Exploitation: use the best performing model
Exploration: take actions of unknown performance
The multi-armed bandit method: We pre-define some probability p, with which we choose between exploration and exploitation. With probability p we randomly select any available action, and with probability 1-p we exploit the empirically best action. During the runtime, we monitor the KPIs to know which action is currently the best one, and update the statistics as we get more feedback.
Java: Data Science Made Easy; Richard M. Reese, Jennifer L. Reese, Alexey Grigorev