Exploitation: use the best performing model

Exploration: take actions of unknown performance

The multi-armed bandit method: We pre-define some probability p, with which we choose between exploration and exploitation. With probability p we randomly select any available action, and with probability 1-p we exploit the empirically best action. During the runtime, we monitor the KPIs to know which action is currently the best one, and update the statistics as we get more feedback.

Java: Data Science Made Easy; Richard M. Reese, Jennifer L. Reese, Alexey Grigorev