Combinatorial Rising Bandits

1 Pohang University of Science and Technology,
2 Microsoft Research Asia
ICLR 2026 (Poster)
*Indicates Equal Contribution

Abstract

Combinatorial online learning is a fundamental task for selecting the optimal action (or super arm) as a combination of base arms in sequential interactions with systems providing stochastic rewards. It is applicable to diverse domains such as robotics, social advertising, network routing, and recommendation systems. In many real-world scenarios, we often encounter rising rewards, where playing a base arm not only provides an instantaneous reward but also contributes to the enhancement of future rewards, e.g., robots improving through practice and social influence strengthening in the history of successful recommendations. Crucially, these enhancements may propagate to multiple super arms that share the same base arms, introducing dependencies beyond the scope of existing bandit models. To address this gap, we introduce the Combinatorial Rising Bandit (CRB) framework and propose a provably efficient and empirically effective algorithm, Combinatorial Rising Upper Confidence Bound (CRUCB). We empirically demonstrate the effectiveness of CRUCB in realistic deep reinforcement learning environments and synthetic settings, while our theoretical analysis establishes tight regret bounds. Together, they underscore the practical impact and theoretical rigor of our approach.

1. Combinatorial Rising Bandits

Toy Example: Rising Rewards and Shared Enhancement

Figure 1: An illustrative example for partially shared enhancement Crucially, every route incorporates the shared edge, which triggers partially shared enhancement across the system. This structural dependency leads to coupled reward dynamics that existing frameworks fail to solve, whereas our algorithm (CRUCB) successfully minimizes regret and converges to the optimal long-term strategy.

The Combinatorial Rising Bandit (CRB) framework formalizes online decision-making in environments where selecting a combination of actions (super arm), which not only yields an immediate reward but also enhances the future expected outcomes of its constituent base arms. The fundamental and unique challenge of CRB is partially shared enhancement : as illustrated by the "shared edge" in Figure 1a, multiple super arms often share common base arms, meaning improvements gained from one choice naturally propagate to others. Existing frameworks are unable to solve this partially shared enhancement, as traditional combinatorial bandits overlook the rising reward nature, while previous rising bandit models fail to account for the structural dependencies created by overlapping actions. Our method, CRUCB, effectively navigates these coupled dynamics to minimize regret and successfully converge on the optimal path, while baseline algorithms fail to account for the rising nature of the shared components (Figure 1c-d).

2. Proposed Method: CRUCB

Algorithm 1: CRUCB

The Future-UCB index for each base arm \(i\) is defined as:

\[ \hat{\mu}_i(t) := \underbrace{\frac{1}{h_i}\sum_{l=N_{i,t-1}-h_i+1}^{N_{i,t-1}}X_i(l)}_{\text{(i) recent average}} + \underbrace{\frac{1}{h_i}\sum_{l=N_{i,t-1}-h_i+1}^{N_{i,t-1}}(t-l)\frac{X_i(l)-X_i(l-h_i)}{h_i}}_{\text{(ii) predicted improvement}} + \underbrace{\sigma(t-N_{i,t-1}+h_i-1)\sqrt{\frac{10 \log t^3}{h_i^3}}}_{\text{(iii) exploration bonus}} \]

(i) Recent Average
Captures the current performance level using a sliding window.

(ii) Predicted Improvement
Estimates the growth slope and extrapolates future gains.

(iii) Exploration Bonus
Accounts for the uncertainty inherent in rising reward dynamics.

3. Theoretical Results

We establish the first explicit and rigorous comparison between regret upper and lower bounds in rising bandit literature. Our analysis demonstrates that CRUCB achieves near-optimal efficiency by adapting to the inherent difficulty of the reward growth pattern.

Problem Instance Class \(\mathcal{A}_c\)

To analyze the learning difficulty across different scenarios, we define a fine-grained class of CRB instances \(\mathcal{A}_c\) for an arbitrary constant \(1 < c < 2\) as:

\[ \mathcal{A}_c := \left\{ \nu : \gamma_i(n) \le (n+1)^{-c}, \forall i \in [K], n \in [T-1] \right\} \]

This class \(\mathcal{A}_c\) acts as a structural separator: a larger \(c\) corresponds to slower outcome growth ("easy" instances), while a smaller \(c\) indicates faster growth ("difficult" instances).

Regret Upper Bound (Corollary 2)

For the problem class \(\mathcal{A}_c\), the cumulative regret of CRUCB is bounded by:

\[ Reg(\pi_{\epsilon}, T) = \tilde{O}\left(\max\{T^{2/3}, T^{2-c}\}\right) \]

Regret Lower Bound (Theorem 5)

For any policy acting on the class \(\mathcal{A}_c\), the worst-case regret is bounded by:

\[ \min_{\pi} \max_{\nu \in \mathcal{A}_c} Reg_{\nu}(\pi, T) = \Omega\left(\max\{\sqrt{T}, T^{2-c}\}\right) \]
Figure 3: Regret Bound Gap

Figure 2: Regret bound gap.The orders of our upper and lower bounds closely match across varying growth rates \(c\), establishing theoretical tightness.

Interpretation

As illustrated in Figure 2 the gap between our upper and lower bounds is minimal. CRUCB effectively adapts to varying problem difficulties without prior knowledge of the growth parameter \(c\), ensuring robustness across diverse rising reward scenarios. Please refer to our paper for the regret upper and lower bounds on the more general classes of instances.

4. Experimental Results: AntMaze

We evaluate CRUCB on AntMaze, a standard benchmark in reinforcement learning (RL). As illustrated in the performance overview below, while the observed outcomes (Figure 3b) exhibit non-concave dynamics during the initial learning phase, CRUCB effectively adapts once reward growth begins. This robustness allows our method to significantly outperform all baselines, achieving a nearly flat regret curve (Figure 3c) whereas traditional algorithms accumulate linear regret due to their inability to handle coupled structural dependencies.

Experimental Results on AntMaze-complex

Figure 3: (a) Task Topology of AntMaze-complex. (b) Observed outcomes for each edge type. (c) Cumulative regret comparison highlighting CRUCB's superior efficiency.

A deep dive into the exploration patterns (Figure 4) reveals the structural reasons for this performance gap. While non-combinatorial methods result in scattered, breadth-first search patterns and naive combinatorial methods get trapped in local impossible paths, CRUCB integrates the rising reward nature with the graph structure to focus its exploration budget on the optimal path, mimicking the focused behavior of an oracle policy.

Exploration Heatmaps

Figure 4:Evolution of visit frequencies at $t=100, 500,$ and $3000$. CRUCB demonstrates the most efficient convergence toward the goal.

BibTeX

 @inproceedings{song2026combinatorial,
        title={Combinatorial Rising Bandits},
        author={Song, Seockbean and Yoon, Youngsik and Wang, Siwei and Chen, Wei and Ok, Jungseul},
        booktitle={International Conference on Learning Representations},
        year={2026}
      }