OPENSIR: OPEN-ENDEDSELF-IMPROVINGREA-
SONER
Wai-Chung Kwan1Joshua Ong Jun Leang1,2Pavlos Vougiouklis3
Jeff Z. Pan1Marco Valentino4Pasquale Minervini1,5
1University of Edinburgh2Imperial College London
3Huawei Technologies Research & Development (UK) Limited
4University of Sheffield5Miniml.AI
{wkwan, p.minervini}@ed.ac.uk
ABSTRACT
Recent advances in large language model (LLM) reasoning through reinforcement
learning rely on annotated datasets for verifiable rewards, which may limit mod-
els’ ability to surpass human-level performance. While self-play offers a promis-
ing alternative, existing approaches depend on external verifiers or cannot learn
open-endedly. We presentOpen-EndedSelf-ImprovingReasoner (OpenSIR), a
self-play framework where an LLM learns to generate and solve novel problems
by alternating teacher and student roles without external supervision. To generate
novel problems, OpenSIR optimises for both difficulty and diversity, rewarding
problems that challenge appropriately while exploring distinct concepts, enabling
open-ended mathematical discovery. Starting from a single trivial seed problem,
OpenSIR substantially improves instruction models: Llama-3.2-3B-Instruct ad-
vances from 73.9 to 78.3 on GSM8K, and from 28.8 to 34.4 on College Math,
while Gemma-2-2B-Instruct rises from 38.5 to 58.7 on GSM8K. Our analyses
reveal that OpenSIR achieves open-ended learning through co-evolving teacher-
student roles that adaptively calibrate difficulty and drive diverse exploration, pro-
gressing autonomously from basic to advanced mathematics. The code is publicly
available athttps://github.com/EdinburghNLP/OpenSIR.
1 INTRODUCTION
Reinforcement learning with verifiable rewards (RLVR) drives recent advances in LLM reasoning.
Recent works on DeepSeek-R1 (DeepSeek-AI et al., 2025) and OpenAI o1 (OpenAI, 2024) have
shown that large-scale reinforcement learning improves reasoning capabilities. Yet these methods
need extensive human-annotated data for reward signals, bottlenecking scalability and potentially
limiting performance to human-level (Hughes et al., 2024b).
One promising direction to address these fundamental limitations is to generate synthetic training
data through self-play, which demonstrated remarkable success in various games (Silver et al., 2016;
2017; Brown & Sandholm, 2019; FAIR et al., 2022), allowing systems to exceed human-level per-
formance by learning from unambiguous reward signals (Silver et al., 2017; FAIR et al., 2022).
Yet, mathematical reasoning poses a key challenge for self-play: unlike games that have clear rules
and winners, generated mathematics problems lack the ground-truth answers to provide feedback
signals. Recent works utilise external verifiers, such as compilers for coding tasks (Pourcel et al.,
2024; Zhao et al., 2025) or game rules (Liu et al., 2025), while R-Zero (Huang et al., 2025) employs
majority voting with basic repetition penalties. However, these approaches cannot achieve open-
ended learning, the ability to continuously generate and pursue novel challenges without external
supervision (Bauer et al., 2023; Hughes et al., 2024a), confining systems to known concepts instead
of exploring diverse mathematical domains.
We presentOpen-endedSelf-ImprovingReasoner (OpenSIR), a method for training a policyπ θto
generate and solve novel problems without external supervision. OpenSIR usesself-play— a single
policyπ θalternates between teacher and student roles: the teacher generates problems, while the
student solves them, with problem-solution pairs selected for reinforcement learning updates. We
1arXiv:2511.00602v1  [cs.CL]  1 Nov 2025
Solve the following
problem step-by-stepStudent Prompt
What does 2+4 equal?Create a math problem
that is conceptually
different from
the provided problemTeacher Prompt
Sarah had 4 cookies.
She baked 2 more.
How many cookies does
Sarah have?
TeacherProblems
What does 2+4 equal?
Solve the quadratic equation Solve the following
problem step-by-stepStudent Prompt
Solve the quadratic
equation Student
Solutions
6 66 6 -1/2, 3 -1/2, 3 0, 3 -1/2, 3
Novelty Correctness
R e w a r dProblem 
Pooloutput
reward
trainable 
 module# 1
Propose
problems
# 3.1
Score
novelty# 3.2
Score
correctness
# 4
Update# 2
Sample solutions
DifficultyDiversity
0.1 0
0.8 0.4+
+=0.1
=1.2Figure 1: Overview of the OpenSIR framework. A single policyπ θalternates between generating
and solving novel problems without external supervision. Each training iteration consists ofprob-
lem generation,solution sampling,scoring, andmodel update. Novelty is captured through both
difficultyanddiversity: problems must be challenging yet solvable, and they must explore new con-
cepts. These dimensions together drive open-ended self-improvement in the LLM reasoning ability.
reward teachers for generating appropriately challenging problems for the students, using consis-
tency and solution length across multiple solution attempts. OpenSIR achieves open-ended learning
through embedding-based diversity rewards that drive continuous exploration of novel mathematical
concepts.
Our experiments show OpenSIR outperforms base instruction models and reinforcement learning
baselines. Starting from a single trivial seed problem, OpenSIR improves base instruction models
by up to 6.3 accuracy points, surpassing GRPO baselines trained on thousands of human-annotated
examples. Specifically, Llama-3.2-3B-Instruct improves from 73.9→78.3 (+4.4) on GSM8K and
28.8→34.4 (+5.6) on College Math, while Gemma-2-2B-Instruct rises from 38.5→58.7 (+20.2) on
GSM8K and 19.1→23.4 (+4.3) on College Math.
Our qualitative analysis reveals OpenSIR succeeds through adaptive difficulty calibration and
diversity-driven exploration. Problem difficulty is automatically calibrated throughout training,
while the range of topics expands from basic to advanced mathematics (§4.1). Generating harder
problems risks invalidity, requiring a balance between challenge and correctness (§4.2). Diversity re-
wards incentives generate problems spanning varied mathematical concepts (§4.3). Teacher-student
co-evolution proves essential: without teacher training, models cannot generate appropriate chal-
lenges or explore new topics (§4.4).
2 OPEN-ENDEDSELF-IMPROVINGREASONER
Figure 1 illustrates the Open-Ended Self-Improving Reasoner (OpenSIR), a self-play framework in
which a policyπ θlearns to both generate and solve novel mathematical problems without external
supervision. We use reinforcement learning to optimise two roles within one policy: theteacher,
which creates new problems, and thestudent, which solves them. This open-ended approach enables
2
the policy to bootstrap its learning and discover new and diverse challenges without annotated data.
Each training iteration involves four phases:
1.Problem generation(§2.1): The teacher proposes new problems by conditioning on reference
problems from an accumulated pool of previously generated problems;
2.Solution sampling(§2.2): The student attempts multiple solutions per problem, with majority
voting determining the reference answer and solve rate measuring reliability;
3.Scoring(§2.3): We compute novelty scores for the teacher’s generated problems and correctness
scores for the student’s solutions; and
4.Model update(§2.4): We update the policy’s parameters with role-specific rewards using the
problem-solution pairs selected by the novelty scores.
In OpenSIR, we define novelty along two dimensions that together drive continuous open-ended
learning. First, problems must have an appropriate level of difficulty. It should be challenging
enough to promote learning but solvable enough to provide reliable training signals. Second, prob-
lems must explore diverse concepts, preventing the model from repeating learning on familiar con-
cepts. This two-dimensional view of novelty ensures the model continuously expands both the depth
and breadth of its mathematical reasoning abilities.
2.1 PROBLEMGENERATION
At each iterationt, the policyπ θgenerateskgroups ofGproblems each, denoted asq 1:Gwithin
each group, for a total ofM=k×Gproblems. To generate these problems, we samplekreference
problems from a poolP t−1of accumulated problems from previous iterations, where each reference
problem serves as a seed for generatingGnew problems. Each generated problem must explicitly
include the mathematical concepts required for its solution. Problems with invalid formats are fil-
tered out, and valid problems proceed to the solution-sampling phase. We initialise the problem pool
P0with a single trivial problem (“What is 1+1?”).
2.2 SOLUTIONSAMPLING
Leta jdenote the parsed answer from solution attempto j. We select the most common answer
across attempts as the reference answera∗. We then compute thesolve ratefor each problem to
determine the reliability of the answers. For brevity, we denotes qi=SolveRate(q i)when referring
to the solve rate of problemq i.
SolveRate(q i) =count(a∗)
Gwherea∗= arg max
a∈a 1:Gcount(a),(1)
In Eq. (1), count(a)denotes the number of times answeraappears. The solve rate quantifies answer
reliability. High solve rates indicate reliable reference answers due to solution convergence, while
low solve rates suggest inconsistent solutions that may indicate flawed problem formulations.
2.3 SCORING
We evaluate the quality of generated problems and solutions with different scoring functions. The
teacher’s problems are scored based ondifficultyanddiversity, while the student’s receive scores for
correctness. Additionally, both roles incorporate format scores to ensure parseable outputs.
2.3.1 TEACHERSCORING
We capture novelty through two fundamental dimensions: difficulty and diversity. We measure dif-
ficulty usingsolvabilityto ensure problems remain appropriately challenging andsolution lengthto
encourage multi-step reasoning, as these provide complementary signals about problem difficulty.
Diversity is promoted through embedding distance, which encourages exploration of varied mathe-
matical concepts. These components form a unified novelty score that guides problem generation.
3
Solvability (score sol).The solvability score identifies problems with appropriate challenge. We
use solve rate as a proxy for solvability—problems withs qi> smaxare likely too easy, while those
withs qi< sminare either too difficult or malformed. We employ a triangular scoring function that
peaks at the optimal solve rate and decreases linearly as problems become too easy or too hard.
We define the solve rate range as[s min, smax]. Easy problems (s qi> s max) fail to challenge the
model, while problems that are too hard or malformed (s qi< smin) offer minimal training value.
Formally, fors qi∈[0,1], lets mid= (s min+smax)/2be the midpoint:
score sol(qi) =1−α|s qi−s mid|ifs qi∈[s min, smax],
0otherwise(2)
whereα= (1−1/G)/(s mid−smin)is the slope coefficient, withGbeing the number of solution
attempts. The score peaks at the midpoints midand decreases to1/nat the boundaries.
This creates a symmetric triangular score centred at the midpoint of the solve rate range, giving a
maximum score for problems with moderate difficulty and progressively less score as the solve rate
approaches either boundary.
Solution Length (score len).Solution length complements solvability by measuring problem com-
plexity. Problems requiring multi-step reasoning typically elicit longer solutions. We score problems
using the average length of student solutions:
score len(qi) = min¯l(qi)
lbase,lcap
lbase
(3)
where ¯l(qi)denotes average solution length for problemq i,lbaseis a normalisation factor (defaults to
1000 tokens), andl capprevents outliers from dominating the scoring signal. This score complements
the solvability score (see Appendix C.1).
Diversity (score div).We compute the semantic distance between each new problem and the exist-
ing problem pool:
score div(qi) = min
q′∈Pt−1d(eqi, eq′)(4)
wheree qiande q′represent problem embeddings obtained from a pre-trained encoder, andd(·,·)
denotes cosine distance. This score maximises when a problem is semantically distant from all
existing problems in the pool.
Format (scoreT
fom).The format score ensures proper problem structure. Generated problems must
be enclosed in <question> tags with concepts listed in <concepts> tags (maximum three concepts).
We assign scoreT
fom(qi) = 1for correct formatting and scoreT
fom(qi) = 0otherwise.
Novelty Score.We combine these components into a novelty score capturing both difficulty and
diversity:
score novel(qi) =αscore sol(qi) +λscore len(qi) +γscore div(qi) +δscoreT
fom(qi)(5)
whereα,λ,γ,δare hyperparameters that control the relative importance of each component. This
novelty score is used to select high-quality problem-solution pairs for training.
2.3.2 STUDENTSCORING
The student’s score is based on solution correctness. For each solution attempt, we evaluate correct-
ness by comparing the parsed answer against the reference answer from majority voting.
Format (scoreS
fom).The format score ensures proper answer presentation. Solutions must present
final answers in\boxed{} notation. We assign scoreS
fom(oj) = 1for correct formatting and0other-
wise.
4
Correctness Score.The student’s correctness score combines accuracy with the format score:
score correct(oj, aj) =1[a j=a∗] +δscoreS
fom(oj)(6)
where1[a j=a∗]is an indicator function that equals 1 when parsed answera jfrom outcomeo j
matches the reference answera∗, and 0 otherwise. This correctness score evaluates both solution
accuracy and proper formatting.
2.4 MODELUPDATE
After computing novelty scores, we selectBhigh-quality samples from valid problems for rein-
forcement learning, allocating half to problem generation and half to solution solving. For teacher
training, we choose problem groups with highest score novel variance to ensure diverse training sig-
nals. For student training, we select problems with the highest novelty scores to provide maximal
training value.
We optimise the policy usingπ θwith an objective similar to Group Relative Policy Optimization
(GRPO) (Shao et al., 2024), adapted for on-policy training to ensure stability (Chen et al., 2025):
J(θ) =E q1:G∼πθ(·|pT)
o1:G∼πθ(·|qi,pS)
X
r∈{T,S}1
GGX
i=1Ar
i
−βD KL(πθ∥πref)(7)
wherep Tandp Sare the teacher and student prompts respectively,r∈ {T, S}refers to teacher
and student,D KLdenotes the KL divergence,π refrefers to the initial model before training. The
advantage for each roler∈ {T, S}is computed as:
Ar
i=Rr
i−mean (Rr
1:G)
std (Rr
1:G).(8)
We define role-specific rewardsRT
iandRS
jusing the scoring functions from Section 2.3:
RT
i=score novel(qi), RS
j=score correct(oj, aj)(9)
All valid problems are then added to the problem poolP tfor future iterations.
3 EXPERIMENTS
3.1 TRAININGSETUP
We experiment with three instruction-tuned models: Llama-3.2-3B-Instruct (Dubey et al., 2024),
Gemma-2-2B-Instruct (Team et al., 2024), and Qwen-2.5-3B-Instruct (Team, 2024) with GRPO
(Shao et al., 2024). We use a learning rate of3×10−7and 10 warm-up steps. The KL divergence
coefficient is set to10−4and the batch size is 256. To compare models trained on the same number
of problem-solution pairs, we train the GRPO baselines with 100 steps, and OpenSIR for 200 steps
since OpenSIR allocates half of its training budget to problem generation. Clipping is not applied
since we strictly use on-policy samples. Each experiment is run with three random seeds. We
provide full training details in Appendix D.1.
3.2 DATASET ANDEVALUATIONSETUP
We evaluate method on five mathematical benchmarks: GSM8K (Cobbe et al., 2021), MATH-500
(Hendrycks et al., 2021), Minerva (Lewkowycz et al., 2022), OlympiadBench (He et al., 2024), and
College Math (Tang et al., 2024).
We use sampling temperature0.6and top-p0.95. The maximum response length is set to 4,096
tokens. We report the average performance over 16 generations (avg@16). Answer extraction and
comparison are performed using themath_verifylibrary.
5
3.3 BASELINES
(1)BaseWe evaluate the instruction-tuned models using zero-shot prompting, where models gener-
ate step-by-step reasoning and provide final answers without additional training.
(2)GRPOWe train the instruction models with GRPO (Shao et al., 2024) on established mathe-
matical datasets. We train two variants:GRPO mathon the MATH dataset (7,500 training examples)
(Hendrycks et al., 2021) andGRPO gsm8k on the GSM8K dataset (7,473 training examples) (Cobbe
et al., 2021).
3.4 MAINRESULTS
Model GSM8K MATH-500 MinervaCollegeOlympiadBench Avg.Math
Llama-3.2-3B-Instruct
Base 73.94 42.86 15.21 28.78 13.09 34.78
GRPO gsm8k 79.72 45.30 16.27 33.33 14.56 37.83+3.1
GRPO math 76.48 45.26 16.09 32.95 14.13 36.98+2.2
OpenSIR 78.28 46.22 17.46 34.42 15.72 38.42+3.6
Gemma-2-2B-Instruct
Base 38.50 16.51 10.09 19.11 3.00 17.44
GRPO gsm8k 58.75 19.15 7.75 20.45 3.21 21.86+4.4
GRPO math 56.03 22.76 7.96 16.31 3.24 21.26+3.8
OpenSIR 58.03 24.75 9.51 23.36 3.15 23.76+6.3
Qwen-2.5-3B-Instruct
Base 84.43 65.36 25.23 48.22 27.94 50.24
GRPO gsm8k 84.94 65.77 25.31 48.46 28.31 50.56+0.3
GRPO math 84.31 65.89 24.98 48.34 28.26 50.36+0.1
OpenSIR 85.38 65.87 25.96 48.74 28.33 50.85+0.6
Table 1: The avg@16 performance on five mathematical benchmarks. OpenSIR consistently outper-
forms GRPO baselines across model architectures despite generating training data through self-play
from a single seed problem, while GRPO baselines use over 7,000 human-annotated examples.
Table 1 demonstrates that OpenSIR achieves substantial gains over the base instruction models,
improving Llama-3.2-3B-Instruct by 3.6 points and Gemma-2-2B-Instruct by 5.9 points on aver-
age accuracy. While OpenSIR demonstrates huge improvements with Llama-3.2-3B-Instruct and
Gemma-2-9b-Instruct, Qwen-2.5-3B-Instruct demonstrates a relatively minimal gains (+0.3). No-
tably, GRPO baselines exhibit similarly marginal improvements with this model, indicating the
effect is not specific to OpenSIR. The limited improvement aligns with observations of potential
benchmark contamination (Wu et al., 2025).
OpenSIR outperforms all GRPO baselines without using human-annotated training data. GRPO
baselines require over 7,000 labeled examples, yet OpenSIR generates its own training problems
through self-play, starting from a single trivial seed problem. As we reveal later in our analysis,
the success of OpenSIR can be attributed to its ability to explore diverse mathematical concepts,
and calibrating difficulty adaptively to maintain optimal challenge levels (§4.1). These capabilities
enable OpenSIR to self-improve and expand its skills without external training data, achieving open-
ended learning.
4 ABLATIONS ANDANALYSES
We perform a series of ablation studies and qualitative analyses on Llama-3.2-3B-Instruct to dissect
the contribution of each key component in the OpenSIR framework. Our analysis investigates:
(1) the evolution of problem difficulty and diversity over training (§4.1), (2) the effect of solve rate
6
thresholds on the difficulty-validity trade-off (§4.2), (3) the impact of diversity rewards on promoting
exploration of novel problem types (§4.3), and (4) the necessity of dual-role training (§4.4).
4.1 EVOLUTION OFPROBLEMDIFFICULTY ANDDIVERSITY
01234Difficulty Rank 1.23.63.4
3.03.8
Difficulty InvalidsDifficulty and Invalid Problems Distribution
GSM8K
MATH
Step 0
Step 100
Step 200
024681012
# Invalid Problems11
3
1
GSM8K MATH Step 0 Step 100 Step 20005101520253035CountTopic Distribution
Trigonometry
Statistics
Probability
Optimisation
Number Theory
Geometry
Discrete Math
Combinatorics
Calculus
Arithmetic
Algebra
Figure 2: Evolution of problem difficulty, validity, and topic diversity during OpenSIR training.
(Left)Human evaluation results showing difficulty rankings (1-5 scale where 1=easiest, 5=hardest)
and number of invalid problems for GSM8K, MATH, and problems generated at steps 0, 100, and
200 of training. Invalid problems are those with logical flaws, missing information, or ambiguities.
(Right)Distribution of mathematical topics across training stages, demonstrating the increasing
diversity of generated problems from step 0 to step 200.
We track how difficulty and diversity evolve during training through human evaluation. We sample
20 problems from three OpenSIR training checkpoints (steps 0, 100, 200) and 20 each from GSM8K
and MATH. Annotators evaluate mixed sets of five problems (one per source), identifying topics,
assessing validity, and ranking difficulty. Figure 2 shows average difficulty rankings (1=easiest,
5=hardest); see Appendix B for full annotation instructions.
Figure 2 (left) reveals a V-shaped difficulty trend across training stages. Problems start at 3.4 dif-
ficulty, drop to 3.0 at midpoint, then rise to 3.8. This pattern reflects OpenSIR’s self-calibration:
the model first generates overly difficulty problems, then learns appropriate difficulty, and finally
increases challenge as its solving capabilities improve. The model also generates increasingly valid
problems during training — validity improves from below 50% initially to 95% (19 of 20 problems)
by the end.
Figure 2 (right) shows topic diversity expansion across training. OpenSIR progresses from basic
topics (algebra, arithmetic, geometry) to advanced domains including calculus and optimisation,
eventually incorporating trigonometry, statistics, and other mathematical areas. This progression
demonstrates OpenSIR’s capacity for autonomous exploration of diverse mathematical concepts.
Appendix A.2 provides detailed case studies that illustrate this evolution.
4.2 DIFFICULTY-VALIDITYTRADE-OFF
Model Acc Validity Solve Rate
OpenSIR 0.5 38.42 70.82 89.82
OpenSIR 0.3 36.81 52.32 81.38
OpenSIR 0.1 35.97 42.31 78.31
Table 2: Performance, problem validity, and solve rate across different lower solve-rate thresholds,
with the upper threshold fixed at 0.9 for all variants. Validity and solve rate are estimated using
GPT-5. Lower thresholds produce harder problems but significantly more invalid ones, ultimately
reducing overall performance.
We investigate the difficulty-validity trade-off by training OpenSIR variants with lower solve-rate
thresholds of 0.1, 0.3, and 0.5, keeping the upper threshold at 0.9 From each variant, we sample
7
300 problems and assess quality with GPT-5 (OpenAI, 2025a) using 8 responses per problem. We
measure validity by comparing GPT-5’s majority answer to our reference answer and difficulty by
GPT-5’s solve rate.
Table 2 reveals a clear trade-off between validity and difficulty. While lowering the threshold
from 0.5 to 0.1 produces moderately harder problems (GPT-5 solve rate decreases from 89.82%
to 78.31%), validity plummets from 70.82% to 42.31%. This suggests that problems with very
low solve rates frequently contain errors rather than representing genuine mathematical challenges.
performance consistently drops with lower thresholds, supporting our selection of 0.5 as the lower
threshold for the solvability reward.
Besides solve-rate thresholds, we find that rewarding longer solutions provides another mecha-
nism for promoting problem complexity that encourage sophisticated multi-step problems (Ap-
pendix C.1).
4.3 IMPACT OFDIVERSITYREWARDS
20
 10
 0 10 20
Component 115
10
5
0510Component 2
t-SNE of Question Embeddings
w/o Diversity w Diversity MATH GSM8K
Figure 3: t-SNE visualization of problem embed-
dings showing the effect of diversity reward on
problem distribution. With diversity reward, prob-
lems explore broader regions of the embedding
space compared to the clustered distribution with-
out diversity reward.
Model Acc # Concepts
w diversity 38.42 5914
w/o diversity 36.45 3328
Table 3: OpenSIR performance with and with-
out diversity reward. Exploring diverse math-
ematical concepts through the diversity reward
improves both accuracy and concept coverage,
showing that variety in problem types is crucial
for self-improvement.We analyse the impact of the diversity re-
ward on problem diversity through problem
embeddings, n-gram similarity, and concept
overlap. Figure 3 visualises the problem em-
beddings with t-SNE, where red points repre-
sent problems without diversity reward, cyan
points show problems with diversity reward,
gold indicates MATH dataset problems, and
purple marks GSM8K dataset problems. With-
out diversity rewards, problems cluster in nar-
row regions, generating similar types repeat-
edly and failing to achieve open-ended explo-
ration. With diversity rewards, problems spread
across the embedding space, reaching areas be-
yond MATH and GSM8K training sets. Further
analysis of n-gram similarity and concept over-
lap support these findings, demonstrating con-
sistent patterns of greater dispersion and nov-
elty (Appendix A.3).
Table 3 empirically confirms the importance
of diversity rewards, showing that remov-
ing diversity rewards reduces average perfor-
mance by 1.97 (from 38.42 to 36.45). It
also shows that the number of unique con-
cepts has dropped significantly (from 5914 to
3328). This demonstrates that without diversity
rewards, the model generates repetitive prob-
lems with limited learning value, constraining
the teacher’s ability to present varied mathe-
matical challenges to the student. Incorporating
diversity rewards thus enables exploration of
novel problems beyond existing datasets, sup-
porting open-ended learning where the model
continuously discovers new challenges rather than repeating known concepts. Notably, this im-
provement is robust to the choice of diversity metric (Appendix C.2), with different measurement
approaches yielding comparable results.
4.4 IMPORTANCE OFDUAL-ROLETRAINING
We evaluate the contribution of the joint teacher-student training by testing a variant where only the
student is updated while the teacher remains fixed at its initial state. Table 4 shows that accuracy
8
TrainedAccAvg.
Roles Solve Rate
Both 38.42 72.20(±4.49)
Student 35.89 64.56(±17.37)
Table 4: Accuracy and average solve rate with standard deviation (±) for OpenSIR with teacher
training (Both) versus without teacher training (Student only). Joint training achieves higher ac-
curacy and remarkably stable problem difficulty (much lower solve rate variance), demonstrating
that teacher training enables calibrated problem generation at optimal difficulty levels for effective
learning.
drops significantly from 38.42 to 35.89 when only the student is trained. This demonstrates that
effective self-play requires both components to co-evolve.
Without teacher training, generated problems become harder (solve rate drops from 72.20 to 64.56)
and drift from the optimal 70% target solve rate established in Section 4.2. More critically, solve
rate variance increases tremendously (from±4.49 to±17.37), indicating highly inconsistent diffi-
culty during training. This poorly calibrated curriculum explains the performance drop: the fixed
teacher cannot adapt to the student’s evolving capabilities, whereas joint training enables continuous
difficulty calibration at the optimal challenge level.
5 RELATEDWORK
Self-play.Self-play achieved superhuman performance in games without human data, from Al-
phaGo (Silver et al., 2016; 2017), StarCraft II (Vinyals et al., 2019), Poker (Brown & Sandholm,
2019), DotA (OpenAI et al., 2019), and Diplomacy (FAIR et al., 2022). Baker et al. (2019) show
that agents can discover complex strategies with self-play, suggesting it is a promising avenue for
continuous open-ended learning. Recent works apply self-play to LLM reasoning: Absolute Zero
(Zhao et al., 2025) and Spiral (Liu et al., 2025) rely on external verifiers or game rules that limit their
use beyond specific domains. R-zero (Huang et al., 2025) attempts verifier-free self-play but uses
only repetition penalties without a mechanism to encourage exploration, constraining open-ended
learning. In contrast, OpenSIR generates and solves problems without external supervision while
actively promoting diversity to enable continuous discovery of novel mathematical concepts.
Reinforcement Learning with Verifiable Feedback (RLVF).RLVF drives recent advances in
LLM reasoning (OpenAI, 2024; 2025b; DeepSeek-AI et al., 2025) but requires extensive human-
annotated data for verifiable reward signals (Zeng et al., 2025), creating scalability bottleneck and
potentially limiting performance to human-level. Recent works show that moderate-difficulty train-
ing samples provide optimal learning signals (Zheng et al., 2025; Sun et al., 2025), while diverse
problem types enhance mathematical reasoning (Akter et al., 2025; Chen et al., 2025). These insights
directly motivate OpenSIR to optimise for appropriate difficulty calibration and diversity-driven ex-
ploration, enable models to learn math reasoning open-endedly without human supervision.
6 CONCLUSIONS
We present OpenSIR, a self-play framework that enables LLMs to autonomously learn to generate
and solve novel problems without external supervision. Starting from only a single trivial math
problem, our framework outperforms GRPO-trained models that utilise thousands of human an-
notations across diverse model families. This approach demonstrates that models can effectively
bootstrap mathematical reasoning through recursive self-improvement, eliminating dependence on
extensive curated datasets. Our analysis reveals that OpenSIR succeeds by combining difficulty
calibration and diversity rewards to create an adaptive curriculum where models continuously dis-
cover and master increasingly challenging mathematical concepts. Overall, OpenSIR represents a
compelling paradigm for open-ended autonomous mathematical reasoning development, enabling
models to recursively expand their capabilities beyond the boundaries of human-annotated data.
9
ACKNOWLEDGMENTS
The authors would like to thank Aryo Pradipta Gema, Neel Rajani, Rohit Saxena (in alphabetical
order) for the helpful discussions and feedback on the manuscript.
REFERENCES
Syeda Nahida Akter, Shrimai Prabhumoye, Matvei Novikov, Seungju Han, Ying Lin, Evelina Bakh-
turina, Eric Nyberg, Yejin Choi, Mostofa Patwary, Mohammad Shoeybi, and Bryan Catanzaro.
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning, April 2025.
Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor
Mordatch. Emergent Tool Use From Multi-Agent Autocurricula. InInternational Conference
on Learning Representations, September 2019. URLhttps://openreview.net/forum?
id=SkxpxJBKwS.
Jakob Bauer, Kate Baumli, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schmieg,
Michael Chang, Natalie Clay, Adrian Collister, Vibhavari Dasagi, Lucy Gonzalez, Karol Gre-
gor, Edward Hughes, Sheleem Kashem, Maria Loks-Thompson, Hannah Openshaw, Jack Parker-
Holder, Shreya Pathak, Nicolas Perez-Nieves, Nemanja Rakicevic, Tim Rocktäschel, Yannick
Schroecker, Satinder Singh, Jakub Sygnowski, Karl Tuyls, Sarah York, Alexander Zacherl, and
Lei M. Zhang. Human-Timescale Adaptation in an Open-Ended Task Space. InProceedings of
the 40th International Conference on Machine Learning, pp. 1887–1935. PMLR, July 2023. URL
https://proceedings.mlr.press/v202/bauer23a.html.
Noam Brown and Tuomas Sandholm. Superhuman AI for multiplayer poker.Science, 365(6456):
885–890, August 2019. doi: 10.1126/science.aay2400.
Yang Chen, Zhuolin Yang, Zihan Liu, Chankyu Lee, Peng Xu, Mohammad Shoeybi, Bryan Catan-
zaro, and Wei Ping. AceReason-Nemotron: Advancing Math and Code Reasoning through Rein-
forcement Learning, May 2025.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John
Schulman. Training Verifiers to Solve Math Word Problems, November 2021.
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu,
Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu,
Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao
Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan,
Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao,
Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding,
Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang
Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai
Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang,
Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang,
Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang,
Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang,
R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng
Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing
Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen
Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong
Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu,
Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xi-
aosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia
Shan, Y . K. Li, Y . Q. Wang, Y . X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng
Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong
Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong,
Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou,
Y . X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying
Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda
10
Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu,
Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu
Zhang, and Zhen Zhang. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Rein-
forcement Learning, January 2025.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.
arXiv e-prints, pp. arXiv–2407, 2024.
FAIR, Anton Bakhtin, Noam Brown, Emily Dinan, Gabriele Farina, Colin Flaherty, Daniel Fried,
Andrew Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob, Mojtaba Komeili, Karthik Konath,
Minae Kwon, Adam Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts, Adithya Renduchin-
tala, Stephen Roller, Dirk Rowe, Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh Zhang,
and Markus Zijlstra. Human-level play in the game of Diplomacy by combining language models
with strategic reasoning.Science, 0(0):eade9097, November 2022. doi: 10.1126/science.ade9097.
Alex Havrilla, Edward Hughes, Mikayel Samvelyan, and Jacob Abernethy. Synthetic Problem Gen-
eration for Reasoning via Quality-Diversity Algorithms, June 2025.
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han,
Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. OlympiadBench:
A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Sci-
entific Problems. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.),Proceedings of the
62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),
pp. 3828–3850, Bangkok, Thailand, August 2024. Association for Computational Linguistics.
doi: 10.18653/v1/2024.acl-long.211.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric
Tang, Dawn Song, and Jacob Steinhardt. Measuring Mathematical Problem Solv-
ing With the MATH Dataset.Proceedings of the Neural Information Processing
Systems Track on Datasets and Benchmarks, 1, December 2021. URLhttps:
//datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/
be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html.
Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin
Huang, Haitao Mi, and Dong Yu. R-Zero: Self-Evolving Reasoning LLM from Zero Data, August
2025.
Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge
Shi, Tom Schaul, and Tim Rocktaschel. Open-Endedness is Essential for Artificial Superhuman
Intelligence, June 2024a.
Edward Hughes, Michael D. Dennis, Jack Parker-Holder, Feryal M. P. Behbahani, Aditi Mavalankar,
Yuge Shi, Tom Schaul, and Tim Rocktäschel. Position: Open-endedness is essential for artificial
superhuman intelligence. InICML. OpenReview.net, 2024b.
Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Ra-
masesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam
Neyshabur, Guy Gur-Ari, and Vedant Misra. Solving Quantitative Reasoning Problems with
Language Models.Advances in Neural Information Processing Systems, 35:3843–3857, Decem-
ber 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/2022/
hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html.
Bo Liu, Leon Guertler, Simon Yu, Zichen Liu, Penghui Qi, Daniel Balcells, Mickel Liu, Cheston
Tan, Weiyan Shi, Min Lin, Wee Sun Lee, and Natasha Jaques. SPIRAL: Self-Play on Zero-Sum
Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning, June 2025.
Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. InInternational Con-
ference on Learning Representations, September 2018. URLhttps://openreview.net/
forum?id=Bkg6RiCqY7.
11
Keming Lu, Hongyi Yuan, Zheng Yuan, Runji Lin, Junyang Lin, Chuanqi Tan, Chang Zhou, and
Jingren Zhou. #InsTag: Instruction Tagging for Analyzing Supervised Fine-tuning of Large Lan-
guage Models. InThe Twelfth International Conference on Learning Representations, October
2023. URLhttps://openreview.net/forum?id=pszewhybU9.
OpenAI. Learning to reason with LLMs, 2024. URLhttps://openai.com/index/
learning-to-reason-with-llms/.
OpenAI. GPT-5 System Card, August 2025a. URLhttps://openai.com/index/
gpt-5-system-card/.
OpenAI. Introducing OpenAI o3 and o4-mini, 2025b. URLhttps://openai.com/index/
introducing-o3-and-o4-mini/.
OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław D˛ ebiak,
Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz,
Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d O. Pinto, Jonathan
Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie
Tang, Filip Wolski, and Susan Zhang. Dota 2 with Large Scale Deep Reinforcement Learning,
December 2019.
Julien Pourcel, Cédric Colas, Gaia Molinaro, Pierre-Yves Oudeyer, and Laetitia Teodorescu.
ACES: Generating a Diversity of Challenging Programming Puzzles with Autotelic Genera-
tive Models.Advances in Neural Information Processing Systems, 37:67627–67662, Decem-
ber 2024. URLhttps://proceedings.neurips.cc/paper_files/paper/2024/
hash/7d0c6ff18f16797b92e77d7cc95b3c53-Abstract-Conference.html.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,
Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of
Mathematical Reasoning in Open Language Models, April 2024.
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman,
Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine
Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go
with deep neural networks and tree search.Nature, 529(7587):484–489, January 2016. ISSN
0028-0836, 1476-4687. doi: 10.1038/nature16961.
David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,
Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan
Hui, Laurent Sifre, George van den Driessche, Thore Graepel, and Demis Hassabis. Mastering
the game of Go without human knowledge.Nature, 550(7676):354–359, October 2017. ISSN
1476-4687. doi: 10.1038/nature24270.
Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, and Huan
Zhang. Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-
targeted Online Data Selection and Rollout Replay, June 2025.
Zhengyang Tang, Xingxing Zhang, Benyou Wang, and Furu Wei. MathScale: Scaling Instruction
Tuning for Mathematical Reasoning. InProceedings of the 41st International Conference on
Machine Learning, pp. 47885–47900. PMLR, July 2024. URLhttps://proceedings.
mlr.press/v235/tang24k.html.
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu-
patiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma
2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024.
Qwen Team. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024.
Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Juny-
oung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan
Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou,
Max Jaderberg, Alexander S. Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David
12
Budden, Yury Sulsky, James Molloy, Tom L. Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff,
Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom
Schaul, Timothy Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver.
Grandmaster level in StarCraft II using multi-agent reinforcement learning.Nature, 575(7782):
350–354, November 2019. ISSN 1476-4687. doi: 10.1038/s41586-019-1724-z.
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan
Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement
learning.https://github.com/huggingface/trl, 2020.
Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao
Zhou, Huijie Lv, Ming Zhang, Yanwei Fu, Qin Liu, Songyang Zhang, and Qi Zhang. Reasoning
or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination,
August 2025.
Weihao Zeng, Yuzhen Huang, Qian Liu, Wei Liu, Keqing He, Zejun Ma, and Junxian He. SimpleRL-
Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild,
March 2025.
Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang,
Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute Zero: Reinforced Self-play Reasoning
with Zero Data, May 2025.
Haizhong Zheng, Yang Zhou, Brian R. Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and
Beidi Chen. Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via
Selective Rollouts, June 2025.
A EXTENDEDRESULTS ANDANALYSIS
A.1 FULLRESULTS
We provide the full results of all seeds in Table 5.
A.2 CASESTUDY
This section provides further analysis of question-solution pairs during training.
As discussed in Section 4.1, the model generates predominantly invalid problems early in training.
Majority of these problems, primarily involve simple mathematical concepts like arithmetic, fail due
to missing information (Figures 4 and 5). When attempting complex topics like optimisation, which
are rare in the beginning, the model produces problems with missing information and fundamental
formulation errors (Figure 6). This reveal the model has limited understanding of underlying math-
ematical concepts. Invalid problems tend to exhibit low solve rates (≤0.25) and correspondingly
receive lower rewards, helping the model learn to generate valid problems. Consequently, invalid
problems decrease rapidly across training (§4.1).
However, not all problems with low solve rates are invalid (§4.2). We find that some problems
involving certain topics that are challenging for the model, such as geometric series, persistently
exhibit low solve rates (Figures 8. The model struggles with exponentiation calculations, resulting
in poor performance on geometric series problems. This reveals a fundamental trade-off in OpenSIR:
while higher solve rate thresholds effectively filter out invalid problems, they inevitably discourage
exploration of genuinely difficult topics. Since these problems have low solvability scores, they are
likely to not receive sufficient encouragement to further explore these topics.
In later training stages, we observe OpenSIR gradually expanding into advanced mathematical do-
mains. After 100 training steps, the model starts to generate problems involving concepts like opti-
misation (Figure 9), calculus (Figure 10), trigonometry-based physics (Figure 11), probability (Fig-
ures 12), among others. While these advanced problems yield lower solve rates, which indicate the
model has a limited understanding of these domains, they achieve high novelty scores with large se-
mantic distances and longer solutions. This progression validates how novelty rewards in OpenSIR
drive exploration of diverse mathematical concepts, enabling open-ended learning.
13
A concert venue charges an admission price per seat and also offers a VIP ticket package that includes
food, drinks, and other perks. If the food and other perks are included in the VIP ticket price and are
worth $10 per person, and a group of friends want to buy the same number of VIP tickets as they would
with regular tickets, what is the minimum admission price per regular ticket to make the total cost of
the VIP tickets equal to or less than the total cost of the regular tickets?
Figure 4: An invalid arithmetic question generated in step 0 with solve rate of 0.25. This question
is invalid since the VIP tick price is not provided, and therefore, it’s impossible to calculate the
minimum regular ticket price.
Find the percentage difference in the cumulative growth of two continuously compounded interest
functions after 5 years:A=P∗e(rt), where A is the amount of money accumulated after n years,
including interest, P is the principal amount, r is the annual interest rate, and t is the time the money is
invested for.
Figure 5: An invalid arithmetic question generated in step 0 with solve rate of 0.125. This question
is invalid since the two interest rates and principal amounts are not provided. Hence, it’s impossible
to calculate the percentage difference with just the general formula provided.
Consider two positive integers m and n (m≥n). Given a 2x2 matrix of numbers where each element is
a non-negative integer, find the maximum value of the following expression:ax2+by2+cxy, where
a, b, and c are constants, subject to the constraint that the sum of any two elements in each row and
column of the matrix are equal. What is the maximum possible value ofax2+by2+cxy?
Figure 6: An invalid optimisation question generated in step 0 with solve rate of 0.125. This ques-
tion is invalid because there are missing information about the constants a, b, and c. There are also
ambiguities in the question, such as the role of m and n in the problem. It also did not explain what
the elements of the matrix represent. Lastly, it contains problem formulation errors, specifically fail-
ing to specify constraints that ensure bounded solutions, demonstrating insufficient understanding
of optimization problem structure.
Consider two positive integers m and n (m≥n). Given a 2x2 matrix of numbers where each element is
a non-negative integer, find the maximum value of the following expression:ax2+by2+cxy, where
a, b, and c are constants, subject to the constraint that the sum of any two elements in each row and
column of the matrix are equal. What is the maximum possible value ofax2+by2+cxy?
Figure 7: An invalid optimisation question generated in step 0 with solve rate of 0.125. This ques-
tion is invalid because there are missing information about the constants a, b, and c. There are also
ambiguities in the question, such as the role of m and n in the problem. It also did not explain what
the elements of the matrix represent. Lastly, it contains problem formulation errors, specifically fail-
ing to specify constraints that ensure bounded solutions, demonstrating insufficient understanding
of optimization problem structure.
14
Question
A company is adding new employees to its workforce. Initially, the company has 500 employees and
adds 10 new employees every quarter. If this trend continues for the next 5 years, and each quarter the
company adds a fixed percentage increase in the number of new employees (starting with 10%), how
many new employees can the company be expected to add during the first 2 years?
Solution(Incorrect)
To find the total number of new employees added in the first 2 years, we need to calculate the number
of new employees added each quarter for the first 2 years and then sum them up.
First, let’s calculate the number of quarters in 2 years: 2 years * 4 quarters/year = 8 quar-
ters.
The company starts with 10 new employees and adds 10% more each quarter. To find the to-
tal number of new employees added each quarter, we can use the formula for compound interest:
A=P∗(1 +r)n
where:
A = total number of new employees after n quarters
P = initial number of new employees (10)
r = percentage increase per quarter (10% or 0.1)
n = number of quarters
However, since we want to find the total number of new employees added each quarter, we need to find
the difference between A and the initial number of new employees (10). We can rewrite the formula as:
A−P=P∗(1 +r)n
A−10 = 10∗(1 + 0.1)n
A= 10∗(1.1)n
Now, we can calculate A for each quarter:
Quarter 1 (n = 1): A =10∗(1.1)1= 11
Quarter 2 (n = 2): A =10∗(1.1)2= 12.1
Quarter 3 (n = 3): A =10∗(1.1)3= 13.21
Quarter 4 (n = 4): A =10∗(1.1)4= 14.41
Quarter 5 (n = 5): A =10∗(1.1)5= 15.71
Quarter 6 (n = 6): A =10∗(1.1)6= 17.31
Quarter 7 (n = 7): A =10∗(1.1)7= 19.33
Quarter 8 (n = 8): A =10∗(1.1)8= 21.59
Now, we can calculate the total number of new employees added in the first 2 years by sum-
ming up the number of new employees added each quarter:
Total = 11 + 12.1 + 13.21 + 14.41 + 15.71 + 17.31 + 19.33 + 21.59
Total = 119.66
Since we can’t have a fraction of an employee, we can round down to the nearest whole num-
ber.
Therefore, the final answer is:119
Figure 8: A valid arithmetic problem involving geometric series with a solve rate of 0.125. The
solution requires either summing quarterly employee additions or applying the geometric series
formula:a·rn−1
r−1. Llama-3.2-3B-Instruct struggles with exponentiation calculations, resulting in
poor performance on geometric series problems.
15
A farmer has 10 different types of vegetables to store in a warehouse with 1,000 square meters of floor
space. There are four container sizes available:
- Small (5 sq m): Maximum 50 available
- Medium (10 sq m): Maximum 40 available
- Large (15 sq m): Maximum 30 available
- Extra-large (20 sq m): Maximum 25 available
The vegetables have different storage requirements:
- 3 bulky vegetables (pumpkins, watermelons, cabbages) require containers of at least 15 sq m
- 4 medium vegetables (tomatoes, peppers, eggplants, zucchini) require containers of at least 10 sq m
- 3 small vegetables (carrots, onions, potatoes) can fit in any container size
Each vegetable type must be stored in at least one container. What is the maximum number of con-
tainers that can be used while satisfying all constraints and not exceeding 1,000 sq m total space?
Figure 9: A valid optimisation problem with a solve rate of 0.375 generated at step 124.
Find the equation of the curve y = f(x) where the derivative is given by f′(x) = (3x2−x−2)/2x and
the curve passes through the point (2, 3).
Figure 10: A valid calculus problem with a solve rate of 0.375 generated at step 156.
A golfer hits a ball from the top of a 50-meter high cliff with an initial velocity of 30 m/s at an angle
of 45 degrees above the horizontal. What is the horizontal distance traveled by the ball when it hits the
ground?
Figure 11: A valid physics problem that involves trigonometry with a solve rate of 0.5 generated at
step 172.
Consider a randomly ordered sequence ofn= 3qdistinct integers{a 1, a2, . . . , a 3q}whereqis a
positive integer. Definefas the number of adjacent pairs(a i, ai+1)in the sequence where both
integers have the same remainder when divided by 3 (i.e.,a imod 3 =a i+1mod 3). If the integers1
through3qare randomly permuted to form this sequence, what is the expected value off?
Figure 12: A valid probability problem with a solve rate of 0.25 generated at step 188.
16
Models Seed GSM8K MATH-500 MinervaCollegeOlympiad-
Avg.
Math Bench
Llama-3.2-3B-Instruct
Base - 73.94 42.86 15.21 28.78 13.09 34.78
GRPO gsm8k42 79.60 45.41 16.34 33.31 14.71 37.87
43 79.62 44.56 16.64 33.35 14.52 37.74
44 79.93 45.91 15.83 33.32 14.46 37.89
Avg. 79.72±0.1945.30±0.6816.27±0.4133.33±0.0214.56±0.1337.83±0.37
GRPO math42 76.99 45.02 16.38 33.02 14.31 37.14
43 76.51 45.23 15.95 32.87 13.85 36.88
44 75.93 45.52 15.95 32.95 14.23 36.92
Avg. 76.48±0.5345.26±0.2516.09±0.2532.95±0.0714.13±0.2436.98±0.31
OpenSIR42 77.82 46.38 17.72 34.24 15.46 38.32
43 78.58 45.91 17.23 34.58 15.86 38.43
44 78.43 46.38 17.44 34.45 15.84 38.51
Avg. 78.28±0.4046.22±0.2717.46±0.2434.42±0.1715.72±0.2338.42±0.27
Gemma-2-2B-Instruct
Base - 38.50 16.51 10.09 19.11 3.00 17.44
GRPO gsm8k42 58.32 18.86 7.53 20.17 3.18 21.61
43 58.86 19.21 7.96 20.77 3.08 21.98
44 59.06 19.36 7.76 20.42 3.37 21.99
Avg. 58.75±0.3819.14±0.267.75±0.2220.45±0.303.21±0.1521.86±0.27
GRPO math42 55.14 22.31 7.95 15.71 3.03 20.83
43 53.94 22.53 7.90 15.08 3.11 20.51
44 59.01 23.44 8.02 18.15 3.57 22.44
Avg. 56.03±2.6522.76±0.607.96±0.0616.31±1.623.24±0.2921.26±1.42
OpenSIR42 58.68 24.09 8.89 22.29 2.99 23.39
43 58.36 25.69 10.73 26.14 3.23 24.83
44 57.03 24.49 8.89 21.66 3.24 23.06
Avg. 58.03±0.8724.75±0.839.51±1.0623.36±2.433.15±0.1423.76±1.30
Qwen-2.5-3B-Instruct
Base - 84.43 65.36 25.23 48.22 27.94 50.24
GRPO gsm8k42 84.71 65.40 26.33 48.51 28.21 50.63
43 85.16 65.80 24.84 48.46 28.50 50.55
44 84.96 66.10 24.75 48.40 28.23 50.49
Avg. 84.94±0.2365.77±0.3525.31±0.8948.46±0.0628.31±0.1650.56±0.45
GRPO math42 84.24 65.74 25.23 48.53 28.23 50.39
43 84.19 65.64 25.14 48.20 27.98 50.23
44 84.49 66.30 24.59 48.29 28.57 50.45
Avg. 84.31±0.1665.89±0.3624.98±0.3548.34±0.1728.26±0.3050.36±0.28
OpenSIR42 85.43 66.17 26.49 48.88 28.86 51.17
43 85.26 65.64 25.30 48.62 28.30 50.62
44 85.44 65.79 26.08 48.72 27.83 50.77
Avg. 85.38±0.1065.87±0.2825.96±0.6148.74±0.1328.33±0.5250.85±0.38
Table 5: Math reasoning evaluation results with individual seed reporting. We report avg@16 per
problem for each seed (s-42, s-43, s-44) and their average with standard deviation as superscript.
17
Step 0 Step 100 Step 200 GSM8K MATHStep 0 Step 100 Step 200 GSM8K MATH1.000 0.150 0.153 0.101 0.107
0.150 1.000 0.152 0.101 0.107
0.153 0.152 1.000 0.096 0.110
0.101 0.101 0.096 1.000 0.075
0.107 0.107 0.110 0.075 1.000ROUGE-L Similarity
Step 0 Step 100 Step 200 GSM8K MATHStep 0 Step 100 Step 200 GSM8K MATH100 21 23 19 21
21 100 23 27 16
23 23 100 24 19
19 27 24 100 22
21 16 19 22 100Concept Overlap
Step 0 Step 100 Step 200 GSM8K MATHStep 0 Step 100 Step 200 GSM8K MATH1.000 0.338 0.248 0.139 0.128
0.338 1.000 0.274 0.143 0.132
0.248 0.274 1.000 0.124 0.132
0.139 0.143 0.124 1.000 0.075
0.128 0.132 0.132 0.075 1.000
Step 0 Step 100 Step 200 GSM8K MATHStep 0 Step 100 Step 200 GSM8K MATH100 42 33 23 32
42 100 52 33 26
33 52 100 34 29
23 33 34 100 22
32 26 29 22 100
0.000.050.100.150.200.250.300.350.40
ROUGE-L Score
020406080100
Concept Overlap %
0.000.050.100.150.200.250.300.350.40
ROUGE-L Score
020406080100
Concept Overlap %w Diversity w/o DiversityFigure 13: Heatmap visualisation of n-gram similarity (ROUGE-L scores) and concept overlap be-
tween generated problems at training steps 0, 100, 200 and reference datasets (MATH, GSM8K).
Top row: with diversity reward; Bottom row: without diversity reward. With diversity reward incor-
porated, the generated problems exhibit low textual similarity and minimal concept overlap, demon-
strating effective exploration of diverse problem types.
18
A.3 FURTHERANALYSIS ONQUESTIONSDIVERSITY
Figure 13 presents n-gram similarity and concept analysis. We compute ROUGE-L scores between
problem texts and extract mathematical concepts using GPT-5 from problems at steps 0, 100, and
200, as well as from the MATH and GSM8K training sets. With diversity rewards (top row), prob-
lems maintain low ROUGE-L scores and minimal concept overlap both across training stages and
with MATH/GSM8K. Without diversity rewards (bottom row), both textual similarity and concept
overlap increase, confirming limited exploration of new problem types.
B ANNOTATIONDETAILS
One of the authors prepare the samples for annotation, and the rest of the authors annotated the
samples with the instructions provide in Figure 14.
You will be presented with multiple sets of 5 math problems to evaluate. For each set, please complete
the following three-step annotation process.
# Step 1: Identify Topics
Foreach problem, identify ALL relevant mathematical topics from the following list:
- Algebra
- Geometry
- Calculus
- Probability
- Statistics
- Number Theory
- Combinatorics
- Optimization
- Arithmetic
- Discrete Math
- Trigonometry
# Step 2: Assess Validity
Foreach problem, determine if it isvalidorinvalid:
-Valid: The problem is logically sound, clearly stated, and can be answered with the given information
-Invalid: The problem contains logical flaws, contradictions, insufficient information, or ambiguities
that prevent a proper solution
# Step 3: Rank Difficulty
Rank all 5 problems fromeasiest to hardest. Provide your ranking as a sequence of problem numbers.
Example:[3, 1, 5, 2, 4] means problem 3 is the easiest and 4 is the hardest.
Consider these factors when assessing difficulty:
- Number of steps required
- Complexity of concepts involved
- Level of mathematical knowledge needed
- Computational complexity
# Response Format
Provide your annotations as a JSON list where each element represents one problem set. Here are
some examples:
[
{
"set_id": "SET_1",
"problems": {
"1": {"topics": ["Algebra", "Calculus"], "valid": true},
"2": {"topics": ["Geometry"], "valid": false},
"3": {"topics": ["Probability"], "valid": true},
"4": {"topics": ["Number Theory"], "valid": true},
"5": {"topics": ["Arithmetic"], "valid": true}
},
"difficulty_ranking": [5, 3, 1, 2, 4]
},
{
"set_id": "SET_2",
"problems": {
"1": {"topics": ["Statistics"], "valid": true},
19
"2": {"topics": ["Discrete Math"], "valid": true},
"3": {"topics": ["Optimization"], "valid": true},
"4": {"topics": ["Algebra"], "valid": false},
"5": {"topics": ["Geometry", "Algebra"], "valid": true}
},
"difficulty_ranking": [1, 2, 5, 3, 4]
},
...
]
Figure 14: The instruction provided to the annotators to annotate problems.
C ADDITIONALABLATIONS
C.1 SOLUTIONLENGTHREWARDINCREASESPROBLEMCOMPLEXITY
ModelQuestion SolutionAccLength Length
w/ length 207 387 38.42
w/o length 150 238 37.86
Table 6: Comparison of OpenSIR performance with and without solution length reward. Solution
length reward improves OpenSIR accuracy and increases average question and solution lengths.
We investigate the impact of the solution length reward in OpenSIR. Table 6 shows this reward im-
proves performance from 37.86% to 38.42%. It also increases the average question length (from 150
to 207 tokens) and solution lengths (from 238 to 387 tokens). By examining the generated questions
manually, we find that the policy tends to generate more sophisticated problems involving advanced
concepts with this reward, such as linear programming and optimization, which naturally require
longer multi-step solutions to solve. These results demonstrate that the solution length reward ef-
fectively guides the policy toward generating more complex problems, which in turn leads to better
performance.
C.2 ROBUSTNESS TODIVERSITYMEASUREMENTS
Reward Acc # Concepts
Embedding 38.42 5914
Concepts 38.26 6213
Table 7: Comparison of diversity measurement approaches in OpenSIR. Despite slight differences in
concept coverage, both embedding-based and concept-based diversity rewards yield nearly identical
accuracy, demonstrating the framework’s robustness to the choice of diversity metric.
We have established the necessity of diversity rewards in Section 4.3. In this section, we further
investigate OpenSIR’s robustness to different diversity measurement approaches. We implement
concept-based diversity by measuring diversity through the mathematical concepts of the problems
(Lu et al., 2023; Havrilla et al., 2025). Formally, we define the concept diversity reward as:
rcon(q) =|Cq| − |C q∩ CPt−1|
3(10)
whereC qare the concepts in problemqandC Pt−1=S
q′∈Pt−1Cq′represents the union of concepts
from all problems in the existing pool. Since each problem contains at most three concepts, this
reward calculates the fraction of new concepts introduced.
Table 7 shows that both embedding-based and concept-based diversity rewards achieve similar ac-
curacy (38.42 vs 38.26), demonstrating the framework’s robustness to the choice of diversity metric.
20
Beyond accuracy, we examine concept coverage, which refers to the number of unique mathemati-
cal concepts discovered during training, as a direct measure of exploratory diversity. As expected,
concept-based diversity achieves slightly higher coverage (6,213 concepts) since it explicitly op-
timises for novel concept discovery. Surprisingly, embedding-based diversity attains comparable
coverage (5,914 concepts), 95% of the concept-based approach, despite not tracking concepts ex-
plicitly. This suggests that maximising representational spread in embedding space effectively pro-
motes novelty discovery, achieving open-ended learning.
D IMPLEMENTATIONDETAILS
D.1 TRAININGDETAILS
Category Hyperparameter Value
TrainerLearning rate3×10−7
Optimiser AdamW (Loshchilov & Hutter, 2018)
Warmup steps 20
Training steps 100/200
KL loss coefficient1×10−4
Gradient norm clipping 0.5
Seeds 42/43/44
GPUs 3 H100
RolloutBatch size†256
Max prompt length 1024
Max solution length 2048
Number of rollouts per prompt 8
Temperature 1.0
Teacher RewardsSolvability weight (α) 1.0
Solution length weight (λ) 1.0
Diversity weight (γ) 1.0
Format weight (δ) 0.1
Embedding model Linq-Embed-Mistral (7B)
Student RewardsAccuracy weight 1.0
Format weight (δ) 0.1
†The number of rollouts seen for one gradient update.
Table 8: The training configurations for the experiments.
We implement OpenSIR based on the TRL framework (von Werra et al., 2020). Table 8 provides a
summary of the training hyperparameters used in our experiments.
D.2 PROMPTS
We detailed the prompt for generating problems in Figure 15 and solving problems in Figure 16.
21
You are given a math problem: {Problem}
Your task is to create a math problem that is conceptually different from the provided prob-
lem. The new problem must be answerable with a numerical value or mathematical expres-
sion.
First, explain how your new problem differs conceptually from the original problem in-
side the <think>...</think> tags. Then, present your new problem inside the <prob-
lem>...</problem> tags. Finally, identify at most three math concepts required to solve
your problem. Provide these concepts in a comma separated list inside the <con-
cepts>...</concepts> tags.
Figure 15: Prompt for generating math problems. {Problem} is a placeholder for the reference
problem sampled from the problem pool.
You are a helpful AI Assistant, designed to provide well-reasoned and detailed responses.
You FIRST think about the reasoning process step by step and then provide the user with
the answer. The last line of your response should be ’Therefore, the final answer is:
$\boxed{ANSWER}$’ (without quotes) where ANSWER is just the final number or ex-
pression that solves the problem.
{Problem}
Figure 16: Prompt for generating solutions to math problems. {Problem} is a placeholder for the
actual problem.
22