Optimal inference schedules for masked diffusion models
Sitan Chen*
HarvardKevin Cong†
HarvardJerry Li‡
UW
November 11, 2025
Abstract
A major bottleneck of standard auto-regressive large language models is that their inference process
is inherently sequential, resulting in very long and costly inference times. To circumvent this, prac-
titioners proposed a class of language models calleddiffusion language models, of which themasked
diffusion model(MDM) is the most successful. The MDM is able to sample tokens out-of-order and,
ostensibly, many tokens at once and in parallel. However, there is very limited rigorous understanding of
how much parallel sampling these models can perform without noticeable degradation in their sampling
performance. Prior work of Li and Cai [LC25] obtained some preliminary bounds, but these are not
tight for many natural classes of distributions. In this work, we give a new,exactcharacterization of
the expected divergence between the true distribution and the sampled distribution, for any distribution
and any unmasking schedule for the sampler, showing an elegant connection to the theory ofunivariate
function approximation.
By leveraging this connection, we then attain a number of novel lower and upper bounds for this
problem. While the connection to function approximation in principle gives the optimal unmasking
schedule for any distribution, we show that it is in general impossible to compete with it without stronga
prioriknowledge of the distribution, even in seemingly benign settings. However, we also demonstrate
new upper bounds and new sampling schedules in terms of well-studied information-theoretic properties
of the base distribution, namely, itstotal correlationanddual total correlation, which show that in some
natural settings, one can sample inO(logn)steps without any visible loss in performance, wherenis
the total sequence length.
*Email:sitan@seas.harvard.edu. Work supported in part by NSF CAREER Award CCF-2441635.
†Email:kcong@college.harvard.edu
‡Email:jerryzli@cs.washington.eduarXiv:2511.04647v2  [cs.LG]  9 Nov 2025
Contents
1 Introduction 1
1.1 Result 1: Optimal unmasking schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Result 2: Impossibility of competing with the optimal schedule . . . . . . . . . . . . . . . . 3
1.3 Result 3: (Dual) total correlation and reduction to a single hyperparameter sweep . . . . . . 4
1.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Technical preliminaries 8
2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Oracle model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Information-theoretic quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Sampling error in terms of unmasking schedule: proof of Theorem 1.4 10
4 Lower bounds on competing with the oracle rate 13
4.1 Warmup example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Lower bounds for arbitrary information curves . . . . . . . . . . . . . . . . . . . . . . . . . 17
5 Upper bound in terms of (dual) total correlation: proof of Theorem 1.9 17
A Logarithmic overhead is necessary 25
B Recovering existing bounds 27
B.1 Recovering the bound of Li and Cai [LC25] . . . . . . . . . . . . . . . . . . . . . . . . . . 27
B.2 Recovering the bound of Austin [Aus20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
C Decoupling estimation error and sampling error 29
D Finishing the proof of Theorem 4.10 30
1 Introduction
Diffusion models are the state-of-the-art approach to generative modeling over domains like video and
molecules, and in recent years have also emerged as a powerful alternative [SAS+24, SHW+24, NZY+25,
KKL+25] to autoregressive large language models (LLMs). Abstractly, these models perform distribution
learning by learning to reverse a corruption process transforming data into noise. Starting from fresh noise,
they apply the learned reverse process to map it into a fresh sample from the data distribution.
In state-of-the-art diffusion language models, the most common choice of corruption process is the
erasure process. This is the basis for the popular paradigm ofmasked diffusion models (MDMs), which now
form the backbone of leading approaches to non-autoregressive language modeling. The erasure process
proceeds as follows: starting from a sampleX0=x0∈Σnat timet= 0, draw timesT 1, . . . , T nfrom some
measure on[0,1], and defineXt
i=x0
iift⩽T iandXt
i=∗otherwise, where “∗” is a special character
corresponding to erasure. Given samples from data distributionµ, one then trains a neural network to learn
conditional marginals: for every timetand conditioningXt=xt, estimatelaw(X0
i|Xt=xt)for all
i∈[n]. It is readily seen that this is equivalent to learning the conditional marginals
law(X i|XS=z), S⊆[n], i̸∈S ,
whereX∼µ,z∈Σ|S|, andX S=zdenotes the partial assignment to the indices given byS. Given these
marginals, it is straightforward to sample fromµby sampling one token at a time. Unlike LLMs which
sample one token at a time from left to right, MDMs can sampleout of order.
Sampling multiple tokens at a time.Crucially in practice, the neural network that is trained to learn
these conditional marginals can, given any such partial assignmentX S=z, simultaneously compute the
conditional marginals for alli̸∈Sin one network evaluation. One of the key selling points of MDMs
is thus that these models have the freedom to samplemultiple tokensat a time in parallel, whereas LLMs
are inherently limited to sequential sampling. Empirically, a standard heuristic is the following: fix an
unmasking schedulegiven bystep sizess 1, s2, . . . , s ksumming ton, and iterate the following fort=
1, . . . , k:
• Sample a random subsetS tof sizes tfrom among the indices[n]\(S 1∪ ··· ∪S t−1)
• For everyi∈S t, samplex iindependentlyfromlaw(X i|XS1∪···∪S t−1=x S1∪···∪S t−1)1– note that
this ignores correlations acrossS tand thus introduces statistical error.
The goal is to makekas small as possible while keeping the statistical error small. Ifk=nands 1=···=
sn= 1, then this will perfectly sample from the data distributionµ, but it is no more efficient than sampling
with an autoregressive model. On the other hand, ifk= 1ands 1=n, this will sample in one step but
output the product distribution whose 1-wise marginals agree withµ, which in general will not be a good
approximation toµ.
In real-world deployments of MDMs, there is an art to picking the unmasking schedule to trade off
between these extremes, giving rise to popular heuristics like thecosine schedule[CZJ+22, SHW+24] and
thelog-linear schedule[LME24, SAS+24] in whichs 1, s2, . . .start out small and progressively increase.
However, our understanding of how to pick these schedules, and how to rigorously quantify the statistical
errors that arise from sampling multiple tokens in parallel, remains limited. In this work we therefore ask:
What is the optimal unmasking schedule for a given data distributionµand target level of error?
1Of course, in reality we never have exact access to the conditional marginal, but in our theoretical analysis it is straightforward
to decouple this error from the overall sampling error; see Appendix C.
1
This is a challenge not just for theory, but for practice. Indeed, a large-scale ML benchmark [KGO+25] was
released just weeks ago in an effort to systematically evaluate unmasking schedules for diffusion language
models. But as we will see, this is a question that is particularly amenable to the lens of theory.
1.1 Result 1: Optimal unmasking schedule
Our first result is a tight and surprisingly simple theoretical characterization of the optimal unmasking sched-
ule for anyµ. The result exposes an elegant connection tounivariate function approximation. To state the
result, we first require some terminology.
Definition 1.1(Expected KL error).Conditioned on a sequence of subsetsS 1, . . . , S kof sizess 1, . . . , s k,
letνS1,...,S Kdenote the distribution over outputsxgenerated by the sampling algorithm above. The notion
of error we will consider in this work is theexpected KL error
E
S1,...,S k[KL 
µ

νS1,...,S k
],
where the expectation is over subsetsS iof sizes isampled according to the algorithm above.
Definition 1.2(Left Riemann approximation).GivenZ= (Z 1, . . . , Z n)∈Rn
≥0and nodes1 =N 1<···<
Nk< n, define theleft Riemann approximationofZto be thek-step sequenceZN
1, . . . , ZN
ngiven by:
ZN
j=(
ZNaifNa⩽j < N a+1
ZNkifj≥N k
Given any sequencesZ= (Z 1, . . . , Z n)andZ′= (Z′
1, . . . , Z′
n), we can define theintegration error∥Z−
Z′∥L1:=Pn
j=1|Zj−Z′
j|. Thek-step left Riemann approximation toZminimizing this integration error is:
N∗,k:= argmin
1=N 1<···<N k<n∥Z−ZN∥L1.(1)
Note that givenZ, one can efficiently find the minimizingN∗,kin polynomial time via dynamic programming.
The central object in this work is the following sequence quantifying correlations withinµ:
Definition 1.3(Average mutual information curve).Given a random variableX∼µoverΣn, define its
information curve, denotedZ=Z(µ)by
Zj=Z j(µ):=E
|S|=j−1,i̸∈S[I(X i;XS)], j∈[n],
i.e. the average mutual information betweenX iandX Sfor randomS⊆[n]of sizej−1and randomi̸∈S.
By Han’s inequality [PW25, Theorem 1.7], we have that0 =Z 1⩽Z 2⩽···Z n.
Our first main result is an exact characterization of the optimal expected KL error achievable by anyk-step
sampler, in terms of thepiecewise approximability of the distribution’s information curve:
Theorem 1.4(Optimal schedule given by best step approximation).Letµbe any distribution overΣn, and
let1⩽k⩽n. LetN∗,kbe the solution to Eq.(1)forµ’s information curveZ=Z(µ).
Then for any unmasking schedules 1, . . . , s k, the expected KL error is given by
E
S1,...,S k[KL 
µ

νS1,...,S k
] =∥Z−ZN∥L1,forN a:= 1 +a−1X
t=1st∀a∈[k].
2
In particular, the schedule that minimizes the expected KL error is
st=N∗,k
t+1−N∗,k
t, t∈[k].
In Figure 1, we give a pictorial depiction of the expected error in Theorem 1.4.
j 0Zj
s1= 4 s2= 5 s3= 2 s4= 2Z5Z10Z12Z13Z
ZN
1 2 3 4 5 6 7 8 9 10 11 12 n= 13
Figure 1: Discrete curveZ(blue) and left Riemann approximationZN(red) for a sampleZ icurve. The
latter extends beyond theZ jcurve ton+ 1to show the final rectangleZ n−Z Nk−1+1; note that this term
is not present in a standard left Riemann approximation. Light blue background rectangles represent the
Riemann approximation terms. The total area is∥Z−ZN∥L1.
The proof of Theorem 1.4 is remarkably simple once one realizes that the key object driving the statistical
error of MDM sampling is the information curve ofµ, and we therefore regard the main technical contri-
butions of this result as identifying the correct information-theoretic object to study as well as drawing the
surprising connection to univariate function approximation.
1.2 Result 2: Impossibility of competing with the optimal schedule
Although Theorem 1.4 gives an exact characterization of the optimal schedule, and this schedule can be
found in polynomial time givena prioriknowledge of the information curve, pragmatically it is unclear
how to use it as thisa prioriknowledge is not readily available.2One might hope that by making use of
conditional marginal queries to the neural network, one can estimateZ(µ)to sufficient accuracy and then
deduce the optimal schedule from this.
In the next part of this work, we prove a collection of impossibility results demonstrating that this is not
possible in general, even under seemingly benign conditions. Our lower bounds in this part apply to an even
more general setting where the sampling algorithm can adaptively make anykconditional marginal queries
it chooses (see Definition 2.1), possibly in a randomized fashion, and then must output a sample such that
marginally over its internal randomness, the algorithm’s output distribution should be close toµ.
We begin by considering a simple scenario where the curve is promised to either be the constant zero
curve (Z j= 0) or a single step function (Z j=I[j > j∗]for an unknownj∗), in which case the optimal
2If one haspoly(n, ε)many held-out samples fromµ, one can always estimate each of theZ i’s to sufficient precision, but in
practice it can be prohibitively expensive to generate this many samples using the diffusion model.
3
schedule is simply determined by the location of the step, if it exists. There exist distributions realizing
both kinds of information curve, namely the uniform distribution overΣnand the uniform distribution over
a minimum distance separable (MDS) code (see Definition 4.1). Our first result shows that even in this
situation, findingj∗if it exists requires a prohibitive number of conditional marginal queries.
Theorem 1.5(Uniform versus code is hard, see Theorem 4.9 for formal statement).There does not exist a
single sampling algorithm which simultaneously achieves iteration complexityo(n)for sampling to expected
KL errorO(1)for all distributions, in fact even for sampling to expected TV error1/2. In fact, this holds
even if the algorithm knewa priorithat the distribution were either the uniform distribution overΣnor a
uniform distribution over an unknown MDS code.
One might wonder if this worst-case result is too pessimistic, and in practice the relevant data distributions
are very far from uniform and have useful correlational structure that one might hope to exploit. Unfortu-
nately, the following strengthening of Theorem 1.5 shows that this is still not the case:
Theorem 1.6(Hardness for general information curves, see Theorem 4.10 for formal statement).LetZ=
Z(µ)beanyinformation curve, whereµis a distribution overFn
q, such that there exists an unmasking
schedule under which one can sample fromµto expected KL errorO(1), in fact even just to expected TV
error1/2. For every1⩽k < n, letZ↑kdenote the information curve given by shifting everyZ jforj > k
up bylog2(q).
There does not exist a single sampling algorithm which simultaneously achieves iteration complex-
ityo(n), even if the algorithm knewa priorithat the information curve of the distribution was one of
Z,Z↑1, . . . ,Z↑n−1.
Intuitively, this result and the previous one follow from the fact that one can engineer sharp discrete jumps
in the information curve which are not detectable unless if one conditions on exactly the right number of in-
dices. We remark that the two results are technically incomparable as the hard distributions for Theorem 1.6
whenZis uniformly zero are slightly more intricate than simply uniform vs. MDS.
1.3 Result 3: (Dual) total correlation and reduction to a single hyperparameter sweep
In the final part of this paper, we redeem the situation by showing that for any distributionµ, there exist un-
masking schedules depending only on asingle scalar parameterquantifying correlations in the distribution
which achieve small expected KL error.
For this, we need to first define two relevant information-theoretic quantities:
Definition 1.7(Total Correlation and Dual Total Correlation).For any random variableX∼µoverΣn,
define thetotal correlation (TC)as
TC = TC(µ) :=nX
i=1H(X i)
−H(X 1, . . . , X n)
and thedual total correlation (DTC)as
DTC = DTC(µ) :=H(X 1, . . . , X n)−nX
i=1H(X i|X1, . . . , X i−1, Xi+1, . . . , X n).
From its definition, we see thatTCis equivalently the KL divergence betweenµand the product distribution
whose marginals agree with those ofµ, and thus it characterizes how “product”µis. On the other hand,
4
DTChas been shown to quantify the extent to whichµcan be expressed as a sparsemixtureof product
distributions [Aus20]. These quantities admit nice characterizations in terms of the information curve ofµ:
Lemma 1.8.For any distributionµwith information curveZ,
1.TC(µ) =Pn
i=1Zi, and
2.DTC(µ) =nZ n−Pn
i=1Zi=nZ n−TC(µ).
Under the pictorial representation in Figure 1,TCis therefore the areaunderthe information curve, and
DTCis the areaabovethe information curve (capped atZ n).
We show that while it is not in general possible to compete with a sampler that can choose the unmasking
schedule dependent ona prioriknowledge of the information curve, there are unmasking schedules that only
depend on having access to a constant-factor approximations toTC(µ)andDTC(µ)which only require a
number of iterations scaling inmin(TC(µ),DTC(µ)), up to log factors. In situations where these quantities
are sublinear inn, this gives us a way to sample asymptotically faster than the naiven-step sampler even
without full knowledge of the information curve. As a simple example, ifµis a distribution over a linear
subspace of dimension or codimensionO(1)(e.g., ifµcorresponds to an unknown parity), then this yields
anexponentialspeedup over naive schedules. We discuss other such examples in Section 1.4 below.
Theorem 1.9(Iteration complexity depending onTC,DTC).For anyε >0, there exists an unmasking
schedule which depends only onεand a parameter cTC(resp. [DTC) such that for any distributionµfor
whichTC(µ)≤ cTC(resp.DTC(µ)≤ [DTC), the expected KL error satisfies
E
S1,...,S k[KL 
µ

νS1,...,S k
]⩽ε ,
and furthermore the number of steps satisfies
k⩽2 + (1 + logn)·(1 +⌈ cTC/ε⌉) (resp.k⩽2 + (1 + logn)·(1 +⌈ [DTC/ε⌉).
While realizing either (and in particular, the minimum) of these iteration complexities still requires knowing
an upper-bound approximation ofTC(µ),DTC(µ), in practice this is not really an issue: one can simply
treat these as hyperparameters and either estimate them with held-out data or guess their values via doubling.
While the no-go results of Section 1.2 tell us it is impossible in theory to know when to stop doubling, in
practice we can simply generate samples according to the different schedules and inspect at what point
the output is sufficiently coherent. We emphasize that the reason this is feasible compared to the scheme
suggested in Theorem 1.4 is that we have reduced from designing a schedule that depends onndifferent
hyperparametersZ 1, . . . , Z n(more than the number of hyperparameters describing the unmasking schedule
itself) to designing one that only depends on2hyperparameters, namelyTC(µ)andDTC(µ).
Finally, the reader may wonder whether thelog(n)factor in Theorem 1.9 is a technical artifact or funda-
mental. In Appendix A we show that it is unavoidable, since there exist information curves which can only
be approximated toL1errorεusing step functions with at leastΩ(min(TC,DTC)·log(n)/ε)steps.
1.4 Related work
We contrast our results with some existing bounds from the literature.
The bound of Li and Cai [LC25].The most closely related prior work is that of Li and Cai [LC25]. They
considered the same setting as the present work and showed that under any unmasking schedules 1, . . . , s k
5
withs max:= max isi, the expected KL error can be bounded by
2⌈log2smax⌉−1
nnX
i=1I(X i;X1, . . . , X i−1, Xi+1, . . . , X n).
This was proven using a delicate inductive argument based on recursively relating the expected KL error for
a given unmasking schedule to the expected KL error with a schedule whose steps are twice as fine.
We make two observations about this bound. First, armed with Theorem 1.4, which gives anexact
characterization of the expected KL error foranyunmasking schedule, we can give a proof of Li and Cai’s
bound in just four lines — see Appendix B.1. Second, we note that up tolognfactors, the bound in
Theorem 1.9 is strictly better. The reason is that
nX
i=1I(X i;X1, . . . , X i−1, Xi+1, . . . , X n) =nZ n= TC(µ) + DTC(µ)≍max(TC(µ),DTC(µ)).
For instance, in the aforementioned simple example whereµis distributed over a generic linear subspace,
TC(µ)+DTC(µ) = Θ(n), whereasmin(TC(µ),DTC(µ))scales with the minimum of the dimension and
codimension, which can be much smaller (see Example 1).
DTC and the work of Tim Austin [Aus20, Aus19].The elegant work of Austin [Aus20] gave a powerful
operational characterization of DTC. First, it is easily seen that any distributionµwhich is a mixture of2o(n)
product distributions hasDTC(µ) =o(n). Austin showed an approximate converse: ifDTC(µ) =o(n),
thenµis well-approximated by a mixture of2o(n)product distributions. In fact, his proof is algorithmic and
has an interesting interpretation under the perspective of the present work: it shows that if one first samples
O(p
DTC(µ)·n/ε)indices in sequence and then samples the remaining indices inOp
DTC(µ)·n/ε
iterations, one can achieve expected KL errorε:
Theorem 1.10(Austin’s iteration complexity bound [Aus20]).For anyε >0, there exists an unmask-
ing schedule which depends only onεand a parameter [DTCsuch that for any distributionµfor which
DTC(µ)≲ [DTC, the expected KL error satisfies
E
S1,...,S k[KL 
µ

νS1,...,S k
]⩽ε ,
and furthermore the number of steps satisfiesk≲lp
DTC·n/εm
.
We provide a proof for completeness in Appendix B.2. Although this bound is already sublinear innwhen
DTC(µ) =o(n), it is bottlenecked at√n. Indeed, note that Austin’s bound is the geometric mean of our
stronger bound in Theorem 1.9 and the trivial iteration complexity bound ofn, up to a logarithmic factor.
The connection between parallel sampling and decomposition of measure has been quite fruitful within
probability theory. For instance, Austin [Aus19] showed the remarkably general result that for any Gibbs
measure∝e−βHof “low-complexity” in the sense that the discrete derivatives of the Hamiltonian lie within
a set of bounded metric entropy, theDTCiso(n). The decomposition of such distributions into mixtures
of product measures arises naturally in the theory of nonlinear large deviations, see e.g., [CD16, EG18a,
Eld18, EG18b]. The theory of parallel sampling can thus also be fruitfully interpreted as providing richer
and more accurate hierarchical measure decompositions, where the levels of the hierarchy correspond to
iterations of the sampler.
Pinning lemma and stochastic localization.Finally, we note a closely related notion from statistical
6
physics and theoretical computer science, namely thepinning lemma[RT12, And08, EAM22]. The premise
behind this result is that if one conditions on a random subset of coordinates of sizesaccording to their
trues-wise marginal inµ, then the remaining coordinates have pairwise correlation which is bounded by
O(1/s). In some sense this is the fundamental premise behind MDMs: pinning random tokens reduces
the correlation among the remaining tokens, which intuitively enables more aggressive parallel sampling of
later tokens. This has been used to great effect in the context of SDP rounding algorithms for solving dense
CSPs [RT12, MR17, YZ14, JKR19]
That being said, the pinning lemma holds foralldistributions, whereas our impossibility results show
that without additional prior information about the distribution, one cannot simultaneously achieveo(n)
complexity for allµ. It is worth contrasting this state of affairs with the work of [AGR24]. By leveraging
the pinning lemma, they showed that in a muchstrongerparallel model where in each round one can si-
multaneously make multiple conditional marginal queries, each corresponding to a possiblydifferentpartial
assignment, it is possible to sample in ˜O(n2/3)parallel rounds, for general distributions.
Other theoretical works on discrete diffusion.We briefly mention some other works in the discrete dif-
fusion literature that derive theoretical bounds. In [CY24], the authors study discretization bounds for a
different paradigm of discrete diffusions, where the corruption process being reversed is a bit flip chan-
nel rather than an erasure channel. Here, it is nontrivial even to derive bounds which scale linearly inn.
In [RCZ+25], the authors consider finding better discretizations of the continuous-time Markov chain asso-
ciated to the discrete diffusion model; under some smoothness assumptions on the underlying distribution,
which are primarily relevant to the bit flip setting, their higher-order solvers achieve nontrivial sampling
guarantees relative to naive (Euler) discretization.
We also remark that the conditional marginal oracle we consider is very similar in spirit to theconditional
querymodel in distribution testing, pioneered by [CFGM13, CRS15]. Our model is different in two ways:
(1) we restrict tosubcube conditioningsin the sense of [CCK+21], and (2) a single oracle query gives
an entire vector of 1-wise conditional marginals, rather than just a single sample from the posterior. The
literature here is extensive and orthogonal to our work; we refer the interested reader to [Can20, Chapter 11].
Lastly, we note that masked diffusion models – and autoregressive models – can be thought of as modern
instantiations of the classical Jerrum-Valiant-Vazirani counting-to-sampling reduction [JVV86].
Continuous diffusion.In recent years there has been significant progress on understanding discretization
bounds for diffusion models over continuous spaces, e.g., the works of [CCL+23, LLT23, CLL23, BBDD24,
CDS25, LWCC24]. The techniques in this area are largely distinct from the ones in this work, with the
exception of the recent work of [RP25] which derived an analogous expression to our Theorem 1.4 for the
discretization error incurred bycontinuousdiffusions. In that context, the analogue of our information curve
is theMMSE curveE[X−E[X|α tX+β tγ]]for Gaussianγ, and they show (see Lemma 2 therein) that the
discretization error is exactly given by the left Riemann integration error to this curve. Interestingly, whereas
in the continuous diffusion context this integration error exactly characterizes the KL error inpath space,
which is only an upper bound to the KL error at the endpoint of the sampler, in the masked diffusion setting
the integration error exactly characterizes the sampler’s KL error. In light of the connection to [RP25], it
would be interesting to extend our impossibility results and TC/DTC-based bounds to the continuous setting.
Concurrent work.Independent concurrent work of [LZ25] also identified the connection to Riemann
approximation of the information curve (our “Result 1”). Unlike our work, they did not explore thequery
complexityof learning an optimal schedule (our “Result 2”) and did not devise explicit schedules that scale
better than the bound in [LC25] (our “Result 3”). Instead, they additionally provided worst-case bounds for
sampling error under arbitrary, non-random orderings, and studied a naturaln→ ∞scaling limit of the step
function approximation problem.
7
2 Technical preliminaries
In this section, we will first provide a brief overview of our notation and oracle model. We will then discuss
some important information theoretic quantities and results.
2.1 Notation
Throughout the remainder of this paper, we will use the following notation.
Vocabulary, data distribution, and product Distributions.We will letΣbe a vocabulary andµbe the data
distribution overΣn. We use∆(Σ)to denote the probability simplex overΣ. LetX= (X 1, . . . , X |S|)∼µ.
For any setS⊆[n], letX S={X i}i∈S. Defineµ(· |S)to be the conditional distribution ofµgivenS, and
µ⊗(· |S)to be the product distribution which has the same marginals asµ(· |S). Lastly, definef µ(· |S)
andf⊗
µ(· |S)be the corresponding probability mass functions. Lastly, we setX Sto denote the set{X i}i∈S.
2.2 Oracle model
Throughout this work, our main oracle object will be theconditional marginal oracle, which outputs the
marginals ofµconditioned on any subset. To define it, first recall that(X 1, . . . , X n)∼µis the data
distribution over a vocabularyΣ. LetDbe the collection of all multivariate distributionsΣ. Moreover, let
pi|S(xS) ={p(X i=j|X S=x S), j∈Σ}be the marginal probability vector on coordinatei. We then
have the following.
Definition 2.1(Conditional marginal oracle).Theconditional marginal oracleCOtakes as input a partial
assignmentX S=sand outputs the conditional marginal distributions ofµgivenX S=xS. Formally,
CO(X i|XS=xS) ={µ(X i|XS=xS)}i̸∈S.
If the pinningX S=xSis impossible insupp(µ), output an arbitrary element of∆(Σ)n−|S|.
In our upper bounds, we will only ever use the oracle to obtain conditional marginals to sample from in
parallel, as this is the standard way in practice to use this oracle. Our lower bounds however apply to the most
general setting of arbitrary randomized algorithms with adaptive query access toCO(see Definition 4.6).
Note thatCOis an exact oracle, whereas in practice, an approximate oracle cCOis learned from the
training data. However, in Appendix C, we show following [LC25] that the error of our sampling algorithms
can be decoupled into learning and sampling error. Since this work focuses on the sampling procedure and
error, we will assume that the learned oracle is perfect.
2.3 Information-theoretic quantities
In this section, we will recall some information theoretic quantities and prove a few preliminary lemmas
which will be useful in the subsequent proofs of our main results. First, recall thatH(X)refers to the
entropy of a random variable, andH(Y|X)refers to the conditional entropy of a pair of random variables.
Moreover, recall from Definition 1.7 that
TC = TC(µ) := nX
i=1H(X i)!
−H(X 1, . . . , X n)
8
and
DTC = DTC(µ) :=H(X 1, . . . , X n)−nX
i=1H(X i|X1, . . . , X i−1, Xi+1, . . . , X n).
To provide some intuition for these quantities, we first provide some examples of theTCandDTCof linear
subspace and product mixture distributions.
Example 1(Linear Subspaces).SupposeΣ =F qandV ⊆Σndenote a linear subspace of dimensiond.
Letµ Vdenote the uniform distribution overV. Then there is a matrixMsuch thatMU∼µ Vfor a uniform
U∈Fd
q. Then
TC(µ) = (n−k) logq−dlogq= (n−d−k) logq
wherek= #{i:M i= 0}is the number of rows ofMwhich are identically 0. Moreover,
DTC(µ) =dlogq−ℓlogq= (d−ℓ) logq,
whereℓ= #{i:M i̸∈span ({M j}j̸=i)}is the number of rows ofMwhich are not in the span of the
remaining rows. In general, we will often havek=ℓ= 0, in which caseTC(µ)andDTC(µ)are the
codimension and dimension ofV, respectively.
Example 2(Mixtures of Products).The DTC of product mixtures has been studied in-depth before in
[Aus20]. In particular, by Proposition 8.1 of [Aus20], the DTC of a mixture ofmproduct distributions
is at mostlogm. Thus, any mixtureµof2o(n)products satisfiesDTC(µ) =o(n).
In the converse direction, by Theorem A of [Aus20], anyµwithDTC(µ) =o(n)is close in transport
distance to a relatively “simple” mixture of products.
Next, we define a sequence of values based on the average entropy of fixed-cardinality subsets. They will
be useful to help analyze the information curve.
Definition 2.2(Average entropy curve).Theaverage entropy curveof the distribution(X 1, . . . , X n)∼µis
given by3
Hi(µ) =1 n
iX
S⊆[n],|S|=iH({X i}i∈S) =E |S|=iH({X j}j∈S).
Whenµis clear from context, we denote this byH i.
We can then express the information curve in terms of the average entropy curve as follows.
Lemma 2.3.We haveZ i=H 1+H i−1−H i.
Proof.Recall thatI(X i;XS\{i}) =H(X i) +H(X S\{i})−H(X S); taking expectations, we find that
Zi=E|S|=i−1,j̸∈S [I(X j;XS)]
=E|T|=i,j∈T
H(X j) +H(X T\{j})−H(X S)
=H 1+H i−1−H i,
as desired.
We now provide several useful statements involving the information curve, namely that it provides a clean
way to express the TC and DTC of a distribution.
3Fori= 0, the entropy of the empty set is defined to be 0.
9
Lemma 2.4.We have the expressions
1.TC =Pn
i=1Ziand
2.DTC =nZ n−Pn
i=1Zi=nZ n−TC.
Proof.First, observe that
nX
i=1Zi=nH 1−H n=nX
i=1H(X i)−H(X 1, . . . , X n) = TC,
which proves item 1. Lastly, using the chain rule of conditional entropy, we observe that
DTC + TC =nX
i=1(H(X i)−H(X i|X1, . . . , X i−1, Xi+1, . . . , X n))
=nX
i=1(H(X i) +H(X 1, . . . , X i−1, Xi+1, . . . , X n)−H(X 1, . . . , X n))
=n(H 1+H n−1−H n)
=nZ n.
Combining this equation with item 1, it follows that
DTC =nZ n−TC=nZ n−nX
i=1Zi,
which proves item 2.
3 Sampling error in terms of unmasking schedule: proof of Theorem 1.4
In this section we establish an upper bound for the expected KL error of a fixed and then random unmasking
algorithm, effectively proving Theorem 1.4. We first formalize the definition of fixed unmasking algorithms
as follows.
Definition 3.1(Fixed Unmasking Algorithm).The fixed unmasking algorithm with subset schedule(S 1, . . . , S k),
given byA fixed(k,{S i}k
i=1), proceeds as follows. First defineN i=Pi
j=1sj, whereN 0= 0ands j=|S j|.
Then at each stagei∈[k], beginning ati= 1, independently and in parallel sample
xj∼µ
XjXFi−1
t=1St
for allj∈S i. The algorithm then outputs the sample(x 1, . . . , x n)∼νS1,...,S k.
We next define the random unmasking algorithm as follows, based on the fixed unmasking algorithm.
This is the formal definition for the algorithm that was outlined in Section 1.
Definition 3.2(Random Unmasking Algorithm).Therandom unmasking algorithm with unmasking sched-
ule(k,{s i}k
1), given byA(k,{s i}k
1), proceeds as follows. First, sample a uniformly random partition of
coordinatesS=⊔k
i=1Si,|Si|=s i. Output a sample(x 1, . . . , x n)∼vS1,...,S kgiven byA fixed(k,{S i}k
i=1).
The algorithm then outputs the sample(x 1, . . . , x n)∼ν.
10
Note that the fixed unmasking algorithmA fixed essentially choosess itokens at fixed positions at each stage
iand samples all tokens independently and in parallel, and the random unmasking algorithm selects the
token positions at each stage uniformly at random amongst all masked tokens. We can now formally state
and prove (a slightly stronger form of) Theorem 1.4.
Theorem 3.3.Letµdenote the underlying distribution of data. Let(k,{s i}k
1)be an unmasking schedule
and{S i}k
1,|Si|=s ibe a fixed subset schedule. LetN i=Pi
j=1sjdenote the partial sums of thes i
sequence, whereN 0= 0. Suppose the fixed unmasking algorithmA fixed({Si}k
1)samples a distribution
νS1,...,S kand the random unmasking algorithmA(k,{s i}k
1)samples a distributionν. Then both algorithms
have query complexityk, andνachieves KL error relative toµof
KL(µ∥ν)≤E S1,...,S k
KL 
µ

νS1,...,S k
=kX
i=1
siX
j=1 
ZNi−1+j−ZNi−1+1
,
where the expectation is taken over all partitionsS=Fk
i=1Sifor which|S i|=s i.
Proof.LetT i=S
j<iSiwithT 1=∅be the coordinates which have already been sampled at stagei. We
first work with the distributionνS1,...,S k. Observe that
KL 
µ

νS1,...,S k
=Ex∼µ
logfµ(x)
f⊗µ(x)
=Ex∼µ"kX
i=1logfµ(XSi|XTi=xTi)
fµ⊗(XSi|XTi=xTi)#
=kX
i=1Ex∼µ
KL 
µ(X Si|XTi=xTi)

µ⊗(XSi|XTi=xTi)
.
To simplify this, observe that the inner KL term is essentially a total correlation of the conditional distribu-
tion ofX SigivenX Ti=xTi. Therefore, it follows that
Ex∼µ
KL 
µ(X Si|XTi=xTi)

µ⊗(XSi|XTi=xTi)
=Ex∼µ

X
j∈SiH(X j|XTi=xTi)
−H(X Si|XTi=xTi)

=
X
j∈SiH(X j|XTi)
−H(X Si|XTi)
=
X
j∈SiH(X Ti∪{j})−H(X Ti)
−(H(X Si⊔Ti)−H(X Ti)),
where in the second equalityH(X j|XTi=xTi)denotes the entropy of the conditional distribution ofX j
givenX Ti=xTi, whileH(X j|XTi)denotes the conditional entropy ofX jgivenX Ti.
11
Combining this with the previous equation, we find that
KL 
µ

νS1,...,S k
=kX
i=1

X
j∈SiH(X Ti∪{j})−H(X Ti)
−(H(X Si⊔Ti)−H(X Ti))
.
Recall now thatνis given by the mixture
ν=1 n
s1...skX
{Si}k
1,|Si|=siνS1,...,S k.
We therefore find that
KL(µ∥ν) =KL
µ





1 n
s1...skX
{Si}k
1,|Si|=siνS1,...,S k

≤1 n
s1...skX
{Si}k
1,|Si|=siKL 
µ

νS1,...,S k
=E{Si}k
1,|Si|=si
KL 
µ

νS1,...,S k
=E{Si}k
1,|Si|=si
kX
i=1

X
j∈SiH(X Ti∪{j})−H(X Ti)
−(H(X Si⊔Ti)−H(X Ti))


=kX
i=1"
si 
E|S|=N i−1+1[H(X S)]−E |S|=N i−1[H(X S)]
− 
E|S|=N i[H(X S)]−E |S|=N i−1[H(X S)]#
=kX
i=1"
si 
HNi−1+1−H Ni−1
− 
HNi−H Ni−1#
=kX
i=1"
siH1−siZNi−1+1−siX
j=1(HNi−1+j−H Ni−1+j−1)#
=kX
i=1

siX
j=1ZNi−1+j
−siZNi−1+1
,
where the first line is an equality, the second by convexity of KL, the third and fourth lines are direct
simplification, the fifth line follows from the fact thatT i∪{j},T i,Si, andS i⊔Tiare individually uniformly
random subsets of[n]of sizesN i−1+ 1,N i−1,si, andN i, respectively, and the final three lines are via
directly applying the definitions ofH iandZ. The theorem follows from the first, third, and final lines.
We make two brief comments about Theorem 3.3.
Theorem 3.3 and Theorem 1.4.First, we note that Theorem 1.4 is an immediate corollary.
Proof of Theorem 1.4.LetN a= 1 +Pa−1
t=1st∀a∈[k]andN 0= 1. By Theorem 3.3, and the definition
12
ofZN, we have that
ES1,...,S kh
KL
µ


νS1,...,Ski
=kX
i=1
siX
j=1(ZNi−1+j−ZNi−1+1)
=∥Z−ZN∥L1,
yielding the formula for KL error. The remainder of the theorem statement is obvious.
Comparison between fixed and random unmasking algorithm.There are two methods of approaching
sampling: first, by fixing the scheduleS iahead of time, and second, by resamplingS ifrom|S i|=s ifor
each sample. These correspond to the fixed and random unmasking algorithms, respectively. We observe
that the inequality in Theorem 3.3 shows that the distribution outputted by therandomunmasking algorithm
is, on average, superior in respect to KL-error fromµto the distribution outputted by thefixedunmasking al-
gorithm. This is an additional guarantee not given in Theorem 1.4, and suggests that the random unmasking
algorithm is superior, albeit requiring an additional step in the sampling process.
4 Lower bounds on competing with the oracle rate
Definition 4.1(MDS codes).Ak-dimensional linear subspaceVofFn
qis anmaximum distance separable
(MDS) codeif for anyk×nmatrixMwhose rows constitute a basis forV, everykcolumns ofMare
linearly independent. We denote byUnif(V)the uniform distribution over points inV.
In this work, we will consideraffine shiftsof MDS codes. That is, we will consider distributions over
affinesubspaces which are given by taking some MDS code and translating it by a fixed vector inFn
q. We
will abuse terminology and refer to such affine subspaces as MDS codes.
We will consider “random” MDS codes:
Definition 4.2(Balanced random MDS codes).A distributionDoverk-dimensional MDS codes isbalanced
if for every subsetS⊆[n]for which|S| ≥kand every partial assignmentx∈F|S|
q,
P
V∼D[∃x∗∈ V:x∗
S=x] = (1/q)|S|−k
Reed-Solomon codesprovide an example of MDS codes. Below, we recall their definition.
Definition 4.3(Reed-Solomon codes).Letqbe any prime power exceedingn, and letkbe any value
between1andn−1. Ak-dimensional Reed-Solomon (RS) code inFn
qis a linear subspace specified as
follows. It is specified by a collection of distinctevaluation pointsa 1, . . . , a n∈Fq, and is given by the set
of all evaluations(p(a 1), . . . , p(a n))wherepis a polynomial overF qof degree less thank.
As in Definition 4.1, we will abuse terminology and also refer to affine shifts of RS codes as RS codes.
We will leverage the following basic property of MDS codes:
Proposition 4.4.Letµ= Unif(V)for anyk-dimensional MDS codeV ⊆Fn
q. Then for anyS⊆[n]
satisfying|S|< kand any partial assignmentx∈F|S|
q,µ(X i|XS=x) = Unif(F q)for alli̸∈S.
In particular, this implies thatZ j(µ) = log2(q)·I[j > k].
In addition, recall from the definition of the oracle that if|S|> kand the partial assignmentX S=xis
incompatible with any element ofV, then the output of the conditional marginal oracle can be arbitrary.
Throughout this section, we will take the oracle’s output in this case to beUnif(F q)⊗(n−|S|).
13
We will also use the following property ofrandomReed-Solomon codes. Given prime powerq≥n
and dimension0< k < n, letD n,k,q denote the following distribution overk-dimensional RS codes over
alphabetq. Whenn, qare clear from context, we denote this byD k.
Lemma 4.5.D kis balanced in the sense of Definition 4.2.
Next, we formalize the model of computation under which we prove a lower bound.
Definition 4.6(Sampling algorithm).LetF ⊆∆(Σn)denote some known family of distributionsµ. Given
access to the conditional marginal oracle for someµ∈ F, anF-aware sampling algorithmAis a procedure
of the following form:
1. Repeat the following:
•Based only on the query outcomes from previous rounds andF(and not on knowledge ofµ), and
possibly using additional randomness, either query the oracle on a partial assignmentX S=x,
or exit the loop.
•If the former, observe conditional marginals{µ(X i|XS=x)} i̸∈S
2. Output a string inΣn.
Importantly, the decision to exit out of the loop can be made adaptively. We say thatAisT-queryif
with probability1it performs at mostTqueries to the oracle before terminating. We denote byA[µ]the
distribution over outputs ofA.
Any such sampling algorithm can be naturally represented by astochastic decision treeas follows:
Definition 4.7(Stochastic decision tree representation).Any sampling algorithmAcan be regarded as an
(infinite-degree) stochastic decision tree as follows. Every internal node is either adecision node(including
the root), aquery node, or aleaf node. Decision and leaf nodes (resp. query nodes) are at even (resp. odd)
distance from the root:
•For every decision nodev, the outgoing edges(v, w)connectvto query nodesw. Each such edge is
labeled with a partial assignmentXS(w)=x(w)with which to query the oracle. Fromv, the sampler
transitions towwith some probabilityP A[w|v].
•For every query nodew, there is a continuum of infinitely many outgoing edges(w, v′), each labeled by an
element of∆(Σ)n−|S(w)|corresponding to a possible response by the oracle to the queryXS(w)=x(w).
Fromw, the sampler walks along the edge corresponding to the oracle’s response toXS(w)=x(w).
•Each leaf nodeℓis labeled with a distributionν ℓoverΣn, corresponding to the algorithm’s (randomized)
output if it has reached that state and decided to exit out of the loop. Letleaf(µ)(resp.leaf⩽T(µ)) denote
all possible leaf nodes of the stochastic decision tree corresponding toA(resp. which are distance at
most2Tfrom the root and reachable given oracle access toµ).
Every path from the root to a decision or leaf nodevis given by a path whose edges are alternatingly labeled
by partial assignmentsX S=xand corresponding oracle responses.
For any internal or leaf nodevof the tree, letP A[v|µ]denote the probability that the algorithm
traverses that node at some point in its execution, conditioned on the oracle responses coming from the
conditional marginal oracle forµ.
14
Definition 4.8(Query budget and cost function).LetT: ∆(Σn)→Ndenote aquery budgetfor the query
complexity of such a sampler, and define
costKL
T(A;µ) :=(
KL(µ∥A[µ])Ais at mostT(µ)-query
∞otherwise
DefinecostTV
T(A;µ)in the same way, withKLreplaced byTV.
4.1 Warmup example
We begin by exhibiting a simple ensemble of distributions for which no single algorithm can successfully
sample fromµto errorεusingO(min(TC(µ),DTC(µ)) log(n)/ε)for everyµin the ensemble.
LetUdenote the uniform distribution overFn
q, and letFconsist ofUas well asU Vfor all Reed-Solomon
codesV ⊆Fn
qof dimension0< k < n.
Formally, we show:
Theorem 4.9.NoF-aware sampling algorithmAcan achievesupµ∈FcostTV
T(A;µ)⩽1/16for any budget
TsatisfyingT(µ)≲max(1,min(TC(µ),DTC(µ))) log(n)for allµ∈ F.
We will use the following terminology: in the stochastic decision tree associated toA, a leafℓis said to
missa subspaceVif, for all partial assignments labeling edges from the root-to-leaf path toℓ, either the
assignment is of size less thandimV, or if otherwise there does not existx∗∈ Vconsistent with that
assignment. Otherwise,ℓis said tohitV.
Proof.LetDdenote the mixture distribution overFgiven by
1
2δU+1
2n−2n−1X
k=1Dk
whereD kare as defined in Lemma 4.5. We will prove the stronger statement that noF-aware sampling
algorithmAcan even achieveE µ∼D[costTV
T(A;µ)]≤1/4.
In order forcostTV
T(A;µ)to be finite, we must haveleaf(µ) =leaf⩽T(µ)(µ). Henceforth, letleaf∗:=
leaf⩽T(U)(U). We must have
TV
U,X
ℓ∈leaf∗P
A[ℓ| U]·ν ℓ
⩽1/8,
or elseE µ∼D[costTV
T(A;µ)]≥1
2costTV
T(A;U)>1/16.
For any leaf nodeℓ, letv 1→w 1→v 2→ ··· →w T−1→v Tdenote the sequence of decision and leaf
nodes along the root-to-leaf path toℓ, and suppose the edges(v i, wi)are labeled with partial assignments
XS(i)=x(i). Ifℓ∈leaf∗, then the edges(w i, vi+1)are all labeled withUnif(F q)⊗(n−|S(i)|).
Letk 1⩽···⩽k Tdenote the numbers|S(1)|, . . . ,|S(T)|in sorted order. By Proposition 4.4, for any
MDSVof dimensionk > k Twe haveP A[ℓ| U V] =P A[ℓ| U]. Fork j< k < k j+1, by Lemma 4.5,
P
V∼D k[ℓavoidsV]≥1−X
s>jq−(ks−k)≥1−T/q ,(2)
and ifℓavoidsV, the oracle’s output under every query along the path is uniform marginals and again we
haveP A[ℓ| U V] =P A[ℓ| U]. The same reasoning applies tok < k 1.
15
Let us write
EV∼D kTV
UV,X
ℓ∈leaf(U V)P
A[ℓ| U V]·νℓ
≥1/2−E V∼D kTV
U,X
ℓ∈leaf(U V)P
A[ℓ| U V]·νℓ
where we used thatTV(U,U V)≥1/2for any proper subspaceV. We can rewrite the mixture on the
right-hand side as
X
ℓ∈leaf∗:avoidsVP
A[ℓ| U V]·νℓ+X
ℓ∈leaf∗:hitsVP
A[ℓ| U V]·νℓ+X
ℓ∈leaf(U V)\leaf∗P
A[ℓ| U V]·νℓ
=X
ℓ∈leaf∗P
A[ℓ| U]·ν ℓ−X
ℓ∈leaf∗:hitsVP
A[ℓ| U]·ν ℓ+X
ℓ∈leaf(U V)\leaf∗P
A[ℓ| U V]·νℓ,(3)
where we used that forℓ∈leaf∗that avoidV,P A[ℓ| U] =P A[ℓ| U V], and forℓ∈leaf∗that hitV,
it must be thatP A[ℓ| U V] = 0as the sampler underU Vmust deviate from the path that leads toℓ. AsP
ℓ∈leaf(U V)\leaf∗PA[ℓ| U V] =P
ℓ∈leaf∗:hitsVPA[ℓ| U], the TV betweenUand the mixture in Eq. (5) is thus
upper bounded by1/8 +P
ℓ∈leaf∗:hitsVPA[ℓ| U], and thus
EV∼D kTV
UV,X
ℓ∈leaf(U V)P
A[ℓ| U V]·νℓ
≥3
8−X
ℓ∈leaf∗:hitsVP
A[ℓ| U].
We say thatVisη-goodif it satisfiesP
ℓ∈leaf∗:hitsVPA[ℓ| U]⩽ηfor someη >0. Observe that
1
n−1n−1X
k=1nX
ℓ∈leaf∗P
A[ℓ| U]·P
V∼D k[ℓhitsV]o
=X
ℓ∈leaf∗P
A[ℓ| U]·1
n−1n−1X
k=1P
V∼D k[ℓhitsV]
⩽X
ℓ∈leaf∗P
A[ℓ| U]·T(U) + (n−1− T(U))T(U)/q
n−1
=T(U) + (n−1− T(U))T(U)/q
n−1
⩽2T(U)
n−1
where in the second step we used that for any leafℓat distance2Tfrom the root, there are at mostT
dimensions0< k < nthat are equal to the size of some partial assignment along the root-to-left path toℓ,
and for all other dimensionsk,P V∼D k[ℓhitsV]⩽T/qby Eq. (2). By Markov’s inequality, we conclude
that forη :=4T(U)
n−1≪1,
P
0<k<n,V∼D k[Visη-good]≥1/2.
We conclude that
E
µ∼D[costTV
T(A;µ)]≥1
2·P
0<k<n,V∼D k[Visη-good]·3
8−η
≥1
16
as claimed.
In fact one sees from the definition ofηin the proof above that we have shown the even stronger statement
that it is necessary to set the budgetT(U)for the uniform distribution to belinearinnfor the costs to
16
be sufficiently bounded across allµ∈ F. Intuitively, this comes from the fact that one has to makeΩ(n)
queries before one can decisively rule out thatµis supported on a subspace.
4.2 Lower bounds for arbitrary information curves
Although the lower bound in Section 4.1 is quite tailored to distributions over MDS codes, it turns out that
the same idea can be extended to show thatanyinformation curve admits a realization by some distribution
which cannot be distinguished from a distribution with the same information curve except shifted upwards
by an additive constant for all indices past a certain point.
Theorem 4.10.LetZbe the information curve associated to some distribution with total correlationTC
and dual total correlationDTC. Suppose that
min(TC,DTC) logn≪n .(4)
Given1⩽k < n, letZ↑kdenote the information curve given by
Z↑k
j:=(
Zj ifj⩽k
Zj+ log2(q)ifj > k
There exist a familyFof distributions whose information curves are from among{Z,Z↑1, . . . ,Z↑n−1}such
that for everykthere is at least one such distribution inF, and furthermore for any budgetTsatisfying
T(µ)≲max(1,min(TC(µ),DTC(µ))) log(n)for allµ∈ F, noF-aware sampling algorithmAcan
achievesupµ∈FcostTV
T(A;µ)⩽1/16.
Proof.Letµ∗denote a distribution overΣnwith information curveZ, and letUandU Vdenote the uniform
distribution overFn
qand the uniform distribution over subspaceV ⊂Fn
qas before, whereVwill range over
RS codes. Defineµ∗[U]:=µ∗× Uandµ∗[UV]:=µ∗× UV, regarded as distributions over(Σ×F q)n.
IfVhas dimensionk, then by the linearity of the information curve in the average entropy curve, and by
additivity of entropy,
Z(µ∗[U]) =Z(µ∗) +Z(U) =Z
Z(µ∗[UV]) =Z(µ∗) +Z(U V) =Z(µ∗) + (I[j > k) j=Z↑k.
So to constructF, we includeµ∗[U], and then for every dimension1⩽k < nand everyk-dimensional RS
codeV, we includeµ∗[UV], thus satisfying the first condition in the Theorem.
The rest of the proof is nearly identical to that of Theorem 4.9, and we defer it to Appendix D.
5 Upper bound in terms of (dual) total correlation: proof of Theorem 1.9
In this section, we use Theorem 3.3 to obtain data-agnostic bounds for the expected KL error of the fixed
and random unmasking algorithms. As mentioned in Section 1.4, Theorem 3.3 already immediately implies
bounds from the prior work of [LC25] and [Aus20]. In this section we use Theorem 3.3 to improve these
bounds in most regimes, assuming only access to estimates cTCand [DTCofTCandDTC.
Recall that Theorem 1.9 states the existence of an algorithm attaining error at mostεand query com-
plexity
k≤2 + (1 + logn)·(1 +⌈ cTC/ε⌉) (resp.k⩽2 + (1 + logn)·(1 +⌈ [DTC/ε⌉).
17
Provided that cTCand [DTCare constant-factor approximations of their respective estimands, this yields a
query complexity proportional tomin(TC,DTC)and is generally significantly better than Theorems B.4
and B.1.
We now turn to the main technical content of this section, namely proving this result.
Proof of Theorem 1.9.We split into two cases, which are roughly similar. The main idea is to use an ex-
ponentially increasing schedule to attain the [DTCbound and an exponentially decreasing schedule for the
cTCbound. This will attain the correct query complexity. Moreover, using the pictorial representation, we
find that the horizontal slices of the error can be enlarged by a factor of\DTC
εordTC
ε, respectively, and sub-
sequently shifted horizontally to fit above or below the information curve, respectively. From this it follows
that the total error is at most a factor ofε
\DTCorε
dTCtimes either the areaDTCor the areaTC, respectively,
yielding the upper bound ofε, provided thatTC≤ cTCandDTC≤ [DTC. We provide the full details
below.
1. The cTCbound. We proceed by defining the mask schedule, and then analyzing the query complexity
and sampling error.
Mask Schedule. We first define our mask schedule. Letζ= 1 +ldTC
εm
>1. Ifζ≥n+ 1, then pickk=n
ands i= 1for alli; the sampler is perfect and the query complexity isn≤l
1 +dTC
εm
, resolving this special
case. From now on assumeζ≤n. Consider the sequenceN igiven byN 0= 0and then recursively
Ni=
Ni−1+ (n−N i−1)1
ζ
for1≤i≤$
log(n−ζ+1)
log1
1−1
ζ%
+ 2 =λ.
Note that definitionally we haveN i≥N i−1and by inductionN i≤n−1for alli. Moreover,
Ni≥N i−1+ (n−N i−1)1
ζ−ζ−1
ζ= (N i−1−1)
1−1
ζ
+n1
ζ,
so that
n−ζ+ 1−N i≤(n−ζ+ 1−N i−1)
1−1
ζ
.
It follows that
n−1≥N λ≥(n−ζ+ 1) 
1−
1−1
ζλ!
>(n−ζ+ 1)
1−1
n−ζ+ 1
=n−ζ.
Now setN i=N i−1+ 1forλ+ 1≤i≤λ+n−N λ. Note thatN λ+n−N λ=n. Lastly, define
si=N i−N i−1.4We consider the mask schedule given by{s i}λ+n−N λ
1 .
Query Complexity. The query complexitykequals the number of steps of unmasking, i.e.
k=λ+n−N λ≤ζ+ 2 +logn
log1
1−1
ζ≤2 +ζ(1 + logn)≤2 + (1 + logn)·"
1 +&cTC
ε'#
,
4Note that potentially some of the final values ofs iwill be 0.
18
where we have used the fact thatlog1
1−z=−log(1−z)≥zforz=1
ζ∈[0,1).
Sampling Error. First observe that for1≤i≤λ, we have
si=N i−N i−1≤(n−N i−1)1
ζ=⇒s i≤(n−N i)1
ζ−1≤ε
cTC(n−N i).
Applying Theorem 3.3, we find that
KL(µ∥ν)≤E S1,...,S k[KL 
µ

νS1,...,S k
]
=kX
i=1

siX
j=1ZNi−1+j
−siZNi−1+1

≤kX
i=1si 
ZNi−ZNi−1+1
≤ε
cTC λX
i=1(n−N i)(ZNi−ZNi−1)!
+kX
i=λ+1si(ZNi−ZNi−1+1)
≤ε
cTC λ−1X
i=1(Ni+1−N i)ZNi!
+ε
cTC(n−N λ)ZNλ
≤ε
cTC
λ−1X
i=1Ni+1−1X
j=N iZj
+ε
cTCn−1X
j=N λZj
=ε
cTCn−1X
j=1Zj
≤ε
cTC·TC
≤ε,
where we letZ 0=Z 1= 0by convention and we have repeatedly used that theZ j’s are nonnegative and
nondecreasing (see Lemma 2.4). Note that the fourth line follows from a rearrangement and the fact that
Ni=N i−1+ 1fori > λ. Thus the algorithm yields the correct query complexity and sampling error,
completing the proof of this case.
2. The [DTCbound. We proceed in the same three steps as in case 1; the proof will be largely similar,
except that the mask schedule is essentially flipped.
Mask Schedule. Letζ= 1 +l\DTC
εm
>1. Ifζ≥n+ 1, then pickk=nands i= 1for alli; the sampler
is perfect and the query complexity isn≤
1 +DTC
ε
, resolving this special case. From now on assume
ζ≤n. Consider the sequenceN′
igiven byN′
0=nand then recursively
N′
i=
N′
i−1
1−1
ζ
for1≤i≤$
log(n−ζ+1)
log1
1−1
ζ%
+ 2 =λ.
19
Note that definitionally we haveN′
i≤N′
i−1and by inductionN′
i≥1for alli. Moreover,
N′
i≤N′
i−1
1−1
ζ
+ζ−1
ζ=⇒N′
i−ζ+ 1≤ 
N′
i−1−ζ+ 1
1−1
ζ
.
It follows that
1≤N′
λ≤ζ−1 + (n−ζ+ 1)
1−1
ζλ
< ζ−1 + (n−ζ+ 1)1
n−ζ+ 1=ζ.
Now, setN′
i=N′
i−1−1forλ+ 1≤i≤λ+N′
λ. Note thatN′
λ+N′
λ= 0. Lastly, defines i=N′
λ+N′
λ−i−
N′
λ+N′
λ−i+1. We consider the mask schedule given by{s i}λ+N′
λ
1 .
Query Complexity. The query complexitykequals the number of steps of unmasking, i.e.
k=λ+N′
λ≤ζ+ 2 +logn
log1
1−1
ζ≤2 +ζ(1 + logn)≤2 + (1 + logn)·"
1 +&
[DTC
ε'#
,
where we have used the fact thatlog1
1−z=−log(1−z)≥zforz=1
ζ∈[0,1).
Sampling Error. First observe that fori > N′
λ, we haveλ+N′
λ−i+ 1≤λand hence
si=N′
λ+N′
λ−i−N′
λ+N′
λ−i+1≤N′
λ+N′
λ−i1
ζ=⇒s i≤N′
λ+N′
λ−i+11
ζ−1≤ε
[DTCN′
λ+N′
λ−i+1.
As usual, let
Ni=X
j≤isi=N′
λ+N′
λ−i.
Applying Theorem 3.3, we find that
KL(µ∥ν)≤E S1,...,S k[KL 
µ

νS1,...,S k
]
=kX
i=1

siX
j=1ZNi−1+j
−siZNi−1+1

≤kX
i=1si(ZNi−ZNi−1+1)
≤ε
[DTC
N′
λ+λX
i=N′
λ+1N′
λ+N′
λ−i+1(ZNi−ZNi−1)
+X
i≤N′
λsi(ZNi−ZNi−1+1)
=ε
[DTC
NN′
λ+λ−1Zn
−ε
[DTC
N′
λ+λ−1X
i=Nλ′+1(Ni−N i−1)ZNi
−ε
[DTC
NN′
λZNN′
λ
≤ε
[DTC
NN′
λ+λ−1Zn
−ε
[DTC
N′
λ+λ−1X
i=Nλ′+1
NiX
j=N i−1+1Zj

−ε
[DTC
NN′
λX
j=1Zj

20
=ε
[DTC
NN′
λ+λ−1X
j=1(Zn−Zj)

≤ε
[DTCnX
j=1(Zn−Zj)
=ε
[DTC·DTC
≤ε,
where we letZ 0=Z 1= 0by convention and we have repeatedly used that theZ j’s are nonnegative and
nondecreasing (see Lemma 2.4). Note that the fourth line follows from a rearrangement and the fact that
Ni=N i−1+ 1fori≤N′
λ. Thus the algorithm yields the correct query complexity and sampling error,
completing the proof of this case.
In conclusion, we find that in both cases there exists a mask schedule with the desired query complexity
and sampling error. This completes the proof.
Knowledge ofTCandDTC.We elaborate briefly on the “hyperparameter sweep” discussed in the intro-
duction. While the above result does not require knowledge of the data distribution or the entire information
curve, it nonetheless requires the values ofTCandDTC. These values are in general unknown and more-
over not readily estimable from our conditional oracle. In practice, however, we can treatTCandDTCas
sampling hyperparameters and sweep over a feasible rangeH.
We suggest a choice ofHas follows. First, it is not difficult to see that if we choose estimates cTC∈
[TC,2·TC]and [DTC∈[DTC,2·DTC], the mask schedule in the proof of Theorem 1.9 achieves error at
mostεand query complexity within a factor of two of
2 + (1 + logn)·
1 + minTC
ε
,DTC
ε
,
that is the complexity if we had complete knowledge ofTCandDTC. Moreover, we know that
1≤DTC
ε
,TC
ε
and DTC,TC≤nZ n≤nH 1≤nlog|Σ|.
Combining these observations, we can take
H={2i:i∈Z, ε≤i≤nlog|Σ|};|H|=O
lognlog|Σ|
ε
,
for which there exists( cTC,[DTC)∈ H2which if used as the estimates ofTCandDTCyields the desired
error and query complexity. Under this choice ofH, the hyperparameter sweep incurs an extra query com-
plexity factor of|H|2=O
log
nlog|Σ|
ε2
. For most choices ofn, ε,|Σ|, this will be a polylogarithmic
factor inn.
21
References
[AGR24] Nima Anari, Ruiquan Gao, and Aviad Rubinstein. Parallel sampling via counting. InProceed-
ings of the 56th Annual ACM Symposium on Theory of Computing, pages 537–548, 2024.
[And08] Montanari Andrea. Estimating random variables from random sparse observations.European
Transactions on Telecommunications, 19(4):385–403, 2008.
[Aus19] Tim Austin. The structure of low-complexity gibbs measures on product spaces.The Annals of
Probability, 47(6):4002–4023, 2019.
[Aus20] Tim Austin. Multi-variate correlation and mixtures of product measures.Kybernetika,
56(3):459–499, 2020.
[BBDD24] Joe Benton, Valentin De Bortoli, Arnaud Doucet, and George Deligiannidis. Nearlyd-linear
convergence bounds for diffusion models via stochastic localization. InThe Twelfth Interna-
tional Conference on Learning Representations, 2024.
[Bir87] Lucien Birgé. Estimating a density under order restrictions: Nonasymptotic minimax risk.The
Annals of Statistics, pages 995–1012, 1987.
[Can20] Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue?Theory
of Computing, pages 1–100, 2020.
[CCK+21] Clément L Canonne, Xi Chen, Gautam Kamath, Amit Levi, and Erik Waingarten. Random
restrictions of high dimensional distributions and uniformity testing with subcube conditioning.
InProceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA), pages
321–336. SIAM, 2021.
[CCL+23] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R. Zhang. Sampling is as
easy as learning the score: theory for diffusion models with minimal data assumptions. InThe
Eleventh International Conference on Learning Representations, 2023.
[CD16] Sourav Chatterjee and Amir Dembo. Nonlinear large deviations.Advances in Mathematics,
299:396–450, 2016.
[CDS25] Giovanni Conforti, Alain Durmus, and Marta Gentiloni Silveri. KL convergence guarantees
for score diffusion models under minimal data assumptions.SIAM Journal on Mathematics of
Data Science, 7(1):86–109, 2025.
[CFGM13] Sourav Chakraborty, Eldar Fischer, Yonatan Goldhirsh, and Arie Matsliah. On the power of
conditional samples in distribution testing. InProceedings of the 4th conference on Innovations
in Theoretical Computer Science, pages 561–580, 2013.
[CHJW09] Qi Chen, Chen He, Lingge Jiang, and Qingchuan Wang. Average entropy functions. In2009
IEEE International Symposium on Information Theory, pages 2632–2633. IEEE, 2009.
[CLL23] Hongrui Chen, Holden Lee, and Jianfeng Lu. Improved analysis of score-based generative
modeling: user-friendly bounds under minimal smoothness assumptions. InInternational Con-
ference on Machine Learning, pages 4735–4763. PMLR, 2023.
22
[CRS15] Clément L Canonne, Dana Ron, and Rocco A Servedio. Testing probability distributions using
conditional samples.SIAM Journal on Computing, 44(3):540–616, 2015.
[CY24] Hongrui Chen and Lexing Ying. Convergence analysis of discrete diffusion model: Exact
implementation through uniformization.arXiv preprint arXiv:2402.08095, 2024.
[CZJ+22] Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked
generative image transformer. InProceedings of the IEEE/CVF conference on computer vision
and pattern recognition, pages 11315–11325, 2022.
[EAM22] Ahmed El Alaoui and Andrea Montanari. An information-theoretic view of stochastic localiza-
tion.IEEE Transactions on Information Theory, 68(11):7423–7426, 2022.
[EG18a] Ronen Eldan and Renan Gross. Decomposition of mean-field gibbs distributions into product
measures.Electronic Journal of Probability, 23, 2018.
[EG18b] Ronen Eldan and Renan Gross. Exponential random graphs behave like mixtures of stochastic
block models.The Annals of Applied Probability, 28(6):3698–3735, 2018.
[Eld18] Ronen Eldan. Gaussian-width gradient complexity, reverse log-sobolev inequalities and non-
linear large deviations.Geometric and Functional Analysis, 28(6):1548–1596, 2018.
[JKR19] Vishesh Jain, Frederic Koehler, and Andrej Risteski. Mean-field approximation, convex hierar-
chies, and the optimality of correlation rounding: a unified perspective. InProceedings of the
51st Annual ACM SIGACT Symposium on Theory of Computing, pages 1226–1236, 2019.
[JVV86] Mark R Jerrum, Leslie G Valiant, and Vijay V Vazirani. Random generation of combinatorial
structures from a uniform distribution.Theoretical computer science, 43:169–188, 1986.
[KGO+25] Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjae Lee, Yuchen Zeng, Shuibai Zhang, Cole-
man Hooper, Yuezhou Hu, Hyung Il Koo, Nam Ik Cho, et al. Parallelbench: Understanding the
trade-offs of parallel decoding in diffusion llms.arXiv preprint arXiv:2510.04767, 2025.
[KKL+25] Samar Khanna, Siddhant Kharbanda, Shufan Li, Harshit Varma, Eric Wang, Sawyer Birnbaum,
Ziyang Luo, Yanis Miraoui, Akash Palrecha, Stefano Ermon, et al. Mercury: Ultra-fast lan-
guage models based on diffusion.arXiv preprint arXiv:2506.17298, 2025.
[LC25] Gen Li and Changxiao Cai. A convergence theory for diffusion language models: An
information-theoretic perspective.arXiv preprint arXiv:2505.21400, 2025.
[LLT23] Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence of score-based generative modeling for
general data distributions. InInternational Conference on Algorithmic Learning Theory, pages
946–985. PMLR, 2023.
[LME24] Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the
ratios of the data distribution. InProceedings of the 41st International Conference on Machine
Learning, pages 32819–32848, 2024.
[LWCC24] Gen Li, Yuting Wei, Yuejie Chi, and Yuxin Chen. A sharp convergence theory for the probability
flow ODEs of diffusion models.arXiv preprint arXiv:2408.02320, 2024.
23
[LZ25] Hugo Lavenant and Giacomo Zanella. Error bounds and optimal schedules for masked diffu-
sions with factorized approximations.arXiv preprint arXiv:2510.25544, 2025.
[MR17] Pasin Manurangsi and Prasad Raghavendra. A birthday repetition theorem and complexity of
approximating dense csps. In44th International Colloquium on Automata, Languages, and
Programming (ICALP 2017), pages 78–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik,
2017.
[NZY+25] Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai
Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint
arXiv:2502.09992, 2025.
[PW25] Yury Polyanskiy and Yihong Wu.Information theory: From coding to learning. Cambridge
university press, 2025.
[RCZ+25] Yinuo Ren, Haoxuan Chen, Yuchen Zhu, Wei Guo, Yongxin Chen, Grant M Rotskoff, Molei
Tao, and Lexing Ying. Fast solvers for discrete diffusion models: Theory and applications of
high-order algorithms.arXiv preprint arXiv:2502.00234, 2025.
[RP25] Galen Reeves and Henry D Pfister. Information-theoretic proofs for diffusion sampling.arXiv
preprint arXiv:2502.02305, 2025.
[RT12] Prasad Raghavendra and Ning Tan. Approximating csps with global cardinality constraints
using sdp hierarchies. InProceedings of the twenty-third annual ACM-SIAM symposium on
Discrete Algorithms, pages 373–387. SIAM, 2012.
[SAS+24] Subham Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin Chiu,
Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language
models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024.
[SHW+24] Jiaxin Shi, Kehang Han, Zhe Wang, Arnaud Doucet, and Michalis Titsias. Simplified and
generalized masked diffusion for discrete data.Advances in neural information processing
systems, 37:103131–103167, 2024.
[YZ14] Yuichi Yoshida and Yuan Zhou. Approximation schemes via sherali-adams hierarchy for dense
constraint satisfaction problems and assignment problems. InProceedings of the 5th conference
on Innovations in theoretical computer science, pages 423–438, 2014.
24
A Logarithmic overhead is necessary
In this section, we show that thelognterm in Theorem 1.9 is necessary. This is essentially implicit in [Bir87]
and has been used in various works on monotone distribution estimation. We were, however, not able to find
a complete statement and proof in the literature and provide one for completeness. First, recall the following
definition.
Definition A.1(Piecewise Functions).Ak-piecewise functionf: [n]→Ris a function for which there are
at mostkvaluesi∈[n−1]such thatf(i)̸=f(i+ 1).
We can now state the main result of this subsection.
Lemma A.2.Letn≥2,εbe such that2
nlog2
ε≤ε≤1
logn, andk≤c·logn
εfor some sufficiently
small constantc. Then there exists a non-negative monotone increasing functionf: [n]→RsatisfyingP
if(i) = 1such that for any functionh: [n]→Rwhich is ak-piecewise constant function, we have that
∥f−g∥ L1≥Ω(ε).
Proof.We definefas follows. Fori= 0, . . . ,l
log(n+1)
log(1+ε)−1m
=m≤log(n+1)
ε, we letB i={⌊(1 +
ε)i⌋, . . . ,min(⌊(1 +ε)i+1⌋ −1, n)}, and forx∈B iwe let
f(x) =p i=1
4·(1 +ε)−i
logn.
Note that
nX
i=1f(x)≤1
4mX
i=0,ε(1+ε)i≥1(ε(1 +ε)i+ 1)(1 +ε)−i
logn≤1
4 mX
i=02ε
logn!
≤1
asm≤2 logn
εand so this is a valid choice off.
Lethbe anyk-piecewise function, and letI 1, . . . , I kdenote the partition of[n]intokdisjoint intervals
on whichhis constant. We will refer to the right endpoints of these intervals as thebreakpointsofh. The
remainder of the proof proceeds in two main steps. We first show thathmay be modified to have two
desirable structural qualities: namely to have breakpoints only at the right endpoints of theB iand values
only in the{p i}. We then directly analyze functionshwith these properties to prove the bound.
Shifting the image ofh.We first claim that for anyh, there is someh image such that∥h image−f∥ L1≤
∥h−f∥ L1, andh image has the same breakpoints ashbuth image(x) =p ijfor allx∈I j, for valuesi j
satisfyingB ij∩Ij̸=∅. That is, on each interval wherehis constant, we can replace the value ofhon that
interval with one of the values thatfattains on the same interval.
Indeed, this is because the contribution ofI jto the total error isF j(h(x)) =P
x∈Ij|h(x)−f(x)|.
Supposeh′(x)∈[p t, pt+1]. Then sinceF j(h′(x))is linear in[p t, pt+1], it must be optimized at one of the
endpoints. In particular, we may replace the value ofhonI jwith eitherp torpt+1without increasing the
number of pieces of total error. Applying this result to all intervalsI jyields someh image which satisfies the
conditions of the claim.
Shifting the breakpoints ofh.We now claim that for anyh, there is someh finalsuch that∥h final−f∥ L1≤
∥h−f∥ L1, which is alsok-piecewise but whose breakpoints are a subset of the breakpoints off, and whose
image is contained in the image ofh.
25
Indeed, supposehhas a breakpointt∈I j∩Bℓwhich is not the right endpoint ofB ℓ; call such break-
pointsbad. The contribution ofI jto the total error isF j(h(x)) =P
x∈Ij|h(x)−f(x)|. We now have two
cases.
Case 1.If|h(t)−f(t)|>|h(t+ 1)−f(t+ 1)|, leth′(r) =h(t+ 1)∀r∈I j∩B ℓ, r≤t, and
h′(x) =h(x)otherwise. Then the number of pieces and total error ofh′are not greater than those ofh.
Moreover, either the total number of pieces decreases by 1 orh′has one less bad breakpoint thanh.5
Case 2.If|h(t)−f(t)| ≤ |h(t+ 1)−f(t+ 1)|, leth′(r) =h(t)∀r∈I j+1∩B ℓ, r > t, and
h′(x) =h(x)otherwise. Then the number of pieces and total error ofh′are not greater than those ofh.
Moreover, either the total number of pieces decreases by 1 orh′has one less bad breakpoint thanh.6
Iterating the operationh→h′, we conclude that after a finite number of rounds there are no bad
breakpoints. The outputh finalsatisfies the conditions of the claim.
Completing the proof.Combining these claims, there existsanoptimalk-piecewise approximationhtof
for whichI j={
(1 +ε)ij
+ 1,
(1 +ε)ij+1
}for some0 =i 1≤ ··· ≤i k+1=m, and for allx∈I j, we
have thath(x) =p tjfor somet j. Moreover, breaking up each interval into the intervals before and aftert j,
we may further assume thath(x) =p ijorh(x) =p ij+1.7Lastly, we may ignore the regionsB jfor which
j≲log(2/ε)
ε, so that for all consideredB s, we haveε(1 +ε)s≥2.8Now, letℓ j=ij+1−ij. We have the
following two cases.
Large interval regime.First, if there is any intervalI jsuch thatℓ j≥2(1+ε)
ε, then ifh(x) = (1 +ε)ij+1on
this interval, and so the error of the approximation on this interval is at least
1
lognij+1X
s=ij1
(1 +ε)s−1
(1 +ε)ij+1·1
2ε(1 +ε)s=ε
2 logn
ℓj−ℓj−1X
s=01
(1 +ε)s

≥ε
2 logn
ℓj−1 +ε
ε
≥1
2 logn= Ω(ε).
If insteadh(x) = (1 +ε)ij, we obtain the same result via an analogous calculation. In either case, the
desired statement holds.
Small interval regime.Otherwise, assume thatℓ j≤2(1+ε)
εfor allj. Recall the approximation inequality
1−1
(1+ε)r≥min(1/2, rε/2), which follows from Bernoulli’s inequality. Applying it, we find that if
h(x) = (1 +ε)ij+1, the error onI jis at least
1
lognij+1X
s=ij1
(1 +ε)s−1
(1 +ε)ij+1·1
2ε(1 +ε)s≥ε
4 lognℓj−1X
s=0sε
≥ε2
16 logn(ℓj−1)2
If insteadh(x) = (1 +ε)ij, we obtain the same result again via an analogous calculation. Thus, the overall
5The former occurs ifI jhas a smaller left endpoint thanB ℓand the latter otherwise.
6The former occurs ifI j+1has a larger right endpoint thanB ℓand the latter otherwise.
7Note that this operation at most doubles the number of intervals inh; thus, it can be accounted for by changing the constantc
appropriately.
8This operation at most increments the number of intervals by a constant, and underestimates the total error, so it is also
permissible.
26
error of the approximation can be lower bounded by
∥h−f∥L1≥ε2
16 lognkX
j=1(ℓj−1)2.
SincePk
j=1(ℓj−1)≥(1−c)logn
ε, by the Cauchy-Schwarz inequality we conclude that
∥h−f∥ L1≥ε2(1−c)2
16 logn·(logn)2
kε2=(1−c)2
16·logn
k≥Ω (ε),
completing the proof of the desired claim.
To make this result applicable to distributions, recall the following result about the realizability of any
information curve by some distribution:
Lemma A.3(Theorem 1 in [CHJW09]).For any information curve0≤Z 1≤ ··· ≤Z nand approximation
errorsε 1, . . . , ε n>0, there exists a distributionµfor which|Z i(µ)−Z i| ≤ε ifor alli.
Combining Lemmas A.2 and A.3, we conclude that there exist distributions for which thelognis un-
avoidable on many parameter regimes.
Theorem A.4.Letn≥2,εbe such that2
nlog2
ε≤ε≤1
logn, andcbe a sufficient small constant. Then
there exists a distributionµsuch that for any unmasking schedule(k,{s i}k
1)withk≤c·logn
ε, we have
ES1,...,S k
KL 
µ

νS1,...,S k
≥Ω(ε),
where the expectation is taken over all partitionsS=Fk
i=1SiandνS1,...,S kdenotes the distribution out-
putted byA(k,{S i}k
1).
Proof.Applying Theorem 1.4, the expected KL error is given by∥Z−ZN∥L1. The result follows from
choosing a distributionµvia Lemma A.3 withε i=o(ε
n)and applying Lemma A.2 and the triangle inequal-
ity for theL1norm.
B Recovering existing bounds
In this section we recover the iteration complexity bound of [LC25] and an iteration complexity bound
which is implicit in [Aus20].
B.1 Recovering the bound of Li and Cai [LC25]
In [LC25], the authors prove the following bound on the sampling error, given the sizes of the mask schedule.
Theorem B.1(Theorem 1 of [LC25]).Letµbe the data distribution andνS1,...,S kbe the output of the fixed
unmasking algorithmA fixed(k,{S i}k
1). Lets max= maxk
i=1|Si|. Then we have
ES1,...,S k
KL 
µ

νS1,...,S k
≤2⌈log2smax⌉−1
nnX
i=1I(X i;{X j}j̸=i) =2⌈log2smax⌉−1
n(TC + DTC)
We provide a short proof of this result via Theorem 3.3.
27
Proof.The equality in the theorem follows from the definition ofTCandDTC, so we aim to prove the
inequality. By Theorem 3.3 and Lemma 2.4, we have
ES1,...,S k[KL 
µ

νS1,...,S k
] =kX
i=1
siX
j=1 
ZNi−1+j−ZNi−1+1

≤kX
i=1(si−1)(Z Ni−ZNi−1)
≤(s max−1)Z n
≤2⌈log2smax⌉−1
n(TC + DTC),
where in the second line we have noted that the summandZ Ni−1+j−ZNi−1+1is zero forj= 1and at most
ZNi−ZNi−1otherwise. This completes the proof.
Remark B.2.We can restate Theorem B.1 in terms of query complexity given a fixedε. In particular, the
number of queries isk≥n
smax. Thus, we find that given anyε, there is a mask schedule attaining expected
sampling error at mostεinO TC+DTC
ε
queries.
As shown in [LC25], this upper bound is optimal in some cases. However, as we remarked in the
introduction, it is generally worse than Theorem 1.9.
B.2 Recovering the bound of Austin [Aus20]
In [Aus20], the author proves that distributions with lowTCcan be decomposed into a fixed subset of coor-
dinatesSand a remaining subset of coordinates[n]\Swhich have low conditionalTC. For completeness,
we show the result here.
Lemma B.3(Lemma 8.3 of [Aus20]).Letµbe the data distribution. Then there is a subset sizes≤DTC
δ2
for which in expectation over all|S|=s, S⊆[n], we have
TC(X 1, . . . , X n|XS) + DTC(X 1, . . . , X n|XS)≤δ2(n− |S|),
whereTC(Y|X) =E x∼p(X) TC(Y|X=x)is the conditionalTCand similarly forDTC(Y|X).
This lemma yields a natural method of samplingµ: first perfectly sample an arbitrary subset of sizes,
and then sample the remaining coordinates in one-shot.
Corollary B.4.SupposeDTC≤δ2n. Letµbe the data distribution andνbe the output of the random
unmasking algorithmA(k,{s i}k
1). Then there exists a schedule(k,{s i}k
1)for which the query complexity
satisfiesk≤DTC
δ2+ 1and the error is bounded by
KL(µ∥ν)≤δ2(n−k+ 1).
Note that herek=s+ 1, since there is the final one-shot step. We provide a short proof of this result
via Theorem 3.3.
28
Proof.Consider the schedule given byk=DTC
δ2
+ 1,s i= 1fori≤k−1, ands k=n−k+ 1. By
Theorem 3.3, we have that
KL(µ∥ν) = (n−k+ 1)(Z n−Zk)
≤n−k+ 1
kkX
j=1(Zn−Zj)
≤n−k+ 1
kDTC
≤δ2(n−k+ 1),
where we have used Lemma 2.4 repeatedly.
Note that the bound in Corollary B.4 is not particularly strong; in particular, the sampling procedure is
essentially two-step and does not provide significant flexibility to the choice of mask schedule. This can
be improved by replacing the one-shot sample ofS kwith anℓ-step, constant mask size sampler. Under
this regime, combining the above result and Theorem B.1 below and then optimizing overkandℓ, we can
recover Theorem 1.10 given in the introduction. We provide a detailed proof below.
Proof of Theorem 1.10.Consider the schedules i= 1fori≤k−1ands i=⌊n−k+1
ℓ⌋fork≤i≤k+ℓ+1.9
Applying Theorem B.1 to the conditional distributionX 1, . . . , X n|XSand then Lemma B.3, we find that
the total error is bounded by
E[KL(µ∥ν)]≤1
ℓ(TC(X 1, . . . , X n|XS) + DTC(X 1, . . . , X n|XS))≤δ2n
ℓ.
The total query complexity isk+ℓ. Takeℓ=l
δ2n
εm
andk≤DTC
δ2, observe that we can then setδ2=
q
DTC·ε
n. We finally find that this schedule yields errorE[KL(µ∥ν)]≤εand query complexityk+ℓ=
Oq
DTC·n
ε
, as desired. This completes the proof.
C Decoupling estimation error and sampling error
This appendix follows the work of [LC25], and is written in order that the present work be self-contained.
For simplicity, letµ(x)denote the PDF ofµ. LetT i=∪ j<iSjfor any subsetsS=Fk
i=1Si. We consider
learning the estimate cCOwhich minimizes the following error:
error(µ, cCO) =E S1,...,S k;i∈[k]
n
|Si|X
j∈Silogµ(X j|XTi=xTi)
cCO(X j|XTi=xTi)

wherecCO(X j|XTi=xTi)denotes the conditional marginal ofX joutputted by the learned oracle, and the
expectation is overallschedulesS=Fk
i=1Si, andiis drawn from the distributionp(i) =|Si|
n.
9This is approximate, as we do not necessarily havePsi=n. Formally, ifP
isi> n, omit as many termss ifrom the end as
necessary and cap the final value ofs i
29
As a brief remark, note that since cCOonly appears in the denominator of the logarithmic term, we can
estimate the learning error
learning error=−E S1,...,S k;i∈[k]
n
|Si|X
j∈SilogcCO(X j|XTi=xTi)

via training samples; moreover, by positivity of KL, the optimum is precisely cCO=CO, hence with suffi-
ciently many samples, we can expect to learn a good estimate cCO.
Now, the following error-decoupling result justifies our assumption that the conditional oracle cCOis
perfect.
Lemma C.1([LC25]).Letµbe the data distribution and(S 1, . . . , S k)be an unmasking schedule. Moreover,
letcCObe a learned conditional marginal oracle, which estimatesCO. LetνS1,...,S kbe the distribution
sampled byA fixed(k,{S 1}k
1)usingCO, andˆνS1,...,S kis the distribution sampled byA fixed(k,{S 1}k
1using
cCO. Then
KL 
µ

νS1,...,S k)
=KL 
µ

ˆνS1,...,S k
+error(µ, cCO)
Proof.Observe that
KL 
µ

νS1,...,S k
−KL 
µ

ˆνS1,...,S k
=Z
Σnµ(x) logνS1,...,S k(x)
ˆνS1,...,S k(x)
dx
=kX
i=1Z
Σnµ(x) logνS1,...,S k(xSi|XTi=xTi)
ˆνS1,...,S k(xSi|XTi=xTi)
dx
=kX
i=1X
j∈Silogµ(X j|XTi=xTi)
cCO(X j|XTi=xTi)
=E S1,...,S k;i∈[k]
n
|Si|X
j∈Silogµ(X j|XTi=xTi)
cCO(X j|XTi=xTi)

=error(µ, cCO),
where the factor ofn
|Si|is due to the distribution ofiin the expectation formula forerror(µ, cCO).
D Finishing the proof of Theorem 4.10
Proof.It remains to verify that the familyFconstructed in Section 4.2 satisfies the second part of The-
orem 4.10. For this, we closely follow the proof of Theorem 4.9. By Proposition 4.4, if one queries the
conditional marginal oracle forµ∗[UV]on a partial assignment of size< k, then the response will be identi-
cal to the one forµ∗[U], and likewise if the assignment is of size> kand its projection to theFn
qcomponent
is incompatible with any element ofV. We will use the samehitandmissterminology from before.
LetDµ∗
kdenote the distribution overµ∗[UV]whereVis a randomk-dimensional RS code. LetDdenote
the mixture distribution overFgiven by
1
2δµ∗[U]+1
2n−2n−1X
k=1Dµ
k.
30
As before, letleaf∗:=leafT(µ∗[U])(µ∗[U]). We must have
TV
µ∗[U],X
ℓ∈leaf∗P
A[ℓ|µ∗[U]]·ν ℓ
⩽1/8,
or elseE µ∗∼D[costTV
T(A;µ)]>1/16.
For any leaf nodeℓ, letv 1→w 1→v 2→ ··· →w T−1→v Tdenote the sequence of decision and leaf
nodes along the root-to-leaf path toℓ, and suppose the edges(v i, wi)are labeled with partial assignments
XS(i)=x(i). Ifℓ∈leaf∗, then the edges(w i, vi+1)are labeled withν⊗Unif(F q)⊗(n−|S(i)|)for some
distributionνoverΣn.
Letk 1⩽···⩽k Tdenote the numbers|S(1)|, . . . ,|S(T)|in sorted order. For anyVof dimensionk > k T
we haveP A[ℓ|µ∗[UV]] =P A[ℓ|µ∗[U]]. Fork j< k < k j+1, by Lemma 4.5, Eq. (2) holds as before,
and ifℓavoidsV, the oracles output under every query along the path is of the formν⊗Unif(F q)⊗(n−|S|)
for some distributionν. In this case, again we haveP A[ℓ|µ∗[UV]] =P A[ℓ|µ∗[U]]. The same reasoning
applies tok < k 1.
Let us write
EV∼D kTV
µ∗[UV],X
ℓ∈leaf(µ∗[UV])P
A[ℓ|µ∗[UV]]·νℓ
≥1/2−E V∼D kTV
µ∗[U],X
ℓ∈leaf(µ∗[UV])P
A[ℓ|µ∗[UV]]·νℓ
where we used thatTV(µ∗[U], µ∗[UV])≥TV(U,U V)≥1/2for any proper subspaceV. We can rewrite the
mixture on the right-hand side as
X
ℓ∈leaf∗:avoidsVP
A[ℓ|µ∗[UV]]·ν ℓ+X
ℓ∈leaf∗:hitsVP
A[ℓ|µ∗[UV]]·ν ℓ+X
ℓ∈leaf(µ∗[UV])\leaf∗P
A[ℓ|µ∗[UV]]·ν ℓ
=X
ℓ∈leaf∗P
A[ℓ|µ∗[U]]·ν ℓ−X
ℓ∈leaf∗:hitsVP
A[ℓ|µ∗[U]]·ν ℓ+X
ℓ∈leaf(µ∗[UV])\leaf∗P
A[ℓ|µ∗[UV]]·ν ℓ,(5)
where we used that forℓ∈leaf∗that avoidV,P A[ℓ|µ∗[U]] =P A[ℓ|µ∗[UV]], and forℓ∈leaf∗that hit
V, it must be thatP A[ℓ|µ∗[UV]] = 0as the sampler underµ∗[UV]must deviate from the path that leads
toℓ. AsP
ℓ∈leaf(µ∗[UV])\leaf∗PA[ℓ|µ∗[UV]] =P
ℓ∈leaf∗:hitsVPA[ℓ|µ∗[U]], the TV betweenµ∗[U]and the
mixture in Eq. (5) is thus upper bounded by1/8 +P
ℓ∈leaf∗:hitsVPA[ℓ|µ∗[U]], and thus
EV∼D kTV
µ∗[UV],X
ℓ∈leaf(µ∗[UV])P
A[ℓ|µ∗[UV]]·ν ℓ
≥3
8−X
ℓ∈leaf∗:hitsVP
A[ℓ|µ∗[U]].
31
We say thatVisη-goodif it satisfiesP
ℓ∈leaf∗:hitsVPA[ℓ|µ∗[U]]⩽ηfor someη >0. Observe that
1
n−1n−1X
k=1nX
ℓ∈leaf∗P
A[ℓ|µ∗[U]]·P
V∼D k[ℓhitsV]o
=X
ℓ∈leaf∗P
A[ℓ|µ∗[U]]·1
n−1n−1X
k=1P
V∼D k[ℓhitsV]
⩽X
ℓ∈leaf∗P
A[ℓ|µ∗[U]]·T(µ∗[U]) + (n−1− T(µ∗[U]))T(µ∗[U])/q
n−1
=T(µ∗[U]) + (n−1− T(µ∗[U]))T(µ∗[U])/q
n−1⩽2T(µ∗[U])
n−1
where in the second step we used that for any leafℓat distance2Tfrom the root, there are at mostT
dimensions0< k < nthat are equal to the size of some partial assignment along the root-to-left path toℓ,
and for all other dimensionsk,P V∼D k[ℓhitsV]⩽T/qby Eq. (2). By Markov’s inequality, we conclude
that forη :=4T(µ∗[U])
n−1≪1(here we used the hypothesis from Eq. (4)),
P
0<k<n,V∼D k[Visη-good]≥1/2.
We conclude that
E
µ∼D[costTV
T(A;µ)]≥1
2·P
0<k<n,V∼D k[Visη-good]·3
8−η
≥1
16.
32