1
Efﬁcient Hyperdimensional Computing with
Modular Composite Representations
Marco Angioli, Christopher J. Kymn, Antonello Rosato, Amy L outﬁ, Mauro Olivieri, and Denis Kleyko
Abstract —The modular composite representation (MCR) is
a computing model that represents information with high-
dimensional integer vectors using modular arithmetic. Ori gi-
nally proposed as a generalization of the binary spatter cod e
model, it aims to provide higher representational power whi le
remaining a lighter alternative to models requiring high-p recision
components. However, despite this potential, MCR has recei ved
limited attention in the literature. Systematic analyses o f its
trade-offs and comparisons with other models, such as binar y
spatter codes, multiply-add-permute, and Fourier hologra phic
reduced representation, are lacking, sustaining the perce ption
that its added complexity outweighs the improved expressiv ity
over simpler models. In this work, we revisit MCR by presenti ng
its ﬁrst extensive evaluation, demonstrating that it achie ves a
unique balance of information capacity, classiﬁcation acc uracy,
and hardware efﬁciency. Experiments measuring informatio n
capacity demonstrate that MCR outperforms binary and integ er
vectors while approaching complex-valued representation s at a
fraction of their memory footprint. Evaluation on a collect ion
of123 classiﬁcation datasets conﬁrms consistent accuracy gains
and shows that MCR can match the performance of binary
spatter codes using up to 4.0×less memory. We investigate
the hardware realization of MCR by showing that it maps
naturally to digital logic and by designing the ﬁrst dedicat ed
accelerator for it. Evaluations on basic operations and sev en
selected datasets demonstrate a speedup of up to three order s-
of-magnitude and signiﬁcant energy reductions compared to
a software implementation. Furthermore, when matched for
accuracy against binary spatter codes, MCR achieves on aver age
3.08×faster execution and 2.68×lower energy consumption.
These ﬁndings demonstrate that, although MCR requires more
sophisticated operations than binary spatter codes, its mo dular
arithmetic and higher per-component precision enable much
The work of CJK was supported by the Center for the Co-Design o f Cogni-
tive Systems (CoCoSys), one of seven centers in JUMP 2.0, a Se miconductor
Research Corporation (SRC) program sponsored by DARPA, in a ddition to
the NDSEG Fellowship, Fernström Fellowship, Swartz Founda tion, and NSF
Grants 2147640 and 2313149. The work of AL and DK was supporte d
by Knut and Alice Wallenberg Foundation under the Wallenber g Scholars
program (Grant No. KAW2023.0327). DK acknowledges funding from the
Swedish Strategic Research Foundation under the Future Res earch Leaders
program (Grant No. FFL24-0111) and the Swedish Research Cou ncil under
the Starting Grant program (Grant No. 2025-05421). Corresponding authors:
Marco Angioli and Denis Kleyko .
M. Angioli, A. Rosato, and M. Olivieri are with the Departmen t
of Information Engineering, Electronics and Telecommunic ations
at Sapienza University of Rome, 00184 Rome, Italy (e-mails:
{marco.angioli; antonello.rosato; mauro.olivieri}@uni roma1.it).
C. J. Kymn is with the Redwood Center for Theoretical Neurosc ience
at the University of California at Berkeley, CA 94720, USA (e -mail:
cjkymn@berkeley.edu).
A. Loutﬁ is with the AI, Robotics and Cybersecurity Center (A RC), and
with the Department of Computer Science at Örebro Universit y, 70281 Örebro,
Sweden and also with Department of Science and Technology at Linköping
University, 58183 Linköping, Sweden (e-mail: amy.loutﬁ@o ru.se).
D. Kleyko is with the AI, Robotics and Cybersecurity Center ( ARC), and
with the Department of Computer Science at Örebro Universit y, 70281 Örebro,
Sweden and also with Intelligent Systems Lab at RISE Researc h Institutes of
Sweden, 16440 Kista, Sweden (e-mail: denis.kleyko@oru.se ).lower dimensionality of representations. When realized wi th our
dedicated hardware accelerator, this results in a faster, m ore
energy-efﬁcient, and high-precision alternative to exist ing models.
Index Terms —Modular Composite Representation, Hyperdi-
mensional Computing, Hardware Acceleration
I. I NTRODUCTION
Hyperdimensional computing, also known as vector sym-
bolic architectures (HD/VSA), is a computing paradigm that
equips the ﬂexibility of connectionist models with structu red
transformations of vectors [1], [2]. At its core, HD/VSA rep re-
sents information using high-dimensional distributed ran dom-
ized vectors, known as hypervectors (HVs), which are com-
bined and compared through simple vector arithmetic [3], [4 ].
This principle endows HD/VSA with several attractive prope r-
ties, including robustness to noise, energy and computatio nal
efﬁciency, and inherent parallelism, that have made HD/VSA
a compelling and promising alternative for deployment in
resource-constrained scenarios, digital hardware accele rators,
and novel neuromorphic devices [5], for modeling and solvin g
classiﬁcation [6], [7], regression [8], [9], clustering [1 0], [11],
and reinforcement learning tasks [12]–[14].
Over the years, HD/VSA has been implemented using many
different kinds of vector spaces, with data types ranging fr om
binary to real and complex numbers, each striking a differen t
balance between information capacity and computational ef ﬁ-
ciency [3], [4]. Within this spectrum, the modular composite
representation (MCR) model, introduced in 2014 and based
on modular integer arithmetic, generalizes binary HD/VSA,
aiming to increase representational power while remaining a
lighter alternative to models requiring ﬂoating-point ari thmetic.
Yet, despite this potential, MCR has received limited atten tion
in the literature, probably because it is perceived neither as
efﬁcient as binary HD/VSA nor as expressive as integer- or
real-valued models.
In this article, we present the ﬁrst extensive study of MCR,
guided by two questions: when and why is MCR beneﬁcial?
and what are its execution time and energy consumption
when implemented in dedicated hardware? We answer them
across three complementary dimensions. First, in Section I II-A,
we show that MCR achieves substantially higher information
capacity than binary and low-precision integer HVs, while
approaching the expressiveness of complex-valued models a t
a fraction of the memory footprint. Second, in Section III-B ,
through large-scale experiments on 123classiﬁcation datasets,
we demonstrate that per-component precision matters : the
modular discretized space of MCR consistently outperformsarXiv:2511.09708v1  [cs.LG]  12 Nov 2025
2
binary and low-precision integer HVs under both equal dimen -
sionality and equal memory footprint settings, and can redu ce
the HV dimensionality by up to 4×while still surpassing
binary models. Finally, in Section III-C, we demonstrate th at
the modular arithmetic at the core of MCR maps seamlessly
to digital hardware and we design the ﬁrst dedicated acceler-
ator for MCR , extending the RISC-V accelerator for binary
HD/VSA [15]. Hardware experiments on the basic operations
and seven classiﬁcation datasets provide details about its
execution time, energy efﬁciency, and scaling behavior. In
addition, by directly comparing hardware-accelerated MCR
with hardware-accelerated binary models (Section III-D), we
demonstrate that although MCR requires more sophisticated
arithmetic, its lower HV dimensionality paired with efﬁcie nt
hardware support translates into a faster, more energy-efﬁ cient,
more compact, and higher-precision alternative to existin g
HD/VSA models.
II. B ACKGROUND AND METHODS
A. Hyperdimensional Computing/Vector Symbolic Architec-
tures
HD/VSA are a family of neuro-inspired computational mod-
els that leverage the mathematics of high-dimensional spac es
to represent, combine, and manipulate information [4]. The
motivation behind HD/VSA stems from the observation that
brains operate with massively parallel circuits built from unre-
liable components, yet still achieve remarkable robustnes s and
generalization [1]. Building on this analogy, HD/VSA repre -
sents information using HVs where information is distributed :
each component contributes equally and independently to
the overall representation [3]. This distributed and holis tic
representation confers two crucial properties to HD/VSA. F irst,
it provides robustness to noise and faults , since even when a
large fraction of components are corrupted, the overall rep re-
sentation can still be reliably recognized [16], [17]. Seco nd, it
offers inherent parallelism , because all HV components can be
processed independently, making HD/VSA particularly well -
suited for efﬁcient implementations in digital hardware [1 5],
[18]–[21] as well as emerging neuromorphic devices [22]–
[24].
Numerous HD/VSA models have been proposed, yet they
all support the same basic operations [4], [7]. At the core of
these models lies the fact that fundamental concepts, symbo ls,
or elements can be mapped to random HVs that, thanks to the
geometric properties of high-dimensional spaces, are near ly
orthogonal, i.e., linearly independent [1]. These HVs can t hen
be combined and manipulated using a small set of basic
operations:
•Binding (◦): associates arguments (e.g., key–value pairs)
into a new HV that is dissimilar to the arguments.
•Superposition (+): combines multiple HVs into a single
composite HV that remains similar to each argument.
Since this aggregation can produce components outside
the admissible dynamic range, a subsequent normaliza-
tion step can be applied to project the HV back into its
original domain.01234
5
6
7
8
9
10
11
12131415complement
opposite
Fig. 1. Redrawn illustration from [32] showing the possible discretized values
inZ16for a single component of an MCR HV with r= 16 . The complement
of a value is deﬁned as the element whose sum with it equals r, while the
opposite is its antipode, i.e., the value obtained by adding r/2.
•Permutation (ρ): reorders the components of an HV to
produce a dissimilar one, and is used to represent order
or role information, such as in sequences.
Finally, a distance metric (δ) quantiﬁes how close two HVs are
in high-dimensional space, serving as the basic mechanism f or
recognition and retrieval. Together, these four operation s form
the algebra of HD/VSA, enabling the composition of complex
information without increasing dimensionality .
The implementation of these operations varies depending
on the characteristics of the vector space in which HVs are
deﬁned [4]. The components of HVs can be of different data
types , with each choice giving rise to a distinct HD/VSA
model. For instance, the binary spatter codes (BSC) model [2 5]
operates with binary HVs, while the multiply–add–permute
(MAP) [26] model exists in several variants: MAP-B with
binary/bipolar values, MAP-I with integers, and MAP-C with
real numbers. Holographic reduced representations (HRR)
and Fourier holographic reduced representations (FHRR) [2 7]
employ real and complex numbers, respectively. This vari-
ety of models points to the trade-off between information
capacity and computational complexity, enabling adaptati on
to different application domains and hardware platforms. F or
example, BSC is often favored in resource-constrained sce-
narios [28] and in-memory computing [29], whereas complex-
valued FHRR is attractive for neuromorphic hardware [22],
[24], [30], [31].
B. Modular Composite Representation
MCR is an HD/VSA model introduced in [32] where
information is represented with vectors consisting of Dinteger-
valued components. MCR relies on modular arithmetic to
deﬁne its operations. Each component of an HV lies on a
discretized unit circle Zr, whereris the chosen modulus, so
that values range from 0tor−1and incrementing r−1wraps
3
around to 0. Modular vectors form a homogeneous space in
the sense that for any two vectors handu, there exists a
third vector csuch that binding hwithuproduces c. Fig. 1
illustrates an example of this discretized space with r= 16 ,
where the complement of a value is deﬁned such that their
sum equals r, and the opposite is obtained by adding r/2.
Within this space, MCR deﬁnes binding between two HVs
handuas the component-wise modular sum ( ci= mod r(hi+
ui)) and unbinding as the component-wise modular subtraction
(ci= mod r(hi−ui)). Similarity is measured by a modular
variant of the Manhattan distance that is deﬁned as follows:
δ(h,u) =D/summationdisplay
i=1min(mod r(hi−ui),modr(ui−hi)).(1)
The superposition (also called bundling) of mHVs
h(1),...,h(m)is carried out in two steps. First, each com-
ponenth(j)
i∈[0,r−1]of thej-th HV is projected to an
equivalent point on the unit circle (cf. Fig. 1), and summed i n
C:
vi=m/summationdisplay
j=1/parenleftBig
cos/parenleftbig2π
rh(j)
i/parenrightbig
+isin/parenleftbig2π
rh(j)
i/parenrightbig/parenrightBig
. (2)
Next, since the resultant HV vno longer lies on the discretized
unit circle Zr, its components are mapped back to the nearest
elements in Zrby ﬁnding integers from [0,r−1]with phases
that are nearest to the phases of components of the resultant
HV:
wi= mod r/parenleftbig/floorleftbigr
2πmod2π/parenleftbig
atan2(ℑ(vi),ℜ(vi))/parenrightbig/ceilingrightbig/parenrightbig
.(3)
If one component happens to be zero (e.g., when an element
is superimposed with its inverse), its undeﬁned phase is
resolved by assigning it to the integer closest to the arithm etic
mean of the original arguments in the integer domain:
wi= mod r
1
mm/summationdisplay
j=1h(j)
i

. (4)
Notably, each discretization step (hereafter referred to a s
normalization ) results in information loss. Therefore, when
superimposing multiple HVs, the normalization in Eq. (3)
should be deferred until the ﬁnal step [32].
MCR can be seen as a middle-ground between the properties
and advantages of BSC and FHRR. It was originally proposed
as a generalization of BSC, inheriting its low memory requir e-
ments [32]. In fact, when r= 2, MCR is equivalent to BSC.
Forr >2, however, its operations generalize the modular
arithmetic of BSC beyond Z2. At the same time, under the
interpretation of integers in [0,r−1]as phasors on the unit
circle (therth roots of unity), the binding and superposition
operations of MCR implement a discretized FHRR, offering
comparable expressiveness while being more computational ly
efﬁcient and easier to implement with LUTs (as we report
in Section III-C1). In this sense, MCR can be seen as
a model covering the whole spectrum between BSC and
FHRR, balancing expressiveness and efﬁcient implementati on.
In line with this discretized-phase view, Yu et al. introduc ed
the Cyclic Group Representation (CGR) model [33], which,
similar to MCR implements the binding operation as thecomponent-wise modular sum. Analysis in [33] shows that
certain similarity functions cannot be expressed by binary
HVs but can be realized within CGR; accordingly, MCR with
r >2also possesses this broader expressivity. Furthermore,
modules of MCR/CGR can be combined to form residue
number systems [34], which has promising applications to
unconventional computing [35] and neuroscience [36], [37] .
No prior work has systematically investigated MCR. Be-
yond being brieﬂy mentioned in a survey [3] and implemented
in software [38], its performance on real datasets and poten tial
for hardware implementation have not been thoroughly eval-
uated. The existing literature has predominantly focused o n
the two “extreme” points of the design spectrum, BSC and
FHRR, while largely overlooking integer-valued alternati ves.
Perhaps this was because the added computational cost of
such designs seemed unjustiﬁed relative to their potential gains
over binary HVs. Even the original study [32] conjectured
that the simpler operations of BSC might be preferable in
resource-constrained scenarios with strict timing requir ements.
In this study, we revisit that conjecture by systematically ana-
lyzing the algorithmic performance and hardware complexit y
of MCR, demonstrating that it offers a favorable trade-off
between HV dimensionality and component dynamic range
compared to many existing HD/VSA models.
C. Capacity of hypervectors
In HD/VSA, the information capacity is measured as the
maximum amount of information (bits) that can be decoded
from an HV representing a data structure.
We adopt the setup from [39]–[41] to evaluate the informa-
tion capacity of MCR and compare it with the well-known
BSC, MAP-I, and FHRR models. Given a codebook matrix Φ
ofdsymbols, where each column Φjis a ﬁxedD-dimensional
HV representing symbol j, we uniformly sample a sequence
s= (s1,...,s m)of lengthmand construct its composite HV
as follows:
φ(s) =m/summationdisplay
j=1ρm−j(Φsj), (5)
whereρ(·)is a permutation operation representing the position
of each symbol within the sequence.
The decoding task then consists of reconstructing the orig-
inal sequence sfromφ(s), producing an estimate ˆsas close
as possible to s. This can be done using various decoding
techniques (see, e.g., [40], [41]). In this study, we use the
simplest one – Codebook decoding , where each symbol in
the codebook is compared to the permuted composite HV
ρ−(m−j)(φ(s))using an appropriate distance metric, δ, and the
symbol with the lowest distance is selected as the estimate ˆsj,
as in Eq. (6). This step is repeated for all positions j∈[1,m]
to decode the complete sequence ˆs.
ˆsj= argmin
c∈{1,...,d}δ(Φc,ρ−(m−j)(φ(s))). (6)
Following the setup in [40], [41], the decoding performance
and, thus, the capacity of HVs, are evaluated in terms of
two performance metrics. The ﬁrst metric is the decoding
4
accuracy , deﬁned in Eq. (7) as the proportion of correctly
decoded symbols, averaged over gtrials.
a=1
gmg/summationdisplay
k=1m/summationdisplay
j=1I[s(k)
j= ˆs(k)
j]. (7)
The second metric is the information rate . For a given
decoding accuracy aand codebook size d, the amount of
information decoded per symbol can be computed as follows:
Isymb(a,d) =alog2(da)+(1−a)log2/parenleftbiggd
d−1(1−a)/parenrightbigg
.(8)
The total information decoded from a sequence of length m
can then be calculated as:
Itot=mIsymb(a,d). (9)
Normalizing with respect to the dimensionality Dyields the
information rate per component (Idim):
Idim=Itot
D. (10)
Furthermore, we also report the information rate per storage
bit(Ibit), Eq. (11), where bdenotes the number of bits
required per HV component. This metric accounts for both
accuracy and memory footprint and is, therefore, suitable
for comparing models with components requiring different
precision, such as binary, integer-, real- or complex-valu ed.
Ibit=Itot/(Db). (11)
D. Classiﬁcation with HVs
In this study, we compare the classiﬁcation performance of
different HD/VSA models across a collection of classiﬁcati on
datasets with tabular data. Therefore, we use transformati on
for key–value pairs to represent feature vectors as HVs. For
tabular data, this transformation has been shown to achieve
the best trade-off between execution time and accuracy in
this setting [7], [42]. Given a d-dimensional feature vector,
x= [x1,x2,...,x d], each feature jis assigned a random
HVr(j)acting as a key. This HV is then bound to a value
vector (ψ(xj)) that is obtained via the quantization of the
corresponding feature value xjand mapping it into the high-
dimensional space using the thermometer code [43], [44].
In MCR, the thermometer code is realized by assigning 0
andr/2as the two extreme values. Such mapping ensures
that similar feature values are transformed into similar HV s.
The outcome of the bindings is dkey-value pairs, which
are then superimposed, resulting in the composite HV φ(x)
representing the feature vector:
φ(x) =d/summationdisplay
j=1r(j)◦ψ(xj). (12)
To train the classiﬁer, we adopt a prototype-based Learning
Vector Quantization (LVQ) approach [45], [46]. During the
ﬁrst epoch, prototypes are initialized using a simple centr oid
classiﬁer [47]–[49] where all HVs belonging to the same clas s
are aggregated into a class prototype:
pc=/summationdisplay
k:y(k)=cφ(x(k)). (13)From the second epoch onward, LVQ2.1 [45], [50] is used
according to Eq. (14), where two prototypes are updated per
training sample: the correct prototype p+and the closest
incorrect prototype p−. For a sample φ(x)with labelc, dis-
tance is computed using the appropriate model-speciﬁc metr ic:
Hamming for BSC, cosine for MAP, modular Manhattan for
MCR, Eq. (1).
/braceleftBigg
p+←p++ǫ(φ(x)−p+)
p−←p−−ǫ(φ(x)−p−).(14)
Hereǫis the learning rate. The learning rule is triggered only
ifφ(x)lies inside the LVQ2.1 window:
min/parenleftbiggδ(φ(x),p−)
δ(φ(x),p+),δ(φ(x),p+)
δ(φ(x),p−)/parenrightbigg
>s, s=1−ω
1+ω,(15)
where in the conducted experiments ωis set to 0.1follow-
ing [45]. To prevent the uncontrolled growth of prototypes’
norms during iterative updates and ensure the correct opera tion
of LVQ2.1, L2-norms of all class prototypes are reset to one
after each training epoch.
To minimize the loss of precision, we maintain high interme-
diate precision (ﬂoating-point) during training and norma lize
class prototypes only after the end of training. For BSC,
this corresponds to the sign function. For MAP-I with low
precision, we ﬁrst rescale the distribution of prototypes’ values
to match the dynamic range of the targeted b-bit interval
[−2b−1,2b−1−1], and then apply uniform quantization. For
MAP-C32 ( 32bits per component), no normalization is ap-
plied, maintaining prototypes in ﬂoating-point precision . For
MCR, prototype updates are accumulated in the complex
domain, Eq. (2), and then discretized back to Zras described
by Eq. (3).
During inference, each test sample is represented followin g
Eq. (12) and classiﬁed using the class label of the nearest
prototype as follows:
ˆy(φ(x)) = argmin
cδ(φ(x),pc). (16)
E. Hyperdimensional Coprocessor Unit
To demonstrate the practical feasibility of MCR arithmetic
in digital hardware, in this study we extend the Hyperdimen-
sional Coprocessor Unit (HDCU) presented in [15] to support
the complete set of MCR operations. The HDCU is a con-
ﬁgurable, general-purpose, open-source accelerator, originally
designed for BSC and tightly coupled with the Klessydra-T03
RISC-V core [51], operating as a coprocessor for the core. It s
architecture is based on three guiding principles:
•General-purpose HDC acceleration : instead of hard-
wiring ﬁxed learning pipelines, the HDCU accelerates the
core HD/VSA operations: binding, superposition, permu-
tation, and distance via dedicated functional units.
•Software programmability : a custom RISC-V Instruction
Set Extension (ISE) for HD/VSA enables the control
of the accelerator via intrinsic functions fully integrate d
into the GNU Compiler Collection (GCC) toolchain.
This allows the same hardware to be programmed for
5
0 100 200 300 4000.000.200.400.600.801.00AccuracyCodebook si.e , d  = 5
0 100 200 300 400Codebook size, d  = 15
0 100 200 300 400Codebook size, d  = 100
BSC
MAP-I3
MAP-I4
MAP-I5
MAP-I32MCR -3
MCR -4
MCR -5
FHRR
0 100 200 300 4000 .000.200.400.600.80Info rate (bit/dim )
0 100 200 300 400 0 100 200 300 400
0 100 200 300 400
Sequence len th0.000.050.100.15Info )ate (bit/bit)
0 100 200 300 400
Se(uence len th0 100 200 300 400
Se(uence len th
Fig. 2. Decoding performance and information capacity anal ysis for MCR compared to BSC, MAP-I, and FHRR. Results are sho wn for three different
codebook sizes ( d= 5,15,100), using HVs of size D= 500 and sequence lengths ranging from m= 10 tom= 400 , averaged over 20 independent
codebooks with 50 test sequences per length. MCR variants ac hieve higher capacity than BSC and MAP-I variants and approa ch the performance of FHRR
while maintaining much greater memory efﬁciency.
accelerating diverse HD/VSA tasks such as classiﬁcation,
regression, or reinforcement learning.
•Hardware conﬁgurability : extensive synthesis-time pa-
rameters (e.g., degree of parallelism, supported opera-
tions, local memory size) allow designers to trade off
performance against resource usage, tailoring the acceler -
ator to the requirements of the target platform.
Further details on the original HDCU internals and interfac e
can be found in [15].
III. R ESULTS
A. Capacity analysis
Following the experiment design introduced in Section II-C ,
we perform experiments to measure the information capacity
of MCR and several other well-known HD/VSA models: BSC,
MAP-I, and FHRR.
Fig. 2 reports the obtained results for three different code -
book sizes ( d∈{5,15,100}), using HVs of size D= 500
and sequence lengths ranging from m= 10 tom= 400 , with
analysis performed following the methods reported in [40],
[41]. We compare MCR with different quantizations ( r∈
{4,8,16}, corresponding to 2, 3, and 4 storage bits per
component, respectively), BSC (1 bit per component), MAP-
I (2, 3, 4, and 32 bits per component), and FHRR (128 bits
per component, with 64 bits each for the real and imaginaryparts). For MCR, MAP-I, and BSC, to minimize information
loss, the composite HV φ(s)was ﬁrst constructed using the
full-precision accumulation, and only afterwards normali zed
back to the target precision. The results are averaged over 20
independent random codebooks, with 50randomly generated
sequences for each considered sequence length m.
As expected, the results demonstrate that MCR drastically
outperforms BSC as the sequence length increases, with an
average improvement in decoding accuracy of 25.5% across
sequence lengths from 10 to 400 and the three codebook
sizes. More impressively, each MCR variant outperforms
the corresponding MAP-I variant with matching precision,
yielding average gains of 12.01% for 3 bits, 10.80% for 4
bits, and 8.88% for 5 bits. This positions the modular space
of MCR as a superior choice over MAP-I quantization for
memory-constrained environments. Surprisingly, MCR even
performs signiﬁcantly better than the unconstrained MAP
model with 32 bits per component, with an average accuracy
gain of 7.63%. Although this may appear counterintuitive, t he
difference in performance comes from the different propert ies
of superposition adopted by the two models. While MAP relies
on a simple integer sum (in effect, only a real part), MCR
interprets integers as discretized phasors on the unit circ le, and
then performs vector addition in C. The greater expressivity
in the complex plane preserves more information during
superposition and explains why MCR achieves higher capacit y.
6
103
104
Size per Prototype (bits × dime sio s)7374757677787980Average Accurac( ( %)
MAP-C32
MAP-I4
BSC
MCR -4
Fig. 3. Average accuracy across 123 classiﬁcation datasets for 1-bit BSC,
MAP-I4, MCR-4 and MAP-C32, using HVs of size D= 1024 . MCR
outperforms BSC and MAP-I4 and approaches MAP-C32.
This fact is also consistent with the strong performance of
FHRR decoding, which operates in the same manner except
for quantization. FHRR has slightly higher decoding accura cy
than MCR, although at the cost of requiring many more bits,
as we will see. Notably, increasing ryields diminishing returns
in MCR: although higher rreduces phase quantization error,
the normalization step discards magnitude, preventing MCR
from fully converging to FHRR.
When the results are analyzed in terms of information
per bit, MCR clearly outperforms all other HD/VSA models.
Despite its high capacity, FHRR has the lowest Ibitdue to
the 128 bits required per component. For the same reason,
MAP-I32 is the second lowest. MAP-I variants with low
precision improve memory efﬁciency by using only three
to ﬁve bits, but they still underperform compared to 1-bit
BSC. In contrast, MCR combines high capacity with low
precision, consistently achieving the highest information-per-
bit rate across all settings.
B. Classiﬁcation performance
To evaluate the performance of MCR on real datasets,
we carried out a large-scale study on a collection of 123
classiﬁcation datasets from [7]. This collection is based o n
a popular collection from [52] representing a subset of 121
datasets from the UCI Machine Learning repository [53]. We
compare MCR-4 to BSC (1 bit), MAP-I4, and MAP-C32. The
objective of this analysis is to understand whether the high er
per-component precision and higher capacity of MCR transla te
into tangible improvements in classiﬁcation accuracy when
applied to diverse data, and whether it provides a better tra de-
off under strict memory constraints.
For our experiments, we used an HV dimensionality D=
1024 , key–value transformation with 1024 quantization levels
generated using the thermometer code, 10training epochs, and
LVQ2.1 with the default parameters: ω= 0.1andǫ= 0.01.
We averaged the results over 20independent runs per dataset.
Fig. 3 reports the results for MCR-4 compared to the base-
line models, plotting the average accuracy versus the memor y
footprint per class prototype (bits ×dimension). The results40 50 60 70 80 90 100
Accuracy MAP-I4405060708090100Accuracy MCR -4
Fig. 4. Scatter plot comparing classiﬁcation accuracy of MA P-I4 (x-axis)
against MCR-4 ( y-axis) across 123 classiﬁcation datasets. Each point cor-
responds to one dataset. The dashed diagonal represents equ al performance
between the two models: the points above the diagonal indica te datasets where
MCR-4 outperforms MAP-I4, while the points below indicate t he opposite.
conﬁrm that MCR-4, with its higher per-component precision ,
(D= 1024 ,b= 4) outperforms BSC ( D= 1024 ,b= 1),
achieving an average gain of +4.84%. More importantly,
MCR-4 also outperforms MAP-I4 ( D= 1024 ,b= 4) under
the same memory footprint with an average gain of +0.61%
and approaches the accuracy of full-precision MAP-C32 (-
1.57%) while requiring only a quarter of its memory footprin t.
A complementary view of these results is shown in Fig. 4,
which compares the relative accuracy of MCR-4 and MAP-I4
on each dataset. Most points lie above the diagonal, conﬁrmi ng
that MCR provides systematic improvements.
To further investigate the accuracy–memory trade-off, Fig . 5
presents the results for MCR-4 with a reduced number of
components in HVs. When Dis reduced to 256, MCR-4 has
the same memory footprint as BSC ( D= 1024 ,b= 1),
MCR-4 still outperforms the binary model with an average
improvement of +3.94%. Even more impressively, MCR-4
remains robust when dimensionality is further decreased: w ith
only64components, MCR-4 still surpasses BSC ( +1.14%)
while requiring 75% less memory.
These results clearly demonstrate that per-component pre-
cision matters : the few-bit quantization deﬁned by MCR
enables superior accuracy–memory efﬁciency compared to
simple binarization or restricting the integer range with t he
clipping function. It also supports a substantial reductio n in
dimensionality while maintaining competitive classiﬁcat ion
performance.
C. Hardware design for MCR
The previous sections demonstrated that MCR achieves
a superior trade-off between memory footprint, classiﬁca-
tion accuracy, and information capacity compared to other
well-known HD/VSA models. However, a natural concern is
whether these improvements come at the expense of additiona l
computational costs. Indeed, the original MCR study [32]
conjectured that BSC, with its simpler operations, could be
preferable for resource-constrained scenarios. This rais es an
7
103
104
Size per Prototype (bits × dimensions)7374757677787980Average Accuracy (%)BSC  (D = 1024)
MCR -4 (D = 2048)
MCR -4 (D = 1024)
MCR -4 (D = 256)
MCR -4 (D = 128)
MCR -4 (D = 64)
MCR -4 (D = 32)
Fig. 5. Average classiﬁcation accuracy across 123 classiﬁcation datasets
for MCR-4 with different Dcompared to BSC ( D= 1024 ,b= 1 ).
With the same memory footprint, MCR-4 ( D= 256 ,b= 4) outperforms
BSC by +3.94% on average. Even at D= 64 , requiring only a quarter of
BSC’s memory, MCR-4 still achieves a +1.14% improvement, highlighting
its superior accuracy–memory trade-off.
important question: can MCR-speciﬁc hardware design solve
the impact of the additional computational cost of MCR on
execution time and energy consumption? In this section, we
provide a solution by moving from software simulations to
hardware design, showing that the arithmetic of MCR natural ly
maps onto digital circuits and presenting the ﬁrst dedicate d
hardware accelerator for this HD/VSA model.
1) Hardware-friendly MCR arithmetic: There are two as-
pects that might initially be perceived as computationally
expensive in MCR: the use of modular reductions and the
reliance on trigonometric functions to perform normalizat ion.
a) Modular reductions: As described in Section II-B, the
binding, unbinding, and distance operations in MCR require
applying a modular reduction ( modr) for every HV component
to remain in Zr. This frequent operation might seem costly,
since modulo is normally computed in hardware by a divider
unit: division is neither associative nor distributive, ma king
parallelization difﬁcult and leading to high area and power
costs even with optimized dividers [54]. However, when
implemented in digital hardware with rchosen as a power
of 2, modular reduction actually comes for free: if each
component is stored using b=⌈log2(r)⌉bits, the modulo
operation is automatically performed by binary overﬂow. Fo r
example, with r= 16 , storing each component with 4 bits
ensures that 14 + 5 = 19 overﬂows to 3, which is exactly
mod16(19). As a result, binding is reduced to integer addition,
unbinding to integer subtraction, and distance computatio n
to two subtractions followed by a min operation . These
operations are lightweight, fully parallelizable, and inc ur no
additional cost beyond the standard integer arithmetic.
b) Trigonometric mappings: Another potentially costly
step is the involvement of trigonometric functions. First, dur-
ing superposition, each component h(j)
i∈[0,r−1]must be
mapped to a unit vector in R2usingcosandsinfunctions. In
software, this requires repeated evaluations of trigonome tric
functions, which are computationally expensive. In hardwa re,
however, the discretized modular space makes this step efﬁ-cient: all cosandsinvalues for all values in Zrcan be pre-
computed and stored in compact ﬁxed-point LUTs, so that
each mapping reduces to a single memory lookup followed
by a ﬁxed-point accumulation. This completely removes the
need for trigonometric evaluation during runtime.
Second, during the normalization, the accumulated HV in
Cmust be projected back to the nearest integer in Zrusing
atan2 , Eq. (3). A direct hardware realization of this step
is costly, as it requires ﬁxed-point division and nonlinear
functions. Division, as already discussed, is slow and difﬁ cult
to parallelize. Similarly, approximating atan2 by computing
the ratioℑ(vi)/ℜ(vi)and mapping it through a LUT demands
a relatively large memory, since the ratio may assume many
values depending on ﬁxed-point precision, creating an unfa vor-
able trade-off between accuracy, area, and power consumpti on.
To overcome these limitations in the hardware realization,
two approaches can be adopted. The ﬁrst one is the common
CORDIC algorithm, which computes atan2(ℑ(vi),ℜ(vi))it-
eratively using only shift-and-add operations and a small L UT
of arctangent constants. The second one, proposed in this wo rk,
is a winner-take-all (WTA)-based approach:
wi= argmax
k∈[0,r−1]/parenleftbig
ℜ(vi)cos/parenleftbig2π
rk/parenrightbig
+ℑ(vi)sin/parenleftbig2π
rk/parenrightbig/parenrightbig
.(17)
The key idea is to compare viagainst allrdiscrete directions
(cos2π
rk,sin2π
rk)retrieved from the LUTs and select the one
with the highest inner product. Importantly, since the quad rant
ofvican be identiﬁed directly from the sign bits of ℜ(vi)
andℑ(vi), the search must cover only r/4 + 1 values in
the corresponding quadrant, rather than all rdirections. This
reduces the normalization to r/4+1 inner product computa-
tions (two multiplications and one addition each), followe d by
the search for an argument of the maximum. In discretized
spaces, this normalization procedure is highly efﬁcient an d
avoids division. Unlike CORDIC, it is fully parallelizable
across all candidate directions while being more compact an d
not requiring additional LUTs.
2) MCR-HDCU: Building on the foundation of HDCU
(Section II-E), we design MCR-HDCU by preserving its
integration strategy while replacing the binary datapath w ith
one that is tailored for modular operations, achieving soft ware-
controlled and efﬁcient acceleration while retaining full ﬂexi-
bility and programmability. To the best of our knowledge, th is
represents the ﬁrst dedicated hardware accelerator for MCR .
Fig. 6 depicts a high-level view of the MCR-HDCU mi-
croarchitecture. The accelerator features dedicated pipe lines
and optimized functional units (FUs) for each MCR arith-
metic operation. As in the original HDCU, each FU can
be selectively enabled or disabled at synthesis time, and it s
degree of hardware parallelism can be conﬁgured through
theSIMD (Single Instruction Multiple Data) parameter. The
design of each functional unit in MCR-HDCU incorporates the
optimized arithmetic strategies introduced in Section III -C1,
ensuring efﬁcient implementation of the modular operation s
of MCR.
Each HV component is represented using b=⌈log2(r)⌉bits,
whereris a synthesis-time parameter. With this representation,
modular wrap-around is naturally handled by binary overﬂow ,
8
Fig. 6. High-level schematic of the MCR-HDCU microarchitec ture. The
design integrates dedicated functional units for each arit hmetic operation in
MCR, running in parallel over SIMD lanes. Specialized Scratchpad Memories
ensure low-latency, high-bandwidth access. The hardware p arallelism and
memory conﬁguration (number and size of Scratchpad Memorie s) can be
conﬁgured at synthesis time.
TABLE I
MCR-HDCU C ONFIGURATION PARAMETERS
Parameter Conﬁguration Time
SPM Size Synthesis
SPM Number Synthesis
Hardware Parallelism ( SIMD ) Synthesis
Functional Unit Enable/Disable Synthesis
MCR Modulo ( r) Synthesis
Fixed-point Precision ( FP) Synthesis
HVDIM Runtime
HVCLASS Runtime
eliminating explicit modulo operations and directly wrapp ing
in theZrring, as discussed in Section III-C1.
To sustain parallel HV operations, the accelerator integra tes
local dedicated Scratchpad Memories (SPMs) that provide lo w-
latency, high-bandwidth access. The number and size of thes e
SPMs are conﬁgurable at synthesis time. By default, four
SPMs are instantiated: three with bandwidth bSIMD , sustain-
ing the transfer of SIMD components per cycle, and one with
higher bandwidth FP·SIMD dedicated to the Superposition
Unit, which requires streaming SIMD ﬁxed-point components
per cycle for Cartesian accumulation, where FPdenotes the
ﬁxed-point precision. When the target learning task is know n
in advance, the SPMs dimension can be set precisely to the re-
quired number of HVs; for more general-purpose deployments ,
larger SPMs may be synthesized to preserve ﬂexibility. Data
exchange between the host core and the SPMs is supported by
dedicatedhvmemld andhvmemstr instructions [15], though
in typical workloads, the accelerator operates autonomous ly on
locally stored HVs. Table I summarizes the parameters that c an
be tuned at the synthesis time and at runtime in MCR-HDCU.
Next, we describe the implementation of each arithmetic
operation in the MCR-HDCU accelerator.
a) Binding: By representing HV components using bbits,
binding is implemented as a simple modular addition, with
Fig. 7. Architecture of the Distance Unit. The design comput es2SIMD
differences in parallel, selects the modular distance thro ugh amin circuit,
and accumulates results using a parallel tree adder.
theSIMD parameter deﬁning how many adders operate in
parallel. The same holds for unbinding, which is a modular
subtraction. Since additions/subtractions are fully para llelized
over theSIMD lanes, the total latency in clock cycle ( L) is
Lbind=Lunbind=HVDIM
SIMDcycles
b) Distance metric: In MCR, distance is computed as
the angular difference across all components of two HVs,
as deﬁned in Eq. (1). As in binding, modular reduction is
inherently handled by the binary overﬂow. Accordingly, the
unit computes 2SIMD differences in parallel using subtractors,
and for each pair of results, a lightweight min circuit selects
the correct distance. The resulting per-component distanc es are
then accumulated across SIMD lanes by a parallel tree adder of
depth⌈log2(SIMD)⌉, as illustrated in Fig. 7. The total latency
of the Distance Unit is thus:
Lsim=HVDIM
SIMD+⌈log2(SIMD)⌉cycles
where the ﬁrst term accounts for the selected hardware paral -
lelism and the second for the depth of the tree adder.
c) Superposition: This operation requires ﬁrst mapping
each HV component to its equivalent vector, then performing
vector addition, and ﬁnally normalizing the accumulated re sult
back onto the discrete ring Zr. As highlighted in Section II-B,
the normalization step should only be applied once after all
operands have been accumulated, since repeated quantizati on
would cause excessive loss of stored information. For this
reason, in MCR-HDCU we decouple superposition and nor-
malization into two separate instructions and FUs, enabling
intermediate accumulation in ﬁxed-point precision before the
ﬁnal projection back onto Zr.
The Superposition Unit is built around two compact LUTs
storing precomputed cos(2π
rk)andsin(2π
rk)values fork∈
[0,r−1], that as highlighted in Section III-C1, are small
and hardware-friendly. The FPparameter can be speciﬁed at
synthesis time to balance performance and hardware cost. To
maintain the desired precision in cumulative superpositio n, the
unit assumes that the ﬁrst input HV is already in ﬁxed-point
Cartesian form and produces a ﬁxed-point result. The second
input, represented in Zr, is mapped on-the-ﬂy to Cartesian
9
Fig. 8. Architecture of the Superposition Unit. Each compon ent is mapped
to Cartesian form via LUT-based cos/sinvalues and accumulated in ﬁxed-
point precision using dedicated high-bandwidth SPM access .
coordinates using cosine and sine LUTs. At each cycle, the
unit fetches SIMD /2 real parts and SIMD /2 imaginary parts
of the ﬁrst HV from the dedicated high-bandwidth SPM,
while the second HV remains ﬁxed for two cycles. The
corresponding cosine values are added to the real component s
of the ﬁrst HV , and the sine values to its imaginary component s.
The results are written back to the SPM at a throughput of
SIMD /2 complex components per cycle. The total latency of
the operation is therefore:
Lsuperimpose =2HVDIM
SIMDcycles
d) Normalization: The Normalization Unit performs the
normalization step, projecting the results of superpositi on that
have accumulated in Cartesian coordinates back onto the
discrete ring representing Zr. As outlined in Section III-C1,
this is achieved through the WTA-based approach, which
avoids the costly division and atan2 computations otherwise
required by direct normalization.
Given a superimposed component in Cartesian form
(ℜ(vi),ℑ(vi)), the unit identiﬁes its quadrant from the sign
bits ofℜ(vi)andℑ(vi)and then compares it against the r/4+1
directions in that quadrant with coordinates (cos2π
rk,sin2π
rk)
precomputed and stored in the same compact LUTs used by
the Superposition Unit. Each comparison is implemented as
an inner product, requiring two ﬁxed-point multiplication s
and one addition. The accumulated Cartesian HV is stored in
a dedicated high-bandwidth SPM that provides simultaneous
access toSIMD /2 real and SIMD /2 imaginary parts per cycle.
As a result, the total latency of the Normalization Unit is
Lnorm=2HVDIM
SIMD(r/4+1) cycles
e) Permutation: Permutation in HD/VSA produces HVs
that are nearly orthogonal to their input HVs. Implement-
ing arbitrary rotations in hardware while processing SIMD
components in parallel would require complex shufﬂing logi cTABLE II
HARDWARE REQUIREMENTS OF MCR-HDCU FOR DIFFERENT SIMD .
SIMD Freq. Module #LUTs #FFs #DSPs #BRAMs
8 150 MHzAccelerator 1707 960 9 0
SPMI 726 282 0 16
16 125 MHzAccelerator 3139 1671 17 0
SPMI 1930 486 0 16
32 115 MHzAccelerator 5823 3098 33 0
SPMI 3065 887 0 20
64 118 MHzAccelerator 13609 8564 65 0
SPMI 7193 1693 0 28
and intermediate buffers, leading to high area and latency
overheads. In this work, we use the insight that even structu red
permutations (e.g., circular shift or block-cyclic permut ation)
produce HVs that are approximately orthogonal to input HVs.
Using this observation, MCR-HDCU performs permutation at
the granularity of whole SIMD blocks. Rather than shifting
components within each block, the design cyclically reorde rs
entire blocks by remapping memory addresses. With this de-
sign choice, permutation reduces to pure address manipulat ion
in the SPM: read addresses are offset by multiples of SIMD
lanes, and the HV is streamed directly in permuted form. This
yieldsHVDIM/SIMD distinct permutations, corresponding to
all possible block-level cyclic shifts, while keeping the i m-
plementation extremely lightweight. The total latency of t he
permutation is:
Lperm=HVDIM
SIMDcycles.
f) Search: The ﬁnal operation implemented in MCR-
HDCU is search , which is essential during inference to identify
the closest prototype among a set of HVs. Implementing this
purely in software would require issuing the distance instr uc-
tionctimes for a c-class problem, limiting the beneﬁts of
acceleration. To avoid this overhead, MCR-HDCU integrates
a dedicated search instruction that reuses the Distance Uni t in
a hardware-controlled loop. At runtime, the number of class es
is speciﬁed as a parameter, enabling the accelerator to iter ate
over all prototypes in the SPM. The query HV is held constant,
while the second operand cycles through each class HV . After
each distance computation, the result is compared against t he
current best match stored in a register, which is updated if
necessary. Once the loop completes, the index of the class
with the lowest distance is written back to the memory.
D. Performance of MCR-HDCU
This section presents a comprehensive evaluation of the
MCR-HDCU accelerator. We begin by reporting hardware
synthesis results, including resource utilization and max imum
achievable frequency on FPGA, to characterize the hardware
requirements of the proposed accelerator and evaluate the
cost of different SIMD conﬁgurations. We then analyze per-
formance, evaluating the advantages of the proposed MCR-
HDCU on both basic HD/VSA operations and full learn-
ing kernels to demonstrate the accelerator’s efﬁciency and
ﬂexibility. Finally, we directly compare MCR and BSC by
benchmarking the proposed accelerator against the origina l
binary-only HDCU [15], denoted here as BSC-HDCU.
10
TABLE III
EXECUTION TIME INµS AND ENERGY CONSUMPTION IN µJPER ARITHMETIC OPERATION IN MCR
Realization HVDIMBinding Superposition Normalization Permutation Distanc e
Time [µs] Energy( µJ) Time [µs] Energy [ µJ] Time [µs] Energy [ µJ] Time [µs] Energy [ µs] Time [µs] Energy [ µs]
Software 64 5.90 0.74 13.12 1.64 76.21 9.53 11.90 1.49 4.71 0.59
Software 512 43.63 5.45 102.21 12.78 606.18 75.77 92.90 11.61 35.42 4.43
Software 2048 173.02 21.63 407.58 50.95 2411.27 301.41 380.38 47.55 140.66 17.58
SIMD 8 64 0.15 0.02 0.36 0.04 0.65 0.07 0.15 0.01 0.24 0.02
SIMD 8 512 0.52 0.06 1.85 0.22 4.38 0.53 0.52 0.06 0.99 0.11
SIMD 8 2048 1.80 0.22 6.97 0.93 17.18 2.32 1.80 0.22 3.55 0.44
SIMD 16 64 0.14 0.01 0.31 0.04 0.46 0.05 0.14 0.01 0.23 0.02
SIMD 16 512 0.37 0.04 1.21 0.15 2.70 0.33 0.37 0.04 0.68 0.08
SIMD 16 2048 1.14 0.15 4.28 0.62 10.38 1.53 1.14 0.14 2.22 0.28
SIMD 32 64 0.14 0.02 0.28 0.04 0.32 0.04 0.14 0.02 0.23 0.03
SIMD 32 512 0.26 0.04 0.77 0.12 1.54 0.25 0.26 0.04 0.47 0.06
SIMD 32 2048 0.68 0.10 2.44 0.39 5.71 0.93 0.68 0.10 1.30 0.19
SIMD 64 64 0.13 0.02 0.24 0.04 0.23 0.04 0.13 0.02 0.20 0.03
SIMD 64 512 0.19 0.04 0.48 0.10 0.82 0.17 0.19 0.04 0.33 0.06
SIMD 64 2048 0.39 0.08 1.30 0.27 2.86 0.59 0.39 0.08 0.74 0.15
In the experiments below, all benchmarks are implemented
using a custom software library that supports two execution
modes: a standard (non-accelerated) conﬁguration compile d
for the baseline RISC-V instruction set and executed on the
Klessydra-T03 [51] core, and an accelerated conﬁguration
compiled with the extended instruction set to exploit the
HDCU coprocessor.
Execution time is estimated using cycle-accurate Register -
Transfer-Level simulations in QuestaSim, by extracting th e
number of clock cycles and multiplying them by the post-
implementation maximum operating frequency of each con-
ﬁguration. Speedup factors are computed as the ratio betwee n
baseline and accelerated execution times. For energy consu mp-
tion, we perform post-implementation gate-level simulati ons
to generate switching activity ﬁles ( .saif ) for each kernel,
which are then analyzed using the Vivado power estimator to
derive the dynamic power of both the core and the accelerator .
1) Hardware realization: To assess the resource require-
ments and achievable performance of MCR-HDCU, we syn-
thesized the Klessydra-T03 core extended with our MCR-
HDCU accelerator on a Xilinx Zynq UltraScale+ ZCU106
(EK-U1-ZCU106-G) device, using Vivado 2023.2. Table II
reports the detailed hardware utilization of MCR-HDCU for
SIMD widths ranging from 8 to 64, while ﬁxing the modulus
r= 16 , ﬁxed-point precision FP=16, and using four 2 KB
SPMs. The reported metrics include LUTs, Flip-Flops (FFs),
Digital Signal Processors (DSPs), Block RAMs (BRAMs), as
well as the maximum operating frequency achievable on the
FPGA.
This analysis highlights how the accelerator can be adapted
to different scenarios by balancing hardware cost against p er-
formance. For example, increasing the SIMD width from 8 to
64 raises the LUTs count from 2433 to20802 , underlining the
need to carefully select the degree of parallelism dependin g on
the application. In resource-constrained edge devices pri marily
targeting inference, a smaller SIMD conﬁguration may be
preferable. In contrast, when resources are less constrain ed
and training acceleration is also desired, higher parallel ism
can be exploited to maximize the throughput.
2) Impact on the basic HD/VSA arithmetic: Table III
summarizes the execution time and the dynamic energy con-
sumption achieved by MCR-HDCU for each core arithmeticoperation. All experiments are conducted across a range of H V
dimensionalities and hardware parallelism conﬁgurations to
explore scalability and design trade-offs. Speciﬁcally, w e eval-
uate three representative SIMD conﬁgurations{32,256,1024}
and three HV dimensionalities {64,512,2048}.
As the HV dimensionality increases, the hardware loops
inside the accelerator allow efﬁcient processing without r e-
fetching instructions repeatedly, maximizing the executi on
speed. On the other hand, increasing SIMD values enables
greater parallelism, allowing for the simultaneous proces sing
of more HV components, but at the cost of increased hardware
complexity, as reported in Table II.
For the binding operation, MCR-HDCU achieves speedups
from39×(SIMD= 8,HVDIM= 64 ) up to444×(SIMD=
64,HVDIM= 2048 ), thanks to its lightweight implementa-
tion as modular additions with implicit overﬂow. The more
complex superposition operation beneﬁts from ﬁxed-point
Cartesian accumulation using compact cosine/sine LUTs (se e
Section III-C), and achieves speedups ranging from 36×up
to314×. Similarly, the normalization operation, implemented
via WTA on the precomputed LUTs, provides speedups from
117×to843×, depending on HV dimensionality and SIMD
parallelism. Much larger gains are observed for the permuta-
tion operation: by reducing it to a simple hardware-friendly
block-wise reordering of memory addresses, MCR-HDCU
achieves speedups from 79×to975×. Finally, the distance
metric, that is critical for inference, also beneﬁts strong ly from
the hardware acceleration, with speedups ranging from 20×
to190×, depending on the conﬁguration.
Overall, these results conﬁrm that all core operations of
MCR can be executed extremely efﬁciently in hardware,
demonstrating that the additional expressiveness of the mo du-
lar space does not come at the cost of computational efﬁcienc y.
Energy efﬁciency follows a similar trend. Across all op-
erations, the software implementation on the RISC-V core
consumes between 29.5×and594×more energy than its
accelerated counterpart. This reduction highlights the va lue of
specialized hardware, particularly for low-power applica tions,
even when accelerating only individual arithmetic operati ons.
3) Performance on real data: To demonstrate the ﬂexibility
of the proposed accelerator and evaluate its performance in
realistic scenarios, we selected seven classiﬁcation data sets
11
101
102
Number of Features (d)020406080100Number of Classes (c)
Haberman AdultPlantMargin
UCIHARISOLET Letter
Cardio10
Fig. 9. Overview of the real datasets selected for the evalua tion, showing
their complexity in terms of the number of features and class es. The datasets
span a wide range of complexities: HabermanSurvival (d= 3,c= 2),
Adult (d= 14 ,c= 2),Letter (d= 16 ,c= 26 ),Cardio10 (d= 21 ,
c= 10 ),PlantMargin (d= 64 ,c= 100 ),UCIHAR (d= 561 ,c= 6),
andISOLET (d= 617 ,c= 26 ).
from the collection used in Section III-B. The selection
captures a wide spectrum of problem complexities in terms
of both the number of features and classes. As shown in
Fig. 9, the selected datasets span from very simple problems
such asHabermanSurvival andAdult , with only a
few features and two classes, to demanding problems such
asPlantMargin ,UCIHAR , andISOLET , which combine
hundreds of features with dozens or even hundreds of classes .
Intermediate problems such as Letter andCARDIO10 rep-
resent median cases, containing tens of input features with
challenging multi-class classiﬁcation. This variability ensures
that the evaluation covers the full range of settings, from
lightweight binary problems to high-dimensional, multi-c lass
problems, thereby providing a comprehensive validation of
MCR-HDCU. It is important to emphasize that the goal of
this analysis is not to assess the classiﬁcation accuracy, b ut
rather to evaluate the efﬁciency of the proposed accelerato r.
Every arithmetic operation implemented in MCR-HDCU was
rigorously veriﬁed to ensure full consistency with the soft ware
model. As a result, the accelerator delivers the exact same
accuracy as the reference software implementation reporte d in
Section III-B.
For these experiments, we adopted HVs of size D= 1024
and a modulo r= 16 (i.e., MCR-4), comparing software exe-
cution on the baseline Klessydra-T03 core against accelera ted
execution with different SIMD conﬁgurations { 8,16,32, and
64}. Table IV reports the obtained execution time for the seven
selected datasets, while Fig. 10 shows the relative speedup s.
The reported time corresponds to a single inference iterati on.
In the MCR-HDCU, latency depends only on the number
of features and classes and is, therefore, constant across a ll
samples within a dataset.
Across all datasets, the accelerator delivers substantial
performance improvements. For the simplest tasks such asHabermanSurvival (d= 3,c= 2), speedups range from
80.4×atSIMD= 8 to137.9×atSIMD= 64 , whileAdult
(d= 14 ,c= 2) achieves 98.1×to259.9×. Larger datasets
yield even higher gains: Letter (d= 16 ,c= 26 ) improves
from95.4×to306.9×whileCardio10 (d= 21 ,c= 10 )
from100.6×to320.2×. For the extreme cases with a large
number of features and classes, the impact of the hardware ac -
celeration is even more pronounced: PlantMargin (d= 64 ,
c= 100 ) shows145.6×to616.9×,UCIHAR (d= 561 ,c= 6)
114.6×to625.3×, andISOLET (d= 617 ,c= 26 ) achieves
the highest gains with 114.5×to627.7×.
Overall, these results conﬁrm that the beneﬁts of hardware
acceleration increase with the dataset complexity. In part icular,
as the number of features increases, the transformation of
feature vectors into HVs dominates the runtime and the accel -
eration of the superposition and binding operations become s
increasingly beneﬁcial. On the other hand, a larger number o f
classes ampliﬁes the cost of searching for the closest proto type,
making the role of the search and distance units the dominant
factor.
4) Comparison with BSC: The previous sections demon-
strated how the arithmetic of MCR can be efﬁciently imple-
mented in digital hardware devices and how the adoption of a
dedicated accelerator can reduce its execution time and ene rgy
consumption substantially. At the same time, Section III-B
reported that the higher per-component precision and the
modular space allow MCR-4 with D= 64 components
andr= 16 (4 bits per component) to achieve higher
average accuracy (75.49% across the collection) than BSC
withD= 1024 components (74.35%), despite requiring 4×
less memory. Building on these ﬁndings, we conclude the
experiments by revisiting the hypothesis from the original
MCR study [32] that BSC is advantageous over MCR in
resource-constrained scenarios with strict requirements on
execution time and energy efﬁciency. To answer this question,
we directly compare MCR-4 ( D= 64 ,r= 16 ) against BSC
(D= 1024 ), when both are accelerated on their corresponding
architectures: MCR-HDCU and BSC-HDCU.
For the sake of fairness, we evaluate four SIMD conﬁg-
urations for each accelerator: SIMD∈ {8,16,32,64}for
MCR-HDCU and SIMD∈{32,64,128,256}for BSC-HDCU.
These conﬁgurations yield an equivalent number of bits pro-
cessed per clock cycle, since MCR-4 uses r= 16 (b= 4
bits per component). Both accelerators are synthesized wit h
four 2 KB SPMs each. Testing SIMD values larger than
the dimensionality of HVs for MCR provides no additional
insights, as conﬁgurations with SIMD>HVDIM converge to
the same execution time and energy consumption as the case
whenSIMD=HVDIM .
Table V reports the results obtained across the previously
selected datasets. Despite the intrinsically more complex arith-
metic of MCR, the combination of an optimized accelera-
tor design and the reduced number of components enabled
by higher per-component precision allows MCR-HDCU to
consistently outperform BSC-HDCU in both execution time
and energy consumption. On average, across all datasets and
SIMD conﬁgurations, MCR-HDCU runs kernels 3.08 ×faster
and with 2.68×lower energy consumption than BSC-HDCU.
12
Haberman Adult PlantMargin UCIHAR ISOLET Letter Cardio10102103104105Execution  Time ( μs )
80.4×
98.1×
145.6×
114.6×
114.5×
95.4×
100.6×97.0×
136.7×
227.4×
187.3×
187.2×
140.0×
147.3×114.9×
188.6×
367.4×
327.1×
327.5×
206.9×
216.6×137.9×
259.9×
616.9×
625.3×
627.7×
306.9×
320.2×S)ftware
SIMD -8
SIMD-16
SIMD-32
SIMD-64
Fig. 10. Execution time of 7selected datasets on the baseline Klessydra-T03 core compa red to the MCR-HDCU accelerator with different SIMD∈
{8,16,32,64}.
TABLE IV
EXECUTION TIME IN µsCOMPARING SOFTWARE AND SIMD HARDWARE
REALIZATIONS ACROSS THE SELECTED DATASETS
Dataset Software SIMD-8 SIMD-16 SIMD-32 SIMD-64
HabermanSurvival 2258.40 28.08 23.27 19.65 16.38
Adult 5565.15 56.75 40.70 29.50 21.42
PlantMargin 27495.41 188.79 120.92 74.84 44.57
UCIHAR 170264.30 1486.03 909.21 520.56 272.30
ISOLET 188503.79 1647.03 1006.94 575.63 300.32
Letter 7854.92 82.33 56.10 37.97 25.59
Cardio10 8230.95 81.83 55.88 38.00 25.70
These advantages are even more pronounced for large dataset s
such as UCIHAR and ISOLET, where the execution time
improves by up to 6.35×while the energy consumption drops
by up to 5.72×.
Importantly, the reported improvements do not come at the
expense of larger accelerators. For example, when comparin g
MCR-4 with SIMD= 8 against BSC with SIMD= 64 , an
instance where MCR-HDCU is actually smaller in terms of
the required resources, MCR-4 still provides improvements :
the execution time is reduced by 1.90×to3.21×while the
energy consumption by 1.82×to3.01×, depending on the
dataset.
These results conﬁrm the key insight of this study: although
MCR requires more sophisticated arithmetic than BSC, its
modular space and higher per-component precision allow
much lower HV dimensionality. When paired with efﬁcient
hardware support, this leads to signiﬁcantly faster execut ion
and lower energy consumption. In other words, MCR is not
only more accurate per bit but also more efﬁcient when
deployed on dedicated hardware.
IV. C ONCLUSIONS
In this study, we revisited the modular composite represent a-
tion (MCR) model, providing an extensive analysis of its pro p-
erties and implementing the ﬁrst hardware accelerator for t his
model. Our study demonstrates that the MCR model achieves
the best trade-off between information capacity, classiﬁc ation
accuracy, and efﬁciency, positioning it as a compelling alt erna-tive to more well-known models such as binary spatter codes
[25] and multiply–add–permute [26].
First, we found that the modular discretized space of MCR
offers substantially higher information capacity than bin ary
and integer hypervectors, while approaching the informati on
capacity of complex-valued hypervectors at a fraction of th eir
memory footprint. Second, a large-scale evaluation on 123
classiﬁcation datasets conﬁrmed that MCR achieves higher
accuracy than binary and low-precision integer hypervecto rs
and can match the performance of binary hypervectors while
requiring up to 4×less memory. These results highlight
the importance of per-component precision, showing that th e
larger dynamic range in non-binary modular arithmetic inde ed
pays off in practice.
Next, we designed MCR-HDCU – the ﬁrst hardware ac-
celerator for MCR, demonstrating that its arithmetic maps
naturally to digital logic, with modular operations realiz ed im-
plicitly through binary overﬂow and trigonometric mapping s
efﬁciently implemented using compact lookup tables. Exper -
imental results on basic operations and seven classiﬁcatio n
datasets show up to three orders-of-magnitude speedups and
substantial energy savings compared to software execution .
When compared to binary hypervectors at iso-accuracy and
accelerated with the original HDCU, MCR-HDCU achieves
on average 2.7×faster execution and 3.9×lower energy
consumption.
Overall, our ﬁndings refute the intuition that the addition al
complexity of MCR outweighs its beneﬁts. On the contrary,
it provides a hardware-friendly alternative that delivers supe-
rior information capacity, higher classiﬁcation accuracy , lower
memory footprint, and great efﬁciency when implemented by
specialized hardware.
REFERENCES
[1] P. Kanerva, “Hyperdimensional computing: An introduct ion to comput-
ing in distributed representation with high-dimensional r andom vectors,”
Cognitive Computation , vol. 1, no. 2, pp. 139–159, 2009.
[2] R. W. Gayler, “Vector symbolic architectures answer Jac kendoff’s chal-
lenges for cognitive neuroscience,” in Joint International Conference on
Cognitive Science (ICCS/ASCS) , 2003, pp. 133–138.
13
TABLE V
EXECUTION TIME (µS)AND ENERGY CONSUMPTION (µJ)FOR BSC (D= 1024 )AND MCR-4 ( D= 64 ),ACROSS DATASETS AND SIMD
CONFIGURATIONS
SIMD #LUTs #FFsHaberman Adult Letter CARDIO10 Plant UCIHAR ISOLET
Time Energy Time Energy Time Energy Time Energy Time Energy T ime Energy Time Energy
BSC,D= 102432 1886 728 6.418 0.624 21.018 2.158 27.164 2.960 31.473 3.458 87.673 10 .185 747.627 87.529 824.864 97.315
64 3255 1010 4.200 0.428 11.550 1.240 14.632 1.679 16.809 1.929 45.105 5. 506 377.345 46.067 416.218 50.812
128 6547 1671 3.181 0.338 7.070 0.780 8.670 1.042 9.842 1.183 24.819 3.169 200.586 25.615 221.126 28.238
256 13094 2909 2.605 0.304 4.600 0.553 5.409 0.690 6.019 0.772 13.707 1.847 103.907 13.927 114.437 15.422
MCR-4,D= 648 2433 1242 2.173 0.235 4.447 0.507 6.140 0.743 6.320 0.771 14.887 1.920 117.720 15.304 130.360 17.077
16 5069 2157 1.976 0.219 3.736 0.437 4.824 0.603 5.112 0.639 11.800 1.569 91.400 12.156 101.000 13.433
32 8887 3985 1.750 0.224 3.217 0.428 3.883 0.563 4.283 0.621 9.917 1.527 7 6.233 11.740 84.033 12.941
64 20802 10257 1.687 0.270 3.026 0.499 3.470 0.607 3.939 0.693 9.122 1.688 6 9.661 12.818 76.652 14.181
[3] D. Kleyko, D. A. Rachkovskij, E. Osipov et al. , “A survey on hyperdi-
mensional computing aka vector symbolic architectures, Pa rt I: Models
and data transformations,” ACM Computing Surveys , vol. 55, no. 6, pp.
1–40, 2022.
[4] K. Schlegel, P. Neubert, and P. Protzel, “A comparison of vector
symbolic architectures,” Artiﬁcial Intelligence Review , vol. 55, no. 6,
pp. 4523–4555, 2022.
[5] D. Kleyko, M. Davies, E. P. Frady et al. , “Vector symbolic architectures
as a computing framework for emerging hardware,” Proceedings of the
IEEE , vol. 110, no. 10, pp. 1538–1571, 2022.
[6] D. Kleyko, D. A. Rachkovskij, E. Osipov et al. , “A survey on hyper-
dimensional computing aka vector symbolic architectures, Part II: Ap-
plications, cognitive models, and challenges,” ACM Computing Surveys ,
vol. 55, no. 9, pp. 1–52, 2023.
[7] P. Vergés, M. Heddes, I. Nunes et al. , “Classiﬁcation using hyperdi-
mensional computing: A review with comparative analysis,” Artiﬁcial
Intelligence Review , vol. 58, no. 6, pp. 1–41, 2025.
[8] A. Hernandez-Cano, C. Zhuo, X. Yin et al. , “RegHD: Robust and
efﬁcient regression in hyper-dimensional learning system ,” in ACM/IEEE
Design Automation Conference (DAC) , 2021, pp. 7–12.
[9] E. P. Frady, D. Kleyko, C. J. Kymn et al. , “Computing on functions
using randomized vector representations (in brief),” in Neuro-Inspired
Computational Elements Conference (NICE) , 2022, pp. 115–122.
[10] M. Imani, Y . Kim, T. Worley et al. , “HDCluster: An accurate clustering
using brain-inspired high-dimensional computing,” in Design, Automa-
tion & Test in Europe Conference & Exhibition (DATE) , 2019, pp. 1591–
1594.
[11] T. Bandaragoda, D. De Silva, D. Kleyko et al. , “Trajectory clustering of
road trafﬁc in urban environments using incremental machin e learning
in combination with hyperdimensional computing,” in IEEE Intelligent
Transportation Systems Conference (ITSC) , 2019, pp. 1664–1670.
[12] D. Kleyko, E. Osipov, R. W. Gayler et al. , “Imitation of honey bees’
concept learning processes using vector symbolic architec tures,” Biolog-
ically Inspired Cognitive Architectures , vol. 14, pp. 57–72, 2015.
[13] Y . Ni, M. Issa, D. Abraham et al. , “HDPG: Hyperdimensional policy-
based reinforcement learning for continuous control,” in ACM/IEEE
Design Automation Conference (DAC) , 2022, pp. 1141–1146.
[14] M. Angioli, A. Rosato, M. Barbirotta et al. , “HD-CB: The ﬁrst explo-
ration of hyperdimensional computing for contextual bandi ts problems,”
arXiv:2501.16863 , 2025.
[15] R. Martino, M. Angioli, A. Rosato et al. , “Conﬁgurable hardware
acceleration for hyperdimensional computing extension on RISC-V,”
TechRxiv , 2024.
[16] M. Imani, A. Rahimi, D. Kong et al. , “Exploring hyperdimensional
associative memory,” in IEEE International Symposium on High Per-
formance Computer Architecture (HPCA) , 2017, pp. 445–456.
[17] A. Rahimi, S. Datta, D. Kleyko et al. , “High-dimensional computing as
a nanoscalable paradigm,” IEEE Transactions on Circuits and Systems
I: Regular Papers , vol. 64, no. 9, pp. 2508–2521, 2017.
[18] M. Schmuck, L. Benini, and A. Rahimi, “Hardware optimiz ations of
dense binary hyperdimensional computing: Rematerializat ion of hyper-
vectors, binarized bundling, and combinational associati ve memory,”
ACM Journal on Emerging Technologies in Computing Systems , vol. 15,
no. 4, pp. 1–25, 2019.
[19] B. Khaleghi, H. Xu, J. Morris et al. , “tiny-HD: Ultra-efﬁcient hy-
perdimensional computing engine for IoT applications,” in Design,
Automation & Test in Europe Conference & Exhibition (DATE) , 2021,
pp. 408–413.
[20] D. Kleyko, E. P. Frady, and F. T. Sommer, “Cellular autom ata can reduce
memory requirements of collective-state computing,” IEEE Transactions
on Neural Networks and Learning Systems , vol. 33, no. 6, pp. 2701–
2713, 2022.[21] S. A. Wasif, M. Wael, P. R. Genssler et al. , “Domain-speciﬁc hyperdi-
mensional RISC-V processor for Edge-AI training,” IEEE Transactions
on Circuits and Systems I: Regular Papers , 2025.
[22] E. P. Frady and F. T. Sommer, “Robust computation with rh ythmic spike
patterns,” Proceedings of the National Academy of Sciences , vol. 116,
no. 36, pp. 18 050–18 059, 2019.
[23] G. Bent, C. Simpkin, Y . Li et al. , “Hyperdimensional computing using
time-to-spike neuromorphic circuits,” in International Joint Conference
on Neural Networks (IJCNN) , 2022, pp. 1–8.
[24] J. Orchard, P. M. Furlong, and K. Simone, “Efﬁcient hype rdimensional
computing with spiking phasors,” Neural Computation , vol. 36, no. 9,
pp. 1886–1911, 2024.
[25] P. Kanerva, “A family of binary spatter codes,” in International Confer-
ence on Artiﬁcial Neural Networks (ICANN) , 1995, pp. 517–522.
[26] R. W. Gayler, “Multiplicative binding, representatio n operators & anal-
ogy,” in Advances in Analogy Research: Integration of Theory and Dat a
from the Cognitive, Computational, and Neural Sciences , 1998, pp. 1–4.
[27] T. A. Plate, Holographic Reduced Representations: Distributed Repre-
sentation for Cognitive Structures . Stanford: Center for the Study of
Language and Information (CSLI), 2003.
[28] M. Eggimann, A. Rahimi, and L. Benini, “A 5 uw standard ce ll memory-
based conﬁgurable hyperdimensional computing accelerato r for always-
on smart sensing,” IEEE Transactions on Circuits and Systems I: Regular
Papers , vol. 68, no. 10, pp. 4116–4128, 2021.
[29] G. Karunaratne, M. Le Gallo, G. Cherubini et al. , “In-memory hyperdi-
mensional computing,” Nature Electronics , vol. 3, no. 6, pp. 327–337,
2020.
[30] V . Sumanasena, D. de Silva, E. Osipov et al. , “Implementing holographic
reduced representations for spiking neural networks,” IEEE Access ,
vol. 13, pp. 116 606–116 620, 2025.
[31] C. Kymn, C. Bybee, Z. Yun et al. , “Oscillator associative memories
facilitate high-capacity, compositional inference,” in New Frontiers in
Associative Memories , 2025.
[32] J. Snaider and S. Franklin, “Modular composite represe ntation,” Cogni-
tive Computation , vol. 6, no. 3, pp. 510–527, 2014.
[33] T. Yu, Y . Zhang, Z. Zhang et al. , “Understanding hyperdimensional
computing for parallel single-pass learning,” in Advances in Neural
Information Processing Systems (NeurIPS) , 2022, pp. 1157–1169.
[34] C. J. Kymn, D. Kleyko, E. P. Frady et al. , “Computing with residue num-
bers in high-dimensional representation,” Neural Computation , vol. 37,
no. 1, pp. 1–37, 2024.
[35] A. R. Omondi and A. B. Premkumar, Residue number systems: Theory
and implementation . World Scientiﬁc, 2007, vol. 2.
[36] I. R. Fiete, Y . Burak, and T. Brookings, “What grid cells convey about
rat location,” Journal of Neuroscience , vol. 28, no. 27, pp. 6858–6871,
2008.
[37] C. Kymn, S. Mazelet, A. Thomas et al. , “Binding in hippocampal-
entorhinal circuits enables compositionality in cognitiv e maps,” in Ad-
vances in Neural Information Processing Systems (NeurIPS) , 2024, pp.
39 128–39 157.
[38] M. Heddes, I. Nunes, P. Vergés et al. , “Torchhd: An open source Python
library to support research on hyperdimensional computing and vector
symbolic architectures,” Journal of Machine Learning Research , vol. 24,
no. 255, pp. 1–10, 2023.
[39] E. P. Frady, D. Kleyko, and F. T. Sommer, “A theory of sequ ence
indexing and working memory in recurrent neural networks,” Neural
Computation , vol. 30, no. 6, pp. 1449–1513, 2018.
[40] M. Hersche, S. Lippuner, M. Korb et al. , “Near-channel classiﬁer:
Symbiotic communication and classiﬁcation in high-dimens ional space,”
Brain Informatics , vol. 8, pp. 1–15, 2021.
14
[41] D. Kleyko, C. Bybee, P.-C. Huang et al. , “Efﬁcient decoding of
compositional structure in holistic representations,” Neural Computation ,
vol. 35, no. 7, pp. 1159–1186, 2023.
[42] D. Kleyko, M. Kheffache, E. P. Frady et al. , “Density encoding enables
resource-efﬁcient randomly connected neural networks,” IEEE Transac-
tions on Neural Networks and Learning Systems , vol. 32, no. 8, pp.
3777–3783, 2021.
[43] P. A. Penz, “The closeness code: An input integer to bina ry vector
transformation suitable for neural network algorithms,” i nIEEE First
Annual International Conference on Neural Networks (ICNN) , 1987, pp.
515–522.
[44] D. A. Rachkovskij, S. V . Slipchenko, E. M. Kussul et al. , “Sparse binary
distributed encoding of scalars,” Journal of Automation and Information
Sciences , vol. 37, no. 6, pp. 12–23, 2005.
[45] D. Nova and P. A. Estévez, “A review of learning vector qu antization
classiﬁers,” Neural Computing and Applications , vol. 25, no. 3, pp. 511–
524, 2014.
[46] C. Diao, D. Kleyko, J. M. Rabaey et al. , “Generalized learning vector
quantization for classiﬁcation in randomized neural netwo rks and hyper-
dimensional computing,” in International Joint Conference on Neural
Networks (IJCNN) , 2021, pp. 1–9.
[47] D. Kleyko, E. Osipov, N. Papakonstantinou et al. , “Fault detection
in the hyperspace: Towards intelligent automation systems ,” in IEEE
International Conference on Industrial Informatics (INDI N), 2015, pp.1219–1224.
[48] A. Rahimi, P. Kanerva, and J. M. Rabaey, “A robust and ene rgy-
efﬁcient classiﬁer using brain-inspired hyperdimensiona l computing,” in
IEEE/ACM International Symposium on Low Power Electronics and
Design (ISLPED) , 2016, pp. 64–69.
[49] D. Kleyko, E. Osipov, N. Papakonstantinou et al. , “Hyperdimensional
computing in industrial systems: The use-case of distribut ed fault
isolation in a power plant,” IEEE Access , vol. 6, pp. 30 766–30 777,
2018.
[50] T. Kohonen, “Improved versions of learning vector quan tization,” in
International Joint Conference on Neural Networks (IJCNN) , 1990, pp.
545–550.
[51] A. Cheikh, S. Sordillo, A. Mastrandrea et al. , “Klessydra-T: Designing
vector coprocessors for multithreaded edge-computing cor es,” IEEE
Micro , vol. 41, no. 2, pp. 64–71, 2021.
[52] M. Fernandez-Delgado, E. Cernadas, S. Barro et al. , “Do we need
hundreds of classiﬁers to solve real world classiﬁcation pr oblems?”
Journal of Machine Learning Research , vol. 15, pp. 3133–3181, 2014.
[53] D. Dua and C. Graff, “UCI machine learning repository,” 2019.
[Online]. Available: http://archive.ics.uci.edu/ml
[54] M. Angioli, M. Barbirotta, A. Cheikh et al. , “Design, implementation
and evaluation of a new variable latency integer division sc heme,” IEEE
Transactions on Computers , vol. 73, no. 7, pp. 1767–1779, 2024.