State of the Art in Text Classification for South Slavic Languages:
Fine-Tuning or Prompting?
Taja Kuzman Pungeršek∗, Peter Rupnik∗, Ivan Porupski∗,
Vuk Dinić∗, Nikola Ljubešić∗†‡
∗Jožef Stefan Institute;
†Faculty of Computer and Information Science, University of Ljubljana;
‡Institute of Contemporary History;
Ljubljana, Slovenia
{taja.kuzman, peter.rupnik, ivan.porupski, vuk.dinic, nikola.ljubesic}@ijs.si
Abstract
Until recently, fine-tuned BERT-like models provided state-of-the-art performance on text classification tasks. With
the rise of instruction-tuned decoder-only models, commonly known as large language models (LLMs), the field has
increasinglymovedtowardzero-shotandfew-shotprompting. However,theperformanceofLLMsontextclassification,
particularly on less-resourced languages, remains under-explored. In this paper, we evaluate the performance of
current language models on text classification tasks across several South Slavic languages. We compare openly
available fine-tuned BERT-like models with a selection of open-source and closed-source LLMs across three tasks in
three domains: sentiment classification in parliamentary speeches, topic classification in news articles and parliamen-
tary speeches, and genre identification in web texts. Our results show that LLMs demonstrate strong zero-shot
performance, often matching or surpassing fine-tuned BERT-like models. Moreover, when used in a zero-shot setup,
LLMs perform comparably in South Slavic languages and English. However, we also point out key drawbacks of
LLMs, including less predictable outputs, significantly slower inference, and higher computational costs. Due to
theselimitations,fine-tunedBERT-likemodelsremainamorepracticalchoiceforlarge-scaleautomatictextannotation.
Keywords:LLM evaluation, text classification, large language models, South Slavic languages, sentiment
identification, topic classification, genre identification
1. Introduction
Untilrecently,thedominantapproachfortextclassi-
fication tasks relied on fine-tuning BERT-like trans-
formermodelsonthousandsofmanually-annotated
training examples. Recently, however, the field has
shifted with the development of instruction-tuned
decoder-only transformer models. These models,
also commonly referred to as large language mod-
els (LLMs), which were originally developed primar-
ily for text generation tasks, have demonstrated
remarkable capabilities across a broad range of
natural language processing (NLP) tasks, includ-
ing text classification (Kuzman et al., 2023; Huang
et al., 2023).
In this paper, we focus on South Slavic lan-
guages, where research on text classification tasks
included in our study has, until recently, been lim-
ited or even non-existent (Kuzman and Ljubešić,
2023; Mochtak et al., 2024; Kuzman and Ljubešić,
2025). We take a first step toward systematically
evaluatingthecurrentstateoftheartfortextclassifi-
cation in these languages. Our evaluation is based
onthreetextclassificationtasksinthreedifferentdo-
mains for which manually-annotated test datasets
inSouthSlaviclanguagesandfine-tunedBERT-like
classifiers arefreely available: sentimentclassifica-
tion of parliamentary speeches, topic classificationin news articles, topic classification in parliamen-
tary speeches, and automatic genre identification
in web texts. These tasks span different domains
and language styles, allowing for a comprehensive
analysis of the performance of transformer-based
models on text classification tasks. Specifically,
we compare the performance of openly available
fine-tuned BERT-like models with the zero-shot ca-
pabilities of both open-source and closed-source
LLMs used via prompting.
An important aspect of our study is to examine
whether the performance of multilingual models on
South Slavic languages is on par with their perfor-
mance on English. This question is particularly
relevant given that the evaluated large language
models have been predominantly pretrained and
instruction-tuned on English data.
By evaluating various models on a selection
of text classification tasks in English and various
South Slavic languages, we set out to test the fol-
lowing two hypotheses that are based on previous
experiments with fine-tuned BERT-like models and
LLMs on automatic genre identification (Kuzman
etal.,2023),newstopicclassification(Kuzmanand
Ljubešić, 2025) and sentiment analysis in parlia-
mentary texts (Mochtak et al., 2025):
H1Zero-shot prompting with instruction-tuned
large language models (LLMs) can achievearXiv:2511.07989v1  [cs.CL]  11 Nov 2025
Dataset Lang # Instances #Labels % Most and Least Frequent Label
Sentiment classification in parliamentary speeches
ParlaSent-EN-test EN 2600 3 40.8% (Neutral), 26.8% (Positive)
ParlaSent-HR-test HR 1336 3 41.9% (Negative), 17.2% (Positive)
ParlaSent-SR-test SR 1074 3 46.2% (Negative), 17.6% (Positive)
ParlaSent-BS-test BS 190 3 47.9% (Negative), 14.7% (Positive)
Genre classification in web texts
EN-GINCO EN 272 8 23.5% (Information/Explanation), 0.4% (Le-
gal)
X-GINCO-SL SL 80 8 15% (Prose/Lyrical), 8.8% (Opin-
ion/Argumentation)
X-GINCO-HR HR 80 8 16.3% (Promotion), 7.5% (Instruction)
X-GINCO-MK MK 80 8 15% (News), 1% (Opinion/Argumentation)
Topic classification in news articles
IPTC-test-HR HR 291 17 11.0% (Economy), 3.8% (Conflict, War and
Peace)
IPTC-test-SL SL 282 17 10.6% (Society), 3.2% (Conflict, War and
Peace)
Topic classification in parliamentary speeches
ParlaCAP-test-EN EN 876 22 6.4% (Law and Crime), 2.1% (Culture)
ParlaCAP-test-HR HR 869 22 8.5% (Government Operations), 1.7% (Immi-
gration)
ParlaCAP-test-SR SR 874 22 7.1% (Government Operations), 1.7% (Immi-
gration)
ParlaCAP-test-BS BS 824 22 10.4% (Other), 0.5% (Culture)
Table1: InformationontestdatasetsinEnglish(EN),Croatian(HR),Serbian(SR),Bosnian(BS),Slovenian
(SL), and Macedonian (MK).
results comparable to the use of BERT-like
models fine-tuned on training data that are
similar to the test data.
H2The performance of LLMs used in a zero-shot
setup on text classification tasks on South
Slavic test datasets is comparable to the per-
formance on English test datasets.
2. Related Work
After the introduction of transformer architectures,
BERT (bidirectional encoder representations from
transformers) models have achieved state-of-the-
art results in text classification tasks, outperform-
ingearliernon-neuralapproaches,suchassupport
vector machines (SVMs). They have also demon-
strated strong cross-lingual zero-shot capabilities
in various classification tasks, including automatic
genre identification (Kuzman and Ljubešić, 2023),
news topic classification (Petukhova and Fachada,
2023; De Clercq et al., 2020), and sentiment clas-
sification (Mochtak et al., 2024). However, these
modelsstillrequirefine-tuningonatrainingdataset,developed during manual annotation campaigns
that are time-consuming and costly.
Instruction-tuned decoder-only transformer mod-
els, commonly referred to as large language mod-
els (LLMs), have recently shown strong perfor-
mance in a range of classification tasks, even in
zero-shot prompting setups that require no training
data (Kuzman et al., 2023; Ljubešić et al., 2024a;
Huang et al., 2023). They have achieved promis-
ing results on various natural language process-
ing tasks, including stance detection (Zhang et al.,
2022), implicit hate speech categorization (Huang
et al., 2023), news topic classification (Kuzman
and Ljubešić, 2025), automatic genre identification
(Kuzman et al., 2023), causal commonsense rea-
soning (Ljubešić et al., 2024b), and machine trans-
lation (Hendy et al., 2023). Due to their promis-
ing performance, researchers have even started
using them as data annotators, either by generat-
ing text and labels (Meng et al., 2022) or by an-
notating pre-existing texts (Kuzman and Ljubešić,
2025). Despite the growing interest in this topic,
the majority of evaluations of LLMs used in text
classification tasks are limited only to English (Sun
et al., 2023; Zhang et al., 2025; Kostina et al.,
2025; Zhao et al., 2024). Systematic multilingual
evaluations, especially which would include less-
resourced languages such as those in the South
Slavic group remain limited. Our work addresses
this gap by providing a comparative evaluation of
open-source and closed-source LLMs with openly-
available fine-tuned BERT-like models across four
benchmarkfamiliescomprisingthreediverseclassi-
fication tasks and three different domains in South
Slavic languages and English.
3. Benchmarks
The benchmarks (evaluation datasets) used in this
study cover three text classification tasks, namely,
sentiment identification, topic classification, and
automatic genre identification, and three domains:
parliamentary speeches, news articles and web
texts. An overview of the datasets is provided in
Table 1. The four benchmark families differ signif-
icantly in terms of language coverage, number of
test instances, and label granularity.
The topic classification task is evaluated on two
domains: 1) news articles, namely, the Croat-
ian and Slovenian IPTC test datasets (Kuzman
and Ljubešić, 2025), which comprise around 300
text instances per language, and 2) parliamentary
speeches, namely, the Bosnian, Croatian, English
and Serbian ParlaCAP test datasets that consist of
approximately 820 to 880 instances per language.
In the ParlaCAP benchmarks, an instance is a tran-
scription of an utterance given by a parliamentary
member in a parliamentary session.
The topic classification task involves the high-
est number of labels, that is, 17 news topic labels
from the top level of the IPTC NewsCodes Media
Topic hierarchical schema1(IPTC, 2022), and 22
agenda setting topic labels (21 major topics and a
labelOther)fromtheComparativeAgendasProject
(CAP) (Baumgartner et al., 2019) Master Code-
book (Bevan, 2019).2
In contrast, the Bosnian, Croatian, English, and
SerbianParlaSentsentimentidentificationdatasets
(Mochtak et al., 2024; Mochtak et al., 2023) have
a significantly lower granularity of labels, with
only 3 categories. They are represented by the
largest number of instances, ranging from 190
(Bosnianpart)to2600(Englishpart)sentence-level
instances.
With8labels,theCroatian,English,Macedonian,
and Slovenian GINCO genre datasets (Kuzman
1https://show.newscodes.org/index.
html?newscodes=medtop&lang=en-GB&
startTo=Show
2https://www.comparativeagendas.net/
pages/master-codebooket al., 2023) represent a midpoint in label gran-
ularity among the four benchmark families. How-
ever,thegenreidentificationtaskmightbethemost
difficult one, as genre identification depends on
the interpretation of full texts with the focus on au-
thor’s purpose, the common function of the text,
and the text’s conventional form (Orlikowski and
Yates, 1994). This complexity has also contributed
to smaller test datasets in terms of the number of
text instances, as manual annotation is more time-
consuming. It is also important to note that, unlike
the parliamentary datasets, the English portion of
the genre datasets is not fully comparable to the
South Slavic portions, which are label-balanced
and contain fewer ambiguous instances. Neverthe-
less, the genre datasets remain valuable for evalu-
ating model performance within each language.
All test datasets were manually annotated by an-
notatorsthataredeemedreliablebasedontheirsat-
isfactory inter-annotator agreement, namely, Krip-
pendorff’s alpha (Krippendorff, 2018) values close
to or above the 0.667 threshold for reliable anno-
tation. To prevent large language models from in-
corporating the test datasets during their training
phase, the test datasets are not publicly available,
except for the ParlaSent benchmark family. Ac-
cess to other datasets is granted on request from
the corresponding authors. Further details on the
test datasets are provided in Section A.1 of the
Appendix.
4. Methodology
In this paper, we evaluate the main machine learn-
ing approaches that have recently been used for
our selection of text classification tasks, with the
focus on the comparison between the freely avail-
able fine-tuned BERT models and the open-source
and closed-source LLMs.3The models are evalu-
ated on four families of test datasets that comprise
South Slavic languages. The performance of the
models is evaluated based on the micro-F1 and
macro-F1metrics, whichenableassessmentofthe
model performance at both the instance and label
levels, respectively.
The following machine learning models are in-
cluded in the evaluation:
•dummy classifier: a dummy classifier that
predicts the most frequent class in the train-
ing data. To allow comparison, the dummy
classifiers were trained on the same datasets
that were used for fine-tuning the BERT-like
models, mentioned below.
3The code for the model evaluation
and analysis of results is available at
https://github.com/TajaKuzman/
Benchmarking-Text-Classification-on-South-Slavic .
•fine-tuned BERT-like classifiers: in our
study, we evaluate previously developed
openly accessible multilingual fine-tuned
BERT-like models that have been fine-tuned
for the respective task, namely, the XLM-R-
ParlaSent (Rupnik et al., 2023; Mochtak et al.,
2024) model for sentiment identification in
parliamentary texts, the X-GENRE classifier
(Kuzman et al., 2023; Kuzman and Ljubešić,
2024d, 2023) for automatic genre identifica-
tion, the IPTC News Topic classifier (Kuzman
and Ljubešić, 2025; Kuzman and Ljubešić,
2025) for news topic classification, and the
ParlaCAP classifier (Kuzman Pungeršek and
Ljubešić, 2025) for topic classification in par-
liamentary speeches. The XLM-R-ParlaSent
and the ParlaCAP models are based on the
XLM-R-parla pretrained model (Ljubešić et al.,
2023) that was developed by additionally pre-
training the large-sized XLM-RoBERTa model
(Conneau et al., 2020) on parliamentary pro-
ceedings in 30 European languages (Mochtak
etal.,2024). TheXLM-R-ParlaSentmodelwas
fine-tuned on 13 thousand instances from the
ParlaSentsentimenttrainingdataset(Mochtak
et al., 2023) in seven European languages
(Bosnian, Croatian, Czech, English, Serbian,
Slovak, and Slovenian) (Mochtak et al., 2024),
while the ParlaCAP model was fine-tuned on
around 30 thousand speeches from parlia-
mentary debates annotated with CAP topic
labels, originating from the ParlaMint 4.1 par-
liamentary datasets (Erjavec et al., 2024; Er-
javec et al., 2025) from 29 European parlia-
ments. The X-GENRE classifier is based on
the base-sized XLM-RoBERTa model (Con-
neau et al., 2020) and was fine-tuned on the
trainingsplitoftheX-GENREdataset(Kuzman
andLjubešić,2024a)inEnglishandSlovenian;
while the IPTC News Topic classifier is based
onthelarge-sizedXLM-RoBERTamodel(Con-
neau et al., 2020) that was fine-tuned on the
EMMediaTopic dataset (Kuzman and Ljubešić,
2024c)inCatalan,Croatian,Greek,andSlove-
nian. All fine-tuned models use the same
classes as the test datasets used in our study.
•open-source and closed-source large lan-
guage models: we use closed-source Ope-
nAI models, namely the GPT-3.5-Turbo (gpt-
3.5-turbo-0125 ) (OpenAI, 2023), GPT-4o
(gpt-4o-2024-08-06 ) (OpenAI, 2024) and
the GPT-5 ( gpt-5-2025-08-07 ) (OpenAI,
2025); a closed-source Gemini 2.5 Flash
model(Comanicietal.,2025)byGoogleDeep-
Mind; a closed-source Mistral Medium 3.1
model ( mistral-medium-2508 ) (Mistral AI,
2025)byMistralAI;andfouropen-sourcemod-
els, namely, the Meta LLaMA 3.3 model (Meta,2024), the Gemma 3 model (Gemma Team
et al., 2025), the Qwen 3 model (Yang et al.,
2025), and the DeepSeek-R1-Distill model
(DeepSeek-R1-Distill-Qwen-14B ) (Guo
et al., 2025). It is important to note that while
theLLaMAmodelwaspretrainedonawebtext
collectioninvariouslanguages,itissaidtosup-
port only 8 languages, namely English, Ger-
man, French, Italian, Portuguese, Hindi, Span-
ish, and Thai (Meta, 2024). The DeepSeek-
R1-Distill model is based on the Qwen 2.5
model (Qwen Team, 2024b,a) that provides
support for more than 29 languages – not in-
cluding South Slavic languages though. In
contrast, the Gemma 3 model is reported to
support over 140 languages (Gemma Team
et al., 2025), and the Qwen 3 model was pre-
trained on 119 languages (Yang et al., 2025).
While closed-source models are said to be
massively multilingual, with Gemini 2.5 mod-
els being pretrained on over 400 languages
(Comanici et al., 2025), details on their lan-
guage coverage are very limited.
Open-source models were installed locally and
executed via the Ollama API service (Marić et al.,
2025). OpenAI models were used through the
chat completion endpoint via the OpenAI API,
whereas other closed-source models were ac-
cessed through the OpenRouter platform4that pro-
videsaunifiedAPIaccesstovariousclosed-source
models. To prevent any bias, all models were used
with their default parameters. The only parameter
that we defined is the temperature which we set to
0 to ensure a more deterministic behaviour of the
models. Moredetailsonthemodelsandtheirimple-
mentation, including information on the availability
ofopenlyavailablemodelsandfine-tuningdatasets,
are provided in Section A.2 of the Appendix.
All instruction-tuned LLMs are used in a zero-
shot prompting setup, meaning that they receive
only a task description and label definitions. The
modelsareinstructedtooutputalabel,represented
by a digit. The same prompt per benchmark family
isusedforallLLMs. PromptsareprovidedinFigure
4 in Section A.2 of the Appendix.
5. Results
Inthissection,weevaluatetheperformanceoffine-
tuned BERT-like models and the instruction-tuned
LLMs on a selection of text classification tasks that
include test datasets in South Slavic languages.
First, in Section 5.1, we provide results on the four
benchmark families with the focus on hypothesis
H1, which expects that zero-shot prompting with
LLMs can provide performance that is comparable
4https://openrouter.ai/
(a) Sentiment classification.
 (b) Automatic genre identification.
(c) News topic classification.
 (d) Parliamentary topic classification.
Figure 1: Micro-F1 and macro-F1 scores across models and languages on the test datasets for sentiment
classification (Figure 1a), automatic genre identification (Figure 1b), and topic classification on news
(Figure 1c) and parliamentary speeches (Figure 1d).
to that of fine-tuned BERT-like models. In Section
5.2, we compare in more detail the performance of
the closed-source and open-source LLMs on the
three text classification tasks, which is followed by
a discussion on the advantages and limitations of
LLMs for data annotation based on text classifica-
tion tasks (Section 5.3). Lastly, in Section 5.4, we
compare the performance of LLMs on English test
datasets with their performance on South Slavic
datasets, addressing hypothesis H2, which pre-
sumes that the available multilingual LLMs perform
similarly on South Slavic languages as on English.
5.1.State of the Art in Text Classification
Tasks
Figure 1 provides results of model evaluation on
our selection of text classification tasks. A consis-
tent pattern emerges across all four benchmark
families: LLMs, when used in a zero-shot prompt-
ing setup, achieve some of the highest scores. As
shown in Table 2, which compares model rankings
across tasks, LLMs achieve first place more oftenon average than the fine-tuned BERT-like model.
Figure 1a shows that both open-source and
closed-sourceLLMs,usedinazero-shotprompting
setup on the sentiment identification task, achieve
performance that is comparable or even signifi-
cantlyhighertothatofafine-tunedBERT-likemodel
trained on a large manually-annotated sentiment
dataset. The only models that consistently per-
form worse than the fine-tuned BERT-like model
are GPT-3.5-Turbo and DeepSeek-R1-Distill. Sen-
timent classification appears broad enough that
more potent LLMs can interpret label definitions ef-
fectively without task-specific fine-tuning, reducing
the benefit of additional training.
In contrast, fine-tuned BERT-like models outper-
form most LLMs on automatic genre identification
and topic classification tasks. These tasks depend
on predefined label sets based on specific guide-
lines, and the strong performance of fine-tuned
BERT-like models indicates that domain-specific
fine-tuningonlabelleddatastilloffersanadvantage
over the general knowledge leveraged by LLMs in
zero-shot setups. This advantage is particularly
(a) Sentiment classification.
 (b) Automatic genre identification.
(c) News topic classification.
 (d) Parliamentary topic classification.
Figure2: ComparisonofLLMsusedinazero-shotpromptingfashiononsentimentidentification(Figure2a),
automatic genre identification (Figure 2b), and topic classification on news (Figure 2c) and parliamentary
speeches (Figure 2d).
clear in genre identification for South Slavic texts,
where the fine-tuned BERT-like model significantly
outperforms LLMs. The likely reason for the fine-
tuned model’s very strong performance on South
Slavic genre datasets is the curated nature of the
test data – more challenging examples were re-
moved before and during manual annotation, un-
like in the English genre test dataset where the
instances were randomly sampled from an English
web corpus. Nevertheless, despite this limitation,
the South Slavic test dataset remains valuable for
comparing the performance of LLMs.
To conclude, since some LLMs used in a zero-
shot prompting setup achieve higher or compara-
ble results to fine-tuned BERT-like models across
all classification tasks and languages, as shown
in Table 2, we can confirm hypothesis H1, which
proposed that zero-shot prompting with LLMs can
perform comparably to fine-tuned BERT-like mod-
els.5.2. Comparison of Large Language
Models
Figure 2 shows the performance of open-source
and closed-source LLMs, used via prompting, on
the tasks of sentiment identification, automatic
genre identification, news topic classification, and
parliamentary topic classification. The DeepSeek-
R1-Distill model is not included in the comparison,
asitperformssignificantlyworsethanothermodels,
as shown in Figure 1.
While different models perform best across dif-
ferent languages and test datasets, a clear trend
emerges: thetop-performingmodelsacrossallfour
benchmark families are the closed-source GPT-
4o and GPT-5 from OpenAI, along with Gemini
2.5 Flash. Although GPT-5 is newer and report-
edly more powerful, it does not outperform GPT-
4o on all benchmarks. Among open-source mod-
els, Gemma 3 generally achieves the best results
in sentiment identification (Figure 2a) and news
topicclassification(Figure2c). Forautomaticgenre
identification (Figure 2b) and parliamentary topic
classification (Figure 2d), rankings of open-source
Model RankRank
(EN)Rank
(South
Slavic)
GPT-5 2.291.33 2.55
GPT-4o 2.362.00 2.45
Fine-Tuned BERT-
Like Model3.214.67 2.82
Gemini 2.5 Flash 3.503.33 3.55
Mistral Medium 3.1 5.365.00 5.45
Gemma 3 5.715.67 5.73
LLaMA 3.3 6.006.67 5.82
Qwen 3 7.437.00 7.55
GPT-3.5-Turbo 8.799.00 8.73
DeepSeek-R1-
Distill10.0010.00 10.00
Table 2: Comparison of models based on their
average rank (1 = best-performing, 10 = worst-
performing) across all test datasets (first column),
and averaged across English (second column) or
South Slavic (third column) test datasets.
models vary by language. Overall, the weakest
performance is observed with the older closed-
sourceGPT-3.5-Turbomodel, highlightingtherapid
progress in both open-source and closed-source
model development.
5.3. Advantages and Disadvantages of
LLMs
Figure 3: Comparison of models on the parliamen-
tary topic classification based on their inference
speed (seconds per instance) and performance
(macro-F1 scores), both averaged across all four
languages.
A clear advantage of LLMs is that they do not re-
quire manually-annotated training data for specific
tasks, yet still achieve strong performance when
provided only with task instructions and brief labeldescriptions. However, these models are signifi-
cantly more computationally expensive than fine-
tunedBERT-likemodels. Whileclosed-sourcemod-
els deliver the best performance, as shown in pre-
vious sections, they come with several limitations:
they are costly to use, their architectures and pre-
training data are not publicly disclosed, and access
through APIs hinders reproducibility, in contrast to
open-source LLMs and fine-tuned BERT-like mod-
els.
What is more, the inference speed of all LLMs
is significantly slower than that of a fine-tuned
BERT-like model. As shown in Figure 3, the fine-
tunedBERT-likemodelachievesoneofthehighest
macro-F1 scores on the topic classification task
for parliamentary speeches, while maintaining a
very low inference time of just 0.02 seconds per
instance. In contrast, most LLMs have inference
times between 0.6 and 1.4 seconds per instance,
making them three to seven times slower for an-
notating the same dataset. The slowest model,
GPT-5, takes 5.5 seconds per instance, which ren-
ders it impractical for large-scale automatic anno-
tation of text collections. In this regard, fine-tuned
BERT-like models offer a key advantage due to
theirlowercomputationalcostandhigherinference
speed. Moreover, they can be trained on training
datathatisannotatedbyLLMsusingtherecentlyin-
troduced LLM teacher-student paradigm (Kuzman
and Ljubešić, 2025), which considerably reduces
the effort needed to develop task-specific models.
Another limitation of LLMs, as revealed by the
experiments, is their occasional deviation from the
defined label set. This issue was especially notice-
able in topic classification and, to a lesser extent,
in genre identification. The highest rate of label
hallucination was found in the DeepSeek-R1-Distill
model, which produced non-existing labels for 8%
of instances in the news topic test datasets and 4%
in the genre test dataset. Similar issues were also
observed, though much less frequently (less than
1%), with the LLaMA 3.3, Gemma 3, Qwen 3 and
Mistral Medium 3.1 models. In contrast, fine-tuned
BERT-like models do not suffer from this issue, as
they output probabilities for the predefined classes.
5.4. Performance on English versus on
South Slavic languages
The sentiment identification ParlaSent and the
topic classification ParlaCAP benchmark families
comprise test datasets in South Slavic languages
and English that were constructed with the same
methodology. Thus, they also allow for a compari-
son of the performance of the LLMs on English, a
highly resourced language, with South Slavic lan-
guages, which are significantly less represented
in the pretraining and instruction-tuning datasets
Model Difference
(sentiment)Difference
(topic)
GPT-5 0.02 0.05
GPT-4o 0.04 0.08
Gemini 2.5
Flash0.04 0.08
Gemma 3 0.05 0.07
LLaMA 3.3 0.05 0.07
Mistral
Medium
3.10.07 0.09
Qwen 3 0.07 0.10
GPT-3.5-
Turbo0.07 0.03
Table 3: Difference between model performance in
macro-F1 scores obtained on sentiment and topic
classification in parliamentary texts on English ver-
sus the average macro-F1 scores on South Slavic
languages.
used to develop large language models.
As shown in Figures 2a and 2d, LLMs gener-
ally perform worse on Bosnian compared to other
languages. However, as shown in Table 3, the dif-
ferences in macro-F1 scores between English and
the average of macro-F1 scores for South Slavic
languages are relatively small for sentiment identi-
fication, ranging from 2 to 7 points. For topic clas-
sification, the performance gap is slightly larger,
ranging from 3 to 10 points. This is likely due to
the increased difficulty of the task, which involves
greater label granularity: 22 labels compared to
just 3 in sentiment classification. These findings
partially confirm hypothesis H2, which stated that
LLMs, when used in a zero-shot setup, perform
comparably on text classification tasks in South
Slavic languages as they do on English.
Interestingly, even the open-source LLaMA 3.3
model – reported to support only eight languages,
excluding the South Slavic group – does not show
a substantial performance drop when applied to
South Slavic languages compared to English.
6. Conclusion
In this paper, we evaluated how well current ma-
chine learning technologies handle text classifica-
tiontasksinSouthSlaviclanguages. Wecompared
fine-tunedBERT-likemodelswithdecoder-onlygen-
erativelargelanguagemodels(LLMs)thatareused
in a zero-shot prompting setup across three tasks
and three text domains: sentiment classification in
parliamentary texts, news topic classification, topic
classification in parliamentary texts, and automaticgenre identification on web texts.
Our results show that LLMs used with prompting,
where only a brief task description and labels were
provided, achieved strong results across all tasks
andlanguages,particularlytheclosed-sourceGPT-
4o (OpenAI, 2024), GPT-5 (OpenAI, 2025) and
Gemini 2.5 Flash (Comanici et al., 2025) models.
The performance of LLMs is comparable or higher
to that of fine-tuned BERT-like models specialized
for the tasks. On the sentiment identification task,
most open-source and closed-source LLMs outper-
formed the fine-tuned model, demonstrating strong
general knowledge of the notion of sentiment. For
genre and topic classification, however, fine-tuning
BERT-likemodelsremainbeneficial,asthesetasks
rely on predefined label sets and fine-tuning aligns
themodelsmorecloselywiththetaskrequirements.
Interestingly, LLMs perform similarly in English
and South Slavic languages, with rather minor
drops in micro- and macro-F1 scores, namely a
drop of 2 to 7 points in terms of macro-F1 scores
on sentiment classification, and a slightly higher
drop from 3 to 10 points on topic classification in
parliamentary texts. This suggests that the gap in
multilingual performance is smaller than expected,
even for open-source models not explicitly dedi-
cated to these languages.
Althoughlargelanguagemodelsofferimpressive
zero-shot performance and reduce the need for an-
notated data, they come with higher computational
costs and are more prone to producing invalid la-
bels. Moreover, their inference speed is at least
threetimesslowerthanthatofthefine-tunedBERT-
like models. Thus, their use in use cases with ex-
tensive data to be processed, such as automatic
enrichment of large corpora with text categories,
remainsimpracticalduetotheirhighcomputational
demands. In contrast, fine-tuned BERT-like mod-
els are more computationally efficient and can be
better tailored to the specific characteristics of a
task and its domain. They remain a practical and
reliable choice for text classification tasks, espe-
cially when computational resources are limited,
high inference speed is desired or output reliabil-
ity is critical. Moreover, it is possible to combine
the strengths of both approaches, as proposed by
the LLM teacher-student pipeline paradigm (Kuz-
man and Ljubešić, 2025): LLMs can be used to
automatically annotate training data, reducing the
need for costly and time-consuming manual anno-
tation, while fine-tuned BERT-like models can then
be trained on these datasets.
This study represents only an initial step to sys-
tematically benchmark text classification perfor-
mance in South Slavic languages. Although our
evaluation includes four diverse benchmark fam-
ilies, some of the test datasets remain relatively
small. Future work will aim to increase dataset
sizes, include more South Slavic languages and di-
alects, and introduce additional classification tasks.
As new large language models continue to emerge
rapidly, it will also be important to establish ongo-
ing evaluations to track whether their performance
continues to improve, particularly on South Slavic
languages. Importantly, this study only evaluated
the performance of LLMs in a zero-shot prompting
setup. In future work, we plan to extend the evalu-
ation to include few-shot prompting and fine-tuning
on training data. To support further research and
facilitate reproducibility, we have made all code,
evaluation scripts, and results publicly available.5
7. Ethical Considerations and
Limitations
Our study has several limitations that should be
acknowledged. First, while we aimed to include
a broad set of South Slavic languages, some –
most notably Bulgarian – were not covered in our
experiments. We assume that the performance
on Bulgarian would be similar to that observed
for Macedonian, given their close linguistic prox-
imity, or the results for Bulgarian could be slightly
better, as Macedonian is comparatively more low-
resourced. Moreover, due to the high computa-
tional cost of evaluating the LLMs on all the test
datasets and the financial cost associated with the
useofclosed-sourcemodels,eachmodelwaseval-
uated on each dataset only once. This setup pre-
vents us from fully estimating the variance of the
results, however, based on our preliminary experi-
ments,weexpectthisvariancetoberelativelysmall.
Finally, the scope of our evaluation remains limited
in terms of test datasets, language coverage and
tasks. Expanding the range of benchmarks would
allow for a more comprehensive validation of our
findings, particularly regarding the hypothesis that
LLMs can perform on par with fine-tuned BERT-
like models across diverse natural language under-
standing tasks, languages and language varieties.
8. Acknowledgements
We would like to thank the developers of the
llm.ijs.siservice(Marićetal.,2025)forestablishing
the LLM inference platform deployed at the Jožef
Stefan Institute, which provided convenient access
to the open-source large language models used
in this study. We also thank the annotators of the
test datasets for their diligence and the time de-
voted to manual annotation, which resulted in the
high-quality evaluation datasets used in this work.
5https://github.com/TajaKuzman/
Benchmarking-Text-Classification-on-South-SlavicLastly, we would like to thank the CLASSLA knowl-
edge centre for South Slavic languages and the
Slovenian CLARIN.SI infrastructure for their valu-
able support.
This work was supported in part by the projects
“Spoken Language Resources and Speech Tech-
nologies for the Slovenian Language” (Grant J7-
4642), “Large Language Models for Digital Hu-
manities” (Grant GC-0002), the research pro-
gramme “Language Resources and Technologies
for Slovene” (Grant P6-0411), all funded by the
ARIS Slovenian Research and Innovation Agency,
and the research project “Embeddings-based tech-
niques for Media Monitoring Applications” (L2-
50070), co-funded by the Kliping d.o.o. agency.
The authors acknowledge the OSCARS project
– and its ParlaCAP cascading grant project –,
which has received funding from the European
Commission’s Horizon Europe Research and In-
novation programme under grant agreement No.
101129751.
9. Bibliographical References
Marta Bañón, Miquel Esplà-Gomis, Mikel L For-
cada, Cristian García-Romero, Taja Kuzman,
Nikola Ljubešić, Rik van Noord, Leopoldo Pla
Sempere, Gema Ramírez-Sánchez, Peter Rup-
nik, et al. 2022. MaCoCu: Massive collection
and curation of monolingual and bilingual data:
focus on under-resourced languages. In23rd
Annual Conference of the European Association
for Machine Translation, pages 301–302.
Frank R Baumgartner, Christian Breunig, and Emil-
ianoGrossman.2019.ComparativePolicyAgen-
das: Theory, Tools, Data. Oxford University
Press.
Shaun Bevan. 2019. Gone Fishing.Comparative
Policy Agendas: Theory, Tools, Data, pages 17–
34.
Gheorghe Comanici, Eric Bieber, Mike Schaeker-
mann, Ice Pasupat, Noveen Sachdeva, Inderjit
Dhillon, Marcel Blistein, Ori Ram, Dan Zhang,
Evan Rosen, et al. 2025. Gemini 2.5: Pushing
thefrontierwithadvancedreasoning,multimodal-
ity, long context, and next generation agentic ca-
pabilities.arXiv preprint arXiv:2507.06261.
Alexis Conneau, Kartikay Khandelwal, Naman
Goyal, Vishrav Chaudhary, Guillaume Wen-
zek, Francisco Guzmán, Édouard Grave, Myle
Ott, Luke Zettlemoyer, and Veselin Stoyanov.
2020. Unsupervised Cross-lingual Represen-
tation Learning at Scale. In58th Annual Meeting
of the Association for Computational Linguistics,
pages 8440–8451.
OrphéeDeClercq,LunaDeBruyne,andVéronique
Hoste. 2020. News topic classification as a
first step towards diverse news recommenda-
tion.Computational Linguistics in the Nether-
lands Journal, 10:37–55.
TomažErjavec,MatyášKopp,NikolaLjubešić,Taja
Kuzman, Paul Rayson, Petya Osenova, Maciej
Ogrodniczuk, Çağrı Çöltekin, Danijel Koržinek,
Katja Meden, et al. 2025. ParlaMint II: advanc-
ing comparable parliamentary corpora across
Europe.Language Resources and Evaluation,
59(3):2071–2102.
Tomaž Erjavec, Matyáš Kopp, Maciej Ogrodniczuk,
Petya Osenova, et al. 2024. Multilingual com-
parable corpora of parliamentary debates Par-
laMint 4.1. Slovenian language resource reposi-
tory CLARIN.SI.
Gemma Team, Aishwarya Kamath, Johan Fer-
ret, Shreya Pathak, Nino Vieillard, Ramona
Merhej, Sarah Perrin, Tatiana Matejovicova,
Alexandre Ramé, Morgane Rivière, et al. 2025.
Gemma 3 technical report.arXiv preprint
arXiv:2503.19786.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao
Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu,
Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025.
DeepSeek-R1: Incentivizing Reasoning Capabil-
ity in LLMs via Reinforcement Learning.arXiv
preprint arXiv:2501.12948.
Amr Hendy, Mohamed Abdelrehim, Amr Sharaf,
Vikas Raunak, Mohamed Gabr, Hitokazu Mat-
sushita, Young Jin Kim, Mohamed Afify, and
Hany Hassan Awadalla. 2023. How Good
Are GPT Models at Machine Translation? A
Comprehensive Evaluation.arXiv preprint
arXiv:2302.09210.
Fan Huang, Haewoon Kwak, and Jisun An. 2023.
Is ChatGPT better than human annotators? Po-
tential and limitations of ChatGPT in explaining
implicit hate speech. InCompanion Proceed-
ings of the ACM Web Conference 2023, pages
294–297.
IPTC. 2022. Groups of NewsCodes. https://iptc.
org/standards/newscodes/groups/#descrncd.
Accessed October 29, 2024.
Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář,
Pavel Rychl `y, and Vít Suchomel. 2013. The Ten-
Ten corpus family. In7th international corpus
linguistics conference CL, pages 125–127. Lan-
caster University.Jacob Devlin Ming-Wei Chang Kenton and
LeeKristinaToutanova.2019.BERT:Pre-training
ofDeepBidirectionalTransformersforLanguage
Understanding.Proceedings of NAACL-HLT,
pages 4171–4186.
Arina Kostina, Marios D Dikaiakos, Dimosthenis
Stefanidis, and George Pallis. 2025. Large
language models for text classification: Case
study and comprehensive review.arXiv preprint
arXiv:2501.08457.
Klaus Krippendorff. 2018.Content analysis: An
introduction to its methodology. Sage Publica-
tions.
TajaKuzmanandNikolaLjubešić.2023. Automatic
genre identification: a survey.Language Re-
sources and Evaluation, pages 1–34.
Taja Kuzman and Nikola Ljubešić. 2023. Multilin-
gual text genre classification model X-GENRE.
Hugging Face.
Taja Kuzman and Nikola Ljubešić. 2024a. English-
Slovenian text genre dataset X-GENRE. Slove-
nianLanguageResourceRepositoryCLARIN.SI.
Taja Kuzman and Nikola Ljubešić. 2024b. Genre-
enriched web corpora MaCoCu-Genre. Slove-
nianLanguageResourceRepositoryCLARIN.SI.
Taja Kuzman and Nikola Ljubešić. 2024c. Multilin-
gual IPTC Media Topic dataset EMMediaTopic
1.0. Slovenian Language Resource Repository
CLARIN.SI.
Taja Kuzman and Nikola Ljubešić. 2024d. Mul-
tilingual text genre classification model X-
GENRE. Slovenian language resource repos-
itory CLARIN.SI.
Taja Kuzman and Nikola Ljubešić. 2025. Multilin-
gual IPTC News Topic Classifier. Hugging Face.
Taja Kuzman and Nikola Ljubešić. 2025. LLM
Teacher-Student Framework for Text Classifica-
tion With No Manually Annotated Data: A Case
Study in IPTC News Topic Classification.IEEE
Access, 13:35621–35633.
Taja Kuzman, Igor Mozetič, and Nikola Ljubešić.
2023. Automatic Genre Identification for Robust
Enrichment of Massive Text Collections: Inves-
tigation of Classification Methods in the Era of
Large Language Models.Machine Learning and
Knowledge Extraction, 5(3):1149–1175.
Taja Kuzman, Peter Rupnik, and Nikola Ljubešić.
2022. The GINCO Training Dataset for Web
Genre Identification of Documents Out in the
Wild. InLanguage Resources and Evalua-
tion Conference, pages 1584–1594, Marseille,
France. European Language Resources Associ-
ation.
TajaKuzmanPungeršekandNikolaLjubešić.2025.
Multilingual ParlaCAP model for CAP Topic Clas-
sification in Parliamentary Speeches. Hugging
Face.
NikolaLjubešić, NadaGalant, SonjaBenčina, Jaka
Čibej, Stefan Milosavljević, Peter Rupnik, and
Taja Kuzman. 2024a. DIALECT-COPA: Ex-
tending the Standard Translations of the COPA
Causal Commonsense Reasoning Dataset to
South Slavic Dialects. InProceedings of the
Eleventh Workshop on NLP for Similar Lan-
guages, Varieties, and Dialects (VarDial 2024),
pages 89–98.
Nikola Ljubešić, Taja Kuzman, Peter Rupnik, Ivan
Vulić,FabianSchmidt,andGoranGlavaš.2024b.
JSI and WüNLP at the DIALECT-COPA Shared
Task: In-Context Learning From Just a Few Di-
alectal Examples Gets You Quite Far. InPro-
ceedings of the Eleventh Workshop on NLP for
Similar Languages, Varieties, and Dialects (Var-
Dial 2024), pages 209–219.
Nikola Ljubešić, Peter Rupnik, and Rik van Noord.
2023. Multilingual parliamentary model XLM-R-
parla. Hugging Face.
Nikola Marić, Boshko Koloski, Damjan Demšar,
Jan Jona Javoršek, and Sašo Džeroski. 2025.
Running large language models locally: design
and operational insights with llm.ijs.si.Interna-
tional conference AI for science 2025: Ljubljana,
Slovenia, 22.09.2025-26.09.2025, page 77.
Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei
Han. 2022. Generating training data with lan-
guage models: Towards zero-shot language un-
derstanding.Advances in Neural Information
Processing Systems, 35:462–477.
Meta. 2024. Llama 3.3 Model Card. https:
//github.com/meta-llama/llama-models/blob/
main/models/llama3_3/MODEL_CARD.md.
Accessed: June 26, 2025.
Shervin Minaee, Nal Kalchbrenner, Erik Cambria,
Narjes Nikzad, Meysam Chenaghlu, and Jian-
feng Gao. 2020. Deep Learning Based Text
Classification: A Comprehensive Review.arXiv
preprint arXiv:2004.03705.
Mistral AI. 2025. Medium is the new large. https:
//mistral.ai/news/mistral-medium-3. Accessed:
October 10, 2025.
Michal Mochtak, Peter Rupnik, Taja Kuzman, and
Nikola Ljubešić. 2025. Parlasent: mappingsentiment in political discourse with large lan-
guage models.Political Research Exchange,
7(1):2508377.
Michal Mochtak, Peter Rupnik, and Nikola Ljubešić.
2024. The ParlaSent Multilingual Training
DatasetforSentimentIdentificationinParliamen-
tary Proceedings. InProceedings of the 2024
Joint International Conference on Computational
Linguistics,LanguageResourcesandEvaluation
(LREC-COLING 2024), pages 16024–16036.
Michal Mochtak, Peter Rupnik, Katja Meden, and
Nikola Ljubešić. 2023. The multilingual senti-
mentdatasetofparliamentarydebatesParlaSent
1.0. Slovenian language resource repository
CLARIN.SI.
OpenAI. 2023. ChatGPT General FAQ.
https://help.openai.com/en/articles/
6783457-chatgpt-general-faq. Accessed:
June 26, 2025.
OpenAI. 2024. Hello GPT-4o. https://openai.com/
index/hello-gpt-4o/. Accessed September 11,
2024.
OpenAI. 2025. Introducing GPT-5. https://openai.
com/index/introducing-gpt-5/. Accessed: Octo-
ber 10, 2025.
Wanda J Orlikowski and JoAnne Yates. 1994.
Genre repertoire: The structuring of communica-
tive practices in organizations.Administrative
science quarterly, pages 541–574.
Alina Petukhova and Nuno Fachada. 2023. MN-
DS:Amultilabelednewsdatasetfornewsarticles
hierarchical classification.Data, 8(5):74.
Qwen Team. 2024a. Qwen2 technical report.arXiv
preprint arXiv:2407.10671.
Qwen Team. 2024b. Qwen2.5: A party of founda-
tion models. Accessed: June 26, 2025.
Peter Rupnik, Nikola Ljubešić, and Michal Mochtak.
2023. Multilingual parliament sentiment regres-
sion model XLM-R-ParlaSent. Hugging Face.
Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei
Guo, Tianwei Zhang, and Guoyin Wang. 2023.
Text Classification via Large Language Models.
InFindings of the Association for Computational
Linguistics: EMNLP 2023, pages 8990–9005.
Ashish Vaswani, Noam Shazeer, Niki Parmar,
Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. 2017. Atten-
tion is all you need.Advances in neural informa-
tion processing systems, 30.
An Yang, Anfeng Li, Baosong Yang, Beichen
Zhang, Binyuan Hui, Bo Zheng, Bowen Yu,
Chang Gao, Chengen Huang, Chenxu Lv, et al.
2025. Qwen3 technical report.arXiv preprint
arXiv:2505.09388.
BowenZhang,DaijunDing,LiwenJing,GenanDai,
andNanYin.2022. HowwouldStanceDetection
TechniquesEvolveaftertheLaunchofChatGPT?
arXiv preprint arXiv:2212.14548.
Yazhou Zhang, Mengyao Wang, Qiuchi Li, Prayag
Tiwari, and Jing Qin. 2025. Pushing the limit of
LLM capacity for text classification. InCompan-
ion Proceedings of the ACM on Web Conference
2025, pages 1524–1528.
Hang Zhao, Qile P Chen, Yijing Barry Zhang,
and Gang Yang. 2024. Advancing Single
and Multi-task Text Classification through Large
Language Model Fine-tuning.arXiv preprint
arXiv:2412.08587.
A. Appendix
A.1. Benchmarking Datasets
In this section, we provide additional information
on the datasets used for benchmarking the models
on sentiment identification, topic classification, and
genre identification tasks in this study.
ParlaSenttestdatasetsforsentimentclassifica-
tion in parliamentary speechesinclude Croa-
tian, Serbian, Bosnian, and English data from
the multilingual sentiment dataset of parliamen-
tary debates ParlaSent 1.0 (Mochtak et al., 2024,
Mochtak et al., 2023).6The dataset comprises
sentences that were randomly sampled from Croa-
tian, Serbian, Bosnian and British parliamentary
corporaandmanuallyannotatedwithreportedinter-
annotator agreement ranging from 0.53 to 0.66 in
Krippendorff’s alpha (Krippendorff, 2018). The an-
notation involved a more granular six-level senti-
ment polarity scale that has been mapped to a
three-level sentiment polarity scale which we use
in our experiments: negative (0), neutral (1), and
positive (2).
GINCO datasets for automatic genre identifi-
cationcomprise the English EN-GINCO dataset
(Kuzman et al., 2023) and a multilingual X-GINCO
dataset from the AGILE benchmark for Automatic
Genre Identification.7The test instances were
sampled from the enTenTen20 English web cor-
pus (Jakubíček et al., 2013) and the MaCoCu
multilingual web corpus collection (Bañón et al.,
2022). They were manually annotated by ex-
perts with a background in linguistics and compu-
tational linguistics who had experience with previ-
ous genre annotation campaigns (Kuzman et al.,
2022, 2023) where they reached an acceptable
inter-annotator agreement of 0.71 in nominal Krip-
pendorff’s alpha (Krippendorff, 2018). While the
X-GINCO dataset comprises numerous European
languages, for the purposes of this study, we fo-
cus on three South Slavic languages: Croatian,
Macedonian, and Slovenian. The test datasets
use the X-GENRE annotation schema (Kuzman
etal.,2023)thatincludesthefollowinggenrelabels:
Information/Explanation,News,Instruction,Opin-
ion/Argumentation,Forum,Prose/Lyrical,Legal
andPromotion. While EN-GINCO and X-GINCO
datasets have been annotated by the same anno-
tator with the same schema, one should note that
6Available in the CLARIN.SI repository at http://
hdl.handle.net/11356/1585 and in the Hugging Face
repository at https://huggingface.co/datasets/classla/
ParlaSent.
7https://github.com/TajaKuzman/
AGILE-Automatic-Genre-Identification-Benchmarkthere are important differences between them in
termsoftheirconstruction–theEnglishtestdataset
wassampledrandomlyfromthewebcorpus,result-
ing in an unbalanced label distribution, while the X-
GINCOdatasetswerecuratedwithmoredeliberate
interventions to ensure a balanced label distribu-
tion and a more controlled sampling process. Con-
sequently, the X-GINCO datasets comprise fewer
ambiguous instances and could be regarded as an
easier test dataset.
IPTC News Topic test datasets(Kuzman and
Ljubešić, 2025) comprise Croatian and Slovenian
news articles extracted from the MaCoCu-Genre
web corpus collection (Kuzman and Ljubešić,
2024b) and manually annotated by one annotator.
The reliability of the annotator was confirmed on a
sample of data that was annotated by an additional
annotator. The two annotators reached an accept-
able inter-annotator agreement of 0.73 in nominal
Krippendorff’s alpha (Krippendorff, 2018). Text in-
stances are annotated with 17 topic labels from the
top level of the IPTC NewsCodes Media Topic hi-
erarchical schema, developed by the International
Press Telecommunications Council (IPTC) (IPTC,
2022). The datasets are more or less balanced by
labels.
ParlaCAP test datasetscomprise parliamentary
speeches in Bosnian, Croatian, English, and Ser-
bian, sourced from the ParlaMint 4.1 dataset (Er-
javec et al., 2024; Erjavec et al., 2025). These
speeches were annotated by a single expert anno-
tator using the 21 CAP categories from the official
CAP schema (Baumgartner et al., 2019), along
withanadditionalOtherlabel. Thedatasetsareap-
proximately balanced across labels. To assess the
annotation quality, the Croatian dataset was inde-
pendently annotated by two additional annotators.
Inter-annotator agreement between the expert an-
notator and the others ranged from 0.62 to 0.68 in
Krippendorff’s alpha, which is around the threshold
of 0.67 typically considered acceptable for annota-
tion reliability (Krippendorff, 2018).
A.2. Models
In the following subsections, we outline the models
included in the evaluation – the fine-tuned BERT-
like classifiers (Section A.2.1) and the open-source
and closed-source LLMs (Section A.2.2).
A.2.1. Fine-Tuned BERT-like Models
BERT (bidirectional encoder representations from
transformers) deep neural models (Kenton and
Toutanova, 2019) have revolutionized the field of
natural language processing (NLP), outperform-
ing the non-neural methods across various NLP
tasks. Theyhaveamorecomplexandcomputation-
ally expensive architecture featuring transformers
– neural networks with self-attention mechanisms
(Vaswani et al., 2017) – that significantly improves
the efficiency of training models on massive text
data. Similarlytodecoder-onlytransformermodels,
BERT models are pretrained on massive amounts
of texts, possibly in multiple languages, which es-
tablishes their ability to encode the words and texts
in high-dimensional vector spaces (Minaee et al.,
2020) and enables their application even across
languages in a zero-shot classification scenario.
To develop BERT-based classifiers, the pretrained
models are trained, that is, fine-tuned, on a training
dataset comprising text instances annotated with
labels. In our study, we evaluate openly-accessible
multilingual fine-tuned BERT-like models that have
been already developed in recent related research.
Namely, we evaluate the following models:
•IPTC News Topic classifier8(Kuzman and
Ljubešić, 2025) is a multilingual fine-tuned
BERT-like model for news topic classification
according to the top-level IPTC NewsCodes
schema (IPTC, 2022). The model is based on
the large-sized XLM-RoBERTa model (Con-
neau et al., 2020) and was fine-tuned on
15,000 training text instances from the EM-
MediaTopic9dataset (Kuzman and Ljubešić,
2024c). The training dataset contains news
article instances in four languages: Catalan,
Croatian, Greek, and Slovenian. The training
dataset was annotated using an LLM that was
shown to achieve annotation reliability compa-
rabletothatofhumanannotators(Kuzmanand
Ljubešić,2025). Thisapproachisbasedonthe
novel methodology that uses the LLM teacher-
student pipeline to develop BERT-like classi-
fiers in the absence of manually-annotated
training data.
•XLM-R-ParlaSent(Rupnik et al., 2023;
Mochtak et al., 2024) is a domain-specific mul-
tilingual transformer model for sentiment iden-
tification in parliamentary texts. It is based on
the XLM-R-parla pretrained model (Ljubešić
et al., 2023) that was developed by addition-
ally pretraining the large-sized XLM-RoBERTa
model (Conneau et al., 2020) on 1.72 billion
words from parliamentary proceedings in 30
European languages. To develop the XLM-
R-ParlaSent model,10the pretrained XLM-R-
Parla model was fine-tuned on the ParlaSent
8The IPTC News Topic classifier is available in
the Hugging Face repository at https://huggingface.co/
classla/multilingual-IPTC-news-topic-classifier.
9The EMMediaTopic training dataset is available in
the CLARIN.SI repository at http://hdl.handle.net/11356/
1991.
10The XLM-R-ParlaSent model is accessible in thesentiment training dataset (Mochtak et al.,
2024; Mochtak et al., 2023) in seven Euro-
pean languages (Bosnian, Croatian, Czech,
English, Serbian, Slovak, and Slovenian). The
training dataset11comprises 13,000 instances
sampled from parliamentary proceedings and
manually annotated with sentiment labels.
•ParlaCAP classifier12(Kuzman Pungeršek
and Ljubešić, 2025) is a domain-specific multi-
lingual transformer model for topic classifica-
tion in parliamentary texts based on the CAP
schema (Baumgartner et al., 2019). As the
XLM-R-ParlaSent model, this model is based
ontheXLM-R-parlapretrainedmodel(Ljubešić
et al., 2023; Mochtak et al., 2024). The XLM-
R-parla model was then fine-tuned on around
30 thousand speeches from parliamentary de-
bates from the ParlaMint 4.1 parliamentary
datasets (Erjavec et al., 2024; Erjavec et al.,
2025) in 29 European languages. The train-
ing dataset was annotated with the CAP cat-
egories by a GPT-4o (OpenAI, 2024) model
used in a zero-shot prompting fashion, follow-
ing the LLM teacher-student framework (Kuz-
man and Ljubešić, 2025). Based on the inter-
annotator agreement, calculated on a sample
that was annotated by three human annota-
tors and the LLM annotator, the agreement
between the LLM and the human annotators
wascomparabletotheagreementbetweenthe
human annotators themselves. This indicates
that the LLM performs as reliably as human
annotators on this task, supporting its use for
annotating the training data.
•X-GENRE classifier(Kuzman et al., 2023;
Kuzman and Ljubešić, 2024d) is a multilin-
gual fine-tuned BERT-like model for automatic
genre identification.13The model is based on
the base-sized XLM-RoBERTa model (Con-
neau et al., 2020) and was fine-tuned on the
training split of the X-GENRE dataset (Kuz-
man and Ljubešić, 2024a), which contains
1,772 text instances in Slovenian and English,
manually-annotated with genre labels from the
X-GENRE schema (Kuzman et al., 2023).
Hugging Face repository at https://huggingface.co/
classla/xlm-r-parlasent.
11The ParlaSent training and test datasets are freely
availableintheCLARIN.SIrepositoryathttp://hdl.handle.
net/11356/1868.
12The ParlaCAP topic classifier is available in the Hug-
ging Face repository at https://huggingface.co/classla/
ParlaCAP-Topic-Classifier.
13The X-GENRE classifier is freely available in the
Hugging Face repository at https://doi.org/10.57967/hf/
0927 and the CLARIN.SI repository at http://hdl.handle.
net/11356/1961.
A.2.2. Instruction-Tuned Large Language
Models
As the BERT models, decoder-only large language
models are based on a transformer deep neural
architecture and are pretrained on massive text
collections. However, while the development of
fine-tuned BERT-like classifiers necessitates large
amounts of annotated training data, recent ad-
vances in the field have shown that the instruction-
tuned LLMs are capable of text classification in a
zero-shotorfew-shotpromptingsetupswhichdoes
not require any training data. We assess the per-
formance of the following large language models:
•OpenAI models, namely the GPT-3.5-Turbo
(gpt-3.5-turbo-0125 ) (OpenAI, 2023),
GPT-4o ( gpt-4o-2024-08-06 ) (OpenAI,
2024) and the GPT-5 ( gpt-5-2025-08-
07) (OpenAI, 2025). These closed-source
instruction-tuned LLMs were developed by
OpenAI. OpenAI states that the models are
trainedonlargemultilingualwebcorpora,how-
ever, specific details about the training data,
procedures, and architecture are not publicly
known.
•Gemini 2.5 Flash model(Comanici et al.,
2025)isaclosed-sourcemultilingualandmulti-
modal instruction-tuned LLM by Google Deep-
Mind. The model is reported to be pretrained
onover400languages(Comanicietal.,2025),
however,detailsonthelanguagecoverageare
not available.
•Mistral Medium 3.1 model( mistral-
medium-2508 ) (Mistral AI, 2025) is a closed-
source multimodal instruction-tuned model by
MistralAI.Availabledetailsonthemodelarchi-
tecture and language coverage are very lim-
ited.
•LLaMA 3.3 model14(Meta, 2024) is an open-
source instruction-tuned multilingual LLM, de-
veloped by Meta, with 70 billion parameters.
The model was pretrained on a web text col-
lection in various languages, however, it is re-
ported to support only 8 languages, namely,
English, German, French, Italian, Portuguese,
Hindi, Spanish, and Thai.
•Gemma 3 model15(Gemma Team et al.,
2025) is an open-source multilingual
instruction-tuned LLM, developed by Google
DeepMind. The model was pretrained on
multimodal data with large quantities of
multilingual texts and is reported to support
over 140 languages. We use the model in 27
billion parameter size.
14https://ollama.com/library/llama3.3
15https://ollama.com/library/gemma3•DeepSeek-R1-Distill16(Guo et al., 2025) is
an open-source reasoning LLM, developed by
DeepSeek AI. We use the distilled model in 14
billion parameter size, namely the DeepSeek-
R1-Distill-Qwen-14B model. The model
is based on the Qwen 2.5 model (Qwen
Team, 2024b,a) that was fine-tuned using a
datasetcuratedwiththeDeepSeek-R1reason-
ing model. The Qwen 2.5 model provides mul-
tilingual support for over 29 languages, includ-
ing Chinese, English, French, Spanish, Por-
tuguese, German, Italian, Russian, Japanese,
Korean, Vietnamese, Thai, and Arabic.
•Qwen 317(Qwen3-2504 ) (Yang et al., 2025)
is an open-source LLM, developed by Alibaba
Cloud. Weusethemodelwiththe32billionpa-
rameter size, namely, the qwen3:32b model.
The model is said to support over 100 lan-
guages and dialects (Yang et al., 2025).
Open-source models were installed locally and
executed via the Ollama API service (Marić et al.,
2025). We use the quantized versions of the mod-
els as they are available through the Ollama li-
brary.18OpenAI models are used through the
chat completion endpoint via the OpenAI API,
whereas other closed-source models were ac-
cessed through the OpenRouter platform19that
provides a unified API access to various closed-
source models.
To prevent any bias, all models were used with
their default parameters. The only parameter that
we defined is the temperature which we set to 0
to ensure a more deterministic behaviour of the
models. The same prompts were used for all open-
source and closed-source models. In Figure 4, we
provide prompts that were provided to the LLMs
for zero-shot text classification, namely for senti-
ment classification (Figure 4a), automatic genre
identification (Figure 4b), news topic classification
(Figure 4c) and topic classification in parliamentary
speeches (Figure 4d). For more details on the se-
tups used to apply fine-tuned BERT-like models
and instruction-tuned LLMs to the test datasets,
refer to the code published on GitHub.20
16https://ollama.com/library/deepseek-r1:14b
17https://ollama.com/library/qwen3
18https://ollama.com/library
19https://openrouter.ai/
20https://github.com/TajaKuzman/
Benchmarking-Text-Classification-on-South-Slavic
(a) Sentiment classification.
 (b) Automatic genre identification.
(c) News topic classification.
 (d) Parliamentary topic classification.
Figure 4: The prompts that are provided to the LLMs for the sentiment identification task (Figure 4a),
automatic genre identification (Figure 4b), and topic classification on news (Figure 4c) and parliamentary
speeches(Figure4d). Thepromptscomprisethedescriptionofthetaskandlabelswithashortdescription.