JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
An Adversarial Robust Behavior Sequence
Anomaly Detection Approach Based on Critical
Behavior Unit Learning
Dongyang Zhan,Member, IEEE,Kai Tan, Lin Y e*, Xiangzhan Yu, Hongli Zhang, and Zheng He
Abstract—Sequential deep learning models (e.g., RNN and LSTM) can learn the sequence features of software behaviors, such as
API or syscall sequences. However, recent studies have shown that these deep learning-based approaches are vulnerable to
adversarial samples. Attackers can use adversarial samples to change the sequential characteristics of behavior sequences and
mislead malware classifiers. In this paper, an adversarial robustness anomaly detection method based on the analysis of behavior units
is proposed to overcome this problem. We extract related behaviors that usually perform a behavior intention as a behavior unit, which
contains the representative semantic information of local behaviors and can be used to improve the robustness of behavior analysis.
By learning the overall semantics of each behavior unit and the contextual relationships among behavior units based on a multilevel
deep learning model, our approach can mitigate perturbation attacks that target local and large-scale behaviors. In addition, our
approach can be applied to both low-level and high-level behavior logs (e.g., API and syscall logs). The experimental results show that
our approach outperforms all the compared methods, which indicates that our approach has better performance against obfuscation
attacks.
Index Terms—Adversarial attacks, anomaly detection, deep learning, behavior unit extraction, malware detection.
✦
1 INTRODUCTION
THEamount of malware is rapidly growing. Malware
such as ransomware and Trojans are also rapidly
changing and have become the most serious threats in
cyberspace[1]. Therefore, detecting malware is very impor-
tant for cybersecurity.
To improve the efficiency of malware detection, secu-
rity researchers have proposed many detection methods
based on machine/deep learning technology. Deep neural
networks have been shown to enable efficient and accurate
malware classification[2]. Existing machine/deep learning-
based malware detection and classification systems mainly
learn features by static analysis of executable files[3] or
dynamic behavior analysis[4, 5]. However, static analysis
methods are insufficient code adversarial technologies (e.g.,
code obfuscation, dynamic code loading, and shelling)[5].
In contrast, dynamic analysis approaches [5] can overcome
these problems by tracking and analyzing the execution
processes of the target programs. Specifically, these ap-
proaches usually leverage virtual environments or kernel
modules to trace program execution and record sequential
behavior logs (e.g., APIs or syscalls). During the execution of
a program/system, it will trigger many events (e.g., syscalls
and APIs), which can be intercepted or recorded for security
analysis. We define these events as behaviors. There are
many dynamic analysis approaches proposed by security
researchers, such as rule-based approaches [6] and machine-
learning-based approaches [7].
•D. Zhan, K. Tan, X. Yu, H. Zhang, L. Ye are with the School of Cyberspace
Science, Harbin Institute of Technology, Harbin, Heilongjiang, 150001.
Z. He is with the Heilongjiang Meteorological Bureau, Harbin, Hei-
longjiang, 150001.
E-mail:{zhandy, yuxiangzhan, zhanghongli, hityelin}@hit.edu.cn
•* Corresponding Author: hityelin@hit.edu.cnIn recent years, combining dynamic approaches and
deep learning techniques has risen. These approaches can
automatically learn features and train neural networks from
behavioral sequences. Sequential neural networks (e.g.,
RNN-based models and transformers) are usually employed
to perform sequence anomaly detection [8] based on dy-
namic behavior logs and can be used for malware detection,
since these deep learning approaches have a good ability to
model sequential data.
However, these sequential neural network models are
vulnerable to adversarial sample attacks[9], which perturb
the samples of the model’s input to bypass the detection
of these models. Unlike adversarial samples for images
[10], the adversarial samples for dynamic malware detec-
tion are behavior logs. The adversarial samples need to be
practical, so attackers cannot directly replace/remove the
original behaviors of the malware. Therefore, the generation
approaches of adversarial samples mainly insert irrelevant
behavior sequences (i.e., normal behavior fragments) into
the original behavior sequences or replace the anomalous
behaviors with those with similar functions [11–13]. For in-
stance, [13] can achieve up to 87.94% for an LSTM classifier
by injecting perturbed behaviors.
To improve the robustness of malware detection, many
approaches are proposed to analyze and learn high-level
OS-related characteristics of program behaviors. For in-
stance, DroidSpan [14] learns the behavior characteristics
associated with accessing Android sensitive data (e.g., user
accounts), which are found to change little with the evolu-
tion of Android software. Droidcat [15] captures lots of app-
level dynamic characteristics of Android apps to perform
robust malware identification. However, these approaches
are usually based on the high-level semantics of APIs inarXiv:2509.15756v1  [cs.CR]  19 Sep 2025
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
specific operating systems, such as the APIs for accessing
accounts in Android, which is the key to achieving robust
analysis. So, it is not easy to apply these approaches to
low-level behavior logs that do not have such high-level
semantic information (e.g., Linux syscalls).
To analyze behavior logs without high-level OS-related
semantic information (e.g., syscall logs), many approaches
for sequential data have been proposed, such as adversarial
training [16], defense Sequence-GAN [17], and sequence
squeezing [18]. However, adversarial training reduces the
detection performance of networks on unperturbed data
samples. Defense Sequence-GAN suffers from the prob-
lem of training overhead and cannot handle adversarial
samples with long sequence injections. Sequence squeezing
can only defend against the attack of replacing labeled
malicious behaviors with less well-known behaviors with
similar functionality. In addition, these approaches lack in-
depth analysis of the relationships among behaviors and
therefore cannot solve the problems of adversarial attacks
in the behavior analysis domain.
In this paper, an adversarial robust behavior sequence
anomaly detection approach based on critical behavior unit
learning is proposed. Based on our observation, the behav-
ior of a program is composed of a series of representative
behavior units. A behavior unit is a collection of related
behaviors that are usually composed of multiple behaviors
to perform a specific behavior purpose and contain the
representative semantics of local behaviors. For instance,
open, read, and close can constitute a behavior unit, and
the purpose of this unit is to access a file. And, behavior
units such as accessing files, sending data to the network,
etc. can constitute the behavior of a program. Based on this
observation, our approach learns the overall semantics of
each behavior unit and the contextual relationships among
behavior units to perform adversarial robust anomaly de-
tection. By extracting and analyzing the behavior units,
the overall representation and behavior intention of local
behaviors can be obtained, and the robustness of behavior
analysis can be improved. It is difficult to change the local
behavior intention with small-scale behavior injections or
replacements, so our approach is resistant to perturbation
attacks that target local behaviors. Then, the contextual re-
lationships among behavior units are analyzed to obtain the
global representation of the target behaviors and mitigate
perturbation attacks that target a wide range of behav-
iors. By combining the local and global features of target
behaviors based on behavior unit analysis, our approach
can perform robust anomaly detection for both high-level
behavior logs (e.g., Android APIs) and low-level behavior
logs (e.g., syscalls).
Our approach first identifies behavior units from un-
perturbed behavior sequences, which contain the represen-
tative semantics of the original behavior sequences. Then,
we extract such behavior units from perturbed samples for
behavior analysis, since our goal is to analyze the security
of behavior sequences with adversarial attacks. Last, we
design a multilevel deep learning model to perform security
analysis based on behavior units. Even if an attacker tam-
pers with some critical behaviors or behavior units, which
the attacker needs to replace with behaviors of the same
function to keep the program’s functionality, this threat canbe identified by our approach.
In summary, the first contribution of this paper is that
an adversarial robust behavior sequence anomaly detection
approach is proposed based on the analysis of behavior
units. By learning the overall semantics of each behavior
unit and the contextual relationships among behavior units,
our model can improve the robustness of behavior analysis
and can be applied to both low-level and high-level behav-
ior logs.
The second contribution is that after implementing the
prototype, comparative experiments are carried out to indi-
cate the performance and robustness of our approach and
the importance of the behavior unit feature.
The rest of this paper is organized as follows. Section 2
summarizes the related work. The threat model and defense
strategy are introduced in Section 3. Section 4 describes the
overall framework and its implementation. The evaluation
of our approach is performed in Section 5. Section 6 sum-
marizes this paper.
2 RELATEDWORK
In this section, we summarize related work about deep
learning-based malware detection, adversarial attacks of
sequential models and adversarial defense approaches.
2.1 Deep Learning-based Malware Detection
Detecting malware is a hot topic in computer security.
Compared with static analysis approaches[3], dynamic anal-
ysis approaches[4, 19] are robust for adversarial technolo-
gies (e.g., code obfuscation, dynamic code loading, and
shelling)[5]. With the development of deep learning, lever-
aging deep learning to detect malware by analyzing the
behaviors of malware is an important approach in malware
detection[20].
Deep learning-based models can analyze dynamic exe-
cution information (e.g., APIs and syscalls) of software to
identify anomalies. The invoked API sequences can be ef-
fectively applied to model the most representative behavior
features of malware [21]. For example, [22–24] use sequences
of API calls for malware detection. These approaches are
based on frequency analysis of API calls [24], or identify
specific malicious API call sequence characteristics [22, 23].
Therefore, sequences of API calls can be effectively em-
ployed to model the most representative behavioral features
associated with particular malware applications.
Since sequential deep learning models have a strong
ability to learn sequential features, many approaches use
sequential deep learning models to detect abnormal behav-
iors of malware. [25] proposes an LSTM-based detection
approach. This approach converts system call events to se-
mantic information in natural language, and treats a system
call event as a sentence. Then, an LSTM-based classifier is
proposed to identify anomalies. [26] proposes an LSTM-
transformer architecture to improve the classification of
malicious system calls, which leverages the ability of LSTM
to capture sequential pattern features and the ability of the
transformer-encoder to capture global dependencies. This
approach combines the strengths of these two models to
identify abnormal patterns in system calls.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
2.2 Adversarial Attacks of Sequential Models
Sequential neural network models are vulnerable to ad-
versarial sample attacks. Different from adversarial sample
attacks in the image recognition field, attackers can inject
irrelevant behavior sequences into the original behavior se-
quences or modify some unimportant behaviors to generate
adversarial samples. Deleting or modifying the behaviors of
malware may affect the functionality of the malware. There-
fore, these approaches usually generate perturbed samples
by inserting intercepted benign fragments or generating
fragments, which significantly reduces the ability of sequen-
tial neural network-based methods[11–13]. For instance,
[27] proposes an end-to-end black-box method to generate
adversarial examples for RNN-based models by changing
unnecessary features. [12] mimics benign behaviors in the
malware by periodically injecting intelligently selected API
calls in original malicious API call sequences, to bypass
the classifiers. [11] outputs sequential adversarial examples
based on a generative RNN and injects them into malicious
sequences to attack RNN-based malware detection systems.
[13] perturbs the classifier by injecting normal sequences
into abnormal sequences. Since directly injecting benign
sequences into malicious sequences is easy to identify,
Sequence-GAN is used to generate benign sequences. An
algorithm is proposed to minimize the amount of injection.
Furthermore, by injecting enough benign sequences, per-
turbed samples can completely bypass the classifier.
2.3 Adversarial Defense Approaches
To defend against such adversarial attacks, several adver-
sarial defense approaches have been proposed. Adversarial
learning[16] is a common approach to defend against ad-
versarial sample attacks, which improves model robustness
by adding adversarial examples to the training set. How-
ever, adversarial training has several limitations. First, the
robustness of the model heavily depends on the quality
of adversarial samples. Second, if the generated perturbed
samples are very similar to normal samples, it may reduce
the model detection ability. Furthermore, the generality of
new adversarial attacks is limited [28].
Another defense approach is Defense Sequence-
GAN[17], which filters out the perturbations added by
adversarial attacks by training a GAN to model the dis-
tribution of unperturbed input. However, there are several
problems with this approach. First, the training overhead
of this approach is high, as discussed in [17]. Second, this
approach can only slightly mitigate adversarial attacks. This
approach is not good for adversarial sample detection in
long sequences, that is, if an attacker inserts a long normal
sequence among a few malicious behavior sequences, it is
difficult for this approach to detect anomalies.
Sequential squeezing[18] is a possible defense approach
that merges similar behaviors into a single representative
feature. This approach mainly defends against critical be-
havior obfuscation attacks, which replace well-known mali-
cious behaviors with less well-known behaviors with simi-
lar functionality. However, this approach cannot be applied
to identifying perturbations with benign behaviors.Therefore, existing adversarial defense approaches can-
not effectively defend against such adversarial attacks in the
malicious behavior detection domain.
3 THREATMODEL ANDDEFENSESTRATEGIES
In this section, we introduce the threat model and defense
strategies of our approach.
3.1 Threat Model
Recent studies have shown that deep learning-based mal-
ware detection methods are vulnerable to adversarial at-
tacks. The adversarial attack modifies the behavior execu-
tion sequence of malware so that the modified sequence is
incorrectly classified as benign.
We assume that attackers can only obtain limited knowl-
edge of the target classification model by query-only obser-
vation, because it is usually difficult to obtain perfect details
of the model architecture that are highly protected [29]. As
discussed in [13, 18], an attacker cannot modify behavior
logs directly because they are generated by tracking pro-
gram execution, which is different from directly modifying
pixel data to perturb the image samples. To maintain the
program’s functionality, attackers can only replace the orig-
inal behaviors with functionally similar behaviors or insert
irrelevant behaviors into the original sequences.
3.2 Our Defense Strategies
To defend against the above attacks, the defense strategies
are as follows:
1) Strongly correlated behavior subsequences with ob-
vious behavioral intentions are identified, and then the
unrecognized irrelevant behaviors are excluded during the
sequence detection process. According to our observation
of syscall behaviors, we find that a collection of strongly
related behaviors, which usually have a specific behavioral
intention, constitute a behavioral unit. For instance, the
operation of reading a file contains the syscall set of “open,
read, close”. These behaviors are possible related syscalls,
which have an obvious behavioral intention. If we can auto-
matically identify these key behavioral intentions and delete
irrelevant behaviors, we can defend against the interference
of adversarial attacks.
2) The joint feature representations within and between
behavior units are learned to identify the latent features of
malicious behavior intentions and the dependence among
multiple behavior intentions. The intention of some behav-
ior units can clearly distinguish whether they are benign
or malicious. However, we argue that multiple behavioral
intentions can also further describe anomalies. To correlate
the behavior units and learn the joint feature representa-
tions, a multilevel deep learning model based on the trans-
former encoder is proposed, as detailed in Section 4. First,
a transformer encoder block is used to learn the embedded
representation of each behavior unit. Next, we concatenate
the embeddings of behavior units with the embeddings of
the in-unit corresponding behaviors to generate the joint
embeddings of behavior units and behaviors, which are fed
into other transformer encoder blocks to learn the contex-
tual and joint feature representations of the behavior and
behavior unit sequences.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
4 SYSTEMDESIGN
This section describes the overall architecture of the pro-
posed malicious behavior detection approach based on crit-
ical behavior unit learning.
4.1 Framework
As shown in Figure 1, the input of our proposed system is
behavior logs, which consists of three modules: (1) behavior
unit pattern identification, (2) behavior unit extraction, and
(3) feature extraction and behavior classification.
1) Behavior Unit Pattern Identification
This module collects the behavior sequences of normal
and malicious software. Then, we identify the behavior
subsequences with obvious behavior classification charac-
teristics from the unperturbed behavior sequences. Based
on the patterns of the identified behavior subsequences,
we identify the behavior units in the perturbed sequences.
Specifically, we obtain candidate subsequences of possible
behaviors from unperturbed sequences, and then apply
the shapelet algorithm to select subsequences with obvious
classification characteristics as the pattern of the critical
behavior unit, as described in Section 4.2.
2) Behavior Unit Extraction
To extract behavior units from the behavior sequences
with perturbation and remove behaviors unrelated to crit-
ical behaviors, the module applies the long common sub-
sequence (LCS) algorithm to extract behavior units from
the perturbed behavior sequence based on the extracted
patterns, which improves the robustness against obfuscation
attacks.
3) Feature Extraction and Behavior Classification
This module integrates multilayer transformer models
to extract multilayer features of software behaviors. The
classification result determines whether the input sequence
is normal or abnormal. Its workflow is shown in Figure 2.
First, the behavior sequence is subjected to behavioral unit
extraction, so perturbations from sequences of unrelated
behaviors are excluded. Second, representations of the input
behavior sequences, which are split into training and testing
data, are generated. Last, the training data are used to train
the transformer-based classification model, and the test data
are fed into the trained transformer-based model to test the
model performance.
4.2 Behavior Unit Pattern Identification
As illustrated in Figure 1, this module generates behav-
ioral sequence fragments from the unperturbed behavior
sequences as behavior unit candidates and then evaluates
the quality of these fragments to choose the top K fragments
as critical behavior units.
4.2.1 Behavior Unit Candidate Generation
We use binary classification to identify normal and abnor-
mal behaviors with class labelsY={0,1}for the given
unperturbed behavior sequencesI={I 1, I2, . . . I n}. We
aim to identify the critical behavior sequence fragments
with obvious behavior features and extract the most rep-
resentative fragment of all the behavior sequences to detect
abnormal sequences.We assume thatSis a subsequence/fragment in a behav-
ior sequenceI iand that the lengths arelandm, respectively,
wherel≤m. Any behavior sequence of lengthmcontains
m−l+ 1distinct subsequences of lengthl. We denote the
set of all subsequences of lengthlfor sequenceI ito beW i,l
and denote the set of all subsequences of lengthlfor the
dataset to be
Wl={W 1,l, W2,l, . . . W n,l}(1)
The set of all candidate critical behavior units for dataset
Iis
W={W min, Wmin+1 , . . . W max}(2)
wheremin≥3andmax≤m, and the process of extracting
the topkcritical behavior units is defined in Algorithm 1.
Algorithm 1Behavior Unit Candidate Extraction
Input:I,min,max,k
1:kbehavior units=∅
2:C=classLabels(I);
3:forbehaviorsequences I iinIdo
4:behavior units=∅
5:forl=mintomaxdo
6:W i,l=generateCandidates(I i, min, max);
7:forallsubsequence SinW i,ldo
8:quality=assessCandidate(S) ;
9:behavior units.add(S, quality) ;
10:end for
11:end for
12:sortByQuality(behavior units) ;
13:removeSelfSimilar(behavior units) ;
14:k behavior units=
merge(k, k behavior units, behavior units) ;
15:end for
Output:k behavior units
For each sequence in the dataset, all subsequences of
all possible lengths according to the min and max length
parameters are visited. Algorithm 1 stores all candidates
for a given behavior sequence with their associated quality
measures (Line 8, which is detailed in Section 4.2.2). Once all
behavior unit candidates have been assessed, first, they are
sorted in order of quality, and self-similar behavior units are
removed. Second, we merge these behavior units with the
existing top k behavior units before processing the following
behavior sequences. Last, we obtain the top k behavior units
and discard all self-similar behavior units from the current
sequences.
4.2.2 Measuring the Quality of a Critical Behavior Unit
We denote the Euclidean distance between two subse-
quencesW SandW Rof lengthlas
dist(W S, WR) =lX
i=1(si−ri)2(3)
The distance between a subsequenceW Sof lengthland
behavior sequenceI iis the minimum distance betweenS
and all normalized subsequencesW RofIii.e.
di,s= min
R∈W i,ldist(W S, WR)(4)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
Behavior Unit Identification Behavior Unit Extraction Detection & Classification
Behavior Sequence LogsABCDEFGHBehavior 
set
Abehavior
subsequence 
setHDFEC
ADC DCH
DFH
BGC DCH
DF BGAE
Behavior
unit setDCHBGA…Generating 
Unit Candidates
 Measuring The
 Quality of a 
ShapeletAFDFECEC
ABDFECGH
AEBAGCAC
DEBGECHA
ADBFECEC
……
LCS M atching  
DFDCHBGA
DBGCHADFPerturbed  Sequences
Malware 1
Malware 2Detection
Feed ForwardResidual
InputAdd & NormFeed ForwardResidual
InputAdd & Norm
MLP
SoftmaxPool
Positional
Encoding
Behavior sequencesBehavior unit 
association
 
Internal feathers of behavior unit  learningSerialized  feathers behavior unit learning
Multi -AttentionResidual
InputAdd & Norm
Multi -AttentionResidual
InputAdd & NormBehavior
embedding
 
Fig. 1: The framework of our approach.
behavior unit extractionbehavior unit identificationsyscall sequences
Representation 
matricsData splitting
Training
dataTesting
data
Detection model Detection model
Representation 
matrics
Fig. 2: The workflow of the proposed approach.
Therefore, all distances between a candidate behavior unit
SKand behavior sequencesI={I 1, I2, . . . I n}are repre-
sented as a distance listD k.
Dk=< d 1,k, d2,k, ...dn,k>(5)
To further identify and extract critical behavior units,
the shapelet algorithm [30] is employed to determine the
quality of a behavior unit. The original shapelet papers use
information gain to determine the quality of a candidate
shapelet [30, 31], because information gain is suitable for
identifying how to obtain a partition of the data and can be
applied to recursively divide the data. The original shapelet
algorithm sorts the distance listD k, and then evaluates the
information gain for each possible split value. However,
calculating distances between the behavior unit candidates
and each behavior sequence is very time-consuming.
To solve this problem, we use the learning time-series
shapelet model [32] to measure the quality of a critical be-
havior unit. First, a learning shapelet model is first trained.
Second, we input each behavior unit candidate into themodel and evaluate the quality according to the difference
between the output of the model and the actual label.
We iteratively optimize the shapelet model by minimiz-
ing a classification loss function (which is shown in Equation
7) instead of searching among possible behavior units from
all the behavior subsequences.
Given classifier weightsw∈RJ+1(including bias) and
a feature vectorx i∈RJ, the linear prediction model is
expressed as
ˆy=JX
j=1wjxi,j+w0 (6)
wherex i,jis the distance between thei-th behavior se-
quenceI iand thej-th shapeletS j
The formulation jointly optimizes shapeletsSand clas-
sifier weightswin
minimize
S∈RJ×L,w∈RJ+1IX
i=1L(yi,ˆyi) +α
IJX
j=1w2
j (7)
whereα≥0is a regularization parameter. Considering
class labelsY={1,0}, the loss functionLis
L(yi,ˆyi) =−y iln(σ(ˆy i))−(1−y i) ln(1−σ(ˆy i))(8)
and the sigmoid functionσis
σ(ˆyi) = (1 +e−ˆyi)−1(9)
Then, the shapeletsS={S 1, S2, . . . S J}and classifier
weightswcan be learned to minimize the classification
objective and reduce generalization errors without compro-
mising shapelet interpretability.
We input each behavior unit candidateW iinto the model
and evaluate these qualities according to the differenceξ
between the output of the modelˆy(W i)and the actual label
yactual ofW i,
ξ=y actual−ˆy(W i) =y actual−JX
j=1wjdWi,Sj−w0(10)
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
whered Wi,Sjis the distance between the behavior unit
candidateW iand thej-th shapeletS j. We useξto evaluate
the quality of candidates. Ifξis close to 0, the quality of the
behavior unit is high.
4.3 Behavior Unit Extraction and Representation
4.3.1 Behavior Unit Extraction
By extracting the critical behavior units, we exclude un-
related behavior sequences from the perturbed sequences
and improve the robustness against obfuscation attacks.
The token-level longest common sequence (LCS) [33]
is employed for the extraction. We assume thatI x=
(x1, x2, ..., x n)andI y= (y 1, y2, ..., y m)are a behavior
unit and behavior sequence of lengthnandm, respec-
tively (n<<m). Their LCS is represented byLCS(I x, Iy) =
M(n, m), where :
M(i, j) =

1 + M(i−1, j−1) ;x i=yji, j >0
MaxM(i−1, j)
M(i, j−1);xi̸=yji, j >0
0 ;i= 0, or j= 0(11)
When the length ofM(n, m)isn, the behavior unit
Sxis completely contained in the behavior sequenceS y.
The LCS-based behavior unit extraction process is shown
in Step 3 of Figure 1. For instance, we can obtain the
behavior unit setα={D→C→H, B→G→A,D→F}by be-
havior unit identification. Given a behavior sequenceβ
={D→E→B→G→E→C→H→A}, we match the identified
behavior units fromαon the behavior sequenceβand retain
only the behavior units that are completely contained in
the behavior sequence. Then, we can obtain the extracted
behavior sequenceγ={D→B→G→C→H→A}.
4.3.2 Behavior Sequence Representation
To characterize behaviors by similarity of semantic and
context, we employ the word2vec [34] model to represent
the behaviors as vectors, as is shown in Figure 3. The
word2vec model can make the embeddings of different
behaviors with similar semantics and contexts closer in the
representation space. We leverage the skip-gram method
to train the word2vec based on the raw input behavior
sequences in the dataset.
Let behavior sequencesI= (I 1, I2, ..., I n), the represen-
tation of the behavior sequence is
Be=Behavior2V ec(I), B e∈RK×Q(12)
4.4 Feature Extraction and Behavior Classification
After the previous steps, we can obtain the new behavior
sequences from the perturbed sequences based on the iden-
tified behavior units.
Since sequential deep learning models (e.g., LSTM and
RNN) have a good ability to learn sequential features, ab-
normal behavior sequences can usually be detected by these
models. However, we argue that existing deep learning
models are not sufficient to represent different granular-
ity levels of behavior semantics in behavior sequences. A
behavior unit contains the collection of strongly related
behaviors, which represents the key classification features ofrelated behaviors. Therefore, the features of behavior units
are necessary for behavior analysis.
Based on this insight, we design a multilevel
transformer-based model to learn the joint features within
and between behavior units, thereby improving the accu-
racy and antiattack ability of the detection method. The
overall architecture of our classification model is shown in
Figure 4.
1) Learning Internal Features of Behavior Units
Since a behavior unit can represent the critical classifica-
tion features of related behaviors, we provide a transformer
encoder block to learn the embedded representation of
different behavior units. The learning process is shown in
Figure 5.
We only use the transformer encoder for our anomaly
detection model, because it is designated for learning the
features at different levels of behaviors. This step is achieved
by employing a multi-head self-attention mechanism, which
allows the model to selectively focus on different parts of
the input sequence at each level. The transformer encoder is
known to be highly effective for sequence modeling tasks, as
it is based on the self-attention mechanism that can establish
long-range dependencies among different positions. This
kind of dependency can help the model better capture the
structure and features of the data, and thus better distin-
guish between normal data and anomalous data.
We did not use the decoder part of the Transformer
model, because our anomaly detection problem is focused
on learning the feature representations of input sequences
and identifying anomalies in the input sequences rather
than generating a new sequence based on a given input
sequence. Therefore, the decoder part, which is responsible
for generating the output sequence based on the encoded
input sequence, is not relevant to our task.
Specifically, we associate each behaviorI iwith its cor-
responding behavior unitU i. Assuming that the behav-
ior unitU icontains behaviors ofJlength, whereU i=
{Bi,1, Bi,2, ...B i,J}, we input each behavior unit into the
transformer encoder block, and use the last layer of the
transformer encoder as the embedded representation of the
behavior unit because it contains the richest information
after multiple layers of self-attention. For the n layers of
the transformer encoder block, the representationsU efrom
then-th layer are denoted as:
Ue=ϕ(n)
transformer (I) ={h(n)
1, h(n)
2, ..., h(n)
J}(13)
2) Learning Sequential Features of Behavior Units
We use other transformer encoder blocks to learn the
sequential features of behavior units.
We concatenate the embeddings of behavior unit repre-
sentations with the embeddings of behavior representations
as input and feed the joint embeddings into the transformer
encoder blocks to learn contextual features of the behavior
and behavior unit sequences:
input= concentrate(Be,Ue)(14)
3) Behavior Classification
The outputs of the transformer encoder blocks are sent to
a pooling layer and an MLP classification layer. The softmax
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
Behavor Sequences
……Sequence Segmentation
1,1 1,2 1,, ,...,k I I I
,1 ,2 ,, ,...,n n n kI I I
Behavor2vecK = Sequence Length
Width =Embedding Size
1,1B
1,kB
1,2B
…Behavior Representation
Fig. 3: Behavior sequence representation.
Feed ForwardResidual
InputAdd & Norm
Feed ForwardResidual
InputAdd & NormFeed ForwardResidual
InputAdd & Norm
Feed ForwardResidual
InputAdd & Norm
MLP
SoftmaxPool
Positional
Encoding
Behavior SequencesBehavior Unit 
Association
…
Learning Internal Feathers of Behavior UnitsLearning Sequential  Feathers of Behavior Units
Multi -AttentionResidual
InputAdd & Norm
Multi -AttentionResidual
InputAdd & Norm
Multi -AttentionResidual
InputAdd & Norm
Multi -AttentionResidual
InputAdd & NormBehavior
Embedding
…
Fig. 4: The architecture of the proposed classification model.
layer is employed to classify the behaviors as normal or
abnormal.
5 EVALUATION
In this section, first, we evaluate the detection performance
and adversarial robustness ability of the model and make a
comparison with many baseline models [35–38]. Second, we
conduct ablation studies to analyze the contribution of each
defense process of our model. Last, we compared our model
with other defense approaches.5.1 Environment
The following experiments are performed on a Windows 10
operating system, powered by AMD Ryzen 7 5800 8-Core
Processor 3.40 GHz, NVIDIA 3060 and 32.0 GB of RAM.
Keras version 2.7 is employed to implement our model.
5.2 Dataset and Sequence Generation
We use the AndroCT dataset[39], which is a dataset for
Android malware detection that contains more than 35,974
Android applications collected from 2010 through 2019,
including malicious and benign applications. Behavior data
consist of API invocation sequences. In our experiments, we
use the latest version (i.e., 2019) of behavior logs.
We apply a fixed window grouping with different sizes
of log records to generate behavior sequences on this
dataset. We label API sequences of malware ”positive” and
those used for benign software ”negative”.
We use a method similar to [13] to implement a state-
of-the-art adversarial attack focused on malware, as shown
in Algorithm 2. Instead of intercepting benign fragments
directly from the behavior execution sequences, we employ
seqGAN [40] to generate irrelevant behaviors and insert
them into the original sequences. The intercepted fragments
are more likely to be detected as “adversarial signatures”
with obvious insertion marks.
Algorithm 2Adversarial Sequence Generation
Input:x(malicious sequence to perturb, of lengthl),
n(number of adversarial sliding window),
B(max injection rate of generated benign fragments)
1:foreach sliding windoww jofninx:do
2:whilecurr injection rate inx < Bdo
3:Randomly select a behavior’s positioniinw j
4:Insert a new adversarial fragment
at positioniofw j,i∈ {1,2, ..., n}
5:end while
6:end for
Output:perturbedx
5.3 Evaluation Method
The confusion matrix is applied to evaluate the performance
of our proposed model. Let TP represent the number of
sequences that are correctly predicted as positive, TN denote
the number of sequences that are correctly classified as
negative, FN denote the number of traces that are positive
but are incorrectly predicted as negative, and FP indicate
the number of traces that are negative but are predicted as
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Sequence 
Segmentation
……Behavior Unit  
ExtractionK = Sequence Length
J =Behavior Unit SizeSequence 
Segmentation
……Behavior Unit  
ExtractionK = Sequence Length
J =Behavior Unit Size
1I
2I
kI
kU
2U
1U
Learning 
Embeddings
Of Behavior UnitsM =Number of Behavior Units
N =Embbeding Size
1Ue
2Ue
MUe
…
Fig. 5: Learning the internal features of behavior units.
positive. We measure our model in terms of accuracy, preci-
sion, recall, and F1 score to assess our detection performance
and make comparisons. In our experiment, we use fixed
windows to divide behavior logs into behavior sequences,
and the goal of our classification is to determine whether
the behavior sequence is malicious.
Precision=TP
TP+FP(15)
Recall=TP
TP+FN(16)
F1−Score=2∗Precision∗Recall
Precision+Recall(17)
5.4 Baselines
5.4.1 Log Behavior Analysis Approaches
The proposed model is compared with several baseline
models. All the baselines focus on log-based behavior analy-
sis approaches, which are open-source and based on sequen-
tial deep learning models. These approaches are presented
as follows:
1) Min Du et al. (2017) [35]:This approach is named
Deeplog, which adopts LSTM networks to learn the behav-
ior patterns of logs. The input is a one-hot vector for each
behavior pattern.
2) Sasho Nedelkoski et al. (2020) [36]:They adopt a
Transformer encoder with a multi-head self-attention mech-
anism, which learns context information from behavior logs
in the form of log vector representations (i.e., embeddings).
3) Siyang Lu et al. (2018) [37]:This approach performs
behavior log anomaly detection by leveraging convolutional
neural networks (CNN), which can explore the latent com-
plex relationships in behavior logs.
4) Amir Farzad et al. (2020) [38]:They propose an
unsupervised model for behavior log anomaly detection
that employs two deep autoencoder networks for anomaly
detection.
5.4.2 Defense Approaches of Adversarial Attacks
We compare our approach with several typical defense
approaches [17] against behavior log adversarial attacks.
1) Sequence Squeezing [18]:Sequence squeezing re-
duces the search space available to an adversary by merging
similar semantic features into a single representative feature.
For instance, the syscalls “sys read” and “sys read” canbe merged into a system operation behavior of “reading
file”. To implement this approach, we use word2vec to
represent the syscall names as word embeddings and cluster
similar syscalls with the same semantics (by using Euclidean
distance). Therefore, different merged groups can represent
different operation behaviors.
2) Defense Sequence-GAN [17]:This approach defends
adversarial attacks by training a GAN to learn the distri-
bution of unperturbed behavior sequences. Therefore, the
Defense Sequence-GAN can generate approximate samples
that meet the unperturbed sample distribution. We employ
SeqGAN to implement this approach. We train a benign
SeqGAN and malicious SeqGAN using the unperturbed
dataset and generate benign samples and malicious sam-
ples, respectively. When an input sequence emerges, we
choose the generated unperturbed sequence nearest the
perturbed sequence (calculated by Euclidean distance) and
feed it to the classifier.
3) Adversarial Learning [16]:Adversarial learning adds
adversarial samples to the training set, which can make
the classifier learn the distribution of adversarial samples,
thereby defending against adversarial attacks. We generate
malicious adversarial samples according to Algorithm 2 and
add them to our training dataset. In addition, we label these
samples abnormal. Then, we train the classifier using the
dataset and analyze the detection performance.
4) GuardOL [41]:This approach constructs a collection
of critical events, and the feature vector of software is gen-
erated by extracting the frequency of each event to perform
classification based on the MLP model. Since the dataset
used by this approach comprises syscall logs, we directly
use the critical behaviors identified by our approach in our
dataset to extract the frequency of each software and then
use the MLP as the classifier.
5.5 Results
5.5.1 Robustness to Adversarial Attacks
To prove the ability and robustness of our proposed model,
we design the method of adversarial attack focused on
behavior sequences and conduct comparative experiments
among the proposed model and other models in [35–38] by
taking into account the adversarial attacks.
Specifically, we employ seqGAN [40], a contextualized
generation model, to generate benign fragments. To prevent
damage to the functionality of the malware, we follow the
idea of generating adversarial sequence samples mentioned
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
in [13]. We insert these adversarial samples into the original
sequences.
We divided the dataset into two equal parts: DataSet1
and DataSet2. Then, we use normal sequences of DataSet1
to train the SeqGAN model and generate benign fragments,
while the victim models are trained on DataSet2. We gen-
erate adversarial samples according to Algorithm 2. These
samples are fed into the trained model. Table 1 shows an
adversarial example, and the inserted behaviors are high-
lighted.
Table 2 shows the detection rates on Dataset2 when
the injection rate percentage increases from0%to40%.
The detection rates of the models that suffer adversarial
attacks have been reduced. Specifically, the F1 score of
Deeplog drops from98.7%to75.3%when the injection rate
is40%. These results show that the context characteristics of
malicious sequences can obviously be changed by inserting
the generated benign fragments.
In contrast, our model is effective in defending adversar-
ial samples. As shown in Table 2, the F1 score drops only 6%
in the worst case. The results show that our proposed model,
based on critical behavior unit learning, is more robust to
defend against adversarial attacks than other approaches.
The robustness comes from two aspects. Specifically, in the
testing process, we only focus on learning critical behavior
units, and the inserted fragments in the adversarial samples
are filtered out. In addition, our model learns the overall
semantics of each critical behavior unit and the contextual
relationships among behavior units. Therefore, the inten-
tions of malicious behaviors can be extracted and learned.
5.5.2 Ablation Studies
In the process of feature extraction and behavior classifica-
tion, the features of behavior units are learned. To assess the
contribution of behavior unit features, we remove behavior
unit features from the model and then evaluate the perfor-
mance of the remaining model.
It shows that the lack of behavior unit feature learning
causes performance degradation. According to the results
in Table 3, the model without behavior unit features has
worse performance, as the F1 scores of the model decrease
by3.1%and0.8%when the injection rates are20%and
40%, respectively. Therefore, behavior unit features have an
important impact on the proposed model. This finding in-
dicates that the behavior unit features can perceive a wider
range of behaviors and improve the anomaly detection and
anti-obfuscation ability.
Figure 6 shows the ROC curves of the proposed model
and that the AUC score decreases after the removal of
behavior unit features. Thus, behavior unit features have a
good effect on malware detection, which enables the model
to adapt to complex detection situations.
Additionally, to verify the effectiveness of behavior unit
extraction and identification, we input the sequences ob-
tained by these two processes into the baseline models
and evaluate whether the performance of these baseline
models is improved. As shown in Table 4, the detection
performance of all models has been improved. Overall, with
the increasing injection ratio of benign fragments, these
improved baseline models can maintain higher F1 scores
than the original models, which implies that behavior unit
(a) Injection rate of 20%
 (b) Injection rate of 40%
Fig. 6: ROC curves of the proposed model with different
injection rates.
extraction and identification can improve the ability to resist
adversarial sample attacks.
5.5.3 Comparison with Other Defense Methods
We apply the above-mentioned adversarial defense meth-
ods as baselines for comparison with the same datasets and
in the same experimental setting. The results are shown in
Table 5. Overall, our proposed approach achieves the high-
est performance. As the injection rate of adversarial attacks
increases, the detection ability of our model maintains the
best performance.
The Defense Sequence-GAN method cannot handle ad-
versarial samples with long sequence injection, so its per-
formance is not good. In addition, this method needs to
identify the samples most similar to the input among a
large number of generated samples, which is very time-
consuming.
Sequence Squeezing mainly defends against critical be-
havior obfuscation attacks, which replace well-known mali-
cious behaviors with less well-known behaviors with simi-
lar functionality. However, this approach cannot be applied
to identifying perturbations with benign behaviors.
As the proportion of malicious sample injection in-
creases, the F1 score of adversarial learning drops by 11.4%,
which is much higher than our approach. That is because
the generated perturbed samples are very similar to normal
samples and reduce the detection ability of the model.
The performance of GuardOL is better than that of other
baselines, which indicates that identifying critical behaviors
is important for the robustness of the model. Since this ap-
proach cannot analyze the sequential features of behaviors,
its performance is worse than that of our model.
5.5.4 Sustainability Analysis
Our approach is related to sustainability analysis[14, 42].
Although the use of artificially generated adversarial sam-
ples for evaluation purposes is common practice, it may not
accurately represent real-world adversarial attack scenarios.
Therefore, it is crucial to incorporate real-world adversarial
malware samples. Software may change its behaviors over
time as it evolves. We can use the behavior logs of the
software for a certain year as unperturbed data and then
use the logs of later versions as perturbed behavior data.
In addition, we compare our approach with other base-
lines. The targets analyzed by these methods are mainly
complete programs, whereas the targets we analyzed in the
previous tests were independent behavioral sequences. In
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
TABLE 1: A behavior sequence example.
Original Sequence size booleanValue hashCode equals valueOf
Adversarial Sequence size booleanValuereadLine replace hashCodegethtml equals valueOf
TABLE 2: The robustness to adversarial attacks of different methods.
Method MetricsInjection Rate
0% 5% 10% 15% 20% 25% 30% 35% 40%
LSTM-based Model [35]R 0.992 0.931 0.885 0.828 0.797 0.753 0.725 0.665 0.607
P 0.982 0.993 0.991 0.991 0.989 0.989 0.989 0.991 0.990
F10.987 0.961 0.935 0.902 0.883 0.855 0.837 0.796 0.753
Transformer-based Model [36]R 0.985 0.945 0.904 0.865 0.825 0.801 0.761 0.733 0.702
P 0.984 0.992 0.992 0.990 0.992 0.988 0.990 0.992 0.992
F10.985 0.968 0.946 0.923 0.901 0.885 0.861 0.843 0.822
CNN-based Model[37]R 0.978 0.929 0.862 0.842 0.778 0.742 0.706 0.667 0.633
P 0.990 0.995 0.995 0.989 0.989 0.995 0.993 0.994 0.994
F10.984 0.961 0.924 0.91 0.871 0.85 0.825 0.806 0.774
Autoencoder-based Model [38]R 0.988 0.962 0.921 0.872 0.857 0.808 0.775 0.733 0.724
P 0.824 0.814 0.810 0.807 0.806 0.800 0.799 0.795 0.789
F10.899 0.882 0.862 0.839 0.831 0.804 0.787 0.763 0.755
Our ModelR 0.968 0.969 0.949 0.947 0.935 0.954 0.911 0.932 0.902
P 0.999 0.974 0.987 0.969 0.971 0.944 0.983 0.938 0.944
F10.983 0.971 0.968 0.958 0.953 0.949 0.945 0.935 0.923
TABLE 3: The ablation experimental results with different
injection rates.
Method MetricsInjection Rate
20% 40%
Without behavior
unit feathersR 0.921 0.876
P 0.923 0.956
F10.922 0.915
With behavior
unit feathersR 0.935 0.902
P 0.971 0.944
F10.953 0.923
order to compare with baselines at the software level, we
organize the behavior sequences generated by each program
into groups. If any behavior sequence within a group is
identified as abnormal, it is classified as malware. The
baselines are as follows.
1) DroidSpan [14]:It is a novel behavior profile to
effectively capture the distribution of sensitive access within
Android apps. Through a longitudinal analysis, consistent
distinctions between benign apps and malware are identi-
fied, persisting over a duration of seven years despite the
evolving nature of both app types.
2) DroidEvolver [43]:Since different models have dif-
ferent sensitivities to software behavior evolution, this
method establishes a model pool composed of multiple ma-
chine learning models and performs software classification
through multi-model voting.
3) MamaDroid [44]:It utilizes Markov chain modeling of
API call sequences for Android malware detection. It offers
three modes of operation for abstracting API calls at differ-ent granularities: families, packages, or classes. It achieves
high accuracy in detecting unknown malware samples.
We train our model and the baselines on the AndroCT
dataset of years 2013 and 2014 and evaluate their perfor-
mance on data from the subsequent years. For a fair com-
parison, we use the same samples in the AndroCT dataset
as DroidSpan [14] to test all baselines and our approach.
The numbers of benign and malicious samples in the test
are shown in Table 6.
The experimental results are shown in Figure 7. Among
the baselines, DroidSpan achieves the best performance. For
instance, it gets the highest average F1 score of 0.691 across
the years 2014-2017 among the baselines when trained on
the data of 2013. When it is trained on the data of 2014
and tested on the data of 2015-2017, the average F1 score
is 0.723, which is also the highest among the baselines. The
results demonstrate its effectiveness in sustainable malware
detection of Android APPs. DroidEvolver achieves average
F1 scores of 0.604 and 0.615 when trained on the datasets of
the years 2013 and 2014, respectively. Similarly, MamaDroid
achieves average F1 scores of 0.579 and 0.614. These results
can be attributed to whether the features selected by these
methods can effectively capture the evolving characteristics
of both malware and benign apps. For instance, despite the
evolution of malware, the selected classification features of
DroidSpan remained sustainable.
In comparison, our approach performs better than the
baseline in most years, and it achieves better average F1
scores overall, reaching 0.719 and 0.729. This result can be
attributed to the analysis of behavior units, which allows for
good robustness in the analysis of the program behaviors. In
addition, our model has high generality because it only an-
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
TABLE 4: The results of improved models with behavior unit extraction and identification.
Method MetricsInjection Rate
5% 10% 15% 20% 25% 30% 35% 40%
LSTM-based Model [35]R 0.986 0.963 0.953 0.942 0.926 0.863 0.842 0.813
P 0.936 0.938 0.936 0.935 0.937 0.935 0.936 0.936
F10.961 0.950 0.944 0.938 0.931 0.897 0.886 0.870
Transformer-based Model [36]R 0.994 0.973 0.963 0.960 0.923 0.871 0.860 0.854
P 0.945 0.936 0.940 0.935 0.935 0.937 0.937 0.933
F10.968 0.954 0.951 0.947 0.929 0.902 0.897 0.892
CNN-based Model[37]R 0.986 0.947 0.933 0.923 0.889 0.854 0.765 0.744
P 0.942 0.941 0.942 0.938 0.937 0.939 0.941 0.937
F10.963 0.944 0.937 0.930 0.912 0.894 0.844 0.829
Autoencoder-based Model [38]R 0.973 0.933 0.907 0.897 0.864 0.841 0.812 0.767
P 0.828 0.824 0.829 0.817 0.824 0.831 0.831 0.825
F10.894 0.875 0.866 0.855 0.843 0.836 0.821 0.795
TABLE 5: Comparison with other defense methods.
Method MetricsInjection Rate
5% 10% 15% 20% 25% 30% 35% 40%
Sequence Squeezing [18]R 0.943 0.902 0.859 0.819 0.789 0.758 0.730 0.719
P 0.985 0.982 0.985 0.985 0.985 0.985 0.985 0.971
F10.964 0.940 0.918 0.895 0.876 0.857 0.838 0.826
Defense Sequence-GAN [17]R 0.907 0.873 0.841 0.815 0.794 0.771 0.752 0.717
P 0.984 0.979 0.978 0.983 0.972 0.967 0.970 0.975
F10.944 0.923 0.905 0.891 0.874 0.859 0.847 0.826
Adversarial Learning[16]R 0.967 0.937 0.894 0.871 0.844 0.828 0.806 0.771
P 0.961 0.969 0.961 0.962 0.962 0.955 0.964 0.947
F10.964 0.953 0.927 0.914 0.899 0.887 0.878 0.850
GuardOL [41]R 0.935 0.931 0.925 0.915 0.910 0.905 0.901 0.895
P 0.930 0.930 0.930 0.929 0.929 0.928 0.928 0.927
F10.933 0.93 0.927 0.922 0.919 0.916 0.914 0.911
Our ApproachR 0.969 0.949 0.947 0.935 0.954 0.911 0.932 0.902
P 0.974 0.987 0.969 0.971 0.944 0.983 0.938 0.944
F10.971 0.968 0.958 0.953 0.949 0.945 0.935 0.923
P R F10.00.20.40.60.81.0
Year: 2014P R F1
Year: 2015P R F1
Year: 2016P R F1
Year: 2017DroidEvolver MamaDroid DroidSpan Our Approach
(a) Trained on the 2013 dataset
P R F10.00.20.40.60.81.0
Year: 2015P R F1
Year: 2016P R F1
Year: 2017DroidSpan DroidEvolver MamaDroid Our Approach (b) Trained on the 2014 dataset
Fig. 7: Performance of different models trained on the data of 2013 and 2014 and tested on the data of subsequent years.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
/uni00000018/uni00000008 /uni00000014/uni00000013/uni00000008 /uni00000014/uni00000018/uni00000008 /uni00000015/uni00000013/uni00000008 /uni00000015/uni00000018/uni00000008 /uni00000016/uni00000013/uni00000008 /uni00000016/uni00000018/uni00000008 /uni00000017/uni00000013/uni00000008/uni00000013/uni00000011/uni00000019/uni00000013/uni00000011/uni0000001b/uni00000014/uni00000011/uni00000013
/uni00000024/uni00000051/uni00000047/uni00000055/uni00000052/uni00000026/uni00000037/uni00000018/uni00000008 /uni00000014/uni00000013/uni00000008 /uni00000014/uni00000018/uni00000008 /uni00000015/uni00000013/uni00000008 /uni00000015/uni00000018/uni00000008 /uni00000016/uni00000013/uni00000008 /uni00000016/uni00000018/uni00000008 /uni00000017/uni00000013/uni00000008
/uni00000024/uni00000027/uni00000029/uni00000024/uni00000010/uni0000002f/uni00000027/uni00000036/uni00000048/uni00000054/uni00000058/uni00000048/uni00000051/uni00000046/uni00000048/uni00000003/uni00000036/uni00000054/uni00000058/uni00000048/uni00000048/uni0000005d/uni0000004c/uni00000051/uni0000004a /uni00000027/uni00000048/uni00000049/uni00000048/uni00000051/uni00000056/uni00000048/uni00000003/uni00000036/uni00000048/uni00000054/uni00000058/uni00000048/uni00000051/uni00000046/uni00000048/uni00000010/uni0000002a/uni00000024/uni00000031 /uni00000024/uni00000047/uni00000059/uni00000048/uni00000055/uni00000056/uni00000044/uni00000055/uni0000004c/uni00000044/uni0000004f/uni00000003/uni0000002f/uni00000048/uni00000044/uni00000055/uni00000051/uni0000004c/uni00000051/uni0000004a /uni0000002a/uni00000058/uni00000044/uni00000055/uni00000047/uni00000032/uni0000002f /uni00000032/uni00000058/uni00000055/uni00000003/uni00000024/uni00000053/uni00000053/uni00000055/uni00000052/uni00000044/uni00000046/uni0000004b
Fig. 8: The F1 scores of our approach and other four adversarial defense approaches under different attack injection rates
on two datasets.
TABLE 6: The number of samples from each year used on
the AndroCT dataset.
Year Number of Benign APPs Number of Malware
2013 1568 1139
2014 2953 1337
2015 1178 1451
2016 1370 1769
2017 1612 1934
alyzes behavior sequences without the need to understand
the high-level semantic information of behaviors related
to Android systems. But our approach does not perform
better than DroidSpan on the data of several years (e.g.,
2014 and 2017) in Figure 7 (a). This shows that the analysis
of DroidSpan for Android APPs is also robust. Since our
approach lacks an understanding of the behavior features
of Android (e.g., accessing sensitive data in Android), it
does not perform as well as the DroidSpan on the data of
several years. To improve the robustness of our approach to
analyze the software behaviors of some specific operating
systems (e.g., Android), the next step could be to explore
combining behavior unit analysis with Android program
behavior features, which we leave for future work.
5.5.5 Generality Analysis
To further investigate the effectiveness of our approach on
other datasets, we extend our experiments to the ADFA-
LD dataset [45], which is also a widely used dataset for
anomaly detection and consists of system call sequences.
We use the same approach to generate adversarial samples
in this dataset, as shown in Algorithm 2. For comparison, we
benchmark our approach against other defense techniques.
The experimental results are shown in Figure 8. Our ap-
proach has the highest F1 scores on the data with different
injection rates. When the attack injection rate increases from
5% to 40%, the F1 scores of our approach decrease by
5.68% and 10.38% on the two datasets, respectively. The
decreases are the lowest among all the other approaches,
which demonstrate the robustness of our approach across
different datasets.
6 CONCLUSION
The RNN-based behavior analysis models are vulnerable to
adversarial sample attacks. To mitigate this problem, thispaper proposes an adversarial robust behavior sequence
anomaly detection approach based on critical behavior unit
learning. The perturbations of the irrelevant sequence are
eliminated by identifying and extracting critical behavior
units, and the robustness of the model is improved. A
multi-level transformer-based abnormal behavior detection
approach is proposed to learn the joint features within
and between behavior units. The experimental results show
that our proposed approach has good performance against
obfuscation attacks.
ACKNOWLEDGMENTS
This work was supported by the National Key R&D
Program of China (No. 2021YFB2012402), the National
Natural Science Foundation of China under Grants No.
61872111, and the Natural Science Foundation of Hei-
longjiang Province of China under Grants No. LH2023F017.
REFERENCES
[1] A. Afianian, S. Niksefat, B. Sadeghiyan, and D. Bap-
tiste, “Malware dynamic analysis evasion techniques:
A survey,”ACM Computing Surveys (CSUR), vol. 52,
no. 6, pp. 1–28, 2019.
[2] M. Sahin and S. Bahtiyar, “A survey on malware detec-
tion with deep learning,” in13th International Conference
on Security of Information and Networks, 2020, pp. 1–6.
[3] M. Amin, T. A. Tanveer, M. Tehseen, M. Khan, F. A.
Khan, and S. Anwar, “Static malware detection and
attribution in android byte-code through an end-to-end
deep system,”Future generation computer systems, vol.
102, pp. 112–126, 2020.
[4] C. Li, Q. Lv, N. Li, Y. Wang, D. Sun, and Y. Qiao, “A
novel deep framework for dynamic malware detection
based on api sequence intrinsic features,”Computers &
Security, vol. 116, p. 102686, 2022.
[5] O. Or-Meir, N. Nissim, Y. Elovici, and L. Rokach, “Dy-
namic malware analysis in the modern era—a state
of the art survey,”ACM Computing Surveys (CSUR),
vol. 52, no. 5, pp. 1–48, 2019.
[6] S. Forrest, S. A. Hofmeyr, A. Somayaji, and T. A.
Longstaff, “A sense of self for unix processes,” inPro-
ceedings 1996 IEEE Symposium on Security and Privacy.
IEEE, 1996, pp. 120–128.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
[7] I. Firdausi, A. Erwin, A. S. Nugrohoet al., “Analysis
of machine learning techniques used in behavior-based
malware detection,” in2010 second international confer-
ence on advances in computing, control, and telecommuni-
cation technologies. IEEE, 2010, pp. 201–203.
[8] R. Vinayakumar, K. Soman, P . Poornachandran, and
S. Sachin Kumar, “Detecting android malware using
long short-term memory (lstm),”Journal of Intelligent &
Fuzzy Systems, vol. 34, no. 3, pp. 1277–1288, 2018.
[9] H. Liang, E. He, Y. Zhao, Z. Jia, and H. Li, “Adversarial
attack and defense: A survey,”Electronics, vol. 11, no. 8,
p. 1283, 2022.
[10] X. Fang, Z. Li, and G. Yang, “A novel approach to gen-
erating high-resolution adversarial examples,”Applied
Intelligence, vol. 52, no. 2, pp. 1289–1305, 2022.
[11] W. Hu and Y. Tan, “Black-box attacks against rnn based
malware detection algorithms,” inWorkshops at the
Thirty-Second AAAI Conference on Artificial Intelligence,
2018.
[12] F. Fadadu, A. Handa, N. Kumar, and S. K. Shukla,
“Evading api call sequence based malware classifiers,”
inInternational Conference on Information and Communi-
cations Security. Springer, 2019, pp. 18–33.
[13] I. Rosenberg, A. Shabtai, Y. Elovici, and L. Rokach,
“Query-efficient black-box attack against sequence-
based malware classifiers,” inAnnual Computer Security
Applications Conference, 2020, pp. 611–626.
[14] H. Cai, “Assessing and improving malware detection
sustainability through app evolution studies,”ACM
Transactions on Software Engineering and Methodology
(TOSEM), vol. 29, no. 2, pp. 1–28, 2020.
[15] H. Cai, N. Meng, B. Ryder, and D. Yao, “Droidcat: Effec-
tive android malware detection and categorization via
app-level profiling,”IEEE Transactions on Information
Forensics and Security, vol. 14, no. 6, pp. 1455–1470, 2018.
[16] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Er-
han, I. Goodfellow, and R. Fergus, “Intriguing proper-
ties of neural networks,”arXiv preprint arXiv:1312.6199,
2013.
[17] I. Rosenberg, A. Shabtai, Y. Elovici, and L. Rokach, “De-
fense methods against adversarial examples for recur-
rent neural networks,”arXiv preprint arXiv:1901.09963,
2019.
[18] I. Rosenberg, A. Shabtai, Y. Elovici, and L. Rokach,
“Sequence squeezing: A defense method against ad-
versarial examples for api call-based rnn variants,” in
2021 International Joint Conference on Neural Networks
(IJCNN). IEEE, 2021, pp. 1–10.
[19] Z. Hu, L. Liu, H. Yu, and X. Yu, “Using graph represen-
tation in host-based intrusion detection,”Security and
Communication Networks, vol. 2021, 2021.
[20] H. Hasan, B. T. Ladani, and B. Zamani, “Megdroid: A
model-driven event generation framework for dynamic
android malware analysis,”Information and Software
Technology, vol. 135, p. 106569, 2021.
[21] G. D’Angelo, M. Ficco, and F. Palmieri, “Association
rule-based malware classification using common sub-
sequences of api calls,”Applied Soft Computing, vol. 105,
p. 107234, 2021.
[22] W. Hardy, L. Chen, S. Hou, Y. Ye, and X. Li, “Dl4md:
A deep learning framework for intelligent malware de-tection,” inProceedings of the International Conference on
Data Science (ICDATA). The Steering Committee of The
World Congress in Computer Science, Computer . . . ,
2016, p. 61.
[23] M. Rhode, P . Burnap, and K. Jones, “Early-stage mal-
ware prediction using recurrent neural networks,”com-
puters & security, vol. 77, pp. 578–594, 2018.
[24] P . Natani and D. Vidyarthi, “Malware detection using
api function frequency with ensemble based classifier,”
inInternational Symposium on Security in Computing and
Communication. Springer, 2013, pp. 378–388.
[25] X. Xiao, S. Zhang, F. Mercaldo, G. Hu, and A. K. Sanga-
iah, “Android malware detection based on system call
sequences and lstm,”Multimedia Tools and Applications,
vol. 78, no. 4, pp. 3979–3999, 2019.
[26] Y. Guan and N. Ezzati-Jivan, “Malware system calls de-
tection using hybrid system,” in2021 IEEE International
Systems Conference (SysCon). IEEE, 2021, pp. 1–8.
[27] I. Rosenberg, A. Shabtai, L. Rokach, and Y. Elovici,
“Generic black-box end-to-end attack against state of
the art api call based malware classifiers,” inInterna-
tional Symposium on Research in Attacks, Intrusions, and
Defenses. Springer, 2018, pp. 490–510.
[28] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and
A. Vladu, “Towards deep learning models resistant
to adversarial attacks,”arXiv preprint arXiv:1706.06083,
2017.
[29] G. Apruzzese, H. S. Anderson, S. Dambra, D. Free-
man, F. Pierazzi, and K. A. Roundy, “” real attackers
don’t compute gradients”: Bridging the gap between
adversarial ml research and practice,”arXiv preprint
arXiv:2212.14315, 2022.
[30] L. Ye and E. Keogh, “Time series shapelets: a new prim-
itive for data mining,” inProceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery
and data mining, 2009, pp. 947–956.
[31] J. Lines, L. M. Davis, J. Hills, and A. Bagnall, “A
shapelet transform for time series classification,” inPro-
ceedings of the 18th ACM SIGKDD international conference
on Knowledge discovery and data mining, 2012, pp. 289–
297.
[32] J. Grabocka, N. Schilling, M. Wistuba, and L. Schmidt-
Thieme, “Learning time-series shapelets,” inProceed-
ings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2014, pp. 392–401.
[33] L. Bergroth, H. Hakonen, and T. Raita, “A survey of
longest common subsequence algorithms,” inProceed-
ings Seventh International Symposium on String Processing
and Information Retrieval. SPIRE 2000. IEEE, 2000, pp.
39–48.
[34] S. Al-Saqqa and A. Awajan, “The use of word2vec
model in sentiment analysis: A survey,” inProceedings
of the 2019 International Conference on Artificial Intelli-
gence, Robotics and Control, 2019, pp. 39–43.
[35] M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog:
Anomaly detection and diagnosis from system logs
through deep learning,” inProceedings of the 2017 ACM
SIGSAC conference on computer and communications secu-
rity, 2017, pp. 1285–1298.
[36] S. Nedelkoski, J. Bogatinovski, A. Acker, J. Car-
doso, and O. Kao, “Self-attentive classification-based
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
anomaly detection in unstructured logs,” in2020 IEEE
International Conference on Data Mining (ICDM). IEEE,
2020, pp. 1196–1201.
[37] S. Lu, X. Wei, Y. Li, and L. Wang, “Detecting anomaly
in big data system logs using convolutional neural
network. in 2018 ieee 16th intl conf on dependable,
autonomic and secure computing, 16th intl conf on
pervasive intelligence and computing,” in4th Intl Conf
on Big Data Intelligence and Computing and Cyber Science
and Technology Congress (DASC/PiCom/DataCom/Cyber-
SciTech), pp. 151–158.
[38] A. Farzad and T. A. Gulliver, “Unsupervised log mes-
sage anomaly detection,”ICT Express, vol. 6, no. 3, pp.
229–237, 2020.
[39] W. Li, X. Fu, and H. Cai, “Androct: ten years of app call
traces in android,” in2021 IEEE/ACM 18th International
Conference on Mining Software Repositories (MSR). IEEE,
2021, pp. 570–574.
[40] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence
generative adversarial nets with policy gradient,” in
Proceedings of the AAAI conference on artificial intelligence,
vol. 31, no. 1, 2017.
[41] S. Das, Y. Liu, W. Zhang, and M. Chandramohan,
“Semantics-based online malware detection: Towards
efficient real-time protection against malware,”IEEE
transactions on information forensics and security, vol. 11,
no. 2, pp. 289–302, 2015.
[42] H. Cai, “Embracing mobile app evolution via contin-
uous ecosystem mining and characterization,” inPro-
ceedings of the IEEE/ACM 7th International Conference on
Mobile Software Engineering and Systems, 2020, pp. 31–
35.
[43] K. Xu, Y. Li, R. Deng, K. Chen, and J. Xu, “Droide-
volver: Self-evolving android malware detection sys-
tem,” in2019 IEEE European Symposium on Security and
Privacy (EuroS&P). IEEE, 2019, pp. 47–62.
[44] L. Onwuzurike, E. Mariconti, P . Andriotis, E. D. Cristo-
faro, G. Ross, and G. Stringhini, “Mamadroid: Detect-
ing android malware by building markov chains of be-
havioral models (extended version),”ACM Transactions
on Privacy and Security (TOPS), vol. 22, no. 2, pp. 1–34,
2019.
[45] G. Creech and J. Hu, “Generation of a new ids test
dataset: Time to retire the kdd collection,” in2013
IEEE Wireless Communications and Networking Conference
(WCNC). IEEE, 2013, pp. 4487–4492.
Dongyang Zhanis an assistant professor in
School of Cyberspace Science at Harbin Insti-
tute of Technology. He received the B.S. degree
in Computer Science from Harbin Institute of
Technology from 2010 to 2014. From 2015 to
2019, he has been working as a Ph.D. candidate
in School of Computer Science and Technology
at HIT. His research interests include cloud com-
puting and security.
Kai Tanis a PhD candidate from the Harbin Insti-
tute of Technology, China. His research focuses
on cloud security.
Xiangzhan Yuis a professor in School of Cy-
berspace Science at Harbin Institute of Technol-
ogy. His main research fields include: network
and information security, security of internet of
things and privacy protection. He has published
one academic book and more than 50 papers on
international journals and conferences.
Hongli Zhangreceived her BS degree in
Computer Science from Sichuan University,
Chengdu, China in 1994, and her Ph.D. degree
in Computer Science from Harbin Institute of
Technology (HIT), Harbin, China in 1999. She is
currently a Professor in School of Cyberspace
Science in HIT. Her research interests include
network and information security, network mea-
surement and modeling, and parallel processing.
Lin Yereceived the Ph.D. degree at Harbin
Institute of Technology in 2011. From January
2016 to January 2017, he was a visiting scholar
in the Department of Computer and Informa-
tion Sciences, Temple University, USA. His cur-
rent research interests include network secu-
rity, peer-to-peer network, network measure-
ment and cloud computing.
Zheng Heis an engineer in Heilongjiang Mete-
orological Bureau. She received her bachelor’s
and Master’s degrees in Meteorology Science
in Nanjing University of Information Science and
Technology from 2011 to 2018. From 2018, she
has been working in Weather Modification Office
of Heilongjiang Province. Her research interests
include climate change, weather modification
and machine learning.