1
A Dual-stage Prompt-driven Privacy-preserving
Paradigm for Person Re-Identification
Ruolin Li, Min Liu, Yuan Bian, Zhaoyang Li, Yuzhen Li, Xueping Wang and Yaonan Wang
Abstract—With growing concerns over data privacy, re-
searchers have started using virtual data as an alternative to
sensitive real-world images for training person re-identification
(Re-ID) models. However, existing virtual datasets produced by
game engines still face challenges such as complex construction
and poor domain generalization, making them difficult to apply in
real scenarios. To address these challenges, we propose a Dual-
stage Prompt-driven Privacy-preserving Paradigm (DPPP). In
the first stage, we generate rich prompts incorporating multi-
dimensional attributes such as pedestrian appearance, illumina-
tion, and viewpoint that drive the diffusion model to synthesize
diverse data end-to-end, building a large-scale virtual dataset
named GenePerson with 130,519 images of 6,641 identities. In
the second stage, we propose a Prompt-driven Disentanglement
Mechanism (PDM) to learn domain-invariant generalization
features. With the aid of contrastive learning, we employ two
textual inversion networks to map images into pseudo-words
representing style and content, respectively, thereby constructing
style-disentangled content prompts to guide the model in learning
domain-invariant content features at the image level. Experiments
demonstrate that models trained on GenePerson with PDM
achieve state-of-the-art generalization performance, surpassing
those on popular real and virtual Re-ID datasets.
Index Terms—Privacy protection, synthesized dataset, person
re-identification.
I. INTRODUCTION
PERSON re-identification (Re-ID) aims to match the iden-
tity of the same person across camera views and is widely
used in video surveillance, smart security, and other fields [1],
[2], [3], [4], [5], [6], [7], [8]. Most Re-ID methods rely on
representation learning using real-world images captured by
cameras [9], [10], [11], [12], [13], [14]. However, collecting
large-scale pedestrian datasets is not only costly, but also
raises personal privacy concerns, as these raw images usually
contain sensitive information such as identity information
and personal whereabouts. Once a Re-ID model is illegally
accessed, these data can be leaked and misused, posing serious
threats to public safety. Due to these privacy issues, several
This work was supported in part by the National Natural Science Foundation
of China under Grant 62425305, U22B2050 and 62221002, in part by the
Science and Technology Innovation Program of Hunan Province under Grant
2023RC1048, in part by the Hunan Provincial Natural Science Foundation of
China under Grant 2024JJ3013. (Corresponding author: Min Liu.)
Ruolin Li, Min Liu, Yuan Bian, Zhaoyang Li, Yuzhen Li and Yaonan
Wang are with the School of Artificial Intelligence and Robotics at Hu-
nan University and National Engineering Research Center of Robot Vi-
sual Perception and Control Technology, Changsha, 410082, Hunan, China
(e-mail: liruolin@hnu.edu.cn; liu min@hnu.edu.cn; yuanbian@hnu.edu.cn;
zhaoyli@hnu.edu.cn; zzrs@hnu.edu.cn; yaonan@hnu.edu.cn).
Xueping Wang is with the College of Information Science and Engineering
at Hunan Normal University, Changsha, 410081, Hunan, China (e-mail:
wang xueping@hnu.edu.cn).
Scene
SettingCamera
SetupModeling
Route
PlannerMotion
Setting
RenderingVideo
Capture
TextText-to-Image
Diffusion  Model(ii) Our Work(i) Existing work s
Image
Crop
(b)“a realistic style  of a person 2” “an anime style  of a person 1”
“a photo of a person 1” “a photo of a person 2”
(a)
Domain -invariant Space
Text Feature Image Featureguide guideFig. 1. Motivation of our method. (a) Our method generates diverse virtual
samples with only text, greatly simplifying the dataset construction process.
(b) PDM learns domain-invariant visual content features under the guidance
of prompts in the joint vision-language space.
Re-ID datasets [15], [16] have been withdrawn in recent years,
leading to a shortage of publicly available datasets and limiting
further research of Re-ID.
Recently, researchers have increasingly focused on privacy-
preserving Re-ID, aiming to protect data while maintaining
model utility. Current approaches primarily include learnable
anonymization [6], [17], [18], [19], [20], [21], federated learn-
ing [22], [23], [24], [25], and virtual data [26], [27], [28],
[29], [30], [31], [32]. The first two methods still require the
use of real data during training, which may lead to potential
privacy risks such as information leakage or gradient inversion
attacks [33], [34]. In contrast, virtual data-based approaches
fundamentally eliminate reliance on real-world images, offer-
ing stricter privacy protection constraints. However, existing
virtual Re-ID datasets are mostly synthesized using game
engines, which still face several challenges: 1) the construction
of existing virtual datasets involves multiple steps, including
clothing texturing, character modeling, motion setting, route
planning, scene setup, rendering, data collection, and anno-
tation, which is inherently complex and reliant on manual
intervention; and 2) there exists a significant domain gaparXiv:2511.05092v1  [cs.CV]  7 Nov 2025
2
between virtual and real data, making it difficult for models
to directly transfer features learned from virtual data to real-
world scenarios, resulting in poor generalization performance
on real datasets.
To address these challenges, we propose a Dual-stage
Prompt-driven Privacy-preserving Paradigm, dubbed DPPP.
DPPP aims to establish a low-cost virtual data generation
pipeline and effectively mitigate the domain gap between
virtual and real domains, thereby ensuring privacy preservation
while maintaining the performance of the Re-ID model. In the
first stage, we leverage a large language model to construct
rich and fine-grained prompts that consider multiple attributes
such as clothing, body shape, pose, and lighting, and introduce
a conditional control module along with lightweight adaptation
strategies to precisely control the diffusion model [35], thus
directly generating high-quality and diverse virtual data. As
shown in Fig. 1(a), our generation method employs a sim-
plified one-step process that bypasses the complex multi-step
procedures of traditional virtual dataset construction. In the
second stage, we propose a Prompt-driven Disentanglement
Mechanism (PDM), inspired by recent studies [36], [37], [38],
[39], [40], [41], [42], [43] showing that text representations can
serve as prototypes for characterizing different image styles
within the vision–language joint embedding space (i.e., the
embedding space of CLIP [44]). As shown in Fig. 1(b), each
image can be represented in the joint vision-language space
by a style-content prompt (“a [S] style of a [content]”). Due
to the absence of semantic labels in most Re-ID datasets
and the diversity and variability of pedestrian image styles,
it is challenging to accurately describe the content and style
of pedestrian images with concrete words. Consequently,
we employ two separate textual inversion networks to learn
style and content pseudo-words that represent specific visual
contexts for more flexible image semantic modeling. The
learned content pseudo-word is subsequently used to construct
a style-disentangled content prompt (“a photo of a [content]”)
as a prototype, guiding the model to learn domain-invariant
visual content features, thereby mitigating the impact of style
variations on its generalization to real-world scenarios. We
align the text representations of style-content prompts and con-
tent prompts with their corresponding image representations
through contrastive learning, enabling effective modeling of
style and content.
Our contributions are summarized as follows:
•We propose DPPP, a novel dual-stage prompt-driven
virtual data privacy-preserving paradigm that utilizes lan-
guage prompts to control virtual image generation and
guide the model in learning domain-invariant features.
•To the best of our knowledge, this work is the first
to apply a diffusion model to privacy-preserving Re-ID,
enabling one-step automatic generation of virtual data.
Moreover, a large-scale generative Re-ID dataset named
GenePerson is constructed, comprising 130,519 images of
6,641 identities in a variety of outfits, poses, and scenes
to ensure diversity.
•A novel disentanglement mechanism based on the joint
vision-language space is proposed, which independently
models image style and content through text, and furtherutilizes content-relevant text features to guide the learning
of domain-invariant visual content representations, thus
enhancing generalization to real-world domains.
•Experimental results demonstrate that models trained
on GenePerson exhibit better generalization than those
trained on other widely used real-world and virtual Re-
ID datasets, with the proposed PDM further improving
the results.
II. RELATEDWORKS
A. Privacy Preservation in Re-ID
Efforts to address data privacy concerns can be grouped into
three categories: learnable anonymization [6], [17], [18], [19],
[20], [21], federated learning [22], [23], [24], and virtual data
[26], [27], [28], [31], [32], [29], [30]. The first two approaches
assume that visual information about the target is available.
However, the collection of datasets has become extremely
difficult due to increasingly stringent privacy restrictions.
For example, DukeMTMC-reID [45] was withdrawn due to
privacy concerns; meanwhile, some European countries have
enacted laws prohibiting unauthorized collection and use of
personal data [46], [47]. In this context, virtual data-based
methods are gradually gaining attention. Virtual data-based
methods can effectively re-identify real-world individuals us-
ing only synthetic virtual data training, which fundamentally
eliminates the privacy issues that may arise from training with
real images. Barbosaet al.[26] manually create SOMAset,
which consists of 50 human models. Baket al.[27] propose
SyRI, which renders 100 virtual identities under different
lighting conditions using HDR environment maps. Sunet al.
[28] propose PersonX, a dataset comprising 1,266 characters
with 36 different viewpoints. SynPerson [29] and FineGPR
[30] datasets both consider weather conditions. Wanget al.
[31] propose a large-scale synthetic dataset RandPerson con-
taining 8,000 characters. Zhanget al.[32] propose another
large synthetic dataset UnrealPerson, which contains 3,000
characters and renders scenes with more realism.
Essentially, these game engine-based virtual data methods
rely on complex multi-step production pipelines and multiple
software tools. In contrast, our approach achieves end-to-end
generation through a diffusion model [35] that can directly
synthesize a variety of virtual pedestrian images based on text
prompts we generate. This text-to-image generation paradigm
not only simplifies the dataset construction pipeline, but also
provides the flexibility to create various samples through
natural language.
B. Domain Generalization
A major challenge of virtual data lies in the lack of real
images and the large domain shift between virtual and real
data, resulting in limited generalization of trained models
in real-world deployments. To mitigate such domain gap,
Domain Generalization (DG) aims to learn domain-invariant
representations that can generalize to unseen target domains.
Traditional DG methods rely on models trained exclusively on
image data [48], [49], [50], [51]. Recently, with the advance-
ment of large-scale vision-language models (i.e., CLIP [44]),
3
νg
* * *
1 k A   style of a  person S S C 
*A photo of a  person C(i) Pseudo-word Token Learning
(ii) Style Disentanglementfθ 
ts-c νg
fθ 
De-stylization 
ProjectorνcText
EncodertcVisual
EncoderνgTransformerνl
Ts
fϕ  
Text
EncoderVisual
Encoder
[gender],  walking,  [color] [pattern] 
[upper_wear], [color] [pattern] [lower_wear], 
[color] [footwear], [lighting], [body_shape], 
[maternal/male] build body, [hair_color] 
[hairstyle], [skin_color], simple background, 
multiple views of the same character, multiple 
views, side view, back view, front view1. What types of [patterns] are used in clothing design ?
2. What are the styles of [upper_wear] /  [lower_wear] / [footwear] in 
     women/men's fashion ?
3. What are common types of women/men's [body_shape] ?
4. What are the different kinds of women/men's [hairstyles] ?Fine-grained questions
Template
Prompt Diversity 
Combination
Prompt PoolNew Combination?
Generated Prompt Repeat outfits?ChatGPT
tc*Topk Patch 
Selector(b) Prompt-driven Disentanglement Stage
Text Input - Character
scenery, outdoors, street, 
[lighting], building, tree, 
realistic, photo backgroundLatent Diffusion ModelLatent Diffusion Model
Fusion Module
Synthetic ImageLoRA
Canny Edge 
Map
Pose Skeleton
Map
ControlNet
Text Input - Background(ii) Image Generation···
NY(a) Prompt-driven Generation Stage
(i) Prompt Generation
Style Transfer
Detail Enhancement
Lighting Control
···lv
Tunable Frozen Text/Image embedding Learnable tokensConditioning
1
Re-IDL1
SupCon L
Tri-txt L
2
Re-IDL2
SupCon L
Fig. 2. Overview of our framework, which consists of two parts. (a) The prompt-driven virtual image generation pipeline shows the generation of our
dataset GenePerson. (b) The disentanglement stage first uses text to effectively capture the style and content information of the image, and then utilizes
style-disentangled prompts as a guide for disentangling the visual representations.
several DG methods have begun exploring integration with
prompt engineering to better leverage semantic information for
domain generalization [36], [37], [52], [53], [54]. These large-
scale vision-language models are trained on massive image-
text pairs, aligning image and text representations in a shared
semantic space. Various visual tasks focusing on generalization
can utilize this space to efficiently manipulate visual features
through domain-invariant prompts [38], [39], [55], [56], [57],
[58], [59], [60]. Yanget al.[58] employs CLIP as a tool for
text-driven feature manipulation and domain expansion. Fahes
et al.[39] leverages textual prompts from the target domain
to guide the mapping of source features in the CLIP latent
space. Choet al.[37] models various image style distributions
in the CLIP latent space through textual prompts to enhance
the model’s domain generalization ability. These works rely
on semantic labels to construct textual descriptions, which are
unavailable in Re-ID where only index-based identity labels
are provided. To address this limitation, we introduce textual
inversion [40] to encapsulate visual context into prompts,
serving as an alternative to semantic annotation. Also, unlike
existing work [13] that inverts an image into a rough global
pseudo-word token, we decompose the inversion task into two
dimensions, content and style.III. METHODOLOGY
A. Problem Definition
We consider a privacy-preserving Re-ID task , where access
to the appearance of real pedestrians is restricted. Therefore,
we construct a large-scale virtual person dataset via a text-to-
image diffusion model, which serves as the source domain,
denoted byDs={xs
i, ys
i}ns
i=1≈Ps, wherexs
i∈ Xs,
ys
i∈ Ys, andPsrepresent the input data, identity label,
and the joint distribution of the data and the labeling space,
respectively. The goal is to train a model onDsthat can be
directly implemented to an unseen real-world target domain
Dt={xt
i, yt
i}nt
i=1≈Pt, wherext
i∈ Xtandyt
i∈ Ytand
Ptdenotes the target distribution. In our work, we consider
Pt̸=Ps, indicating that the target and source distribution are
mutually distinct.
B. Image Generation Pipeline
In the first stage, we propose a prompt-driven virtual data
auto-generation pipeline for building our dataset GenePerson.
As shown in Fig. 2(a), the overall production process contains
two core modules, prompt generation and image synthesis.
4
Pose
 Different Characters
Fig. 3. Pedestrians with different postures in the GenePerson dataset. Generate
corresponding virtual samples based on the given poses.
Illumination background
Fig. 4. Some of the image samples in the proposed GenePerson dataset, in-
cluding 3 different pedestrians in different background scenes, and pedestrians
in 9 different illumination conditions.
Character Prompt Generation.Text-to-image synthesis
models rely on semantic control from the input of textual
prompts. As shown in Fig. 2(a)(i), we design a prompt
template containing multiple placeholders, each representing
key visual attributes of a character, such as clothing style,
body shape, hairstyle, and skin color. By substituting these
placeholders with specific terms in different combinations, we
can generate rich prompts. We utilize ChatGPT to generate
candidate attribute descriptions through the following ques-
tions: 1) “Please list specific [color] names”; 2) “What types of
[patterns] are used in clothing design”; 3) “What are the styles
of clothing”; 4) “What are the different types of body shapes”;
5) “What are the different kinds of hairstyles”, along with ad-
ditional questions such as “Please list specific [color] names”
and “Please list the common skin colors”. The responses
are then inserted into the corresponding placeholders to form
initial textual prompts. These initial prompts are subsequently
randomly combined based on the attributes to generate diverse
descriptions, with deduplication applied to the combinations
of upper wear, lower wear, and footwear color and style,
as clothing is a crucial visual cue for identity recognition.
For example, a final prompt might read: “girl, walking, pink
plaid shirt, black distressed jeans, white loafers, day, sunshine,
slim figure, maternal build body, dark brown hair color, waist-
length hair, straight hair texture, high ponytail, bangs, warmbeige skin, simple background, multiple views of the same
character, multiple views, side view, back view, front view”.
Character Generation.We introduce the Stable Diffusion
Model (SDM) [35] to end-to-end synthesize diverse characters
based on gender, body shape, styling, viewpoint, illumination,
and other factors via generated prompts. Although the native
SDM excels in generic images, it remains limited in rendering
details such as clothing textures, poses, limbs, lights and
lighting, making it insufficient for fine-grained Re-ID task
[61]. To generate reasonable characters, we explore LoRA [62]
for targeted enhancements: 1) an Image Enhancement LoRA
to improve the control of the model over visual details; 2) a
Lighting Adaptation LoRA to enhance adaptability to complex
lighting scenarios; and 3) a Style Modulation LoRA to achieve
more natural appearances in character generation. Further, we
integrate ControlNet [63] for finer control through additional
inputs from conditional images. We input an edge map along
with multiple pose skeleton maps to simultaneously generate
multi-pose images of the same pedestrian. Fig. 3 illustrates the
characters generated based on the specified poses. Ultimately,
we generate 6,641 virtual pedestrians.
Scenario Design.To construct a scene-diverse dataset, we
employ SDM to generate background images of indoor and
outdoor scenes, such as streets, commercial areas, parks, fields
and residential areas. We embed different simple temporal
lighting descriptions in the background prompts, such as “day,
sunshine”, “dusk” and “night, dark”. Finally, we perform
pixel-level fusion of the segmented virtual characters with
the generated background images according to the lighting
conditions. Fig. 4 shows images of the same pedestrian with
different backgrounds, and the pedestrian images under differ-
ent lighting conditions in the GenePerson dataset.
Data Annotation.Our method simultaneously generates
multi-view images of the same identity on a single canvas,
enabling automatic cropping and labeling without additional
data collection, thus achieving zero-cost annotation.
C. Pseudo-word Token Learning
As shown in Fig. 2(b)(i), to capture the style and content
information of each input virtual image, we utilize two in-
dependently parameterized textual inversion networks to learn
trainable tokens corresponding to the style and content pseudo-
words in the style-content prompt “a[S∗
1,···, S∗
k]style of a
[C∗]person”. Note that the text encoder is kept frozen during
this learning stage.
Content Pseudo-word Token.We aim to encapsulate the
domain-shared identity information in an image into a content
pseudo-wordC∗. Specifically, we extract the global visual
embeddingv g∈Rd1from the final layer of the CLIP visual
encoder, whered 1denotes the dimension of the global feature.
A multi-layer perceptron (MLP) is employed as an inversion
network to invertv ginto a pseudo-word tokenc∗. This process
can be formalized as:
c∗=fθ(vg),(1)
wheref θ(·)denotes an inversion network parameterized byθ,
andc∗∈T∗, whereT∗denotes the token embedding space.
5
Style Pseudo-word Token.Style is usually expressed as a
combination of local visual elements such as color distribution
and texture structure, which relies on the combination of multi-
ple related local regions rather than by a single image patch. To
effectively model image style, we assume each image contains
npotential style elements, represented by a set of learnable
tokens corresponding to style pseudo-wordsS∗={S∗
i}k
i=1,
which are learned through adaptive aggregation of local patch
features. Additionally, considering that the deepest layers of
visual encoders primarily capture global semantic information
such as identity or category while lacking local pattern details
[64], [65], [66], we extract local features from the penultimate
layer as it better preserves low and mid-level visual attributes
related to image style.
LetV={vi
l}m
i=1∈Rd2×mdenote the patch features output
from the penultimate layer of encoder, wheremis the number
of image patches, andd 2is the feature dimension for each
patch. We define a set ofnlearnable tokensT s={T i}n
i=1∈
Rd2×n, where each token corresponds to a latent style element.
These tokens andVare jointly fed into a Transformer block
for adaptive optimization. To ensure alignment with CLIP’s
textual embedding space, a fully connected layer is utilized to
project the output into the target dimension. The entire process
can be formalized as:
˜V= FC (Transformer ([T s|V])),(2)
where ˜V={˜vi
l}n
i=1∈Rd1×ndenotes the style features of the
image, and[·|·]means concatenation.
Notably, our approach strictly aligns the discriminative
dimension during the style disentanglement, which is expected
to ensure that the learned style features focus only on visual
features relevant to identity matching. To mitigate interfer-
ence from irrelevant noisy style representations, we introduce
a Global Feature-based Local Filtering Module (GFLFM).
GFLFM utilizes global image features as a soft reference to
evaluate the relevance of each style feature, and retains only
those with high relevance. Firstly, for the i-th style feature˜vi
l,
its attention weightw iwith the global featurev gis computed
as:
wi=exp 
˜vi
l⊤vg
Pn
j=1exp
˜vj
l⊤
vg.(3)
Then, based on the computed weightsW={w i}n
i=1, we sort
the style features and select the topkones with the highest
scores:
VL= TopK
˜V , W, k
,(4)
whereV L∈Rd1×kmeans the final selected style features.
Similarly, we employ an MLP-based inversion network,
parameterized byϕ, to invert the selected style features into
a set of pseudo-word tokens. This can be mathematically
described as:
[s∗
1,···, s∗
k] =f ϕ(VL).(5)
Cross-modal Contrastive Learning.After obtaining the
content and style pseudo-word tokens, each image can be rep-
resented as a structured textual description, which is then fedinto the frozen CLIP text encoder to obtain the corresponding
text feature, denoted ast s-c. To encourage pseudo-words to
efficiently encapsulate corresponding visual contexts belong-
ing to the same identity, we apply the following symmetric
supervised contrastive loss:
L1
SupCon =L i2t+L t2i,(6)
Li2t=1
BBX
i=1X
p+∈P(j))logexp
sim
vi
g, tp+
s-c
/τ
PB
j=1exp
sim
vig, tj
s-c
/τ,
(7)
Lt2i=1
BBX
i=1X
p+∈P(j))logexp
sim
ti
s-c, vp+
g
/τ
PB
j=1exp
sim
tis-c, vj
g
/τ,
(8)
wherevi
gandti
s-cdenote the global image feature and the
style-content text feature of thei-th image in a batch of size
B, respectively.P(j)denotes the positive samples associated
withvi
gandti
s-c.τis a temperature hyperparameter.
To enhance the identity recognition of content pseudo-
words, we utilize a triplet loss to encourage pseudo-words
with the same identity to cluster in the text embedding space,
while pushing apart those with different identities. The content
pseudo-word tokens are fed into the CLIP text encoder to
obtain their corresponding text representationst c∗. Given an
anchorta
c∗, a positivetp
c∗, and a negativetn
c∗, the triplet loss
is formulated as:
LTri-txt = max 
∥ta
c∗−tp
c∗∥2
2− ∥ta
c∗−tp
c∗∥2
2+δ,0
,(9)
where∥·∥ 2denotes the Euclidean distance, andδis a margin
hyperparameter.
We also employ the standard Re-ID loss, i.e., the triplet loss
and the identity loss [67], to optimize the image encoder:
L1
Re-ID =L1
Tri-img +L1
ID,(10)
L1
Tri-img =max(d p−dn+α),(11)
L1
ID=−1
BBX
j=1logpj,(12)
wherep jis the ID prediction probability for thej-th class,
dpandd nrepresent the feature distances of the positive
and negative pairs, andαdenotes the margin. Notably, in
accordance with CLIP-ReID [68], we use the features before
and after the linear layer following the transformer for the
computation here, and additionally computeL1
Tri-img after the
11-th transformer layer.
The overall function is defined as follows:
LPTL=L1
Re-ID+L1
SupCon +L Tri-txt.(13)
D. Prompt-driven Style Disentanglement
After learning the style and content pseudo-words, as de-
picted in Fig. 2(b)(ii), we further perform the style disentan-
glement at the image level guided by the content text features.
Concretely, we design a de-stylization projectorP sto filter out
the style information from the raw image features, obtaining
6
De-stylization  
ProjectorVisual
EncoderVisual
Encoder
···
···
···Gallery
Embedding××
××
××
××
×××
××××
×××××
××
××
××
×××
××××
×××Inference
Identification
Fig. 5. At inference time, the trained visual encoder and de-stylization
projector are used to extract image content features.
the visual content features, denoted asv c∈Rd1. The process
is defined as:
vc=Ps(vg).(14)
Be aware that during this learning stage, only the parameters
ofP sare tunable.
With the aim of learning visually domain-invariant represen-
tations, we encouragev cto gather around style-disentangled
content prompts. Specifically, the trained content inversion
networkf θis kept fixed, and a content prompt is constructed
as: “a photo of a[C∗]person”, whereC∗is the learned content
pseudo-word. This content prompt is input to a frozen CLIP
text encoder to derive content-oriented text featurest c, while
vcare encouraged to align witht cthrough symmetric super-
vised contrastive loss, thereby enabling the learning of visually
modality-shared identity content. The loss is formulated as:
L2
SupCon =1
BBX
i=1X
p+∈P(j) 
logexp
sim
vi
c, tp+
c
/τ
PB
j=1exp
sim
vic, tj
c
/τ
+ logexp
sim
ti
c, vp+
c
/τ
PB
j=1exp
sim
tic, vj
c
/τ!
.
(15)
We perform Re-ID loss computation usingv cto enhance the
modeling of identity-related features. Ultimately, the overall
loss for trainingP sis defined as:
LPSD=L2
Re-ID+L2
SupCon .(16)
E. Inference
As shown in Fig. 5, we employ the trained visual encoder
and de-stylization projector for inference. Given a query image
xt∈Drfrom the real domain, the visual encoder extracts its
original visual featurev r∈Rd1. Subsequently,v ris passed
through the de-stylization projector to obtain the visual content
feature˜v r. We take˜v ras the final identity representation,
which is matched for similarity with the gallery corresponding
embedding.TABLE I
THE STATISTICS OF REAL-WORLD DATASETS IN OUR EXPERIMENTS.
Dataset #ID #Images #Cams
Market-1501 [69] 1,501 32,668 6
DukeMTMC-reID [15] 1,404 36,411 8
TABLE II
ACOMPARISON OF SOME SYNTHETIC DATASETS. “GENERATION”REFERS
TO DATA GENERATION TECHNIQUES.
Dataset #ID #Images Generation
SOMAset [26] 50 100,000 Game Engine
SyRI [27] 100 1,680,000 Game Engine
PersonX [28] 1,266 273,456 Game Engine
RandPerson [31] 8,000 132,145 Game Engine
UnrealPerson [32] 3,000 120,000 Game Engine
FineGPR [30] 1,150 2,028,600 Game Engine
GenePerson (Ours) 6,641 130,519 Generative Model
IV. EXPERIMENTS
A. Experimental Settings
Datasets and Evaluation Protocols.In this paper, two
widely used real-world person Re-ID datasets are used for
generalization evaluation, including Market-1501 [69] and
DukeMTMC-reID [15]. The statistics of the real datasets
are listed in Tab. I, and six existing synthetic datasets for
comparison are described in Tab. II.
Following common practice in the Re-ID community, this
work evaluates the performance using mean Average Precision
(mAP) and Cumulative Matching Characteristics (CMC) at
Rank-1 and Rank-5.
Implementation Details.For the image generation stage,
we select Stable Diffusion XL as the generative model and
Euler a as the sampling algorithm. Human pose skeletons are
pre-generated using OpenPose [70] and fed into ControlNet.
Meanwhile, we prepare a gird map with 42 evenly distributed
blocks, which is processed using the Canny edge detection
algorithm [71] before being fed into ControlNet. Subsequently,
each generated image is automatically divided into 42 equally-
sized regions, which are assigned identity labels corresponding
to the image. Finally, U2Net [72] is used to segment the
foreground person from each region, which is fused to the
background based on the alpha channel.
For the feature disentanglement stage, we adopt the ViT-
B/16 pre-trained CLIP as the backbone to extract features.
Our framework adds randomly initialized inversion networks,
Transformer and de-stylization projector. Both content and
style inversion networks are designed as three-layer MLPs
with hidden dimensions of 512 and 2048, respectively. The
Transformer for extracting local style features of the image
is set to 3 layers and 1 header, and the number of potential
style elementsnis set to 24. Inspired by the projection head
design in CLIP, the de-stylization projector is implemented as
a lightweight two-layered perceptron. A batch normalization
7
TABLE III
PERFORMANCECOMPARISON WITHEXISTINGREAL ANDSYNTHETICDATASETS ONMARKET-1501ANDDUKEMTMC-REID, RESPECTIVELY.
Testing Set→ Market-1501 DukeMTMC-reID
Training Set↓ Synthetic Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP
Market-1501 [69] × - - - 30.7 45.0 15.0
DukeMTMC-reID [15] × 49.8 66.8 22.5 - - -
SOMAset [26] ✓ 4.5 - 1.3 4.0 - 1.0
SyRI [27] ✓ 29.0 - 10.8 23.7 - 9.0
PersonX [28] ✓ 44.0 - 20.4 35.4 - 18.1
FineGPR [30] ✓ 50.5 67.7 24.6 - - -
RandPerson [31] ✓ 55.6 - 28.8 47.6 - 27.1
UnrealPerson [32] ✓ 54.4 70.2 27.9 48.2 64.5 26.3
GenePerson (Ours) ✓ 57.0 71.9 31.9 56.1 70.5 36.1
GenePerson†(Ours) ✓ 57.7 73.0 32.6 57.5 71.3 37.2
Redindicates the best andbluethe second best. A dagger (†) means training on our PDM method. Unrealperson is extracted
from unreal v1.1, unreal v2.1, unreal v3.1 and unreal v4.1.
layer followed by a linear fully connected layer is placed at
the end of the network. The number of effective local features
kin Eq. (4) is set to 6. The batch size is set to 64, containing
8 identities with 8 images per identity. All input images are
resized to256×128. Besides, to enhance intra-class vari-
ation, we apply image-level augmentation by independently
sampling contrast coefficients in the range(0.5,1.5)for each
image. We use the Adam optimizer for training, with an initial
learning rate of 5e-6 for the visual encoder and 5e-5 for the
randomly initialized modules. Both the visual encoder and
the de-stylization projector are trained for 20 epochs, and the
learning rate decay factor is reduced by a factor of 0.1 at the
15th and 20th epochs.
The entire framework runs on a single NVIDIA RTX
A6000 GPU with 48GB VRAM, where the training stage is
implemented using PyTorch.
B. Comparison with State-of-the-art Methods
We evaluate the generalization of GenePerson using direct
transfer, which means training a model on a specific dataset
and then evaluating its performance on another dataset without
any adjustments. We employ two real-world datasets as testing
sets, and the evaluation results are shown in Tab. III. It can
be seen that our proposed GenePerson dataset outperforms all
real-world and synthetic datasets, achieving mAP accuracies
of 57.0% and 56.1% on Market-1501 and DukeMTMC-reID,
respectively. Although our fast-generated GenePerson is only
one-fifteenth the size of FineGPR, its remarkable improve-
ments on real-world benchmarks demonstrate its effectiveness
in modeling different identities and scene variations with
finite training data. We attribute this superior result to the
GenePerson with more diverse pedestrians and scenes. In par-
ticular, the best results are achieved when training on GenePer-
son using our PDM method, with further improvements in
Rank-1 accuracy of 0.7% and 1.4% on the two real-world
datasets, respectively. Our results emphasize that introducingTABLE IV
ABLATION STUDY OF THE EFFECTIVENESS OF EACH LOSS COMPONENT IN
PDMONMARKET-1501.
L1
SupCon L1
Re-ID LTri-txt L2
Re-IDL2
SupCon Rank-1 mAP
a)✓- - - - 23.0 10.6
b)✓ ✓- - - 55.0 29.9
c)✓ ✓ ✓- - 56.4 31.7
d)✓ ✓ ✓ ✓- 57.3 32.4
e)✓ ✓ ✓ ✓ ✓ 57.7 32.6
TABLE V
ABLATION OF TRAINING WITH DIFFERENT PROMPT STRATEGIES,
INCLUDING GENERIC PROMPT,CONTENT PROMPT,AND STYLE-CONTENT
PROMPT,EVALUATED ONMARKET-1501.
Method Rank-1 Rank-5 mAP
generic prompt 54.2 70.7 29.0
content prompt 55.8 71.2 31.5
style-content prompt 56.4 71.3 31.7
disentangled text representations of style and content during
training encourages the model to capture domain-invariant
visual features.
C. Ablation Study
Ablation Study on Loss Components.In Tab. IV, we
evaluate the contribution of each loss component to model
performance through incremental addition of loss terms. Row
a) serves as the baseline, using only contrastive supervision
L1
SupCon on the initial pseudo-words. Comparing row b) with
row a), we observe a significant improvement after introducing
L1
Re-ID to enforce identity consistency during pseudo-word
learning, showing that identity supervision is crucial for model
8
TABLE VI
EFFECTIVENESS OF THEGLOBALFEATURE-BASEDLOCALFILTERING
MODULE ANDSENSITIVITYANALYSIS TO THENUMBER OFLOCAL
FEATURESkONMARKET-1501.
Method k mAP Rank-1 Rank-5
w/o filter module 24 30.6 54.5 70.2
w/ filter module4 31.7 55.9 71.7
6 (Ours) 32.6 57.7 78.7
8 32.0 56.6 72.1
10 31.9 55.9 72.1
12 30.9 55.3 70.9
Feature Distance Distribution
Feature DistanceProbability DensityBaseline
Baseline + PDM
Fig. 6. Visualization of distance distributions between randomly selected
cross-domain sample pairs from Market-1501 and GenePerson, before and
after applying PDM.
training. In row c), a triplet constraintL Tri-txt is applied to
the text features corresponding to the content pseudo-words,
which contributes to enhancing discriminability in the seman-
tic space. Furthermore, building on the joint optimization in
the pseudo-word token learning module, the introduction of
the prompt-driven style disentanglement module that employs
L2
Re-ID to learn image content leads to performance gains,
as shown in row d). This confirms that the de-stylization
projector can acquire domain-invariant content features for
identity recognition. Finally, as seen in row e), the best
performance is achieved whenL2
SupCon is added to align the
visual representations with the content text representations.
Ablation Study on Style-Content Prompt.To explore
whether the disentangled modeling strategy of style and con-
tent pseudo-word tokens provides a more granular guide for
generalizable learning of visual features, we design three sets
of comparison experiments in Tab. V. We train a baseline
model that relies only on a generic template “a photo of a
person” without any pseudo-word tokens to provide semantic
guidance. We then use “a photo of a[C∗]person” and “a
[S∗
1,···, S∗
k]style of a[C∗]person” as prompts in turn to
explore the effects of style and content pseudo-words. The re-
sults in Tab. V show that the mAP improves by 2.5%when the
content pseudo-word is incorporated into the template. When
style and content are jointly modeled, the model achieves
optimal performance with 31.7%mAP. This result emphasizesthat modeling visual concepts at different semantic levels can
effectively enhance the generalization of the model.
Ablation Study on GFLFM.We analyze the impact of
the global feature-based local feature filtering strategy under
different designs on the generalization performance. We dis-
able the filtering strategy by settingk=n. The results in
Tab. VI show that introducing GFLFM outperforms the setting
without the filtering strategy (w/o filter module) in various
evaluation metrics, which validates the effectiveness of our
strategy. In addition, to further determine the optimal number
of effective local style featuresk, we varykfrom 4 to 12 for
sensitivity analysis. It is observed that the model performance
first improves and then decreases askincreases, reaching peak
performance whenkis taken as 6, with Rank1 and mAP
accuracies of 57.7% and 32.6%, respectively. It is reasonable
that an appropriate increase the number of potential style
features provides richer information. However, too many local
style features may introduce redundant visual information,
leading to overfitting and increasing computational costs.
Qualitative Visualization of PDM.To visually understand
and validate the effectiveness of our proposed PDM method,
we conduct a qualitative analysis. Fig. 6 presents the distri-
bution of distances between 10,000 randomly selected pairs
of cross-domain samples from the real and virtual domains,
before and after applying PDM. The results indicate that
the distances between cross-domain samples are significantly
reduced and more centralized after disentanglement, which
confirms our hypothesis about prompt-driven style disen-
tanglement. This indicates that PDM successfully weakens
the effect of domain style differences and improves feature
consistency, helping the model learn more domain-invariant
content representations.
V. CONCLUSION
In this paper, we propose an innovative virtual dataset
construction pipeline that employs a text-to-image diffusion
model to directly synthesize virtual samples using generated
prompts that incorporate multiple pedestrian attributes. This
pipeline not only streamlines the data construction process
but also improves the diversity and quality of generated data.
In addition, we propose a simple yet effective generalization
strategy, PDM. PDM utilizes the aligned multimodal potential
space provided by CLIP, which mitigates the impact of style
differences between virtual and real-world images by guiding
the model to focus on domain-invariant content information
through prompts. Extensive experiments validate the superi-
ority of GenePerson and the effectiveness of PDM. Looking
ahead, we will extend our dataset and method to handle
more challenging tasks such as pose estimation, body part
segmentation, and cross-modal retrieval.
REFERENCES
[1] X. Tan, X. Gong, and Y . Xiang, “Clip-based camera-agnostic feature
learning for intra-camera supervised person re-identification,”IEEE
Trans. Circuits Syst. Video Technol., vol. 35, no. 5, pp. 4100–4115, 2025.
[2] X. Liu, J. Guo, H. Chen, Q. Miao, Y . Xi, and R. Liu, “Adaptive
occlusion-aware network for occluded person re-identification,”IEEE
Trans. Circuits Syst. Video Technol., vol. 35, no. 5, pp. 5067–5077, 2025.
9
[3] Z. Pang, C. Wang, L. Zhao, Y . Liu, and G. Sharma, “Cross-modality
hierarchical clustering and refinement for unsupervised visible-infrared
person re-identification,”IEEE Trans. Circuits Syst. Video Technol.,
vol. 34, no. 4, pp. 2706–2718, 2024.
[4] C. Peng, B. Wang, D. Liu, N. Wang, R. Hu, and X. Gao, “Mrlreid:
Unconstrained cross-resolution person re-identification with multi-task
resolution learning,”IEEE Trans. Circuits Syst. Video Technol., vol. 34,
no. 10, pp. 10 050–10 062, 2024.
[5] M. Liu, Y . Bian, Q. Liu, X. Wang, and Y . Wang, “Weakly super-
vised tracklet association learning with video labels for person re-
identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 5,
pp. 3595–3607, 2024.
[6] M. Ye, W. Shen, J. Zhang, Y . Yang, and B. Du, “Securereid: Privacy-
preserving anonymization for person re-identification,”IEEE Trans. Inf.
Forensics Secur., vol. 19, pp. 2840–2853, 2024.
[7] Y . Tang, M. Liu, B. Li, Y . Wang, and W. Ouyang, “Nas-ped: Neural
architecture search for pedestrian detection,”IEEE Trans. Pattern Anal.
Mach. Intell., vol. 47, no. 3, pp. 1800–1817, 2025.
[8] Z. Huang, S. Yang, M. Zhou, Z. Li, Z. Gong, and Y . Chen, “Feature
map distillation of thin nets for low-resolution object recognition,”IEEE
Trans. Image Process., vol. 31, pp. 1364–1379, 2022.
[9] Z. Gao, P. Chen, T. Zhuo, M. Liu, L. Zhu, M. Wang, and S. Chen, “A
semantic perception and cnn-transformer hybrid network for occluded
person re-identification,”IEEE Trans. Circuits Syst. Video Technol.,
vol. 34, no. 4, pp. 2010–2025, 2024.
[10] W. Chen, X. Xu, J. Jia, H. Luo, Y . Wang, F. Wang, R. Jin, and X. Sun,
“Beyond appearance: A semantic controllable self-supervised learning
framework for human-centric visual tasks,” inIEEE Conf. Comput. Vis.
Pattern Recog., 2023, pp. 15 050–15 061.
[11] M. Liu, F. Wang, X. Wang, Y . Wang, and A. K. Roy-Chowdhury,
“A two-stage noise-tolerant paradigm for label corrupted person re-
identification,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 7,
pp. 4944–4956, 2024.
[12] H. Pan, Q. Liu, Y . Chen, Y . He, Y . Zheng, F. Zheng, and Z. He,
“Pose-aided video-based person re-identification via recurrent graph
convolutional network,”IEEE Trans. Circuits Syst. Video Technol.,
vol. 33, no. 12, pp. 7183–7196, 2023.
[13] Z. Yang, D. Wu, C. Wu, Z. Lin, J. Gu, and W. Wang, “A pedestrian is
worth one prompt: Towards language guidance person re-identification,”
inIEEE Conf. Comput. Vis. Pattern Recog., 2024, pp. 17 343–17 353.
[14] Y . Bian, M. Liu, X. Wang, Y . Tang, and Y . Wang, “Occlusion-aware
feature recover model for occluded person re-identification,”IEEE
Trans. Multimedia, vol. 26, pp. 5284–5295, 2024.
[15] Z. Zheng, L. Zheng, and Y . Yang, “Unlabeled samples generated by
gan improve the person re-identification baseline in vitro,” inInt. Conf.
Comput. Vis., 2017.
[16] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to bridge
domain gap for person re-identification,” inIEEE Conf. Comput. Vis.
Pattern Recog., 2018, pp. 79–88.
[17] M. Maximov, I. Elezi, and L. Leal-Taix ´e, “Ciagan: Conditional identity
anonymization generative adversarial networks,” inIEEE Conf. Comput.
Vis. Pattern Recog., 2020, pp. 5446–5455.
[18] J.-W. Chen, L.-J. Chen, C.-M. Yu, and C.-S. Lu, “Perceptual
indistinguishability-net (pi-net): Facial image obfuscation with manipu-
lable semantics,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp.
6474–6483.
[19] J. Dietlmeier, F. Hu, F. Ryan, N. E. O’Connor, and K. McGuinness, “Im-
proving person re-identification with temporal constraints,” inIEEE/CVF
Winter Conf. Appl. Comput. Vis., 2022, pp. 540–549.
[20] J. Zhang, M. Ye, and Y . Yang, “Learnable privacy-preserving anonymiza-
tion for pedestrian images,” inACM Int. Conf. Multimedia, 2022, pp.
7300–7308.
[21] K. Kansal, Y . Wong, and M. Kankanhalli, “Privacy-enhancing person re-
identification framework – a dual-stage approach,” inIEEE/CVF Winter
Conf. Appl. Comput. Vis., 2024, pp. 8528–8537.
[22] W. Zhuang, Y . Wen, X. Zhang, X. Gan, D. Yin, D. Zhou, S. Zhang, and
S. Yi, “Performance optimization of federated person re-identification
via benchmark analysis,” inACM Int. Conf. Multimedia, 2020, pp. 955–
963.
[23] W. Zhuang, Y . Wen, and S. Zhang, “Joint optimization in edge-cloud
continuum for federated unsupervised person re-identification,” inACM
Int. Conf. Multimedia, 2021, pp. 433–441.
[24] G. Wu and S. Gong, “Decentralised learning from independent multi-
domain labels for person re-identification,” vol. 35, no. 4, 2021, pp.
2898–2906.[25] W. Ma, X. Wu, S. Zhao, T. Zhou, D. Guo, L. Gu, Z. Cai, and M. Wang,
“Fedsh: Towards privacy-preserving text-based person re-identification,”
IEEE Trans. Multimedia, vol. 26, pp. 5065–5077, 2024.
[26] I. B. Barbosa, M. Cristani, B. Caputo, A. Rognhaugen, and T. Theoharis,
“Looking beyond appearances: Synthetic training data for deep cnns in
re-identification,”Comput. Vis. Image Underst., vol. 167, pp. 50–62,
2018.
[27] S. Bak, P. Carr, and J.-F. Lalonde, “Domain adaptation through synthesis
for unsupervised person re-identification,” inEur. Conf. Comput. Vis.,
2018, pp. 193–209.
[28] X. Sun and L. Zheng, “Dissecting person re-identification from the
viewpoint of viewpoint,” inIEEE Conf. Comput. Vis. Pattern Recog.,
2019, pp. 608–617.
[29] S. Xiang, G. You, L. Li, M. Guan, T. Liu, D. Qian, and Y . Fu,
“Rethinking illumination for person re-identification: A unified view,”
inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp. 4731–4739.
[30] S. Xiang, D. Qian, M. Guan, B. Yan, T. Liu, Y . Fu, and G. You, “Less
is more: Learning from synthetic data with fine-grained attributes for
person re-identification,”ACM Trans. Multimedia Comput. Commun.
Appl., vol. 19, no. 5s, pp. 1–20, 2023.
[31] Y . Wang, S. Liao, and L. Shao, “Surpassing real-world source training
data: Random 3d characters for generalizable person re-identification,”
inACM Int. Conf. Multimedia, 2020, pp. 3422–3430.
[32] T. Zhang, L. Xie, L. Wei, Z. Zhuang, Y . Zhang, B. Li, and
Q. Tian, “Unrealperson: An adaptive pipeline towards costless person
re-identification,” inIEEE Conf. Comput. Vis. Pattern Recog., 2021, pp.
11 506–11 515.
[33] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAdv.
Neural Inform. Process. Syst., vol. 32, 2019.
[34] B. Zhao, K. R. Mopuri, and H. Bilen, “idlg: Improved deep leakage
from gradients,”arXiv preprint arXiv:2001.02610, 2020.
[35] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-
resolution image synthesis with latent diffusion models,” inIEEE Conf.
Comput. Vis. Pattern Recog., 2022, pp. 10 684–10 695.
[36] S. Bose, A. Jha, E. Fini, M. Singha, E. Ricci, and B. Banerjee, “Stylip:
Multi-scale style-conditioned prompt learning for clip-based domain
generalization,” inIEEE/CVF Winter Conf. Appl. Comput. Vis., 2024,
pp. 5542–5552.
[37] J. Cho, G. Nam, S. Kim, H. Yang, and S. Kwak, “Promptstyler: Prompt-
driven style generation for source-free domain generalization,” inInt.
Conf. Comput. Vis., 2023, pp. 15 702–15 712.
[38] G. Kwon and J. C. Ye, “Clipstyler: Image style transfer with a single
text condition,” inIEEE Conf. Comput. Vis. Pattern Recog., 2022, pp.
18 041–18 050.
[39] M. Fahes, T.-H. Vu, A. Bursuc, P. P ´erez, and R. de Charette, “Poda:
Prompt-driven zero-shot domain adaptation,” inInt. Conf. Comput. Vis.,
2023, pp. 18 623–18 633.
[40] R. Gal, Y . Alaluf, Y . Atzmon, O. Patashnik, A. H. Bermano, G. Chechik,
and D. Cohen-Or, “An image is worth one word: Personalizing text-to-
image generation using textual inversion,” 2022.
[41] W. Li, Z. Zhang, X. Lan, and D. Jiang, “Transferable adversarial face
attack with text controlled attribute,”AAAI Conf. Artif. Intell., vol. 39,
no. 5, pp. 4977–4985, 2025.
[42] Y . Zheng, B. Zhong, Q. Liang, S. Zhang, G. Li, X. Li, and R. Ji,
“Towards universal modal tracking with online dense temporal token
learning,”IEEE Trans. Pattern Anal. Mach. Intell., 2025.
[43] Y . Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack:
Online dense temporal token learning for visual tracking,” inAAAI Conf.
Artif. Intell., vol. 38, no. 7, 2024, pp. 7588–7596.
[44] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
“Learning transferable visual models from natural language supervi-
sion,” inInt. Conf. Mach. Learn., vol. 139, 2021, pp. 8748–8763.
[45] Z. Zheng, L. Zheng, and Y . Yang, “Unlabeled samples generated by
gan improve the person re-identification baseline in vitro,” inInt. Conf.
Comput. Vis., 2017.
[46] M. Goddard, “The eu general data protection regulation (gdpr): Euro-
pean regulation that has a global impact,”Int. J. Market Res., vol. 59,
no. 6, pp. 703–705, 2017.
[47] S. Bhaimia, “The general data protection regulation: the next generation
of eu data protection,”Leg. Inf. Manag., vol. 18, no. 1, pp. 21–28, 2018.
[48] K. Zhou, Y . Yang, T. Hospedales, and T. Xiang, “Deep domain-
adversarial image generation for domain generalisation,” vol. 34, no. 07,
2020, pp. 13 025–13 032.
[49] K. Zhou, Y . Yang, Y . Qiao, and T. Xiang, “Domain generalization with
mixstyle,” inInt. Conf. Learn. Represent., 2021.
10
[50] K. Zhou, Y . Yang, T. Hospedales, and T. Xiang, “Learning to generate
novel domains for domain generalization,” inEur. Conf. Comput. Vis.,
2020, pp. 561–578.
[51] J. Kang, S. Lee, N. Kim, and S. Kwak, “Style neophile: Constantly
seeking novel styles for domain generalization,” inIEEE Conf. Comput.
Vis. Pattern Recog., 2022.
[52] X. Yu, S. Yoo, and Y . Lin, “Clipceil: Domain generalization through
clip via channel refinement and image-text alignment,” inAdv. Neural
Inform. Process. Syst., vol. 37, 2024, pp. 4267–4294.
[53] V . Vidit, M. Engilberge, and M. Salzmann, “Clip the gap: A single
domain generalization approach for object detection,” inIEEE Conf.
Comput. Vis. Pattern Recog., 2023, pp. 3219–3229.
[54] J. Cha, K. Lee, S. Park, and S. Chun, “Domain generalization by mutual-
information regularization with pre-trained models,” inEur. Conf. Com-
put. Vis., 2022, pp. 440–457.
[55] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski,
“Styleclip: Text-driven manipulation of stylegan imagery,” inInt. Conf.
Comput. Vis., 2021, pp. 2065–2074.
[56] L. Dunlap, C. Mohri, D. Guillory, H. Zhang, T. Darrell, J. E. Gonzalez,
A. Raghunathan, and A. Rohrbach, “Using language to extend to unseen
domains,” inInt. Conf. Learn. Represent., 2023.
[57] R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and
D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image
generators,”ACM Trans. Graph., vol. 41, no. 4, 2022.
[58] H. Yang, J. Jeong, and K.-J. Yoon, “Prompt-driven contrastive learning
for transferable adversarial attacks,” inEur. Conf. Comput. Vis., 2025,
pp. 36–53.
[59] Z. Zhang, X. Yuan, L. Zhu, J. Song, and L. Nie, “Badcm: Invisible
backdoor attack against cross-modal learning,”IEEE Trans. Image
Process., vol. 33, pp. 2558–2571, 2024.
[60] T. Wang, F. Li, L. Zhu, J. Li, Z. Zhang, and H. T. Shen, “Cross-modal
retrieval: A systematic review of methods and future directions,”Proc.
IEEE, vol. 112, no. 11, pp. 1716–1754, 2024.
[61] R. Zhu, S. Xu, P. Liu, J. Liu, Y . Lu, D. Niu, H. Zheng, Y .-K. Chen,
M. Jing, and Y . Fan, “A flexible zero-shot approach to tone mapping via
structure-preserving diffusion models,”IEEE Trans. Circuits Syst. Video
Technol., pp. 1–1, 2025.
[62] E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang,
W. Chenet al., “Lora: Low-rank adaptation of large language models.”
Int. Conf. Learn. Represent., vol. 1, no. 2, p. 3, 2022.
[63] L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to
text-to-image diffusion models,” inInt. Conf. Comput. Vis., 2023, pp.
3836–3847.
[64] G. Lin, Z. Bao, Z. Huang, Z. Li, W.-s. Zheng, and Y . Chen, “A
multi-level relation-aware transformer model for occluded person re-
identification,”Neural Networks, vol. 177, p. 106382, 2024.
[65] R. Zhu, Z. Tu, J. Liu, A. C. Bovik, and Y . Fan, “Mwformer: Multi-
weather image restoration using degradation-aware transformers,”IEEE
Trans. Image Process., vol. 33, pp. 6790–6805, 2024.
[66] L. Zheng, Y . Zhao, S. Wang, J. Wang, and Q. Tian, “Good practice in
cnn feature transfer,”arXiv preprint arXiv:1604.00133, 2016.
[67] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, and T. Mei, “Fastreid: A
pytorch toolbox for general instance re-identification,” inACM Int. Conf.
Multimedia, 2023, pp. 9664–9667.
[68] S. Li, L. Sun, and Q. Li, “Clip-reid: Exploiting vision-language model
for image re-identification without concrete text labels,”AAAI Conf.
Artif. Intell., vol. 37, no. 1, pp. 1405–1413, 2023.
[69] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable
person re-identification: A benchmark,” inInt. Conf. Comput. Vis., 2015,
pp. 1116–1124.
[70] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y . Sheikh, “Openpose:
Realtime multi-person 2d pose estimation using part affinity fields,”
IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 1, pp. 172–186,
2021.
[71] J. Canny, “A computational approach to edge detection,”IEEE Trans.
Pattern Anal. Mach. Intell., vol. 8, no. 6, pp. 679–698, 1986.
[72] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. R. Zaiane, and M. Jager-
sand, “U2-net: Going deeper with nested u-structure for salient object
detection,”Pattern Recognit., vol. 106, p. 107404, 2020.