463 Linguistic Profiling of Deepfa
463 Linguistic Profiling of Deepfa
Anonymous Author(s)
Affiliation
Address
email
Abstract
24 1 Introduction
25 The field of deepfakes has experienced a significant shift in recent years, moving beyond traditional
26 generative models focused on realistic images and videos to the emergence of text-to-image (T2I)
27 generative models. These new models have unlocked new possibilities for creating highly realistic
28 and convincing deepfakes. In particular, recent T2I generative models like DALL·E [36], Imagen [41]
29 and Stable Diffusion [39] have revolutionized deepfakes by enabling the direct synthesis of images
30 from textual descriptions or prompts. However, the proliferation of T2I generative models raises
31 serious ethical concerns. The potential for misuse and the rapid dissemination of misleading or false
32 information further intensify concerns about the authenticity and trustworthiness of visual content.
33 Although many datasets and methods (e.g., [40, 13, 56, 29, 50, 34, 25, 51, 46]) exist for deepfake
34 detection, they suffer from significant limitations. Firstly, these methods typically focus solely on
35 binary classification, i.e., distinguishing between real and deepfake images, without providing any
36 explanatory information. However, in practice, countering forgery requires comprehensive evidence
Submitted to 37th Conference on Neural Information Processing Systems (NeurIPS 2023). Do not distribute.
Linguistic Profiling Tree
Dataset # Generative Models
Early Datasets
Input Image FaceForensic++ [40] 4
Detection Detection DFDC [13] 8
WildDeepfake [56] Unknown
Deepfake Image Real Image
CDDB [29] 11
Identification Prediction Recent Datasets
Deepfake Model Deepfake Prompt DiffusionDB [50] 1
SAC [34] 4
Next-Generation Deepfake Detection Keys: Pick-a-Pic [25] 2
• Detection: It’s an AI generated image.
• Identification: The source generative model is DALL·E. HP [51] 1
• Prediction: The source textual prompt is “a panda MidJourney(250K) [46] Unknown
playing a piano in space”. DFLIP-3K (Ours) ≈ 3K
Figure 1: DFLIP-3K enables linguistic profiling Table 1: A comparison of DFLIP-3K against the
of deepfakes, including the assessment of authen- early and recent deepfake datasets. It collects
ticity (deepfake detection), the identification of deepfakes from about 3K generative models, rep-
source model (deepfake identification), and the resenting the largest scale in terms of the number
prediction of source prompt (prompt prediction). of deepfake models.
37 and explanations to effectively challenge fake content, going beyond a simple binary judgment.
38 Secondly, as in Table 1, individual datasets are often generated by a limited number of generative
39 models, resulting in a lack of representation for the extensive diversity of existing deepfakes.
40 In response to this pressing issue, this paper focuses on the development of convincing and explainable
41 deepfake detection, which serves as a foundational step towards next-generation deepfake detection.
42 We posit that the key lies not only in binary deepfake detection but, more importantly, in dissecting
43 the results and presenting them in a manner understandable to humans. Hence, we place significant
44 emphasis on the importance of this new and challenging task, which we term linguistic profiling
45 of deepfake detection. As shown in Fig.1, the task of linguistic profiling of deepfake detection can
46 be further decomposed into three sub-tasks, namely the detection of deepfakes, the identification of
47 deepfake source models, and the prediction of textual prompts that are used for text-to-image genera-
48 tion. To facilitate this emerging study, we curate an open dataset (DFLIP-3K), which standardizes
49 resources for analyzing the linguistic characteristics of deepfake contents, and provides a benchmark
50 for evaluating and comparing new approaches for linguistic profiling of deepfake detection.
51 In particular, the established DFLIP-3K database encompasses approximately 300K deepfake samples
52 produced from about 3K generative models, which is the largest scale in the literature of deepfake
53 detection. In addition, we gather about 190K textual prompts that are used to create images. The
54 collected prompts allow for the exploitation of linguistic profiling in simultaneous deepfake detection,
55 identification, and prompt prediction. Apart from curating this database, we thoroughly examine
56 the ethical considerations and potential flaws associated with large-scale data collection. By making
57 DFILP-3K publicly available, we provide the community with the initial opportunity to further
58 enhance a database of such application and magnitude.
59 To assess the potential value of DFLIP-3K, we establish a benchmark for linguistic profiling in
60 simultaneous deepfake detection, identification, and prompt prediction. Based on this benchmark,
61 we conduct several experiments. Our evaluation focuses on state-of-the-art vision-based and vision-
62 language models adapted for deepfake profiling. The results reveal that these vision-language models
63 outperformed traditional vision models in detecting and identifying deepfakes. Unlike vision-based
64 models, our suggested vision-language models have the ability to generate either regular image
65 captions or textual prompts in specific formats for image generation. The evaluations demonstrate
66 that training the vision-language models with our collected prompts leads to more sensible prompt
67 prediction, resulting in reconstructed images that closely resemble the input deepfake images in terms
68 of perceptual, semantic, and aesthetic similarities. Furthermore, we also show that the suggested
69 deepfake profiling facilitates the reliable interpretability of the generated content.
70 Despite the validation results, DFLIP-3K is not considered a finalized data product at this stage.
71 Given the continuous emergence of generative models, the comprehensive curation of DFLIP-3K
72 for widespread usage extends beyond the scope of a single research paper. Therefore, in addition
73 to releasing the dataset, we are also sharing the software stack that we developed for assembling
74 DFLIP-3K. We consider this initial data release and accompanying paper as an initial step towards
2
75 creating a widely applicable deepfake dataset suitable for a broader range of linguistic profiling tasks.
76 Consequently, we are making all collection methods, metadata, and deepfake data within our database
77 openly accessible to fully unleash its potential for next-generation deepfake detection.
78 2 Related Works
79 Deepfake Datasets with Generative Models. Deepfake datasets are vital for advancing development
80 in the field of deepfake detection, manipulation, and understanding. The progress in generating
81 deepfake datasets is closely tied to the advancements in generative models. As generative models
82 have evolved significantly, they have enabled the creation of more realistic and diverse deepfakes.
83 Traditional generative models, such as generative adversarial nets (GANs) [16, 10, 23] and variational
84 autoencoders (VAEs) [44, 19], have played a significant role in creating early deepfake datasets like
85 FaceForensic++ [40], DFDC [13], WildDeepfake [56] and CDDB [29], to name a few. However, as the
86 field evolved, the limitations of traditional generative models became apparent. While they were kind
87 of success in generating realistic deepfakes, they lacked fine-grained control and interpretability. This
88 led researchers to explore new approaches for deepfake dataset creation. Text-to-image generative
89 (T2I) models have been emerging as a promising solution, allowing for the generation of deepfake
90 datasets based on textual descriptions. Current state-of-the-art generative models for text-guided
91 image synthesis include the scaled-up GANs like GigaGAN [22], autoregressive models such as
92 DALL·E [37] and Parti [52], and diffusion models like DALL·E 2 [36] and Stable Diffusion [39].
93 Their integration of textual information enables researchers to generate new deepfake datasets (e.g.,
94 DiffusionDB [50], SAC [34], Pick-a-Pic [25], HP [51], MidJourney(250K) [46]) that align with
95 specific prompts, facilitating high-degree control and excellent interpretability. These datasets are
96 enumerated in Table 1, and their additional details are provided in the supplementary material.
97 It is noteworthy that these datasets mainly focus on a limited number of specific models such as Stable
98 Diffusion and MidJourney. Nonetheless, the field is experiencing rapid emergence of numerous other
99 generative models. To fill this gap, our DFLIP-3K database encompasses deepfake images generated
100 by a significantly larger volume (about 3K) of T2I generative models. This provides a comprehensive
101 resource that best reflects the current landscape of T2I deepfake technologies.
102 Deepfake Detection and Identification. Previous studies on deepfake detection have predominantly
103 utilized deep neural networks, such as Residual Network (ResNet) [47], Vision Transformer (ViT)
104 [49], and Contrastive Language-Image Pretraining (CLIP) [48], as the backbone for training binary
105 classifiers. Two recent works [17, 38] propose to detect and recognize both GAN-based and recently
106 appeared diffusion-based deepfakes. Their experimental results demonstrated that detecting deepfakes
107 generated by T2I models is more challenging compared to traditional GAN-based deepfakes. However,
108 their solutions mainly focused on Stable Diffusion and LDM [39].
109 These deepfake detection methods typically concentrate solely on binary classification, while lacking
110 the exploration of evidential and explanatory information. However, combating forgery necessitates
111 comprehensive evidence and explanations to effectively challenge fraudulent content, surpassing
112 a mere binary judgment. Our DFLIP-3K database is proposed to bridge this gap, aiming for the
113 development of convincing and explainable deepfake detection. Particularly, this database allows
114 for leveling up the traditional deepfake detection task to linguistic profiling of deepfakes, which is
115 composed of three sub-tasks, i.e., deepfake detection, deepfake model identification, and prompt
116 prediction. These three sub-tasks serve as crucial elements in dissecting and understanding deepfakes,
117 providing valuable insights for effective detection and analysis.
118 Deepfake Prompt Prediction. Prompt prediction (a.k.a., prompt engineering) involves analyzing
119 the text prompts that were used to generate deepfakes. Emerging prompt engineering tools like
120 Lexica [43] allow users to explore textual prompts with various variations and find visually similar
121 images from a database that match the input deepfake. However, the primary drawback of such tools is
122 the requirement for users to be engaged in costly iteration. Prompt auto-completers [33] complements
123 user-provided prompts using additional keywords provided by statistical distribution predictions
124 to generate higher-quality deepfakes with desired styles. However, refining these keywords often
125 requires expensive human intervention or assistance. Others employ image caption models such as
126 CLIP [35] and BLIP [30] to generate prompts. However, merely describing image elements using
127 these models does not guarantee visually appealing results.
3
Train Test
Models # Models # Prompts # Images Methods
Real Fake Real Fake
Stable Diffusion [39] 1 15,000 15,000 Stable Diffusion [39] 5,000 5,000 (15,000) 1,000 1,000
Personalized Diffusion 3,434(at least) 120,978 280,753 PD 1 10,000 10,000 (40,000) 4,000 4,000
DALL·E 2 [36] 1 31,315 33,705 DALL·E 2 [36] 5,000 5,000 (30,961) 1,000 1,000
MidJourney [20] 5,000 5,000 (20,867) 1,000 1,000
MidJourney [20] Unkown 21,858 27,241 Imagen [41] - - 223 223
Parti [52] 1 0 195 Parti [52] - - 193 193
Imagen [41] 1 0 224 PD 2 - - - 72,721
Table 2: Statistics of DFLIP-3K that consists of Table 3: The proposed benchmark setting on
six big families of generative models. Personalized DFLIP-3K. Digits outside parentheses are for
Diffusions include deepfakes from 3,434 known gen- deepfake detection, and the ones in parentheses
erative models and a set of unknown models. are for the other two tasks. Personalized Diffu-
sion (PD) is divided into PD1 (51 Models) for
in-distribution training/test, and PD2 (≈ 3K mod-
els) for out-of-distribution test.
128 The development of our DFLIP-3K database, which collects textual prompts used for image gener-
129 ation, opens up new avenues for researchers to explore the realm of automatic prompt prediction
130 systems. By profiling the prompts, we can better understand how to describe the style and improve
131 image quality, and gain insights into the intentions, biases, or specific characteristics of the individuals
132 or groups behind the creation of deepfakes.
134 DFLIP-3K is constructed by scraping publicly available high-quality images. It encompasses deep-
135 fakes generated by prominent T2I models, namely Stable Diffusion [39], DALL·E 2 [36], MidJourney
136 [20], Imagen [41], Parti [52], and a substantial number of personalized models based on Stable
137 Diffusion. Table 2 presents an overview of the DFLIP-3K database.
139 Overall Dataset Generation Pipeline. The fundamental principles guiding our data collection
140 efforts are three-fold: public availability, high-quality images, and reliable sources. To this end, our
141 collection comprises four distinct parts: 1) website selection, 2) web scraping, 3) parsing and filtering,
142 and 4) storing data. The code used for the whole collection pipeline will be openly available.
143 Website Selection. Our objective is to collect a diverse and high-quality database of synthetic data
144 that accurately reflects the latest advancements in T2I deepfakes. To achieve this goal, we first
145 select social media platforms such as Instagram and Twitter, as well as art-sharing websites and
146 imageboards. We meticulously select source websites to ensure that all deepfakes included in our
147 database are created by using recent T2I tools and exhibit high aesthetic quality. We also exclude any
148 content that is illegal, hateful, or NSFW (not safe for work), such as sexual or violent imagery, to
149 ensure that the dataset maintains ethical standards and aligns with responsible usage guidelines.
150 For DALL·E 2, we retrieve its public database [4]. It is a website that enables users to search for
151 images by texts. It boasts over 30,000 images, although their quality is fair given the immense
152 quantity. In addition, we also scrape DALL·E generated images that were posted by OpenAI on
153 their social media channels like Instagram and Twitter. These images are carefully selected and
154 are of superior quality. For Imagen and Parti, their models are not publicly available, and thus we
155 use their officially released data from their respective project pages [52, 41] as well as their social
156 media channels. For MidJourney, we use the MidJourney User Prompts & Generated Images (250k)
157 dataset [46], which crawls messages from the public MidJourney Discord servers. We sample a
158 subset of images that are upscaled by users from this dataset. Furthermore, we scrape the official
159 showcase gallery [7] to obtain more aesthetically pleasing images.
160 To obtain deepfakes generated by Stable Diffusion, we utilize a subset of the data released by
161 DiffusionDB, which is a large-scale T2I prompt dataset. This dataset contains 14 million images with
162 prompts and hyperparameters generated by Stable Diffusion (SD) from official Discord channels.
163 However, the aesthetic quality of the images in this dataset are fair. Hence, we additionally select
164 Lexica [43], which is an art gallery for artwork created with Stable Diffusion. This platform provides
165 more aesthetically pleasing images.
4
166 To avoid any confusion, we use the term ’Stable Diffusion’ to refer exclusively to the official
167 base models developed by StabilityAI. Custom models built on top of Stable Diffusion through
168 personal data finetuning are referred to as ’Personalized Diffusions’ throughout this paper. In order
169 to enhance the quality of the collected synthetic images, we select several popular imageboards or
170 galleries that specialize in AI-generated illustrations, such as Civitai [31], Aigodlike [2], AIBooru [1],
171 ArtHub.ai [3], FindingArt [5], and MajinAI [6]. These platforms are dedicated to AI art enthusiasts
172 and provide a means for sharing generative models and deepfake images. The images available on
173 these platforms are sourced from both direct user uploads and web crawlers. Additionally, some
174 images come with generating parameters, such as prompts, samplers, and other hyper-parameters.
175 Web Scraping. We exclusively select publicly available image-sharing platforms that do not require
176 any login credentials. Accordingly, we develop web-scraping tools to extract a comprehensive
177 collection of images. Our web-scraping tool primarily utilizes the BeautifulSoup Python libraries to
178 parse the image download URLs and accessible parameters, such as prompts and models, from the
179 returned HTML messages. To ensure compliance with SFW (safe for work) regulations and avoid
180 downloading age-restricted content, we configure the web-scraping tool to crawl only safe content (if
181 possible). We download all raw images from the parsed URLs and maintain their original formats,
182 such as WEBP, PNG, JPG, etc. We publicly disclose all image URLs and sanitized metadata as we
183 do not claim ownership of the data.
184 Parsing & Filtering. To aggregate images from diverse sources, we archive them based on their
185 generative models. Following the processing and filtering of Common Crawl, we have obtained over
186 300K samples. This section provides a guide on processing data into a clean data format.
187 Parsing data for the official releases of DALL·E, Parti, Imagen, and MidJourney is simple and
188 straightforward, as these sources already identify their generative methods and adhere to strict
189 regulations for publishing SFW content in standardized formats.
190 Our primary focus is on processing data from other imageboards and websites. After downloading all
191 the collected deepfake images from various platforms, we apply publicly available NSFW detectors
192 such as GantMan [26] and LAION-SAFETY [28], to filter out potentially inappropriate images. It
193 is worth noting that some NSFW images may not be detected, and we make efforts to periodically
194 update DFLIP-3K to minimize such instances. To address the issue of duplicate images generated
195 with the same prompt and hyper-parameters but using different random seeds, we use fastdup [9] to
196 remove duplicates and similar images with the same prompts. Besides, we filter out any collapsed
197 images that failed to download. To obtain the parameters for generated images, we employ two
198 methods: scraping metadata from websites and parsing original raw images. Standard format of
199 the scraped data suffices for the former method. For the latter, we try to extract raw information
200 used to generate the images if possible. Currently, widely used open-source AI-generated tools
201 such as Automatic1111-Stable-Diffusion-WebUI [11] automatically embed information about the
202 hyperparameters utilized to generate images into PNG files. Therefore, it is feasible to directly obtain
203 uploaded images in PNG format and extract the crucial parameters, including prompts, the name of
204 the model used for generation, and the model hash.
205 We utilize search engines, such as Google and CivitAI, to search for model information and download
206 URLs. However, there exist a significant number of images with unknown model hashes, which are
207 likely generated by personalized or merged models. Additionally, enormous images are generated
208 using additional networks, such as embeddings (via textural inversion [15]), LoRA [21], and Control-
209 Net [53], among others. It is challenging, and sometimes impossible, to extract the exact additional
210 networks utilized, as individuals tend to withhold this information. However, these networks are
211 primarily used to encode personalized characters or poses and have a relatively minor impact on
212 image quality and style. Thus, we rely on base models as the primary source for generating images.
213 Since some deepfakes contain watermarks or logos (e.g., DALL·E, Imagen, and Parti), we address
214 this issue by cropping these images to eliminate potentially trivial methods for detection.
216 To construct a deepfake detection database, it is crucial to include high-quality real data as a
217 counterpart to the deepfake samples. All real images utilized in DFLIP-3K are gathered from LAION-
218 5B [42], which is a database containing 5 billion image-text-pairs crawled from web pages between
219 2014 and 2021. We choose LAION-5B for two primary reasons. Firstly, given the recent emergence
5
DALL·E[36] SD [39] PD MJ [20] Imagen [41] Parti [52]
Figure 2: DFLIP-3K examples generated by the Figure 3: Word clouds of our collected textual
six groups of generative models. More examples prompts for all associated generative models.
are presented in the Supplementary Material. More word clouds are in Supplementary Material.
220 of deepfake technologies, it is very unlikely that high-quality deepfakes were easily accessible on the
221 internet before 2021. Secondly, this dataset has been used to train the Stable Diffusion model, and as
222 such, we can leverage it as negative samples to construct a training set for deepfake detection.
223 We select a subset of deepfakes from our database and employ an image retrieval tool [32] provided
224 by LAION-5B to search for high-quality and visually similar real images. We search for images that
225 have an aesthetic score of over 8, and select the most similar image from LAION-5B.
227 Table 2 presents the statistics of our DFLIP-3K database. We collect over 300K images, and after
228 cleaning and filtering, we obtain a dataset of 189,151 images with prompts, paired with over 3K
229 models. For Personalized Diffusions, we gather 280,753 images from either 3,434 parsed models or a
230 set of unknown models. For Stable Diffusion, we consider different versions as one model, such as
231 v1.4 and v1.5, as they only differ in the number of training iterations, while using the same training
232 dataset. Similarly, in Personalized Diffusions, we treat different versions of models with the same
233 name as the same model, as their versions only have slight differences in style. It is worth noting that
234 half of the models in Personalized Diffusions have less than 10 images. Regarding the number of
235 models in MidJourney subset, it is unknown to us since we are unsure of their model structures and
236 whether they use the same model for each operation.
237 Prompt Analysis. After collecting data, we analyze the prompts provided by users who utilize
238 T2I models. We observe that these prompts differ significantly from our usual natural language.
239 Many users submit prompts containing comma-delimited phrases that impose desirable constraints.
240 These instructions often include the elements that should be present in the generated images, and
241 begin with words that describe the quality of the image, such as "Masterpiece" or "8K", followed
242 by specific elements such as "cat", "dog", or "person". Furthermore, we find that users of DALL·E,
243 MidJourney, and Stable Diffusion tend to use longer sentences that are closer to natural language,
244 whereas many personalized diffusion models use pure words or tags to describe images. This
245 phenomenon may be due to the fact that many fine-tuned models use image tag estimation tools, such
246 as DeepDanbooru [24], to label the training images, while other models’ captions are often more
247 closely related to natural language annotations.
248 For personalized diffusion models, their textual prompts follow specific grammatical structures, i.e.,
249 the Stable-Diffusion-Webui [11] grammar. This grammar utilizes various symbols to manipulate the
250 model’s attention towards specific words or implement certain control. For instance, the use of paren-
251 theses "()" increases the model’s focus on the enclosed words, while the use of brackets "[]" decreases
252 it. In addition, there are negative prompts that instruct the model to avoid generating particular objects
253 in the deepfake image. This is achieved by using the negative prompt for unconditional conditioning
254 in the sampling process, instead of an empty string. Furthermore, additional networks, such as
255 LoRA [21], can be incorporated into the base model by using the format "<lora:LoRA_Name>". To
256 facilitate easy training purposes, we offer a clean prompt format version that excludes symbols and
257 special characters.
258 To illustrate the commonly used words in our collected textual prompts, we conduct frequency
259 analysis, excluding special words and punctuation. The resulting word cloud (Figure 3) shows the
260 top 200 most frequent words.
6
261 4 Benchmark Setup
262 Based on DFLIP-3K, we develop a benchmark to validate its efficacy as a standardized resource for
263 evaluating methods for linguistic profiling of deepfake detection, which include the three sub-tasks:
264 1) Deepfake Detection, 2) Deepfake Model Identification, and 3) Prompt Prediction (Fig.1).
265 Deepfake detection is one of the fundamental sub-tasks, particularly in light of recent advances in
266 large-scale T2I generative models that enable individuals to create high-quality deepfakes. As these
267 deepfakes blur the distinction between reality and fantasy, differentiating them from non-AI generated
268 images becomes increasingly challenging. Moreover, it is crucial to identify which deepfake model is
269 used to generate an image, as this can either serve as an evident for deepfake detection or aid model
270 publishers in safeguarding their models. Nonetheless, identifying the origins of deepfakes remains
271 a non-trivial task, given their high diversity and the prevalence of personalized models trained on
272 private data. Furthermore, recent advancements in T2I models have led to the emergence of the
273 prompt prediction problem. The predicted prompts forms the other evident for deepfake detection.
274 Based on the resulting prompts, we can better understand and interpret the created content.
275 To effectively train deep learning models that can accomplish the three tasks at hand, it is imperative
276 to perform additional data processing. This is primarily due to some models having few images,
277 rendering them inadequate for training purposes. Following our initial data collection, we have
278 conducted a thorough pre-screening process and present the resulting processed dataset in Table 3.
279 Our aim is to ensure that all generative models included in the training set have a minimum of 100
280 images, with 50 images for testing. As a result, we now have a total of 54 models for training,
281 comprising 51 personalized diffusion models (PD1 in Table 3), as well as DALL·E, MidJourney, and
282 Stable Diffusion. Further details on the selected models can be found in the supplementary material.
283 In addition to the generative models, we also selected 25,000 real images to serve as a separate
284 category in the deepfake identification task. Moving forward, we have categorized the remaining
285 models of Personalized Diffusion (PD2 in Table 3), Parti, and Imagen for out-of-distribution test.
287 In this section, we present our experimental setting along with the detailed training procedures. We
288 report the performance of baseline methods and discuss limitations.
290 For deepfake detection and deepfake model identification, we select two state-of-the-art methods,
291 CNNDet [47] and S-Prompts [48] as our baseline approaches. In particular, we follow them to use
292 vision-based network ResNet-50 [18], ViT-base-16 [14], as well as vision-language based network
293 CLIP [35] as our baselines for these two tasks. Prompt prediction is a recently emerging task, for
294 which there are no dedicated methods available yet. Hence, we choose the state-of-the-art captioning
295 method, BLIP [30], as our baseline, which merely performs prompt prediction.
296 To accomplish all the three sub-tasks of linguistic deepfake profiling, we suggest exploiting
297 Flamingo [8] that leverages pre-trained vision-language models to accept images and texts as input
298 and generate free-form texts as output. We discover that language models can be used to unify the
299 three different sub-tasks in a simple and efficient manner. For instance, in the case of deepfake
300 detection, we convert the real and fake labels in the dataset to a dialogue format, such as ‘Question: Is
301 this image generated by AI? Answer: This is an AI generated image by Stable Diffusion’ or ‘Answer:
302 This is a real image.’ Similarly, for deepfake identification, the answer to the above question can
303 determine the deepfake source model. For prompt prediction, we design the image prompts as a
304 question-answer format, such as ‘Question: Give me prompts using Stable Diffusion to generate
305 this image. Answer: I suggest using ChilloutMix and this prompt: 8k, RAW photo, best quality,
306 masterpiece, realistic, photo-realistic, ultra-detailed, 1 girl, solo, beautiful detailed sky street, standing,
307 nose blush, closed mouth, beautiful detailed eyes, short hair, white shirt, belly button, torn jeans.’
308 We use the OpenFlamingo-9B [12] implementation, which uses a CLIP ViT-Large vision encoder
309 and a LLaMA-7B [45] language model that is trained on large multimodal datasets like Multimodal
310 C4 [55] and LAION-2B [42]. We further fine-tune OpenFlamingo-9B on our DFLIP-3K database
311 using the aforementioned question-answering data format.
7
Reference Image BLIP-predicted prompts & images Flamingo-predicted prompts & images
Prompt: Prompt:
a very tall tower made of rocks in a big cake made of stone with
the snow with a sky background snow on the top and a mountain on
and a few blocks of ice the background, best quality, ultra
high res,photorealistic,detailed face
Identification: ChilloutMix
Prompt: Prompt:
a man with a creepy face and a Boris johnson as the joker
suit on and tie on, with a creepy Identification: Deliberate
smile on his face
Prompt: Prompt:
a woman dressed in a costume 1girl, portrait of beautiful young-
with horns and a hood on her head jaina, long blonde hair, athletic,
and chest, with a sword in her purple dress, armor, hooded, cape,
hand looking at viewer, ...
Identification: AbyssOrangeMix2
Prompt: Prompt:
a cat with a blue hat on its head A cute cat with a hat on its head
and eyes painted on it's face and and a shirt on its body, wearing a
it's head is wearing a blue hat backpack, watercolor, illustration,
detailed background,
Identification: DreamShaper
Figure 4: Visualization of Prompt Prediction Results. The first column displays the input reference
images used for prompt prediction, while the second and third columns showcase the prompts
predicted by BLIP and the corresponding images generated by Stable Diffusion v1.5, respectively.
Similarly, the fourth and fifth columns exhibit the fine-tuned Flamingo-predicted prompts and the
images generated by models predicted by Flamingo, respectively. Compared with BLIP, Flamingo
can jointly perform model identification and prompt prediction, and its predicted prompts interpret
the image contents more faithfully (like ‘cake’ vs. ’tower’), providing more global styles (like ‘high
res’) and more local attributes (like ‘a cute cat’). Thanks to the better dissection on the two essentials
(source model and prompt), Flamigo’s resulting images look visually closer to the reference images.
313 Following [47, 29], we use average detection accuracy to measure the deepfake detection results.
314 For deepfake model identification, we calculate multiclass accuracy, which calculates the accuracy
315 that the picture is classified into the correct deepfake model or real image. To evaluate the quality of
316 predicted prompts, we select 1, 000 images from test set as reference images to do prompt prediction.
317 To assess the quality of predicted prompts and models serving as two essential evident for deepfakes,
318 we suggest measuring the similarity between each input deepfake and the reconstructed deepfakes,
319 which is generated by feeding the predicted prompt into the identified model. For BLIP, as it only
320 outputs image captions, we use Stable Diffusion v1.5 to produce the reconstructed deepfakes. In
321 contrast, since Flamingo can identify deepfake model jointly, we use its predicted models and prompts
322 to generate the reconstructed deepfakes. We adopt CLIP-Score [15], Learned Perceptual Image Patch
323 Similarity (LPIPS) [54] and LAION-Aesthetic Score [27] to measure the semantic, perceptual, and
324 aesthetic similarities respectively. Please find training and testing details in Supplementary Material.
8
Methods Accuracy (↑)
CNNDet-ResNet [47] 46.43
SPrompts-ViT [48] 51.74 Methods CLIP (↑) LPIPS (↑) Aesthetic (↑)
SPrompts-CLIP [48] 53.10 BLIP [30] 0.65 0.67 6.30
Flamingo [8] 63.39 Flamingo [8] 0.66 0.72 6.97
Table 5: Results of deepfake model iden- Table 6: Similarities between reference and recon-
tification. structed deepfakes by either identified (Flamingo)
or default models (BLIP) over predicted prompts.
325 5.3 Evaluation Results
326 Deepfake detection. Table 4 reports the results of deepfake detection. Some conclusions are listed
327 as follows: (1) Different from the observation from [47] over GAN-based deepfakes, the Gaussian
328 blur and JPEG augmentation brings clear performance degradation over the non-augmentation case
329 in terms of detection accuracy. (2) Pre-trained vision-language models, such as CLIP and Flamingo,
330 show superior performance in detecting out-of-distribution deepfakes compared to traditional vision
331 models, such as ResNet and ViT. These models leverage both visual and textual information, enabling
332 a more comprehensive analysis of the input data. The incorporation of multimodal information lead
333 to clear advancements in deepfake detection.
334 Deepfake Identification. Table 5 reports the results of deepfake model identification. Our results
335 demonstrate that the pre-trained vision-language models, such as CLIP and Flamingo, outperform
336 traditional pre-trained vision models in accurately identifying the deepfake model used to generate
337 deepfakes. In particular, Flamingo exhibits superior performance in deepfake identification, possibly
338 due to its massive size, pretraining data, huge amount of network parameters as well as its excellent
339 vision-language learning mechanism. Furthermore, the correctly identified deepfake models provide
340 a substantial amount of evidence for convincing deepfake detection.
341 Prompt Prediction. Figure 4 visualizes the predicted prompts from BLIP and Flamingo. From the
342 results, we can find that Flamingo’s predicted prompts describe the given reference images more
343 faithfully. For example, Flagmio understands the main content of the first reference image correctly
344 as ‘a big cake’, while BLIP misunderstands it as ‘a very tall tower’. Moreover, the predicted prompts
345 of Flamingo is capable of providing more global styles like ‘ultra high res’ and ‘detailed background’,
346 as well as more local attributes like ‘a cute cat’ and ‘beautiful young jaina’. The more faithful
347 textual prompts contribute to a more reliable interpretation of deepfakes. In addition, we also show
348 their reconstructed deepfakes produced by feeding the predicted prompts to either the predicted
349 models (Flamingo) or the default model (BLIP). The results reflect that Flamingo’s reconstructed
350 deepfakes look clearly closer to their reference deepfakes. This shows that our suggested linguistic
351 profiling mechanism over DFLIP-3K is able to provide the two types of trustworthy evidences for
352 deepfake detection. In addition, table 6 presents the corresponding quantitative evaluation. The results
353 demonstrate that Flamingo’s reconstructed deepfakes are more similar to the reference deepfakes in
354 terms of semantic, perceptual and aesthetic scores, aligning well with the visual comparison case.
9
371 References
372 [1] AIBooru. https://aibooru.online/. Accessed June 7, 2023.
375 [4] DALL·E 2.0 Artistic Visual Gallery. https://dalle2.gallery/. Accessed June 7, 2023.
379 [8] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc,
380 Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda
381 Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andy Brock,
382 Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew
383 Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning. ArXiv,
384 abs/2204.14198, 2022.
385 [9] Amir Alush, Dickson Neoh, and Danny Bickson et al. Fastdup. GitHub. Note: https://github.com/visual-
386 layer/fastdup, 2022.
387 [10] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In
388 International conference on machine learning, pages 214–223. PMLR, 2017.
391 [12] Anas Awadalla, Irena Gao, Joshua Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe,
392 Yonatan Bitton, Samir Gadre, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell
393 Wortsman, and Ludwig Schmidt. Openflamingo, March 2023.
394 [13] Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton
395 Ferrer. The deepfake detection challenge (dfdc) dataset. arXiv preprint arXiv:2006.07397, 2020.
396 [14] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
397 Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth
398 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
399 [15] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel
400 Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.
401 arXiv preprint arXiv:2208.01618, 2022.
402 [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron
403 Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–
404 144, 2020.
405 [17] Luca Guarnera, Oliver Giudice, and Sebastiano Battiato. Level up the deepfake detection: a method
406 to effectively discriminate images generated by gan architectures and diffusion models. arXiv preprint
407 arXiv:2303.00608, 2023.
408 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
409 In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
410 [19] Geoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In Artificial
411 Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural
412 Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, pages 44–51. Springer, 2011.
413 [20] David Holz. Midjourney: Exploring new mediums of thought and expanding the imaginative powers of
414 the human species. 2022.
415 [21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and
416 Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
417 2021.
10
418 [22] Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park.
419 Scaling up gans for text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision
420 and Pattern Recognition (CVPR), 2023.
421 [23] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial
422 networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages
423 4401–4410, 2019.
424 [24] Kim, Kichang. DeepDanbooru Repository. https://github.com/KichangKim/DeepDanbooru. Ac-
425 cessed June 7, 2023.
426 [25] Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-pic:
427 An open dataset of user preferences for text-to-image generation. arXiv preprint arXiv:2305.01569, 2023.
428 [26] Gant Laborde. Deep nn for nsfw detection.
429 [27] LAION. Laion aesthetics v1. Technical Report Version 1.0, LAION AI, 2022. url
430 https://github.com/LAION-AI/aesthetic-predictor .
431 [28] LAION-AI. LAION-SAFETY Repository. https://github.com/LAION-AI/LAION-SAFETY. Ac-
432 cessed June 7, 2023.
433 [29] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng Hong, and Luc
434 Van Gool. A continual deepfake detection benchmark: Dataset, methods, and essentials. arXiv preprint
435 arXiv:2205.05467, 2022.
436 [30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training
437 for unified vision-language understanding and generation. In ICML, 2022.
438 [31] Justin Maier. Civitai: The ai art community’s free, open-source model-sharing hub, 2023.
439 [32] Montier, Romain. CLIP-Retrieval. https://rom1504.github.io/clip-retrieval. Accessed June
440 7, 2023.
441 [33] Jonas Oppenlaender. Prompt engineering for text-based generative art. arXiv preprint arXiv:2204.13988,
442 2022.
443 [34] John David Pressman, Katherine Crowson, and Simulacra Captions Contributors. Simulacra aes-
444 thetic captions. Technical Report Version 1.0, Stability AI, 2022. https://github.com/JD-P/
445 simulacra-aesthetic-captions.
446 [35] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish
447 Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from
448 natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR,
449 2021.
450 [36] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional
451 image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
452 [37] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and
453 Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine Learning,
454 pages 8821–8831. PMLR, 2021.
455 [38] Jonas Ricker, Simon Damm, Thorsten Holz, and Asja Fischer. Towards the detection of diffusion model
456 deepfakes. arXiv preprint arXiv:2210.14571, 2022.
457 [39] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution
458 image synthesis with latent diffusion models, 2021.
459 [40] Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner.
460 Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF interna-
461 tional conference on computer vision, pages 1–11, 2019.
462 [41] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed
463 Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealistic text-to-
464 image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
465 [42] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti,
466 Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. LAION-5B: An open large-scale
467 dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
11
468 [43] Sharif Shameem. Lexica: Building a Creative Tool for the Future, 2022.
469 [44] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders.
470 arXiv preprint arXiv:1711.01558, 2017.
471 [45] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
472 Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard
473 Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. arXiv preprint
474 arXiv:2302.13971, 2023.
475 [46] Iulia Turc and Gaurav Nemade. Midjourney user prompts & generated images (250k), 2023.
476 [47] Sheng-Yu Wang, Oliver Wang, Richard Zhang, Andrew Owens, and Alexei A Efros. Cnn-generated images
477 are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision
478 and pattern recognition, pages 8695–8704, 2020.
479 [48] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An
480 occam’s razor for domain incremental learning. In Conference on Neural Information Processing Systems
481 (NeurIPS), 2022.
482 [49] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent
483 Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In CVPR, 2022.
484 [50] Zijie J Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau.
485 Diffusiondb: A large-scale prompt gallery dataset for text-to-image generative models. arXiv preprint
486 arXiv:2210.14896, 2022.
487 [51] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hongsheng Li. Better aligning text-to-image models
488 with human preference. arXiv preprint arXiv:2303.14420, 2023.
489 [52] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
490 Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich
491 text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
492 [53] Lvmin Zhang and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. arXiv
493 preprint arXiv:2302.05543, 2023.
494 [54] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable
495 effectiveness of deep features as a perceptual metric. In CVPR, 2018.
496 [55] Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu,
497 Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of
498 images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
499 [56] Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, and Yu-Gang Jiang. Wilddeepfake: A challenging
500 real-world dataset for deepfake detection. In Proceedings of the 28th ACM international conference on
501 multimedia, pages 2382–2390, 2020.
12