0% found this document useful (0 votes)
42 views9 pages

DreamTalk-DMT: A Lightweight Sparse Mechanism Model With Dynamic Thresholds

The paper presents DreamTalk-DMT, an optimized digital human synthesis model that enhances computational efficiency and expression generation by introducing a dynamic threshold sparsification mechanism and a decoupled decoder. This model utilizes adaptive sparsity and cross-modal feature enhancement to reduce computational load while preserving key features, improving synchronization between speech and expression. The proposed framework aims to address the limitations of existing models in real-time interaction scenarios, providing a new technical path for lightweight end-to-end speech synthesis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views9 pages

DreamTalk-DMT: A Lightweight Sparse Mechanism Model With Dynamic Thresholds

The paper presents DreamTalk-DMT, an optimized digital human synthesis model that enhances computational efficiency and expression generation by introducing a dynamic threshold sparsification mechanism and a decoupled decoder. This model utilizes adaptive sparsity and cross-modal feature enhancement to reduce computational load while preserving key features, improving synchronization between speech and expression. The proposed framework aims to address the limitations of existing models in real-time interaction scenarios, providing a new technical path for lightweight end-to-end speech synthesis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

International Journal of Advanced Engineering Research and

Science (IJAERS)
Peer-Reviewed Journal
ISSN: 2349-6495(P) | 2456-1908(O)
Vol-12, Issue-5; May, 2025
Journal Home Page Available: https://ijaers.com/
Article DOI: https://dx.doi.org/10.22161/ijaers.125.3

DreamTalk-DMT: A Lightweight Sparse Mechanism


Model with Dynamic Thresholds
Jia Zhang1,*, Lin Po Shang2
1Department of Electronic Information Engineering, Electronic Information Science and Technology major, Guangdong University of
Petrochemical Technology, China
2Department of Energy and Power Engineering, Process Equipment and Control Engineering major, Guangdong University of

Petrochemical Technology, China


*Corresponding author:[email protected]

Received: 11 Apr 2025, Abstract— Aiming at the shortcomings of the DreamTalk 2D digital human
Receive in revised form: 09 May 2025, synthesis model in computational efficiency and expression generation
fineness, this paper proposes an optimization method combining adaptive
Accepted: 15 May 2025,
sparsity and cross-modal feature enhancement. By introducing a dynamic
Available online: 20 May 2025 threshold sparsity mechanism into the diffusion model, the sparsity ratio was
©2025 The Author(s). Published by AI dynamically adjusted based on the learnable threshold and Exponential
Publication. This is an open-access Moving Average (EMA), and the Mutual information Constraint (MI
article under the CC BY license Constraint) was combined to minimize the information loss, which reduced
the calculation amount of the model while retaining key features. The model
Keywords— Diffusion model; Dynamic
architecture is improved, and the decoupled decoder is designed to
threshold sparsification; Mutual information
decompose the facial expression into the upper and lower regions for
constrained optimization; Decoupled
independent processing. The dynamic linear layer is combined to realize
decoder; Cross-modal feature fusion module
parameter adaptation under the style condition, and the detail expression of
expression generation is improved. In addition, Tacotron speech features
and Wav2Vec acoustic features are fused to enhance the synchronization of
speech and expression, and skip connections are used to optimize the
information transmission efficiency.

I. INTRODUCTION StarGAN-VC model, which has attracted people's attention


From the perspective of technology evolution, digital because it can solve this problem using only a single
human synthesis technology has experienced a significant generator. However, there is still a gap between real and
transformation from traditional methods based on physical converted speech.[4] Diffusion Model has aroused a new
models to data-driven deep learning methods. Initially, upsurge of research in the field of digital human synthesis
DaViT regress 3DMM parameters from the input image to due to its theoretical completeness and generation quality
roughly scout the shape and texture of the face. Although advantages. Among them, DreamTalk model, as the
3DMM provides valuable information, its linear nature landmark achievement in the field of speech-driven
limits its realism.[1] Subsequently, an innovative approach expression synthesis, is an audio-driven framework based
developed by Buhari et al.[2] combined graph theory and on two-stage diffusion, which uses emotional conditional
FACS to extract useful features (68 landmark points) that diffusion model and lip refinement network[5] to improve
can distinguish between various microexpressions.[3] The facial emotional expression while maintaining high video
development of deep learning technology, methods based quality. DREAM-Talk represents a major leap forward in
on Generative Adversarial networks (GAN) have made the field of emotional conversational face generation,
breakthroughs in the field of image generation, such as the enabling the creation of realistic and emotionally engaged

www.ijaers.com Page | 12
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

digital human representations in a wide range of relationship between speech prosody and expression
applications.[5] dynamics, and the matching degree of expression and
Many scientific research institutions and enterprises speech emotion is not good.
continue to make efforts in digital generation technology To address the above technical challenges, this study
and other related fields. In the direction of expression proposes a dynamic threshold sparsification and
generation, VASA-1, a diffusion-based global facial decoupling generation framework based on information
dynamics and head motion generation model proposed by theory and dynamic system theory. By introducing the
Microsoft Research Asia, can not only generate lip learnable sparse threshold and Exponential Moving
movements perfectly synchronized with audio, but also Average (EMA) mechanism[10], combined with the
generate a large number of facial nuances and natural head mutual information loss function[11], the framework
movements, providing high video quality through realistic reduced the floating-point operation efficiency while
facial and head dynamics IC. Online generation of ensuring that the key information was not lost. The
512×512 videos at up to 40 FPS with negligible startup decoupled decoder was designed, the facial expression
latency is also supported.[6]; OTAvatar[7] proposed by space was divided into the upper and lower halves, and the
Ma et al., OTAvatar invert the portrait image into a dynamic linear layer was used to realize the adaptive
motion-free identity code, and then use the identity code adjustment of parameters to improve the naturalness of
and motion code to modulate an efficient CNN to generate expression. The gated fusion module of Tacotron acoustic
a three-plane formula volume. Finally, the image is features and Wav2Vec speech representation is
generated by volume rendering, and the identity and constructed, and the gradient transfer path is optimized by
motion in the latent code are decoupled by a novel anti- combining jump connection, which greatly improves the
phase decoupling strategy. The face image is constructed accuracy of speech-expression synchronization, and
based on generalized controllable three-plane rendering. In provides an innovative solution for the practical
addition, the Make-A-Video model[8] launched by Meta development of digital human technology.
AI tries to model the multi-modal generation of text-
speech-image in a unified way. Although it shows strong
II. DREAMTALK
potential in creative content generation, there are still
technical bottlenecks in the accurate synchronization of In the field of speech-driven expression synthesis of
voice and expression. digital human, DreamTalk model uses the diffusion
mechanism[5], uses the Transformer-based EmoDiff
At present, in the aspect of film and television special
network, and performs temporal denoising learning of 3D
effects, the application of digital human is more and more
expression under the conditions of audio, portrait and
widely, and the fidelity of image and motion has been
emotional style, and realizes the end-to-end generation of
improved. The continuous expansion of digital human
speech to expression. Excellent results are achieved on the
application scenarios to strong interaction fields such as
VoxCeleb dataset, which alleviates the mode collapse
real-time broadcast, virtual idol interaction, and intelligent
problem of traditional GAN. The diffusion mechanism
education, the limitations of existing technologies have
adopted by the method is derived from the denoising
become increasingly prominent. Aided by the diffusion
Diffusion Probability Model (DDPM)[12], which is based
model mechanism, the DreamTalk model represents a
on the Markov chain[13]. The data generation is realized
major leap forward in the field of emotional talk face
through the process of adding Gaussian noise forward and
generation, enabling the creation of realistic and
reverse iterative denoising. Compared with traditional
emotionally engaging digital human representations in a
generative models, diffusion models have more solid
wide range of applications[9]. However, with the
theoretical foundation and stronger conditional generation
expansion of application scenarios and the improvement of
ability, and have shown significant advantages in the field
requirements, its defects gradually appear. In terms of
of multimodal generation, which provides important
computational efficiency, the model parameters are dense,
technical support for models such as DreamTalk.
and in real-time interaction scenarios, the memory
denoising diffusion probabilistic model (DDPM) is a class
footprint is high and the reasoning time is long, which
of generative models based on probabilistic diffusion
seriously affects the interaction fluency. For expression
process. In recent years, remarkable progress has been
generation, it is difficult for a single decoder to accurately
made in the field of deep learning and generative models.
simulate the differentiated motion of the eyebrow, mouth
The core idea of diffusion model is to treat the process of
and other regions, and synthesize expression detail
data generation as a random process that gradually changes
distortion. In cross-modal fusion, the simple feature
from simple distribution (e.g., Gaussian distribution) to
concatenation method cannot deeply explore the complex

www.ijaers.com Page | 13
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

complex data distribution. Diffusion models usually (2)


include two processes: the Forward Diffusion Process and
the Reverse Process. However, both of them are a The noisy data at any time t can be obtained。
parameterized Markov chain in nature, which has
Reverse Diffusion Process is a process that gradually
stationary property. That is, if a probability changes with
recovers useful information from noisy data.
time, it will tend to a stationary distribution under the
action of the Markov chain, and the longer the time, the The goal is to gradually recover the distribution of the
more stable the distribution will be. It was this stationarity original data from the pure noise state (the final result of
that allowed him to gradually restore the image, given a the forward diffusion process). It is the opposite of the
neural network that predicted the noise. forward diffusion process and tries to learn how to remove
the noise added at each time step so as to recover the
original data.
The backward diffusion process takes advantage of the
fact that the way noise is added in the forward diffusion
process is known, and gradually restores the noisy data to
the original data by training a neural network to predict
how much noise should be subtracted at each step.In the
backward diffusion process, the neural network is
constructed to fit , and the original data is
gradually recovered from the noise, which can be
expressed as follows.
(3)

Where θ is the neural network parameter, and


are the mean and variance, respectively.
Fig.1 Diffusion Model generation process
The training process in the diffusion model is achieved
by optimizing the variational lower bound of the negative
The Forward Diffusion Process is a process that log-likelihood with . To simplify the training process,
continuously adds noise to the data to be trained. The the variance of the model is set to a constant and the
process usually starts from a simple distribution (e.g., coefficients of the loss function are removed, so the loss
Gaussian distribution, etc.), and through multiple rounds of function is:
small cardinality noise, the image data to be trained is
closer to a complex data distribution. Meanwhile, at each (4)
step, the model predicts the noise at the next step based on
the current data state and noise level, thus gradually
pushing the data into a high-dimensional and complex
distribution space.
In the forward process, given the initial data
distribution x0~q(x), the noise with standard deviation βt is
gradually added to the initial data according to the
schedule to obtain the noise data.

(1)

Where t represents the final time, as t continues to Fig.2 Diffusion Process


increase, the noise data gradually approaches the Gaussian
distribution.
III. IMPROVED MODEL ARCHITECTURE AND
However, the efficiency of stepwise iteration based on KEY TECHNOLOGIES
Equation (1) is very low, and the training process
Aiming at the technical bottlenecks of DreamTalk
consumes a lot of time. To improve the efficiency of
model in terms of computational efficiency, expression
computing, introducing the =1- , , type (1) generation accuracy and cross-modal fusion, this study
can be converted to: proposes a dynamic threshold sparsifation-decoupling

www.ijaers.com Page | 14
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

generation framework (DTS-DG). The framework realized facilitates the deployment in practical applications.
systematic optimization through four core modules. At the
level of efficiency optimization, the dynamic sparse
threshold and EMA dynamic adjustment mechanism were
used, and the mutual information loss function was
combined to reduce the amount of calculation while
ensuring the loss of information. In the cross-modal fusion
dimension, the gated fusion module of Tacotron[14] and
Wav2Vec features[15] is constructed, supplemented by
skip connection to optimize the gradient transfer path and
enhance the depth correlation between speech and
expression. In the aspect of expression generation, the
upper and lower half decoupling decoder is designed, and
the parameters are adaptively adjusted by the dynamic
linear layer, which significantly improves the accuracy of
expression detail description and emotion synchronization,
and provides a new solution for speech-driven digital
human synthesis technology.
Through the four-layer optimization system, the
improved model achieves a significant improvement in
computational efficiency and generalization ability while
maintaining the naturalness of speech synthesis, which
provides a new technical path for the lightweight of end-
to-end speech synthesis models.
3.1 ynamic threshold sparsification mechanism
When the original dreamtalk model deals with high-
dimensional features, there are problems such as large
consumption of computing resources and slow inference
speed. A large number of redundant parameters not only Fig.3 Model framework
increase the computational burden, but also may lead to
overfitting. In view of this, this study introduces a dynamic Under the key requirements of model computational
threshold sparsification mechanism, which dynamically efficiency optimization, the dynamic threshold
screens features based on a dynamic sparse mask. By sparsification mechanism becomes one of the core
setting a learnable threshold, the feature dimensions that innovations of this research. The mechanism aims to solve
contribute less to the model are automatically identified the problem that the traditional fixed sparsity method
and eliminated. cannot adapt to the dynamic changes of features in the
The computational efficiency of the improved model is process of model training. By introducing a learnable
significantly improved, and the number of parameters and threshold and combining with the Exponential Moving
reasoning time are reduced compared with the original Average (EMA) technology, the dynamic adjustment of
model, which effectively alleviates the bottleneck of the sparsity ratio of model parameters is realized, and the
computing resources. At the same time, the generalization calculation amount is reduced while the key information is
ability of the model is enhanced because the redundant retained to the maximum extent, ensuring that the model
information interference is reduced. In addition, the performance is not significantly affected.
dynamic threshold sparsification mechanism ensures that In the training process of the diffusion model, the data
the model can still maintain a high level of performance characteristics show a complex change trend with the
while being lightweight by retaining key features, which advancement of time steps. To effectively capture these
changes and adjust the sparsification strategy accordingly,
we design a dynamic threshold calculation method based
on learnable threshold and EMA. First, we define a
learnable threshold parameter θ, which is optimized
through backpropagation during model training. To map

www.ijaers.com Page | 15
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

the values of θ to a reasonable range, we use the sigmoid Through the above dynamic threshold sparsifying
function [16] to convert it to θ', i.e. mechanism, the model can dynamically adjust the sparsity
ratio of the parameters during the training process, and
(5)
flexibly balance the computational efficiency and model
The value range of θ' is limited to the interval of (0,1), performance under different training stages and data
which enables the threshold to be adjusted in a reasonable feature distributions. This mechanism not only effectively
dynamic range. reduces the computational burden of the model and
improves the inference speed, but also ensures the
At the same time, in order to track the dynamic
accuracy and stability of the model in tasks such as
changes of features, we introduce EMA[10] to calculate
expression generation by retaining key information. In
the mean value μt of the absolute values of features. EMA
practical applications, this mechanism enables the model
is a commonly used time series smoothing technique,
to maintain good performance under limited computing
which is able to dynamically update statistics based on
resources when dealing with large-scale data and complex
historical information and current data. In this study, μ t is
tasks.
calculated as follows.
3.2 Mutual information constrained optimization
(6)
mechanism
α is the smoothing coefficient of EMA, which is In the process of dynamic threshold sparsification, the
usually set to a value close to 1, and α=0.9 was taken in original dreamtalk model is easy to cause the loss of
this study. This means that the calculation of μ t is more feature information, which affects the model's ability to
dependent on the historical mean μt-1, but at the same time, capture key semantic and emotional information, and leads
it is also adjusted according to the expectation of to the decline of the accuracy and integrity of the
the absolute value of the feature at the current time. In this generated results.
way, μt can better reflect the overall trend of the absolute
In order to solve this problem, based on Mutual
value of the feature, and it is somewhat robust to sudden
Information[17] and Kullback-Leibler Divergence
outliers.
theory[18] in information theory, this study constructs a
Based on the computed θ' and μt, we generate the mutual information constrained optimization mechanism.
dynamic threshold θ'·μt and construct the sparse mask Mt Mutual information was proposed by Shannon in 1948 to
accordingly. For each element xt[i] in the feature vector xt, quantify the dependence between two random variables.
the element Mt[i] of the sparse mask Mt is generated KL divergence was defined by Kullback and Leibler in
according to the following rules: 1951 as a measure of how different two probability
distributions are.
(7)
The basic definition of KL divergence is as follows.
When |xt[i]| is greater than the dynamic threshold, the (8)
value of Mt[i] is 1, and the corresponding element is
retained in the sparsification process. Otherwise, Mt[i] is 0, Here, p(x) and q(x) represent two probability
and the corresponding element is set to zero, thus distributions, and the formula measures the difference
sparsifying the feature vector xt. This dynamic threshold between p(x) and q(x) by calculating the weighted sum of
setting allows the sparsification process to be dynamically log ratios over all values x.
adjusted according to the importance and distribution of The basic definition of mutual information is based on
features. The key features that have larger absolute values joint distribution and marginal distribution, which is
and contribute more to the model output are more likely to expressed as follows.
be retained; However, the relatively unimportant features (9)
are sparsified to reduce the amount of calculation.
That is, the mutual information is equal to the KL
During backpropagation, to ensure that the sparsified
divergence between the joint probability distribution
model can still learn effectively, we only perform gradient
p(X,Y) and the product p(X)p(Y) of the marginal
updates on the corresponding parameters with a value of 1
probability distributions, which reflects the amount of
in the sparse mask Mt. This not only ensures that the model
information shared between two random variables X and Y.
can continue to be optimized in the case of parameter
compression, but also avoids the invalid calculation of the In this study, the original feature distribution is denoted
sparsified (zeroed) parameters, which further improves the as q(xt), and the feature distribution under the action of
computational efficiency. sparse mask Mt is denoted as p(xt|Mt). Based on the above

www.ijaers.com Page | 16
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

theory, the mutual information loss function is constructed to solve the problem of speech and expression
as follows. synchronization. The Tacotron model is used to extract the
512-dimensional speech feature containing prosodic
and semantic information, while the Wav2Vec model is
(10) used to obtain the 1024-dimensional feature focusing
This formula quantifies the information loss during on acoustic details, providing multi-dimensional speech
dynamic threshold sparsification by calculating the KL information for fusion.
divergence of the feature distribution before and after
The module adopts the gating mechanism to realize
sparsification. In the actual calculation, because it is
feature fusion, and learns from the idea that the LSTM
difficult to estimate the probability distribution directly,
gating unit[20] controls the information flow through the
the feature mean and variance are used to approximate the
Sigmoid function ( ). Firstly, the two features
distribution. In the training process, is incorporated
into the total loss function, and the model parameters and are concatenated and linear transformed, and then the
sparse threshold are optimized through back propagation, gating signal g is generated by the Sigmoid function:
which effectively retains key information while reducing . Based on this, the fusion
the amount of calculation and maintaining the performance feature is obtained by
of the model. weighted summation, so that the model can adaptively
adjust the feature weight according to the speech
After introducing this mechanism, the model performs
characteristics. In addition, the jump connection
well in information retention, the retention rate of key
of ResNet is introduced to ensure the
features is improved, and the performance degradation
caused by information loss is effectively avoided. At the effective transmission of key information, improve the
same time, the mutual information constraint optimization expression ability of the network, and realize the deep
mechanism makes the model more accurately balance the correlation between speech features and expression
computational efficiency and information retention in the generation.
sparsification process, which provides a guarantee for the 3.4 Decouple the decoder
stable training and efficient operation of the model.
The original dreamtalk model uses a single decoder to
3.3 (Multi-model Fusion network) Cross-modal process facial expression generation, which is difficult to
feature fusion module accurately model the movement of different facial regions.
The original dreamtalk model has the problems of It is easy to interfere with emotional expression and mouth
insufficient synchronization and insufficient feature fusion movement, resulting in unnatural local
when processing speech and facial expression features, expressions and loss of details. In this study, based on
which leads to the inability to accurately match the the decoupled generation network designed by FACS
generated facial expression and speech, and poor theory[21], the facial expression space is divided into the
expression naturalness and dynamic correlation. The cross- upper and lower half regions, which are processed
modal feature fusion module constructed in this study independently and modeled by the dual-branch structure
strengthens the dynamic association between speech and respectively.
expression by deeply fusing Tacotron speech features and
The introduction of decoupled decoder effectively
Wav2Vec acoustic features.
solves the defects of the original model. In the generation
After the introduction of this module, the model of eye expressions, the movements of eyebrows and
achieves a significant improvement in speech-expression eyelids are more consistent with emotional semantics, and
synchronization, and the time deviation between lip the emotional expression is more accurate. In terms of
movements and speech phonemes is reduced, which mouth movement generation, the synchronization between
greatly improves the phenomenon of phonetic and painting mouth shape and speech is further enhanced, and the
synchronization. In addition, the cross-modal feature speech-expression synchronization error is reduced. At the
fusion module effectively enhances the network's ability to same time, the structure avoids the interference between
express multimodal information through the gate the actions of different regions, which greatly improves the
mechanism and skip connection[19], so that the model can naturalness and accuracy of local expressions. The
better capture the complex mapping relationship between generated facial expressions are more vivid and realistic,
speech and expression. and have more advantages in detail processing.
On the basis of computational efficiency optimization,
this study constructs a cross-modal feature fusion module

www.ijaers.com Page | 17
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

Considering that a single decoder is difficult to the Visual Geometry Group at the University of Oxford.
accurately simulate the movement of different facial The dataset is derived from speech clips in YouTube
regions, this study designs a decoupled generation network videos related to celebrities. It is split into VoxCeleb1,
based on FACS theory, and divides the facial expression which has more than 100,000 voice clips of 1,251
space into the upper and lower half regions for celebrities, and VoxCeleb2, which is much larger, with
independent processing. The upper half is responsible for more than 1 million voice clips of 6,112 celebrities and
emotional expression, while the lower half is closely each clip is at least 3 seconds. It is characterized by a high
related to speech articulation. diversity of speech, including different races, accents, ages,
The decoupled decoder adopts a dual-branch structure, and complex backgrounds, while being of high quality and
each branch is equipped with a dynamic linear layer, and carefully screened. It has a wide range of applications in
its design refers to the idea of conditional normalization. speech recognition, speaker verification, speech sentiment
The eyebrow decoder uses analysis, speech synthesis and other fields, which provides
rich and high-quality data resources for speech-related
to generate a
research and application.
weight matrix based on semo. According to the acoustic
The experimental platform environment configuration
features , the mouth decoder uses used in this experiment is shown in Table 1
Table 1 Experimental environment
Determine the parameters, including
Name version informatio
. Finally, the output of the upper and
Operating system Microsoft Windows11
lower halves is concatenated to avoid the mutual CPU 12th Gen Intel(R) Core(TM) i7-
interference between emotional expression and mouth 12700
movement, realize the accurate control of eye emotional
transmission and mouth speech synchronization, and GPU NVIDIA GeForce RTX 4060 Ti
significantly improve the naturalness and detail accuracy Memory capacity 16GB
of expression generation.
Deep Learning PyTorch
FrameworkPython 3.10.14
IV. EXPERIMENTAL ANALYSIS CUDA 11.8
4.1 Experimental Environment and experimental data PyTorch 2.1.2
set
TorchVision 0.16.2
In this study, an end-to-end training approach is used to
jointly optimize modules such as speech feature extraction,
cross-modal feature fusion, and expression generation. In 4.2 Comparative analysis of data
the early stage of training, the parameters of pre-trained
We use a variety of evaluation metrics to evaluate the
models such as Tacotron and Wav2Vec are fine-tuned with
experimental results, and the experimental results are
a small learning rate to adapt them to the speech feature
shown in Table 3, which show the experimental results of
extraction task of this study. Then, the cross-modal feature
the four methods SadTalker, Wav2lip, TANGO and Ours,
fusion module was gradually introduced to decouple the
respectively. In this paper, SSIM, SIFT and PSNR are
decoder, and the alternating training strategy was adopted.
selected as the performance evaluation metrics, which
The parameters of the expression generation network were
measure the quality of the 2D digital human video
fixed, and the feature fusion module was optimized to
generated based on the diffusion model from different
enhance the correlation between speech and expression
perspectives, thus providing a comprehensive evaluation of
features. Then the feature fusion module is fixed, and the
the performance of the method.
decoupled decoder is trained to improve the quality of
expression generation. In the training process, the Early SSIM is a full-reference image quality assessment
Stopping method is used to avoid overfitting, and the index, which measures the similarity of images from three
training rounds are dynamically adjusted according to the aspects: brightness, contrast and structure. SSIM values
expression naturalness index on the validation set. range from [0,1], with higher values indicating lower
image distortion. Therefore, for the similarity curve of
The dataset used in this experiment is VoxCeleb. The
video frames, a higher SSIM value is better, and a flatter
VoxCeleb dataset is an open source dataset maintained by
curve is better, because it means that the similarity

www.ijaers.com Page | 18
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

between video frames does not change much and the video V. CONCLUSION
quality is stable. In this study, the DreamTalk speech synthesis model is
Table 2 Comparison of experimental results of different optimized, and the performance of the model is
methods significantly improved by introducing techniques such as
adaptive threshold sparsification method, mutual
Methods SSIM↑ LPIPS↓ PSNR↑
information constraint, multi-model fusion and skip
DMT 0.7970 0.1093 28.2298 connection. In terms of speech synthesis quality,
DreamTalk 0.6973 0.4582 20.3429 computational efficiency and generalization ability, the
improved model is significantly better than the traditional
SadTalker 0.6693 0.5348 12.8915
DreamTalk model and other comparison models.
Wav2lip 0.8470 0.1277 34.6643
However, there are still some shortcomings in this
TANGO 0.8758 0.1359 29.0019 study. In the process of multi-model fusion, the current
simple average fusion method essentially treats the output
of each model with equal weight, which fails to fully
LPIPS is a deep learning-based image similarity
consider the differences in the advantages of different
evaluation metric, which evaluates image similarity by
models in processing specific speech features or scenes,
comparing perceptual differences between image patches.
and it is difficult to maximize the effectiveness of each
The smaller the LPIPS value, the more similar the images.
model in complex speech synthesis tasks. In the field of
For the similarity curve of video frames, a lower LPIPS
cross-modal applications, although the speech-image
value is better, and a flatter curve is better, which indicates
matching has been improved, there is still a large room for
that the perceptual difference between video frames is
improvement in the quality and diversity of image
small and the video quality is high.
generation. There is a gap between the generated image
PSNR is a commonly used metric to evaluate video and and the real image and user expectation in detail texture,
image quality, which is calculated by comparing the color richness and creative expression. When the adaptive
original signal with the distorted signal. A higher PSNR sparsization method faces extreme data distribution, such
value indicates less distortion of the video frame. For the as a small number of abnormal speech samples or a serious
similarity curve of video frames, the higher the PSNR imbalance of data feature distribution, the stability of the
value, the better, the upward of the curve indicates that the model will be affected, and problems such as fluctuations
video quality is improving, and the downward of the curve in the quality of synthesized speech and abnormal
indicates that the video quality is decreasing. parameter update may occur.
The experimental results show that the model proposed To address these shortcomings, future research will be
in this study outperforms the previous methods in many carried out in several directions. In the aspect of multi-
aspects. The cross-modal feature fusion module realized model fusion, the fusion strategy based on attention
the deep fusion of speech features through the gate mechanism and dynamic weight allocation will be deeply
mechanism and skip connection, which significantly explored. By constructing an intelligent evaluation system,
improved the synchronization. The decoupled decoder the model can automatically allocate the weight of each
separated the upper and lower half of the facial motion sub-model according to the characteristics of the input
based on FACS theory, and combined with the dynamic speech, and give full play to the advantages of different
linear layer to enhance the expression detail generation models. In the field of cross-modal research, we plan to
ability. The dynamic threshold sparsification and mutual combine generative adversarial networks and self-
information constrained optimization mechanism greatly supervised learning technology to further explore the
reduce the computational complexity under the premise of potential correlation between speech and image, build a
controllable information loss. When mutual information more powerful cross-modal mapping model, improve the
constrained optimization is disabled, the inference time of quality and diversity of image generation, and realize more
the model decreases but the performance index deteriorates creative and realistic image generation driven by speech.
significantly. These results prove that the collaborative For the adaptive sparsification method, a dynamic
design of model components is the key to achieve efficient adjustment threshold strategy and an abnormal data
and natural expression generation. detection mechanism are introduced. By real-time
monitoring of data distribution characteristics, the
sparsification process is dynamically optimized, and the
stability of the model in extreme data environments is
enhanced, so as to further improve the overall performance

www.ijaers.com Page | 19
Zhang and Shang International Journal of Advanced Engineering Research and Science, 12(5)-2025

and application range of the model, which provides more [12] Nair, N. G., Mei, K., & Patel, V. M. (2023). At-ddpm:
powerful support for the development of speech synthesis Restoring faces degraded by atmospheric turbulence using
technology and cross-modal research. denoising diffusion probabilistic models. In Proceedings of
the IEEE/CVF winter conference on applications of
computer vision (pp. 3434-3443).
ACKNOWLEDGEMENTS [13] Hou, J., Lu, Y., Wang, M., Ouyang, W., Yang, Y., Zou, F.,
... & Liu, Z. (2024). A Markov Chain approach for video-
This paper received funding from "Guangdong based virtual try-on with denoising diffusion generative
University of Petrochemical Technology College Student adversarial network. Knowledge-Based Systems, 300,
Innovation and Entrepreneurship Project No. 24A014". 112233.
[14] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss,
R. J., Jaitly, N., ... & Saurous, R. A. (2017). Tacotron:
REFERENCES Towards end-to-end speech synthesis. arXiv preprint
[1] Priyadharshini, A. R., & Annamalai, R. (2024). arXiv:1703.10135.
Identification and Reconstruction of Human Faces into 3D [15] Schneider, S., Baevski, A., Collobert, R., & Auli, M.
Models Using SSD-Based and Attention Mesh Models in (2019). wav2vec: Unsupervised pre-training for speech
Real-Time. SN Computer Science, 5(8), 1-9. recognition. arXiv preprint arXiv:1904.05862.
[2] Buhari, A. M., Ooi, C. P., Baskaran, V. M., Phan, R. C., [16] Zhao, Y., & Li, X. (2024, September). Better
Wong, K., & Tan, W. H. (2020). FACS-based graph approximation of sigmoid function for privacy-preserving
features for real-time micro-expression neural networks. In Journal of Physics: Conference
recognition. Journal of Imaging, 6(12), 130. Series (Vol. 2852, No. 1, p. 012007). IOP Publishing.
[3] Chauhan, A., & Jain, S. (2024). FMeAR: FACS Driven [17] A Mutual Information Based Approach for Feature Subset
Ensemble Model for Micro-Expression Action Unit Selection and Image ClassificationA Mutual Information
Recognition. SN Computer Science, 5(5), 598. Based Approach for Feature Subset Selection and Image
[4] Kaneko, T., Kameoka, H., Tanaka, K., & Hojo, N. (2019). Classification.
Stargan-vc2: Rethinking conditional methods for stargan- [18] Xu, H., Li, Y., Zhang, M., & Tong, P. (2024). Sonar image
based voice conversion. arXiv preprint arXiv:1907.12279. segmentation using a multi-spatial information constraint
[5] Zhang, C., Wang, C., Zhang, J., Xu, H., Song, G., Xie, Y., fuzzy C-means clustering algorithm based on KL
... & Feng, J. (2023). Dream-talk: Diffusion-based realistic divergence. International Journal of Machine Learning
emotional audio-driven method for single image talking and Cybernetics, 1-18.
face generation. arXiv preprint arXiv:2312.13578. [19] Chen, H., Lu, X., Li, S., & He, L. (2025). Improving
[6] Xu, S., Chen, G., Guo, Y. X., Yang, J., Li, C., Zang, Z., ... aluminum surface defect super-resolution with diffusion
& Guo, B. (2024). Vasa-1: Lifelike audio-driven talking models and skip connections. Materials Today
faces generated in real time. Advances in Neural Communications, 42, 111297.
Information Processing Systems, 37, 660-684. [20] Wang, W., Han, D., Duan, X., Yong, Y., Wu, Z., Ma, X.,
[7] Ma, Z., Zhu, X., Qi, G. J., Lei, Z., & Zhang, L. (2023). ... & Dai, K. (2024). Fast-Activated Minimal Gated Unit:
Otavatar: One-shot talking face avatar with controllable tri- Lightweight Processing and Feature Recognition for
plane rendering. In Proceedings of the IEEE/CVF Multiple Mechanical Impact Signals. Sensors, 24(16),
Conference on Computer Vision and Pattern 5245.
Recognition (pp. 16901-16910). [21] Gilbert, M., Demarchi, S., & Urdapilleta, I. (2021).
[8] Singer, U., Polyak, A., Hayes, T., Yin, X., An, J., Zhang, FACSHuman, a software program for creating
S., ... & Taigman, Y. (2022). Make-a-video: Text-to-video experimental material by modeling 3D facial
generation without text-video data. arXiv preprint expressions. Behavior Research Methods, 53(5), 2252-
arXiv:2209.14792. 2272.
[9] Ma, Y., Zhang, S., Wang, J., Wang, X., Zhang, Y., &
Deng, Z. (2023). DreamTalk: When Emotional Talking
Head Generation Meets Diffusion Probabilistic
Models. arXiv preprint arXiv:2312.09767.
[10] Chen, R., Pang, K., Wang, Z., Liu, Q., Tang, C., Chang,
Y., & Huang, M. (2025). A self-supervised graph
convolutional model for recommendation with exponential
moving average. Neural Computing and Applications, 1-
17.
[11] Zhao, S., Wang, Y., Yang, Z., & Cai, D. (2019). Region
mutual information loss for semantic
segmentation. Advances in Neural Information Processing
Systems, 32.

www.ijaers.com Page | 20

You might also like