Handwriting Text Generation
Handwriting Text Generation
3. Transformer Encoder: The style vectors obtained from the style samples
are fed into a Transformer encoder. The encoder uses self-attention
mechanisms to capture long-range dependencies and enriches the style
vectors with contextual information.
Summary : I did not see this work as a writer specific style but
calligraphy specific style generation. The style is extracted in following
way 1st pretrained resnet-18 extracts the feature vectors from specific
writer images which are available but only few samples.It extracts
p-feature maps from p images flatten them and send to transformer
encoder.. The extracted vector is paased to transformer encoder to extract
long term dependency and then it is passes to transformer decoder. The
input to decoder is a styled vector from encoder as well as image
encoding/embeddings of visual archetype. This embedding decide which
content to generate. Then the transformer output is passed to CNN decoder
to generate the image. It also has the writer classification loss.
3.Handwriting Transformers
This also uses transformer for HT generation. The generator consist of transformer encoder and
decoder. The encoder extracts the feature from input handwritten images. It uses resnet-18 for
feature extraction and then may be applying a self attention to get long term dependencies. The
decoder apply again multihead attention where k,v come from feature map provided by encoder
and q come from decoder input content. The output of decoder is passed to convolution layer to
generate the image. The framework uses the cyclic consistency loss. This loss looks similar to
MSE loss.
The paper uses the same transformer encoder-decoder architecture for style generation.This
paer uses focal frequency loss which is different than other works to preserve the style.
5. GANwriting: Content-Conditioned Generation of Styled Handwritten Word
Images
Get input style images X_i, e.g. 15 word images from a single writer
Pass each image through a CNN backbone (like VGG19) to extract feature maps. For example,
for an image of size 32x128:
VGG19 convolution layers extract 32 feature maps of size 8x32
These capture stylistic information like strokes, shapes, slant etc.
Aggregate features across all input images:
Resize all feature maps to the same (height, width)
Concatenate along channel dimension
E.g. if there are 15 images, the aggregated map is (32, 8, 32, 15)
Pass aggregated maps through additional convolution layers
To reduce dimensions and compute statistics
Outputs style features Fs, e.g. of size (256, 4, 8)
Add small random noise to Fs
This allows natural variations in style
Fs' = Fs + N(0, 0.05)
Multi-Branch Encoder:
Uses multiple independent encoding branches to capture different representations of a letter
trajectory, with each branch specialized to extract a different prototype writing style.
Here are the common points regarding how writer styles are extracted:
100 : (1,100,512)
(15,100,1024):
Mean of writer style:(1,100,1024)
3. Transformer Decoder: The content strings or input sequences, which represent the desired
text or characters, are input to the Transformer decoder. The decoder performs cross-attention
between the style vectors or feature maps obtained from the encoder and the content strings.
This allows the model to capture the entanglement between content and style, enabling the
generation of locally styled text.
4. Convolutional Decoder: After the Transformer decoder, some papers utilize convolutional
decoders to generate the final handwritten text images conditioned on both the content and
style information. The convolutional decoder takes the entangled content-style representation
and produces the output images.
6. Loss Functions: Different loss functions are used to train the models, such as cyclic
consistency loss, focal frequency loss, and adversarial loss (in the context of GANs).
Overall, these papers combine various techniques, including pre-trained models, Transformer
encoders and decoders, convolutional decoders, and contrastive learning, to extract and
represent writer-specific styles for generating handwritten text images.
[1] Bhunia, Ankan Kumar, et al. "Handwriting transformers." Proceedings of the IEEE/CVF international
conference on computer vision. 2021.
DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion
https://github.com/tmaham/DS-Fusion/tree/main
Improving Handwritten OCR with Training Samples Generated by Glyph Conditional Denoising Diffusion
Probabilistic Model
full: 33
5skips: 26.51
10skips: 26.99
Improving Handwritten OCR with Training Samples Generated by Glyph Conditional Denoising Diffusion
Probabilistic Model