PP Advanced Typography in PDF
PP Advanced Typography in PDF
technology
The ideal match or an uneasy compromise?
Create a PDF file with text
• Nothing can be simpler
GSUB Multiple
Substitution
Alternate
Substitution
Script List Feature List Lookup List
Ligature
Substitution
Language
System Info Contextual
Substitution
Chained Contextual
Substitution
Example arab cyrl grek
… init dnom
medi frac
fina c2sc
rlig smcp
Single Adjustment
Cursive Adjustment
Script List Feature List Lookup List
Mark-to-Base
Attachment
Language
System Info Mark-to-Ligature
Attachment
Mark-to-Mark
Attachment
Features
• How do we know which features to apply?
• How and when to apply them?
Indic scripts: overview of the algorithm
• Clustering into syllables (Unicode)
• Reordering (Unicode, aside from font)
• Substitutions (OpenType features)
• one to one
• one to many
• many to one
• contextual
• Positioning (OpenType features)
• kerning
• mark positioning
Indic shaping algorithm: Unicode
• Initial
• \u091A\u093F\u0928\u094D\u0939\u0947
• Clustering into syllables
• \u091A\u093F\u0928\u094D\u0939\u0947
• Reordering
• \u093F\u091A\u0928\u094D\u0939\u0947
Indic shaping algorithm: OpenType
• Initial
• GSUB
• GPOS
GSUB features implementation
• Single substitution:
• One to one
• Replacement glyph might not have Unicode value (swashes)
• Remember Unicode value and replace glyph id.
• Multiple substitution:
• One to many
• Do not confuse with Unicode decomposition
• Same approach
• How to enable copying? (/ToUnicode)
GSUB features implementation
• Alternate substitution:
Y placement
X placement
X advance
Y advance
GPOS features implementation
• Check if current glyph has placement
• Move the cursor to the position of the glyph the current glyph is attached to
(Td)
• Apply xPlacement and yPlacement to move the origin to the anchor (Td)
• Show current glyph (Tj / TJ)
• Roll back the cursor to the initial position (Td)
• Apply xAdvance and yAdvance (Td)
GPOS features implementation
• Horizontal placement => Tj + Td can be replaced with TJ
• Vertical placement (yPlacement != 0 or yAdvance != 0) =>
TJ is not enough => need to use Td
• Might be a problem for text extraction
Back to /ToUnicode
1C4
\u0935\u0930\u094D\u0923\u094B\u0902
Indic does not work that way
How to map glyphs to Unicode?
• 39, 27, 1C4
• \u0935\u0930\u094D\u0923\u094B\u0902
• 39 <-> u0935
• 27 <-> u0923
• 1C4 <-> ???
• Easy without reordering, but not in our case
• \u0935|\u0930\u094D\u0923|\u094B\u0902
• Incorrect when copying single glyphs
• Incorrect when adding new words
How to map glyphs to Unicode?
• 39, 27, 1C4
• 27 <-> u0923
• \u0935|\u0930\u094D\u0923|\u094B\u0902
• \u0935\u0930\u094D\u0917\u0940\u0915\u0930\u0923
• Extra chars will be copied along with the word
• \u0935\u0930\u094D\u0917\u0940\u0915\u0930\u0930\u094D\u0
923
• Challenge for most of the PDF producers even today
/ActualText comes to save us
• Can be specified for content that does translate into text but that is
represented in a nonstandard way (ISO 32000-1)
• Replacement text can be specified for the following items:
• A structure element, by means of the optional ActualText entry (PDF 1.4) of
the structure element dictionary.
• A marked-content sequence, through an ActualText entry in a property list
attached to the marked-content sequence with a Span tag.
/ActualText comes to save us
• \u0935\u0930\u094D\u0923\u094B\u0902
• [39, 27, 1С4]
• /ToUnicode CMAP:
• 39 <-> u0935
• 27 <-> u0923
• 1С4 <-> \u094B\u0930\u094D\u0902
/Span <</ActualText <FEFF 0930 094D 0923 094B 0902> >> BDC
<002701C4>Tj
EMC
/ActualText
• Not supported in many PDF viewers
• Problems with determining spaces when extracting text
Features + algorithms
• Lookup tables don’t know script rules
• Half characters
• त + व = त्व tva
• ण + ढ= ण्ढ ṇḍha
• स + थ = स्थ stha
• Don’t blindly apply all features
• Set up masks for features during preprocessing
Arabic
• Right-to-left
• Unicode => logical order
• init, medi, fina, liga
• /ReversedChars
• /ReversedChars BMC
• ( olleH ) Tj
• −200 0 Td
• ( . dlrow ) Tj
• EMC
Arabic
Why OpenType?
• All non-obligatory font-specific features + positioning
• Many ligatures do not have Unicode equivalent as there are too many
of them because of script-specific rules => encode them in lookup
tables
• Different correct representations of a text: some glyphs might be
present in a font, some may not => too hard to check all options =>
encode transformations in lookup tables
Conclusions
• OpenType features
• Obligatory (Indic, Arabic shaping)
• Non-obligatory (Latin Swashes, Kerning)
• Unicode preprocessing for complex scripts
• Work in pair with algorithms and script rules
• PDF + OpenType = solid (and necessary) match, but…
• Td even for showing a single word (vertical positioning)
• /ActualText for complex scripts text extraction (two buffer analogue)
References
• OpenType specification - https://www.microsoft.com/en-
us/Typography/OpenTypeSpecification.aspx
• Microsoft Typography - https://www.microsoft.com/en-
us/Typography/default.aspx
• FontForge Open Source tool - https://fontforge.github.io
• OpenType CookBook - http://opentypecookbook.com/index.html
Questions? प्रशन? ?أسئلة