0% found this document useful (0 votes)

24 views53 pages

PP Advanced Typography in PDF

Uploaded by

Harshii Tanwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views53 pages

PP Advanced Typography in PDF

Uploaded by

Harshii Tanwar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

PDF and OpenType

technology
The ideal match or an uneasy compromise?
Create a PDF file with text
• Nothing can be simpler

• Choose the right font (Tf)

• Set text matrix (Tm) or move text cursor (Td)

• Convert Unicode chars to PDF characters via encoding (CIDs)

• Output (Tj / TJ) and go for a coffee

This is what you see when you get back if your text is in
Devanagari script:

Below is the correct result.

How many differences can you find between the two?
Devanagari
Tamil: different appearance, same problems
Tamil
What is OpenType
• File format combining TrueType and Type1 outlines?
• Not just this: support for all kinds of scripts, world languages and
their specific features.
Latin
Kerning
Discretionary ligatures
Swashes
Stylistic alternatives
PDF Implementation
• Kerning: Text positioning + text showing operators (Tm/Td + Tj),
or TJ operator
• [ (A) 120 (W) 120 (A) 95 (Y again) ] TJ
• Ligatures, swashes, other substitutions: output correct glyph id and
specify /ToUnicode mapping correctly (if you want to be able to
extract the text from PDF afterwards)
/ToUnicode CMAP

• Different glyph ids, same Unicode

• <002a><002a><004b>
• <00f8><00f8><004b>
/ToUnicode CMAP

• Some ligatures have Unicode values, but some do not

• ﬀ (U+FB00): 'LATIN SMALL LIGATURE FF'
• <02c4><02c4><005400480045>
OpenType features
• 'aalt' Access All Alternates
• 'abvf' Above-base Forms
• 'abvm' Above-base Mark Positioning
• 'abvs' Above-base Substitutions
• 'afrc' Alternative Fractions
• 'akhn' Akhands
•…
And ≈130 more!
OpenType features: basic operations
• Substitute glyphs

• Adjust the positions of glyphs

OpenType Layout tag registry
• ≈ 150 Script tags
• For some scripts there are old and new implementations (e.g. deva
and dev2)
• ≈ 500 Language tags
• ≈ 140 Feature tags
• Font developers also may define and register their own features
• How is everything organized?
GSUB and GPOS tables in the OpenType font
• GSUB = Glyph Substitution
• GPOS = Glyph Positioning
Single Substitution

GSUB Multiple
Substitution

Alternate
Substitution
Script List Feature List Lookup List
Ligature
Substitution
Language
System Info Contextual
Substitution

Chained Contextual
Substitution
Example arab cyrl grek

FAR URD … DFLT

… init dnom

medi frac

fina c2sc

rlig smcp
Single Adjustment

GPOS Pair Adjustment

Cursive Adjustment
Script List Feature List Lookup List
Mark-to-Base
Attachment
Language
System Info Mark-to-Ligature
Attachment

Mark-to-Mark
Attachment
Features
• How do we know which features to apply?
• How and when to apply them?
Indic scripts: overview of the algorithm
• Clustering into syllables (Unicode)
• Reordering (Unicode, aside from font)
• Substitutions (OpenType features)
• one to one
• one to many
• many to one
• contextual
• Positioning (OpenType features)
• kerning
• mark positioning
Indic shaping algorithm: Unicode
• Initial
• \u091A\u093F\u0928\u094D\u0939\u0947
• Clustering into syllables
• \u091A\u093F\u0928\u094D\u0939\u0947
• Reordering
• \u093F\u091A\u0928\u094D\u0939\u0947
Indic shaping algorithm: OpenType
• Initial

• GSUB

• GPOS
GSUB features implementation
• Single substitution:
• One to one
• Replacement glyph might not have Unicode value (swashes)
• Remember Unicode value and replace glyph id.
• Multiple substitution:
• One to many
• Do not confuse with Unicode decomposition
• Same approach
• How to enable copying? (/ToUnicode)
GSUB features implementation
• Alternate substitution:

• One to one of many

• Same approach
• Ligature substitution
• Many to one
• Same approach and define /ToUnicode as described before
GPOS features
• Single adjustments: superscript or subscript
• Pair adjustments: kerning
• Cursive attachment: connect glyphs with attachment points
• MarkToBase attachment: position mark characters with respect to
base glyph
• MarkToLigature attachment: associate mark with one of the ligature
glyph’s components
• MarkToMark attachment: attach one mark to another
GPOS features implementation
• Placement
• Advance
• Glyph attachment points
• Offset to attaching point
GPOS features implementation

Y placement
X placement

X advance

Y advance
GPOS features implementation
• Check if current glyph has placement
• Move the cursor to the position of the glyph the current glyph is attached to
(Td)
• Apply xPlacement and yPlacement to move the origin to the anchor (Td)
• Show current glyph (Tj / TJ)
• Roll back the cursor to the initial position (Td)
• Apply xAdvance and yAdvance (Td)
GPOS features implementation
• Horizontal placement => Tj + Td can be replaced with TJ
• Vertical placement (yPlacement != 0 or yAdvance != 0) =>
TJ is not enough => need to use Td
• Might be a problem for text extraction
Back to /ToUnicode

The underlying Unicode sequence is:

\u0935\u0930\u094D\u0923\u094B\u0902
Content stream glyph ids: 39, 27, 1C4
???
\u0935\u0930\u094D\u0923\u094B\u0902
Two buffer approach
• Text editors keep two buffers
• Buffer with Unicode string
• Buffer with glyph ids
• Easy correspondence for non-breakable parts (syllables)
• Cursor goes over syllables
• Windows: Uniscribe
• Linux: ICU - International Components for Unicode
Two syllables

• Cursor in your browser knows that!

• \u0935\u0930\u094D\u0923\u094B\u0902
PDF Approach
• Have only content stream and glyphs written there
• /ToUnicode
• Have to map all the glyphs to Unicode characters
1C4 ??

1C4

\u0935\u0930\u094D\u0923\u094B\u0902
Indic does not work that way
How to map glyphs to Unicode?
• 39, 27, 1C4
• \u0935\u0930\u094D\u0923\u094B\u0902
• 39 <-> u0935
• 27 <-> u0923
• 1C4 <-> ???
• Easy without reordering, but not in our case
• \u0935|\u0930\u094D\u0923|\u094B\u0902
• Incorrect when copying single glyphs
• Incorrect when adding new words
How to map glyphs to Unicode?
• 39, 27, 1C4
• 27 <-> u0923
• \u0935|\u0930\u094D\u0923|\u094B\u0902
• \u0935\u0930\u094D\u0917\u0940\u0915\u0930\u0923
• Extra chars will be copied along with the word
• \u0935\u0930\u094D\u0917\u0940\u0915\u0930\u0930\u094D\u0
923
• Challenge for most of the PDF producers even today
/ActualText comes to save us
• Can be specified for content that does translate into text but that is
represented in a nonstandard way (ISO 32000-1)
• Replacement text can be specified for the following items:
• A structure element, by means of the optional ActualText entry (PDF 1.4) of
the structure element dictionary.
• A marked-content sequence, through an ActualText entry in a property list
attached to the marked-content sequence with a Span tag.
/ActualText comes to save us
• \u0935\u0930\u094D\u0923\u094B\u0902
• [39, 27, 1С4]
• /ToUnicode CMAP:
• 39 <-> u0935
• 27 <-> u0923
• 1С4 <-> \u094B\u0930\u094D\u0902

/Span <</ActualText <FEFF 0930 094D 0923 094B 0902> >> BDC
<002701C4>Tj
EMC
/ActualText
• Not supported in many PDF viewers
• Problems with determining spaces when extracting text
Features + algorithms
• Lookup tables don’t know script rules
• Half characters
• त + व = त्व tva
• ण + ढ= ण्ढ ṇḍha
• स + थ = स्थ stha
• Don’t blindly apply all features
• Set up masks for features during preprocessing
Arabic
• Right-to-left
• Unicode => logical order
• init, medi, fina, liga
• /ReversedChars
• /ReversedChars BMC
• ( olleH ) Tj
• −200 0 Td
• ( . dlrow ) Tj
• EMC
Arabic
Why OpenType?
• All non-obligatory font-specific features + positioning
• Many ligatures do not have Unicode equivalent as there are too many
of them because of script-specific rules => encode them in lookup
tables
• Different correct representations of a text: some glyphs might be
present in a font, some may not => too hard to check all options =>
encode transformations in lookup tables
Conclusions
• OpenType features
• Obligatory (Indic, Arabic shaping)
• Non-obligatory (Latin Swashes, Kerning)
• Unicode preprocessing for complex scripts
• Work in pair with algorithms and script rules
• PDF + OpenType = solid (and necessary) match, but…
• Td even for showing a single word (vertical positioning)
• /ActualText for complex scripts text extraction (two buffer analogue)
References
• OpenType specification - https://www.microsoft.com/en-
us/Typography/OpenTypeSpecification.aspx
• Microsoft Typography - https://www.microsoft.com/en-
us/Typography/default.aspx
• FontForge Open Source tool - https://fontforge.github.io
• OpenType CookBook - http://opentypecookbook.com/index.html
Questions? प्रशन? ?‫أسئلة‬

• Benoît Lagae [email protected]

• Alexey Subach [email protected]

平湖秋月-Autumn Moon Over The Calm Lake
100% (2)
平湖秋月-Autumn Moon Over The Calm Lake
4 pages
Free Rental Receipt Template
No ratings yet
Free Rental Receipt Template
21 pages
CIO/IT Head of North
No ratings yet
CIO/IT Head of North
21 pages
ALPHABETUM A Unicode Font For Typing Anc
No ratings yet
ALPHABETUM A Unicode Font For Typing Anc
104 pages
Otf English
No ratings yet
Otf English
4 pages
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
No ratings yet
Week 4 - A Comparative Study of UTF-8 UTF-16 and UTF-32
12 pages
Hans Hagen 05
No ratings yet
Hans Hagen 05
15 pages
Unicode Enabling of ABAP
No ratings yet
Unicode Enabling of ABAP
82 pages
Xetexmain
No ratings yet
Xetexmain
112 pages
Unicode Fundamentals
No ratings yet
Unicode Fundamentals
51 pages
Rajkumar Indic Typesetting - Challenges and Opportunities
No ratings yet
Rajkumar Indic Typesetting - Challenges and Opportunities
3 pages
Font Creation With FontForge
0% (1)
Font Creation With FontForge
14 pages
Creating S Olarly Multilingual Documents Using Unicode, Opentype, and Xǝtex
No ratings yet
Creating S Olarly Multilingual Documents Using Unicode, Opentype, and Xǝtex
41 pages
The Unicode CharacterGlyph Model: Case Studies
100% (1)
The Unicode CharacterGlyph Model: Case Studies
27 pages
Font Selection and Font Composition For Unicode
No ratings yet
Font Selection and Font Composition For Unicode
19 pages
Text Processing
No ratings yet
Text Processing
47 pages
Surviving The TeX Font Encoding Mess
100% (4)
Surviving The TeX Font Encoding Mess
69 pages
TrueType, PostScript Type 1, & OpenType: What's The Difference?
No ratings yet
TrueType, PostScript Type 1, & OpenType: What's The Difference?
10 pages
TEX Gyre Heros: B. Jackowski and J. M. Nowacki
100% (1)
TEX Gyre Heros: B. Jackowski and J. M. Nowacki
37 pages
Integrating TrueType Fonts Into ConTeXt
No ratings yet
Integrating TrueType Fonts Into ConTeXt
16 pages
Pochoir Pro Sprayed
No ratings yet
Pochoir Pro Sprayed
11 pages
Unicode - Wikipedia, The Free Encyclopedia
No ratings yet
Unicode - Wikipedia, The Free Encyclopedia
18 pages
Fontlab Training Slides Hand
No ratings yet
Fontlab Training Slides Hand
75 pages
Fontlab Training Slides Hand PDF
No ratings yet
Fontlab Training Slides Hand PDF
75 pages
TEX Gyre Cursor: B. Jackowski and J. M. Nowacki
No ratings yet
TEX Gyre Cursor: B. Jackowski and J. M. Nowacki
34 pages
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
No ratings yet
FALLSEM2020-21 CSE4022 ETH VL2020210104471 Reference Material I 25-Jul-2020 NLP2-Lecture 1 3
35 pages
CircularXXTT Black
No ratings yet
CircularXXTT Black
17 pages
An Introduction To Unicode - The Trainer's Friend
No ratings yet
An Introduction To Unicode - The Trainer's Friend
52 pages
Uni Code
No ratings yet
Uni Code
9 pages
(Digital Classical Philology) Character Encoding of Classical Languages
No ratings yet
(Digital Classical Philology) Character Encoding of Classical Languages
22 pages
Captura de Pantalla 2023-08-15 A La(s) 7.26.52 P.M.
No ratings yet
Captura de Pantalla 2023-08-15 A La(s) 7.26.52 P.M.
1 page
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
No ratings yet
7-Text Preprocessing - ASCII and UNICODE-10!01!2024
34 pages
Proxima Nova Supplemental Fonts: Proxima Nova Character Set (In Non-Opentype-Savvy Applications)
No ratings yet
Proxima Nova Supplemental Fonts: Proxima Nova Character Set (In Non-Opentype-Savvy Applications)
2 pages
Maxbox Starter120 Unicode
No ratings yet
Maxbox Starter120 Unicode
7 pages
Howto Unicode
No ratings yet
Howto Unicode
12 pages
Font Spec
No ratings yet
Font Spec
70 pages
Dario Taraborelli - Accessing OpenType Font Features in LaTeX
No ratings yet
Dario Taraborelli - Accessing OpenType Font Features in LaTeX
3 pages
Latex Font Encodings
No ratings yet
Latex Font Encodings
40 pages
Fonts UTF-8 WhitePaperv6
No ratings yet
Fonts UTF-8 WhitePaperv6
13 pages
L02 Topic1B Multimedia Element Text (BL)
No ratings yet
L02 Topic1B Multimedia Element Text (BL)
41 pages
The Inevitable Unicode Project: Tikkana Akurati, Upgrade & Unicode Specialist
No ratings yet
The Inevitable Unicode Project: Tikkana Akurati, Upgrade & Unicode Specialist
11 pages
The Fontspec Package Font Selection For X E L Tex and Lual Tex
No ratings yet
The Fontspec Package Font Selection For X E L Tex and Lual Tex
65 pages
The X E TEX Project: Typesetting For The Rest of The World: Jonathan Kew
No ratings yet
The X E TEX Project: Typesetting For The Rest of The World: Jonathan Kew
6 pages
Extr 030
No ratings yet
Extr 030
4 pages
Unicode Vs UTF-8
No ratings yet
Unicode Vs UTF-8
2 pages
Fontselection in Latex PDF
No ratings yet
Fontselection in Latex PDF
29 pages
Unicode HOWTO: Guido Van Rossum and The Python Development Team
No ratings yet
Unicode HOWTO: Guido Van Rossum and The Python Development Team
12 pages
Introduction To Unicode: History of Character Codes
No ratings yet
Introduction To Unicode: History of Character Codes
4 pages
Optical Character Recognition For Printed Tamil Text Using Unicode
No ratings yet
Optical Character Recognition For Printed Tamil Text Using Unicode
9 pages
Howto Unicode
No ratings yet
Howto Unicode
9 pages
Forouzan Appendix
No ratings yet
Forouzan Appendix
106 pages
CircularXXTT BlackItalic
No ratings yet
CircularXXTT BlackItalic
17 pages
BogArt Deco - Ligature Rich Font Family
No ratings yet
BogArt Deco - Ligature Rich Font Family
2 pages
Immediate Access To Unicode Demystified A Practical Programmer S Guide To The Encoding Standard 1st Edition Richard Gillam Ebook Full Chapters
No ratings yet
Immediate Access To Unicode Demystified A Practical Programmer S Guide To The Encoding Standard 1st Edition Richard Gillam Ebook Full Chapters
87 pages
MPDF 6.0 Demo
No ratings yet
MPDF 6.0 Demo
34 pages
Unicode®: Character Encodings
No ratings yet
Unicode®: Character Encodings
11 pages
Font Installation Guide
No ratings yet
Font Installation Guide
109 pages
Extr 040
No ratings yet
Extr 040
4 pages
The Microtype Package
100% (2)
The Microtype Package
197 pages
Unicode in C and C
No ratings yet
Unicode in C and C
8 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Audi A6 f2 Faulty 0009
No ratings yet
Audi A6 f2 Faulty 0009
2 pages
Characteristics of Multislice CT: Recent Topics
No ratings yet
Characteristics of Multislice CT: Recent Topics
5 pages
Knight's Tour
No ratings yet
Knight's Tour
8 pages
Use The Sutherland - Hodgman Polygon Clipping Algorithm To Clip The LABC Given Below.
No ratings yet
Use The Sutherland - Hodgman Polygon Clipping Algorithm To Clip The LABC Given Below.
3 pages
Find Changes Logs For A Table Using SM30 - SAP Blogs
No ratings yet
Find Changes Logs For A Table Using SM30 - SAP Blogs
7 pages
Mobile Phone Cloning IJERTCONV3IS10043
No ratings yet
Mobile Phone Cloning IJERTCONV3IS10043
5 pages
Detector Block Chamber Unit: To Sec7 TOC
No ratings yet
Detector Block Chamber Unit: To Sec7 TOC
1 page
Session 4 - DICE GAME
No ratings yet
Session 4 - DICE GAME
7 pages
Developing A Process For Laminated Object Manufacturing (Rapid Prototyping) Without De-Cubing.
100% (2)
Developing A Process For Laminated Object Manufacturing (Rapid Prototyping) Without De-Cubing.
93 pages
Fdma Technology PDF
No ratings yet
Fdma Technology PDF
2 pages
Unit 6 Fds 2023
No ratings yet
Unit 6 Fds 2023
67 pages
VLSI Physical Design Automation PDF
No ratings yet
VLSI Physical Design Automation PDF
29 pages
Application Information: Need To Know How? You've Turned To The Right Place - . - Literally
No ratings yet
Application Information: Need To Know How? You've Turned To The Right Place - . - Literally
50 pages
CAO Assignment 01 02 CSE2003
No ratings yet
CAO Assignment 01 02 CSE2003
2 pages
Fine Art Colour Photography: Lesson 1 Course Notes
No ratings yet
Fine Art Colour Photography: Lesson 1 Course Notes
19 pages
End-Of-Term Test Higher A
No ratings yet
End-Of-Term Test Higher A
4 pages
Week8 Tree Worksheets
No ratings yet
Week8 Tree Worksheets
6 pages
Documents - Pub - The Elastix Call Center Protocol Revealed
No ratings yet
Documents - Pub - The Elastix Call Center Protocol Revealed
68 pages
Dell Powerconnect 6224/6224F/6224P/6248/6248P: 3.3.18.1 Firmware Release Notes
No ratings yet
Dell Powerconnect 6224/6224F/6224P/6248/6248P: 3.3.18.1 Firmware Release Notes
75 pages
X1 Owner's Manual
No ratings yet
X1 Owner's Manual
12 pages
VR&AR
No ratings yet
VR&AR
8 pages
Resume Francesco Rene Loli
No ratings yet
Resume Francesco Rene Loli
2 pages
An ATM With An Eye
No ratings yet
An ATM With An Eye
43 pages
A New Implementation: A Multiport Automatic Network Analyzer
No ratings yet
A New Implementation: A Multiport Automatic Network Analyzer
8 pages
Teknik Lipatan Minggu 14
No ratings yet
Teknik Lipatan Minggu 14
42 pages
CE 212 Digital Systems Ch4
No ratings yet
CE 212 Digital Systems Ch4
37 pages
QBlade An Open Source Tool For Design An
No ratings yet
QBlade An Open Source Tool For Design An
6 pages
Cheryl Simons Resume 2013-4
No ratings yet
Cheryl Simons Resume 2013-4
3 pages

PP Advanced Typography in PDF

Uploaded by

PP Advanced Typography in PDF

Uploaded by

PDF and OpenType

• Choose the right font (Tf)

• Set text matrix (Tm) or move text cursor (Td)

• Convert Unicode chars to PDF characters via encoding (CIDs)

• Output (Tj / TJ) and go for a coffee

Below is the correct result.

• Different glyph ids, same Unicode

• Some ligatures have Unicode values, but some do not

• Adjust the positions of glyphs

FAR URD … DFLT

GPOS Pair Adjustment

• One to one of many

The underlying Unicode sequence is:

• Cursor in your browser knows that!

• Benoît Lagae [email protected]

You might also like