0% found this document useful (0 votes)
85 views16 pages

Ucharclasses

This document describes the ucharclasses package for LaTeX, which allows automatic font switching based on Unicode blocks. It provides commands to set transition rules between blocks, like switching to a Japanese font for Japanese characters. Issues like overlapping blocks that can cause incorrect switching are discussed. Workarounds like temporarily disabling rules are presented. While useful for font switching, the package's transitions could potentially be used for other purposes as well.

Uploaded by

EMDC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views16 pages

Ucharclasses

This document describes the ucharclasses package for LaTeX, which allows automatic font switching based on Unicode blocks. It provides commands to set transition rules between blocks, like switching to a Japanese font for Japanese characters. Issues like overlapping blocks that can cause incorrect switching are discussed. Workarounds like temporarily disabling rules are presented. While useful for font switching, the package's transitions could potentially be used for other purposes as well.

Uploaded by

EMDC
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

ucharclasses v2.

4
Mike “Pomax” Kamermans
February 28, 2021

Contents
1 Introduction 2

2 Use 4
2.1 Overriding ucharclass transitions . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Problems with RTL languages 6

4 Commands 7
4.1 \setTransitionTo[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.2 \setTransitionFrom[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 \setTransitions[3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.4 \setTransitionsForXXXX[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.5 \setDefaultTransitions[2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5 Code 9

6 Package options and Unicode blocks 12

1
1 Introduction
Sometimes you donʹt want to have to bother with font switching just because youʹre us‑
ing languages that are distinct enough to use different Unicode blocks, but arenʹt covered
by the polyglossia package. Where normal word processing packages such as MS, Star‑
or OpenOffice pretty much handle this for you, LATEX (because it needs you to tell it what
to do) has no default behaviour for this, and so we arrive at a need for a package that does
this for us. You already discovered that regular LATEX has no understanding of Unicode
(in fact, it has no understanding of 8‑bit characters at all, it likes them in seven bits in‑
stead), and ended up going for Xe(La)TeX as your TeX compiler of choice, which means
you now have two excellent resources available: fontspec, and ucharclasses.
The first of these lets you pick fonts based on what your system calls them, without
needing to rewrite them as MetaFont files. This is convenient. This is good. The second
lets you define what should happen when we change from a character in one Unicode
block to a character in another. This is also convenient, and paired with fontspec it offers
automatic fontswitching in the same way that normal Office applications take care of this
for you. With one big difference: you stay in control. In an Office application, if at some
point you need the switch rule to use a completely different rule, thatʹs just too bad for
you. In Xe(La)TeX, you stay on top of things and still get to say exactly what happens,
and when.
For instance, this document has no explicit font codes in the text itself. Instead, there
are a few Unicode block transition rules defined, which all say “when entering block
..., use fontspec to change the font to ...”. As such, typesetting the following list in the
appropriate fonts just works:

・ English: This is an English phrase (using Palatino Linotype)


・ Japanese: 日本語が分かりますか (using Ume Mincho)
・ Thai: คุณพูดภาษาอังกฤษได้ไหม (using IrisUPC)
・ Sinhala: කරැණාකරල ඒක නැවත කියන්න පුළුවන්ද (using Iskoola Pota)
・ Malayalam: നിങ്ങളുെട േപെരന്താണ്? (using Arial Unicode MS)
・ and even domino tiles, 🁇 🀼 🁐 🁋 🁚 🁝 (using Segoe UI Symbol),
・ or mahjong tiles: 🀑 🀑 🀑 🀒 🀒 🀒 🀕 🀕 🀕 🀗 🀗 🀗 🀅 🀅 (using Segoe UI Emoji)

However, be aware that this only “just works” for Unicode blocks. If you are working
with typographically overlapping languages, such as combining English and Vietnamese
in one document, things get a lot more complex if you want one font for English and
another for Vietnamese. Both of these languagese use Latin blocks, so it is inherently
impossible to tell which language is intended based on which Unicode block a character
in a word belongs to.
As an example, this document uses one rule for applying a font for general CJK, and
an override with a different font for all Japanese‑specific CJK characters. This causes a
problem for Chinese, because both Japanese and Chinese mostly use characters from the
ʺCJK Unified Ideographsʺ block, but most Japanese fonts contain fewer characters than
are necessary to typeset Chinese:

・ Chinese, using the Japanese CJK font, which may have gaps: 我的母�是�� (uses
Ume Mincho, which does not contain the three Chinese‑specific characters used in
that phrase)

2
We can get around this by explicitly setting the font to one that supports Chinese,
turning off the switching rules for the stretch of Chinese text, using {\uccoff + a fontspec
rule + the text we wanted to typeset + \uccon}. This gives us: 我的母语是汉语 (This now
explicitly uses Han Nom A).

3
2 Use
In order to get this all to work, the only thing that had to be incidated was a set of tran‑
sition rules in the preamble:

\usepackage{fontspec}

\newfontfamily{\defaultfont}{Code2000}
\newfontfamily{\latinfont}{Palatino Linotype}
\newfontfamily{\cjkfont}{HAN NOM A}
\newfontfamily{\japanesefont}{Ume Mincho}
\newfontfamily{\unifiedCJKfont}{SimSun-ExtB}
\newfontfamily{\thaifont}{IrisUPC}
\newfontfamily{\sinhalafont}{Iskoola Pota}
\newfontfamily{\malayalamfont}{Arial Unicode MS}
\newfontfamily{\dominofont}{Segoe UI Symbol}
\newfontfamily{\mahjongfont}{Segoe UI Emoji}

\usepackage[CJK, Latin, Thai, Sinhala, Malayalam,


DominoTiles, MahjongTiles]{ucharclasses}

\setDefaultTransitions{\defaultfont}{}

\setTransitionsForLatin{\latinfont}{}
\setTransitionsForCJK{\cjkfont}{}
\setTransitionsForJapanese{\japanesefont}{}
\setTransitionTo{CJKUnifiedIdeographsExtensionB}{\unifiedCJKfont}
\setTransitionTo{Thai}{\thaifont}
\setTransitionTo{Sinhala}{\sinhalafont}
\setTransitionTo{Malayalam}{\malayalamfont}
\setTransitionTo{DominoTiles}{\dominofont}
\setTransitionTo{MahjongTiles}{\mahjongfont}

By default, ucharclasses is agnostic with regard to what you want inserted at the start
or end of Unicode blocks, so while using this package for font switching is the most
obvious application, you could also use it for far more creative purposes.

4
2.1 Overriding ucharclass transitions
If you need to “override” ucharclass transition rules (for instance, you want a custom font
for a bit of cross‑Unicode‑block text), you will want to temporarily disable and reenabled
XeTeXʹs interchartoks state. You can do this in three ways:

1. call [\XeTeXinterchartokstate = 0] before, and [\XeTeXinterchartokstate = 1] after


youʹre done,
2. call the macros \disableTransitionRules before, and \enableTransitionRules after
youʹre done, or
3. call \uccoff before, and \uccon after youʹre done.

This last option is mainly there because itʹs nice and short, and is more convenient in
a scoped environment {\uccoff such as this\uccon} where you only want to override the
transition behaviour within a paragraph. If you need it disabled for a few blocks of text
instead, the full name commands are probably a better choice, because it makes your .tex
more readable. As the base XeTeX command uses the unLATEXy “... = ...” construction,
itʹs best to avoid it outside of the preamble (and when using ucharclasses, should not be
in the preamble at all).

5
3 Problems with RTL languages
The overlapping block problem is especially notable when using RTL/LTR rules for lan‑
guages such as Arabic or Hebrew. While you would want to be able to specify something
along the lines of:

\setTransitionsForArabics{\arabicfont\setRTL}{\setLTR}

this will not work, because Arabic (and Hebrew, and other RTL languages) has things
like spaces in it, and so rather than ending with a full sentence that starts with \setRTL,
then the Arabic text, and then finally \setLTR, every word in the Arabic sentence will be
wrapped by \setRTL and \setLTR, effectively getting the typesetting all wrong, because
going from Arabic to a space character “leaves” the Arabic block, so the transition rule
for leaving the Arabic block is applied.
If you need script support, rather than Unicode blocks, you may want to have a look
at the polyglossia package instead. You can try to combine the two packages by relying
on \uccoff and textbackslash uccon to turn off Unicode block transitions inside regions
of text, but this may not always work, or may have interesting interaction side‑effects.

6
4 Commands
4.1 \setTransitionTo[2]
This command has two arguments:

1. The name of the Unicode class to which the transition should apply (see ʹUnicode
blocksʹ list)
2. The code you want used when entering this Unicode block

4.2 \setTransitionFrom[2]
This command has two arguments:

1. The name of the Unicode class to which the transition should apply (see ʹUnicode
blocksʹ list)
2. The code you want used when exiting this Unicode block

4.3 \setTransitions[3]
This command has three arguments:

1. The name of the Unicode class to which the transition should apply (see ʹUnicode
blocksʹ list)
2. The code you want used when entering this Unicode block
3. The code you want used when exiting this Unicode block

4.4 \setTransitionsForXXXX[2]
There are a number of these commands, pertaining to particular “informal groups”: col‑
lections of Unicode blocks which can be considered part of a single meta‑block. Available
informal groups (the names of which replace the XXXX in the section‑stated command)
are:

・ Arabics
・ CanadianSyllabics
・ CherokeeFull
・ Chinese
・ CJK
・ Cyrillics
・ Diacritics
・ EthiopicFull
・ GeorgianFull
・ Greek
・ Korean
・ Japanese
・ Latin

7
・ Mathematics
・ MongolianFull
・ MyanmarFull
・ Phonetics
・ Punctuation
・ SundaneseFull
・ Symbols
・ SyriacFull
・ Yi

Furthermore, these commands have two arguments:

1. The code you want used when entering blocks from the commandʹs informal group
2. The code you want used when exiting blocks from the commandʹs informal group

4.5 \setDefaultTransitions[2]
This is a blanket command that lets you set up the same to and from transition rules for
all blocks in one go. It has (fairly obviously) two arguments:

1. The code you want used when entering any Unicode block
2. The code you want used when exiting any Unicode block

8
5 Code
The code relies on running through individual definition blocks for each Unicode blocks,
conditioned to whether ucharclasses is loaded with package options or not:

...
\newif\if@overrideClassLoading
\newcommand{\overrideClassLoading}{\@overrideClassLoadingtrue
\let\overrideClassLoading\relax}

\def\do#1#2#3{\DeclareOption{#1}%
{\overrideClassLoading\expandafter\let\csname enable#1\endcsname\@empty}}
% We execute the list with this definition of \do
\AllClasses
...

The classes are automatically numbered by using the \newXeTeXintercharclass com‑


mand, and every time a new class is defined, the class counter goes up. After all desired
classes have been defined, the code iterates over the class numbers from lower bound to
upper bound.
The block loading code is defined as follows:

\chardef\@classstart=\xe@alloc@intercharclass

\providecommand\@gobblethree[3]{}
\def\do#1{%
\ifcsname enable#1\endcsname
\expandafter\@defineUnicodeClass
\else
\expandafter\@gobblethree
\fi{#1}}

\def\@defineUnicodeClass#1#2#3{%
\if@ucharclassverbose\typeout{Defining #1 Class}\fi
\expandafter\newXeTeXintercharclass\csname #1Class\endcsname
\count@=#2
\loop
\if@ucharclassverbose
\typeout{\XeTeXcharclass\number\count@=
\expandafter\string\csname #1Class\endcsname}%
\fi
\XeTeXcharclass\count@=\csname #1Class\endcsname
\ifnum\count@<#3
\advance\count@\@ne
\repeat
}

9
And the transition commands are defined as follows:
\def\setTransitionsFor#1#2#3{%
\ifcsname enable#1\endcsname
\count@=\@classstart
\loop\ifnum\count@<\@classend
\advance\count@\@ne
\ifnum\count@=\csname #1Class\endcsname\else
\XeTeXinterchartoks\count@ \csname #1Class\endcsname={#2}%
\XeTeXinterchartoks\csname #1Class\endcsname \count@={#3}%
\fi
\repeat
\XeTeXinterchartoks\@ucharclass@boundary\csname #1Class\endcsname={#2}%
\XeTeXinterchartoks\csname #1Class\endcsname\@ucharclass@boundary={#3}%
\else
\if@ucharclassverbose
\PackageWarningNoLine{ucharclasses}{Class #1\MessageBreak
not loaded}%
\fi
\fi
}

\def\setTransitionTo#1#2{%
\ifcsname enable#1\endcsname
\count@=\@classstart
\loop\ifnum\count@<\@classend
\advance\count@\@ne
\ifnum\count@=\csname #1Class\endcsname\else
\XeTeXinterchartoks\count@ \csname #1Class\endcsname={#2}%
\fi
\repeat
\XeTeXinterchartoks\@ucharclass@boundary\csname #1Class\endcsname={#2}%
\else
\if@ucharclassverbose
\PackageWarningNoLine{ucharclasses}{Class #1\MessageBreak
not loaded}%
\fi
\fi
}

\def\setTransitionFrom#1#2{%
\ifcsname enable#1\endcsname
\count@=\@classstart
\loop\ifnum\count@<\@classend
\advance\count@\@ne
\ifnum\count@=\csname #1Class\endcsname\else
\XeTeXinterchartoks\csname #1Class\endcsname \count@={#2}%
\fi

10
\repeat
\XeTeXinterchartoks\csname #1Class\endcsname\@ucharclass@boundary={#2}%
\else
\if@ucharclassverbose
\PackageWarningNoLine{ucharclasses}{Class #1\MessageBreak
not loaded}%
\fi
\fi
}

The broad level \setTransitionsFor(InformalGroupName)[2] commands are essen‑


tially wrapper commands, calling \setTransitionsFor for each blocks that is in the infor‑
mal group. For Arabic, for instance, uses the following code:

\def\doclass#1{%
\expandafter\noexpand\csname setTransitionsFor#1\endcsname{####1}{####2}}
\begingroup\edef\x{\endgroup
\noexpand\newcommand\noexpand\setDefaultTransitions[2]{%
\ClassGroups}}\x

...

\doclass{Arabics}

11
6 Package options and Unicode blocks
The following Unicode blocks are available for use in transition rules (corresponding to
Unicode version 10.0), as well as for use as package options when you want ucharclasses
to only load those classes that you know are used in your document.
Starting with XeTeX version 0.99994 (available in TeXLive 2016), the number of \Xe‑
TeXcharclass registers was extended from 256 to 4096; some not so important blocks are
thus provided only for this and newer versions; in the list below, those blocks are put
into parentheses.

・ (Adlam) ・ Chakma
・ AegeanNumbers ・ Cham
・ (Ahom) ・ Cherokee
・ AlchemicalSymbols ・ CherokeeSupplement
・ AlphabeticPresentationForms ・ (ChessSymbols)
・ (AnatolianHieroglyphs) ・ (Chorasmian)
・ AncientGreekMusicalNotation ・ CJKCompatibility
・ AncientGreekNumbers ・ CJKCompatibilityForms
・ AncientSymbols ・ CJKCompatibilityIdeographs
・ Arabic ・ CJKCompatibilityIdeographsSupplement
・ ArabicExtendedA ・ CJKRadicalsSupplement
・ ArabicMathematicalAlphabeticSymbols ・ CJKStrokes
・ ArabicPresentationFormsA ・ CJKSymbolsAndPunctuation
・ ArabicPresentationFormsB ・ CJKUnifiedIdeographs
・ ArabicSupplement ・ CJKUnifiedIdeographsExtensionA
・ Armenian ・ CJKUnifiedIdeographsExtensionB
・ Arrows ・ CJKUnifiedIdeographsExtensionC
・ Avestan ・ CJKUnifiedIdeographsExtensionD
・ Balinese ・ CJKUnifiedIdeographsExtensionE
・ Bamum ・ CJKUnifiedIdeographsExtensionF
・ BamumSupplement ・ CJKUnifiedIdeographsExtensionG
・ BasicLatin ・ CombiningDiacriticalMarks
・ BassaVah ・ CombiningDiacriticalMarksExtended
・ Batak ・ CombiningDiacriticalMarksForSymbols
・ Bengali ・ CombiningDiacriticalMarksSupplement
・ (Bhaiksuki) ・ CombiningHalfMarks
・ BlockElements ・ CommonIndicNumberForms
・ Bopomofo ・ ControlPictures
・ BopomofoExtended ・ Coptic
・ BoxDrawing ・ CopticEpactNumbers
・ Brahmi ・ CountingRodNumerals
・ BraillePatterns ・ Cuneiform
・ Buginese ・ CuneiformNumbersAndPunctuation
・ Buhid ・ CurrencySymbols
・ ByzantineMusicalSymbols ・ CypriotSyllabary
・ (Carian) ・ Cyrillic
・ CaucasianAlbanian ・ CyrillicExtendedA

12
・ CyrillicExtendedB ・ Hanunoo
・ CyrillicExtendedC ・ (Hatran)
・ CyrillicSupplement ・ Hebrew
・ Deseret ・ Hiragana
・ Devanagari ・ IdeographicDescriptionCharacters
・ DevanagariExtended ・ IdeographicSymbolsAndPunctuation
・ Dingbats ・ ImperialAramaic
・ (DivesAkuru) ・ (IndicSiyaqNumbers)
・ (Dogra) ・ InscriptionalPahlavi
・ DominoTiles ・ InscriptionalParthian
・ (Duployan) ・ IPAExtensions
・ (EarlyDynasticCuneiform) ・ Javanese
・ EgyptianHieroglyphs ・ Kaithi
・ (EgyptianHieroglyphFormatControls) ・ KanaExtendedA
・ Elbasan ・ KanaSupplement
・ (Elymaic) ・ Kanbun
・ Emoticons ・ KangxiRadicals
・ EnclosedAlphanumerics ・ Kannada
・ EnclosedAlphanumericSupplement ・ Katakana
・ EnclosedCJKLettersAndMonths ・ KatakanaPhoneticExtensions
・ EnclosedIdeographicSupplement ・ KayahLi
・ Ethiopic ・ Kharoshthi
・ EthiopicExtended ・ (KhitanSmallScript)
・ EthiopicExtendedA ・ Khmer
・ EthiopicSupplement ・ KhmerSymbols
・ GeneralPunctuation ・ Khojki
・ GeometricShapes ・ Khudawadi
・ GeometricShapesExtended ・ Lao
・ Georgian ・ LatinExtendedAdditional
・ GeorgianExtended ・ LatinExtendedA
・ GeorgianSupplement ・ LatinExtendedB
・ Glagolitic ・ LatinExtendedC
・ GlagoliticSupplement ・ LatinExtendedD
・ Gothic ・ LatinExtendedE
・ Grantha ・ LatinSupplement
・ GreekAndCoptic ・ Lepcha
・ GreekExtended ・ LetterlikeSymbols
・ Gujarati ・ Limbu
・ (GunjalaGondi) ・ LinearA
・ Gurmukhi ・ LinearBIdeograms
・ HalfwidthAndFullwidthForms ・ LinearBSyllabary
・ HangulCompatibilityJamo ・ Lisu
・ HangulJamo ・ (LisuSupplement)
・ HangulJamoExtendedA ・ Lycian
・ HangulJamoExtendedB ・ Lydian
・ HangulSyllables ・ Mahajani
・ (HanifiRohingya) ・ MahjongTiles

13
・ (Makasar) ・ (OldSogdian)
・ Malayalam ・ (OldSouthArabian)
・ Mandaic ・ (OldTurkic)
・ Manichaean ・ OpticalCharacterRecognition
・ (Marchen) ・ Oriya
・ (MasaramGondi) ・ OrnamentalDingbats
・ MathematicalAlphanumericSymbols ・ (Osage)
・ MathematicalOperators ・ Osmanya
・ (MayanNumerals) ・ (OttomanSiyaqNumbers)
・ (Medefaidrin) ・ PahawhHmong
・ MeeteiMayek ・ Palmyrene
・ MeeteiMayekExtensions ・ PauCinHau
・ MendeKikakui ・ PhagsPa
・ MeroiticCursive ・ (PhaistosDisc)
・ MeroiticHieroglyphs ・ Phoenician
・ Miao ・ PhoneticExtensions
・ MiscellaneousMathematicalSymbolsA ・ PhoneticExtensionsSupplement
・ MiscellaneousMathematicalSymbolsB ・ PlayingCards
・ MiscellaneousSymbols ・ PrivateUseArea
・ MiscellaneousSymbolsAndArrows ・ PsalterPahlavi
・ MiscellaneousSymbolsAndPictographs ・ Rejang
・ MiscellaneousTechnical ・ RumiNumeralSymbols
・ Modi ・ Runic
・ ModifierToneLetters ・ Samaritan
・ Mongolian ・ Saurashtra
・ MongolianSupplement ・ Sharada
・ Mro ・ Shavian
・ (Multani) ・ (ShorthandFormatControls)
・ MusicalSymbols ・ Siddham
・ Myanmar ・ Sinhala
・ MyanmarExtendedA ・ SinhalaArchaicNumbers
・ MyanmarExtendedB ・ SmallFormVariants
・ Nabataean ・ SmallKanaExtension
・ (Nandinagari) ・ (Sogdian)
・ (Newa) ・ SoraSompeng
・ NewTaiLue ・ (Soyombo)
・ NKo ・ SpacingModifierLetters
・ NumberForms ・ Sundanese
・ (NyiakengPuachueHmong) ・ SundaneseSupplement
・ (Nushu) ・ SuperscriptsAndSubscripts
・ Ogham ・ SupplementalArrowsA
・ OlChiki ・ SupplementalArrowsB
・ OldHungarian ・ SupplementalArrowsC
・ (OldItalic) ・ SupplementalMathematicalOperators
・ (OldNorthArabian) ・ SupplementalPunctuation
・ OldPermic ・ SupplementalSymbolsAndPictographs
・ OldPersian ・ (SupplementaryPrivateUseAreaA)

14
・ (SupplementaryPrivateUseAreaB)
・ (SuttonSignWriting)
・ SylotiNagri
・ SymbolsAndPictographsExtendedA
・ (SymbolsForLegacyComputing)
・ Syriac
・ SyriacSupplement
・ Tagalog
・ Tagbanwa
・ Tags
・ TaiLe
・ TaiTham
・ TaiViet
・ TaiXuanJingSymbols
・ Takri
・ Tamil
・ (TamilSupplement)
・ (Tangut)
・ (TangutComponents)
・ (TangutSupplement)
・ Telugu
・ Thaana
・ Thai
・ Tibetan
・ Tifinagh
・ Tirhuta
・ TransportAndMapSymbols
・ Ugaritic
・ UnifiedCanadianAboriginalSyllabics
・ UnifiedCanadianAboriginalSyllabicsExtended
・ Vai
・ VedicExtensions
・ VerticalForms
・ (Wancho)
・ WarangCiti
・ (Yezidi)
・ YiRadicals
・ YiSyllables
・ YijingHexagramSymbols
・ (ZanabazarSquare)

15
In addition, the informal blocks for use as package option are:

・ Arabics
・ CanadianSyllabics
・ CherokeeFull
・ Chinese
・ CJK
・ Cyrillics
・ Diacritics
・ EthiopicFull
・ GeorgianFull
・ Greek
・ Korean
・ Japanese
・ Latin
・ Mathematics
・ MongolianFull
・ MyanmarFull
・ Phonetics
・ Punctuation
・ SundaneseFull
・ Symbols
・ SyriacFull
・ Yi

16

You might also like