How To Transcribe Documents With Transkribus
How To Transcribe Documents With Transkribus
How To Transcribe Documents With Transkribus
Transkribus is a platform for the automated recognition, transcription and searching of historical
documents, using Handwritten Text Recognition (HTR) technology.
- Used to train a neural network (“model”) which is capable of automatically recognising printed
or handwritten documents
- Enriched and marked-up to serve as the basis for digital editions of documents.
This introduction enables you to either quickly create training data for the automated recognition of
your specific documents or to create a transcription for a scholarly edition.
If you already have transcribed documents available and would like to use them as training data for HTR,
please consult our HowToUseExisitingTranscriptions guide.
Download the Transkribus Expert Client, or make sure you are using the latest version:
- https://transkribus.eu/
2 HowToTranscribe – Basic Instruction
Consult the Transkribus Wiki for further information and other How to Guides:
- https://transkribus.eu/wiki/
Transkribus and the technology behind it are made available via the following projects and sites:
- https://read.transkribus.eu/
- https://transcriptorium.eu/
- https://github.com/transkribus/
Contact
Contents
Introduction............................................................................................................................................. 4
Upload documents to Transkribus .......................................................................................................... 4
Segmentation .......................................................................................................................................... 5
Viewing profiles ................................................................................................................................... 5
Automatically detect text regions, lines and baselines ....................................................................... 6
Correcting the results of automated segmentation............................................................................ 6
Simple transcription – for HTR training ................................................................................................... 9
Train a HTR model ................................................................................................................................. 10
Advanced transcription - for a scholarly edition ................................................................................... 10
Reading order .................................................................................................................................... 10
Reading order: Interline additions .................................................................................................... 13
Reading order: Additions as extra notes ........................................................................................... 14
Transcription and Virtual Keyboards ................................................................................................. 16
Diacritics and ligatures ...................................................................................................................... 17
Punctuation marks ............................................................................................................................ 18
References ............................................................................................................................................. 18
Credits ................................................................................................................................................... 19
The READ project has received funding from the European Union’s Horizon
2020 research and innovation programme under grant agreement No
674943.
4 HowToTranscribe – Basic Instruction
Introduction
This guide explains the process of transcribing documents in Transkribus.
• As training dat for a Handwritten Text Recognition (HTR) model which is capable of
automatically transcribing your documents.
• As the basis for a digital scholarly edition.
Step 1: Uploading
Step 2: Segmentation
- Run the automated segmentation tool to create baselines for your document.
Step 3: Transcription
This form of simple transcription is sufficient for training HTR technology. Note: HTR can work on
both handwritten and printed documents.
There are also advanced transcription options for those working on scholarly editions. You can adjust
the reading order of the text, use historical characters, add tags and metadata, expand abbreviations
and more.
▪ This allows you to upload documents directly from repositories which support
the DFG (Deutsche Forschungsgemeinschaft – German Science Funds) Viewer
o Extract and upload images from PDF
Segmentation
- Once you have uploaded your documents to Transkribus, you are ready to start
segmentation.
- In order to transcribe your documents in Transkribus, they must be segmented into text
regions, lines and baselines.
- For the HTR to work, the text and image need to be connected.
Viewing profiles
- Viewing profiles are available to help you with the tasks of segmentation and transcription.
- You can select between viewing profiles for “Segmentation” and “Transcription” by clicking
the “Profiles” button in the Main menu.
- The “Segmentation” profile means that baselines are displayed in red, making it easier to
spot any errors resulting from the automated segmentation process.
- The “Transcription” profile means that the Text Editor field will be displayed, allowing you to
transcribe your document.
- Of course you can simply use the “default” profile to perform either task.
- In the example above the first line had been missed by the program. If you would like to add
it to the existing text region:
o Click inside the region so that it is highlighted.
o Drag the border of the text region as needed.
o If you need to split one region into two, you can do this with buttons in the Canvas menu.
o As shown in Figure 6, the “H-button” splits a text region horizontally.
o The “V” button splits a text region vertically.
o The “L-button” allows you to split a text region with customisable line.
o In the example above two regions are overlapping, so one can be deleted.
o Click on the text region you wish to delete, and click the red “Remove a shape” button.
o Hold down the “CTRL” button on your keyboard and click on both text regions.
o Click the “Merges the selected shapes” button in the Canvas menu.
Correct baselines
- Of course it is also possible to correct the baselines in your document.
- As with the text regions, click on a baseline and you can drag the parts of the line, split a line
into two or merge two lines together.
- You can also delete a baseline and draw a new one from scratch. Click the “+BL” button in the
Canvas menu. Click once to start drawing your baseline and double-click to finish your line.
- Note: Baselines are most important for HTR; line regions do not need to be corrected.
- Transcribe the text according to the language of your source document. Use the characters of
your keyboard.
- You can have more than one person working on a document but they should not work on
the same page simultaneously. You can let other Transkribus users see your documents by
clicking the “User Manager” button in the “Server” tab.
Figure 11 “Item visibility” button displays the logical order of segmentation elements
12 HowToTranscribe – Basic Instruction
o Once you choose to show the reading order of text regions or lines, numbers will be
displayed on the image of your document.
o By clicking on one of the numbers marking the reading order, it is possible to type in
a new number and change the reading order accordingly. The same can be done by
moving the segmentation elements in the “Layout” tab.
Figure 12 Edit reading order by clicking on the digit and entering a new number
- In cases where the reading order of a page is completely incorrect, it is possible to reorder
the text
o Make the line reading order visible as described above
o Click on the “Layout” tab on the left side of the screen
o Select the page or text region that you wish to reorder
o Click the “R” button
o The reading order will be rearranged according to the coordinates of the top left
corner of a text or line region. After that, the lines should be in right order.
o There can be issues with the reading order of newspaper columns and similar
documents. E.g. the programme assigns a reading order based on the horizontal
layout of lines on a page, rather than putting the lines in order by column. To fix this
issue, use the “V” button in the Canvas menu to split the text region on the page into
separate regions for each column. Once there is a separate text region for each
column, the reading order should automatically update and be correct.
Figure 14 Click the “Shape Visibility” button, then choose to show baselines and the reading order of lines.
o Select the baseline below the addition (if the addition is above the line).
o Split the line region with the “V” button in the Canvas menu exactly where the
addition should be logically placed
- Edit the reading order so that it is correct. Click on the number associated with each line
region and then type the correct one.
14 HowToTranscribe – Basic Instruction
▪ Following the movement of your mouse pointer you can add points to the
original text region and expand the shape so that it also includes the
addition.
▪ Afterwards the additional lines/baselines can be renumbered according to
their correct reading order.
o Option 2: You can generate just one large text region for the whole page and do the
line/baseline segmentation manually in the correct order. In this way you will get the
correct reading order right from the beginning.
▪ Note: this may be the best option if you are dealing with a document which
has a sophisticated layout with many additions, notes and deletions.
o Option 3: You can connect the extra text region which contains the addition to the
line where the addition belongs. To do this, select both text regions and then click
the “Links two shapes” button in the “Structural” tab, within the “Metadata” tab.
▪ Note: The linking will be part of the XML file but is currently not supported in
the export formats.
o In simple transcripts you will transcribe this as LATIN SMALL LETTER Y since the base
character is clearly visible.
o Example 2: LATIN SMALL LETTER S is expressed with two graphemes in most European
historical scripts. We find therefore a clear distinction between LATIN SMALL LETTER
S and LATIN SMALL LETTER LONG S.
18 HowToTranscribe – Basic Instruction
Figure 23 “Thatbestand.” vs. “Revisionsgerichts”: LATIN SMALL LETTER LONG S vs. LATIN SMALL LETTER S
o But although there is a clear distinction, a simple transcription would use LATIN SMALL
LETTER S in both cases.
- Option 2: Palaeographic Transcription
o Philologists or palaeographers are not only interested in the correct transcription, but
also in the historical appearance and development of graphemes. Therefore it might
also be interesting to transcribe the above examples with full support of the Unicode
character set or even by utilizing the private area of Unicode.
o Note: Please take into account that this is an important decision and will affect the
usability of the text in many ways. If you decide to go for a palaeographic transcription
it will cause a lot more work than with a slightly normalized transcription.
- Note: In printed texts (which can also be transcribed in Transkribus) the transcription of
ligatures may play a role. Again the same rule can be applied: Though specific combinations of
letters, such as “ft” are expressed with a specific grapheme where two graphemes are matched
together, and though such ligatures can also be expressed with specific Unicode letters, we
recommend transcribing them according to the dictionary.
Punctuation marks
- Punctuation marks are transcribed in the same way as characters. Use the appropriate
character on your keyboard and do not normalize or add punctuation marks. Typical
punctuation marks are:
o modern characters such as dot, comma, semicolon, colon: “.”, “,”, “;”:”
o historical characters such as virgule (slash), or line fillers, etc.
o Note: Colons in historical texts are often used to mark abbreviated words. These
should be transcribed as a colon.
- In contrast to many transcription rules where punctuation marks are added and omitted
according to a modern understanding we recommend to keep to the original punctuation
marks.
- If you want to add punctuation marks which do not appear in the original document you may
use the “supplied” tag in the “Tagging” tab, within the “Metadata” tab to indicate that the
punctuation mark was added by yourself.
References
To get an overview on scripts from Unicode: http://www.unicode.org/charts/
- Contains e.g.:
o Non-European and historic Latin
o Phonetic and historic letters
o Additions for Slovenian and Croatian
o etc.
- Contains e.g.:
o Orthographic Latin additions
o etc.
- Contains e.g.:
o Medievalist additions
o Insular and Celtic letters
o Ancient Roman epigraphic letters
o etc.
- This initiative has collected and systematized about 1512 characters which are especially
recommended for the transcription of medieval documents. Note: Some of them are still in
the “private” section of Unicode, therefore not officially available.
- http://folk.uib.no/hnooh/mufi/
- http://folk.uib.no/hnooh/mufi/specs/MUFI-Alphabetic-4-0.pdf
Credits
We would like to thank the many users who have contributed their feedback to help improve the
Transkribus software.
Transkribus is made available to the public as part of H2020 e-Infrastructure Project READ (Recognition
and Enrichment of Archival Documents) which received funding from the European Commission under
grant agreement No 674943.