0% found this document useful (0 votes)

59 views8 pages

Data Annotation Guideline

Guide

Uploaded by

martinmuriithi324

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views8 pages

Data Annotation Guideline

Guide

Uploaded by

martinmuriithi324

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

D a t a A n n o t a t io n G u id e lin e

修正日期修正内容
2 0 2 5 /0 2 /2 7 更新了关于 ' 符号的使用
更新了大写字母规范

(1)Validity Judgment
For entire segments of qualified speech, annotations should be made by
breaking down the audio into sentences. Sentences that meet any of the
following conditions are considered (invalid) and should not be segmented:
1.Overlapping Speech:
• If two people speak simultaneously within a sentence with similar
volume levels and there is significant overlap, the segment should be
marked as invalid. However, if the overlap is minimal (only one or two
words), and the main speaker’s content can still be clearly heard and
transcribed normally, it should be transcribed.
2.Unclear Content:
• If part of a sentence is unclear and the content cannot be determined,
the sentence should be marked as invalid.
3.High Noise Levels:
• If a sentence contains strong noise (environmental or equipment noise)
that makes it difficult to hear the main speaker’s content, the sentence
should be marked as invalid.
4.Frame Loss:
• If a sentence has missing frames, it should be marked as invalid.
5.Non-Human Voice:
• If a sentence does not contain normal human speech (e.g., machine
customer service, synthesized voice, TV/radio broadcast), it should be
marked as invalid.
6.Non-Target Language Content:
• If a sentence contains parts in languages other than the target language,
it should be marked as invalid. English can be transcribed, but for other
languages, communication with the product manager is required.
7.Sensitive Information:
• If a sentence contains sensitive information (politically sensitive,
religiously sensitive, pornographic, violent), it should be marked as invalid.

(2)Effective Voice Interception

1.Annotators should consider semantic coherence and segment the audio
by sentence. If a sentence is too long, it can be divided into
sub-sentences. Each segment should not exceed 8 seconds but also
should not be excessively short. Based on annotation experience, an
average natural language segment of about 5-6 seconds is ideal.

2.Optimal Time Boundar y Position:

• The best position for each time boundary should be at the lowest point
of the waveform.
3.Different Speakers:
• Speech from different speakers should not be segmented into the same
sentence.
4.Silence Padding Around Segments:
• When segmenting, try to leave a 0.2~0.3 second silent segment around
the annotated speech segment. If there isn’t naturally this much silence,
it is not mandatory. Aim to segment segments without sudden noise, and
if necessary, reduce the padding before and after the segment to avoid
sudden noise, but ensure no clipping of speech occurs.
5.Single Word Responses:
• Even single-word responses should be segmented. Where possible,
combine them with adjacent sentences for better context.
6.Handling Pauses Between Sentences:
• If a pause in speech results in a silent segment longer than 2 seconds,
the sentence should be split into two sub-sentences without considering
sentence meaning. If the pause is less than 2 seconds and the total
sentence length does not exceed 8 seconds, it can remain as one
sentence.
7.Intra-Sentence Pauses:
• For pauses within a sentence that are no longer than 2 seconds, even if
there is noise during the pause or the resulting segment is semantically
incomplete, it can remain unsplit.
8.Invalid Conditions:
• Speech segments with clipping, amplitude truncation, frame loss, or
abnormal energy levels should be considered invalid.

(3) Speaker Identification

• Different speakers within the same segment should be identified using
distinct identity IDs.• The gender of each speaker must also be marked.

(1)Content Transcription
Transcription personnel must transcribe the content exactly as heard,
ensuring that no extra characters, missing characters, or incorrect
characters are included. The general guidelines are as follows:
1.Capitalization:
• If a word typically starts with a capital letter, it should be transcribed
according to standard writing conventions, e.g., "China," "Microsoft."
• The use of capital letter should follow the basic writing rules. Such as
Capital in the first letter of the first word in the sentences. If the sentence
isn’t finished in the end of the segment, the beginning of the next
segment should use a small letter.
2.Numbers:
• Numbers appearing in the text should not be transcribed as Arabic
numerals but should be written out in words according to the language
being used.

O r ig in a l Transcription

I’m 15 years old I’m fifteen years old

3.Spelled-Out Words:
• Letters should be capitalized and separated by spaces, e.g., "A B C."

O r ig in a l Transcription

five thirty pm five thirty PM

FBI FBI
NFC NFC

4.Abbreviations:
• Abbreviated words should be fully expanded during transcription. Always
use the full word based on its pronunciation, not its abbreviated form.

O r ig in a l Transcription

This is Dr.Smith This is doctor Smith

5.Punctuation:
• Use punctuation marks according to grammatical rules.
• Punctuation spoken by the speaker should be transcribed as heard, e.g.,
"@" should be transcribed as "at," ".com" should be written as "dot com."
• Only the following punctuation marks are allowed: commas (,), hyphens
(-) only within words, periods (.), exclamation points (!), single quotation
marks ('), and question marks (?). No other punctuation marks should be
added, and all symbols must conform to grammatical rules. All symbols
should be input in normal English typing mode.
• Ex: I've been learning 'Chinese' for 2 months.
Our ancestors said: "Where there is caution, there is peace of mind."
In these sentence the use of ‘ or " are incorrect, this punctuation is only
used in Abbreviations and Possessive.
6.Interjections:
• Interjections should be transcribed accurately based on pronunciation
and meaning.(e.g. uh, hum, ah etc.)
7.Other Guidelines:
• Profanity: Profanity should be transcribed normally without substituting
letters.
• Internet Slang and Common Terms: These should be transcribed
according to common usage.
• Repetitions: Any repeated words or phrases in the audio should be
transcribed in full.
• Unclear but Pronounceable Words: If a word is clearly pronounced but
the meaning is uncertain (e.g., common names), you can use a
homophone, ensuring that the transcription matches the pronunciation. In
cases where the context provides clear sentence meaning, choose words
that fit both the pronunciation and the sentence meaning.
• Incomplete Words: If a word is cut off mid-sentence, add a hyphen (-)
followed by a space before the next word, e.g., "I want to go to s-
school." Note that sentences must end with complete words; if an
incomplete word appears at the end of a sentence, it should be omitted.

1.Special Label

If the following situations occur during the annotation process,

corresponding special labels need to be added, and the labels must be

legal: avoid missing paired labels, inconsistent capitalization, mismatched

brackets, etc.

Valid N o is e S p e c ia l E x p la in Role Text Annotation

or La b e l Ma r
Invalid k
Da ta
Valid No None Audio property label. O1 I went to eat today
data noise Transcribe what is or
heard and follow the O2
rules ……
[N] Audio property label. I went to eat today.[N]
After punctuation, at
A sentence containing the end of a sentence.
noise needs to be
marked [N] at the
end of the sentence,
but there is no need
to distinguish the type
of noise.
[HM] Audio property label. I get drunk alone[HM]

The speaker’s rap

content needs to be
marked at the end of
the sentence[HM]
[OVERLA Audio property label. Today I went
P/][/OVE [OVERLAP/]to
RLAP] If the voices overlap eat[/OVERLAP]
and one of them is
particularly clear, only
the voice of the
person who speaks
clearly should be
transcribed. The
speaker should be
marked with a role,
and the overlap text
should be labeled
between the special
labels.

Invalid The [IVS] Noise segments N [IVS]

data recordin longer than 0.5
g seconds will be
person’ marked. For example,
s invalid the voices overlap and
human the sound volumes
voice are about the same;
segmen Voice frame loss;
t Speech clipping;
Speech echoes;
Not a normal
speaking tone: such
as singing, speaking
with a high voice, etc.;
non-target language;
Some words in the
speech segment
cannot be heard
clearly or cannot be
transcribed due to
noise.
Non-rec [OIVS] Noise segments N [OIVS]
ording longer than 0.5
person’ seconds will be
s invalid marked. For example:
human TV vocals;
voice The voiceover in the
segmen program that narrates
t the commercials;
Music with human
voice, etc.
Personal [PIL] • The audio may N [PIL]
info contain private
information of the
recording subject,
including but not
limited to:
• Detailed Address
• Phone Number
• ID Number (e.g.,
National ID, Social
Security Number)
• Bank Account
Number
• Social Insurance
Number• Passport
Number• And other
sensitive personal
information.

1.QualityRequirements
The accuracy of the annotated words should be 98% and above. If the

following annotation errors occur in part of a sentence: incorrect

annotation, valid error, etc., then the sentence is deemed to be incorrectly

annotated.

The correct rate of role sentences should be over 90%.

Accuracy rate of special labels = number of incorrect labels (wrong

label, multi-label, missing label) / number of labeled labels) ≥ 90%

Quality inspectors must live in the countr y where the target language

is spoken for a long time (native speakers are preferred)

Eclectic Chinese-Japanese-English Dictionary - 1884
100% (3)
Eclectic Chinese-Japanese-English Dictionary - 1884
823 pages
Ordinal Numbers Plan
100% (2)
Ordinal Numbers Plan
4 pages
Interpersonal Skills Presentation
0% (1)
Interpersonal Skills Presentation
24 pages
Learn Portuguese Part 3
No ratings yet
Learn Portuguese Part 3
3 pages
Transcription Guidelines
100% (1)
Transcription Guidelines
12 pages
Argumentative Essay: The Argumentative Essay Is A Very Useful Test of A Student's Ability To Think Logically
No ratings yet
Argumentative Essay: The Argumentative Essay Is A Very Useful Test of A Student's Ability To Think Logically
34 pages
Fet Summary
No ratings yet
Fet Summary
16 pages
Process Powerpoint
No ratings yet
Process Powerpoint
15 pages
Sample For Transcription
No ratings yet
Sample For Transcription
5 pages
Socratic Questioning
100% (1)
Socratic Questioning
16 pages
Exercises, June24
No ratings yet
Exercises, June24
13 pages
Regular Verbs: Infinitive Simple Past Past Participle Spanish
100% (1)
Regular Verbs: Infinitive Simple Past Past Participle Spanish
2 pages
Introduction: Multimodality As Challenge and Resource For Translation
No ratings yet
Introduction: Multimodality As Challenge and Resource For Translation
13 pages
UJI RI Ahayu: E Master Degree: The University of Queensland, Australia July 2005
No ratings yet
UJI RI Ahayu: E Master Degree: The University of Queensland, Australia July 2005
4 pages
1100 Hours of Tagalog Natural Dialogue Test
No ratings yet
1100 Hours of Tagalog Natural Dialogue Test
7 pages
Common Swedish Verbs by David Hensleigh
No ratings yet
Common Swedish Verbs by David Hensleigh
82 pages
Language Sciences: Bernd Heine, Gunther Kaltenböck
No ratings yet
Language Sciences: Bernd Heine, Gunther Kaltenböck
16 pages
Requirement
No ratings yet
Requirement
6 pages
Articulation in English
No ratings yet
Articulation in English
19 pages
UNIT 6 - Simple Present Vs Present Progressive
No ratings yet
UNIT 6 - Simple Present Vs Present Progressive
2 pages
Germanic Family
No ratings yet
Germanic Family
6 pages
WELCOME UNIT Part 2
No ratings yet
WELCOME UNIT Part 2
39 pages
Transcription Guidelines - GoTranscript
No ratings yet
Transcription Guidelines - GoTranscript
12 pages
Estudios de Lingüística Inglesa Aplicada
No ratings yet
Estudios de Lingüística Inglesa Aplicada
7 pages
Steingard, Balduccini, Sinha (2022)
No ratings yet
Steingard, Balduccini, Sinha (2022)
17 pages
Bitmex Full Report
No ratings yet
Bitmex Full Report
12 pages
Sharon Chelangat Statement
No ratings yet
Sharon Chelangat Statement
1 page
INTRODUCTION
No ratings yet
INTRODUCTION
13 pages
Transcription Guide 20171117
No ratings yet
Transcription Guide 20171117
11 pages
Grammar Revision Questions 2
No ratings yet
Grammar Revision Questions 2
11 pages
RE2 U2b The 1000 Year Bird Song
No ratings yet
RE2 U2b The 1000 Year Bird Song
60 pages
Aichatbot Subinrajfinal
No ratings yet
Aichatbot Subinrajfinal
52 pages
Specification For 1000 Hour American English Doctor-Patient Dialogue Annotations
No ratings yet
Specification For 1000 Hour American English Doctor-Patient Dialogue Annotations
7 pages
Scribie Transcription Guide
No ratings yet
Scribie Transcription Guide
13 pages
GOT
No ratings yet
GOT
13 pages
Text Annotation Guidelines For Hindi ASR
No ratings yet
Text Annotation Guidelines For Hindi ASR
8 pages
Rev Transcription
100% (2)
Rev Transcription
24 pages
Free Talk Annotation and Transcription Requirement-2022-12-29
No ratings yet
Free Talk Annotation and Transcription Requirement-2022-12-29
7 pages
Language Variation
No ratings yet
Language Variation
16 pages
Go Transcript Guidelines
No ratings yet
Go Transcript Guidelines
11 pages
Transcription Guidelines
No ratings yet
Transcription Guidelines
12 pages
Text Format Descriptions: Full Verbatim
No ratings yet
Text Format Descriptions: Full Verbatim
10 pages
User Guide - Colloquial Video Annotation
No ratings yet
User Guide - Colloquial Video Annotation
5 pages
Untitled Document
No ratings yet
Untitled Document
7 pages
AP Style of Writing - A Comprehensive Guide - AP Style Guide - Writer
No ratings yet
AP Style of Writing - A Comprehensive Guide - AP Style Guide - Writer
1 page
88 - Present Perfect - US 2 4
No ratings yet
88 - Present Perfect - US 2 4
3 pages
Subtitle Edit Guidelines
No ratings yet
Subtitle Edit Guidelines
5 pages
Mint Search Rating
No ratings yet
Mint Search Rating
11 pages
Rosemary Wanjiku
No ratings yet
Rosemary Wanjiku
4 pages
LOFT System Guidelines
No ratings yet
LOFT System Guidelines
17 pages
Audio Transcription Instruction (Praat)
No ratings yet
Audio Transcription Instruction (Praat)
16 pages
METALINGUISTICS
No ratings yet
METALINGUISTICS
9 pages
Standards For Tagging Malay Long Language Streams
No ratings yet
Standards For Tagging Malay Long Language Streams
11 pages
Transcription
No ratings yet
Transcription
4 pages
Eura English Transcription Guidelines 2024 - ADAP QF
No ratings yet
Eura English Transcription Guidelines 2024 - ADAP QF
25 pages
2024 Spring Midterm Exam 6th
No ratings yet
2024 Spring Midterm Exam 6th
5 pages
Indonesia Transcription Guidelines - EN - 0413
No ratings yet
Indonesia Transcription Guidelines - EN - 0413
7 pages
Labelling Rules
No ratings yet
Labelling Rules
4 pages
Appen
No ratings yet
Appen
9 pages
Gujarat (Standard Language) Specification
No ratings yet
Gujarat (Standard Language) Specification
6 pages
GoTranscript's Transcription Guidelines
No ratings yet
GoTranscript's Transcription Guidelines
9 pages
Ake ASR Transcription Rule (EN) - Long Audio - V0117
No ratings yet
Ake ASR Transcription Rule (EN) - Long Audio - V0117
5 pages
BRIAN SUDI MUKANGULA Itax
No ratings yet
BRIAN SUDI MUKANGULA Itax
2 pages
Paypal Payoneer Paypal Payoneer: Example
No ratings yet
Paypal Payoneer Paypal Payoneer: Example
5 pages
Alex Gitau Maina-Itax Statement
No ratings yet
Alex Gitau Maina-Itax Statement
1 page
Recorded Captioning Style Guide: July 2020
No ratings yet
Recorded Captioning Style Guide: July 2020
31 pages
Transcription Rules - English Version
No ratings yet
Transcription Rules - English Version
7 pages
Short Test Unit 9
No ratings yet
Short Test Unit 9
1 page
Specification 1
No ratings yet
Specification 1
4 pages
EU Portuguese Natural Conversation Annotation.docx 20240404 170408 ٠٠٠٠
No ratings yet
EU Portuguese Natural Conversation Annotation.docx 20240404 170408 ٠٠٠٠
8 pages
Transcription Style Guide
No ratings yet
Transcription Style Guide
15 pages
Job 2 Guidelines
No ratings yet
Job 2 Guidelines
9 pages
Abdi Chepkwony Europass CV
No ratings yet
Abdi Chepkwony Europass CV
3 pages
Ake ASR Transcription Rule (En) - Long Audio
No ratings yet
Ake ASR Transcription Rule (En) - Long Audio
4 pages
Aragorn Training Document
No ratings yet
Aragorn Training Document
34 pages
game 外语视频标注规范
No ratings yet
game 外语视频标注规范
6 pages
Loft Rules
No ratings yet
Loft Rules
6 pages
Rev Transcription Style Guide
No ratings yet
Rev Transcription Style Guide
7 pages
Quebec Accent French Colloquial Video Speech Transcription
No ratings yet
Quebec Accent French Colloquial Video Speech Transcription
6 pages
Iris EN Long Audio Transcription Project: FAQ Frequent Answers & Questions
No ratings yet
Iris EN Long Audio Transcription Project: FAQ Frequent Answers & Questions
10 pages
Gotranscripts Guidelines
No ratings yet
Gotranscripts Guidelines
12 pages
Guideline
No ratings yet
Guideline
4 pages
Carneros Transcription Guidelines - Updated 20210727
No ratings yet
Carneros Transcription Guidelines - Updated 20210727
29 pages
SJJ Hindi Transcription
No ratings yet
SJJ Hindi Transcription
9 pages
Shujiajia Audio Transcription & QA
No ratings yet
Shujiajia Audio Transcription & QA
6 pages
Transcription Guidelines en Ver2-9 05291019
No ratings yet
Transcription Guidelines en Ver2-9 05291019
12 pages
Rev Transcription Style Guide v3.3
No ratings yet
Rev Transcription Style Guide v3.3
18 pages
Transcription Guidelines
No ratings yet
Transcription Guidelines
7 pages
Speech Ocean Guidelines
No ratings yet
Speech Ocean Guidelines
6 pages
Pre-Test Quick Guide
No ratings yet
Pre-Test Quick Guide
3 pages
Tiktok Project Rules: Audio Characteristics
No ratings yet
Tiktok Project Rules: Audio Characteristics
7 pages
Rev+Transcription+Style+Guide+3 0
No ratings yet
Rev+Transcription+Style+Guide+3 0
18 pages
Transcription Guidelines: Last Updated: 05292019
No ratings yet
Transcription Guidelines: Last Updated: 05292019
11 pages
NCC Standard Guidelines 10.20.17
No ratings yet
NCC Standard Guidelines 10.20.17
19 pages
Transcription Requirements AA
No ratings yet
Transcription Requirements AA
11 pages
What Do We Do?: We Provide Audio Transcription Services, Which Means That We Convert Audio and Video Files Into Text
No ratings yet
What Do We Do?: We Provide Audio Transcription Services, Which Means That We Convert Audio and Video Files Into Text
12 pages
Go Transcript Guidelines
No ratings yet
Go Transcript Guidelines
11 pages
Past Simple - To Be Theory
100% (2)
Past Simple - To Be Theory
2 pages
Character Voices: A Workbook for Audiobook Narration: Narrated by the Author, #2
From Everand
Character Voices: A Workbook for Audiobook Narration: Narrated by the Author, #2
Renee Conoulty
5/5 (1)

Data Annotation Guideline

Uploaded by

Data Annotation Guideline

Uploaded by

D a t a A n n o t a t io n G u id e lin e

(2)Effective Voice Interception

2.Optimal Time Boundar y Position:

(3) Speaker Identification

I’m 15 years old I’m fifteen years old

five thirty pm five thirty PM

This is Dr.Smith This is doctor Smith

If the following situations occur during the annotation process,

corresponding special labels need to be added, and the labels must be

legal: avoid missing paired labels, inconsistent capitalization, mismatched

Valid N o is e S p e c ia l E x p la in Role Text Annotation

The speaker’s rap

Invalid The [IVS] Noise segments N [IVS]

following annotation errors occur in part of a sentence: incorrect

annotation, valid error, etc., then the sentence is deemed to be incorrectly

The correct rate of role sentences should be over 90%.

Accuracy rate of special labels = number of incorrect labels (wrong

label, multi-label, missing label) / number of labeled labels) ≥ 90%

is spoken for a long time (native speakers are preferred)

You might also like