0% found this document useful (0 votes)
59 views8 pages

Data Annotation Guideline

Guide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views8 pages

Data Annotation Guideline

Guide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

D a t a A n n o t a t io n G u id e lin e

修正日期 修正内容
2 0 2 5 /0 2 /2 7 更新了关于 ' 符号的使用
更新了大写字母规范

(1)Validity Judgment
For entire segments of qualified speech, annotations should be made by
breaking down the audio into sentences. Sentences that meet any of the
following conditions are considered (invalid) and should not be segmented:
1.Overlapping Speech:
• If two people speak simultaneously within a sentence with similar
volume levels and there is significant overlap, the segment should be
marked as invalid. However, if the overlap is minimal (only one or two
words), and the main speaker’s content can still be clearly heard and
transcribed normally, it should be transcribed.
2.Unclear Content:
• If part of a sentence is unclear and the content cannot be determined,
the sentence should be marked as invalid.
3.High Noise Levels:
• If a sentence contains strong noise (environmental or equipment noise)
that makes it difficult to hear the main speaker’s content, the sentence
should be marked as invalid.
4.Frame Loss:
• If a sentence has missing frames, it should be marked as invalid.
5.Non-Human Voice:
• If a sentence does not contain normal human speech (e.g., machine
customer service, synthesized voice, TV/radio broadcast), it should be
marked as invalid.
6.Non-Target Language Content:
• If a sentence contains parts in languages other than the target language,
it should be marked as invalid. English can be transcribed, but for other
languages, communication with the product manager is required.
7.Sensitive Information:
• If a sentence contains sensitive information (politically sensitive,
religiously sensitive, pornographic, violent), it should be marked as invalid.

(2)Effective Voice Interception


1.Annotators should consider semantic coherence and segment the audio
by sentence. If a sentence is too long, it can be divided into
sub-sentences. Each segment should not exceed 8 seconds but also
should not be excessively short. Based on annotation experience, an
average natural language segment of about 5-6 seconds is ideal.

2.Optimal Time Boundar y Position:


• The best position for each time boundary should be at the lowest point
of the waveform.
3.Different Speakers:
• Speech from different speakers should not be segmented into the same
sentence.
4.Silence Padding Around Segments:
• When segmenting, try to leave a 0.2~0.3 second silent segment around
the annotated speech segment. If there isn’t naturally this much silence,
it is not mandatory. Aim to segment segments without sudden noise, and
if necessary, reduce the padding before and after the segment to avoid
sudden noise, but ensure no clipping of speech occurs.
5.Single Word Responses:
• Even single-word responses should be segmented. Where possible,
combine them with adjacent sentences for better context.
6.Handling Pauses Between Sentences:
• If a pause in speech results in a silent segment longer than 2 seconds,
the sentence should be split into two sub-sentences without considering
sentence meaning. If the pause is less than 2 seconds and the total
sentence length does not exceed 8 seconds, it can remain as one
sentence.
7.Intra-Sentence Pauses:
• For pauses within a sentence that are no longer than 2 seconds, even if
there is noise during the pause or the resulting segment is semantically
incomplete, it can remain unsplit.
8.Invalid Conditions:
• Speech segments with clipping, amplitude truncation, frame loss, or
abnormal energy levels should be considered invalid.

(3) Speaker Identification


• Different speakers within the same segment should be identified using
distinct identity IDs.• The gender of each speaker must also be marked.

(1)Content Transcription
Transcription personnel must transcribe the content exactly as heard,
ensuring that no extra characters, missing characters, or incorrect
characters are included. The general guidelines are as follows:
1.Capitalization:
• If a word typically starts with a capital letter, it should be transcribed
according to standard writing conventions, e.g., "China," "Microsoft."
• The use of capital letter should follow the basic writing rules. Such as
Capital in the first letter of the first word in the sentences. If the sentence
isn’t finished in the end of the segment, the beginning of the next
segment should use a small letter.
2.Numbers:
• Numbers appearing in the text should not be transcribed as Arabic
numerals but should be written out in words according to the language
being used.

O r ig in a l Transcription

I’m 15 years old I’m fifteen years old

3.Spelled-Out Words:
• Letters should be capitalized and separated by spaces, e.g., "A B C."

O r ig in a l Transcription

five thirty pm five thirty PM

FBI FBI
NFC NFC

4.Abbreviations:
• Abbreviated words should be fully expanded during transcription. Always
use the full word based on its pronunciation, not its abbreviated form.

O r ig in a l Transcription

This is Dr.Smith This is doctor Smith

5.Punctuation:
• Use punctuation marks according to grammatical rules.
• Punctuation spoken by the speaker should be transcribed as heard, e.g.,
"@" should be transcribed as "at," ".com" should be written as "dot com."
• Only the following punctuation marks are allowed: commas (,), hyphens
(-) only within words, periods (.), exclamation points (!), single quotation
marks ('), and question marks (?). No other punctuation marks should be
added, and all symbols must conform to grammatical rules. All symbols
should be input in normal English typing mode.
• Ex: I've been learning 'Chinese' for 2 months.
Our ancestors said: "Where there is caution, there is peace of mind."
In these sentence the use of ‘ or " are incorrect, this punctuation is only
used in Abbreviations and Possessive.
6.Interjections:
• Interjections should be transcribed accurately based on pronunciation
and meaning.(e.g. uh, hum, ah etc.)
7.Other Guidelines:
• Profanity: Profanity should be transcribed normally without substituting
letters.
• Internet Slang and Common Terms: These should be transcribed
according to common usage.
• Repetitions: Any repeated words or phrases in the audio should be
transcribed in full.
• Unclear but Pronounceable Words: If a word is clearly pronounced but
the meaning is uncertain (e.g., common names), you can use a
homophone, ensuring that the transcription matches the pronunciation. In
cases where the context provides clear sentence meaning, choose words
that fit both the pronunciation and the sentence meaning.
• Incomplete Words: If a word is cut off mid-sentence, add a hyphen (-)
followed by a space before the next word, e.g., "I want to go to s-
school." Note that sentences must end with complete words; if an
incomplete word appears at the end of a sentence, it should be omitted.

1.Special Label

If the following situations occur during the annotation process,

corresponding special labels need to be added, and the labels must be

legal: avoid missing paired labels, inconsistent capitalization, mismatched

brackets, etc.

Valid N o is e S p e c ia l E x p la in Role Text Annotation


or La b e l Ma r
Invalid k
Da ta
Valid No None Audio property label. O1 I went to eat today
data noise Transcribe what is or
heard and follow the O2
rules ……
[N] Audio property label. I went to eat today.[N]
After punctuation, at
A sentence containing the end of a sentence.
noise needs to be
marked [N] at the
end of the sentence,
but there is no need
to distinguish the type
of noise.
[HM] Audio property label. I get drunk alone[HM]

The speaker’s rap


content needs to be
marked at the end of
the sentence[HM]
[OVERLA Audio property label. Today I went
P/][/OVE [OVERLAP/]to
RLAP] If the voices overlap eat[/OVERLAP]
and one of them is
particularly clear, only
the voice of the
person who speaks
clearly should be
transcribed. The
speaker should be
marked with a role,
and the overlap text
should be labeled
between the special
labels.

Invalid The [IVS] Noise segments N [IVS]


data recordin longer than 0.5
g seconds will be
person’ marked. For example,
s invalid the voices overlap and
human the sound volumes
voice are about the same;
segmen Voice frame loss;
t Speech clipping;
Speech echoes;
Not a normal
speaking tone: such
as singing, speaking
with a high voice, etc.;
non-target language;
Some words in the
speech segment
cannot be heard
clearly or cannot be
transcribed due to
noise.
Non-rec [OIVS] Noise segments N [OIVS]
ording longer than 0.5
person’ seconds will be
s invalid marked. For example:
human TV vocals;
voice The voiceover in the
segmen program that narrates
t the commercials;
Music with human
voice, etc.
Personal [PIL] • The audio may N [PIL]
info contain private
information of the
recording subject,
including but not
limited to:
• Detailed Address
• Phone Number
• ID Number (e.g.,
National ID, Social
Security Number)
• Bank Account
Number
• Social Insurance
Number• Passport
Number• And other
sensitive personal
information.

1.QualityRequirements
The accuracy of the annotated words should be 98% and above. If the

following annotation errors occur in part of a sentence: incorrect

annotation, valid error, etc., then the sentence is deemed to be incorrectly

annotated.

The correct rate of role sentences should be over 90%.

Accuracy rate of special labels = number of incorrect labels (wrong

label, multi-label, missing label) / number of labeled labels) ≥ 90%

Quality inspectors must live in the countr y where the target language

is spoken for a long time (native speakers are preferred)

You might also like