Data Annotation Guideline
Data Annotation Guideline
修正日期 修正内容
2 0 2 5 /0 2 /2 7 更新了关于 ' 符号的使用
更新了大写字母规范
(1)Validity Judgment
For entire segments of qualified speech, annotations should be made by
breaking down the audio into sentences. Sentences that meet any of the
following conditions are considered (invalid) and should not be segmented:
1.Overlapping Speech:
• If two people speak simultaneously within a sentence with similar
volume levels and there is significant overlap, the segment should be
marked as invalid. However, if the overlap is minimal (only one or two
words), and the main speaker’s content can still be clearly heard and
transcribed normally, it should be transcribed.
2.Unclear Content:
• If part of a sentence is unclear and the content cannot be determined,
the sentence should be marked as invalid.
3.High Noise Levels:
• If a sentence contains strong noise (environmental or equipment noise)
that makes it difficult to hear the main speaker’s content, the sentence
should be marked as invalid.
4.Frame Loss:
• If a sentence has missing frames, it should be marked as invalid.
5.Non-Human Voice:
• If a sentence does not contain normal human speech (e.g., machine
customer service, synthesized voice, TV/radio broadcast), it should be
marked as invalid.
6.Non-Target Language Content:
• If a sentence contains parts in languages other than the target language,
it should be marked as invalid. English can be transcribed, but for other
languages, communication with the product manager is required.
7.Sensitive Information:
• If a sentence contains sensitive information (politically sensitive,
religiously sensitive, pornographic, violent), it should be marked as invalid.
(1)Content Transcription
Transcription personnel must transcribe the content exactly as heard,
ensuring that no extra characters, missing characters, or incorrect
characters are included. The general guidelines are as follows:
1.Capitalization:
• If a word typically starts with a capital letter, it should be transcribed
according to standard writing conventions, e.g., "China," "Microsoft."
• The use of capital letter should follow the basic writing rules. Such as
Capital in the first letter of the first word in the sentences. If the sentence
isn’t finished in the end of the segment, the beginning of the next
segment should use a small letter.
2.Numbers:
• Numbers appearing in the text should not be transcribed as Arabic
numerals but should be written out in words according to the language
being used.
O r ig in a l Transcription
3.Spelled-Out Words:
• Letters should be capitalized and separated by spaces, e.g., "A B C."
O r ig in a l Transcription
FBI FBI
NFC NFC
4.Abbreviations:
• Abbreviated words should be fully expanded during transcription. Always
use the full word based on its pronunciation, not its abbreviated form.
O r ig in a l Transcription
5.Punctuation:
• Use punctuation marks according to grammatical rules.
• Punctuation spoken by the speaker should be transcribed as heard, e.g.,
"@" should be transcribed as "at," ".com" should be written as "dot com."
• Only the following punctuation marks are allowed: commas (,), hyphens
(-) only within words, periods (.), exclamation points (!), single quotation
marks ('), and question marks (?). No other punctuation marks should be
added, and all symbols must conform to grammatical rules. All symbols
should be input in normal English typing mode.
• Ex: I've been learning 'Chinese' for 2 months.
Our ancestors said: "Where there is caution, there is peace of mind."
In these sentence the use of ‘ or " are incorrect, this punctuation is only
used in Abbreviations and Possessive.
6.Interjections:
• Interjections should be transcribed accurately based on pronunciation
and meaning.(e.g. uh, hum, ah etc.)
7.Other Guidelines:
• Profanity: Profanity should be transcribed normally without substituting
letters.
• Internet Slang and Common Terms: These should be transcribed
according to common usage.
• Repetitions: Any repeated words or phrases in the audio should be
transcribed in full.
• Unclear but Pronounceable Words: If a word is clearly pronounced but
the meaning is uncertain (e.g., common names), you can use a
homophone, ensuring that the transcription matches the pronunciation. In
cases where the context provides clear sentence meaning, choose words
that fit both the pronunciation and the sentence meaning.
• Incomplete Words: If a word is cut off mid-sentence, add a hyphen (-)
followed by a space before the next word, e.g., "I want to go to s-
school." Note that sentences must end with complete words; if an
incomplete word appears at the end of a sentence, it should be omitted.
1.Special Label
brackets, etc.
1.QualityRequirements
The accuracy of the annotated words should be 98% and above. If the
annotated.
Quality inspectors must live in the countr y where the target language