Gujarat (Standard Language) Specification

Uploaded by

Dered

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views6 pages

Gujarat (Standard Language) Specification

Uploaded by

Dered

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Gujarati (standard language) labeling specification

一、 Data annotation
1. Effectiveness judgment
The whole qualified pronunciation needs to be marked by sentence interception.
If the following situations occur, it is judged that a single sentence is unqualified
and does not need to be intercepted:
1) In a sentence, if two people speak overlapping and their voices are close
in size, and there are many overlapping parts, it is marked as invalid speech;
If there are few overlapping parts (only one or two words), and the normal
transcription of the main speakers content can be heard clearly;
2) If there is an inaudible part of a sentence and the content cannot be judged,
the sentence is invalid;
3) If a sentence has strong noise (environmental noise, equipment noise) that
makes it difficult to hear the main speakers content, the sentence is judged
as invalid;
4) If there is a frame drop in a sentence, the sentence is judged to be invalid;
5) If a sentence is not a normal human voice (machine customer service,
synthetic voice, TV broadcast voice), the sentence is judged as invalid;
6) If a sentence contains non-language parts, the sentence is judged to be
invalid;
7) If a sentence involves sensitive information (politically sensitive,
religiously sensitive, pornographic violence), the sentence is judged to
be invalid
2. Effective speech interception
1) Annotators need to consider semantic coherence and intercept in sentence
units. For sentences that are too long, component sentences can be
intercepted. The maximum length of each sentence should not exceed 8 seconds,
but it should not be too short. According to labeling experience, each
natural language segment can take an average of 5-6 seconds;
2) The optimal position of each time boundary is at the lowest point of the
waveform;
3) The voices of different speakers cannot be intercepted in the same sentence;
4) Try to leave a mute segment of 0.2 ~ 0.3 seconds around the marked speech
segment when intercepting. If there is no such long mute segment, it is not
forced. Intercepting the speech segment without sudden noise as much as
possible can shorten the reserved time before and after the speech in order
to avoid sudden noise, but there can be no sound cutting;
5) If there is only one word to indicate the response, it also needs to be
intercepted, and if it can be merged with adjacent sentences, it should be
merged as much as possible.
6) If the mute segment of a sentence is more than 2s due to the speakers pause,
it needs to be intercepted into two clauses, regardless of the meaning of
the sentence; If the pause time is less than two seconds and the duration
of a single sentence does not exceed 8s, cut it into one sentence.
7) If a person pauses for no more than two seconds in the middle of speech,
there is noise in the middle pause, the intercepted sentence is incoherent,
and the semantics are incomplete, it is not necessary to split it
8) Speech clipping, amplitude cutting, frame dropping, and energy
abnormalities are regarded as invalid.
3. Speaker identification
Different speakers in the same paragraph should be marked with different
identity IDs, and the gender of the speakers should be marked
4. Content transcription
The labeling personnel need to transcribe the content according to the audio
they hear, and it is required that the transcribed content must be completely
consistent with the speech they hear, and there should be no more words, less words,
or typos. The general guidelines are as follows:
1) Case: If the first letter of the word is usually capitalized, transcribe
it according to normal writing habits. For example: C hina, M ic r o soft
2) Numbers: Numbers appear in the text. Arabic numerals cannot be directly
transcribed, but should be transcribed into the words of this language.
Original text Transcribe
Im 15 years old Im fifteen years old

3) Spelling class words:

Letter capitalization separated by spaces. For example
Original text Transcribe
Five third pm Five third P M
F BI F B I

N FC N F C

4) Abbreviation
You cant use the abbreviation of words when transcribing. Be sure to use
the whole word of the pronounced word. For example:
Original text Transcribe
This is Dr. Smith This is doctor Smith

5) Punctuation
Punctuation is used according to grammatical rules.
The punctuation spoken by the speaker needs to be transcribed, for example:
"@" is transcribed as "at", and ". com" is written as "dot com"
Only commas (,) and hyphens (-) are allowed during transcription. They can
only appear in the middle of words, periods (.), exclamation marks (!),
single quotation marks (), and question marks (?). Punctuation marks other
than these are not allowed. The symbols added need to comply with grammatical
rules. All symbols need to be in normal English input state
6) Modal particle
Modal particles should be accurately transcribed according to pronunciation
and semantics (pure laughter does not need to be labeled; If this modal
particle contributes to the contextual sentence meaning, it must be labeled.
Such as: "Shall we have dinner together later?" "Hmm." The "Hmm" here is
a response to the above, which has semantics; If this modal particle does
not contribute to the meaning of the context sentence, it does not need to
be marked, and it is not wrong to mark it. Such as: unconscious humming).
7) Other
 The content of dirty words is transcribed normally, and it is forbidden
to replace them with letters
 Internet hot words and common Internet words are transcribed according
to common usage
 If there are repeated words in pronunciation, all of them should be
transcribed
 It is found that the listening is relatively clear, the semantics are
uncertain, but the pronunciation can be determined, such as ordinary
names, etc. You can choose homophones instead, but you need to ensure
that the text and pronunciation are correct. When there is a clear
context sentence meaning, choose the words that conform to the
pronunciation and sentence meaning for labeling.
 If the word is not finished, add-after it, and there should be a space
between it and the following word, for example: I want to go to s-school.
Note that the end of the sentence must be a complete word. If the
unfinished word is at the end of the sentence, directly discard it
without intercepting it.
5. Special symbols
If the following situations occur during the labeling process, corresponding
special labels need to be added, and the labels must be legal: avoid missing pairs
of labels, inconsistent case, and unequal brackets.
Is the Noise Special Explain Role Text annotation
data labels label
valid ing
No noise Without Transcribe according to O1 I went to dinner today.
the specification or
according to what you O2
hear …
[N] The inclusion of noise in a O1 I went to dinner today
sentence needs to be or [N]
marked [N] at the end of O2
the sentence, but there is …
no need to distinguish
the type of noise.
[HM] Speaker rap content O1 Im drunk alone [HM]
Valid data

needs to be marked at or
the end of sentence [HM] O2
…
[OVERLAP/] The speech overlaps, and O1 Today I went to
[/OVERLAP] one of them is or [OVERLAP/] for dinner
particularly clear, O2 [/OVERLAP]
transcribing only the …
speech of the person
who speaks clearly. The
role marks the speaker,
and the affected text is
marked with a label.
Invalid Invalid [IVS] Only noise paragraphs N [IVS]
data voice longer than 0.5 seconds
segment will be marked. For
of example, the voice
recorders overlaps, and the sound
voice volume is similar;
Speech frame drop;
Speech cut-off;
Speech has an echo;
Not the normal tone of
speech: such as singing,
holding your throat, etc.;
Non-target language;
There are some words in
the speech segment that
are inaudible or cannot
be transcribed because of
noise.
Invalid [OIVS] Only noise paragraphs N [OIVS]
voice longer than 0.5 seconds
segment will be marked. For
of example:
non-recor TV vocals;
ding Program broadcast cavity
persons narration explanation
voice advertisement;
Music with vocals; Etc
Sensitive [PIL] The voice contains the N [PIL]
informati private information of
on the recorder.
Detailed address, mobile
phone number, ID
number, bank card
number, social security
number, passport
number, etc.

6. Quality Requirements
The accuracy rate of labeled words is 98% and above. If part of the statement
has the following labeling errors: wrong labeling, valid error, etc., this sentence
is considered to be an incorrectly labeled statement.
The accuracy rate of role sentences is above 90%
Accuracy rate of special labels = number of wrong labels (wrong labels, multiple labels,
missed labels)/number of labels labeled is not less than 85%
7. Data check
In order to avoid low-level format errors in the exported data, the following
checks should be carried out before the data is stored in the database. If the
following problems are found in the final acceptance process, they will be treated
as overall unqualified.
1) Only the following characters are allowed in the dimension text:
 Alphabet of this language
 Blank
 Common punctuation (,.!? -)
2) To merge multiple consecutive spaces into one, the spaces at the beginning
and end of the sentence need to be removed
3) Arabic numerals are not allowed in callout text
4) Each audio must have corresponding.metadata and.txt files. Any missing file
will be regarded as invalid audio
5) Timestamp format check to prevent negative numbers
6) Metadata file unexpected line wrap condition
7) Audio files less than 15 k

Eura English Transcription Guidelines 2024 - ADAP QF
No ratings yet
Eura English Transcription Guidelines 2024 - ADAP QF
25 pages
Gotranscript Transcription Guidelines (Adapted For Translation Into Multiple Languages)
No ratings yet
Gotranscript Transcription Guidelines (Adapted For Translation Into Multiple Languages)
9 pages
AJANTA Japanese in Two Months PDF
80% (5)
AJANTA Japanese in Two Months PDF
9 pages
Transcription Guidelines Go Transcript
67% (3)
Transcription Guidelines Go Transcript
14 pages
Welsh Reader (Intermediate)
100% (1)
Welsh Reader (Intermediate)
187 pages
Text Annotation Guidelines For Hindi ASR
No ratings yet
Text Annotation Guidelines For Hindi ASR
8 pages
Indonesia Transcription Guidelines - EN - 0413
No ratings yet
Indonesia Transcription Guidelines - EN - 0413
7 pages
What Do We Do?: We Provide Audio Transcription Services, Which Means That We Convert Audio and Video Files Into Text
No ratings yet
What Do We Do?: We Provide Audio Transcription Services, Which Means That We Convert Audio and Video Files Into Text
12 pages
Turn-Taking in Japanese Conversation
100% (1)
Turn-Taking in Japanese Conversation
261 pages
SJJ Hindi Transcription
No ratings yet
SJJ Hindi Transcription
9 pages
Data Annotation Guideline
No ratings yet
Data Annotation Guideline
8 pages
game 外语视频标注规范
No ratings yet
game 外语视频标注规范
6 pages
Specification For 1000 Hour American English Doctor-Patient Dialogue Annotations
No ratings yet
Specification For 1000 Hour American English Doctor-Patient Dialogue Annotations
7 pages
Shujiajia Audio Transcription & QA
No ratings yet
Shujiajia Audio Transcription & QA
6 pages
Transcription Rules - English Version
No ratings yet
Transcription Rules - English Version
7 pages
Requirement
No ratings yet
Requirement
6 pages
Quebec Accent French Colloquial Video Speech Transcription
No ratings yet
Quebec Accent French Colloquial Video Speech Transcription
6 pages
EU Portuguese Natural Conversation Annotation.docx 20240404 170408 ٠٠٠٠
No ratings yet
EU Portuguese Natural Conversation Annotation.docx 20240404 170408 ٠٠٠٠
8 pages
1100 Hours of Tagalog Natural Dialogue Test
No ratings yet
1100 Hours of Tagalog Natural Dialogue Test
7 pages
User Guide - Colloquial Video Annotation
No ratings yet
User Guide - Colloquial Video Annotation
5 pages
Aragorn Training Document
No ratings yet
Aragorn Training Document
34 pages
Pre-Test Quick Guide
No ratings yet
Pre-Test Quick Guide
3 pages
Guide For Transcribing Audio Records: July 2018
No ratings yet
Guide For Transcribing Audio Records: July 2018
8 pages
Transcription Requirements AA
No ratings yet
Transcription Requirements AA
11 pages
Carneros Transcription Guidelines - Updated 20210727
No ratings yet
Carneros Transcription Guidelines - Updated 20210727
29 pages
Transcription Guidelines
100% (1)
Transcription Guidelines
12 pages
Ake ASR Transcription Rule (EN) - Long Audio - V0117
No ratings yet
Ake ASR Transcription Rule (EN) - Long Audio - V0117
5 pages
Avert Transcription Style Guide 1.0
No ratings yet
Avert Transcription Style Guide 1.0
16 pages
Appen
No ratings yet
Appen
9 pages
Transcription Guidelines en Ver2-9 05291019
No ratings yet
Transcription Guidelines en Ver2-9 05291019
12 pages
Transcription Guidelines: Last Updated: 05292019
No ratings yet
Transcription Guidelines: Last Updated: 05292019
11 pages
LOFT System Guidelines
No ratings yet
LOFT System Guidelines
17 pages
TCS Bangla Guidelines
No ratings yet
TCS Bangla Guidelines
7 pages
Standards For Tagging Malay Long Language Streams
No ratings yet
Standards For Tagging Malay Long Language Streams
11 pages
Transcription Guide 20171117
No ratings yet
Transcription Guide 20171117
11 pages
Transcription Guidelines - GoTranscript
No ratings yet
Transcription Guidelines - GoTranscript
12 pages
Ake ASR Transcription Rule (En) - Long Audio
No ratings yet
Ake ASR Transcription Rule (En) - Long Audio
4 pages
Main Style Guide For Transcribing: The Basics
No ratings yet
Main Style Guide For Transcribing: The Basics
4 pages
Text Format Descriptions: Full Verbatim
No ratings yet
Text Format Descriptions: Full Verbatim
10 pages
Guideline
No ratings yet
Guideline
4 pages
Specification 1
No ratings yet
Specification 1
4 pages
Transcriptionformat
No ratings yet
Transcriptionformat
14 pages
Speech Ocean Guidelines
No ratings yet
Speech Ocean Guidelines
6 pages
Transcriptionformat
No ratings yet
Transcriptionformat
14 pages
Loft Rules
No ratings yet
Loft Rules
6 pages
Labelling Rules
No ratings yet
Labelling Rules
4 pages
Transcription Rules - German
No ratings yet
Transcription Rules - German
9 pages
Specification
No ratings yet
Specification
4 pages
Abtipper - Dresing Und Pehl - Einfache Transkription - ENG - Freelancer
No ratings yet
Abtipper - Dresing Und Pehl - Einfache Transkription - ENG - Freelancer
7 pages
Job 2 Guidelines
No ratings yet
Job 2 Guidelines
9 pages
Paypal Payoneer Paypal Payoneer: Example
No ratings yet
Paypal Payoneer Paypal Payoneer: Example
5 pages
Iris EN Long Audio Transcription Project: FAQ Frequent Answers & Questions
No ratings yet
Iris EN Long Audio Transcription Project: FAQ Frequent Answers & Questions
10 pages
Short Audio Transcription Guideline
No ratings yet
Short Audio Transcription Guideline
3 pages
Tiktok Project Rules: Audio Characteristics
No ratings yet
Tiktok Project Rules: Audio Characteristics
7 pages
Transcription Guidelines
No ratings yet
Transcription Guidelines
13 pages
Go Transcript Guidelines
No ratings yet
Go Transcript Guidelines
11 pages
Introduction
No ratings yet
Introduction
9 pages
GoTranscript's Transcription Guidelines
No ratings yet
GoTranscript's Transcription Guidelines
9 pages
Gotranscripts Guidelines
No ratings yet
Gotranscripts Guidelines
12 pages
Go Transcript Guidelines
No ratings yet
Go Transcript Guidelines
11 pages
Casting Words Guidelines
No ratings yet
Casting Words Guidelines
1 page
Audio Transcription Instruction (Praat)
No ratings yet
Audio Transcription Instruction (Praat)
16 pages
Essential Korean Phrasebook & Dictionary: Speak Korean with Confidence!
From Everand
Essential Korean Phrasebook & Dictionary: Speak Korean with Confidence!
Soyeung Koh
2/5 (1)
Bih, A Grammar of (Nguyen)
No ratings yet
Bih, A Grammar of (Nguyen)
373 pages
Lecture 2.2. Part-Of-Speech Theories
No ratings yet
Lecture 2.2. Part-Of-Speech Theories
15 pages
Korean From Zero Book - 1-107-124
No ratings yet
Korean From Zero Book - 1-107-124
18 pages
Grammar #1
No ratings yet
Grammar #1
36 pages
Output
No ratings yet
Output
5 pages
A Modern Runyoro-Rutooro Grammar: L. T. Rubongoya
No ratings yet
A Modern Runyoro-Rutooro Grammar: L. T. Rubongoya
5 pages
Pharasa Verbs 1ra Parte
No ratings yet
Pharasa Verbs 1ra Parte
228 pages
Lesson 25 - Anybody, Everybody, Somebody, Nobody, Etc.
No ratings yet
Lesson 25 - Anybody, Everybody, Somebody, Nobody, Etc.
30 pages
The Concise Na'vi Dictionary
No ratings yet
The Concise Na'vi Dictionary
22 pages
N5 To N3 Grammar Reviewer: Basic Particles
No ratings yet
N5 To N3 Grammar Reviewer: Basic Particles
39 pages
(Graham Ranger) Discourse Markers An Enunciative
100% (2)
(Graham Ranger) Discourse Markers An Enunciative
324 pages
Grammatical Sketch For Readers of Albanian
No ratings yet
Grammatical Sketch For Readers of Albanian
133 pages
15 Multi-Word Verbs
No ratings yet
15 Multi-Word Verbs
9 pages
Pravila U Kineskom Jeziku
No ratings yet
Pravila U Kineskom Jeziku
9 pages
Lesson 4 - Project Team: Beginning Japanese For Professionals: Book 1
No ratings yet
Lesson 4 - Project Team: Beginning Japanese For Professionals: Book 1
23 pages
Silva's Poem Analysis
No ratings yet
Silva's Poem Analysis
3 pages
Japanese Particles
100% (3)
Japanese Particles
15 pages
Phrasal Verbs
No ratings yet
Phrasal Verbs
10 pages
Transitive Predicates
100% (1)
Transitive Predicates
11 pages
Oxford Phrasal Verbs Dictionary For Learners of English 2nd Edition Varios Autores - The Ebook in PDF and DOCX Formats Is Ready For Download
No ratings yet
Oxford Phrasal Verbs Dictionary For Learners of English 2nd Edition Varios Autores - The Ebook in PDF and DOCX Formats Is Ready For Download
46 pages
Practice Questions For Arabic Present Tense in The State of Jazm
No ratings yet
Practice Questions For Arabic Present Tense in The State of Jazm
6 pages
Translation
No ratings yet
Translation
30 pages
Roadmap B2: UNIT 9 - Lesson - C - Page 74 (90 Minutes)
No ratings yet
Roadmap B2: UNIT 9 - Lesson - C - Page 74 (90 Minutes)
13 pages
Verse (3:1) - Word by Word: Chapter (3) Sūrat Āl Im'rān (The Family of Imrān)
No ratings yet
Verse (3:1) - Word by Word: Chapter (3) Sūrat Āl Im'rān (The Family of Imrān)
455 pages
Phrasal Verbs: Dialogue
No ratings yet
Phrasal Verbs: Dialogue
2 pages
6e Word Skills Phrasal Verbs
No ratings yet
6e Word Skills Phrasal Verbs
9 pages
Phrasal Verb Booklet
No ratings yet
Phrasal Verb Booklet
15 pages

Gujarat (Standard Language) Specification

Uploaded by

Gujarat (Standard Language) Specification

Uploaded by

Gujarati (standard language) labeling specification

3) Spelling class words:

You might also like