0% found this document useful (0 votes)
12 views

Gujarat (standard language) specification

Uploaded by

Dered
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Gujarat (standard language) specification

Uploaded by

Dered
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Gujarati (standard language) labeling specification

一、 Data annotation
1. Effectiveness judgment
The whole qualified pronunciation needs to be marked by sentence interception.
If the following situations occur, it is judged that a single sentence is unqualified
and does not need to be intercepted:
1) In a sentence, if two people speak overlapping and their voices are close
in size, and there are many overlapping parts, it is marked as invalid speech;
If there are few overlapping parts (only one or two words), and the normal
transcription of the main speakers content can be heard clearly;
2) If there is an inaudible part of a sentence and the content cannot be judged,
the sentence is invalid;
3) If a sentence has strong noise (environmental noise, equipment noise) that
makes it difficult to hear the main speakers content, the sentence is judged
as invalid;
4) If there is a frame drop in a sentence, the sentence is judged to be invalid;
5) If a sentence is not a normal human voice (machine customer service,
synthetic voice, TV broadcast voice), the sentence is judged as invalid;
6) If a sentence contains non-language parts, the sentence is judged to be
invalid;
7) If a sentence involves sensitive information (politically sensitive,
religiously sensitive, pornographic violence), the sentence is judged to
be invalid
2. Effective speech interception
1) Annotators need to consider semantic coherence and intercept in sentence
units. For sentences that are too long, component sentences can be
intercepted. The maximum length of each sentence should not exceed 8 seconds,
but it should not be too short. According to labeling experience, each
natural language segment can take an average of 5-6 seconds;
2) The optimal position of each time boundary is at the lowest point of the
waveform;
3) The voices of different speakers cannot be intercepted in the same sentence;
4) Try to leave a mute segment of 0.2 ~ 0.3 seconds around the marked speech
segment when intercepting. If there is no such long mute segment, it is not
forced. Intercepting the speech segment without sudden noise as much as
possible can shorten the reserved time before and after the speech in order
to avoid sudden noise, but there can be no sound cutting;
5) If there is only one word to indicate the response, it also needs to be
intercepted, and if it can be merged with adjacent sentences, it should be
merged as much as possible.
6) If the mute segment of a sentence is more than 2s due to the speakers pause,
it needs to be intercepted into two clauses, regardless of the meaning of
the sentence; If the pause time is less than two seconds and the duration
of a single sentence does not exceed 8s, cut it into one sentence.
7) If a person pauses for no more than two seconds in the middle of speech,
there is noise in the middle pause, the intercepted sentence is incoherent,
and the semantics are incomplete, it is not necessary to split it
8) Speech clipping, amplitude cutting, frame dropping, and energy
abnormalities are regarded as invalid.
3. Speaker identification
Different speakers in the same paragraph should be marked with different
identity IDs, and the gender of the speakers should be marked
4. Content transcription
The labeling personnel need to transcribe the content according to the audio
they hear, and it is required that the transcribed content must be completely
consistent with the speech they hear, and there should be no more words, less words,
or typos. The general guidelines are as follows:
1) Case: If the first letter of the word is usually capitalized, transcribe
it according to normal writing habits. For example: C hina, M ic r o soft
2) Numbers: Numbers appear in the text. Arabic numerals cannot be directly
transcribed, but should be transcribed into the words of this language.
Original text Transcribe
Im 15 years old Im fifteen years old

3) Spelling class words:


Letter capitalization separated by spaces. For example
Original text Transcribe
Five third pm Five third P M
F BI F B I

N FC N F C

4) Abbreviation
You cant use the abbreviation of words when transcribing. Be sure to use
the whole word of the pronounced word. For example:
Original text Transcribe
This is Dr. Smith This is doctor Smith

5) Punctuation
Punctuation is used according to grammatical rules.
The punctuation spoken by the speaker needs to be transcribed, for example:
"@" is transcribed as "at", and ". com" is written as "dot com"
Only commas (,) and hyphens (-) are allowed during transcription. They can
only appear in the middle of words, periods (.), exclamation marks (!),
single quotation marks (), and question marks (?). Punctuation marks other
than these are not allowed. The symbols added need to comply with grammatical
rules. All symbols need to be in normal English input state
6) Modal particle
Modal particles should be accurately transcribed according to pronunciation
and semantics (pure laughter does not need to be labeled; If this modal
particle contributes to the contextual sentence meaning, it must be labeled.
Such as: "Shall we have dinner together later?" "Hmm." The "Hmm" here is
a response to the above, which has semantics; If this modal particle does
not contribute to the meaning of the context sentence, it does not need to
be marked, and it is not wrong to mark it. Such as: unconscious humming).
7) Other
 The content of dirty words is transcribed normally, and it is forbidden
to replace them with letters
 Internet hot words and common Internet words are transcribed according
to common usage
 If there are repeated words in pronunciation, all of them should be
transcribed
 It is found that the listening is relatively clear, the semantics are
uncertain, but the pronunciation can be determined, such as ordinary
names, etc. You can choose homophones instead, but you need to ensure
that the text and pronunciation are correct. When there is a clear
context sentence meaning, choose the words that conform to the
pronunciation and sentence meaning for labeling.
 If the word is not finished, add-after it, and there should be a space
between it and the following word, for example: I want to go to s-school.
Note that the end of the sentence must be a complete word. If the
unfinished word is at the end of the sentence, directly discard it
without intercepting it.
5. Special symbols
If the following situations occur during the labeling process, corresponding
special labels need to be added, and the labels must be legal: avoid missing pairs
of labels, inconsistent case, and unequal brackets.
Is the Noise Special Explain Role Text annotation
data labels label
valid ing
No noise Without Transcribe according to O1 I went to dinner today.
the specification or
according to what you O2
hear …
[N] The inclusion of noise in a O1 I went to dinner today
sentence needs to be or [N]
marked [N] at the end of O2
the sentence, but there is …
no need to distinguish
the type of noise.
[HM] Speaker rap content O1 Im drunk alone [HM]
Valid data

needs to be marked at or
the end of sentence [HM] O2

[OVERLAP/] The speech overlaps, and O1 Today I went to
[/OVERLAP] one of them is or [OVERLAP/] for dinner
particularly clear, O2 [/OVERLAP]
transcribing only the …
speech of the person
who speaks clearly. The
role marks the speaker,
and the affected text is
marked with a label.
Invalid Invalid [IVS] Only noise paragraphs N [IVS]
data voice longer than 0.5 seconds
segment will be marked. For
of example, the voice
recorders overlaps, and the sound
voice volume is similar;
Speech frame drop;
Speech cut-off;
Speech has an echo;
Not the normal tone of
speech: such as singing,
holding your throat, etc.;
Non-target language;
There are some words in
the speech segment that
are inaudible or cannot
be transcribed because of
noise.
Invalid [OIVS] Only noise paragraphs N [OIVS]
voice longer than 0.5 seconds
segment will be marked. For
of example:
non-recor TV vocals;
ding Program broadcast cavity
persons narration explanation
voice advertisement;
Music with vocals; Etc
Sensitive [PIL] The voice contains the N [PIL]
informati private information of
on the recorder.
Detailed address, mobile
phone number, ID
number, bank card
number, social security
number, passport
number, etc.

6. Quality Requirements
The accuracy rate of labeled words is 98% and above. If part of the statement
has the following labeling errors: wrong labeling, valid error, etc., this sentence
is considered to be an incorrectly labeled statement.
The accuracy rate of role sentences is above 90%
Accuracy rate of special labels = number of wrong labels (wrong labels, multiple labels,
missed labels)/number of labels labeled is not less than 85%
7. Data check
In order to avoid low-level format errors in the exported data, the following
checks should be carried out before the data is stored in the database. If the
following problems are found in the final acceptance process, they will be treated
as overall unqualified.
1) Only the following characters are allowed in the dimension text:
 Alphabet of this language
 Blank
 Common punctuation (,.!? -)
2) To merge multiple consecutive spaces into one, the spaces at the beginning
and end of the sentence need to be removed
3) Arabic numerals are not allowed in callout text
4) Each audio must have corresponding.metadata and.txt files. Any missing file
will be regarded as invalid audio
5) Timestamp format check to prevent negative numbers
6) Metadata file unexpected line wrap condition
7) Audio files less than 15 k

You might also like