Mod 3
Mod 3
Introduction
String processing refers to the manipulation and analysis of text data using various
operations and functions. It involves working with sequences of characters, known as
strings
The central data structure utilised by numerous early synthesis systems was
commonly known as a string rewriting mechanism. The formalism stores the linguistic
representation of an utterance as a string. The initial state of the string consists of textual
content. As the processing occurs, the string undergoes modifications or enhancements
through the addition of supplementary symbols. This method was utilised by systems
such as MITalk and the CSTR Alvey synthesiser.
Text Mining
Text mining is a specialised form of data mining that focuses on analysing and
extracting information from unstructured text data. It involves utilising natural language
processing (NLP) techniques to mine vast amounts of unstructured text for valuable
insights. Text mining can be used independently for specific objectives or as a crucial
initial step in the broader data mining process.
Through text mining, unstructured text data can be transformed into structured
data, enabling various data mining tasks like association rule mining, classification and
clustering. Businesses can leverage this capability to extract valuable insights from
diverse data sources such as customer reviews, social media posts and news articles.
analysis. The extraction of insights from unstructured text data and the subsequent use
Notes of these insights to make data-driven decisions have become indispensable tools for
organisations.
$OO VWULQJ SURFHVVLQJ IXQFWLRQV LQ WKH VWULQJU OLEUDU\ DUH SUHIL[HG ZLWK ³VWUB´ 7KH
behaviour described suggests that R will automatically provide a list of available functions Notes
DQG GLVSOD\ WKHP DV RSWLRQV ZKHQ WKH XVHU W\SHV ³VWUB´ DQG SUHVVHV WKH WDE NH\
Consequently, there is no obligation to memorise each function name. One additional
advantage is that the string being processed is consistently placed as the first argument
in the functions within this package, thereby facilitating the usage of the pipe operation.
First,provide an explanation on the usage of the functions within the stringr package.
The primary source of examples for this analysis will be the second case study,
which pertains to self-reported heights of students. The majority of the topic is focused
on instructing the reader on regular expressions (regex) and the utilities available in the
stringr package.
(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)
Language applications:
Java Substring
substr in C++
Python find
3. Replace in String:
0RGLI\LQJ VWULQJV E\ UHSODFLQJ VSHFL¿F FKDUDFWHUV ZRUGV RU SKUDVHV LV D FRPPRQ
operation. To achieve this, you can follow the methods outlined below:
a) Create a New String:
One way to handle the task is to create a new string from scratch, replacing
every instance of the substring S1 with S2 when encountered in the original
string S.
b) Iterative Approach:
Using a variable “i,” iterate through the characters of string S and take the
Notes
following steps:
,IWKHSUH¿[VXEVWULQJRIVWULQJ6VWDUWLQJIURPLQGH[LPDWFKHV6DGGWKH
string S2 to the new string “ans.”
If there is no match, add the current character to the new string “ans.”
After completing the above procedures, the resulting string “ans” will contain the
modifications. You can then print the “ans” as the final output.
4. Finding the Length of String
Finding the length or size of a given string is one of the most common operations on
a String. The length of a string is determined by how many characters are included in
it.
(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)
Subsequence of a String
(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)
8. Substring of a String
A continuous portion of a string, or a string inside another string, is referred to as a
Notes
substring. In general, there are n*(n+1)/2 non-empty substrings for a string of size n.
9. Binary String
A binary string is a unique type of string that only contains two-character types, such
as 0 and 1.
(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)
Examples:
Importance of Regex
Regular expressions (regex) are utilised in Google Analytics for URL matching and
supporting search and replace operations. They are also commonly supported in popular
text editors such as Sublime, Notepad++, Brackets, Google Docs and Microsoft Word.
Example : Regular expression for an email address :
The regular expression provided is used to validate an email address. It consists of
a pattern that matches a sequence of characters. The pattern starts with a caret symbol
(^) which indicates the beginning of the string. The next part of the pattern, enclosed in
square brackets, specifies a range of characters that are allowed in the email address.
In this case, it includes lowercase letters (a-z), uppercase letters (A-Z), digits (0-9),
XQGHUVFRUH B# >D]$=B??@ ? >D]$=@^`
The provided regular expression can be utilised to validate whether a given set of
characters conforms to the format of an email address.
A >D]$=B??@ # >D]$=B??@ ? >D]$=@^`
The above regular expression can be used for checking if a given set of characters is
an email address or not.
1. Metacharacters
A group of specialised operators are included in metacharacters that regex doesn’t
Notes
recognise. Regex does follow its own set of regulations. Every line of text you
encounter will most likely contain one of these operators. These individuals consist of:
.\|()[]{}$*+?
4XDQWL¿HUV
4XDQWL¿HUVGHVSLWHWKHLUEUHYLW\LQW\SLQJSRVVHVVVLJQL¿FDQWSRWHQF\DVPLQXVFXOH
HQWLWLHV7KHDOWHUDWLRQRIDVLQJOHSRVLWLRQFDQKDYHDVLJQL¿FDQWLPSDFWRQWKHRYHUDOO
RXWSXW YDOXH 4XDQWL¿HUV DUH SULPDULO\ XWLOLVHG WR DVFHUWDLQ WKH H[WHQW RI WKH PDWFK
RXWFRPH ,W LV LPSRUWDQW WR QRWH WKDW TXDQWL¿HUV H[HUW WKHLU LQÀXHQFH RQ WKH LWHPV
GLUHFWO\SUHFHGLQJWKHP7KHIROORZLQJLVDFRPSUHKHQVLYHOLVWRITXDQWL¿HUVWKDWDUH
frequently employed in the process of identifying and analysing patterns within textual
data: The matching process includes all characters except for newline characters.
7KH TXDQWL¿HUV FDQ EH XWLOLVHG LQ FRQMXQFWLRQ ZLWK PHWDFKDUDFWHUV VHTXHQFHV
and character classes in order to yield intricate patterns. The utilisation of various
FRPELQDWLRQVRIWKHVHTXDQWL¿HUVHQDEOHVRQHWRHIIHFWLYHO\LGHQWLI\DQGPDWFKDJLYHQ
SDWWHUQ7KHXQGHUVWDQGLQJRIWKHVHTXDQWL¿HUVFDQEHHQKDQFHGWKURXJKWZRGLVWLQFW
approaches:
*UHHG\TXDQWL¿HU7KHV\PERO³ ´LVFRPPRQO\UHIHUUHGWRDVDJUHHG\TXDQWL¿HU7KH
DOJRULWKPDWWHPSWVWRPDWFKDVSHFL¿FSDWWHUQPXOWLSOHWLPHVEDVHGRQWKHQXPEHURI
repetitions available for that pattern.
1RQJUHHG\TXDQWL¿HUV,WLVGHQRWHGE\WKHV\PERO"DUHDW\SHRITXDQWL¿HUXVHGLQ
regular expressions. In the context of pattern matching, a non-greedy approach refers
WRWKHEHKDYLRXUZKHUHDVSHFL¿FSDWWHUQZLOOKDOWLWVVHDUFKXSRQHQFRXQWHULQJWKH
initial match.
3. Sequences
6HTXHQFHVDUHFRPSRVHGRIVSHFL¿FFKDUDFWHUVWKDWDUHXWLOLVHGWRUHSUHVHQWDSDWWHUQ
within a provided string. The following is a list of commonly used sequences in R:
Sequences Description
\d Matches a digit character
\D Matches a non-digit character
\s Matches a space character
\S Matches a non-space character
\w Matches a word character
\W Matches a non-word character
\b Matches a word boundary
\B Matches a non-word boundary
4. Character classes
&KDUDFWHUFODVVHVDUHGH¿QHGDVDFROOHFWLRQRIFKDUDFWHUVWKDWDUHHQFORVHGZLWKLQ
a square bracket [ ]. The classes exclusively correspond to the characters that are
HQFORVHGZLWKLQWKHEUDFNHWV7KHXWLOLVDWLRQRITXDQWL¿HUVFDQEHFRPELQHGZLWKWKHVH
classes. The caret (^) symbol is utilised in character classes, which is an intriguing
aspect. The operation of negation is applied to the given expression, causing it to
VHDUFKIRUDOOHOHPHQWVWKDWGRQRWPDWFKWKHVSHFL¿HGSDWWHUQ7KHIROORZLQJVHFWLRQ
Notes outlines the various character classes that are commonly utilised in regular expressions
(regex):
Characters Description
[aeiou] Matches lowercase vowels
[AEIOU] Matches uppercase vowels
[0123456789] Matches any digit
[0-9] Same as the previous class
[a-z] Match any lowercase letter
[A-Z] Match any uppercase letter
[a-zA-Z0-9] Match any of the above classes
[^aeiou] Matches everything except letters
[^0-9] Matches everything except digits
The use of curly braces, denoted by { ... }, is a common syntax element in programming
languages. Notes
The instruction instructs the computer to iterate the preceding character (or set of
FKDUDFWHUV IRUWKHQXPEHURIWLPHVVSHFL¿HGZLWKLQWKHEUDFNHWV
Example : {2} means that the preceding character is to be repeated 2 times
5. Wildcard ( . )
The wildcard ( . ) is a symbol used in various programming languages and regular
expressions to represent any character or set of characters. It is often used as a
placeholder or a pattern-matching tool. The dot wildcard can match any single
character, including letters, numbers .
The dot symbol, also known as the wildcard character, has the ability to substitute for
any other symbol.
Example : The Regular expression .* will tell the computer that any character
can be used any number of times.
6. Optional character ( ? )
The regular expression “.*” indicates to the computer that any character can be used
an arbitrary number of times.
The optional character, denoted by the symbol “?”, is a feature that may or may not be
present in a given context.
The symbol denotes to the computer that the preceding character in the string to be
matched may or may not be present.
([DPSOH7KHIRUPDWIRUDGRFXPHQW¿OHFDQEHZULWWHQDV³GRF[´
The symbol ‘?’ is used to indicate to the computer that the variable ‘x’ may or may not
EHLQFOXGHGLQWKH¿OHIRUPDWQDPH
7. The caret ( ^ ) symbol
The caret symbol (^) is used to set the position for a match.
The caret symbol is used to indicate to the computer that the match should commence
at the start of the string or line.
Example:The regular expression pattern ^\d{3} will successfully match with patterns
such as “901” in the string “901-333-”.
8. The dollar ( $ ) symbol
The dollar symbol, denoted by the character “$”, is widely recognised as the currency
symbol for the United States dollar.
7KHUHJXODUH[SUHVVLRQVSHFL¿HVWKDWWKHPDWFKVKRXOGEHORFDWHGDWWKHHQGRIWKH
string or immediately before the newline character (n) at the end of the line or string.
Example:The regular expression pattern “-\d{3}$” can be used to match strings that
end with a three-digit number preceded by a hyphen. For instance, in the string “-901-
333”, the pattern will successfully match the substring “-333”.
9. Character Classes
Character classes are a fundamental concept in computer programming and regular
H[SUHVVLRQV7KH\DUHXVHGWRGH¿QHDVHWRIFKDUDFWHUVWKDWFDQEHPDWFKHGLQD
given pattern.A character class is capable of matching any single character from a
VSHFL¿HGVHWRIFKDUDFWHUV5HJXODUH[SUHVVLRQVDUHXVHGWRPDWFKWKHIXQGDPHQWDO
Notes components of a language, such as letters, digits, spaces, symbols and other similar
entities.
The escape sequence “\s” is used to represent any whitespace characters,
including spaces and tabs.
The regular expression pattern “\S” is used to match any character that is not a
whitespace character.
The regular expression \d is used to match any digit character.
The regular expression \D matches any characters that are not digits.
The regular expression pattern \w is used to match any word character, which
includes alphanumeric characters.
The regular expression pattern \W is used to match any non-word character.
The regular expression “\b” is used to match any word boundary, which includes
spaces, dashes, commas, semi-colons and other similar characters.
7KH >VHWBRIBFKDUDFWHUV@ SDWWHUQ LV XVHG WR PDWFK DQ\ VLQJOH FKDUDFWHU WKDW LV
LQFOXGHG LQ WKH VHWBRIBFKDUDFWHUV 7KH GHIDXOW EHKDYLRXU RI WKH PDWFK LV WR
consider case sensitivity.
Example:The pattern [abc] is used to match any occurrence of the characters a, b, or
c within a given string.
>AVHWBRIBFKDUDFWHUV@1HJDWLRQ
7KHQRWDWLRQ>AVHWBRIBFKDUDFWHUV@LVXVHGWRUHSUHVHQWDVHWRIFKDUDFWHUVWKDWVKRXOG
not be included in a given context. Negation refers to the ability to match any individual
FKDUDFWHUWKDWGRHVQRWEHORQJWRDVSHFL¿HGVHWRIFKDUDFWHUV7KHGHIDXOWEHKDYLRXU
of the match is to consider case sensitivity.
Example:The regular expression [^abc] is used to match any character except for the
characters a, b and c.
>¿UVWODVW@&KDUDFWHUUDQJH
The character range refers to a pattern that matches any single character within a
VSHFL¿HGUDQJHVWDUWLQJIURPWKH¿UVWFKDUDFWHUDQGHQGLQJZLWKWKHODVWFKDUDFWHU
Example:The regular expression [a-zA-Z] is used to match any character within the
range of a to z or A to Z.
12. The Escape Symbol ( \ )
7KH HVFDSH V\PERO ? LV XVHG WR PDWFK IRU VSHFL¿F FKDUDFWHUV VXFK DV µ¶ µ¶ DQG
others. To match for these characters, simply add a backslash (\) before the desired
character. The computer will be instructed to interpret the subsequent character as a
search character and include it in the evaluation of a matching pattern.
Example:The regular expression \d+[\+-x\*]\d+ can be used to identify patterns such
as “2+2” and “3*9” within the string “(2+2) * 3*9”.
13. Grouping Characters ( )
The process of grouping characters is achieved through the use of brackets ( ).
A collection of distinct symbols within a regular expression can be consolidated into
D FRKHVLYH HQWLW\ IXQFWLRQLQJ DV D XQL¿HG EORFN7R DFKLHYH WKLV LW LV QHFHVVDU\ WR
enclose the regular expression within brackets ( ).
1. grep()
To identify a pattern within a character vector and obtain the element values or
indices as the output, the grep() function can be utilised
# use the built in data set state.division
head ( as.character (state.division))
## [1] “East South Central” “Pacific” “Mountain”
## [4] “West South Central” “Pacific” “Mountain”
# find the elements which match the pattern
grep (“North”, state.division)
## [1] 13 14 15 16 22 23 25 27 34 35 41 49
# use value = TRUE to show the element value
grep (“North”, state.division, value = TRUE)
## [1] “East North Central” “East North Central” “West North Central”
## [4] “West North Central” “East North Central” “West North Central”
## [7] “West North Central” “West North Central” “West North Central”
## [10] “East North Central” “West North Central” “East North Central”
# can use the invert argument to show the non-matching elements
grep (“North | South”, state.division, invert = TRUE)
## [1] 2 3 5 6 7 8 9 10 11 12 19 20 21 26 28 29 30 31 32 33 37 38 39
## [24] 40 44 45 46 47 48 50
2. grep1()
To find a pattern in a character vector and to have logical (TRUE/FALSE) outputs
use grepl() :
3. regexpr()
The interpretation of the output from the regexpr() function is as follows: The first
element of the output indicates the starting position of the match in each element.
It should be noted that a value of -1 indicates the absence of a match. The second
element, referred to as the attribute “match length,” indicates the length of the match.
The value of the third element, “useBytes” attribute, is set to TRUE. This indicates that
the matching process was performed on a byte-by-byte basis, rather than a character-by-
character basis.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# default usage
regexpr(“one”, text)
## [1] 1 -1 -1 11
## attr(,”match.length”)
## [1] 3 -1 -1 3
## attr(,”useBytes”)
## [1] TRUE
4. gregexpr()
Notes
The gregexpr() function is a built-in function in many programming languages that is
used for pattern matching within strings. It returns the starting
The gregexpr() function performs a similar task to regexpr(). It is used to locate the
position of a pattern within a vector of strings by searching each element individually.
The primary distinction lies in the output format of gregexpr(), which is a list. The function
gregexpr() returns a list that has the same length as the input text. Each element in
the list follows the same format as the return value of regexpr(). However, instead of
providing the starting position of only one match, it provides the starting positions of all
non-overlapping matches.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# pattern
pat = “one”
# default usage
gregexpr(pat, text)
## [[1]]
## [1] 1
## attr(,”match.length”)
## [1] 3
## attr(,”useBytes”)
## [1] TRUE
##
## [[2]]
## [1] -1
## attr(,”match.length”)
## [1] -1
## attr(,”useBytes”)
## [1] TRUE
##
## [[3]]
## [1] -1
## attr(,”match.length”)
## [1] -1
## attr(,”useBytes”)
## [1] TRUE
##
## [[4]]
## [1] 11
## attr(,”match.length”)
## [1] 3
Notes ## attr(,”useBytes”)
## [1] TRUE
5. regexec()
The regexec() function is used to perform regular expression matching in a
programme.
The regexec() function closely resembles gregexpr() as it produces an output that is a
list of the same length as the text. The starting position of the match is stored in each element
of the list. A value of -1 indicates the absence of a match. Furthermore, it should be noted
that every element within the list possesses the attribute “match.length,” which provides the
lengths of the matches. In cases where there is no match, the attribute value is -1.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# pattern
pat = “one”
# default usage
regexec(pat, text)
## [[1]]
## [1] 1
## attr(,”match.length”)
## [1] 3
##
## [[2]]
## [1] -1
## attr(,”match.length”)
## [1] -1
##
## [[3]]
## [1] -1
## attr(,”match.length”)
## [1] -1
##
## [[4]]
## [1] 11
## attr(,”match.length”)
## [1] 3
Extracting Patterns
In Stringr, there are two primary options for extracting a string that matches a
pattern: (a) extracting the first occurrence or (b) extracting all occurrences. When using
VWUBH[WUDFW LW VHDUFKHV IRU WKH ILUVW LQVWDQFH RI WKH SDWWHUQ LQ D FKDUDFWHU YHFWRU ,I D
match is found, it returns the matched string; otherwise, the output for that element will be
NA and the output vector will be the same length as the input string.
y <- c (“I use R #useR2014”, “I use R and love R #useR2015”, “Beer”)
VWUBH[WUDFW \SDWWHUQ ³5´
## [1] “R” “R” NA
8VH VWUBH[WUDFWBDOO WR ILQG HYHU\ LQVWDQFH RI D SDWWHUQ ZLWKLQ D FKDUDFWHU YHFWRU$
list that is the same length as the vector’s number of items is produced as the output.
The matching pattern occurrence within each list item’s respective vector element will be
provided.
VWUBH[WUDFWBDOO \SDWWHUQ ´>>SXQFW@@ >D]$=@ 5>D]$=@ ´
## [[1]]
## [1] “R” “#useR2014”
##
## [[2]]
Amity Directorate of Distance & Online Education
100 Optimization and Dimension Reduction Techniques
Functions Description
nchar() It counts the number of characters in a string or a vector. In a stringr package,
LWVVXEVWLWXWHIXQFWLRQLVVWUBOHQJWK
tolower() ,WFRQYHUWVDVWULQJWRWKHORZHUFDVH$OWHUQDWLYHO\\RXFDQXVHWKHVWUBWRBORZHU
function.
toupper() ,W FRQYHUWV D VWULQJ WR XSSHUFDVH$OWHUQDWLYHO\ \RX FDQ DOVR XVH WKH VWUBWRBB
upper() function.
chartr() ,WLVXVHGWRUHSODFHHDFKFKDUDFWHULQDVWULQJ$OWHUQDWLYHO\\RXFDQXVHVWUB
replace() function to replace a complete string.
substr() ,WLVXVHGWRH[WUDFWSDUWVRIDVWULQJ6WDUWDQGHQGSRVLWLRQVQHHGWREHVSHFL¿HG
$OWHUQDWLYHO\\RXFDQXVHWKHVWUBVXE IXQFWLRQ
setdiff() It is used to determine the difference between two vectors.
setequal() It is used to check if the two vectors have the same string values.
abbreviate() It is used to abbreviate strings. The length of the abbreviated string needs to be
VSHFL¿HG
strsplit() It is used to split a string based on a criterion. It returns a list. Alternatively, you
FDQXVHWKHVWUBVSOLW IXQFWLRQ7KLVIXQFWLRQOHWV\RXFRQYHUW\RXUOLVWRXWSXWWR
a character matrix.
sub() ,WLVXVHGWR¿QGDQGUHSODFHWKH¿UVWPDWFKLQDVWULQJ
gsub() ,WLVXVHGWR¿QGDQGUHSODFHDOOWKHPDWFKHVLQDVWULQJYHFWRU$OWHUQDWLYHO\\RX
FDQXVHWKHVWUBUHSODFH IXQFWLRQ
Now, these functions will be written and their effects on strings will be understood.
Notes The outputs have not been shown here. This is expected to run these commands locally
and observe the differences.
library(stringr)
string <- “Los Angeles, officially the City of Los Angeles and often known by its initials
L.A., is the second-most populous city in the United States (after New York City), the
most populous city in California and the county seat of Los Angeles County. Situated in
Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity,
sprawling metropolis and as a major centre of the American entertainment industry.”
strwrap(string)
#count number of characters
nchar(string)
VWUBOHQJWK VWULQJ
#convert to lower
tolower(string)
VWUBWRBORZHU VWULQJ
#convert to upper
toupper(string)
VWUBWRBXSSHU VWULQJ
#replace strings
chartr(“and”,”for”,x = string) #letters a,n,d get replaced by f,o,r
VWUBUHSODFHBDOO VWULQJ VWULQJSDWWHUQ F ³&LW\´ UHSODFHPHQW ³VWDWH´ WKLVLVFDVH
sensitive
#abbreviate strings
abbreviate(c(“monday”,”tuesday”,”wednesday”),minlength = 3)
#split strings
strsplit(x = c(“ID-101”,”ID-102”,”ID-103”,”ID-104”),split = “-”)
Notes
VWUBVSOLW VWULQJ F ³,'´´,'´´,'´´,'´ SDWWHUQ ³´VLPSOLI\ 7
#find and replace first match
sub(pattern = “L”,replacement = “B”,x = string,ignore.case = T)
The term “data wrangling” is commonly employed to refer to the initial phases of the
data analytics process. The process entails the conversion and mapping of data from one
format to another. The objective is to enhance the accessibility of data for applications
such as business analytics and machine learning. The process of data wrangling
encompasses a diverse range of tasks. The aforementioned tasks encompass data
collection, exploratory analysis, data cleansing, creation of data structures and storage.
outcomes may encounter unexpected delays in the process of transforming data into a
format that is readily employable. In contrast to the outcomes yielded by data analysis, Notes
which frequently offer captivating and stimulating revelations, the data wrangling phase
typically yields minimal visible progress despite significant efforts. In light of budget and
time constraints, the challenges faced by data wranglers are amplified for businesses. The
position requires proficient management of expectations and technical expertise.
¿OWHU
Notes The filter() function is used to generate a subset of data that meets specific criteria.
It allows proper filtering using various methods, including conditional operators, logical
operators, NA values and range operators. The syntax of the filter() function is as follows:
filter(dataframeName, condition)
Example:
The code below utilises the filter() function to retrieve data from the “stats” data
frame, specifically for players who have scored more than 100 runs.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# fetch players who scored more
# than 100 runs
filter(stats, runs>100)
Output
player runs wickets
1 B 200 20
2 C 408 NA
2. distinct()
The distinct() method is used to eliminate duplicate rows from a data frame or based
on the specified columns. The distinct() method follows the syntax provided below:
GLVWLQFW GDWDIUDPH1DPHFROFRONHHSBDOO 758(
Example: In this example, the distinct() method was utilised to eliminate duplicate
rows from the data frame. Additionally, duplicates were removed based on a specified
column.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’, ‘A’, ‘A’),
runs=c(100, 200, 408, 19, 56, 100),
wickets=c(17, 20, NA, 5, 2, 17))
# removes duplicate rows
distinct(stats)
#remove duplicates based on a column
Output
4. select() method The select() method is used to retrieve the desired columns as a table
by specifying the necessary column names within the select() method. The syntax of
the select() method is as follows:
select(dataframeName, col1,col2,…)
Example:
In the following code, the select() method was utilised to retrieve only the player and
Notes wickets column data.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# fetch required column data
select(stats, player,wickets)
Output
player wickets
1 A 17
2 B 20
3 C NA
4 D 5
5. rename() methodThe rename() function is used for the purpose of modifying column
names. This can be achieved using the following syntax-
rename(dataframeName, newName=oldName)
Example:
,Q WKLV SDUWLFXODU LQVWDQFH WKH FROXPQ QDPH ³UXQV´ LV PRGLILHG WR ³UXQVBVFRUHG´
within the stats data frame.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# renaming the column
UHQDPH VWDWVUXQVBVFRUHG UXQV
Output
The following methods are utilised for the purpose of generating new variables. The
mutate() function is used to generate new variables while retaining the existing ones, Notes
whereas the transmute() function drops the old variables and generates new ones. The
syntax for both methods is provided below:
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)
Example:
In this example, a new column named “avg” was created using the mutate() and
transmute() methods.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
uns=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))
# add new column avg
mutate(stats, avg=runs/4)
# drop all and create a new column
transmute(stats, avg=runs/4)
Output
avg
1 25.00
2 50.00
3 102.00
4 4.75
The mutate() function is used to append a new column to the existing data frame
without removing any of the existing columns. On the other hand, the transmute()
function is used to create a new variable while discarding all the previous columns.
7. summarise()
The data in the data frame can be summarised using the summarise method. This
can be achieved by applying aggregate functions such as sum(), mean() and others. The
syntax of the summarise() method is as follows:
VXPPDULVH GDWDIUDPH1DPHDJJUHJDWHBIXQFWLRQ FROXPQ1DPH
Example:
Notes The code below demonstrates the utilisation of the summarise() method to present a
summary of the data contained within the runs column.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))
# summarise method
summarise(stats, sum(runs), mean(runs))
Output
sum(runs) mean(runs)
1 727 181.75
Below is a table presenting multiple logic rules that can be used with the filter()
function:
You can use these logic rules within the filter() function to filter data based on
specific conditions and generate subsets of the data that meet those criteria.
There are additional filtering and subsetting functions that are quite useful:
# remove duplicate rows
VXEBH[S!GLVWLQFW
# random sample, 50% sample size without replacement
VXEBH[S!VDPSOHBIUDF VL]H UHSODFH )$/6(
# random sample of 10 rows with replacement
VXEBH[S!VDPSOHBQ VL]H UHSODFH 758(
# select rows 3-5
VXEBH[S!VOLFH
# select top n entries - in this case ranks variable X2011 and selects
# the rows with the top 5 values
VXEBH[S!WRSBQ Q ZW ;
Selecting
When dealing with a large data frame, it is common to have the need to evaluate
specific variables. The select() function facilitates the selection and potential renaming of
variables. Assuming our objective is to exclusively evaluate the expenditure data from the
past five years.
The application of special functions within the select() method is also possible. To
illustrate, users have the option to select all variables that begin with the letter ‘X’. To view
the list of available functions, please use the command ‘?select’.
expenditures %>%
VHOHFW VWDUWVBZLWK ³;´ !
head()
Variables can be de-selected by utilising the “-” symbol before the respective name
Notes or function. The inverse of the functions above can be obtained using the following
procedure:
expenditures %>% select (-X1980:-X2006)
H[SHQGLWXUHV!VHOHFW VWDUWVBZLWK ³;´
To enhance convenience, the system provides two options for renaming selected
variables.
# select and rename a single column
H[SHQGLWXUHV!VHOHFW <UB ;
# Select and rename the multiple variables with an “X” prefix:
H[SHQGLWXUHV!VHOHFW <UB VWDUWVBZLWK ³;´
# keep all variables and rename a single variable
expenditures %>% rename (`2011` = X2011)
Sorting
There are instances where it is desirable to examine observations in a ranked
sequence based on specific variable(s).
The arrange() function facilitates the sorting of data based on variables in either
ascending or descending order. In order to evaluate the average expenditures per
division, the following analysis will be performed.
The arrange() function can be utilised to sort the divisions based on their expenditure
for 2011, arranging them in ascending order. This facilitates the identification of significant
distinctions among Divisions 8, 4, 1 and 6 in comparison to Divisions 5, 7, 9, 3 and 2.
VXEBH[S!
JURXSBE\ 'LYLVLRQ !
VXPPDULVH 0HDQB PHDQ ;QDUP 758(
0HDQB PHDQ ;QDUP 758( !
DUUDQJH 0HDQB
VXEBH[S!
JURXSBE\ 'LYLVLRQ !
Notes
VXPPDULVH 0HDQB PHDQ ;QDUP 758(
0HDQB PHDQ ;QDUP 758( !
DUUDQJH GHVF 0HDQB
Aggregating Data
Data aggregation involves the process of summarising and organising data in a
more concise and meaningful manner. It entails collecting data from various sources and
presenting it in a summarised format, which is pivotal for effective data analysis.
Accurate and high-quality data collected in substantial quantities is imperative
for generating meaningful insights. This data collection plays a pivotal role in making
informed decisions across various aspects, including financial strategies, product
development, pricing, operational decisions, and crafting marketing strategies.
Example:A large e-commerce company in India wants to analyse its customer
data to gain insights into purchasing behaviour across different regions of the country.
The company has a massive dataset containing information about individual customer
transactions.
Here’s a simplified example of their data:
Customer ID
Region (e.g., North, South, East, West)
Product Category (e.g., Electronics, Clothing, Books)
Purchase Amount (in Indian Rupees)
Sample Data:
Example:
North Region Total Sales: `8,000
South Region Total Sales: `2,500
East Region Total Sales: `1,200
West Region Total Sales: `4,000
Example:
Electronics Category Total Sales: `9,000
Clothing Category Total Sales: `2,500
Books Category Total Sales: `2,000
Example:
North Region Average Purchase Amount: `4,000 (Total Sales `8,000 / Number of
Transactions in North Region)
South Region Average Purchase Amount: `2,500 (Total Sales `2,500 / Number of
Transactions in South Region)
By aggregating the data in this way, the e-commerce company can gain insights into
which regions are the most profitable, which product categories are the most popular, and
what the average spending patterns are in different regions. This information can inform
marketing strategies, inventory management, and product recommendations to improve
the company’s overall performance in the Indian market.
built-in feature, is designed to handle dates without including time information. On the
Notes other hand, the chron library, which is a contributed library, is capable of handling both
dates and times. However, it does not provide functionality for managing time zones.
For more advanced control over dates and times, the POSIXct and POSIXlt classes can
be used. These classes offer support for both dates and times, while also providing the
ability to manage time zones. In R, it is recommended to employ the most straightforward
approach when working with date and time data. In the case of data that only includes
dates, the most suitable option is typically the as.Date function.
The chron library is a suitable option for managing dates and times that do not
include timezone information. When it comes to manipulating time zones, the POSIX
classes within the library are particularly valuable. Additionally, it is important to consider
the various “as.” functions that can be utilised to convert between different date types as
needed.
With the exception of the POSIXct class, dates are internally stored as the numerical
representation of days or seconds elapsed from a specific reference date. Dates in R
typically have a numeric mode and the class function can be utilised to determine their
actual storage method. The POSIXlt class is designed to store date/time values as a
collection of components, such as hour, minute, second, month and so on. This structure
facilitates the extraction of specific parts of the date/time information.
The Sys.Date function can be used to obtain the current date. It returns a Date
object that can be converted to a different class if required.
representation of a date into a date object. This is commonly done in programming when
dealing with date-related operations. By converting strings to dates, various operations Notes
can be performed.
When importing date and time data into R, it is common for them to be automatically
converted to character strings. The task at hand necessitates the conversion of strings
into dates. Multiple strings can be merged to form a date variable.
order to consolidate the individual data into a single date object, the ISOdate() function
Notes should be utilised.
yr <- c (“2012”, “2013”, “2014”, “2015”)
mo <- c (“1”, “5”, “7”, “2”)
day <- c (“02”, “22”, “15”, “28”)
Specifier Description
%a Abbreviated weekday
%A Full weekday
%b Abbreviated month
%B Full month
%C Century
%y Year without century
%Y Year with century
%d Day of month (01-31)
%j Day in Year (001-366)
%m Month of year (01-12)
%D Date in %m/%d/%y format
%u Weekday (1-7), Starts on Monday
Key point:In order to obtain the current date in R, the sys.Date() method can be
utilised. This method is designed to retrieve the present date.
1. Weekday
Notes
This section will explore the %a, %A and %u specifiers, which provide the
abbreviated weekday, full weekday and numbered weekday starting from Monday,
respectively.
# today date
date<-Sys.Date()
# abbreviated month
format(date,format=”%a”)
# fullmonth
format(date,format=”%A”)
# weekday
format(date,format=”%u”)
Output
[1] “Sat”
[1] “Saturday”
[1] “6”
[Execution complete with exit code 0]
2. Date:
Let’s look into the day, month and year format specifiers to represent dates in
different formats.
Example:
# today date
date<-Sys.Date()
# day in month
format(date,format=”%d”)
# month in year
format(date,format=”%m”)
# abbreviated month
format(date,format=”%b”)
# full month
format(date,format=”%B”)
Notes
# Date
format(date,format=”%D”)
format(date,format=”%d-%b-%y”)
Output
[1] “2022-04-02”
[1] “02”
[1] “04”
[1] “Apr”
[1] “April”
[1] “04/02/22”
[1] “02-Apr-22”
[Execution complete with exit code 0]
3. Year:
The year can be formatted in various ways. The format specifiers %y, %Y and %C
are used to retrieve different representations of the year from a given date. The %y
specifier returns the year without the century, the %Y specifier returns the year with the
century and the %C specifier returns only the century of the date.
# today date
date<-Sys.Date()
# century
format(date,format=”%C”)
Output
[1] “22”
[1] “2022”
[1] “20”
[Execution complete with exit code 0]
x <- Sys.Date ()
x
Notes
## [1] “2015-09-26”
y <- as.Date (“2015-09-11”)
x>y
## [1] TRUE
x-y
## Time difference of 15 days
One of the advantages of utilising the date/time classes is their ability to accurately
handle leap years, leap seconds, daylight savings and time zones. To obtain a
comprehensive list of acceptable time zone specifications, the OlsonNames() function
can be utilised.
# last leap year
x <- as.Date (“2012-03-1”)
y <- as.Date (“2012-02-28”)
x-y
## Time difference of 2 days
y == x
## [1] FALSE
y-x
## Time difference of 3 hours
The lubridate package also provides the same functionality, with the only distinction
being the use of different accessor function(s).
library (lubridate)
x <- now ()
x
## [1] “2015-09-26 10:08:18 EDT”
x>y
## [1] TRUE
x-y
x - hours (4)
## [1] “2015-09-26 06:08:18 EDT”
Time spans can be managed using the duration functions available in the lubridate
package.
Durations are a means of quantifying the length of time between specified start
and end dates. Utilising the base R date functions for duration calculations can be a
cumbersome and error-prone process. The lubridate package offers a straightforward
syntax for calculating durations using various units of measurement, such as seconds,
minutes, hours and so on.
# create new duration (represented in seconds)
QHZBGXUDWLRQ
## [1] “60s”
dhours (1)
## [1] “3600 s (~1 hours)”
dyears (1)
## [1] “31536000 s (~365 days)”
x + dhours (10)
## [1] “2015-09-22 22:00:00 UTC”
analyses can be performed, such as sales analysis for a company, inventory analysis,
price analysis for a specific stock or market and population analysis. Notes
The syntax for using the ts() function is as follows:
objectName <- ts(data, start, end, frequency)
Where:
data: Represents the data vector containing the time series observations.
start: Denotes the timestamp of the first observation in the time series.
end: Denotes the timestamp of the last observation in the time series.
frequency: Specifies the number of observations per unit time. For example,
frequency=1 for monthly data.
By using the ts() function in R, you can conduct comprehensive Time Series Analysis
to gain valuable insights and make informed predictions about the behaviour of your data
over time.
Note: To know about more optional parameters, use the following command in the R
console: help(“ts”)
Output
Output
OLEUDU\UHTXLUHGIRUGHFLPDOBGDWH IXQFWLRQ
library(lubridate)
Output :
Notes
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020.307 2547989 2491957 2604020 2462296 2633682
2020.326 2915130 2721277 3108983 2618657 3211603
2020.345 3202354 2783402 3621307 2561622 3843087
2020.364 3462692 2748533 4176851 2370480 4554904
2020.383 3745054 2692884 4797225 2135898 5354210
The graph presented below illustrates the projected values of COVID-19 if its
widespread transmission persists over the course of the next five weeks.
1. Risk Management:
Text mining plays a crucial role in risk management by systematically analysing,
recognizing, treating and monitoring risks in various processes within organisations.
Implementing Risk Administration Software based on text mining technology enhances
the ability to identify and mitigate risks effectively. It allows the management of vast
amounts of text data from multiple sources and enables the extraction of relevant
information to make informed decisions.
4. Business Intelligence:
Text mining has become a fundamental component of business intelligence for
companies and businesses. It provides significant insights into user behaviour and
trends, allowing organisations to understand their strengths and weaknesses compared
to competitors. By leveraging text mining methods, companies gain a competitive
advantage in the industry.
Gather relevant textual data from various sources in different document formats
(e.g., plain text, web pages, PDF files). Notes
Execute pre-processing and data cleansing tasks to identify and remove
inconsistencies.
Perform data cleansing to capture authentic text and remove stop words
through stemming (identifying word roots and indexing data).
Process and control the data set to facilitate review and further cleansing.
2. Implementation of Pattern Analysis:
Utilise pattern analysis as a critical component within the Management
Information System.
3. Extracting Relevant Information:
Leverage the data obtained from the previous steps to extract relevant and
significant information.
4. Enable effective decision-making processes and trend analysis.
Tokenization
Several TDM (Text Data Mining) methods rely on the utilisation of word or short
phrase counting techniques. However, from a computer’s perspective, the texts in your
corpus are merely sequences of characters and it lacks the understanding of words or
phrases. To enable the computer to count and perform calculations, it is necessary to
provide instructions on how to segment the text into meaningful units. The individual units
within the text are referred to as tokens and the act of dividing the text into these units is
known as tokenization.
Tokenizing text is a common practice where the text is divided into individual words
or tokens. However, there are other types of tokenization that can also be beneficial.
If one is interested in examining particular phrases, such as ‘artificial intelligence’ or
‘White Australia Policy’, or investigating the co-occurrence patterns of certain words, it
is advisable to segment the text into two or three-word units. To perform an analysis of
sentence structures and features, it is recommended to begin by tokenizing the text into
discrete sentences. Frequently, it is necessary to tokenize text using various methods in
order to facilitate different types of analyses.
In languages where words are not separated in writing, such as Chinese, Thai, or
Vietnamese, tokenization necessitates careful consideration to determine how the text
should be divided in order to facilitate the intended analysis.
Word Replacement
The presence of spelling variations can pose challenges in text analysis, as the
computer interprets distinct spellings as separate words instead of recognising their
shared reference. To resolve this issue, it is recommended to select a singular spelling
and substitute any alternative variations with that particular version throughout the text.
In order to process a large corpus, the initial step involves tokenizing the words and
subsequently standardising the spelling. Another option is to utilise tools like VARD to
automate the task.
Stopwords
In the realm of textual analysis, it is worth noting that certain words, such as ‘the’,
‘is’, ‘that’, ‘a’ and others, are frequently employed but do not contribute significantly to
the understanding or interpretation of the content within your documents. Consequently,
these words tend to exert a disproportionate influence on any analysis conducted.
The words that need to be filtered out prior to text analysis are commonly referred to
as “stopwords”. There is a wide range of available stopword lists that can be utilised for
the purpose of eliminating commonly used words across multiple languages. To exclude
certain words from your analysis that are frequently found in your documents but are not
relevant, you have the option to personalise the existing stopwords lists by incorporating
your own words into them.
Stemming
Stemming involves the application of a predefined set of rules to determine the
suffixes of words that can be removed, thereby isolating the fundamental root of the
word. The resulting “stem” may or may not correspond to a valid word. There are various
stemming algorithms available, including the Snowball and Lancaster stemmers. The
execution of these rules will yield varying outcomes. Therefore, it is recommended to
examine the rules they employ and test them on your data in order to determine which
one best aligns with your requirements. The implementation of a stemming algorithm
necessitates the execution of programming tasks.
Lemmatization
Lemmatization is a computational process that involves the analysis of words in
order to derive their corresponding dictionary root form. In order to perform this analysis,
it is necessary for the lemmatizer to have an understanding of the context in which each
Information Retrieval
Information retrieval (IR) is a process that retrieves relevant information or
documents by using a predetermined set of queries or phrases. Information retrieval (IR)
systems employ algorithms to monitor user behaviours and discern pertinent data. The
utilisation of information retrieval is prevalent in library catalogue systems and widely
used search engines such as Google. There are several common sub-tasks associated
with Information Retrieval (IR), which are frequently encountered in this field. These sub-
tasks include:
Ɣ Tokenization is a fundamental process in natural language processing that involves
the segmentation of lengthy text into individual sentences and words, referred to
as “tokens”. These components are utilised in various models, such as the bag-of-
words model, to perform tasks like text clustering and document matching.
Ɣ Stemming is the procedure of extracting the root word form and meaning by
separating the prefixes and suffixes from words. This technique enhances the
process of information retrieval by decreasing the size of indexing files.
comprehend written text. These subtasks play crucial roles in various projects and tasks
Notes and contribute to their successful completion.
1. Summarization:
Summarization is a technique used to condense lengthy textual content into concise
and well-organised summaries that capture the main ideas of a document. It helps in
H[WUDFWLQJHVVHQWLDOLQIRUPDWLRQHI¿FLHQWO\
2. Part-of-Speech (PoS) Tagging:
3R6WDJJLQJLQYROYHVDVVLJQLQJVSHFL¿FWDJVWRHDFKWRNHQLQDGRFXPHQWWRLQGLFDWH
its part of speech, such as nouns, verbs, adjectives, etc. This enables semantic
analysis of unstructured text and aids in understanding the linguistic elements.
7H[W&DWHJRUL]DWLRQ 7H[W&ODVVL¿FDWLRQ
7H[W FDWHJRUL]DWLRQ DOVR NQRZQ DV WH[W FODVVL¿FDWLRQ FDWHJRULVHV WH[W GRFXPHQWV
EDVHGRQSUHGH¿QHGWRSLFVRUFDWHJRULHV,WSURYHVYDOXDEOHLQRUJDQLVLQJV\QRQ\PV
DQGDEEUHYLDWLRQVDQGIDFLOLWDWHVHI¿FLHQWLQIRUPDWLRQUHWULHYDO
4. Sentiment Analysis:
6HQWLPHQWDQDO\VLVLGHQWL¿HVDQGFODVVL¿HVSRVLWLYHRUQHJDWLYHVHQWLPHQWVH[SUHVVHG
in internal and external data sources. It tracks customer attitudes and analyses
sentiment changes over time. Businesses use sentiment analysis to understand
brand, product and service perceptions, leading to better customer engagement and
improved user experiences.
These NLP sub-tasks are integral to unlocking the potential of unstructured text data,
enabling businesses to gain valuable insights and make data-driven decisions.
Data Mining
Notes
It is a process that involves extracting useful patterns and information from large
datasets. It is a technique used in various fields, such as business
Data mining is a systematic procedure that involves the identification of patterns
and the extraction of valuable insights from large sets of data. This practice involves the
evaluation of both structured and unstructured data in order to identify novel information.
It is commonly employed for the analysis of consumer behaviours within the domains
of marketing and sales. Text mining is a sub-field of data mining that specifically deals
with the organisation and analysis of unstructured data in order to produce new and
valuable findings. The aforementioned techniques are classified as data mining methods,
specifically falling within the domain of textual data analysis.
2. Detection of emotions
Notes Emotion detection sentiment analysis goes beyond simple polarity by identifying
specific emotions like happiness, frustration, anger and sadness in text. To achieve this,
many emotion recognition systems employ advanced machine learning algorithms or
lexicons, which are lists of words and their associated emotions.
However, using lexicons has a limitation because people express emotions in
various ways. For example, words like “bad” and “kill,” which can signify rage (e.g., “this
is bad ass” or “your customer service is killing me”), can also be used to convey joy,
creating potential ambiguities in emotion detection.
3. Analysis of Sentiment Based on Aspects
Aspect-based sentiment analysis is a valuable approach when you aim to identify the
specific qualities or traits mentioned in texts with positive, neutral, or negative sentiments.
For example, if a product review includes the statement “The battery life of this camera is
too short,” an aspect-based classifier can determine that the author expressed a negative
sentiment about the device’s battery life.
4. Multilingual sentiment analysis
Performing sentiment analysis across different languages presents a considerable
challenge. The preparation process requires a significant amount of time and effort. Most
of the tools mentioned, such as sentiment lexicons, can be accessed online. However,
there are others, like translated corpora or noise detection algorithms, that require coding
skills to be effectively utilised.
One potential approach is to employ a language classifier to automatically detect the
language present in texts. Subsequently, you can proceed to train a distinct sentiment
analysis model that categorises texts based on the language of your preference.
Text Classification
Text classification refers to the systematic procedure of assigning tags or categories
to texts, primarily determined by the content they contain.
Automated text classification enables the efficient tagging of extensive text datasets,
yielding favourable outcomes within a significantly reduced timeframe, eliminating the
need for manual labour. The technology described possesses promising potential for
various domains.
1. Rule-based System
Text classification systems of this kind rely on linguistic rules. The term “rules” refers
to associations that have been manually created by humans, linking a particular linguistic
pattern with a corresponding tag. Once the algorithm has been implemented with the
specified rules, it is capable of automatically identifying and assigning appropriate tags to
various linguistic structures.
Rules typically encompass references to syntactic, morphological and lexical
patterns. Additionally, these factors can also pertain to semantic or phonological aspects.
An illustrative instance of a rule for categorising product descriptions according to the
colour of a product is as follows:
The set of colours (Black, Grey, White, Blue) can be represented as “Colour”.
In this scenario, the system will automatically assign the tag “COLOUR” whenever it
detects any of the words mentioned earlier.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 137
possible to encounter an accuracy paradox. This paradox arises due to the model’s
Notes tendency to make accurate predictions, primarily because a majority of the data
belongs to a single category. In such instances, it is advisable to take into account
alternative metrics such as precision and recall.
Ɣ Precision is a metric used to assess the accuracy of a classifier. It measures the
proportion of correct predictions made by the classifier for a specific tag, taking into
account both correct and incorrect predictions. The presence of a high precision
metric suggests a lower occurrence of false positives. It is crucial to take into
consideration that precision solely quantifies the instances in which the classifier
accurately predicts that a given text belongs to a particular tag. Certain tasks, such
as automated email responses, necessitate models that exhibit a notable degree of
precision. This precision is crucial in ensuring that a response is delivered to a user
only when there is a high probability of the prediction being accurate.
Ɣ The term “recall” refers to the ratio of correctly predicted texts to the total number of
texts that should have been assigned a specific tag. A high recall metric indicates a
lower occurrence of false negatives. This metric is especially valuable in situations
where it is necessary to direct support tickets to the appropriate teams. The
objective is to optimise the automatic routing of tickets associated with a specific
tag, such as Billing Issues, by prioritising the maximum number of tickets routed,
even if it results in occasional incorrect predictions.
Ɣ The F1 score is a metric that integrates precision and recall to provide an
assessment of the performance of your classifier. The aforementioned metric
serves as a superior indicator compared to accuracy when assessing the quality of
predictions across all categories within your model.
5. Cross-validation
Cross-validation is a commonly employed technique for evaluating the effectiveness
of a text classifier. The process involves partitioning the training data into distinct subsets
using a random approach. An illustration of this concept involves dividing the training data
into four distinct subsets, with each subset comprising 25% of the original dataset.
Next, all subsets, with the exception of one, are utilised for training a text classifier.
The purpose of this text classifier is to generate predictions for the remaining subset of
data, specifically for testing purposes. Following this step, the performance metrics are
computed by comparing the predicted values with the predefined tags. This process is
then repeated until all subsets of data have been utilised for testing.
The final step involves compiling the results from all subsets of data in order to
calculate the average performance for each metric.
Summary
Ɣ String processing refers to the manipulation and analysis of text data using various
operations and functions. It involves working with sequences of characters, known
as strings
Ɣ The central data structure utilised by numerous early synthesis systems was
commonly known as a string rewriting mechanism. The formalism stores the
linguistic representation of an utterance as a string. The initial state of the string
consists of textual content. As the processing occurs, the string undergoes
modifications or enhancements through the addition of supplementary symbols. This
method was utilised by systems such as MITalk and the CSTR Alvey synthesiser