0% found this document useful (0 votes)
7 views56 pages

Mod 3

This document covers string processing and text mining techniques, focusing on the use of regular expressions in R for manipulating text data. It outlines the importance of data wrangling with the Dplyr package and discusses the process of text mining, including data collection, cleansing, and analysis. Additionally, it provides an overview of various string operations and algorithms for effective string manipulation.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views56 pages

Mod 3

This document covers string processing and text mining techniques, focusing on the use of regular expressions in R for manipulating text data. It outlines the importance of data wrangling with the Dplyr package and discusses the process of text mining, including data collection, cleansing, and analysis. Additionally, it provides an overview of various string operations and algorithms for effective string manipulation.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Optimization and Dimension Reduction Techniques 83

Module - III: String Processing and Text Mining


Notes
Learning Objectives
At the end of this module, you will be able to:
Ɣ Infer the basics of string processing and the importance of regular expressions in
text manipulation
Ɣ Learn how to create and use regular expressions to match and extract specific
patterns from strings in R
Ɣ Discuss various string manipulation techniques using regular expressions in R to
clean, format and modify text data
Ɣ Analyse familiarity with the principles of data wrangling and the role of the Dplyr
package in R for data manipulation
Ɣ Learn how to filter, sort and select data using Dplyr functions to perform essential
data transformations
Ɣ Discuss the concept of handling dates and times in R, including formatting and
arithmetic operations on date-time data

Introduction
String processing refers to the manipulation and analysis of text data using various
operations and functions. It involves working with sequences of characters, known as
strings
The central data structure utilised by numerous early synthesis systems was
commonly known as a string rewriting mechanism. The formalism stores the linguistic
representation of an utterance as a string. The initial state of the string consists of textual
content. As the processing occurs, the string undergoes modifications or enhancements
through the addition of supplementary symbols. This method was utilised by systems
such as MITalk and the CSTR Alvey synthesiser.

Text Mining
Text mining is a specialised form of data mining that focuses on analysing and
extracting information from unstructured text data. It involves utilising natural language
processing (NLP) techniques to mine vast amounts of unstructured text for valuable
insights. Text mining can be used independently for specific objectives or as a crucial
initial step in the broader data mining process.
Through text mining, unstructured text data can be transformed into structured
data, enabling various data mining tasks like association rule mining, classification and
clustering. Businesses can leverage this capability to extract valuable insights from
diverse data sources such as customer reviews, social media posts and news articles.

Uses of Text Mining


The common usage of Text Mining refers to the application of techniques and
methodologies to extract valuable insights and knowledge from large volumes of
unstructured textual data.
Text mining is a commonly employed technique in diverse domains, including but
not limited to natural language processing, information retrieval and social media

Amity Directorate of Distance & Online Education


84 Optimization and Dimension Reduction Techniques

analysis. The extraction of insights from unstructured text data and the subsequent use
Notes of these insights to make data-driven decisions have become indispensable tools for
organisations.

Process of Text MInning


The text mining process typically consists of several essential steps. The following
steps are implemented in order to extract valuable insights and knowledge from
unstructured text data. The process generally consists of the following stages:
Ɣ The process of gathering and recording information. During the initial step, the
necessary text data is collected from different sources.
Ɣ The process entails the gathering of unstructured information from various sources,
which are accessible in a range of document formats including plain text, web
pages, PDF files and others.
Ɣ The execution of pre-processing and data cleansing tasks aims to identify and
eliminate inconsistencies present in the data. The data cleansing process is
responsible for capturing accurate text and is performed to eliminate stop words
using stemming, which entails identifying the base form of a word and indexing the
data.
Ɣ The data set is subjected to processing and control tasks in order to enable a
comprehensive review and subsequent cleansing.
Ɣ Pattern analysis plays a crucial role in the successful implementation of a
Management Information System (MIS).

3.1 String Processing and Regular Expressions (regex)


String Processing
In order to perform tasks such as graph creation, summary generation, or
model fitting in R, it is necessary to extract numeric data from character strings and
convert them into the appropriate numeric representations. The practice of converting
unorganised data into informative variable names or categorical variables is widely
recognised.
Data scientists often face unique and unexpected challenges when it comes to
processing strings. Hence, composing a comprehensive section on this topic is a
challenging endeavour. In order to provide a more precise explanation, now delineate
the procedure for converting the original unprocessed data, which has not yet been
presented and from which obtained the examples of murders, heights and research
funding rates, into the data frames that have been examined in this topic.
Various common string processing tasks will be explored in this guide, drawing
insights from the provided case studies. These tasks include extracting numbers
from strings, removing unnecessary characters from text, searching for and replacing
characters, extracting specific portions of strings, converting unstructured text into
standardised formats and splitting strings into multiple values.
All of the aforementioned activities can be accomplished utilising the fundamental
R functions. However, a notable drawback is the absence of a cohesive convention,
resulting in a slight difficulty in recalling and implementing them. The stringr package
serves as a repackage of this functionality, providing more logical function names and an
organised arrangement of parameters.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 85

$OO VWULQJ SURFHVVLQJ IXQFWLRQV LQ WKH VWULQJU OLEUDU\ DUH SUHIL[HG ZLWK ³VWUB´ 7KH
behaviour described suggests that R will automatically provide a list of available functions Notes
DQG GLVSOD\ WKHP DV RSWLRQV ZKHQ WKH XVHU W\SHV ³VWUB´ DQG SUHVVHV WKH WDE NH\
Consequently, there is no obligation to memorise each function name. One additional
advantage is that the string being processed is consistently placed as the first argument
in the functions within this package, thereby facilitating the usage of the pipe operation.
First,provide an explanation on the usage of the functions within the stringr package.
The primary source of examples for this analysis will be the second case study,
which pertains to self-reported heights of students. The majority of the topic is focused
on instructing the reader on regular expressions (regex) and the utilities available in the
stringr package.

Regular Expressions (regex)


Regular expressions, also known as regex, are a collection of pattern matching
commands utilised for identifying string sequences within extensive text data. The
purpose of these commands is to identify and process text that belongs to a specific
family, such as alphanumeric characters, digits, or words. This versatility allows them to
effectively handle any type of text or string class.
In summary, the utilisation of regular expressions enables the extraction of additional
information from textual data, all while reducing the length of the code required for such
operations.
For instance, consider a scenario where data has been extracted from a website
using web scraping techniques. The dataset includes the timestamp of user log events.
The objective is to perform log time extraction. However, the data exhibits significant
disorder and lacks organisation. The code contains HTML div elements, JavaScript
functions and other elements. Regular expressions should be utilised in such scenarios.
In addition to R, regular expressions are supported in various programming
languages such as Python, Ruby, Perl, Java and JavaScript.

3.1.1 Introduction to String Processing


Strings are commonly represented as arrays of bytes or words, allowing for the
storage of character sequences. They are a versatile data type and defined as a
sequence of characters, with termination marked by the special character “0.”
In the C programming language, strings can be denoted as either character arrays
or character pointers. Character arrays are stored in a manner similar to other array
types. When declared as an auto variable, the string str[] is stored in the stack segment,
whereas if declared as a global or static variable, it is stored in the data segment.

General Operations Performed on String:


Here are certain string notions that you need to be aware of:
1. Concatenation of Strings:
Concatenation is the process of combining multiple strings together. It involves joining
two strings to create a new one. There are two methods for concatenating strings:
a) String Concatenation without Inbuilt Methods:
To concatenate two strings without using any inbuilt methods, follow the
CONCATENATE algorithm as shown below:

Amity Directorate of Distance & Online Education


86 Optimization and Dimension Reduction Techniques

Algorithm: CONCATENATE (STR1, STR2, STR3)


Notes b) String Concatenation using Inbuilt Methods:
String concatenation can be achieved using various inbuilt methods in different
programming languages, such as C/C++, Java, Python, C# and Javascript. Each
ODQJXDJHSURYLGHVLWVRZQVSHFL¿FDSSURDFKWRSHUIRUPVWULQJFRQFDWHQDWLRQ
2. Find in String
 )LQGLQJDVSHFL¿FLWHPLQVLGHDJLYHQIXOOVWULQJLVDYHU\VLPSOHRSHUDWLRQFDUULHGRXW
RQVWULQJV7KLVFDQEHXVHGWRORFDWHDVSHFL¿FFKDUDFWHUZLWKLQDVWULQJRUWRORFDWH
an entire string within another string.

(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)

a) Find a character in string:


Find the first place of the character in the string given a string and a
character. Finding the position of the character in a string is a very competitive
programming challenge in these situations.
b) Find a substring in another string:
Imagine a string with length N and a substring of length M. Run a nested loop
after that, with the inner loop running from 0 to M and the outer loop from 0 to
(N-M). Verify whether the inner loop’s traversed sub-string corresponds to the
given substring at each index.
Using an O(n) searching method, such as the KMP or Z algorithms, is an effective
approach.

Language applications:
™ Java Substring
™ substr in C++
™ Python find
3. Replace in String:
 0RGLI\LQJ VWULQJV E\ UHSODFLQJ VSHFL¿F FKDUDFWHUV ZRUGV RU SKUDVHV LV D FRPPRQ
operation. To achieve this, you can follow the methods outlined below:
a) Create a New String:
One way to handle the task is to create a new string from scratch, replacing
every instance of the substring S1 with S2 when encountered in the original
string S.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 87

b) Iterative Approach:
Using a variable “i,” iterate through the characters of string S and take the
Notes
following steps:
 ,IWKHSUH¿[VXEVWULQJRIVWULQJ6VWDUWLQJIURPLQGH[LPDWFKHV6DGGWKH
string S2 to the new string “ans.”
 If there is no match, add the current character to the new string “ans.”
After completing the above procedures, the resulting string “ans” will contain the
modifications. You can then print the “ans” as the final output.
4. Finding the Length of String
Finding the length or size of a given string is one of the most common operations on
a String. The length of a string is determined by how many characters are included in
it.

(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)

There are two methods to calculate the length of two strings:


a) Length of String without using any inbuilt methods:
To find the length of a string without using any built-in functions, you can follow
this algorithm:
Set LEN = 0 and I = 0.
Repeat Steps 3 to 4 while STRING[I] is not NULL:
LEN = LEN + 1.
Set I = I + 1.
Exit.
b) Length of String using Inbuilt Methods:
Calculating the length of a string can also be achieved using built-in functions in
different programming languages like C, C++, Java, Python, C# and Javascript.
Each language provides specific methods to obtain the length of a string.
5. Trim a String
Strings frequently contain spaces or other special characters. Therefore, understanding
how to trim such characters in a String is crucial.

Here is a Straightforward Fix


1) Iterate through the entire string, moving all succeeding characters one position
backwards if the current character is a space and increasing the length of the
resultant string otherwise.
The above solution has an O(n2) time complexity.
It can be solved in O(n) time using a Better Solution. The goal is to record the total
Amity Directorate of Distance & Online Education
88 Optimization and Dimension Reduction Techniques

number of non-space characters that have been observed.


Notes 1) Set the initial value of ‘count’ to 0 (the number of non-space characters seen
thus far).
2) As you go over the characters in the given string iteratively, do the following:
a) If the character you are on is not a space, move it to the index ‘count’ and
increase ‘count’.
3) Finally, add “0” to the index “count.”
6. Reverse and Rotation of a String:
a) Reverse Operation:
The reverse operation involves flipping the order of characters in a string, so
that the first character becomes the last, the second becomes the second last
and so on.
b) Rotations of a String:
Rotations of a string involve finding all possible circular shifts of the characters
in the string. Let’s consider another example with the string “hello.” All possible
rotations of the string “hello” would be:
 hello
 elloh
 llohe
 lohel
 ohell
Each rotation represents a circular shift of the characters in the string.
7. Subsequence of a String
A subsequence is a sequence that can be created by eliminating zero or more
components from another sequence while maintaining the original order of the
remaining elements.More broadly, it can be stated that there can be a total of (2n-1)
non-empty sub-sequences for a sequence of size n.

Subsequence of a String

(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 89

8. Substring of a String
A continuous portion of a string, or a string inside another string, is referred to as a
Notes
substring. In general, there are n*(n+1)/2 non-empty substrings for a string of size n.
9. Binary String
A binary string is a unique type of string that only contains two-character types, such
as 0 and 1.

(Imagesource:https://www.geeksforgeeks.org/introduction-to-strings-data-structure-and-algorithm-
tutorials/)

10. Palindrome String


If a string’s reverse is identical to the string, it is referred to as a palindrome.
11. Lexicographic Patterns
Lexicographical pattern, often known as dictionary order, is a pattern based on ASCII
values. Consider a character’s lexicographic order as well as its ASCII value. As a
result, the characters’ lexicographical order will be
‘A’, ‘B’, ‘C’, …, ‘Y’, ‘Z’, ‘a’, ‘b’, ‘c’, …, ‘y’, ‘z’.
12. Pattern Searching
 3DWWHUQVHDUFKLQJLVWKHSURFHVVRI¿QGLQJDVSHFL¿FSDWWHUQZLWKLQDJLYHQVWULQJ,WLV
an advanced topic related to strings and is considered a subset of string algorithms.
3DWWHUQVHDUFKLQJDOJRULWKPVDUHXVHGWRHI¿FLHQWO\ORFDWHRFFXUUHQFHVRIRQHVWULQJ
(the pattern) within another string (the text). These techniques are valuable for
searching and identifying patterns in text data.

3.1.2 Fundamentals of Regular Expressions


A regular expression, also known as regex, is a specific arrangement of characters
that establishes a predetermined pattern for conducting searches. The following
instructions outline the process of writing regular expressions:
Ɣ Begin by familiarising yourself with the special characters utilised in regular
expressions (regex), including “.”, “*”, “+”, “?” and others.
Ɣ Select a programming language or tool that provides support for regular expressions
(regex), such as Python, Perl, or grep.
Ɣ Please construct your pattern using a combination of special characters and literal
characters.
Ɣ Utilise the suitable function or method to perform a pattern search within a given
string.

Examples:

Amity Directorate of Distance & Online Education


90 Optimization and Dimension Reduction Techniques

1. To achieve a match for a sequence of literal characters, it is sufficient to directly


Notes include those characters within the pattern.
2. To match a single character from a set of possibilities, the square brackets
notation can be utilised. For example, the expression [0123456789] can be
employed to match any digit.
3. The star (*) symbol is used to match zero or more occurrences of the preceding
expression.
A regular expression, also known as a rational expression, is a string of characters
that specifies a search pattern. It is primarily used for pattern matching with strings, such
as in operations like “find and replace”. Regular expressions provide a versatile method
for identifying patterns within character sequences. This functionality is utilised in various
programming languages such as C++, Java and Python. It is a powerful tool used in
computer science and programming to match and manipulate strings of text based on
specific patterns. Regular expressions are important due to their versatility and efficiency
in tasks such as data validation, text parsing and pattern matching. Their ability to
represent complex patterns and perform string manipulation makes them a fundamental
component in various applications and programming languages.

Importance of Regex
Regular expressions (regex) are utilised in Google Analytics for URL matching and
supporting search and replace operations. They are also commonly supported in popular
text editors such as Sublime, Notepad++, Brackets, Google Docs and Microsoft Word.
Example : Regular expression for an email address :
The regular expression provided is used to validate an email address. It consists of
a pattern that matches a sequence of characters. The pattern starts with a caret symbol
(^) which indicates the beginning of the string. The next part of the pattern, enclosed in
square brackets, specifies a range of characters that are allowed in the email address.
In this case, it includes lowercase letters (a-z), uppercase letters (A-Z), digits (0-9),
XQGHUVFRUH B# >D]$=B??@ ? >D]$=@^` 
The provided regular expression can be utilised to validate whether a given set of
characters conforms to the format of an email address.
A >D]$=B??@ # >D]$=B??@ ? >D]$=@^` 
The above regular expression can be used for checking if a given set of characters is
an email address or not.

Process for writing regular expressions:


Regular expressions in R can be divided into 5 categories:
1. Metacharacters
2. Sequences
3. Quantifiers
4. Character Classes
5. POSIX character classes
The following elements are commonly used in the composition of regular
expressions:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 91

1. Metacharacters
A group of specialised operators are included in metacharacters that regex doesn’t
Notes
recognise. Regex does follow its own set of regulations. Every line of text you
encounter will most likely contain one of these operators. These individuals consist of:
.\|()[]{}$*+?
 4XDQWL¿HUV
 4XDQWL¿HUVGHVSLWHWKHLUEUHYLW\LQW\SLQJSRVVHVVVLJQL¿FDQWSRWHQF\DVPLQXVFXOH
HQWLWLHV7KHDOWHUDWLRQRIDVLQJOHSRVLWLRQFDQKDYHDVLJQL¿FDQWLPSDFWRQWKHRYHUDOO
RXWSXW YDOXH 4XDQWL¿HUV DUH SULPDULO\ XWLOLVHG WR DVFHUWDLQ WKH H[WHQW RI WKH PDWFK
RXWFRPH ,W LV LPSRUWDQW WR QRWH WKDW TXDQWL¿HUV H[HUW WKHLU LQÀXHQFH RQ WKH LWHPV
GLUHFWO\SUHFHGLQJWKHP7KHIROORZLQJLVDFRPSUHKHQVLYHOLVWRITXDQWL¿HUVWKDWDUH
frequently employed in the process of identifying and analysing patterns within textual
data: The matching process includes all characters except for newline characters.
 7KH TXDQWL¿HUV FDQ EH XWLOLVHG LQ FRQMXQFWLRQ ZLWK PHWDFKDUDFWHUV VHTXHQFHV
and character classes in order to yield intricate patterns. The utilisation of various
FRPELQDWLRQVRIWKHVHTXDQWL¿HUVHQDEOHVRQHWRHIIHFWLYHO\LGHQWLI\DQGPDWFKDJLYHQ
SDWWHUQ7KHXQGHUVWDQGLQJRIWKHVHTXDQWL¿HUVFDQEHHQKDQFHGWKURXJKWZRGLVWLQFW
approaches:
 *UHHG\TXDQWL¿HU7KHV\PERO³ ´LVFRPPRQO\UHIHUUHGWRDVDJUHHG\TXDQWL¿HU7KH
DOJRULWKPDWWHPSWVWRPDWFKDVSHFL¿FSDWWHUQPXOWLSOHWLPHVEDVHGRQWKHQXPEHURI
repetitions available for that pattern.
 1RQJUHHG\TXDQWL¿HUV,WLVGHQRWHGE\WKHV\PERO"DUHDW\SHRITXDQWL¿HUXVHGLQ
regular expressions. In the context of pattern matching, a non-greedy approach refers
WRWKHEHKDYLRXUZKHUHDVSHFL¿FSDWWHUQZLOOKDOWLWVVHDUFKXSRQHQFRXQWHULQJWKH
initial match.
3. Sequences
 6HTXHQFHVDUHFRPSRVHGRIVSHFL¿FFKDUDFWHUVWKDWDUHXWLOLVHGWRUHSUHVHQWDSDWWHUQ
within a provided string. The following is a list of commonly used sequences in R:

Sequences Description
\d Matches a digit character
\D Matches a non-digit character
\s Matches a space character
\S Matches a non-space character
\w Matches a word character
\W Matches a non-word character
\b Matches a word boundary
\B Matches a non-word boundary

4. Character classes
 &KDUDFWHUFODVVHVDUHGH¿QHGDVDFROOHFWLRQRIFKDUDFWHUVWKDWDUHHQFORVHGZLWKLQ
a square bracket [ ]. The classes exclusively correspond to the characters that are
HQFORVHGZLWKLQWKHEUDFNHWV7KHXWLOLVDWLRQRITXDQWL¿HUVFDQEHFRPELQHGZLWKWKHVH
classes. The caret (^) symbol is utilised in character classes, which is an intriguing
aspect. The operation of negation is applied to the given expression, causing it to

Amity Directorate of Distance & Online Education


92 Optimization and Dimension Reduction Techniques

VHDUFKIRUDOOHOHPHQWVWKDWGRQRWPDWFKWKHVSHFL¿HGSDWWHUQ7KHIROORZLQJVHFWLRQ
Notes outlines the various character classes that are commonly utilised in regular expressions
(regex):

Characters Description
[aeiou] Matches lowercase vowels
[AEIOU] Matches uppercase vowels
[0123456789] Matches any digit
[0-9] Same as the previous class
[a-z] Match any lowercase letter
[A-Z] Match any uppercase letter
[a-zA-Z0-9] Match any of the above classes
[^aeiou] Matches everything except letters
[^0-9] Matches everything except digits

5. POSIX character classes


 ,QWKH5SURJUDPPLQJODQJXDJHFODVVHVFDQEHLGHQWL¿HGE\EHLQJHQFORVHGZLWKLQ
a double square bracket ([[ ]]). Character classes are utilised in a similar manner.
The presence of a caret symbol before an expression indicates the negation of the
expression’s value. These classes are perceived as more intuitive compared to others,
resulting in a higher level of ease when learning. The available POSIX character
classes in R are as follows:
Some of the Symbols are explained below with example:
1. Repeaters ( *, + and { } )
 7KHXVHRIUHSHDWHUVVSHFL¿FDOO\ DQGLVDFRPPRQSUDFWLFHLQWHFKQLFDOZULWLQJ
7KHVHUHSHDWHUVVHUYHWKHSXUSRVHRIUHSHDWLQJHOHPHQWVLQDFRQFLVHDQGHI¿FLHQW
manner. By utilising these symbols, writers can convey repetition without the need for
extensive repetition of the same content.
The symbols in question serve as repeaters, indicating to the computer that the
preceding character should be used multiple times.
2. The asterisk symbol ( * )
The regular expression instructs the computer to search for zero or more occurrences
XSWRDQLQ¿QLWHQXPEHU RIWKHSUHFHGLQJFKDUDFWHURUVHWRIFKDUDFWHUV
Example:The regular expression “ab*c” generates strings such as “ac”, “abc”, “abbc”,
“abbbc” and so forth.
3. The plus symbol (+)
The command directs the computer to repeat the previous character or group of
characters at least once or more, possibly an unlimited number of times.
Example:The regular expression “ab+c” matches strings that contain one or more
occurrences of the letter ‘a’, followed by the letter ‘b’ and then followed by any number
of occurrences of the letter ‘c’. This pattern will match strings such as “abc”, “abbc”,
“abbbc” and so on.
4. The curly braces { … }

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 93

The use of curly braces, denoted by { ... }, is a common syntax element in programming
languages. Notes
The instruction instructs the computer to iterate the preceding character (or set of
FKDUDFWHUV IRUWKHQXPEHURIWLPHVVSHFL¿HGZLWKLQWKHEUDFNHWV
Example : {2} means that the preceding character is to be repeated 2 times
5. Wildcard ( . )
The wildcard ( . ) is a symbol used in various programming languages and regular
expressions to represent any character or set of characters. It is often used as a
placeholder or a pattern-matching tool. The dot wildcard can match any single
character, including letters, numbers .
The dot symbol, also known as the wildcard character, has the ability to substitute for
any other symbol.
Example : The Regular expression .* will tell the computer that any character
can be used any number of times.
6. Optional character ( ? )
The regular expression “.*” indicates to the computer that any character can be used
an arbitrary number of times.
The optional character, denoted by the symbol “?”, is a feature that may or may not be
present in a given context.
The symbol denotes to the computer that the preceding character in the string to be
matched may or may not be present.
 ([DPSOH7KHIRUPDWIRUDGRFXPHQW¿OHFDQEHZULWWHQDV³GRF[´
The symbol ‘?’ is used to indicate to the computer that the variable ‘x’ may or may not
EHLQFOXGHGLQWKH¿OHIRUPDWQDPH
7. The caret ( ^ ) symbol
The caret symbol (^) is used to set the position for a match.
The caret symbol is used to indicate to the computer that the match should commence
at the start of the string or line.
Example:The regular expression pattern ^\d{3} will successfully match with patterns
such as “901” in the string “901-333-”.
8. The dollar ( $ ) symbol
The dollar symbol, denoted by the character “$”, is widely recognised as the currency
symbol for the United States dollar.
 7KHUHJXODUH[SUHVVLRQVSHFL¿HVWKDWWKHPDWFKVKRXOGEHORFDWHGDWWKHHQGRIWKH
string or immediately before the newline character (n) at the end of the line or string.
Example:The regular expression pattern “-\d{3}$” can be used to match strings that
end with a three-digit number preceded by a hyphen. For instance, in the string “-901-
333”, the pattern will successfully match the substring “-333”.
9. Character Classes
Character classes are a fundamental concept in computer programming and regular
H[SUHVVLRQV7KH\DUHXVHGWRGH¿QHDVHWRIFKDUDFWHUVWKDWFDQEHPDWFKHGLQD
given pattern.A character class is capable of matching any single character from a

Amity Directorate of Distance & Online Education


94 Optimization and Dimension Reduction Techniques

VSHFL¿HGVHWRIFKDUDFWHUV5HJXODUH[SUHVVLRQVDUHXVHGWRPDWFKWKHIXQGDPHQWDO
Notes components of a language, such as letters, digits, spaces, symbols and other similar
entities.
™ The escape sequence “\s” is used to represent any whitespace characters,
including spaces and tabs.
™ The regular expression pattern “\S” is used to match any character that is not a
whitespace character.
™ The regular expression \d is used to match any digit character.
™ The regular expression \D matches any characters that are not digits.
™ The regular expression pattern \w is used to match any word character, which
includes alphanumeric characters.
™ The regular expression pattern \W is used to match any non-word character.
™ The regular expression “\b” is used to match any word boundary, which includes
spaces, dashes, commas, semi-colons and other similar characters.
™ 7KH >VHWBRIBFKDUDFWHUV@ SDWWHUQ LV XVHG WR PDWFK DQ\ VLQJOH FKDUDFWHU WKDW LV
LQFOXGHG LQ WKH VHWBRIBFKDUDFWHUV 7KH GHIDXOW EHKDYLRXU RI WKH PDWFK LV WR
consider case sensitivity.
Example:The pattern [abc] is used to match any occurrence of the characters a, b, or
c within a given string.
 >AVHWBRIBFKDUDFWHUV@1HJDWLRQ
 7KHQRWDWLRQ>AVHWBRIBFKDUDFWHUV@LVXVHGWRUHSUHVHQWDVHWRIFKDUDFWHUVWKDWVKRXOG
not be included in a given context. Negation refers to the ability to match any individual
FKDUDFWHUWKDWGRHVQRWEHORQJWRDVSHFL¿HGVHWRIFKDUDFWHUV7KHGHIDXOWEHKDYLRXU
of the match is to consider case sensitivity.
Example:The regular expression [^abc] is used to match any character except for the
characters a, b and c.
 >¿UVWODVW@&KDUDFWHUUDQJH
The character range refers to a pattern that matches any single character within a
VSHFL¿HGUDQJHVWDUWLQJIURPWKH¿UVWFKDUDFWHUDQGHQGLQJZLWKWKHODVWFKDUDFWHU
Example:The regular expression [a-zA-Z] is used to match any character within the
range of a to z or A to Z.
12. The Escape Symbol ( \ )
 7KH HVFDSH V\PERO ?  LV XVHG WR PDWFK IRU VSHFL¿F FKDUDFWHUV VXFK DV µ¶ µ¶ DQG
others. To match for these characters, simply add a backslash (\) before the desired
character. The computer will be instructed to interpret the subsequent character as a
search character and include it in the evaluation of a matching pattern.
Example:The regular expression \d+[\+-x\*]\d+ can be used to identify patterns such
as “2+2” and “3*9” within the string “(2+2) * 3*9”.
13. Grouping Characters ( )
The process of grouping characters is achieved through the use of brackets ( ).
A collection of distinct symbols within a regular expression can be consolidated into
D FRKHVLYH HQWLW\ IXQFWLRQLQJ DV D XQL¿HG EORFN7R DFKLHYH WKLV LW LV QHFHVVDU\ WR
enclose the regular expression within brackets ( ).

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 95

Example:The regular expression ([A-Z]\w+) is composed of two distinct elements that


have been combined. The provided regular expression is designed to identify patterns Notes
that consist of an uppercase letter followed by any character.
14. Vertical Bar ( | )
The vertical bar (|) is a symbol used to match any single element that is separated by
it.
Example:The regular expression “th(e|is|at)” can be used to match words such as
“the,” “this,” and “that.”

3.1.3 Pattern Matching and Extraction


There are a total of five functions available that offer pattern matching capabilities.
The three functions for which are explained here, namely grep(), grepl() and regexpr(),
are among the most commonly used functions. The main distinction among these three
functions lies in the type of output they produce.
The functions gregexpr() and regexec() are also illustrated here. The following two
functions offer comparable functionalities to the regexpr() function, but with the output
presented in list format.

1. grep()
To identify a pattern within a character vector and obtain the element values or
indices as the output, the grep() function can be utilised
# use the built in data set state.division
head ( as.character (state.division))
## [1] “East South Central” “Pacific” “Mountain”
## [4] “West South Central” “Pacific” “Mountain”
# find the elements which match the pattern
grep (“North”, state.division)
## [1] 13 14 15 16 22 23 25 27 34 35 41 49
# use value = TRUE to show the element value
grep (“North”, state.division, value = TRUE)
## [1] “East North Central” “East North Central” “West North Central”
## [4] “West North Central” “East North Central” “West North Central”
## [7] “West North Central” “West North Central” “West North Central”
## [10] “East North Central” “West North Central” “East North Central”
# can use the invert argument to show the non-matching elements
grep (“North | South”, state.division, invert = TRUE)
## [1] 2 3 5 6 7 8 9 10 11 12 19 20 21 26 28 29 30 31 32 33 37 38 39
## [24] 40 44 45 46 47 48 50

2. grep1()
To find a pattern in a character vector and to have logical (TRUE/FALSE) outputs
use grepl() :

Amity Directorate of Distance & Online Education


96 Optimization and Dimension Reduction Techniques

grepl (“North | South”, state.division)


Notes ## [1] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
FALSE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
TRUE
## [23] TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
FALSE
## [34] TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
FALSE
## [45] FALSE FALSE FALSE FALSE TRUE FALSE
# wrap in sum() to get the count of matches
sum ( grepl (“North | South”, state.division))
## [1] 20
To find exactly where the pattern exists in a string use regexpr() :
x <- c (“v.111”, “0v.11”, “00v.1”, “000v.”, “00000”)
regexpr (“v.”, x)
## [1] 1 2 3 4 -1
## attr(,”match.length”)
## [1] 2 2 2 2 -1
## attr(,”useBytes”)
## [1] TRU

3. regexpr()
The interpretation of the output from the regexpr() function is as follows: The first
element of the output indicates the starting position of the match in each element.
It should be noted that a value of -1 indicates the absence of a match. The second
element, referred to as the attribute “match length,” indicates the length of the match.
The value of the third element, “useBytes” attribute, is set to TRUE. This indicates that
the matching process was performed on a byte-by-byte basis, rather than a character-by-
character basis.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# default usage
regexpr(“one”, text)
## [1] 1 -1 -1 11
## attr(,”match.length”)
## [1] 3 -1 -1 3
## attr(,”useBytes”)
## [1] TRUE

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 97

4. gregexpr()
Notes
The gregexpr() function is a built-in function in many programming languages that is
used for pattern matching within strings. It returns the starting
The gregexpr() function performs a similar task to regexpr(). It is used to locate the
position of a pattern within a vector of strings by searching each element individually.
The primary distinction lies in the output format of gregexpr(), which is a list. The function
gregexpr() returns a list that has the same length as the input text. Each element in
the list follows the same format as the return value of regexpr(). However, instead of
providing the starting position of only one match, it provides the starting positions of all
non-overlapping matches.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# pattern
pat = “one”
# default usage
gregexpr(pat, text)
## [[1]]
## [1] 1
## attr(,”match.length”)
## [1] 3
## attr(,”useBytes”)
## [1] TRUE
##
## [[2]]
## [1] -1
## attr(,”match.length”)
## [1] -1
## attr(,”useBytes”)
## [1] TRUE
##
## [[3]]
## [1] -1
## attr(,”match.length”)
## [1] -1
## attr(,”useBytes”)
## [1] TRUE
##
## [[4]]
## [1] 11
## attr(,”match.length”)

Amity Directorate of Distance & Online Education


98 Optimization and Dimension Reduction Techniques

## [1] 3
Notes ## attr(,”useBytes”)
## [1] TRUE

5. regexec()
The regexec() function is used to perform regular expression matching in a
programme.
The regexec() function closely resembles gregexpr() as it produces an output that is a
list of the same length as the text. The starting position of the match is stored in each element
of the list. A value of -1 indicates the absence of a match. Furthermore, it should be noted
that every element within the list possesses the attribute “match.length,” which provides the
lengths of the matches. In cases where there is no match, the attribute value is -1.
# some text
text = c(“one word”, “a sentence”, “you and me”, “three two one”)
# pattern
pat = “one”
# default usage
regexec(pat, text)
## [[1]]
## [1] 1
## attr(,”match.length”)
## [1] 3
##
## [[2]]
## [1] -1
## attr(,”match.length”)
## [1] -1
##
## [[3]]
## [1] -1
## attr(,”match.length”)
## [1] -1
##
## [[4]]
## [1] 11
## attr(,”match.length”)
## [1] 3

Pattern Replacement Functions


Pattern replacement functions are a type of function that allows for the replacement
of specific patterns within a given text or string. In addition to identifying patterns in

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 99

character vectors, it is also a common requirement to replace a specific pattern within a


string with a new pattern. The Base R regex functions offer two options for this purpose: Notes
There are two options available for replacing occurrences:
(a) replacing the first matching occurrence or
(b) replacing all occurrences.
To replace the first matching occurrence of a pattern use sub() :
new <- c (“New York”, “new new York”, “New New New York”)
new
## [1] “New York” “new new York” “New New New York”
# Default is case sensitive
sub (“New”, replacement = “Old”, new)
## [1] “Old York” “new new York” “Old New New York”
# use ‘ignore.case = TRUE’ to perform the obvious
sub (“New”, replacement = “Old”, new, ignore.case = TRUE)
## [1] “Old York” “Old new York” “Old New New York”

To replace all matching occurrences of a pattern use gsub() :


# Default is case sensitive
gsub (“New”, replacement = “Old”, new)
## [1] “Old York” “new new York” “Old Old Old York”
# use ignore.case = TRUE to perform the obvious
gsub (“New”, replacement = “Old”, new, ignore.case = TRUE)
## [1] “Old York” “Old Old York” “Old Old Old York”

Extracting Patterns
In Stringr, there are two primary options for extracting a string that matches a
pattern: (a) extracting the first occurrence or (b) extracting all occurrences. When using
VWUBH[WUDFW  LW VHDUFKHV IRU WKH ILUVW LQVWDQFH RI WKH SDWWHUQ LQ D FKDUDFWHU YHFWRU ,I D
match is found, it returns the matched string; otherwise, the output for that element will be
NA and the output vector will be the same length as the input string.
y <- c (“I use R #useR2014”, “I use R and love R #useR2015”, “Beer”)
VWUBH[WUDFW \SDWWHUQ ³5´
## [1] “R” “R” NA
8VH VWUBH[WUDFWBDOO  WR ILQG HYHU\ LQVWDQFH RI D SDWWHUQ ZLWKLQ D FKDUDFWHU YHFWRU$
list that is the same length as the vector’s number of items is produced as the output.
The matching pattern occurrence within each list item’s respective vector element will be
provided.
VWUBH[WUDFWBDOO \SDWWHUQ ´>>SXQFW@@ >D]$=@ 5>D]$=@ ´
## [[1]]
## [1] “R” “#useR2014”
##
## [[2]]
Amity Directorate of Distance & Online Education
100 Optimization and Dimension Reduction Techniques

## [1] “R” “R” “#useR2015”


Notes ##
## [[3]]
## character(0)

3.1.4 String Manipulation with Regex in R


String manipulation involves a set of functions that are utilised to extract information
from variables containing text. In the field of machine learning, these functions are
extensively utilised for the purpose of feature engineering. Specifically, they are
employed to generate novel features from pre-existing string features. In the R
programming language, there are packages available, such as stringr and stringi, that
provide a comprehensive set of functions for manipulating strings. These packages can
be loaded into the R environment to access their functionalities.
Furthermore, the R programming language includes a variety of built-in functions
specifically designed for manipulating strings. The purpose of these functions is to
enhance the functionality of regular expressions. The practical differences between string
manipulation functions and regular expressions can be observed in the following aspects:
String manipulation functions serve basic operations on strings, such as splitting or
extracting specific letters. In contrast, regular expressions handle more complex tasks, like
extracting email IDs or dates from text. String manipulation functions provide predefined
responses, while regular expressions offer customization according to specific needs.
For instance, consider a dataset with customer names as a variable. String
manipulation functions can help extract first names and last names, creating new
features. To gain practical knowledge on these functions and commands, make sure you
have R installed and consider installing the stringr R package.

String Manipulation Functions in R


In R, strings are represented by values enclosed in quotes (“ “). Even numeric values
can be treated as strings. In R, strings are recognized as objects of the character class.
Let’s examine this behaviour with examples:
text <- “san francisco”
typeof(text)
[1] “character”
num <- c(“24”, “34”, “36”)
typeof(num)
[1] “character”

paste - Base Function for Combining Strings:


The paste function in R is widely used in machine learning to create or restructure
variable names. For instance, if you want to combine “Var1” and “Var2” to create a new
string “Var3”, you can use paste:
var3 <- paste(“Var1”, “Var2”, sep = “-”)
var3
[1] “Var1-Var2”

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 101

Recycling Behaviour of paste:


When you pass vectors of different lengths to paste, it will recycle the shorter vector
Notes
to match the length of the longer vector. For example:
paste(1:5, c(“?”, “!”), sep = “-”)
[1] “1-?” “2-!” “3-?” “4-!” “5-?”
cat - Printing and Concatenating without Quotes:
The cat function in R allows you to display and concatenate strings without quotes:
cat(text, “USA”, sep = “-”)
san francisco-USA
cat(month.name[1:5], sep = “ “)
January February March April May
toString - Converting Non-Character Values to Strings:
The toString function converts non-character values to their string representations:
toString(1:10)
[1] “1, 2, 3, 4, 5, 6, 7, 8, 9, 10”
These string manipulation functions in R enable efficient handling and processing of
textual data.
Now, let us look at some of the commonly used base R functions (also available in
stringr) to modify strings:

Functions Description
nchar() It counts the number of characters in a string or a vector. In a stringr package,
LWVVXEVWLWXWHIXQFWLRQLVVWUBOHQJWK
tolower() ,WFRQYHUWVDVWULQJWRWKHORZHUFDVH$OWHUQDWLYHO\\RXFDQXVHWKHVWUBWRBORZHU 
function.
toupper() ,W FRQYHUWV D VWULQJ WR XSSHUFDVH$OWHUQDWLYHO\ \RX FDQ DOVR XVH WKH VWUBWRBB
upper() function.
chartr() ,WLVXVHGWRUHSODFHHDFKFKDUDFWHULQDVWULQJ$OWHUQDWLYHO\\RXFDQXVHVWUB
replace() function to replace a complete string.
substr() ,WLVXVHGWRH[WUDFWSDUWVRIDVWULQJ6WDUWDQGHQGSRVLWLRQVQHHGWREHVSHFL¿HG
$OWHUQDWLYHO\\RXFDQXVHWKHVWUBVXE IXQFWLRQ
setdiff() It is used to determine the difference between two vectors.
setequal() It is used to check if the two vectors have the same string values.
abbreviate() It is used to abbreviate strings. The length of the abbreviated string needs to be
VSHFL¿HG
strsplit() It is used to split a string based on a criterion. It returns a list. Alternatively, you
FDQXVHWKHVWUBVSOLW IXQFWLRQ7KLVIXQFWLRQOHWV\RXFRQYHUW\RXUOLVWRXWSXWWR
a character matrix.
sub() ,WLVXVHGWR¿QGDQGUHSODFHWKH¿UVWPDWFKLQDVWULQJ
gsub() ,WLVXVHGWR¿QGDQGUHSODFHDOOWKHPDWFKHVLQDVWULQJYHFWRU$OWHUQDWLYHO\\RX
FDQXVHWKHVWUBUHSODFH IXQFWLRQ

Amity Directorate of Distance & Online Education


102 Optimization and Dimension Reduction Techniques

Now, these functions will be written and their effects on strings will be understood.
Notes The outputs have not been shown here. This is expected to run these commands locally
and observe the differences.
library(stringr)
string <- “Los Angeles, officially the City of Los Angeles and often known by its initials
L.A., is the second-most populous city in the United States (after New York City), the
most populous city in California and the county seat of Los Angeles County. Situated in
Southern California, Los Angeles is known for its Mediterranean climate, ethnic diversity,
sprawling metropolis and as a major centre of the American entertainment industry.”
strwrap(string)
#count number of characters
nchar(string)

VWUBOHQJWK VWULQJ
#convert to lower
tolower(string)
VWUBWRBORZHU VWULQJ

#convert to upper
toupper(string)
VWUBWRBXSSHU VWULQJ

#replace strings
chartr(“and”,”for”,x = string) #letters a,n,d get replaced by f,o,r
VWUBUHSODFHBDOO VWULQJ VWULQJSDWWHUQ F ³&LW\´ UHSODFHPHQW ³VWDWH´ WKLVLVFDVH
sensitive

#extract parts of string


`substr(x = string,start = 5,stop = 11)
H[WUDFWDQJHOHVVWUBVXE VWULQJ VWULQJVWDUW HQG 

#get difference between two vectors


setdiff(c(“monday”,”tuesday”,”wednesday”),c(“monday”,”thursday”,”friday”))

#check if strings are equal


setequal(c(“monday”,”tuesday”,”wednesday”),c(“monday”,”tuesday”,”wednesday”))
setequal(c(“monday”,”tuesday”,”thursday”),c(“monday”,”tuesday”,”wednesday”))

#abbreviate strings
abbreviate(c(“monday”,”tuesday”,”wednesday”),minlength = 3)

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 103

#split strings
strsplit(x = c(“ID-101”,”ID-102”,”ID-103”,”ID-104”),split = “-”)
Notes
VWUBVSOLW VWULQJ F ³,'´´,'´´,'´´,'´ SDWWHUQ ³´VLPSOLI\ 7
#find and replace first match
sub(pattern = “L”,replacement = “B”,x = string,ignore.case = T)

#find and replace all matches


gsub(pattern = “Los”,replacement = “Bos”,x = string,ignore.case = T)

3.2 Data Wrangling with Dplyr


The dplyr package contains an additional set of functions for data wrangling. The
aforementioned package is a component of the tidyverse. This topic provides a concise
introduction to dplyr and its distinct coding structure, which differs from the coding
structure used thus far.
The tidyverse is a curated set of R packages specifically developed for data science
purposes. All packages exhibit a common underlying design philosophy, adhere to a
consistent grammar and utilise standardised data structures.
The dplyr package possesses its own dedicated website.
dplyr is a comprehensive framework for data manipulation that offers a cohesive
collection of verbs designed to address prevalent challenges encountered during data
manipulation tasks.
Ɣ The mutate() function is used to introduce additional variables that are derived from
existing variables.
Ɣ The select() function is used to choose variables by their names.
Ɣ The filter() function is used to select cases based on their values.
Ɣ The function “summarise()” is designed to condense a set of multiple values into a
single summary.
Ɣ The function arrange() is used to modify the order of the rows.
The combination of these functions is seamless when used in conjunction with
JURXSBE\  7KLV IXQFWLRQ HQDEOHV WKH H[HFXWLRQ RI YDULRXV RSHUDWLRQV RQ D SHUJURXS
basis. Additional information about the vignette titled “dplyr” can be obtained for a more
comprehensive understanding. In addition to the single-table verbs, dplyr also offers a
range of two-table verbs. Further information on these can be found in the vignette titled
“two-table”. (For individuals who are unfamiliar with dplyr, it is highly recommended to
commence their learning journey by referring to the data transformation chapter in the
book “R for Data Science.”)
The package comprises functions that are named as verbs, such as “mutate” and
“select.” These functions collectively provide extensive support for various data wrangling
tasks.
the installation of dplyr should be proceeded with:
install.packages(“dplyr”)
library(dplyr)

Amity Directorate of Distance & Online Education


104 Optimization and Dimension Reduction Techniques

3.2.1 Introduction to Data Wrangling


Notes The core principle of data analytics is centred on the notion of manipulation. In the
present context, the term “sneaky” pertains to data that is acquired without appropriate
authorization or consent. Here specifically referring to a distinct form of data that is
gathered and examined for diverse objectives. Data manipulation is a crucial component
in various tasks, including web scraping, statistical analysis and the development of
dashboards and visualisations. Before proceeding with any of these tasks, it is crucial
to ensure that data is in a suitable format for utilisation. Data wrangling is an essential
component of data manipulation.

Defining Data Wrangling


Data wrangling refers to the process of cleaning, transforming and organising raw
data into a structured format that is suitable for analysis. It plays a crucial role in data
science and analytics as it ensures that the data is accurate, consistent and complete,
enabling meaningful insights and informed decision-making.

Image Source: https://towardsdatascience.com/data-wrangling-raw-to-clean-transformation-


b30a27bf4b3b

The term “data wrangling” is commonly employed to refer to the initial phases of the
data analytics process. The process entails the conversion and mapping of data from one
format to another. The objective is to enhance the accessibility of data for applications
such as business analytics and machine learning. The process of data wrangling
encompasses a diverse range of tasks. The aforementioned tasks encompass data
collection, exploratory analysis, data cleansing, creation of data structures and storage.

Importance of Data Wrangling


The process of data wrangling can be characterised as a time-consuming
endeavour. In reality, the task can consume approximately 80% of a data analyst’s
time. This is primarily due to the fluid nature of the process, where clear and sequential
steps may not always be readily available for guidance throughout the entire duration.
However, this is primarily due to the iterative nature of the process and the labour-
intensive activities it entails. The necessary actions to be taken are contingent upon
factors such as the origin or origins of the data, the level of data integrity, the data
architecture implemented within your organisation and the intended utilisation of the data
upon completion of the data wrangling process.
The insights acquired throughout the data wrangling process can prove to be
extremely valuable. The potential impact of these factors on the project’s future trajectory
is significant. Failure to properly complete this step will lead to the development of subpar
data models that have negative consequences on an organisation’s decision-making
processes and overall reputation. If someone ever suggests that data wrangling is not of
significant importance, you are authorised to inform them otherwise.
Regrettably, the significance of data wrangling can be overlooked due to a lack of
understanding in some cases. High-level decision-makers with a preference for expeditious
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 105

outcomes may encounter unexpected delays in the process of transforming data into a
format that is readily employable. In contrast to the outcomes yielded by data analysis, Notes
which frequently offer captivating and stimulating revelations, the data wrangling phase
typically yields minimal visible progress despite significant efforts. In light of budget and
time constraints, the challenges faced by data wranglers are amplified for businesses. The
position requires proficient management of expectations and technical expertise.

What distinguishes data wrangling from data cleaning?


The terms ‘data wrangling’ and ‘data cleaning’ are often used interchangeably by
certain individuals. The reason for this is that both of these tools serve the purpose of
transforming data into a format that is more practical and beneficial. The reason for this is
also due to the fact that they possess certain shared characteristics. However, it is crucial
to note that there exist significant distinctions between them.
Data wrangling is the systematic procedure of gathering unprocessed data, performing
data cleansing, transforming it into a structured format and subsequently storing it in a
manner that facilitates efficient utilisation. The term “data wrangling” is frequently used to
refer to each of the individual steps involved in the process, as well as their combination.
This can lead to confusion, as data wrangling is not always well understood.
Data cleaning is a specific component within the broader data wrangling process.
Data cleaning is a multifaceted procedure that encompasses various tasks. These tasks
include removing undesired observations, eliminating outliers, rectifying structural errors
and typos, standardising units of measure and validating the data set. The process of
data cleaning typically involves a more structured approach compared to data wrangling,
although the steps involved may not always be strictly sequential. The differentiation
between data wrangling and data cleaning is often ambiguous. Data wrangling can be
conceptualised as a comprehensive task that encompasses various activities. Data
cleaning is categorised within this domain, along with various other activities. The
process typically includes several steps such as planning the desired data collection,
performing data scraping, conducting exploratory analysis, cleansing and mapping the
data, creating appropriate data structures and finally storing the data for future utilisation.

3.2.2 Data Manipulation with Dplyr Package


R offers a library known as dplyr that encompasses a range of pre-existing methods
designed to facilitate data manipulation. These methods can be utilised to effectively
manipulate data. To utilise the data manipulation function, it is necessary to import the
dplyr package by executing the line of code library(dplyr). The following is a compilation
of several data manipulation functions that are available in the dplyr package.

Function Name Description


¿OWHU Produces a subset of a Data Frame.
distinct() Removes duplicate rows in a Data Frame.
arrange() Reorders the rows of a Data Frame.
select() Produces data in required columns of a Data Frame.
rename() Renames the variable names.
mutate() Creates new variables without dropping old ones.
transmute() Creates new variables by dropping the old.
summarise() Gives summarised data like Average, Sum, etc.

Amity Directorate of Distance & Online Education


106 Optimization and Dimension Reduction Techniques

 ¿OWHU 
Notes The filter() function is used to generate a subset of data that meets specific criteria.
It allows proper filtering using various methods, including conditional operators, logical
operators, NA values and range operators. The syntax of the filter() function is as follows:
filter(dataframeName, condition)

Example:
The code below utilises the filter() function to retrieve data from the “stats” data
frame, specifically for players who have scored more than 100 runs.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# fetch players who scored more
# than 100 runs
filter(stats, runs>100)

Output
player runs wickets
1 B 200 20
2 C 408 NA
2. distinct()
The distinct() method is used to eliminate duplicate rows from a data frame or based
on the specified columns. The distinct() method follows the syntax provided below:
GLVWLQFW GDWDIUDPH1DPHFROFRONHHSBDOO 758(
Example: In this example, the distinct() method was utilised to eliminate duplicate
rows from the data frame. Additionally, duplicates were removed based on a specified
column.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’, ‘A’, ‘A’),
runs=c(100, 200, 408, 19, 56, 100),
wickets=c(17, 20, NA, 5, 2, 17))
# removes duplicate rows
distinct(stats)
#remove duplicates based on a column

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 107

GLVWLQFW VWDWVSOD\HUNHHSBDOO 758(


Notes
Output

player runs wickets


1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5
5 A 56 2

player runs wickets


1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5

3. arrange() methodThe arrange() function in R is used to sort the rows according to a


designated column. The syntax of the `arrange()` method is as follows:
arrange(dataframeName, columnName)
Example:The code below demonstrates the utilisation of the arrange() function to
order the data in ascending order based on the runs.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# ordered data based on runs
arrange(stats, runs)

Output

player runs wickets


1 D 19 5
2 A 100 17
3 B 200 20
4 C 408 NA

4. select() method The select() method is used to retrieve the desired columns as a table
by specifying the necessary column names within the select() method. The syntax of
the select() method is as follows:
select(dataframeName, col1,col2,…)

Example:

Amity Directorate of Distance & Online Education


108 Optimization and Dimension Reduction Techniques

In the following code, the select() method was utilised to retrieve only the player and
Notes wickets column data.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# fetch required column data
select(stats, player,wickets)

Output

player wickets
1 A 17
2 B 20
3 C NA
4 D 5

5. rename() methodThe rename() function is used for the purpose of modifying column
names. This can be achieved using the following syntax-
rename(dataframeName, newName=oldName)

Example:
,Q WKLV SDUWLFXODU LQVWDQFH WKH FROXPQ QDPH ³UXQV´ LV PRGLILHG WR ³UXQVBVFRUHG´
within the stats data frame.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, NA, 5))
# renaming the column
UHQDPH VWDWVUXQVBVFRUHG UXQV

Output

player runs_scored wickets


1 A 100 17
2 B 200 20
3 C 408 NA
4 D 19 5

6. mutate() and transmute()

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 109

The following methods are utilised for the purpose of generating new variables. The
mutate() function is used to generate new variables while retaining the existing ones, Notes
whereas the transmute() function drops the old variables and generates new ones. The
syntax for both methods is provided below:
mutate(dataframeName, newVariable=formula)
transmute(dataframeName, newVariable=formula)

Example:
In this example, a new column named “avg” was created using the mutate() and
transmute() methods.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
uns=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))
# add new column avg
mutate(stats, avg=runs/4)
# drop all and create a new column
transmute(stats, avg=runs/4)

Output

player runs wicket avg


1 A 100 17 25.00
2 B 200 20 50.00
3 C 408 7 102.00
4 D 19 5 4.75

avg
1 25.00
2 50.00
3 102.00
4 4.75
The mutate() function is used to append a new column to the existing data frame
without removing any of the existing columns. On the other hand, the transmute()
function is used to create a new variable while discarding all the previous columns.
7. summarise()
The data in the data frame can be summarised using the summarise method. This
can be achieved by applying aggregate functions such as sum(), mean() and others. The
syntax of the summarise() method is as follows:
VXPPDULVH GDWDIUDPH1DPHDJJUHJDWHBIXQFWLRQ FROXPQ1DPH

Amity Directorate of Distance & Online Education


110 Optimization and Dimension Reduction Techniques

Example:
Notes The code below demonstrates the utilisation of the summarise() method to present a
summary of the data contained within the runs column.
# import dplyr package
library(dplyr)
# create a data frame
stats <- data.frame(player=c(‘A’, ‘B’, ‘C’, ‘D’),
runs=c(100, 200, 408, 19),
wickets=c(17, 20, 7, 5))
# summarise method
summarise(stats, sum(runs), mean(runs))

Output
sum(runs) mean(runs)
1 727 181.75

3.2.3 Filtering, Sorting and Selecting Data


Filtering
Filtering data is a frequently performed operation used to identify and select
observations that meet a specific value or condition for a particular variable. The filter()
IXQFWLRQ RIIHUV WKH FDSDELOLW\ PHQWLRQHG 7KH VXEBH[S GDWD IUDPH ZKLFK FRQWDLQV WKH
expenditures from the past 5 years, can be filtered by Division.

Below is a table presenting multiple logic rules that can be used with the filter()
function:

Logic Rule Description


< Less than
!= Not equal to
> Greater than
%in% Group membership
== Equal to
is.na Check if the value is NA
<= Less than or equal to
!is.na Check if the value is not NA
>= Greater than or equal to

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 111

& Logical AND


Notes
| Logical OR
! Logical NOT

You can use these logic rules within the filter() function to filter data based on
specific conditions and generate subsets of the data that meet those criteria.
There are additional filtering and subsetting functions that are quite useful:
# remove duplicate rows
VXEBH[S!GLVWLQFW
# random sample, 50% sample size without replacement
VXEBH[S!VDPSOHBIUDF VL]H UHSODFH )$/6(
# random sample of 10 rows with replacement
VXEBH[S!VDPSOHBQ VL]H UHSODFH 758(
# select rows 3-5
VXEBH[S!VOLFH 
# select top n entries - in this case ranks variable X2011 and selects
# the rows with the top 5 values
VXEBH[S!WRSBQ Q ZW ;

Selecting
When dealing with a large data frame, it is common to have the need to evaluate
specific variables. The select() function facilitates the selection and potential renaming of
variables. Assuming our objective is to exclusively evaluate the expenditure data from the
past five years.
The application of special functions within the select() method is also possible. To
illustrate, users have the option to select all variables that begin with the letter ‘X’. To view
the list of available functions, please use the command ‘?select’.
expenditures %>%
VHOHFW VWDUWVBZLWK ³;´ !
head()

Amity Directorate of Distance & Online Education


112 Optimization and Dimension Reduction Techniques

Variables can be de-selected by utilising the “-” symbol before the respective name
Notes or function. The inverse of the functions above can be obtained using the following
procedure:
expenditures %>% select (-X1980:-X2006)
H[SHQGLWXUHV!VHOHFW VWDUWVBZLWK ³;´
To enhance convenience, the system provides two options for renaming selected
variables.
# select and rename a single column
H[SHQGLWXUHV!VHOHFW <UB ;
# Select and rename the multiple variables with an “X” prefix:
H[SHQGLWXUHV!VHOHFW <UB VWDUWVBZLWK ³;´
# keep all variables and rename a single variable
expenditures %>% rename (`2011` = X2011)

Sorting
There are instances where it is desirable to examine observations in a ranked
sequence based on specific variable(s).
The arrange() function facilitates the sorting of data based on variables in either
ascending or descending order. In order to evaluate the average expenditures per
division, the following analysis will be performed.
The arrange() function can be utilised to sort the divisions based on their expenditure
for 2011, arranging them in ascending order. This facilitates the identification of significant
distinctions among Divisions 8, 4, 1 and 6 in comparison to Divisions 5, 7, 9, 3 and 2.
VXEBH[S!
JURXSBE\ 'LYLVLRQ !
VXPPDULVH 0HDQB PHDQ ;QDUP 758( 
0HDQB PHDQ ;QDUP 758( !
DUUDQJH 0HDQB

A descending argument can be employed to arrange items in a rank-order from


highest to lowest.
The data can be displayed in descending order by utilising the desc() function within
the arrange() function.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 113

VXEBH[S!
JURXSBE\ 'LYLVLRQ !
Notes
VXPPDULVH 0HDQB PHDQ ;QDUP 758( 
0HDQB PHDQ ;QDUP 758( !
DUUDQJH GHVF 0HDQB

3.2.4 Transforming and Aggregating Data


Transforming Data
Data transformation in data mining is the process of converting raw data into a
format suitable for analysis and modelling. Its goal is to prepare data for data mining to
extract useful insights and knowledge. Data transformation involves several key steps:
1. Data Cleaning
Data cleaning involves identifying and correcting errors, inconsistencies, and missing
values in the dataset. This step ensures that the data is accurate and reliable.
2. Data Integration
Data integration combines data from multiple sources, such as databases and
VSUHDGVKHHWVLQWRDVLQJOHIRUPDW7KLVVWHSHQDEOHVWKHFUHDWLRQRIDXQL¿HGGDWDVHW
for analysis.
3. Data Normalisation
Data normalisation scales the data to a common range of values, often between 0
and 1. This scaling facilitates data comparison and analysis by removing differences
in data magnitude.
4. Data Reduction
Data reduction reduces the dimensionality of the dataset by selecting a subset of
UHOHYDQWIHDWXUHVRUDWWULEXWHV7KLVVWHSVLPSOL¿HVWKHGDWDVHWZKLOHUHWDLQLQJHVVHQWLDO
information.
5. Data Discretization
Data discretization converts continuous data into discrete categories or bins. It
VLPSOL¿HVWKHUHSUHVHQWDWLRQRIGDWDDQGFDQDLGLQLGHQWLI\LQJSDWWHUQV
6. Data Aggregation
Data aggregation involves combining data at different levels of granularity. This can
include summing or averaging data to create new features or attributes.

Amity Directorate of Distance & Online Education


114 Optimization and Dimension Reduction Techniques

Advantages of Data Transformation in Data Mining


Notes
Ɣ Improves Data Quality: Data transformation helps eliminate errors and
inconsistencies, enhancing data quality.
Ɣ Facilitates Data Integration: Integration of data from various sources improves
accuracy and completeness.
Ɣ Enhances Data Analysis: Data is prepared for analysis and modelling through
normalisation, dimensionality reduction, and discretization.
Ɣ Increases Data Security: Sensitive information can be masked or removed during
transformation, enhancing data security.
Ɣ Enhances Algorithm Performance: Improved data quality and dimensionality
reduction lead to better algorithm performance.

Disadvantages of Data Transformation in Data Mining


Ɣ Time-Consuming: Data transformation can be time-consuming, particularly with
large datasets.
Ɣ Complexity: It can be a complex process, requiring specialised skills.
Ɣ Data Loss: Some transformation methods may result in data loss, impacting
analysis.
Ɣ Bias: Improperly applied transformation can introduce bias into the data.
Ɣ High Cost: Data transformation can be costly in terms of hardware, software, and
personnel.
Ɣ Data transformation is a crucial step in data mining as it ensures data is suitable
for analysis and modelling, free from errors, and optimally prepared for extracting
valuable insights and knowledge.

Aggregating Data
Data aggregation involves the process of summarising and organising data in a
more concise and meaningful manner. It entails collecting data from various sources and
presenting it in a summarised format, which is pivotal for effective data analysis.
Accurate and high-quality data collected in substantial quantities is imperative
for generating meaningful insights. This data collection plays a pivotal role in making
informed decisions across various aspects, including financial strategies, product
development, pricing, operational decisions, and crafting marketing strategies.
Example:A large e-commerce company in India wants to analyse its customer
data to gain insights into purchasing behaviour across different regions of the country.
The company has a massive dataset containing information about individual customer
transactions.
Here’s a simplified example of their data:
Customer ID
Region (e.g., North, South, East, West)
Product Category (e.g., Electronics, Clothing, Books)
Purchase Amount (in Indian Rupees)

Sample Data:

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 115

Customer 1, North, Electronics, `5,000


Customer 2, South, Clothing, `2,500
Notes
Customer 3, East, Books, `1,200
Customer 4, West, Electronics, `4,000
Customer 5, North, Books, `800
... (many more records)

Data Aggregation in this Scenario:


Total Sales by Region:
Aggregate the total sales for each region (North, South, East, West) by summing the
purchase amounts of all transactions in that region.

Example:
North Region Total Sales: `8,000
South Region Total Sales: `2,500
East Region Total Sales: `1,200
West Region Total Sales: `4,000

Total Sales by Product Category:


Aggregate the total sales for each product category (Electronics, Clothing, Books) by
summing the purchase amounts of all transactions in that category.

Example:
Electronics Category Total Sales: `9,000
Clothing Category Total Sales: `2,500
Books Category Total Sales: `2,000

Average Purchase Amount by Region:


Calculate the average purchase amount for each region by taking the mean of
purchase amounts in that region.

Example:
North Region Average Purchase Amount: `4,000 (Total Sales `8,000 / Number of
Transactions in North Region)
South Region Average Purchase Amount: `2,500 (Total Sales `2,500 / Number of
Transactions in South Region)
By aggregating the data in this way, the e-commerce company can gain insights into
which regions are the most profitable, which product categories are the most popular, and
what the average spending patterns are in different regions. This information can inform
marketing strategies, inventory management, and product recommendations to improve
the company’s overall performance in the Indian market.

3.3 Working with Dates and Times


The R programming language offers a variety of options for managing and
manipulating date and date/time data. The as.Date function, which is included as a

Amity Directorate of Distance & Online Education


116 Optimization and Dimension Reduction Techniques

built-in feature, is designed to handle dates without including time information. On the
Notes other hand, the chron library, which is a contributed library, is capable of handling both
dates and times. However, it does not provide functionality for managing time zones.
For more advanced control over dates and times, the POSIXct and POSIXlt classes can
be used. These classes offer support for both dates and times, while also providing the
ability to manage time zones. In R, it is recommended to employ the most straightforward
approach when working with date and time data. In the case of data that only includes
dates, the most suitable option is typically the as.Date function.
The chron library is a suitable option for managing dates and times that do not
include timezone information. When it comes to manipulating time zones, the POSIX
classes within the library are particularly valuable. Additionally, it is important to consider
the various “as.” functions that can be utilised to convert between different date types as
needed.
With the exception of the POSIXct class, dates are internally stored as the numerical
representation of days or seconds elapsed from a specific reference date. Dates in R
typically have a numeric mode and the class function can be utilised to determine their
actual storage method. The POSIXlt class is designed to store date/time values as a
collection of components, such as hour, minute, second, month and so on. This structure
facilitates the extraction of specific parts of the date/time information.
The Sys.Date function can be used to obtain the current date. It returns a Date
object that can be converted to a different class if required.

3.3.1 Handling Dates and Times in R


Real-world data is commonly linked to dates and time. However, accurately handling
dates can seem complex due to the diverse range of formats and the need to account for
time zone differences and leap years. The R programming language provides a variety of
functions that enable users to manipulate and analyse dates and times. In addition, the
utilisation of packages such as lubridate facilitates the manipulation and handling of dates
and times.

Getting Current Date and Time


To get current date and time information:
Sys.timezone ()
>@³$PHULFD1HZB<RUN´
Sys.Date ()
## [1] “2015-09-24”
Sys.time ()
## [1] “2015-09-24 15:08:57 EDT”
If using the lubridate package:
library (lubridate)
now ()
## [1] “2015-09-24 15:08:57 EDT”

Converting Strings to Dates


The process of converting strings to dates involves transforming a string

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 117

representation of a date into a date object. This is commonly done in programming when
dealing with date-related operations. By converting strings to dates, various operations Notes
can be performed.
When importing date and time data into R, it is common for them to be automatically
converted to character strings. The task at hand necessitates the conversion of strings
into dates. Multiple strings can be merged to form a date variable.

1. Convert Strings to Dates


To convert a string that is already in a date format (YYYY-MM-DD) into a date
object use as.Date() :
x <- c (“2015-07-01”, “2015-08-01”, “2015-09-01”)
as.Date (x)
## [1] “2015-07-01” “2015-08-01” “2015-09-01”
It is important to be aware that the default date format is YYYY-MM-DD. If the
string you are working with has a different format, you will need to include the format
argument. There exist various formats in which dates can be represented. To obtain a
comprehensive list of formatting code options in R, use ?strftime function.
y <- c (“07/01/2015”, “07/01/2015”, “07/01/2015”)
as.Date (y, format = “%m/%d/%Y”)
## [1] “2015-07-01” “2015-07-01” “2015-07-01”
If using the lubridate package:
library (lubridate)
ymd (x)
## [1] “2015-07-01 UTC” “2015-08-01 UTC” “2015-09-01 UTC”
mdy (y)
## [1] “2015-07-01 UTC” “2015-07-01 UTC” “2015-07-01 UTC”
One of the numerous advantages provided by the lubridate package is its automatic
recognition of common date separators, such as “-”, “/”, “.” and “”.
The parsing function applied can be determined by specifying the order of the date
elements.

Order of elements in date- time Parse Function


year, month, day ymd()
year, day, month ydm()
month, day, year mdy()
day, month, year dmy()
hour, minute hm()
hour, minute, second hms()
year, month, day, hour, minute, second Ymd hms()

2. Create Dates by Merging Data


In certain cases, the date data may be gathered and stored in distinct elements. In

Amity Directorate of Distance & Online Education


118 Optimization and Dimension Reduction Techniques

order to consolidate the individual data into a single date object, the ISOdate() function
Notes should be utilised.
yr <- c (“2012”, “2013”, “2014”, “2015”)
mo <- c (“1”, “5”, “7”, “2”)
day <- c (“02”, “22”, “15”, “28”)

# ISOdate converts to a POSIXct object


ISOdate (year = yr, month = mo, day = day)
## [1] “2012-01-02 12:00:00 GMT” “2013-05-22 12:00:00 GMT”
## [3] “2014-07-15 12:00:00 GMT” “2015-02-28 12:00:00 GMT”

# truncate the unused time data by converting with as.Date


as.Date ( ISOdate (year = yr, month = mo, day = day))
## [1] “2012-01-02” “2013-05-22” “2014-07-15” “2015-02-28”
It should be noted that the ISODate() function has additional arguments that allow
for the inclusion of data pertaining to hours, minutes, seconds and time zone. This can be
useful when there is a need to combine these individual components.

3.3.2 Date and Time Formatting


The R programming language offers a range of functions that are specifically
designed to handle date and time operations. The purpose of these functions is to
facilitate the formatting and conversion of dates between different formats. The R
programming language offers a format function that accepts date objects as input.
Additionally, it provides a format parameter that allows users to specify the desired format
for the date.

Specifier Description
%a Abbreviated weekday
%A Full weekday
%b Abbreviated month
%B Full month
%C Century
%y Year without century
%Y Year with century
%d Day of month (01-31)
%j Day in Year (001-366)
%m Month of year (01-12)
%D Date in %m/%d/%y format
%u Weekday (1-7), Starts on Monday

Key point:In order to obtain the current date in R, the sys.Date() method can be
utilised. This method is designed to retrieve the present date.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 119

1. Weekday
Notes
This section will explore the %a, %A and %u specifiers, which provide the
abbreviated weekday, full weekday and numbered weekday starting from Monday,
respectively.

# today date
date<-Sys.Date()

# abbreviated month
format(date,format=”%a”)

# fullmonth
format(date,format=”%A”)

# weekday
format(date,format=”%u”)

Output
[1] “Sat”
[1] “Saturday”
[1] “6”
[Execution complete with exit code 0]

2. Date:
Let’s look into the day, month and year format specifiers to represent dates in
different formats.

Example:
# today date
date<-Sys.Date()

# default format yyyy-mm-dd


date

# day in month
format(date,format=”%d”)

# month in year
format(date,format=”%m”)

# abbreviated month
format(date,format=”%b”)

# full month

Amity Directorate of Distance & Online Education


120 Optimization and Dimension Reduction Techniques

format(date,format=”%B”)
Notes
# Date
format(date,format=”%D”)
format(date,format=”%d-%b-%y”)

Output
[1] “2022-04-02”
[1] “02”
[1] “04”
[1] “Apr”
[1] “April”
[1] “04/02/22”
[1] “02-Apr-22”
[Execution complete with exit code 0]

3. Year:
The year can be formatted in various ways. The format specifiers %y, %Y and %C
are used to retrieve different representations of the year from a given date. The %y
specifier returns the year without the century, the %Y specifier returns the year with the
century and the %C specifier returns only the century of the date.
# today date
date<-Sys.Date()

# year without century


format(date,format=”%y”)

# year with century


format(date,format=”%Y”)

# century
format(date,format=”%C”)

Output
[1] “22”
[1] “2022”
[1] “20”
[Execution complete with exit code 0]

3.3.3 Date and Time Arithmetic


R stores date and time objects as numeric values, enabling users to perform a range
of calculations including logical comparisons, addition, subtraction and manipulation of
durations.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 121

x <- Sys.Date ()
x
Notes
## [1] “2015-09-26”
y <- as.Date (“2015-09-11”)
x>y
## [1] TRUE
x-y
## Time difference of 15 days
One of the advantages of utilising the date/time classes is their ability to accurately
handle leap years, leap seconds, daylight savings and time zones. To obtain a
comprehensive list of acceptable time zone specifications, the OlsonNames() function
can be utilised.
# last leap year
x <- as.Date (“2012-03-1”)
y <- as.Date (“2012-02-28”)
x-y
## Time difference of 2 days

# example with time zones


x <- as.POSIXct (“2015-09-22 01:00:00”, tz = “US/Eastern”)
y <- as.POSIXct (“2015-09-22 01:00:00”, tz = “US/Pacific”)

y == x
## [1] FALSE

y-x
## Time difference of 3 hours
The lubridate package also provides the same functionality, with the only distinction
being the use of different accessor function(s).
library (lubridate)

x <- now ()
x
## [1] “2015-09-26 10:08:18 EDT”

y <- ymd (“2015-09-11”)

x>y
## [1] TRUE
x-y

Amity Directorate of Distance & Online Education


122 Optimization and Dimension Reduction Techniques

## Time difference of 15.5891 days


Notes
y + days (4)
## [1] “2015-09-15 UTC”

x - hours (4)
## [1] “2015-09-26 06:08:18 EDT”
Time spans can be managed using the duration functions available in the lubridate
package.
Durations are a means of quantifying the length of time between specified start
and end dates. Utilising the base R date functions for duration calculations can be a
cumbersome and error-prone process. The lubridate package offers a straightforward
syntax for calculating durations using various units of measurement, such as seconds,
minutes, hours and so on.
# create new duration (represented in seconds)
QHZBGXUDWLRQ 
## [1] “60s”

# create durations for minutes, hours, years


dminutes (1)
## [1] “60s”

dhours (1)
## [1] “3600 s (~1 hours)”

dyears (1)
## [1] “31536000 s (~365 days)”

# add/subtract durations from date/time object


[\PGBKPV ³´

x + dhours (10)
## [1] “2015-09-22 22:00:00 UTC”

x + dhours (10) + dminutes (33) + dseconds (54)


## [1] “2015-09-22 22:33:54 UTC”

3.3.4 Time Series Analysis


Time Series Analysis in R involves analysing the behaviour of a particular object
or phenomenon over a defined time period. To conduct Time Series Analysis in R, you
can utilise the ts() function, which requires specific parameters. The ts() function takes a
data vector, where each data point corresponds to a user-provided timestamp value. Its
primary purpose is to aid in gaining insights and predicting the performance of an asset
within a specified timeframe, particularly in the context of business operations. Various
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 123

analyses can be performed, such as sales analysis for a company, inventory analysis,
price analysis for a specific stock or market and population analysis. Notes
The syntax for using the ts() function is as follows:
objectName <- ts(data, start, end, frequency)

Where:
™ data: Represents the data vector containing the time series observations.
™ start: Denotes the timestamp of the first observation in the time series.
™ end: Denotes the timestamp of the last observation in the time series.
™ frequency: Specifies the number of observations per unit time. For example,
frequency=1 for monthly data.
By using the ts() function in R, you can conduct comprehensive Time Series Analysis
to gain valuable insights and make informed predictions about the behaviour of your data
over time.
Note: To know about more optional parameters, use the following command in the R
console: help(“ts”)

Examples of Time Series Analysis


In this illustration, the case of the COVID-19 pandemic will be examined. The dataset
used in this analysis comprises the cumulative count of confirmed COVID-19 cases on
a weekly basis, spanning from January 22, 2020, to April 15, 2020. The data has been
sourced from the World in Data repository.
# Weekly data of COVID-19 positive cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843, 471497,
936851, 1508725, 2072113)
OLEUDU\UHTXLUHGIRUGHFLPDOBGDWH IXQFWLRQ
library(lubridate)

# output to be created as png file


png(file =”timeSeries.png”)

# creating time series object


# from date 22 January, 2020
PWVWV [VWDUW GHFLPDOBGDWH \PG ³´ 
frequency = 365.25 / 7)

# plotting the graph


plot(mts, xlab =”Weekly Data”,
ylab =”Total Positive Cases”,

Amity Directorate of Distance & Online Education


124 Optimization and Dimension Reduction Techniques

main =”COVID-19 Pandemic”,


Notes
col.main =”darkgreen”)

# saving the file


dev.off()

Output

Fig: Time Series Data visualisation chart


(Image Source: https://www.geeksforgeeks.org/time-series-analysis-in-r/)

Multivariate Time Series Analysis


The concept of Multivariate Time Series involves the creation of multiple time series
within a single chart. The data vector contains weekly records of total positive COVID-19
cases and total deaths, spanning from January 22, 2020, to April 15, 2020.
# Weekly data of COVID-19 positive cases and
# weekly deaths from 22 January, 2020 to
# 15 April, 2020
positiveCases <- c(580, 7813, 28266, 59287,
75700, 87820, 95314, 126214,
218843, 471497, 936851,
1508725, 2072113)

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 125

deaths <- c(17, 270, 565, 1261, 2126, 2800,


3285, 4628, 8951, 21283, 47210,
Notes
88480, 138475)
OLEUDU\UHTXLUHGIRUGHFLPDOBGDWH IXQFWLRQ
library(lubridate)
# output to be created as png file
png(file=”multivariateTimeSeries.png”)

# creating multivariate time series object


# from date 22 January, 2020
mts <- ts(cbind(positiveCases, deaths),
VWDUW GHFLPDOBGDWH \PG ³´ 
frequency = 365.25 / 7)
# plotting the graph
plot(mts, xlab =”Weekly Data”,
main =”COVID-19 Cases”,
col.main =”darkgreen”)
# saving the file
dev.off()

Output

Fig: Multivariate Time Series Analysis using R


(Image Source: https://www.geeksforgeeks.org/time-series-analysis-in-r/)

Amity Directorate of Distance & Online Education


126 Optimization and Dimension Reduction Techniques

Time series forecasting


Notes Time series forecasting can be accomplished by utilising various models available
in the R programming language. In this particular instance, the Arima automated model
is employed. To obtain additional information regarding the parameters of the arima()
function, execute the following command.
help(“arima”)
The code below utilises the forecast library for conducting forecasting. Therefore, it is
essential to install the forecast library.
# Weekly data of COVID-19 cases from
# 22 January, 2020 to 15 April, 2020
x <- c(580, 7813, 28266, 59287, 75700,
87820, 95314, 126214, 218843,
471497, 936851, 1508725, 2072113)

OLEUDU\UHTXLUHGIRUGHFLPDOBGDWH IXQFWLRQ
library(lubridate)

# library required for forecasting


library(forecast)

# output to be created as png file


png(file =”forecastTimeSeries.png”)

# creating time series object


# from date 22 January, 2020
PWVWV [VWDUW GHFLPDOBGDWH \PG ³´ 
frequency = 365.25 / 7)

# forecasting model using arima model


fit <- auto.arima(mts)

# Next 5 forecasted values


forecast(fit, 5)

# plotting the graph with next


# 5 weekly forecasted values
plot(forecast(fit, 5), xlab =”Weekly Data”,
ylab =”Total Positive Cases”,
main =”COVID-19 Pandemic”, col.main =”darkgreen”)

# saving the file


dev.off()
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 127

Output :
Notes
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2020.307 2547989 2491957 2604020 2462296 2633682
2020.326 2915130 2721277 3108983 2618657 3211603
2020.345 3202354 2783402 3621307 2561622 3843087
2020.364 3462692 2748533 4176851 2370480 4554904
2020.383 3745054 2692884 4797225 2135898 5354210

The graph presented below illustrates the projected values of COVID-19 if its
widespread transmission persists over the course of the next five weeks.

Fig:Time Series Forecasting using R


(Image Source: https://www.geeksforgeeks.org/time-series-analysis-in-r/)

3.4 Text Mining and Analysis


Text mining originated in the computational and information management domains,
specifically in areas such as database searching and information retrieval. On the other
hand, text analysis had its roots in the humanities, where it initially involved manual
examination of text, including activities like creating Bible concordances and newspaper
indexes. In contemporary usage, the terms “search,” “retrieve,” and “analyse” are
commonly associated with the application of computational methods to text data. As a
result, these terms have become interchangeable and are now used interchangeably.
Text mining, also known as text analytics, encompasses a variety of techniques
aimed at extracting valuable information from collections of documents. This is achieved
by identifying and investigating noteworthy patterns within unstructured textual data found
in different types of documents, including books, web pages, emails, reports and product
descriptions.
Amity Directorate of Distance & Online Education
128 Optimization and Dimension Reduction Techniques

3.4.1 Introduction to Text Mining


Notes
Describing Text Mining:
Text mining is a specialised aspect of data mining that focuses on the analysis
and extraction of information from unstructured text data. The process entails utilising
techniques in natural language processing (NLP) to extract valuable information and
insights from extensive quantities of unstructured text data. Text mining serves as a
preprocessing step for data mining or can function independently to accomplish specific
tasks.
Text mining is a technique that enables the conversion of unstructured text data
into structured data. This structured data can then be effectively utilised for various data
mining tasks, including classification, clustering and association rule mining. This feature
enables organisations to extract valuable insights from diverse data sources, including
customer feedback, social media posts and news articles.

Applications of Text Mining:

1. Risk Management:
Text mining plays a crucial role in risk management by systematically analysing,
recognizing, treating and monitoring risks in various processes within organisations.
Implementing Risk Administration Software based on text mining technology enhances
the ability to identify and mitigate risks effectively. It allows the management of vast
amounts of text data from multiple sources and enables the extraction of relevant
information to make informed decisions.

2. Customer Care Service:


Text mining, especially Natural Language Processing (NLP) techniques, is
increasingly valuable in the customer care sector. Companies utilise text analytics
software to improve overall customer experience by analysing textual information from
diverse sources such as surveys, user feedback and customer calls. Text analysis helps
reduce response times and enables rapid and efficient resolution of user grievances.

3. Social Media Analysis:


Specialised text mining tools are employed to analyse social media platforms’ data
and interactions. These tools track and analyse text from news, blogs, emails and other
sources. By efficiently analysing posts, likes and followers, businesses gain insights into
how people interact with their brand and online content.

4. Business Intelligence:
Text mining has become a fundamental component of business intelligence for
companies and businesses. It provides significant insights into user behaviour and
trends, allowing organisations to understand their strengths and weaknesses compared
to competitors. By leveraging text mining methods, companies gain a competitive
advantage in the industry.

Process of Text Mining


The conventional process of text mining involves the following key steps:
1. Data Collection:
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 129

™ Gather relevant textual data from various sources in different document formats
(e.g., plain text, web pages, PDF files). Notes
™ Execute pre-processing and data cleansing tasks to identify and remove
inconsistencies.
™ Perform data cleansing to capture authentic text and remove stop words
through stemming (identifying word roots and indexing data).
™ Process and control the data set to facilitate review and further cleansing.
2. Implementation of Pattern Analysis:
™ Utilise pattern analysis as a critical component within the Management
Information System.
3. Extracting Relevant Information:
™ Leverage the data obtained from the previous steps to extract relevant and
significant information.
4. Enable effective decision-making processes and trend analysis.

Fig: process of text mining

3.4.2 Text Preprocessing and Cleaning


The term “pre-processing” encompasses various activities that are performed to
prepare documents for analysis. The selection of pre-processing techniques can vary
based on the nature of the documents, the type of text and the desired analyses. It is
possible to employ a limited number of pre-processing techniques or opt for a diverse
range, depending on these factors.
The initial preprocessing step in every Text Document Mining (TDM) project involves
identifying the necessary cleaning procedures required to facilitate the subsequent
analysis. Cleaning involves a series of procedures aimed at standardising text and
eliminating irrelevant text and characters. Upon completion of these procedures, you will
obtain a well-organised text dataset that is prepared for analysis.
Certain Time Division Multiplexing (TDM) methods necessitate the inclusion of
additional context in your corpus prior to conducting analysis. Pre-processing techniques,
such as parts of speech tagging and named entity recognition, facilitate the analysis
process by categorising and assigning semantic significance to various elements within
the text.
The interface of the TDM tool you are using may incorporate cleaning and pre-
processing methods. There are various tools available that incorporate cleaning and pre-
processing methods within their functionalities. Some examples of such tools are:
Voyant Tools and Gale Digital Scholar Lab are two software platforms that offer
various features and functionalities for users in the field of digital scholarship. These
tools provide researchers with a range of capabilities to analyse and explore textual data.
Voyant Tools enables users to conduct text analysis, visualisation and exploration, while
Amity Directorate of Distance & Online Education
130 Optimization and Dimension Reduction Techniques

In some cases, it may be necessary to engage in programming activities to


Notes adequately preprocess your corpus in preparation for your analyses.

Cleaning and other pre-processing techniques


™ Tokenization
™ Converting your text to lowercase
™ Word replacement
™ Punctuation and non-alphanumeric character removal
™ Stopwords
™ Parts of speech tagging
™ Named entity recognition
™ Stemming and lemmatization

Tokenization
Several TDM (Text Data Mining) methods rely on the utilisation of word or short
phrase counting techniques. However, from a computer’s perspective, the texts in your
corpus are merely sequences of characters and it lacks the understanding of words or
phrases. To enable the computer to count and perform calculations, it is necessary to
provide instructions on how to segment the text into meaningful units. The individual units
within the text are referred to as tokens and the act of dividing the text into these units is
known as tokenization.
Tokenizing text is a common practice where the text is divided into individual words
or tokens. However, there are other types of tokenization that can also be beneficial.
If one is interested in examining particular phrases, such as ‘artificial intelligence’ or
‘White Australia Policy’, or investigating the co-occurrence patterns of certain words, it
is advisable to segment the text into two or three-word units. To perform an analysis of
sentence structures and features, it is recommended to begin by tokenizing the text into
discrete sentences. Frequently, it is necessary to tokenize text using various methods in
order to facilitate different types of analyses.
In languages where words are not separated in writing, such as Chinese, Thai, or
Vietnamese, tokenization necessitates careful consideration to determine how the text
should be divided in order to facilitate the intended analysis.

Converting your text to Lowercase


Computers frequently distinguish between capitalised and lowercase versions of
words, leading to potential issues during analysis. One potential solution to this problem
is to convert all text to lowercase.

Word Replacement
The presence of spelling variations can pose challenges in text analysis, as the
computer interprets distinct spellings as separate words instead of recognising their
shared reference. To resolve this issue, it is recommended to select a singular spelling
and substitute any alternative variations with that particular version throughout the text.
In order to process a large corpus, the initial step involves tokenizing the words and
subsequently standardising the spelling. Another option is to utilise tools like VARD to
automate the task.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 131

Removing Punctuation and non-alphanumeric Characters


Notes
The presence of punctuation or special characters within your data can introduce
unnecessary visual noise and complicate the process of analysing the text. The
occurrence of errors in Optical Character Recognition (OCR) can lead to the inadvertent
inclusion of atypical non-alphanumeric characters in your text. A straightforward method
for eliminating unnecessary clutter in your text involves identifying non-alphanumeric
characters and subsequently removing them.

Stopwords
In the realm of textual analysis, it is worth noting that certain words, such as ‘the’,
‘is’, ‘that’, ‘a’ and others, are frequently employed but do not contribute significantly to
the understanding or interpretation of the content within your documents. Consequently,
these words tend to exert a disproportionate influence on any analysis conducted.
The words that need to be filtered out prior to text analysis are commonly referred to
as “stopwords”. There is a wide range of available stopword lists that can be utilised for
the purpose of eliminating commonly used words across multiple languages. To exclude
certain words from your analysis that are frequently found in your documents but are not
relevant, you have the option to personalise the existing stopwords lists by incorporating
your own words into them.

Parts of Speech Tagging


The process of parts of speech tagging is employed to impart contextual information
to textual data. Text is commonly known as “unstructured data,” which refers to data
that lacks a specific structure or pattern. In the context of a computer system, text can
be defined as an uninterrupted sequence of characters devoid of inherent meaning or
interpretation. However, it is possible to perform analyses that examine the contextual
usage of words or tokens in order to categorise them in specific manners.
In the process of parts of speech tagging, each word in the given text is assigned to
a specific word class, including but not limited to nouns, verbs, adjectives, prepositions
and determiners. The inclusion of additional information with words allows for more
advanced processing and analysis, such as lemmatization, sentiment analysis, or any
analysis that requires a closer examination of a particular group of words.
Parts of speech taggers are software applications that utilise machine learning
techniques to classify words based on their grammatical roles. These taggers are trained
on text data that has been manually annotated by human experts. Tagging software
operates using various methods, although typically, a tagger relies on probabilities, such
as the previous tagging history of a word.
A tagger has the capability to analyse the typical ordering patterns of tags. In the
case of the phrase ‘this erudite scholar’, if the tagger is not familiar with the term ‘erudite’,
it can analyse the surrounding words to deduce that an adjective is probable when a
word preceded by a determiner and followed by a noun is encountered.

Named Entity Recognition (NER)


It is a natural language processing (NLP) technique used to identify and classify
named entities in text. Named entities refer to specific types of words or phrases
This method, similar to parts of speech tagging, is employed to impart context and
structure to textual data.

Amity Directorate of Distance & Online Education


132 Optimization and Dimension Reduction Techniques

Named Entity Recognition (NER) is a computational process that involves the


Notes analysis of text to identify specific entities that would be recognisable to a human. The
entities are subsequently categorised into various classifications, including person,
location, organisation, nationality, time, date and so on.
Certain named entity recognizers, like SpaCy, possess a collection of pre-
established categories that they have undergone training to accurately detect. Stanford
NER, among other tools, provides the capability to define custom categories. When you
define your own categories, it is necessary to train the recognizer in order to accurately
identify the entities that are of interest to you. In order to accomplish this task, it will be
necessary to engage in the manual classification of numerous documents, which can be
a time-consuming and labour-intensive endeavour.
The utilisation of entity tagging within your text enables you to pose inquiries about
the texts in a more seamless manner. In the context of extracting information about
individuals mentioned in a given text, it is important to note that computer systems lack
the inherent ability to identify and categorise people without the utilisation of Named
Entity Recognition (NER) techniques. Prior to performing NER, the computer system
lacks the knowledge required to discern individuals from other entities within the text.
Once the entities in the text have been classified, the computer can easily generate
a list of all the entities tagged as “Person”. Named entity recognition (NER) offers a
wide range of potential inquiries to delve into. These include determining the individuals
mentioned in the same documents, identifying significant locations within a text and
examining whether the entities mentioned in the text vary across different time periods.

Stemming and Lemmatization


In certain instances, it can be advantageous for your analysis to identify words with
the same root as being identical. In the context of computational analysis, it is common
for words such as ‘swim’, ‘swims’, ‘swimming’, ‘swam’ and ‘swum’ to be considered
distinct entities. However, there may be instances where it is desirable to recognise all of
these variations as different forms of the base word ‘swim’.
Stemming and lemmatization are distinct techniques employed to reduce words
to their fundamental root form, enabling them to be categorised accordingly. When
embarking on text mining, beginners typically do not require the use of advanced
standardisation methods. However, it is beneficial to have knowledge of these methods.

Stemming
Stemming involves the application of a predefined set of rules to determine the
suffixes of words that can be removed, thereby isolating the fundamental root of the
word. The resulting “stem” may or may not correspond to a valid word. There are various
stemming algorithms available, including the Snowball and Lancaster stemmers. The
execution of these rules will yield varying outcomes. Therefore, it is recommended to
examine the rules they employ and test them on your data in order to determine which
one best aligns with your requirements. The implementation of a stemming algorithm
necessitates the execution of programming tasks.

Lemmatization
Lemmatization is a computational process that involves the analysis of words in
order to derive their corresponding dictionary root form. In order to perform this analysis,
it is necessary for the lemmatizer to have an understanding of the context in which each

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 133

word is utilised. Therefore, prior to lemmatising your text, it is imperative to pre-process it


by applying parts of speech tagging. There are multiple lemmatizers available for use, but Notes
programming is necessary for all of them.

3.4.3 Text mining techniques


The text mining process encompasses a series of activities that facilitate the
extraction of information from unstructured text data. In order to utilise various text mining
techniques, it is necessary to begin with text preprocessing. This involves the process
of cleansing and converting text data into a format that can be effectively utilised. The
practice described is a fundamental component of natural language processing (NLP). It
typically encompasses various techniques, including language identification, tokenization,
part-of-speech tagging, chunking and syntax parsing. These techniques are employed to
properly structure data for analysis purposes. Once the process of text preprocessing has
been successfully executed, it becomes possible to utilise text mining algorithms in order
to extract valuable insights from the data. There are several commonly used text mining
techniques, which are as follows:

Information Retrieval
Information retrieval (IR) is a process that retrieves relevant information or
documents by using a predetermined set of queries or phrases. Information retrieval (IR)
systems employ algorithms to monitor user behaviours and discern pertinent data. The
utilisation of information retrieval is prevalent in library catalogue systems and widely
used search engines such as Google. There are several common sub-tasks associated
with Information Retrieval (IR), which are frequently encountered in this field. These sub-
tasks include:
Ɣ Tokenization is a fundamental process in natural language processing that involves
the segmentation of lengthy text into individual sentences and words, referred to
as “tokens”. These components are utilised in various models, such as the bag-of-
words model, to perform tasks like text clustering and document matching.
Ɣ Stemming is the procedure of extracting the root word form and meaning by
separating the prefixes and suffixes from words. This technique enhances the
process of information retrieval by decreasing the size of indexing files.

Natural Language Processing (NLP)


Natural Language Processing (NLP) is an interdisciplinary field that combines
knowledge from computational linguistics, computer science, artificial intelligence
and data science to enable computers to understand human language in written and
spoken forms. NLP sub-tasks involve analysing sentence structures and grammar to
Amity Directorate of Distance & Online Education
134 Optimization and Dimension Reduction Techniques

comprehend written text. These subtasks play crucial roles in various projects and tasks
Notes and contribute to their successful completion.
1. Summarization:
Summarization is a technique used to condense lengthy textual content into concise
and well-organised summaries that capture the main ideas of a document. It helps in
H[WUDFWLQJHVVHQWLDOLQIRUPDWLRQHI¿FLHQWO\
2. Part-of-Speech (PoS) Tagging:
 3R6WDJJLQJLQYROYHVDVVLJQLQJVSHFL¿FWDJVWRHDFKWRNHQLQDGRFXPHQWWRLQGLFDWH
its part of speech, such as nouns, verbs, adjectives, etc. This enables semantic
analysis of unstructured text and aids in understanding the linguistic elements.
 7H[W&DWHJRUL]DWLRQ 7H[W&ODVVL¿FDWLRQ 
 7H[W FDWHJRUL]DWLRQ DOVR NQRZQ DV WH[W FODVVL¿FDWLRQ FDWHJRULVHV WH[W GRFXPHQWV
EDVHGRQSUHGH¿QHGWRSLFVRUFDWHJRULHV,WSURYHVYDOXDEOHLQRUJDQLVLQJV\QRQ\PV
DQGDEEUHYLDWLRQVDQGIDFLOLWDWHVHI¿FLHQWLQIRUPDWLRQUHWULHYDO
4. Sentiment Analysis:
 6HQWLPHQWDQDO\VLVLGHQWL¿HVDQGFODVVL¿HVSRVLWLYHRUQHJDWLYHVHQWLPHQWVH[SUHVVHG
in internal and external data sources. It tracks customer attitudes and analyses
sentiment changes over time. Businesses use sentiment analysis to understand
brand, product and service perceptions, leading to better customer engagement and
improved user experiences.
These NLP sub-tasks are integral to unlocking the potential of unstructured text data,
enabling businesses to gain valuable insights and make data-driven decisions.

Information Extraction (IE)


Information extraction is a process that involves identifying and retrieving relevant
data from various documents during a search operation. The main goal is to extract
structured information from unstructured text and store it, including entities, attributes and
relationship data, in a database. Common sub-tasks in information extraction include:
1. Feature Selection:
Feature selection, also known as attribute selection, is the systematic process of
LGHQWLI\LQJ DQG VHOHFWLQJ WKH PRVW VLJQL¿FDQW IHDWXUHV RU GLPHQVLRQV WKDW KDYH WKH
greatest impact on the output of a predictive analytics model.
2. Feature Extraction:
Feature extraction involves choosing a subset of features to enhance the accuracy of
DFODVVL¿FDWLRQWDVN'LPHQVLRQDOLW\UHGXFWLRQLVSDUWLFXODUO\FUXFLDOLQWKLVFRQWH[W
3. Named-Entity Recognition (NER):
 1DPHG(QWLW\5HFRJQLWLRQ 1(5 DOVRNQRZQDVHQWLW\LGHQWL¿FDWLRQRUH[WUDFWLRQLVD
computational process that aims to identify and classify distinct entities within a given
text, such as names or locations. For example, the NER system can successfully
recognize “California” as a geographical location and “Mary” as a female personal
name.
These information extraction subtasks play a vital role in converting unstructured text
data into structured and meaningful information, facilitating effective data analysis and
decision-making processes.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 135

Data Mining
Notes
It is a process that involves extracting useful patterns and information from large
datasets. It is a technique used in various fields, such as business
Data mining is a systematic procedure that involves the identification of patterns
and the extraction of valuable insights from large sets of data. This practice involves the
evaluation of both structured and unstructured data in order to identify novel information.
It is commonly employed for the analysis of consumer behaviours within the domains
of marketing and sales. Text mining is a sub-field of data mining that specifically deals
with the organisation and analysis of unstructured data in order to produce new and
valuable findings. The aforementioned techniques are classified as data mining methods,
specifically falling within the domain of textual data analysis.

3.4.4 Sentiment Analysis and Text Classification


Sentiment Analysis
Sentiment analysis, also referred to as the process of discerning positive or
negative sentiment in text, is a widely recognised technique. Businesses commonly
utilise information technology (IT) to analyse data from social media platforms in order to
assess sentiment, evaluate brand reputation and understand their customer base.

Types of Sentiment Analysis


Sentiment analysis focuses primarily on determining the polarity of a given text,
categorising it as positive, negative, or neutral. However, its scope extends beyond
polarity to encompass the identification of various moods and emotions, such as anger,
pleasure and sadness. Additionally, sentiment analysis can also assess the urgency of a
text, distinguishing between urgent and non-urgent messages. Furthermore, it can even
discern the intentions of the author, differentiating between those who are interested and
those who are uninterested.
The categories can be constructed and customised to align with specific sentiment
analysis requirements, allowing for the interpretation of consumer comments and
inquiries. The following are some of the most popular types of sentiment analysis
currently available:
1. Sentiment Analysis by Grade
If your company places a premium on polarity precision, you can think about
expanding your polarity categories to cover various intensities of positive and negative:
™ extremely positive
™ Positive
™ Neutral
™ Negative
™ Very negative
This is typically known as graded or fine-grained sentiment analysis and might be
used, for instance, to evaluate 5-star reviews:

Five stars for Very Positive


Very Negative = 1 star.

Amity Directorate of Distance & Online Education


136 Optimization and Dimension Reduction Techniques

2. Detection of emotions
Notes Emotion detection sentiment analysis goes beyond simple polarity by identifying
specific emotions like happiness, frustration, anger and sadness in text. To achieve this,
many emotion recognition systems employ advanced machine learning algorithms or
lexicons, which are lists of words and their associated emotions.
However, using lexicons has a limitation because people express emotions in
various ways. For example, words like “bad” and “kill,” which can signify rage (e.g., “this
is bad ass” or “your customer service is killing me”), can also be used to convey joy,
creating potential ambiguities in emotion detection.
3. Analysis of Sentiment Based on Aspects
Aspect-based sentiment analysis is a valuable approach when you aim to identify the
specific qualities or traits mentioned in texts with positive, neutral, or negative sentiments.
For example, if a product review includes the statement “The battery life of this camera is
too short,” an aspect-based classifier can determine that the author expressed a negative
sentiment about the device’s battery life.
4. Multilingual sentiment analysis
Performing sentiment analysis across different languages presents a considerable
challenge. The preparation process requires a significant amount of time and effort. Most
of the tools mentioned, such as sentiment lexicons, can be accessed online. However,
there are others, like translated corpora or noise detection algorithms, that require coding
skills to be effectively utilised.
One potential approach is to employ a language classifier to automatically detect the
language present in texts. Subsequently, you can proceed to train a distinct sentiment
analysis model that categorises texts based on the language of your preference.

Text Classification
Text classification refers to the systematic procedure of assigning tags or categories
to texts, primarily determined by the content they contain.
Automated text classification enables the efficient tagging of extensive text datasets,
yielding favourable outcomes within a significantly reduced timeframe, eliminating the
need for manual labour. The technology described possesses promising potential for
various domains.
1. Rule-based System
Text classification systems of this kind rely on linguistic rules. The term “rules” refers
to associations that have been manually created by humans, linking a particular linguistic
pattern with a corresponding tag. Once the algorithm has been implemented with the
specified rules, it is capable of automatically identifying and assigning appropriate tags to
various linguistic structures.
Rules typically encompass references to syntactic, morphological and lexical
patterns. Additionally, these factors can also pertain to semantic or phonological aspects.
An illustrative instance of a rule for categorising product descriptions according to the
colour of a product is as follows:
The set of colours (Black, Grey, White, Blue) can be represented as “Colour”.
In this scenario, the system will automatically assign the tag “COLOUR” whenever it
detects any of the words mentioned earlier.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 137

Rule-based systems are characterised by their ease of comprehension, as they


are created and enhanced by human developers. The process of incorporating new Notes
rules into an algorithm typically necessitates extensive testing to evaluate their potential
impact on the predictions generated by existing rules. This can pose challenges in terms
of scalability for the system. In addition, the creation of intricate systems necessitates a
comprehensive understanding of linguistics and the specific dataset that is intended for
analysis.
2. Machine learning-based systems
They are computer programmes that utilise algorithms and statistical models to
automatically learn and improve from experience without being explicitly programmed.
These systems are designed to analyse and interpret large amounts of
Machine learning-based text classification systems have the capability to acquire
knowledge from historical data, specifically through the use of examples. In order to
accomplish this task, individuals must undergo training using pertinent examples of text,
commonly referred to as training data, which have been accurately labelled.
In order for the model to generate accurate predictions, it is crucial that the training
samples are both consistent and representative. The operational mechanism of a text
classifier is a topic of interest.
Machines are required to convert the training data into a format that is
comprehensible to them. In this particular scenario, vectors (which are encoded
data in the form of numerical values) are utilised for this purpose. Vectors are utilised
to represent distinct characteristics or attributes of the given data. The bag of words
approach is a widely used technique for vectorization. It involves tallying the frequency of
occurrence of each word, selected from a predetermined set of words, within the text that
is being analysed.
3. Hybrid systems
It refers to a combination of different technologies or methodologies that work
together to achieve a common goal. These systems typically integrate both physical and
Hybrid systems integrate rule-based systems with machine learning-based systems.
The two components work in tandem to enhance the precision of the outcomes.
4. Evaluation
The performance evaluation of a text classifier is conducted using various
parameters, including accuracy, precision, recall and F1 score. Gaining a comprehensive
understanding of these metrics will enable you to assess the effectiveness of your
classifier model in text analysis.
The classifier can be evaluated by either using a fixed testing set, which consists of
data with known expected tags, or by employing cross-validation. The process involves
partitioning the training data into two distinct subsets. One subset is utilised for training
purposes, while the other subset is reserved for testing purposes.
Ɣ Accuracy is a metric that quantifies the classifier’s ability to make correct
predictions. It is calculated by dividing the number of correct predictions by the
total number of predictions made. Nevertheless, relying solely on accuracy may
not always be the most optimal approach for assessing the effectiveness of a
classifier. In instances where there is an imbalance in categories, specifically when
one category has a significantly larger number of examples compared to others, it is

Amity Directorate of Distance & Online Education


138 Optimization and Dimension Reduction Techniques

possible to encounter an accuracy paradox. This paradox arises due to the model’s
Notes tendency to make accurate predictions, primarily because a majority of the data
belongs to a single category. In such instances, it is advisable to take into account
alternative metrics such as precision and recall.
Ɣ Precision is a metric used to assess the accuracy of a classifier. It measures the
proportion of correct predictions made by the classifier for a specific tag, taking into
account both correct and incorrect predictions. The presence of a high precision
metric suggests a lower occurrence of false positives. It is crucial to take into
consideration that precision solely quantifies the instances in which the classifier
accurately predicts that a given text belongs to a particular tag. Certain tasks, such
as automated email responses, necessitate models that exhibit a notable degree of
precision. This precision is crucial in ensuring that a response is delivered to a user
only when there is a high probability of the prediction being accurate.
Ɣ The term “recall” refers to the ratio of correctly predicted texts to the total number of
texts that should have been assigned a specific tag. A high recall metric indicates a
lower occurrence of false negatives. This metric is especially valuable in situations
where it is necessary to direct support tickets to the appropriate teams. The
objective is to optimise the automatic routing of tickets associated with a specific
tag, such as Billing Issues, by prioritising the maximum number of tickets routed,
even if it results in occasional incorrect predictions.
Ɣ The F1 score is a metric that integrates precision and recall to provide an
assessment of the performance of your classifier. The aforementioned metric
serves as a superior indicator compared to accuracy when assessing the quality of
predictions across all categories within your model.
5. Cross-validation
Cross-validation is a commonly employed technique for evaluating the effectiveness
of a text classifier. The process involves partitioning the training data into distinct subsets
using a random approach. An illustration of this concept involves dividing the training data
into four distinct subsets, with each subset comprising 25% of the original dataset.
Next, all subsets, with the exception of one, are utilised for training a text classifier.
The purpose of this text classifier is to generate predictions for the remaining subset of
data, specifically for testing purposes. Following this step, the performance metrics are
computed by comparing the predicted values with the predefined tags. This process is
then repeated until all subsets of data have been utilised for testing.
The final step involves compiling the results from all subsets of data in order to
calculate the average performance for each metric.

Summary
Ɣ String processing refers to the manipulation and analysis of text data using various
operations and functions. It involves working with sequences of characters, known
as strings
Ɣ The central data structure utilised by numerous early synthesis systems was
commonly known as a string rewriting mechanism. The formalism stores the
linguistic representation of an utterance as a string. The initial state of the string
consists of textual content. As the processing occurs, the string undergoes
modifications or enhancements through the addition of supplementary symbols. This
method was utilised by systems such as MITalk and the CSTR Alvey synthesiser

Amity Directorate of Distance & Online Education

You might also like