0% found this document useful (0 votes)
14 views29 pages

Optimisation and Dddddimension Reduction Tech-Unlocked

This document covers data wrangling techniques in R, emphasizing the importance of data import, web scraping, and the Tidyverse ecosystem for effective data analysis. It outlines the data wrangling process, including data cleaning, validation, and publishing, while highlighting its significance in improving data quality and facilitating analysis. Additionally, it discusses various tools for data wrangling and the benefits of these techniques in enhancing data consistency and decision-making capabilities.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views29 pages

Optimisation and Dddddimension Reduction Tech-Unlocked

This document covers data wrangling techniques in R, emphasizing the importance of data import, web scraping, and the Tidyverse ecosystem for effective data analysis. It outlines the data wrangling process, including data cleaning, validation, and publishing, while highlighting its significance in improving data quality and facilitating analysis. Additionally, it discusses various tools for data wrangling and the benefits of these techniques in enhancing data consistency and decision-making capabilities.

Uploaded by

Sumod Sanker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Optimization and Dimension Reduction Techniques 43

Module - II: Introduction to Data Wrangling


Notes
Learning Objectives
At the end of this module, you will be able to:

Ɣ Infer the importance of data import in R for data analysis


Ɣ Learn how to import data from common file formats such as CSV, Excel and text
files in R
Ɣ State the concept and purpose of web scraping in data acquisition
Ɣ Learn how to perform web scraping using the Rvest package to extract data from
websites
Ɣ Explain techniques to extract specific data from HTML tables available on web
pages
Ɣ Discuss the principles of tidy data and its significance in data manipulation
Ɣ Learn about the Tidyverse ecosystem and its essential packages for data wrangling

Introduction
Similar to the statement made by Samuel Taylor Coleridge in his poem “Rime of
the Ancient Mariner,” the value of data is heavily reliant on the analyst’s proficiency in
managing and manipulating the data. Despite the remarkable progress made in data-
related technologies, analysts continue to dedicate a significant portion of their time to
acquiring data, identifying and addressing data quality concerns and preparing data for
effective utilisation.
Extensive research has demonstrated that within the realm of data analysis, this
particular phase stands out as the most laborious and time-intensive element. In the
realm of data analysis and exploration, data wrangling continues to hold a significant role
as a foundational element. Despite the inherent difficulties and complexities associated
with this process, it serves as a crucial stepping stone that empowers the creation of
impactful visualisations and facilitates the development of robust statistical models.

Defining Data Wrangling


Data munging, also known as data wrangling, is the process of converting and
manipulating raw data into a more structured and usable format. This transformation is
essential to make data suitable for various downstream applications, such as analytics.
Data wrangling aims to improve data quality and practicality and it often constitutes a
significant part of a data analyst’s workload compared to direct data examination.
During the data wrangling process, tasks like data visualisation, aggregation, training
statistical models and more can be performed. The general steps in data wrangling
include extracting raw data from the source, performing necessary transformations (e.g.,
sorting or parsing) and storing the processed data in a data sink for future use.
Data mapping is closely associated with data wrangling and involves matching
source data fields with their corresponding destination data fields. While data wrangling
focuses on data transformation, data mapping ensures a coherent connection between
different data components.

Amity Directorate of Distance & Online Education


44 Optimization and Dimension Reduction Techniques

Notes

(Image source: https://www.javatpoint.com/data-wrangling)

Data Wrangling Process

(Image source: https://www.javatpoint.com/data-wrangling)

The data wrangling process refers to the steps involved in preparing and
transforming raw data into a format that is suitable for analysis. This process typically
includes tasks such as data collection, data
Data wrangling is a technical term that can be considered self-descriptive. The term
“wrangling” is used to describe the process of gathering and organising information in a
specific manner. The operation comprises a series of the subsequent processes:
1. Data Wrangling Exploration: Prior to commencing the data wrangling procedure, it is
imperative to consider the potential underlying aspects of your data. It is imperative
to engage in critical thinking regarding the anticipated outcomes of your data and
its intended applications after the completion of the data wrangling process. After
establishing your objectives, proceed to collect the necessary data.
 2UJDQLVDWLRQ 2QFH WKH UDZ GDWD KDV EHHQ FROOHFWHG ZLWKLQ D VSHFL¿F GDWDVHW LW LV
necessary to arrange the data in a structured manner. The initial observation of raw
data can be overwhelming due to the multitude of data types and sources, as well as
their inherent complexity.
3. Data Cleaning: Once your data has been organised, the next step is to initiate the data
cleaning process. The process of data cleaning encompasses the tasks of identifying
and removing outliers, formatting null values and eliminating duplicate data. It is
imperative to acknowledge that the process of cleaning data obtained through web
scraping techniques can be more laborious compared to cleaning data obtained from
a database. Web data is typically characterised by its high level of unstructuredness,
which often necessitates a longer processing time compared to structured data
obtained from a database.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 45

 'DWDHQULFKPHQWLQYROYHVHYDOXDWLQJWKHDYDLODEOHGDWDWRDVFHUWDLQLWVVXI¿FLHQF\IRU
IXUWKHUSURFHVVLQJ,QVXI¿FLHQWGDWDGXULQJWKHFRPSOHWLRQRIWKHZUDQJOLQJSURFHVV Notes
can potentially undermine the insights derived from subsequent analysis. Investors
seeking to analyse product review data would require a substantial volume of data to
effectively depict the market and enhance their investment intelligence.
 9DOLGDWLRQ 2QFH D VXI¿FLHQW DPRXQW RI GDWD KDV EHHQ FROOHFWHG LW LV QHFHVVDU\ WR
implement validation rules to ensure the accuracy and integrity of the data. The
validation rules are executed in iterative sequences to ensure the consistency of your
data across the entire dataset. Validation rules serve the dual purpose of ensuring
both quality and security. This step employs a similar logical approach as that used in
data normalisation, which is a process of standardising data through the application of
validation rules.
6. Data publishing is the concluding stage of the data munging process. Data publishing
encompasses the process of adequately preparing data for subsequent utilisation.
This may involve the creation of comprehensive notes and documentation detailing the
steps taken during the data wrangling process. Additionally, it may require establishing
access permissions for other users and applications.

Data Wrangling Use Case

(Image source: https://www.javatpoint.com/data-wrangling)

The use case of data wrangling involves the process of transforming and cleaning
raw data into a structured format that is suitable for analysis and interpretation. This
process is essential for ensuring data quality and reliability in various domains, such as
business intelligence, data science and research.
Data munging is a versatile technique that finds application in various use-cases,
which are outlined below:
1. Fraud Detection: By leveraging a data wrangling tool, businesses are empowered to
execute the following actions:
™ Corporate fraud can be distinguished by carefully analysing detailed information
such as multi-party and multi-layered emails or web chats. By examining these
sources, one can identify unusual behaviour that may indicate fraudulent
activities.
™ Enhance data security by enabling non-technical operators to efficiently analyse
and manipulate data in order to effectively manage the multitude of daily
security tasks.
™ To achieve accurate and consistent modelling results, it is essential to establish
a standardised approach for quantifying both structured and unstructured data
sets.

Amity Directorate of Distance & Online Education


46 Optimization and Dimension Reduction Techniques

™ Improve compliance by ensuring that your business adheres to industry and


Notes government standards through the implementation of security protocols during
the integration process.
2. Customer Behaviour Analysis is a process that involves the systematic examination
and evaluation of customer actions, preferences and patterns. This analysis aims to
gain insights into customer behaviour in order to make informed business decisions
and improve overall customer satisfaction. The utilisation of a data-munging tool
facilitates the expeditious acquisition of accurate insights pertaining to customer
behaviour analysis, thereby enhancing business processes. The marketing team is
provided with the ability to assume control over business decisions and optimise their
outcomes. Data wrangling tools can be utilised for:
™ Optimise the efficiency of data preparation for analysis by reducing the time
required.
™ Efficiently comprehend the business value inherent in your data Enable your
analytics team to directly leverage customer behaviour data
™ Enable data scientists to uncover data trends through the utilisation of data
discovery and visual profiling techniques.

Tools for Data Wrangling


Data wrangling tools are utilised to collect, import, organise and clean data before it
is inputted into analytics and business intelligence (BI) programmes. By utilising software
that facilitates the assessment of data mappings and analysis of data samples throughout
the transformation process, automated methods for data wrangling can be employed.
This feature enhances the ability to promptly detect and resolve data mapping issues.
Businesses that deal with extraordinarily huge data volumes must automate data
cleaning. The manual data cleansing methods must be managed by the data team or
data scientist. In smaller installations, however, non-data specialists are in charge of
cleansing the data before using it.
Spreadsheets and data munging with scripts are two examples of different data
wrangling techniques. All users of the data can access and use their data wrangling tools
thanks to some of the more current all-in-one technologies.
The following list contains some of the more popular data wrangling tools.
™ Excel spreadsheets The most fundamental manual data wrangling tool is Power
Query.
™ OpenRefine is a programme that automates data cleansing and requires
programming knowledge
™ Tabula :It is a tool that works with all forms of data.
™ DataPrep by Google :This data service examines, purifies and ready data.
™ Database manager :It is a tool for cleansing and altering data.
™ For maps and chart data, Plotly (data wrangling with Python) is helpful.
™ Data is converted with CSVKit.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 47

Benefits of Data Wrangling


Notes

(Image source: https://www.javatpoint.com/data-wrangling)

The benefits of data wrangling include improved data quality, increased efficiency
in data analysis, enhanced data integration and better decision-making capabilities.
Data wrangling allows for the identification and correction of errors, inconsistencies and
missing values
As stated earlier, the utilisation of big data has become a fundamental component in
the realm of business and finance in the present day. However, the complete potential of
the aforementioned data is not always evident. Data processes, such as data discovery,
serve as valuable tools for identifying and acknowledging the potential of your data.
In order to maximise the potential of your data, it is necessary to implement data. The
following are several significant advantages associated with data wrangling.
1. Data Consistency: The process of data wrangling ensures that the resulting dataset
exhibits a higher level of consistency from an organisational perspective. Ensuring data
consistency is of utmost importance in business operations that entail the collection of
data input by consumers or other human end-users. In the event that a human end-
user erroneously submits personal information, such as creating a duplicate customer
account, it can have a subsequent impact on performance analysis.
2. Enhanced statistical insights can be obtained through the process of data wrangling,
which involves transforming metadata to achieve greater consistency. The
aforementioned insights are frequently attained through enhanced data consistency.
:KHQ PHWDGDWD UHPDLQV FRQVLVWHQW DXWRPDWHG WRROV FDQ HIIHFWLYHO\ DQG HI¿FLHQWO\
analyse the data, leading to faster and more accurate results. In the context of
constructing a model for forecasting market performance, the process of data wrangling
becomes essential in order to ensure that the metadata is cleansed and prepared
appropriately. This enables the model to execute smoothly and without encountering
any errors.
 &RVWHI¿FLHQF\$VVWDWHGHDUOLHUWKHXWLOLVDWLRQRIGDWDZUDQJOLQJWHFKQLTXHVHQDEOHV
EXVLQHVVHV WR HQKDQFH WKH HI¿FLHQF\ RI WKHLU GDWD DQDO\VLV DQG PRGHOEXLOGLQJ
procedures, resulting in long-term cost savings. One example of an effective practice
is to perform a comprehensive cleaning and organisation of data prior to its integration,
as this will result in a reduction of errors and time saved for developers.

2.1 Importing Data from Different File Formats


In R programming, importing data involves reading data from external files, writing
data to external files and accessing these files from within or outside the R environment.
A wide range of file formats, such as CSV, XML, xlsx, JSON and web data, can be

Amity Directorate of Distance & Online Education


48 Optimization and Dimension Reduction Techniques

imported into R for data analysis. Similarly, data within the R environment can be saved
Notes to external files using the same file formats. This process facilitates seamless data
exchange and enables efficient data manipulation and analysis in R.

2.1.1 Overview of Data Import in R


The R programming language has the ability to read data from various formats,
including files generated by other statistical packages. R is capable of reading and
loading data into memory regardless of whether it was prepared using Excel (in CSV,
XLSX, or TXT format), SAS, Stata, SPSS, or other software.
R has two native data formats: Rdata (also known as Rda) and Rds. The
aforementioned formats are employed for the purpose of preserving R objects for future
utilisation. The Rdata file format is utilised for storing multiple R objects, whereas the
Rds file format is employed for saving a single R object. Please refer to the following
instructions for reading and loading data into R from different file extensions.
To set the working directory, follow these steps:
In order to access the data, it is necessary to configure the R working directory to the
specific location where the data is stored.
™ The function setwd(“...”) is used to set the current working directory to a
specified location.
™ The function `getwd()` is used to retrieve and display the current directory.
setwd(“C:/mydata”)
When specifying the pathname, R interprets forward slashes (“/”), while Windows
interprets backward slashes (“\”). Establishing the working directory can effectively
mitigate issues related to path ambiguity.
> setwd(“C:/mydata”)

Reading R Data Files


Introduction: This document provides a comprehensive guide on how to read R data
files. R data files are binary files that store data objects in the R programming language.
By understanding the process of reading R data files, users can
RData files are a file format used in the R programming language. These files
contain serialised objects, such as data frames, matrices, or
Function: load()
> load(“survey.rdata”)
Or
> load(“survey.rda”)
It should be noted that the output of this function is not being assigned to a
variable. The load() function in R is used to load all R objects stored in a file into the R
environment. The original names assigned to these objects during the saving process will
be retained and assigned to them upon loading. The command `ls()` is used to display a
list of all the objects that are currently loaded into the R environment.

RDS Files
Function: readRDS()

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 49

> dataRDS <- readRDS(“survey.rds”


™ RDS files, also known as Remote Desktop Services files, are a type of file
Notes
format commonly used in remote desktop environments. These files contain
The readRDS() function is used to read RDS (R Data Serialisation) files.
The function `readRDS()` is used to read the file “survey.rds” and store its contents
in the variable `dataRDS`.
The readRDS function is used to restore a single R object. In this particular instance,
the object underwent a name reassignment, being designated as dataRDS.

Reading Delimited Data Files


Introduction: This document provides an overview of reading delimited data files.
Delimited data files are commonly used to store structured data, where each field is
separated by a delimiter character. By understanding the process of reading delimited
data files

Space-Delimited
The read.table() function is a built-in function in R that is used to read data from a file
and create a data frame.
Function: read.table()

Common Parameters:
™ Header: TRUE when the first row includes variable names. The default is
FALSE.
™ Sep: A string indicating what is separating the data. The default is “ “.
> dataSPACE <-read.table(“C:/mydata/survey.dat”, header=TRUE, sep= “ “)
With the working directory set, this is equivalent to:
> dataSPACE <-read.table(“survey.dat”, header=TRUE, sep= “ “)
Once the working directory has been set, the following is considered equivalent:
The data from the file “survey.dat” is read into the variable dataSPACE using the
read.table function. The file has a header row and the columns are separated by spaces.

Tab-Delimited
Functions: read.table()
Common Parameters:
™ Header: TRUE when the first row includes variable names. The default is
FALSE.
™ Sep: A string indicating what is separating the data. The default is “ “.
> dataTAB <-read.table(“survey.dat”, header=TRUE, sep= “\t”)
The read.table() function is a useful tool in programming that allows for the reading
of tabular data.
The data from the file “survey.dat” is read into the data frame called dataTAB. The
file has a header row and the columns are separated by tabs.

Amity Directorate of Distance & Online Education


50 Optimization and Dimension Reduction Techniques

Comma-Delimited
Notes
Function: read.csv()
Common Parameters:
™ Header: TRUE when the first row includes variable names. The default is
FALSE.
> dataCOMMA <-read.csv(“survey.csv”, header=TRUE)
The read.csv() function is used to read a CSV (Comma-Separated Values) file in R
programming language.
The dataCOMMA variable is assigned the value of the CSV file “survey.csv” using
the read.csv() function, with the header parameter set to TRUE.

Fixed-Width Formats
Function: read.fwf()
Common Parameters:
™ Header: TRUE when the first row includes variable names. The default is
FALSE.
> dataFW <-read.fwf(“survey.txt”, header=TRUE)
Fixed-width formats are a type of data format commonly used in computer systems.
In fixed-width formats, each field or data element is allocated a
The read.fwf() function is a function used in programming to read fixed-width
formatted files.
The dataFW object is assigned the result of the read.fwf function, which reads the
“survey.txt” file. The header parameter is set to TRUE, indicating that the file

Reading SPSS, Stata and SAS Data Files


How to read data files in SPSS, Stata and SAS formats are given below:
The R package “foreign” enables users to read data that is stored in SPSS SAV files,
Stata DTA files, or SAS XPORT libraries.
> install.packages(“foreign”)
> library(foreign)
To install the foreign package on a local computer, follow these steps:
1. Ensure that the foreign package is not already installed.
2. Install the foreign package.
3. Load the foreign package into R.

SPSS
The read.spss() function is designed to read SPSS data files.
Function: read.spss()

Common Parameters:
Ɣ to.data.frame: TRUE if R should treat loaded data as a data frame. The default is
FALSE.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 51

Ɣ use.value.labels: TRUE if R should convert variables with value labels into R factors
with those levels. The default is TRUE. Notes
The dataSPSS object is assigned the result of the read.spss function, which reads
the “C:\mydata/survey.save” file and converts it into a data frame.
> dataSPSS <- read.spss(“C:\mydata/survey.save”, to.data.frame=TRUE)
In R, it is assumed that any value labels present in the SPSS file pertain to factors,
which are R’s equivalent of categorical variables. Consequently, R stores the labels
themselves instead of the original numerical values. An illustrative instance involves the
utilisation of a variable denoted as “gender,” which is encoded as 0 for male and 1 for
female. These corresponding labels are stored within the SAV file. When data is imported
from SPSS into R, the variable values will be represented as “male” and “female” instead
of the original “0” and “1” values. The default behaviour can be modified in the call to the
read.spss function by specifying the desired changes.
The variable “dataSPSS” is assigned a value. The function `read.spss()` is used to
read an SPSS file. It requires the user to choose the file using the `file
> dataSPSS <- read.spss(file.choose(), use.value.labels=FALSE)
STATA
The purpose of the read.data() function is to read data from a specified source.
Function: read.data()

Common Parameters:
Ɣ convert.dates: Convert STATA dates to Date class. The default is TRUE.
Ɣ convert.factors: TRUE to convert value labels into factors. The default is TRUE.
The data from the file “survey.dta” is read into the variable “dataStata” using the
function read.dta().
> dataStata <- read.dta(“survey.dta”)
The object that is generated is inherently a data frame. By default, the conversion
process transforms value labels into factor levels. To disable this feature, utilise the
following steps:
> dataStata <-read.dta(“survey.dta”, convert.factors=FALSE)
The data from the file “survey.dta” is read into the variable “dataStata” using the
function “read.dta”. The option “convert.factors” is set to FALSE.
Note:It should be noted that STATA has a tendency to modify the way it stores
data files when transitioning between versions, which may cause compatibility issues
with the foreign package. In the event that the read.dta command encounters an error,
it is recommended to utilise the SAVEOLD command in STATA to store the data. This
operation generates a DTA file that is saved in a previous version of STATA, which is
more likely to be recognised by the read.dta function.

SAS
The read.xport() function is designed to read data from an XPORT file format.
Function: read.xport()
The data from the file “C:/mydata/survey” is read into the variable “dataSAS” using
the function “read.xport”.

Amity Directorate of Distance & Online Education


52 Optimization and Dimension Reduction Techniques

> dataSAS <- read.xport(“C:/mydata/survey”)


Notes The function will yield a data frame if there is only one dataset present in the library.
Conversely, if there are multiple datasets, the function will produce a list of data frames.

2.1.2 Importing CSV, Excel and Text Files


CSV Files
In R programming, there are three popular methods for importing CSV files:
Ɣ Using the read.csv() method
Ɣ 8VLQJWKHUHDGBFVY PHWKRG
Ɣ Using the fread() method
Let’s explore each of these methods and demonstrate how to import a CSV file using
them.
Using read.csv() method
The read.csv() method is utilised for reading comma-separated values (CSV) files.
The read.csv() function is used for importing CSV files, particularly those of smaller
sizes.
The data from the CSV files is stored in a variable to facilitate subsequent
manipulation. Multiple CSV files can be imported and stored into separate variables.
The returned output will be in the format of a DataFrame, with row numbers assigned
as integers.

Syntax:
read.csv(path, header = TRUE, sep = “,”)
The function read.csv() is used to read a CSV file from the specified path. It takes
three arguments: path, header and sep. The path argument specifies the location of
the CSV file. The header argument is a logical value that indicates whether the CSV file
contains a header row.

Arguments:
™ path: CSV file path that needs to be imported.
™ header: Indicates whether to import headers in CSV. By default, it is set to
TRUE.
™ sep: the field separator character
The programming language R commonly employs the concept of factors
for the purpose of re-encoding strings. It is recommended to set the parameter
“stringsAsFactors” = FALSE in order to prevent R from automatically converting character
or categorical variables into factors.
# read the data from the CSV file
GDWDUHDGFVY ³&??3HUVRQDO??,06??FULFNHWBSRLQWVFVY´KHDGHU 758(
# print the data variable (outputs as DataFrame)
data

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 53

Output
Notes
Teams Wins Lose Points
1 India 52 10
2 South Africa 3 4 6
3 West Indies 1 6 2
4 England 2 4 4
5 Australia 4 2 8
6 New Zealand 2 5 4

8VLQJUHDGBFVY PHWKRG
7KH UHDGBFVY  PHWKRG LV ZLGHO\ UHJDUGHG DV WKH SUHIHUUHG DSSURDFK IRU UHDGLQJ
CSV files in R. The programme sequentially processes each line of a CSV file.
The data is read in the form of a Tibble, with only 10 rows initially displayed.
Additional rows can be accessed by expanding the view.
Additionally, it provides the user with the percentage of the file that has been read
into the system, enhancing its robustness in comparison to the read.csv() method.
:KHQGHDOLQJZLWKODUJH&69ILOHVLWLVDGYLVDEOHWRXWLOLVHWKHUHDGBFVY PHWKRG
Syntax:
UHDGBFVY SDWKFROBQDPHVQBPD[FROBW\SHVSURJUHVV

Arguments :
™ path: CSV file path that needs to be imported.
™ FROBQDPHV,QGLFDWHVZKHWKHUWRLPSRUWKHDGHUVLQ&69%\GHIDXOWLWLVVHWWR
TRUE.
™ QBPD[7KHPD[LPXPQXPEHURIURZVWRUHDG
™ FROBW\SHV ,I DQ\ FROXPQ VXFFXPEV WR 18// WKHQ WKH FROBW\SHV FDQ EH
specified in a compact string format.
™ progress: A progress metre to analyse the percentage of files read into the
system
# import data.table library
library(data.table)
#import data
GDWDUHDGBFVY ³&??3HUVRQDO??,06??FULFNHWBSRLQWVFVY´

Output

Teams Wins Lose Points


1 India 5 2 10
2 South Africa 3 4 6
3 West Indies 1 6 2
4 England 2 4 4
5 Australia 4 2 8
6 New Zealand 2 5 4

Amity Directorate of Distance & Online Education


54 Optimization and Dimension Reduction Techniques

Using fread() method


Notes If the CSV files are extremely large, the best way to import into R is using the fread()
method from the data.table package.
The output of the data will be in the form of a Data table in this case.
# import data.table library
library(data.table)
# read the CSV file
GDWDIUHDG ³&??3HUVRQDO??,06??FULFNHWBSRLQWVFVY´

Output

Teams Wins Lose Points


1: India 5 2 10
2: South Africa 3 4 6
3: West Indies 1 6 2
4: England 2 4 4
5: Australia 4 2 8
6: New Zealand 2 5 4

Importing Excel File


Here, we will discuss two different approaches to import Excel files into the R
programming language.

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 55

Method 1: Using read_excel()


7KH XVHU PXVW FDOO WKH UHDGBH[FHO  IXQFWLRQ IURP WKH UHDGBH[FHO OLEUDU\ RI WKH 5
Notes
language using the file name as the parameter in order to import an Excel file into R
using this method. Both.xlsx and.xls files can be imported using the readxl() package.
R-Studio comes with this package pre-installed. The user will be able to import the Excel
file into R by using this function.

Syntax:
UHDGBH[FHO ILOHQDPHVKHHWGW\SH ´IORDW´

Parameters:
filename:File name to read from
sheet:The title of the Excel sheet
dtype:-Numpy data type

Returns:
A data frame is used to represent the variable.

Method 2: Utilising the built-in menu options in Rstudio


The current approach is the simpler method for importing an Excel file in R, in
contrast to the previous approach. This is the sole method available for importing an
Excel file in R, where the user is not required to manually input any code into the console
for the import process. Additionally, the user is required to focus on the environment
window within the studio.

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Steps to import excel file using Dataset option from the environment window of
Rstudio:
Step 1: The Import Dataset option should be selected in the environment window
of RStudio. The user needs to choose this option to import the dataset from the
environment window in RStudio.

Amity Directorate of Distance & Online Education


56 Optimization and Dimension Reduction Techniques

Notes

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Step 2: Choose the “From Excel” option within the import Dataset menu. To import
an Excel file, the user should select the “From Excel” option under the import dataset
menu. This option is specifically designed for importing Excel files.

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Step 3: Utilise the browse option to choose and import the desired Excel file. The
user will be presented with the option to browse for the desired Excel file to import into R
by clicking on the corresponding button.Next, the user must choose the specific Excel file
that they wish to import into R.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 57

Notes

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Step 4: Proceed to select the import option, which will initiate the successful
importation of the Excel file. In the final step, users are required to click on the “import”
button to initiate the successful importation of the selected Excel file into R.
The dataset was selected by the user based on their preferences, involving
modifications to the file name and sheet type. The user is presented with two sheets and
selects the second one. They proceed to choose the second list using the sheet option.
They can specify the maximum number of rows they desire from the data they have
entered. In the skip box, the user has the ability to skip rows according to their desired
quantity. The user inputs a value in the NA box. If this value exists in the dataset, it is
considered as NA.
An alternative approach for importing Excel files into R-Studio is available.
Step 1: To initiate the desired action, please select the “file” option by clicking on it.
Step 2: Within the file, locate and select the “Import Dataset” option, followed by
choosing the desired dataset from the available Excel files.

Amity Directorate of Distance & Online Education


58 Optimization and Dimension Reduction Techniques

Notes

(Image source: https://www.geeksforgeeks.org/how-to-import-an-excel-file-into-r/)

Importing a Text File


Here it can easily import or read .txt file using the basic R function read.table(). read.
table() is used to read a file in table format. This function is easy to use and flexible.

Syntax:
# read data stored in .txt file
[UHDGWDEOH ³ILOHBQDPHW[W´KHDGHU 758()$/6(
# Simple R program to read txt file
x<-read.table(“D://Data//myfile.txt”, header=FALSE)
# print x
print(x)

Output:
V1 V2 V3
1 100 a1 b1
2 200 a2 b2
3 300 a3 b3

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 59

If the header argument is set at TRUE, which reads the column names if they exist in
the file. Notes

2.1.3 Reading Data from Databases


The process of importing data from a relational database involves retrieving and
transferring data from one database system to another. This can be done using various
methods and tools, depending on the specific requirements and the database systems
involved. To begin the import process, it is necessary to
The R programming language has limitations when it comes to processing extremely
large datasets and does not support concurrent access to data. A Relational Database
Management System (RDBMS), in contrast, possesses the capability to swiftly retrieve
specific segments of large-scale data. Additionally, it enables simultaneous access
by multiple users operating on different hosts. There exist multiple R packages that
facilitate communication with RDBMS, each offering a distinct level of abstraction.
Certain packages offer the capability to efficiently copy complete data frames to and
from databases. There are several packages available on CRAN that can be used for
importing data from a Relational Database.
Ɣ RODBC
Ɣ RMySQL
Ɣ ROracle
Ɣ RPostgreSQL
Ɣ The RSQLite package is utilised for the bundled DBMS SQLite.
Ɣ The RJDBC package utilises Java to establish connections with any Database
Management System (DBMS) that is equipped with a JDBC driver.
Ɣ PL/R
Ɣ RpgSQL
Ɣ RMongo is a Java client interface for MongoDB that allows users to interact with
MongoDB using the R programming language.

RMySQL
To utilise the RMySQL package, follow these steps: 1. Install the RMySQL package
by running the following command: `install.packages
The RMySQL package serves as an interface to the MySQL Database Management
System (DBMS). The current version of this package necessitates the pre-installation of
the DBI package.
The dbDriver function, when passed the argument “MySQL”, will return an object
that manages database connections. This object can then be used with functions such
as dbConnect and dbDisconnect to respectively establish and terminate a connection
to a database. Before working with DBMS using their respective call functions, such as
dbDriver(“SQLite”), dbDriver(“RPostgreSQL”) and dbDriver(“Oracle”), it is necessary to
install packages like RSQLite, RPostgreSQL and ROracle.
Ɣ The function dbGetQuery is used to send queries and retrieve results in the form of
a data frame.

Amity Directorate of Distance & Online Education


60 Optimization and Dimension Reduction Techniques

Ɣ The function dbSendQuery is responsible for sending the query and returning an
Notes object that belongs to the class inheriting from “DBIResult”. This object can then be
utilised to retrieve the desired result.
Ɣ The function dbClearResult is used to remove the result from the cache memory.
Ɣ The fetch operation retrieves either a subset or all of the rows that were specified in
the query. The output of the fetch function is a collection of elements organised in a
list structure.
Ɣ The function dbHasCompleted is used to verify if all the rows have been retrieved.
Ɣ The functions dbReadTable and dbWriteTable are utilised to perform table reading
and writing operations in a Database from a R data frame.

Syntax:
>library(RMySQL)
!FRQQHFWLRQGE&RQQHFW GE'ULYHU ³0\64/´ GEQDPH ´7HVWB'DWDEDVH´
## Assuming that using MySQL tables for DBMS
>dbListTables(connection)
##Loading a data frame into database
!GDWD 6DPSOHB'DWD
!GE:ULWH7DEOH FRQQHFWLRQ³&ROXPQ´6DPSOHB'DWDRYHUZULWH 7UXH
## To read the Column1 of in the database
>dbReadTable(connection, “Column1”)
## Selecting from the loaded table as a query
!GE*HW4XHU\ &RQQHFWLRQSDVWH 5RZB1DPH9DULDEOHB1DPH&RQGLWLRQ

RODBC
The RODBC package is used for connecting R to databases using the Open
Database Connectivity (ODBC) interface.
The RODBC package offers an interface for accessing database sources that
support the ODBC interface. The popularity of this package stems from its ability to
utilise the same R code for importing data from various database systems. The RODBC
package is compatible with OS X, Windows and Unix/Linux operating systems and
supports a wide range of database systems including MySQL, Microsoft SQL Server,
Oracle and PostgreSQL.
Ɣ The functions odbcConnect and odbcDriverConnect are used to establish a
connection to a database.
Ɣ The odbcGetInfo function retrieves information regarding the client and server.
Ɣ The odbcClose function is utilised for the purpose of closing the database
connection.
Ɣ The sqlSave function is used to store the R data frame provided as an argument
into a database table.
Ɣ The sqlFetch function performs the reverse operation, whereby it retrieves the
database table and stores it as a R data frame.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 61

Ɣ The sqlQuery function is utilised to transmit a SQL query to the database, resulting
in the retrieval of a R data frame. Notes
In the example mentioned below, PostgreSQL with an ODBC driver is used.
>library(RODBC)
!FRQQHFWLRQRGEF&RQQHFW ³6DPSOHB'DWDEDVH´XLG ´1DPH´FDVH ´WRORZHU´
!GDWD 6DPSOHB'DWD
!VTO6DYH FRQQHFWLRQ6DPSOHB'DWDURZQDPH ´5RZ´DGG3. 758(
!UP 6DPSOHB'DWD
!VTO4XHU\ FRQQHFWLRQ³6HOHFWB&ROXPQ&RQGLWLRQ´
Importing Data from Non-Relational Database
R also has packages that support non-relational databases for data import.
™ rhbase is used for Hadoop Distributed File System
™ RCassandra is used for Cassandra Database system
™ Rmongodb is used for MongoDB.
>Library(rmongodb)
!6DPSOH'DWDEDVH³7HVWB'DWDEDVH´
>MyMongoDB <- mongo.create(db=SampleDatabase)
## To insert a list
!PRQJRLQVHUW 0\0RQJR'%³7HVWB'DWDEDVH&ROXPQ´OLVWQDPH

2.2 Web Scraping in R


The majority of individuals possess a basic understanding of web pages . However,
it is important to note that the perception of a website by an individual differs from that of
Google or a web browser.
When a user enters a website address into their browser, the browser initiates the
process of downloading and rendering the corresponding web page. However, in order to
properly render the page, the browser requires specific instructions.
There exist three distinct categories of instructions:
1. HTML is a markup language that provides a structural framework for web pages. It
GH¿QHVWKHLQIUDVWUXFWXUHDQGRUJDQLVDWLRQRIWKHFRQWHQWZLWKLQDZHEVLWHVXFKDV
headings, paragraphs and other components.
The infrastructure consists of elements known as tags, such as <h1>...<\h1> or
<p>...<\p>. Tags in an HTML document serve as the fundamental elements that define
the nature of the content enclosed within them. For instance, the tag “h1” is used
to indicate a heading of level 1. It is imperative to note that there exist two distinct
categories of tags:
™ starting tags (e.g. <h1>)
™ ending tags (e.g. <\h1>)
As the output of the HTML code is not super elegant, CSS is used to style the final
website.

Amity Directorate of Distance & Online Education


62 Optimization and Dimension Reduction Techniques

2. CSS, on the other hand, is a style sheet language that determines the visual
Notes SUHVHQWDWLRQDQGOD\RXWRIDZHEVLWH,WLVUHVSRQVLEOHIRUGH¿QLQJWKHDSSHDUDQFHDQG
aesthetics of the site, including aspects such as colours, fonts and spacing.
For instance, CSS is used to specify various aspects of a website, such as fonts,
colours, sizes, spacing and more.
One of the crucial aspects of CSS is selectors, which are patterns used to select
elements. The .class selector is particularly important, as it selects all elements with the
same class. For example, the .xyz selector will target all elements with class=”xyz”.
3. JavaScript is a programming language that enables interactive and dynamic
IXQFWLRQDOLW\ RQ ZHE SDJHV ,W LV XVHG WR GH¿QH WKH EHKDYLRXU DQG LQWHUDFWLYLW\ RI D
website, allowing for actions such as user input validation, content manipulation and
dynamic updates.
Web scraping is a technique used to extract information from the lines of code in
HTML, CSS and Javascript. The term typically denotes an automated process that is
characterised by reduced error rates and increased speed compared to manual data
collection methods.
It is imperative to acknowledge that web scraping may give rise to ethical concerns
due to its involvement in accessing and utilising data from websites without explicit
permission from the website owner. Adhering to the terms of use for a website and
obtaining written consent prior to extracting substantial volumes of data are considered
best practices.

Web Scraping vs. APIs: A Comparison


APIs: Structured and Authorised Access
Ɣ An API (Application Programming Interface) facilitates communication between
software systems, providing structured access to data from websites and online
services.
Ɣ APIs are a more ethical approach as they require explicit permission and
authorization from the website or service to access data.
Ɣ APIs offer controlled access and follow a set of rules and protocols for data retrieval.

Web Scraping: Data Access without Explicit Permission


Ɣ Web scraping involves extracting data from web pages using automated scripts or
bots.
Ɣ Web scraping does not require explicit permission from the website owner, raising
ethical concerns.
Ɣ It is useful for gathering data from websites without APIs, providing an alternative
option for data access.

Limitations of APIs and Web Scraping


Ɣ APIs may have rate limits, restricting the number of requests within a specific
timeframe.
Ɣ Not all websites or online services offer APIs, making web scraping necessary for
accessing certain data.
Ɣ Both methods have their use cases, but using APIs is generally preferred due to
their structured access and ethical nature. However, web scraping can be valuable
when APIs are not available or feasible.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 63

2.2.1 Introduction to Web Scraping


Web scraping refers to the automated process of extracting data from websites. It
Notes
involves retrieving and parsing the HTML code of a webpage in order to extract specific
information or data points. Web scraping is commonly used in various domains such as
Web scraping is an automated process of extracting large amounts of data from
websites, often in unstructured HTML format. The extracted data is then converted
into structured form and stored in databases or spreadsheets for various applications.
Different methods can be used for web scraping, including utilising APIs, online services,
or creating custom code.
Notable websites like Google, Twitter and Facebook provide APIs that allow users
to access their data in an organised manner, making API usage the preferred option.
However, for websites without APIs or limited data access, web scraping becomes
necessary to retrieve substantial data.
Web scraping involves two key components: the crawler and the scraper. The
crawler is an AI algorithm that systematically navigates the web, following hyperlinks to
retrieve specific data. The scraper is a specialised tool designed to extract data from
websites efficiently and accurately, tailored to the project’s complexity and requirements.

An inquiry into the functioning of web scrapers?


Web scrapers are software tools that enable the extraction of data from websites.
They can retrieve either all the available data on a given website or specific data as
per the user’s requirements. For optimal performance, it is recommended to provide
specific data specifications to the web scraper, enabling it to efficiently extract the desired
information. One possible use case involves extracting data from an Amazon webpage,
specifically focusing on the available types of juicers. However, the objective is to solely
retrieve information pertaining to the various juicer models, while excluding any customer
reviews.
In the process of web scraping, the initial step involves providing the URLs to the
web scraper. The process involves loading the HTML code of the specified websites and
in some cases, a more sophisticated scraper may also extract the CSS and Javascript
elements. The scraper retrieves the necessary data from the provided HTML code and
generates the output in the format specified by the user. The data is typically stored in
either an Excel spreadsheet or a CSV file format. However, alternative formats like JSON
files can also be utilised for data storage.

Classification of Web Scrapers


Web Scrapers can be categorised based on various criteria, such as Self-built or
Pre-built Web Scrapers, Browser extension or Software Web Scrapers and Cloud or
Local Web Scrapers.
Ɣ The implementation of self-built web scrapers necessitates a proficient
understanding of programming concepts and techniques. To enhance the
functionality of your Web Scraper, a deeper understanding is required. In contrast,
pre-built Web Scrapers refer to pre-existing scrapers that have been developed in
advance and can be conveniently downloaded and executed. Additionally, there are
advanced options available for customization.
Ɣ Browser extensions are small software programs that enhance web browsers’
functionality, usually developed using web technologies like HTML and CSS.

Amity Directorate of Distance & Online Education


64 Optimization and Dimension Reduction Techniques

Web scrapers as browser extensions provide added features and can be easily
Notes integrated with the browser for user convenience. However, they have limitations
since they operate within the browser’s constraints and cannot execute advanced
features beyond those limits.
Ɣ Software web scrapers, on the other hand, are independent programs that users
download and install on their computers. They offer greater complexity and
advanced features not restricted by browser limitations.
Ɣ Cloud web scrapers use remote servers provided by the vendor, freeing up a user’s
computer resources for other tasks. Local web scrapers, on the contrary, rely on the
user’s computer resources, potentially impacting overall performance if significant
CPU or RAM resources are required during scraping.

Uses of Web Scraping


Web scraping is a technique employed to extract data from websites. It is commonly
used in various domains such as data analysis, research, market intelligence and
automation.
The practice of web scraping finds utility in numerous industries, serving a range of
purposes. Below some of these items are examined.
1. Price Monitoring: An Overview Price monitoring refers to the systematic process of
WUDFNLQJDQGDQDO\VLQJWKHSULFHVRIJRRGVRUVHUYLFHVLQDVSHFL¿FPDUNHWRULQGXVWU\
It involves the collection
Web scraping is a technique employed by companies to extract product data from
various sources, including their own products and those of their competitors. This
GDWD LV WKHQ DQDO\VHG WR HYDOXDWH LWV LQÀXHQFH RQ SULFLQJ VWUDWHJLHV 7KH GDWD FDQ
be utilised by companies to determine the most effective pricing strategy for their
products, enabling them to maximise their revenue potential.
2. Market research is a systematic process of gathering, analysing and interpreting
information about a target market or industry. It involves collecting data on various
factors such as customer preferences, market trends. The utilisation of web scraping
in market research is a common practice among companies. The acquisition
of substantial quantities of meticulously extracted web data can prove highly
advantageous for businesses seeking to analyse consumer patterns and gain insights
into future strategic directions.
3. News monitoring is the process of systematically tracking and analysing news sources
to gather relevant information. It involves the use of various tools and techniques to
monitor news outlets, social media platforms and
The utilisation of web scraping techniques enables companies to obtain comprehensive
reports on the latest news from various news sites. This is particularly crucial for
companies that experience frequent media coverage or rely on daily news for their
operational needs. In the realm of corporate operations, it is crucial to acknowledge
WKDWQHZVUHSRUWVSRVVHVVWKHSRWHQWLDOWRVLJQL¿FDQWO\LPSDFWWKHVXFFHVVRUIDLOXUHRI
a company within a mere twenty-four-hour period.

2.2.2 Web Scraping with Rvest Package


There exist multiple packages in R specifically designed for web scraping, each with
its own unique set of advantages and constraints. The rvest package is widely used in
the field.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 65

To initiate web scraping in R, it is necessary to have R and RStudio installed. If these


software are not already installed, please refer to the appropriate documentation for Notes
installation instructions. After successfully installing R and RStudio, the next step is to
install the rvest package.
install.packages(“rvest”)
rvest
The rvest package, inspired by the Python libraries Beautiful Soup and
RoboBrowser, offers a comparable syntax that makes it an ideal choice for individuals
transitioning from Python.
The rvest package offers a set of functions that enable users to retrieve information
from web pages and extract specific elements using CSS selectors and XPath. The
library is a component of the Tidyverse package collection, which means it adheres to
certain coding conventions, such as the use of pipes. These conventions are also shared
by other libraries within the Tidyverse, such as tybble and ggplot2.
To begin the scraping process, it is essential to load the rvest package.
library(rvest)

2.2.3 Extracting Data from HTML Tables


The following are the main ideas in R for table scraping:
Ɣ Using R, web scraping entails using XML and rvest packages to collect data from
webpages.
Ɣ R has the ability to scan HTML websites and parse them in order to retrieve specific
data that is of interest.
Ɣ The most important tools for collecting data from websites are selectors. It is
essential to completely comprehend the HTML structure of the webpage in order
to correctly obtain the needed information. R’s selectors make it possible to choose
HTML page elements using XPath or CSS selectors.
Ɣ HTML Parsing: The next step is to parse the HTML content and extract the pertinent
information after the desired elements have been chosen.

Scraping a Table from a Static Website


To extract tabular data from a static website, follow these steps:
 5HDGLQJ+70/&RQWHQW8VHWKHUHDGBKWPOIXQFWLRQWRIHWFKWKH+70/FRQWHQWRIWKH
website.
 6HOHFWLQJWKH7DEOH8WLOLVHWKHKWPOBQRGHVIXQFWLRQZLWKD&66VHOHFWRUWRVHOHFWWKH
desired table from the HTML content.
 ([WUDFWLQJ7DEOH&RQWHQW$SSO\WKHKWPOBWDEOHIXQFWLRQWRH[WUDFWWKHWDEOH¶VFRQWHQW
 'LVSOD\LQJWKH'DWD3ULQWWKH¿UVWVL[URZVRIWKHH[WUDFWHGWDEOHWRUHYLHZWKHGDWD
obtained from the website.
library(rvest)
# Read the HTML content of the website
ZHESDJHUHDGBKWPO ³KWWSVHQZLNLSHGLDRUJZLNL?
 /LVWBRIBFRXQWULHVBE\B*'3B 333 BSHUBFDSLWD´

Amity Directorate of Distance & Online Education


66 Optimization and Dimension Reduction Techniques

# Select the table using CSS selector


Notes WDEOHBQRGHKWPOBQRGHV ZHESDJH³WDEOH´
# Extract the table content
WDEOHBFRQWHQWKWPOBWDEOH WDEOHBQRGH >>@@
# Print the table
KHDG WDEOHBFRQWHQW

Scraping a Table from a Dynamic Website


To scrape a table from a dynamic website, where the content is generated using
JavaScript, you can follow these steps:
1. Reading HTML Code: Use the rvest library to read the HTML code of the webpage.
 6HOHFWLQJ WKH 7DEOH 8WLOLVH WKH KWPOBQRGHV IXQFWLRQ WR VHOHFW WKH ¿UVW WDEOH RQ WKH
page.
 &RQYHUWLQJWR'DWD)UDPH$SSO\WKHKWPOBWDEOHIXQFWLRQWRFRQYHUWWKHVHOHFWHG+70/
code into a DataFrame.
 'LVSOD\LQJ WKH 'DWD 8VH WKH KHDG IXQFWLRQ WR GLVSOD\ WKH ¿UVW IHZ URZV RI WKH
DataFrame, showing the extracted table data from the dynamic website.
library(rvest)
library(tidyverse)
# URL of the website
url <- “https://www.worldometers.info/world-population/\
population-by-country/”
# Read the HTML code of the page
KWPOBFRGHUHDGBKWPO XUO
8VHWKHKWPOBQRGHVIXQFWLRQWRH[WUDFWWKHWDEOH
WDEOHBKWPOKWPOBFRGH!KWPOBQRGHV ³WDEOH´ !>>@@
8VHWKHKWPOBWDEOHIXQFWLRQWRFRQYHUWWKHWDEOH
# HTML code into a data frame
WDEOHBGIWDEOHBKWPO!KWPOBWDEOH
# Inspect the first few rows of the data frame
KHDG WDEOHBGI

2.3 Tidying Data with the Tidyverse


Data tidying, also known as data cleaning or data cleansing, refers to the process
of identifying and rectifying errors, inconsistencies and inaccuracy.It is a commonly
acknowledged fact that a significant portion, specifically 80%, of the data analysis
process is dedicated to the crucial tasks of data cleaning and preparation. The iterative
nature of the analysis process necessitates the repetition of this crucial step multiple
times. This repetition is essential to address emerging problems and incorporate newly
acquired data.The objective of data tidying is to organise datasets in a manner that
optimises their suitability for analysis.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 67

The principles of tidy data establish a standardised approach for organising data
values in a dataset. The utilisation of a standard facilitates the process of initial data Notes
cleaning by eliminating the need to commence from the beginning and develop a solution
from scratch on each occasion. The tidy data standard has been specifically designed
to enhance the ease of initial data exploration and analysis, as well as to streamline
the development of cohesive data analysis tools. Translation is frequently necessary
when using existing tools. The process involves allocating a certain amount of time to
manipulate the output generated by one tool in order to prepare it for use as input in
another tool. Tidy datasets and tidy tools are complementary components that facilitate
data analysis, enabling users to concentrate on the substantive domain problem rather
than the mundane data logistics.

2.3.1 Principles of Tidy Data


There are three interrelated rules or principles which make a dataset tidy:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

Figure: Following three rules makes a dataset tidy: variables are in columns, observations are in rows
and values are in cells.
(Image Source: https://r4ds.had.co.nz/tidy-data.html)

1. Each variable must have its own column.

(Image Source: https://r4ds.had.co.nz/tidy-data.html)

2. Every observation should be in different row.

3. There should be one spreadsheet for each type of data

Amity Directorate of Distance & Online Education


68 Optimization and Dimension Reduction Techniques

Notes

(Image Source: https://r4ds.had.co.nz/tidy-data.html)

The interrelation of these three rules stems from the impossibility of satisfying
only two out of the three. The aforementioned interrelationship gives rise to a more
streamlined set of practical instructions:
™ Create a tibble for each dataset.
™ Arrange each variable in a separate column.
For individuals who extensively utilise Excel, particularly Excel pivot tables, it is
beneficial to perceive tidy data as data that is highly compatible with pivoting operations.
Consider a scenario in which you encountered a situation where the utilisation of a pivot
table was necessary. However, the original dataset contained dimensions presented in
both rows and columns. For instance, the rows contained information about Campaign,
while the columns contained information about Device Category, with the corresponding
Sessions data populated within the cells. If an individual encounters the need to manually
or extensively apply formulas to transform raw data into a desired format suitable for
input into a pivot table, they have encountered a situation involving non-tidy data.
Tidy data refers to data that may not be optimised for human readability, but is highly
compatible with subsequent R functions, particularly those within the tidyverse.
Ensuring the tidiness of your data is important for several reasons. There are two
primary advantages:
1. There is a notable benefit associated with selecting a uniform approach for data
storage. Having a consistent data structure facilitates the learning process of
tools that operate on it due to the presence of a fundamental uniformity.
2. Placing variables in columns offers a distinct advantage by leveraging the
vectorized nature of R. As previously discussed in the context of mutate and
summary functions, it is important to note that the majority of built-in R functions
operate on vectors of values. The process of transforming tidy data is facilitated
by its inherent ease and intuitiveness.

2.3.2 Introduction to the Tidyverse


The term “Tidyverse” is the new name of Hadleyverse, a compilation of R packages
developed or endorsed by Hadley Wickham.

Amity Directorate of Distance & Online Education


Optimization and Dimension Reduction Techniques 69

Notes

(Image: various packages in tidyverse)

The tidyverse refers to a cohesive assortment of packages that exhibit seamless


compatibility owing to their shared data representations and API design. The primary
purpose of the tidyverse package is to streamline the process of installing and loading
essential tidyverse packages by providing a single command.
The purpose of these packages is to provide comprehensive coverage of data
analysis in R. Each package complements the others by supporting shared concepts and
generating compatible output. The tidyverse consists of three major packages, commonly
referred to as the “Big Three.” These packages are:
™ The dplyr package is a tool used for performing data manipulation on data
frames.
™ The tidyr package provides a set of tools for efficiently organising and
restructuring data frames.
These tools allow users to easily tidy and untidy their data, facilitating data cleaning
and manipulation tasks.
™ The ggplot2 package is a powerful tool for creating visualisations of tidy data.
Please take note that the aforementioned three packages possess distinct and
individual names. This feature proves to be useful when conducting Google searches for
specific operation instructions or troubleshooting particular issues. For example, one can
search for details on how to filter data by column name using dplyr. By including any of
these package names, it can be ensured that the search results will exclusively pertain to
R programming language.
To install tidyverse, put the following code in RStudio:
# Install from CRAN
install.packages(“tidyverse”)
# to check your installation
library(tidyverse)

Output:
ņņ$WWDFKLQJSDFNDJHV
ņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņWLG\YHUVHņņ
9 ggplot2 3.3.6 9 purrr 0.3.5

Amity Directorate of Distance & Online Education


70 Optimization and Dimension Reduction Techniques

9 tibble 3.1.8 9 dplyr 1.0.10


Notes
9 tidyr 1.2.1 9 stringr 1.4.1
9 readr 2.1.3 9 forcats 0.5.2
ņņ&RQIOLFWV
ņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņņ
WLG\YHUVHBFRQIOLFWV ņņ
9 dplyr::filter() masks stats::filter()
9 dplyr::lag() masks stats::lag()
Furthermore, it is worth noting that there exist several other packages within the
tidyverse framework that users may encounter.
Ɣ The tibble package is designed to enhance the user-friendliness of data by providing
a more convenient and intuitive data structure.frames
Ɣ The “purrr” package provides a comprehensive set of tools for general data
manipulation.
Ɣ Magrittr is a package that serves as the origin of the %>%, which is extensively
used in the tidyverse. It is important to note that Magrittr was not developed by
Wickham himself, but his packages have a dependency on it. The pipe operator in
the dplyr package is derived from the magrittr package, which is commonly known
among R users. Mentioning this fact in casual conversations with R enthusiasts can
enhance one’s credibility.
Ɣ The broom package facilitates the transformation of statistical models into tidy data
frames or tibbles.
The provided list represents a subset of the tidyverse packages. However, due to the
increasing popularity of the tidyverse, it is now possible to conveniently install and load
all the packages, including the ones mentioned here, by simply installing the ‘tidyverse’
package. This package will automatically install the mentioned packages along with a few
additional ones.
Although it is possible to replicate the functionality of the aforementioned packages
using base R or alternative packages (such as data.table() as a notable substitute for
dplyr()), adopting the tidyverse approach necessitates gaining familiarity with one or more
of the aforementioned packages.

2.3.3 Data Cleaning and Transformation with Dplyr


Although it is feasible to perform most of the tasks using Base R functions (i.e.,
without importing an external package), dplyr simplifies the process significantly. Similar
to other highly functional R packages, dplyr was created by the data scientist Hadley
Wickham.
dplyr is a software package designed to facilitate the manipulation of tabular
data. It achieves this by providing a concise collection of functions that can be utilised
in combination to efficiently extract and summarise valuable insights from the data. It
complements tidyr, facilitating efficient conversion between diverse data formats (long vs.
wide) for the purpose of visualisation and analysis.
dplyr is a package that is included in the tidyverse ecosystem. To ensure consensus,
load the tidyverse package and the previously downloaded books dataset.
Amity Directorate of Distance & Online Education
Optimization and Dimension Reduction Techniques 71

To install the dplyr package, type the following command.


install.packages(“dplyr”)
Notes
To load dplyr package, type the command below
library(dplyr)
Important dplyr Functions to remember

dplyr Functions Description Equivalent SQL


select() Selecting columns (variables) SELECT
¿OWHU Filter (subset) rows. WHERE
JURXSBE\ Group the data GROUP BY
summarise() Summarise (or aggregate) data -
arrange() Sort the data ORDER BY
join() Joining data frames (tables) JOIN
mutate() Creating New Variables COLUMN ALIAS

Some of the most common dplyr functions explained:


™ The rename() function is used to modify the names of columns.
™ The recode() function is used to modify or transform values within a specific
column.
™ The select() function is used to choose specific columns from a dataset, while
the filter() function is used to extract rows from the dataset based on specified
conditions.
™ The mutate() function is used to generate new columns by extracting
information from existing columns.
™ 7KH JURXSBE\  DQG VXPPDULVH  IXQFWLRQV DUH XVHG WR JHQHUDWH VXPPDU\
statistics for data that has been grouped together.
™ The arrange() function is used to sort the results in a specific order. The count()
function is used to determine the number of distinct values.

Select Function
#To select the following columns
mycols <- select(census, age, education,occupation)
head(mycols)
#To select all columns from education to relationship
mycols <- select(census, education:relationship)
#To print first 5 rows
head(mycols, 5)
#To select columns with numeric indexes
mycols <- select(mycols,c(6:9))
head(mycols)

Amity Directorate of Distance & Online Education

You might also like