Data-Analytics - All UNITS
Data-Analytics - All UNITS
– I Semester
(19BT71201)DATA ANALYTICS
(Common to CSSE and IT)
COURSE DESCRIPTION: The course provides Introduction to Data Analytics and its Life
Cycle, Review of Basic Data Analytic Methods Using R, Advanced Analytical Theory and
Methods, Advanced Analytics-Technology and Tools: In-Database Analytics and
Communicating and Operationalizing an Analytics Project
COURSE OUTCOMES: After successful completion of this course, the students will be able to:
CO1. Use Analytical Architecture and its life cycle in Data Analytics
CO2. Analyze and Visualize the Data Analytics Methods using R.
CO3. Apply Advanced Analytical Methods for Text Analysis and Time –Series Analysis.
CO4. Develop Analytical Report for given Analytical problems.
CO5. Analyze and Design Data Analytics Application on Societal Issues.
DETAILED SYLLABUS:
UNIT I – INTRODUCTION TO DATA ANALYTICS and R (9 periods)
Practice in Analytics: BI versus Data Science, Current Analytical Architecture, Emerging
Big Data Ecosystem and a New Approach to Analytics. Data Analytics Life Cycle: Key
Roles for a Successful Analytics Project Background and Overview of Data Analytics
Lifecycle Phases - Discovery Phase, Data Preparation Phase, Model Planning, Model
Building, Communicate Results, Operationalize.Introduction to R:R Graphical User
Interfaces, Data Import and Export, Attribute and Data Types, Descriptive Statistics.
1
Findings, Approach, Model Description, Key Points Supported with Data, Model Details
Recommendations, Additional Tips on Final Presentation, Providing Technical
Specificationsand Code, Data Visualization.
UNIT V –DATA ANALYTICS APPLICATIONS (9 periods)
Text and Web: Data Acquisition, Feature Extraction, Tokenization, Stemming, Conversion
to Structured Data, Sentiment Analysis, Web Mining.
Recommender Systems: Feedback, Recommendation Tasks, Recommendation
Techniques, Final Remarks.
2. Joao Moreira, Andre Carvalho, Andre Carlos Ponce de Leon Ferreira Carvalho, Tomas
Horvath, A General Introduction to Data Analytics, John Wiley and Sons,
1stEdition,2019·
REFERENCE BOOKS:
1. Anil Maheshwari, Data Analytics Made Accessible, Lake Union Publishing,
1stEdition,2017.
2. Richard Dorsey, Data Analytics: Become a Master in Data Analytics, Create
SpaceIndependent Publishing Platform, 2017.
2
Sree Sainath Nagar, A. Rangampet-517 102
Lesson Plan
3
No. of Course Bloo
S. Book(s)
Topic periods Outco ms Remarks
No. required
followed
mes Level
Bayes in R
Total no of periods required: 09
UNIT - III: ADVANCED ANALYTICAL TECHNOLOGY AND METHODS
Time Series Analysis: Overview T1 CO3 BL2
16. 1
of Time Series Analysis
Box-Jenkins Methodology, T1 CO3 BL2
17. 1
ARIMA Model
Autocorrelation Function (ACF), T1 CO3 BL3
18. Autoregressive Models, Moving 1
Average Models
19. ARMA and ARIMA Models T1 CO3 BL2 Case
1
Study:
Building and Evaluating an T1 CO3 BL2
Customer
20. ARIMA Model, Reasons to 1
Response
Choose and Cautions
Prediction
Text Analysis: Text Analysis T1 CO3 BL3 and Pro_t
21. 1
Steps, A Text Analysis Example Optimizatio
Collecting Raw Text, T1 CO3 BL2 n
22. 1
Representing Text
Term Frequency—Inverse T1 CO3 BL2
Document Frequency (TFIDF),
23. 1
Categorizing Documents by
Topics
Determining Sentiments, CO3
24. 1 T1 BL4
Gaining Insights
Total no of periods required: 09
UNIT - IV : ANALYTICAL DATA REPORT AND VISULAIZATION
Communicating and T1 CO4 BL2
Operationalizing an Analytics
25. Project, Creating the Final 1
Deliverables: Developing Core
Material for Multiple Audiences Case
26. Project Goals, Main Findings 1 T1 CO4 BL3 Study:
27. Approach, Model Description 2 T1 CO4 BL3 Predictive
Modeling of
28. Key Points Supported with Data 1 T1 CO4 BL4
Big Data
Model Details T1 CO4 BL3 with
29. 1
Recommendations Limited
Additional Tips on Final T1 CO4 BL4 Memory
30. 1
Presentation
Providing Technical T1 CO4 BL2
31. 1
Specifications and Code
32. Data Visualization 1 T1 CO4 BL2
Total no of periods required: 09
UNIT – V: DATA ANALYTICS APPLICATIONS
Text and Web: Data T1 CO5 BL2
33. 1
Acquisition, Feature Extraction
Tokenization, Stemming, T1 CO5 BL2
34. 1
Conversion to Structured Data Alpine
35. Sentiment Analysis, Web Mining 1 T1 CO5 BL3 Miner and
Recommender Systems: T1 CO5 BL4 Data
36. 1 Wrangler
Feedback
37. Recommendation Tasks 1 T1 CO5 BL2
Recommendation Techniques, T1 CO5 BL3
38. 1
Final Remarks
4
No. of Course Bloo
S. Book(s)
Topic periods Outco ms Remarks
No. required
followed
mes Level
Social Network Analysis: T1 CO5 BL4
39. 1
Representing Social Networks
40. Basic Properties of Nodes 1 T1 CO5 BL2
Basic and Structural Properties T1 CO5 BL2
41. 1
of Networks
Total no of periods required: 09
Grand total periods required: 45
TEXT BOOKS:
T1. EMC Education Services, Data Science and Big Data Analytics – Discovering,
Analyzing, Visualizing and Presenting Data, John Wiley and Sons, 2015.
T2. Joao Moreira, Andre Carvalho, Andre Carlos Ponce de Leon Ferreira Carvalho, Tomas
Horvath, A General Introduction to Data Analytics, John Wiley and Sons,
1stEdition,2019.
5
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
BI versus Data Science
Current Analytical Architecture
Emerging Big Data Ecosystem and New Approach to Analytics.
Points to Remember:
• Keeping up with this huge influx of data is difficult.
• More challenging is analyzing vast amounts of it.
• Does not conform to traditional notions of data structure, to identify meaningful
patterns and extract useful information.
Note:
These challenges of the data deluge present the opportunity to transform
business, government, science, and everyday life.
Several industries have led the way in developing their ability to gather and exploit
data:
• Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions shown in figure 1.
6
Several industries have led the way in developing their ability to gather and exploit
data:
• Mobile phone companies analyze subscribers' calling patterns to determine, for
example, whether a caller 's frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to
remain in her contract.
Several industries have led the way in developing their ability to gather and exploit
data:
• For companies such as LinkedIn and Facebook, data itself is their primary product.
The valuations of these companies are heavily derived from the data they gather and
host, which contains more and more intrinsic value as the data grows.
8
Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness require the use
of new technical architectures and analytics to enable insights that unlock new
sources of business value.
• Social media and genetic sequencing are among the fastest-growing sources of
Big Data and examples of untraditional sources of data being used for analysis.
For Example:
• In 2012 Facebook users posted 700 status updates per second worldwide, which
can be leveraged to deduce latent interests or political views of users and show
relevant ads.
•
• For instance, an update in which a woman changes her relationship status from
"single" to "engaged" would trigger ads on bridal dresses, wedding planning, or
name-changing services.
• Facebook can also construct social graphs to analyze which users are connected to
each other as an interconnected network.
• In March 2013, Facebook released a new feature called "Graph Search," enabling
users and developers to search social graphs for people with similar interests,
hobbies, and shared locations.
• Another example comes from genomics. Genetic sequencing and human genome
mapping provide a detailed understanding of genetic makeup and lineage. The health
care industry is looking toward these advances to help predict which illnesses a
person is likely to get in his lifetime and take steps to avoid these maladies.
9
Application of Big Data analytics
Data Structures
Big data can come in multiple forms.
It including structured and non-structured data such as financial data, text files,
multimedia files, and genetic mappings.
10
Contrary to much of the traditional data analysis performed by organizations, most of
the Big Data is unstructured or semi-structured in nature, which requires different
techniques and tools to process and analyze
Eg: R Programming, Tableau Public, SAS, Python...
3. Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web click stream data that may
contain inconsistencies in data values and formats) shown in figure 5.
E.g.: Google Queries
4. Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.
11
12
Figure.4 Semi-structured data
Table 1 shows Types of Data Repositories, from an Analyst Perspective and table 2
shows on Business Drivers for Advanced Analytics.
13
Analyst Perspective on Data Repositories
TABLE 1. Types of Data Repositories, from an Analyst Perspective
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
1.1.1. BI versus Data Science
Current Analytical Architecture
Emerging Big Data Ecosystem and
New Approach to Analytics.
14
Figure.6 Business Intelligence (BI) vs. Data Science
Figure 6 shows the difference between Business Intelligence vs. Data Science.
15
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
BI versus Data Science
1.1.2 Current Analytical Architecture
Emerging Big Data Ecosystem and
New Approach to Analytics.
16
• These are high-priority operational processes getting critical data feeds from the
data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream
analytics shown in figure 7.
• Because users generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to analyze data
offline in R or other local analytical tools.
• Many times these tools are limited to in-memory analytics on desktops analyzing
samples of data, rather than the entire population of a dataset.
• Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis-and any insights on the quality of the data or
anomalies-rarely are fed back into the main data repository.
The data now comes from multiple sources (as shown in figure 8), such as
these:
Medical information, such as genomic sequencing and diagnostic imaging
Photos and video footage uploaded to the World Wide Web
Video surveillance, such as the thousands of video cameras spread across a city
Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart phones
Smart devices, which provide sensor-based collection of information from smart
electric grids, smart buildings, and many other public and industry infrastructures
Nontraditional IT devices, including the use of radio-frequency identification
(RFID) readers, GPS navigation systems, and seismic processing
17
FIGURE.8 Data evolution and the rise of Big Data sources
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
BI versus Data Science
Current Analytical Architecture
1.1.3 Emerging Big Data Ecosystem and New Approach to Analytics.
Four main groups of players (shown in figure 9)
Data devices
Games, smartphones, computers, etc.
Data collectors
Phone and TV companies, Internet, Gov‘t, etc.
Data aggregators – make sense of data
Websites, credit bureaus, media archives, etc.
Data users and buyers
Banks, law enforcement, marketers, employers, etc.
18
Figure.9 Mai groups in Big Data
Key Roles for the New Big Data Ecosystem (shown in below figures)
1. Deep analytical talent
• Advanced training in quantitative disciplines – e.g., math, statistics,
machine learning
Ex. of Professional: statisticians, economists, mathematicians, and the new role of
the Data Scientist
1. Data savvy professionals
• Savvy but less technical than group 1
2. Ex. of Professional: financial analysts, market research analysts, life
scientists, operations managers, and business and functional managers
3. Technology and data enablers
• Support people – e.g., DB admins, programmers, etc.
19
• Data scientists are generally thought of as having five main sets of skills and
behavioral characteristics, as shown in Figure.
UNIT-I: Chapter-II
20
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
Data Preparation Phase
Model Planning
Model Building
Communicate Results
Operationalize
21
Figure.11 Key roles in Big Data
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
1.2.2. Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
Data Preparation Phase
Model Planning
Model Building
Communicate Results
Operationalize
Data Analytics Lifecycle defines the analytics process and best practices from
discovery to project completion
The Lifecycle employs aspects of:
Scientific method
22
Cross Industry Standard Process for Data Mining (CRISP-DM)
Process model for data mining
Davenport‘s DELTA framework
Hubbard‘s Applied Information Economics (AIE) approach
MAD Skills: New Analysis Practices for Big Data by Cohen et al.
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
1.2.3 Discovery Phase
Data Preparation Phase
Model Planning
Model Building
Communicate Results
Operationalize
23
Figure.12 Data Analytics Lifecycle: Discovery
Phase 1- Discovery (shown in figure 12):
• In Phase 1, the team learns the business domain, including relevant history such
as whether the organization or business unit has attempted similar projects in the
past from which they can learn.
• The team assesses the resources available to support the project in terms of
people, technology, time, and data.
• Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.
24
1. Learning the Business Domain
• Understanding the domain area of the problem is essential.
• At this early stage in the process, the team needs to determine how much business
or domain knowledge the data scientist needs to develop models in Phases 3 and 4.
• These data scientists have deep knowledge of the methods, techniques, and
ways for applying heuristics to a variety of business and conceptual problems.
2. Resources:
• Ensure the project team has the right mix of domain experts, customers,
analytic talent, and project management to be effective.
• In addition, evaluate how much time is needed and if the team has the right
breadth and depth of skills.
• After taking inventory of the tools, technology, data, and people, consider if the
team has sufficient resources to succeed on this project, or if additional resources are
needed. Negotiating for resources at the outset of the project, while seeping the
goals, objectives, and feasibility, is generally more useful than later in the
process and ensures sufficient time to execute it properly.
Following is a brief list of common questions that are helpful to ask during the
discovery phase when interviewing the project sponsor. The responses will begin to
shape the scope of the project and give the team an idea of the goals and objectives
of the project.
25
• What business problem is the team trying to solve?
• What is the desired outcome of the project?
• What data sources are available?
• What industry issues may impact the analysis?
• What timelines need to be considered?
• Who could provide insight into the project?
• Who has final decision-making authority on the project?
• How will the focus and scope of the problem change if the following dimensions
change:
• Time: Analyzing 1 year or 10 years' worth of data?
• People: Assess impact of changes in resources on project timeline.
• Risk: Conservative to aggressive
• Resources: None to unlimited (tools, technology, systems)
• Size and attributes of data: Including internal and external data sources
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
1.2.4 Data Preparation Phase
Model Planning
Model Building
Communicate Results
Operationalize
26
Figure.13 Data Analytics Lifecycle: Data Prep
Includes steps to explore, preprocess, and condition data.
Create robust environment – analytics sandbox.
Data preparation tends to be the most labor-intensive step in the analytics
lifecycle.
Often at least 50% of the data science project‘s time.
The data preparation phase is generally the most iterative and the one that teams
tend to underestimate most often shown in figure 13.
28
4. Data Conditioning
29
Figure.14 Common tools for data preparation
30
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
Data Preparation Phase
1.2.5 Model Planning
Model Building
Communicate Results
Operationalize
31
Phase 3: Model Planning in Industry Verticals
32
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
Data Preparation Phase
Model Planning
1.2.6. Model Building
Communicate Results
Operationalize
33
viii. Is a different form of the model required to address the business problem?
UNIT-I: Chapter-II
Data Analytics Life Cycle
Key Roles for a Successful Analytics Project
Background and Overview of Data Analytics Lifecycle Phases
Discovery Phase
Data Preparation Phase
Model Planning
Model Building
1.2.7 Communicate Results
Operationalize
34
Figure.17 Data Analytics Lifecycle: Communicate Results
In this last phase, the team communicates the benefits of the project more broadly
and sets up a pilot project to deploy the work in a controlled way shows in figure
17.
Risk is managed effectively by undertaking small scope, pilot deployment before a
wide-scale rollout.
During the pilot project, the team may need to execute the algorithm more
efficiently in the database rather than with in-memory tools like R, especially with
larger datasets.
To test the model in a live setting, consider running the model in a production
environment for a discrete set of products or a single line of business.
Monitor model accuracy and retrain the model if necessary.
35
Figure.18 Data Analytics Lifecycle: Operationalize
36
UNIT-I Chapter-III
1.3.1 Introduction to R Studio, Basic operations and import and
export of data using R Tool.
Agenda:
1. About Data Mining
2. About R and RStudio
3. Datasets
4. Data Import and Export
a. Save and Load R Data
Data mining is the process to discover interesting knowledge from large amounts of
data [Han and Kamber, 2000].
It is an interdisciplinary field with contributions from many areas, such as:
o Statistics, machine learning, information retrieval, pattern recognition and
bioinformatics.
Data mining is widely used in many domains, such as:
o Retail, Finance, telecommunication and social media.
In real world applications, a data mining process can be broken into six major
phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation and
6. Deployment
as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining).
37
In figure 19 shows the different Panels of R Studio environment.
About R:
R is a free software environment for statistical computing and graphics.
It provides a wide variety of statistical and graphical techniques (http://www.r-
project.org/).
R can be easily extended with 7324 packages available on CRAN (Comprehensive R
Archive Network) (http://cran.r-project.org/)
To help users to find out which R packages to use, the CRAN Task Views are a good
guidance (http://cran.r-project.org/web/views/). They provide collections of packages
for different tasks. Some Task Views related to data mining are:
o Machine Learning & Statistical Learning
o Cluster Analysis & Finite Mixture Models
o Time Series Analysis
o Natural Language Processing
o Multivariate Statistics and
o Analysis of Spatial Data.
RStudio
RStudio 10 is an integrated development environment (IDE) for R and can run on
various operating systems like Windows, Mac OS X and Linux. It is a very useful and
powerful tool for R programming.
38
When RStudio is launched for the first time, you can see a window similar to below
Figure. There are four panels:
1. Source panel (top left), which shows your R source code. If you cannot see the
source panel, you can find it by clicking menu \File", \New File" and then \R Script".
You can run a line or a selection of R code by clicking the \Run" bottom on top of
source panel, or pressing \Ctrl + Enter".
2. Console panel (bottom left), which shows outputs and system messages displayed
in a normal R console;
3. Environment/History/Presentation panel (top right), whose three tabs show
respectively all objects and function loaded in R, a history of submitted R code, and
Presentations generated with R;
4. Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show
respectively a list of _les, plots, R packages installed, help documentation and local
web content.
In addition to above three folders which are useful to most projects, depending on your
project and preference, you may create additional folders below:
1. rawdata, where to put all raw data,
2. models, where to put all produced analytics models, and
3. reports, where to put your analysis reports.
39
Datasets
1. The Iris Dataset
2. The Bodyfat Dataset
> str(iris)
'data.frame': 150 observations (records, or rows) of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
41
Data Import and Export
An alternative way to save and load R data objects is using functions saveRDS() and
readRDS(). They work in a similar way as save() and load().
The differences are:
a. multiple R objects can be saved into one single _le with save(), but only one
object can be saved in a file with saveRDS(); and
b. readRDS() enables us to restore the data under a different object name, while
load() restores the data under the same object name as when it was saved.
> a <- 1:10
> saveRDS(a, file="./data/mydatafile2.rds")
> a2 <- readRDS("./data/mydatafile2.rds")
> print(a2)
[1] 1 2 3 4 5 6 7 8 9 10
R also provides function save.image() to save everything in current workspace into a single
file, which is very convenient to save your current work and resume it later, if the data
loaded into R are not very big.
42
1.3.2 Implement Data Exploration and Visualization on different Datasets to explore
multiple and Individual Variables.
Agenda:
5. R If ... Else
6. R While Loop
7. R Functions
a. Creating a Function
b. Arguments
8. Data Structures
a. Vectors
b. Lists
c. Matrices
d. Arrays
e. Data Frames
9. R Graphics
a. Plot
b. Line
c. Scatterplot
d. Pie Charts
e. Bars
10.R Statistics
11.Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files
43
R If ... Else
Example:
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
Output: "b is greater than a"
if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}
R Loops
Loops are handy because they save time, reduce errors, and they make code more readable.
while loops
44
for loops
R While Loops
Example
i <- 1
while (i < 6) {
print(i)
i <- i + 1
}
Note:
Break
With the break statement, we can stop the loop even if the while condition is TRUE.
Next
With the next statement, we can skip an iteration without terminating the loop.
For Loops
A for loop is used for iterating over a sequence.
Example
for (x in 1:10) {
print(x)
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
45
R Functions
A function is a block of code which only runs when it is called.
You can pass data, known as parameters, into a function.
A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() {
# create a function with the name my_function
print("Hello World!")
}
my_function() # call the function named my_function
Arguments
Information can be passed into functions as arguments.
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}
my_function("Peter")
my_function("Lois")
my_function("Stewie")
Data Structures
Vectors
To combine the list of items to a vector, use the c() function and separate the items
by a comma.
Example
# Vector of strings
fruits <- c("banana", "apple", "orange")
# Print fruits
fruits
Example
# Vector of numerical values
numbers <- c(1, 2, 3)
# Print numbers
numbers
Example
# Vector of logical values
46
log_values <- c(TRUE, FALSE, TRUE, FALSE)
log_values
Vector Length
To find out how many items a vector has, use the length() function.
Sort a Vector
Access Vectors
You can access the vector items by referring to its index number inside brackets []. The first
item has index 1, the second item has index 2, and so on. # Ex. fruits[1]
You can also access multiple elements by referring to different index positions with the c()
function.
Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")
Note:
2. The seq() function has three parameters: from is where the sequence starts, to is
where the sequence stops, and by is the interval of the sequence.
Lists
A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.
Example
# List of strings
thislist <- list("apple", "banana", "cherry")
Access Lists
You can access the list items by referring to its index number, inside brackets. The first item
has index 1, the second item has index 2, and so on.
47
Check if Item Exists
To find out if a specified item is present in a list, use the %in% operator.
Matrices
Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix
Access More Than One Row
More than one row can be accessed if you use the c() function.
Ex. thismatrix[c(1,2),]
Access More Than One Column
Ex. thismatrix[, c(1,2)]
Add Rows and Columns
Use the cbind() function to add additional columns in a Matrix.
Use the rbind() function to add additional rows in a Matrix.
Remove Rows and Columns
Use the c() function to remove rows and columns in a Matrix.
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
Number of Rows and Columns
Use the dim() function to find the number of rows and columns in a Matrix.
Arrays
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to specify
the dimensions.
Example
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray
48
Output:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Data Frames
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each
column should have the same type of data.
Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Use the summary() function to summarize the data from a Data Frame:
Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame
summary(Data_Frame)
Output:
Training Pulse Duration
49
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45
Training Pulse Duration
Other :1 Min. :100.0 Min. :30.0
Stamina :1 1st Qu.:110.0 1st Qu.:37.5
Strength:1 Median :120.0 Median :45.0
Mean :123.3 Mean :45.0
3rd Qu.:135.0 3rd Qu.:52.5
Max. :150.0 Max. :60.0
Access Items
Example:
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Output:
Training
1 Strength
2 Stamina
3 Other
[1] Strength Stamina Other
Levels: Other Stamina Strength
[1] Strength Stamina Other
Levels: Other Stamina Strength
Plot
The plot() function is used to draw points (markers) in a diagram.
The function takes parameters for specifying points in the diagram.
Parameter 1 specifies points on the x-axis.
Parameter 2 specifies points on the y-axis.
Example
Draw one point in the diagram, at position (1) and position (3):
>plot(1, 3)
50
To draw more points, use vectors:
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
>plot(c(1, 8), c(3, 10))
Multiple Points
Example
>plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
51
Draw a Line
The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:
Plot Labels
The plot() function also accept other parameters, such as main, xlab and ylab if you
want to customize the graph with a main title and different labels for the x and y-axis:
>plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Graph Appearance
52
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means 50% smaller,
and 2 means 100% larger):
Example
plot(1:10, cex=2)
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)
Lines
A line graph has a line that connects all the points in a diagram.
To create a line, use the plot() function and add the type parameter with a value of
"l":
Example
plot(1:10, type="l")
Line Color
The line color is black by default. To change the color, use the col parameter
Example
plot(1:10, type="l", col="blue")
53
Line Width
To change the width of the line, use the lwd parameter (1 is default, while 0.5 means
50% smaller, and 2 means 100% larger).
Line Styles
The line is solid by default. Use the lty parameter with a value from 0 to 6 to specify
the line format.
For example, lty=3 will display a dotted line instead of a solid line
Multiple Lines
To display more than one line in a graph, use the plot() function together with the
lines() function.
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")
Scatter Plot
A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
It needs two vectors of same length, one for the x-axis (horizontal) and one for
the y-axis (vertical).
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)
Compare Plots
To compare the plot with another plot, use the points() function:
Example
54
Draw two plots on the same figure:
# day one, the age and speed of 12 cars:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars:
x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)
Pie Charts
A pie chart is a circular graphical view of data.
Use the pie() function to draw pie charts.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart
pie(x)
Start Angle
You can change the start angle of the pie chart with the init.angle parameter.
The value of init.angle is defined with angle in degrees, where default angle is
0.
Example
Start the first pie at 90 degrees:
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart and start the first pie at 90 degrees
pie(x, init.angle = 90)
Colors
You can add a color to each pie with the col parameter.
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Fruits", col = colors)
Legend
To add a list of explanation for each pie, use the legend() function.
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Pie Chart", col = colors)
# Display the explanation box
legend("bottomright", mylabel, fill = colors)
Bar Charts
A bar chart uses rectangular bars to visualize data. Bar charts can be displayed
horizontally or vertically. The height or length of the bars are proportional to the
values they represent.
Use the barplot() function to draw a vertical bar chart.
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, col = "red")
Statistics Introduction
Data Set
A data set is a collection of data, often presented in a table.
There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road
Tests), which is retrieved from the 1974 Motor Trend US Magazine.
Example
# Print the mtcars data set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data set:
?mtcars
Get Information
Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
57
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)
Use the rownames() function to get the name of each row in the first column,
which is the name of each car: rownames(Data_Cars)
From the examples above, we have found out that the data set has 32 observations (Mazda
RX4, Mazda RX4 Wag, Datsun 710, etc) and 11 variables (mpg, cyl, disp, etc).
A variable is defined as something that can be measured or counted.
Here is a brief explanation of the variables from the mtcars data set:
58
use the summary() function to get a statistical summary of the data:
summary(Data_Cars).
The summary() function returns six statistical numbers for each variable:
1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max
Max Min
Example
#Find the largest and smallest value of the variable hp (horsepower).
Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)
For example, we can use the which.max() and which.min() functions to find the
index position of the max and min value in the table:
Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)
Or even better, combine which.max() and which.min() with the rownames() function to get
the name of the car with the largest and smallest horsepower:
Example
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]
59
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424
Example
Find the average weight (wt) of a car:
Data_Cars <- mtcars
mean(Data_Cars$wt)
Median
The median value is the value in the middle, after you have sorted all the values.
Note: If there are two numbers in the middle, you must divide the sum of those numbers
by two, to find the median.
Example
#Find the mid point value of weight (wt):
Data_Cars <- mtcars
median(Data_Cars$wt)
Mode
The mode value is the value that appears the most number of times.
Example: names(sort(-table(Data_Cars$wt)))[1]
Percentiles
Percentiles are used in statistics to give you a number that describes the value that a
given percent of the values are lower than.
Example
Data_Cars <- mtcars
# c() specifies which percentile you want
quantile(Data_Cars$wt, c(0.75))
Note:
1. If you run the quantile() function without specifying the c() parameter, you will get
the percentiles of 0, 25, 50, 75 and 100.
2. Quartiles
a. Quartiles are data divided into four parts, when sorted in an ascending order:
b. The value of the first quartile cuts off the first 25% of the data
c. The value of the second quartile cuts off the first 50% of the data
d. The value of the third quartile cuts off the first 75% of the data
e. The value of the fourth quartile cuts off the 100% of the data
60
12.Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files
Look at Data
Note: The iris data is used in this for demonstration of data exploration with R.
Execute the following commands and note the output for each and write the purpose
of the command in comments using #:
> dim(iris)
> names(iris)
> str(iris)
> attributes(iris)
> iris[1:5, ]
> head(iris)
> tail(iris)
> ## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx
> iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
Explore Individual Variables
Execute the following commands and note the output for each and write the purpose
of the command in comments using #:
> summary(iris)
> quantile(iris$Sepal.Length)
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
> var(iris$Sepal.Length)
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
> pie(table(iris$Species))
61
> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))
> ## same function as above
> # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species,
pch=as.numeric(iris$Species))
> smoothScatter(iris$Sepal.Length, iris$Sepal.Width)
> pairs(iris)
More Explorations
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> distMatrix <- as.matrix(dist(iris[,1:4]))
> heatmap(distMatrix)
> library(lattice)
> levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9,
+ col.regions=grey.colors(10)[10:1])
> filled.contour(volcano, color=terrain.colors, asp=1,
+ plot.axes=contour(volcano, add=T))
> persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
> library(MASS)
> parcoord(iris[1:4], col=iris$Species)
> library(lattice)
> parallelplot(~iris[1:4] | Species, data=iris)
> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)
63
DATA ANALYTICS
Unit – II chapter I: Exploratory Data Analysis
Exploratory Data Analysis:
‣ Visualization before analysis
‣ Dirty data
‣ Visualizing a single variable
‣ Examining multiple variables
‣ Data Exploration versus Presentation
Exploratory Data Analysis (EDA):
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.
Fig1.EDA process
Steps to explain the EDA process (shown in figure 1):
Look at the structure of the data: number of data points, number of features, feature names,
data types, etc.
When dealing with multiple data sources, check for consistency across datasets.
Identify what data signifies (called measures) for each of data points and be mindful while
obtaining metrics.
Calculate key metrics for each data point (summary analysis):
a. Measures of central tendency (Mean, Median, Mode);
b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation);
c. Measures of skewness and kurtosis.
Investigate visuals:
a. Histogram for each variable;
b. Scatterplot to correlate variables.
Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
Identify outliers and mark them. Based on context, either discard outliers or analyze them
separately.
Estimate missing points using data imputation techniques.
64
2.1. 1: Visualization before analysis
What is Visualization?
Data visualization is the representation of data through the use of common graphics, such as
charts, plots, infographics, and even animations. These visual displays of information
communicate complex data relationships and data-driven insights in a way that is easy to
understand.
Visualization Tools:
Tableau (Tableau is a data visualization tool that can be used to create interactive graphs,
charts, and maps), QlikView, Microsoft Power BI, Datawrapper, Plotly, Excel, Zoho analytics.
Figure 2 shows the benefits of data visualization tools.
65
Fig2.1.1 Benefits of Data Visualization Tools
66
Visualization before analysis: Why is it important?
Data visualization allows business users to gain insight into their vast amounts of data. It
benefits them to recognize new patterns and errors in the data. Making sense of these
patterns helps the users pay attention to areas that indicate red flags or progress. This
process, in turn, drives the business ahead. Figure 3 shows importance of Data visualization.
Example: covid cases (red flag)
Fig4.Anscombe's quartet
Figure 4 shows the Anscombe‘s quartet datasets and figure 5 shows Anscombe's quartet
visualized as scatterplots.
67
Fig5.Anscombe's quartet visualized as scatterplots
R code:
>install.packages(''ggplot2")
>data (anscombe) # It load the anscombe dataset into the current workspace
>anscombe
>nrow(anscombe) # It number of rows
[1] 11
# generates levels to indicate which group each data point belongs to
>levels<- gl(4, nrow(anscombe))
>levels
# Group anscombe into a data frame
>mydata <- with(anscombe, data.frame(x=c(xl,x2,x3,x4), y=c(yl,y2,y3,y4),
mygroup=levels))
>mydata
2.1.2 Dirty data
Dirty data can be detected in the data exploration phase with visualizations. In general,
analysts should look for anomalies, verify the data with domain knowledge, and decide the
most appropriate approach to clean the data.
Example to explain dirty data: (Sample taken as student details)
68
Consider a scenario in which a bank is conducting data analyses of its account holders to
gauge customer retention as shown in figure 6 and 7.
70
> iris[1:5, ]
> head(iris)
> tail(iris)
## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx > iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
b) Visualizing single variable
Example Functions for Visualizing a Single Variable
71
Figure 8 (a) Dotchart on the miles per gallon of cars (b) Barplot on the distribution of
car cylinder counts
Histogram and Density Plot:
It includes a histogram of household income. The histogram shows a clear concentration of
low household incomes on the left and the long tail of the higher incomes on the right (figure
9 (a) and (b)).
73
Barplot to visualize multiple variables
> count s <- table {mtcars$gear , mtcars$cyl )
> barplot (counts, main= "Distribution of Car Cylinder Counts and Gears‖ , xlab="Number of
Cylinders‖, ylab="Counts‖, col=c ( " #OOO OFFFF" , "#0080FFFF", "#OOFFFFFF") , legend =
rownames (counts) , beside- TRUE, args. legend = list (x= "top", title= "Number of Gears" ))
Scatterplot Matrix:
Fisher's iris dataset [13] includes the measurements in centimeters ofthe sepal length, sepal
width, petal length, and petal width for 50 flowers from three species of iris. The three
species are setosa, versicolor, and virginica.
74
Scatterplot matrix of Fisher's {13} iris dataset
The R code for generating the scatterplot matrix is provided next.
>colors<- C(―red‖, ‖green‖ ,―blue‖)
> pairs(iris[l : 4], main= "Fisher' s Iris Dataset‖, pch = 21, bg =
colors[unclass(iris$Species)])
> par (xpd = TRUE)
> legend (0.2, 0 .02, horiz = TRUE, as.vector(unique (iris$Species)) , fill = colors, bty = "n")
The vector colors defines the color scheme for the plot. It could be changed to something like
colors<- c("gray50", "white" , "black" } to make the scatter plots gray scale.
Explore multiple variables:
> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))
2.1.4: Data Exploration versus presentation
Data exploration means the deep-dive analysis of data in search of new insights.
Data presentation means the delivery of data insights to an audience in a form that makes
clear the implications.
1. Audience - Who is the data for?
For data exploration, the primary audience is the data analyst herself. She is the person
who is both manipulating the data and seeing the results. She needs to work with tight
feedback cycles of defining hypotheses, analyzing data, and visualizing results.
For data presentation, the audience is a separate group of end-users, not the author of the
analysis. These end-users are often non-analytical, they are on the front-lines of business
decision-making, and may have difficulty connecting the dots between an analysis and the
implications for their job.
2. Message - What do you want to say?
Data exploration is about the journey to find a message in your data. The analyst is trying
to put together the pieces of a puzzle.
Data presentation is about sharing the solved puzzle with people who can take action on
the insights. Authors of data presentations need to guide an audience through the content
with a purpose and point of view.
3. Explanation - What does the data mean?
For the analysts using data exploration tools, the meaning of their analysis can be self-
evident. A 1% jump in your conversion metric may represent a big change that changes your
75
marketing tactics. The important challenge for the analysts is to answer why is this
happening.
Data presentations carry a heavier burden in explaining the results of analysis. When the
audience isn‘t as familiar with the data, the data presentation author needs to start with
more basic descriptions and context. How do we measure the conversion metric? Is a 1%
change a big deal or not? What is the business impact of this change?
4. Visualizations - How do I show the data?
The visualizations for data exploration need to be easy to create and may often show
multiple dimensions to unearth complex patterns.
For data presentation, it is important that visualizations be simple and intuitive. The
audience doesn‘t have the patience to decipher the meaning of a chart. I used to love
presenting data in treemaps but found that as a visualization it could seldom stand-alone
without a two-minute tutorial to teach new users how to read the content.
5. Interactions - How are data insights created and shared?
Data exploration work on their own to gather data, connect data across silos, and dig into
the data to find insights. Data exploration is often a solitary activity that only connects with
other people when insights are found and need to be shared.
Data presentation is a collaborative, social activity. The value emerges when insights found
in data are shared with people who understand the context of the business. The dialogue that
emerges is the point, not a failure of the analysis.
77
Reference: https://stattrek.com/hypothesis-test/difference-in-means
Problem 1: Two-Tailed Test
Within a school district, students were randomly assigned to one of two Math teachers -
Mrs. Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students, and Mrs.
Jones had 25 students.
At the end of the year, each class took the same standardized test. Mrs. Smith's students
had an average test score of 78, with a standard deviation of 10; and Mrs. Jones'
students had an average test score of 85, with a standard deviation of 15.
Test the hypothesis that Mrs. Smith and Mrs. Jones are equally effective teachers. Use a
0.10 level of significance. (Assume that student performance is approximately normal.)
Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work
through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Null hypothesis: μ1 - μ2 = 0
Alternative hypothesis: μ1 - μ2 ≠ 0
Note that these hypotheses constitute a two-tailed test. The null hypothesis will be
rejected if the difference between sample means is too big or if it is too small.
Formulate an analysis plan. For this analysis, the significance level is 0.10. Using sample
data, we will conduct a two-sample t-test of the null hypothesis.
<Analyze sample data. Using sample data, we compute the standard error (SE), degrees
of freedom (DF), and the t statistic test statistic (t).
SE = sqrt[(s12/n1) + (s22/n2)]
SE = sqrt[(102/30) + (152/25] = sqrt(3.33 + 9)
78
SE = sqrt(12.33) = 3.51
If you enter 1.99 as the sample mean in the t Distribution Calculator, you will find the that
the P(t ≤ 1.99) is about 0.973. Therefore, P(t > 1.99) is 1 minus 0.973 or 0.027. Thus, the
P-value = 0.027 + 0.027 = 0.054.
Interpret results: Since the P-value (0.054) is less than the significance level (0.10), we
cannot accept the null hypothesis.
Non-parametric test:
Since the Wilcoxon Rank Sum Test does not assume known distributions, it does not deal
with parameters, and therefore we call it a non-parametric test. Whereas the null hypothesis
of the two-sample t test is equal means, the null hypothesis of the Wilcoxon test is usually
taken as equal medians.
Reference: https://www.slideserve.com/italia/wilcoxon-rank-sum-test
The significance level, as mentioned in the Student's t-test discussion, is equivalent to the
type I error. For a significance level such as o = 0.05, if the null hypothesis (Jt1 = J1 1) is
TRUE, there is a 5% chance that the observed T value based on the sample data will be large
enough to reject the null hypothesis. By selecting an appropriate significance level, the
probability of committing a type I error can be defined before any data is collected or
analyzed. The probability of committing a Type II error is somewhat more difficult to
determine. If two population means are truly not equal, the probability of committing a type
II error will depend on how far apart the means truly are. To reduce the probability of a type
II error to a reasonable level, it is often necessary to increase the sample size.
2.2.5: Power and Sample Size
The power of a test is the probability of correctly rejecting the null hypothesis. It is denoted
by 1- beta where beta is the probability of a type II error. Because the power of a test
improves as the sample size increases, power is used to determine the necessary sample
size. In the difference of means, the power of a hypothesis test depends on the true
difference of the population means. In other words, for a fixed significance level, a larger
sample size is required to detect a smaller difference in the means. In general, the
magnitude of the difference is known as the effect size. As the sample size becomes larger, it
is easier to detect a given effect size, alpha as illustrated in Figure.
Sales within a region would differ and this would be true for all four regions (within-group
variations).There might be impact of the regions and mean-sales of the four regions would
not be all the same i.e. there might be variation among regions (between-group variations).
81
2.2.7: Decision Trees in R
A decision tree is a flowchart-like structure in which each internal node represents a "test"
on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the
outcome of the test, and each leaf node represents a class label (decision taken after
computing all attributes). The paths from root to leaf represent classification rules.
In decision analysis, a decision tree and the closely related influence diagram are used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.
A Decision Tree is an algorithm used for supervised learning problems such as classification
or regression. A decision tree or a classification tree is a tree in which each internal (nonleaf)
node is labeled with an input feature shown in figure 11.
Decision trees used in data mining are of two main types –
Classification tree − when the response is a nominal variable, for example if an email is spam
or not. Regression tree − when the predicted outcome can be considered a real number (e.g.
the salary of a worker)
Steps:
1. Begin the tree with the root node, says S, which contains the complete dataset.
2. Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
3. Divide the S into subsets that contains possible values for the best attributes.
4. Generate the decision tree node, which contains the best attribute.
5. Recursively make new decision trees using the subsets of the dataset created in step
3 Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.
83
Figure.11 Decision Tree
Applications:
• Marketing
• Companies
• Diagnosis of disease
Example:
So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which are:
1. Information Gain
2. Gini Index
• Information gain is the measurement of changes in entropy.
• Entropy is a metric to measure the impurity in a given attribute.
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
Advantages:
84
• Easy to read and interpret
• Easy to prepare
• It gives great visual representation
Disadvantages:
• Unstable nature
• Less effective in predicting the outcome of a continuous variable.
Topic 8: Naïve Bayes in R
Naive Bayes is a Supervised Non-linear classification algorithm in R Programming. Naive
Bayes classifiers are a family of simple probabilistic classifiers based on applying Baye's
theorem with strong(Naive) independence assumptions between the features or
variables.
The Naive Bayes algorithm is called ―Naive‖ because it makes the assumption that the
occurrence of a certain feature is independent of the occurrence of other features.
Theory:
Naive Bayes algorithm is based on Bayes theorem. Bayes theorem gives the conditional
probability of an event A given another event B has occurred.
where,
P(A|B) = Conditional probability of A given B.
P(B|A) = Conditional probability of B given A.
P(A) = Probability of event A.
P(B) = Probability of event B.
For many predictors, we can formulate the posterior probability as follows:
P(A|B) = P(B1|A) * P(B2|A) * P(B3|A) * P(B4|A) …
Example:
Consider a sample space: {HH, HT, TH, TT}
where,
H: Head
T: Tail
P(Second coin being head given = P(A|B)
first coin is tail) = P(A|B)
= [P(B|A) * P(A)] / P(B)
= [P(First coin is tail given second coin is head) *
P(Second coin being Head)] / P(first coin being tail)
85
= [(1/2) * (1/2)] / (1/2)
= (1/2)
= 0.5
Performing Naive Bayes on Dataset:
Using Naive Bayes algorithm on the dataset which includes 11 persons and 6 variables or
attributes.
Python 3:
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Splitting data into train
# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Fitting Naive Bayes Model
# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Species ~ ., data = train_cl)
classifier_cl
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)
# Confusion Matrix
cm <- table(test_cl$Species, y_pred)
cm
# Model Evaluation
confusionMatrix(cm)
Output: Model classifier_cl:
86
The Conditional probability for each feature or variable is created by model separately.The
apriori probabilities are also calculated which indicates the distribution of our data.
Confusion Matrix:
So, 20 Setosa are correctly classified as Setosa. Out of 16 Versicolor, 15 Versicolor are
correctly classified as Versicolor, and 1 are classified as virginica. Out of 24 virginica, 19
virginica are correctly classified as virginica and 5 are classified as Versicolor.
Model Evaluation:
87
The model achieved 90% accuracy with a p-value of less than 1. With Sensitivity, Specificity,
and Balanced accuracy, the model build is good.
So, Naive Bayes is widely used in Sentiment analysis, document categorization, Email spam
filtering etc in industry.
88
UNIT III – ADVANCED ANALYTICAL TECHNOLOGY AND METHODS
Chapter – I: Time Series Analysis
Time Series Analysis: Overview of Time Series Analysis, Box-Jenkins Methodology, ARIMA
Model, Autocorrelation Function (ACF), Autoregressive Models, Moving Average Models, ARMA
and ARIMA Models, Building and Evaluating an ARIMA Model, Reasons to Choose and
Cautions.
AGENDA:
Overview of Time Series Analysis
Box-Jenkins Methodology
ARIMA Model
Autocorrelation Function (ACF)
Autoregressive Models
Moving Average Models
ARMA and ARIMA Models
Building and Evaluating an ARIMA Model
Reasons to Choose and Cautions
From the figure 3.1.1, the time series consists of an ordered sequence of 144 values. The
analyses presented are limited to equally spaced time series of one variable. The goals of
time series analysis are:
Identify and model the structure of the time series.
Forecast future values in the time series.
Time series analysis has many applications in finance, economics, biology, engineering,
89
retail, and manufacturing. Here are a few specific use cases:
Retail Sales: For various product lines, a clothing retailer is looking to forecast future
monthly sales. These forecasts need to account for the seasonal aspects of the
customer's purchasing decisions.
Spare parts planning: Companies' service organizations have to forecast future
spare part demands to ensure an adequate supply of parts to repair customer
products. To forecast future demand, complex models for each part number can be
built using input variables such as expected part failure rates, service diagnostic
effectiveness, forecasted new product shipments, and forecasted.
Stock trading: Some high-frequency stock traders utilize a technique called pairs
trading. In pairs trading, an identified strong positive correlation between the prices of
two stocks is used to detect a market opportunity. Pairs trading is one of many
techniques that falls into a trading strategy called statistical arbitrage.
Developed by George Box and Gwilym Jenkins, the Box-Jenkins methodology for time
series analysis involves the following three main steps:
90
1. Condition data and select a model.
Identify and account for any trends or seasonality in the time series.
Examine the remaining time series and determine a suitable model.
The covariance of Yt and Yt+h is a measure of how the two variables, Yt and Yt-h vary together.
It is expressed as:
cov( yt ,yt+h )=E[(yt-ut)(yt+h-µt+h)] ->Eq. 1.1
If two variables are independent of each other, their covariance is zero. If the variables
change together in the same direction, the variables have a positive covariance.
91
Conversely, if the variables change together in the opposite direction, the variables have a
negative covariance.
For a stationary timeseries, by condition (a), the mean is a constant, say µ· So, for a given
stationary sequence, Yt. the covariance notation can be simplified to:
cov(h)= E[(yt-u)(yt+h-µ)] ->Eq. 1.2
By part (c), the covariance between two points in the time series can be nonzero, as long as
the value of the covariance is only a function of h. It is an example for h = 3.
cov(3)= E(y1,y4)= E(y2,y5)=……… ->Eq. 1.3
It is important to note that for h 0, the cov(O) = cov(y t,yt) = var(yt) for all t. Because the var
(yt) < ∞, by condition (b), the variance of Yt is a constant for all t.
h) = ->Eq. 1.4
Because the cov(0) is the variance, the ACF is analogous to the correlation function of two
variables, corr (yt, Yt+h), and the value of the ACF falls between - 1 and 1. Thus, the closer
the absolute value of ACF(h) is to 1, the more useful Y, can be as a predictor of Y t+h .
92
Fig 3.1.4(b) Autocorrelation function (ACF)
By convention, the quantity h in the ACF is referred to as the lag, the difference between the
time points t and t+h. At lag 0, the ACF provides the correlation of every point with itself. So
ACF(0) always equals 1. According to the ACF plot, at lag 1 the correlation between Y t andYt-1
is approximately 0.9, which is very close to 1. So Y t-1 appears to be a good predictor of the
value of Yt· Because ACF(2) is around 0.8, Y t-2 also appears to be a good predictor of the
value of Yt. A similar argument could be made for lag 3 to lag 8. (All the autocorrelations are
greater than 0.6.) In other words, a model can be considered that would express Y t as a
linear sum of its previous 8 terms. Such a model is known as an autoregressive model of
order 8.
93
yt=ø1yt-1+εt ->Eq. 1.6
It is evident that yt-1=ø1yt-2+εt-1 . Thus, substituting for yt-1 yields
yt=ø1(ø1yt-2+εt-1)+εt
yt=ø12 yt-2+ ø1εt-1+εt ->Eq. 1.7
As this substitution process is repeated, Yt can be expressed as a function of Yt-h for h = 3, 4
... and a sum of the error terms. This observation means that even in the simple AR(1)
model, there will be considerable autocorrelation with the larger lags even though those lags
are not explicitly included in the model. What is needed is a measure of the autocorrelation
between Yt andYt+h for h = 1, 2, 3 ... with the effect of the Yt+1 to Yt+h-1 values excluded from
the measure. The partial autocorrelation function (PACF) provides such a measure
PACF(h)=corr(y*t- yt, yt+h- y*t+h) for h=2 ->Eq. 1.8
=corr(yt ,yt+1) for h=1
Where y*t = β1yt+1 + β2yt+2 +……..+ βh-1yt+h-1
y*t+h = β1yt+h-1 + β2yt+h-2 +……..+ βh-1yt+1and
the h-1 values of the βs are based on linear regression.
In other words, after linear regression is used to remove the effect of the variables between
yt and yt+h on yt and yt+h , the PACF is the correlation of what remains. For h = 1, there are
no variables between Yt and Yt+1. So the PACF(1) equals ACF(1). Although the computation of
the PACF is somewhat complex, many software tools hide this complexity from the analyst.
To understand why this phenomenon occurs, for an MA(3) time series model:
yt=εt+θ1εt-1+θ2εt-2+θ3εt-3 ->Eq. 1.10
yt-1= εt-1+θ1εt-2+θ2εt-3+θ3εt-4 ->Eq. 1.11
yt-2= εt-2+θ1εt-3+θ2εt-4+θ3εt ->Eq. 1.12
yt-3= εt-3+θ1εt-4+θ2εt-5+θ3ε ->Eq. 1.13
yt-4= εt-4+θ1εt-5+θ2εt-6+θ3 ->Eq. 1.14
Because the expression of yt shares specific white noise variables with the expressions for y t-1
through yt-3, inclusive, those three variables are correlated to y t. However, the expression of
yt does not share white noise variables with yt-4. So the theoretical correlation between yt and
yt-4 is zero.
95
3.1.7 ARMA and ARIMA Models
In general, the data scientist does not have to choose between an AR(p) and an MA(q) model
to describe a time series. In fact, it is often useful to combine these two representations into
one model. The combination of these two models for a stationary time series results in an
Autoregressive Moving Average model, ARMA(p,q), which is expressed as:
yt=δ+ø1yt-1+ø2yt-2+……+ øpyt-p
+εt+θ1εt-1+…..+ θqεt-q ->Eq. 1.15
Where δ is a constant for a nonzero-centered time series
øj is a constant for j = 1, 2, ... , p
øp ≠0
θk is a constant fork= 1, 2, ... , q
εt ~N (0, ζε2) for all t
If p = 0 and q =;e. 0, then the ARMA(p,q) model is simply an AR(p) model. Similarly, if p = 0
and q ≠0, then the ARMA(p,q) model is an MA(q) model.
To apply an ARMA model properly, the time series must be a stationary one. However, many
time series exhibit some trend over time. Since such a time series does not meet the
requirement of a constant expected value (mean), the data needs to be adjusted to remove
the trend. One transformation option is to perform a regression analysis on the time series
and then to subtract the value of the fitted regression line from each observed y-value.
If detrending using a linear or higher order regression model does not provide a stationary
series, a second option is to compute the difference between successive y-values. This is
known as differencing.
dt = yt- yt-1 for t=2,3, ... ,n
The mean of the time series plotted is certainly not a constant. Applying differencing to the
time series results in the plot below. This plot illustrates a time series with a constant mean
and a fairly constant variance over time.
96
Fig 3.1.7(b) Time series for differencing example
If the differenced series is not reasonably stationary, applying differencing additional times
may help. It provides the twice differenced time series for t= 3, 4, ... n.
dt-1-dt-2=(yt-yt-1)- (yt-1-yt-2)=yt-2yt-1-yt-2 ->Eq. 1.16
Successive differencing can be applied, but over-differencing should be avoided. One reason
is that over-differencing may unnecessarily increase the variance. The increased variance can
be detected by plotting the possibly over-differenced values and observing that the spread of
the values is much larger, after differencing the values of y twice.
Because the need to make a time series stationary is common, the differencing can be
included (integrated) into the ARMA model definition by defining the Autoregressive
97
Integrated Moving Average model, denoted ARIMA(p,d,q). The structure of the ARIMA
model is identical to the expression, but the ARMA(p,q) model is applied to the time series,
Yt, after applying differencing d times. Additionally, it is often necessary to account for
seasonal patterns in time series. For example, in the retail sales use case example monthly
clothing sales track closely with the calendar month. Similar to the earlier option of
detrending a series by first applying linear regression, the seasonal pattern could be
determined and the time series appropriately adjusted. An alternative is to use a seasonal
autoregressive integrated moving average model, denoted ARIMA(p,d,q) x (P,D,Q)s
where:
For a time series with a seasonal pattern, following are typical values of s:
98
Fig 3.1.8(a) Monthly gasoline production
In R, the ts() function creates a time series object from a vector or a matrix. The use of time
series objects in R simplifies the analysis by providing several methods that are tailored
specifically for handling equally time spaced data series. For example, the plot() function
does not require an explicitly specified variable for the x-axis. To apply an ARMA model, the
dataset needs to be a stationary time series. Using the diff() function, the gasoline
production time series is differenced once and plotted:
> plot(diff(gas_prod ))
>abline (a =0, b =0)
99
Fig 3.1.8(c) ACF of the differenced gasoline time series
>pacf(diff (gas _prod) , xaxp = c(0, 48 , 4) , lag. max=48 , main=" ")
The dashed lines provide upper and lower bounds at a 95% significance level. Any value of
the ACF or PACF outside of these bounds indicates that the value is significantly different
from zero.
It shows several significant ACF values. The slowly decaying ACF values at lags 12, 24, 36,
and 48 are of particular interest. A similar behavior in the ACF was seen but for lags 1, 2, 3,
... It indicates a seasonal autoregressive pattern every 12 months. Examining the PACF plot,
the PACF value at lag 12 is quite large, but the PACF values are close to zero at lags 24, 36,
and 48. Thus, a seasonal AR(1) model with period = 12 will be considered. It is often useful
to address the seasonal portion of the overall ARMA model before addressing the nonseasonal
portion of the model.
The arima() function in R is used to fit a (0,1,0) x (1,0,0) 12 model. The analysis is applied to
the original time series variable, gas _prod. The differencing, d =1, is specified by the order=
c(0,1,0) term.
>arima 1 <- arima (gas_prod, order=c(0 ,1, 0) , seasonal= list(order=c (1,0 , 0)
,period=1 2))
>arima1
Series: gas_prod
ARIMA(0,1,0)(1,0,0)(12)
100
Coefficients:
Sar1:0.8335
s.e.:0.0324
sigma^2 estimated as 37.29: log likelihood=-778.69
AIC=l561.38 AICc=l561.43 BIC=l568.33
The value of the coefficient for the seasonal AR(1) model is estimated to be 0.8335 with a
standard error of 0.0324. Because the estimate is several standard errors away from zero,
this coefficient is considered significant. The output from this first pass ARIMA analysis is
stored in the variable arima _1, which contains several useful quantities including the
residuals. The next step is to examine the residuals from fitting the (0, 1,0) x (1,0,0) 12
ARIMA model. The ACF and PACF plots of the residuals are provided
>acf(arima_l$residuals, xaxp = c(0, 48, 4), lag.max=48, main=' ')
>pacf arima_l$residuals, xaxp = c 0, 48, 4), lag.max=48, main=‟‟ )
101
>pacf(arima_2$residuals, xaxp = c(0,48,4) , l ag .max=48, main="" )
Fig 3.1.8(g) ACF for the residuals from the (0, 1, 1) x (1, 0,0) 12 model
Fig 3.1.8(h) PACF for the residuals from the (0, 1, 1) x (1, 0,0) 12 model
It should be noted that the ACF and PACF plots each have several points that are close to the
bounds at a 95% significance level. However, these points occur at relatively large lags. To
102
avoid overfitting the model, these values are attributed to random chance. So no attempt is
made to include these lags in the model. However, it is advisable to compare a reasonably
fitting model to slight variations of that model.
Comparing Fitted Time Series Models
The arima() function in Ruses Maximum Likelihood Estimation (MLE) to estimate the model
coefficients. In the R output for an ARIMA model, the log-likelihood (logL) value is provided.
The values of the model coefficients are determined such that the value of the log likelihood
function is maximized. Based on the log L value, the R output provides several measures that
are useful for comparing the appropriateness of one fitted model against another fitted
model. These measures follow:
AIC (Akaike Information Criterion)
AICc (Akaike Information Criterion, corrected)
BIC (Bayesian Information Criterion)
Because these criteria impose a penalty based on the number of parameters included in the
models, the preferred model is the fitted model with the smallest AIC, AICc, or BIC value.
The table provides the information criteria measures for the ARIMA models already fitted as
well as a few additional fitted models. The highlighted row corresponds to the fitted ARIMA
model obtained previously by examining the ACF and PACF plots.
Table: Information Criteria to Measure Goodness of Fit
103
Fig 3.1.8(i) Plot of residuals from the fitted (0,1,1) x (1,0,0)12 model
Fig 3.1.8(j) Histogram of the residuals from the fitted (0,1,1) x {1,0,0) 12 model
>qqnorm(arima_2$r esiduals, main= "" )
>qqline(arima_2$residuals)
Fig 3.1.8(k) Q-Q plot of the residuals from the fitted (0,1,1) x (1,0,0) 12 model
If the normality or the constant variance assumptions do not appear to be true, it may be
104
necessary to transform the time series prior to fitting the ARIMA model. A common
transformation is to apply a logarithm function.
Forecasting
The next step is to use the fitted (0,1,1) x (1,0,0)12 model to forecast the next 12 months of
gasoline production. ln R, the forecasts are easily obtained using the predict () function and
the fitted model already stored in the variable arima_2. The predicted values along with the
associated upper and lower bounds at a 95% confidence level are displayed in R.
#predict the next 12 months
>arima_2.predict<-predict(arima_2,n.ahead=l2)
>matrix(c(arima_2.predict$pred-1.96*arima_2.predict$se,arima_2.predict$pred,
arima_2.predict$pred+1.96*arima_2.predict$se),12,3,dimnames=list(c(241:252),c
("LB","Pred","UB"))))
LB PRED UB
241 394.9689 404.8167 414.6645
242 378.6142 388.8773 399.1404
243 394.9943 405.6566 416.3189
244 405.0188 416.0658 427.1128
245 397.9545 409.3733 420.7922
246 396.1202 407.8991 419.6780
247 396.6028 408.7311 420.8594
248 387.5241 399.9920 412.4598
249 387.1523 399.9507 412.7492
250 387.8486 400.9693 414..0900
251 383.1724 396.6076 410.0428
252 390.2075 403.9500 417.6926
>plot (gas_prod, xlim=c (145,252), xlab = "Time (months)" , ylab = "Gasoline
production (millions of barrels) ", ylim=c (360,440))
>lines(arima_2 .predict$pred)
>lines(arima_2 .predict$pred+1. 96*arima_2 .predict$se , col=4 , lty=2)
lines(arima_2 .predict$pred-1 . 96*arima_2 .predict$se , col=4 , lty=2)
105
Fig 3.1.8(l) Actual and forecasted gasoline production
3.1.9 Reasons to Choose and Cautions
One advantage of ARIMA modeling is that the analysis can be based simply on
historical time series data for the variable of interest. The various input variables need to be
considered and evaluated for inclusion in the regression model for the outcome variable.
Because ARIMA modeling, in general, ignores any additional input variables, the forecasting
process is simplified. If regression analysis was used to model gasoline production, input
variables such as Gross Domestic Product (GOP), oil prices, and unemployment rate may be
useful input variables.
However, to forecast the gasoline production using regression, predictions are
required for the GOP, oil price, and unemployment rate input variables. The minimal data
requirement also leads to a disadvantage of ARIMA modeling; the model does not provide an
indication of what underlying variables affect the outcome. For example, if ARIMA modeling
was used to forecast future retail sales, the fitted model would not provide an indication of
what could be done to increase sales. In other words, casual inferences cannot be drawn
from the fitted ARIMA model. One caution in using time series analysis is the impact of
severe shocks to the system. In the gas production example, shocks might include refinery
fires, international incidents, or weather-related impacts such as hurricanes. Such events can
lead to short-term drops in production, followed by persistently high increases in production
to compensate for the lost production or to simply capitalize on any price increases. Along
similar lines of reasoning, time series analysis should only be used for short-term forecasts.
Over time, gasoline production volumes may be affected by changing consumer demands as
a result of more fuel-efficient gasoline-powered vehicles, electric vehicles, or the introduction
of natural gas-powered vehicles. Changing market dynamics in addition to shocks will make
any long-term forecasts, several years into the future, very questionable.
Additional Methods
Additional time series methods include the following:
106
demand for products can be modeled based on the previous demand combined with a
weather-related time series such as temperature or rainfall.
Spectral analysis is commonly used for signal processing and other engineering
applications. Speech recognition software uses such techniques to separate the signal
for the spoken words from the overall signal that may include some noise.
Kalman filtering is useful for analyzing real-time inputs about a system that can
exist in certain states. Typically, there is an underlying model of how the various
components of the system interact and affect each other. A Kalman filter processes
the various inputs, attempts to identify the errors in the input, and predicts the
current state. For example, a Kalman filter in a vehicle navigation system can process
various inputs, such as speed and direction, and update the estimate of the current
location.
Multivariate time series analysis examines multiple time series and their effect on
each other. Vector ARIMA (VARIMA) extends ARIMA by considering a vector of several
time series at a particular time, t. VARIMA can be used in marketing analyses that
examine the time series related to a company's price and sales volume as well as
related time series for the competitors.
107
Unit – III: ADVANCED ANALYTICAL THEORY AND METHODS: TEXT ANALYSIS
Chapter 2: TEXT ANALYSIS
Agenda
108
text analysis is that most of the time the text is not structured.
109
Search and Retrieval: Search and retrieval is the identification of the documents in a
corpus that contain search items such as specific words, phrases, topics, or entities like
people or organizations. These search items are generally called key terms. Search and
retrieval originated from the field of library science and is now used extensively by web
search engines.
Text Mining: Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest. With the
proper representation of the text, many of the techniques, such as clustering and
classification, can be adapted to text mining. For example, the k-means can be modified to
cluster text documents into groups, where each group represents a collection of
documents with a similar topic. The distance of a document to a centroid represents how
closely the document talks about that topic. Classification tasks such as sentiment analysis
and spam filtering are prominent use the naive Bayes classifier. Text mining may utilize
methods and techniques from various fields of study, such as statistical analysis,
information retrieval, data mining, and natural language processing. Note that, in reality,
all three steps do not have to be present in a text analysis project. If the goal is to
construct a corpus or provide a catalog service, for example, the focus would be the
parsing task using one or more text pre-processing techniques, such as part-of-speech
(POS) tagging, named entity recognition, lemmatization, or stemming. Furthermore, the
three tasks do not have to be sequential. Sometimes their orders might even look like a
tree. For example, one could use parsing to build a data store and choose to either search
and retrieve the related documents or use text mining on the entire data store to gain
insights.
Part-of-Speech (POS) Tagging, Lemmatization, and Stemming:
The goal of POS tagging is to build a model whose input is a sentence, such as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the corresponding word,
such as:
PRPVBD DT NN according to the Penn Treebank POS tags.
Therefore, the four words are mapped to pronoun (personal), verb (past tense).
determiner, and noun (singular), respectively.
Both lemmatization and stemming are techniques to reduce the number of dimensions
and reduce inflections or variant forms to the base form to more accurately measure the
number of times each word appears. With the use of a given dictionary, lemmatization
finds the correct dictionary base form of a word. For example, given the sentence:
obesity causes many problems,the output of lemmatization would be:
obesity cause many problem.
Different from lemmatization, stemming does not need a dictionary, and it usually refers
to a crude process of stripping affixes based on a set of heuristics with the hope of
correctly achieving the goal to reduce inflections or variant forms. After the process, words
110
are stripped to become stems. A stem is not necessarily an actual word defined in the
natural language, but it is sufficient to differentiate itself from the stems of other words. A
well-known rule-based stemming algorithm is Porter's stemming algorithm. It defines a set
of production rules to iteratively transform words into their stems. For the sentence shown
previously:
obesity causes many problems, the output of Porter's stemming algorithm is:
obescaus mani problem
3.2.2 A Text Analysis Example
To further describe the three text analysis steps, consider the fictitious company
ACME, maker of two products: bPhone and bEbook. ACME is in strong competition with
other companies that manufacture and sell similar products. To succeed, ACME needs to
produce excellent phones and eBook readers and increase sales. One of the ways the
company does this is to monitor what is being said about ACME products in social media.
In other words, what is the buzz on its products? ACME wants to search all that is said
about ACME products in social media sites, such as Twitter and Facebook, and popular
review sites, such as Amazon and ConsumerReports. It wants to answer questions such as
these. • Are people mentioning its products? • What is being said? Are the products seen
as good or bad? If people think an ACME product is bad, why? For example, are they
complaining about the battery life of the bPhone, or the response time in their bEbook?
ACME can monitor the social media buzz using a simple process based on the three steps
outlined above. This process is illustrated in Figure below, and it includes the modules in
the next list.
Figure: ACME‘s Text Analysis process
3.TFIDF
5.Sentimen
t Analysis
Text analysis
This can help your customer support teams sort text data at scale, more accurately, and
faster than human analysis. That means valuable customer insights sooner rather than
later, and quicker turnarounds when it comes to making business decisions.
Text analysis, also known as text mining, is a machine learning technique used to
automatically extract value from text data. With the help of natural language processing
(NLP), text analysis tools are able to understand, analyze, and extract insights from
your unstructured data.
They make processing and analyzing huge amounts of unstructured data incredibly easy.
For example, if you receive thousands of support tickets, text analysis tools can analyze
112
them as soon as they come into your helpdesk, and alert you to any recurring issues or
even angry customers that are at risk of churning.
Text Analysis Examples:
There are two main text analysis techniques that you can use text classification and text
extraction. – While they can be used individually, you‘ll be able to get more detailed
Sentiment Analysis
Topic Analysis
Language Detection
Intent Detection
Text Extraction
Keyword Extraction
Entity Extraction
Text Classification:
Text Classification, also referred to as text tagging, is the practice of classifying text using
pre-defined tags. There are many examples of text classification, but we‘ll just touch upon
some of the most popular methods used by businesses.
Sentiment Analysis: Sentiment analysis can automatically detect the emotional
undertones embedded in customer reviews, survey responses, social media posts, and
beyond, which helps organizations understand how their customers feel about their brand,
product, or service. Sentiment analysis of product reviews, for example, can tell you what
customers like or dislike about your product. Restaurants might want to quickly detect
negative reviews on public opinion sites, like Yelp. By performing sentiment analysis on
Yelp reviews, they can quickly detect negative sentiments, and respond right away.
Review items like Cpterra, and G2Crowd also offer unsolicited feedback. Take Slack, for
example. Customers leave long-winded reviews that praise or criticize different aspects of
the software. By running sentiment analysis, you can start organizing these reviews by
sentiment in real time. By training your own sentiment analysis model to detect emotional
tones in your customer feedback, you‘ll be able to gain more accurate results that are
tailored to your dataset. Request a demo to see how sentiment analysis can be tailored
to your use case.
Let‘s go over the two main text analysis methods – text classification and text extraction –
and the various models available. The one you choose will depend on the insights you are
hoping to gain, and/or the problem you‘re attempting to solve. Let‘s take a closer look:
Text Classification:
Text classification, also referred to as text tagging, is the practice of classifying text into
113
pre-defined groups. With the help of Natural Language Processing(NLP),text
classification tools are powerful enough to automatically analyze text and classify it into
categories, depending on the content that you‘re dealing with.
Now, let's proceed with the different types of text classification models available.
Sentiment Analysis:
Nowadays, analysis falls short if it doesn‘t examine the different emotions behind every
piece of text. Sentiment analysis can automatically detect the emotional undertones
embedded in customer reviews, survey responses, social media posts, and so on, which
helps organizations understand how their customers feel about their brand, product, or
service.
For example, a sentiment analysis of product reviews can help a business understand
what customers like or dislike about your product. Think of review sites
like Yelp, Capterra, and G2 Crowd, where you might stumble upon feedback about your,
let‘s say, SaaS business. In the following reviews for Slack, customers praise or criticize a
few aspects of the tool:
―In love with Slack, I won‟t be using anything to communicate else going forward. How did
I survive without it?!” → Positive
―I don‟t agree with the hype, Slack failed to notify me of several important messages and
that‟s what a communication platform should be all about.” → Negative
“The UX is one of the best, it's very intuitive and easy to use. However, we don't have a
budget for the high price Slack asks for its paid plans.” → Neutral
By training a model to detect sentiment on the other hand, you can delegate the task
of categorizing texts into Positive, Neutral and Negative, to machines. Not only does this
help speed up the process, you‘ll receive more consistent results since machines are
inherently not biased.
Topic analysis:
Topic analysis is a machine learning technique that interprets and categorizes large
collections of text according to individual topics or themes.
For example, instead of humans having to read thousands of product reviews to identify
the main topic that customers are talking about in regards to your product, you can use a
topic analysis tool to do it in seconds.
Let‘s say you‘re an entertainment on-demand service company that‘s just released new
content, and you want to know what topics customers are mentioning. You could define
tags such as UX/UI, Quality, Functionality, and Reliability, and find out which aspect is
being talked about most often, and how customers are talking about each aspect. Take
this review for Prime Video:
“I think Amazon is making a great effort in adding engaging content but I can‟t get past
the ugly interface. It‟s not as intuitive as other competing streaming services and if it
weren‟t lumped in with my Prime membership, I wouldn‟t pay for the stand-alone service.”
In this example, the topic analysis classifier can be trained to process this and
114
automatically tag it under UX/UI.
Test MonkeyLearn‘s very own feedback classifier for SaaS companies to get an idea of
how topic analysis sorts information according to themes.
Language Detection
This text analysis model identifies and classifies text according to its language. This is
particularly helpful for businesses with a global presence that need to route information to
the correct localized teams.
Take this ticket, for example:
“La blusa es más grande de lo que esperaba, quisiera devolverla por una prenda de una
talla menor.”
A language classifier could easily detect that the language is Spanish and route it to the
correct customer service agent, helping businesses improve response times and avoiding
unnecessary delays.
Test MonkeyLearn‘s language classifier and see how it can identify over 49 different
languages!
Intent Detection
Text classifiers can also be used to automatically discover customer intent in text. For
example, you might receive a message like the one below:
“Sheesh, the amount of emails I receive is staggering - and it‟s only been one week. It‟s
sporting goods, folks. I don‟t need over 20 emails per week to remind me of that.
Unsubscribing ASAP.”
With an intent detection classifier in place, you could quickly detect that this customer
wants to „Unsubscribe‟ and address this customer immediately. Perhaps you‘ll convince
them to change the settings to limit the number of emails they receive every week, rather
than unsubscribe.
With a clear intent detected, you can easily classify customers and take immediate action.
Play around with the following model to classify outbound sales responses.
Text Extraction:
Text Extraction is a text analysis technique that extracts valuable pieces of data from
text, like keywords, names, product specifications, phone numbers, and more.
Here are some examples of text extraction models.
Keyword Extraction:
Keyword extraction shows you the most relevant words or expressions in your text data.
For example, if you want to understand the main topics mentioned in a set of customer
reviews about a particular product or feature, you‘d quickly run your data through
a keyword extractor.
Entity Extraction:
Entity extraction is used to obtain names of people, companies, brands, and more. This
technique is particularly helpful when you‘re trying to discover names of competitors in
text, brand mentions, and specific customers that you want to track.
115
Textual information is all around us.
Soon after you wake up, you usually navigate through large amounts of textual data in the
form of text messages, emails, social media updates, and blog posts before you make it to
your first cup of coffee.
Deriving information from such large volumes of text data is challenging. Businesses deal
with massive quantities of text data generated from several data sources, including apps,
web pages, social media, customer reviews, support tickets, and call transcripts.
To extract high-quality, relevant information from such huge amounts of text data,
businesses employ a process called text mining. This process of information extraction
from text data is performed with the help of text analysis software.
Text Mining
Text mining, also called text data mining, is the process of analyzing large volumes of
unstructured text data to derive new information. It helps identify facts, trends, patterns,
concepts, keywords, and other valuable elements in text data.
It's also known as text analysis and transforms unstructured data into structured data,
making it easier for organizations to analyze vast collections of text documents. Some of
the common text mining tasks are text classification, text clustering, creation of granular
taxonomies, document summarization, entity extraction, and sentiment analysis.
Text mining uses several methodologies to process text, including natural language
processing (NLP).
What is natural language processing?
Natural language processing (NLP) is a subfield of computer science, linguistics, data
science, and artificial intelligence concerned with the interactions between humans and
computers using natural language.
In other words, natural language processing aims to make sense of human languages to
enhance the quality of human-machine interaction. NLP evolved from computational
linguistics, enabling computers to understand both written and spoken forms of human
language.
Many of the applications you use have NLP at their core. Voice assistants like Siri, Alexa,
and Google Assistant use NLP to understand your queries and craft responses. Grammarly
uses NLP to check the grammatical accuracy of sentences. Even Google Translate is made
possible by NLP.
Natural language processing employs several machine learning algorithms to extract the
meaning associated with each sentence and convert it into a form that computers can
understand. Semantic analysis and syntactic analysis are the two main methods used to
perform natural language processing tasks.
Semantic analysis
Semantic analysis is the process of understanding human language. It's a critical aspect of
NLP, as understanding the meaning of words alone won't do the trick. It enables
computers to understand the context of sentences as we comprehend them.
116
Semantic analysis is based on semantics – the meaning conveyed by a text. The semantic
analysis process starts with identifying the text elements of a sentence and assigning them
to their grammatical and semantical role. It then analyzes the context in the surrounding
text to determine the meaning of words with more than one interpretation.
Syntactic analysis
Syntactic analysis is used to determine how a natural language aligns with grammatical
rules. It's based on syntax, a field of linguistics that refers to the rules for arranging words
in a sentence to make grammatical sense.
Some of the syntax techniques used in NLP are:
Part-of-speech tagging: Identifying the part of speech for each word
Sentence breaking: Assigning sentence boundaries on a huge piece of text
Morphological segmentation: Dividing words into simpler individual parts called
morphemes
Word segmentation: Dividing huge pieces of continuous text into smaller, distinct
units
Lemmatization: Reducing inflected forms of a word into singular form for easy
analysis
Stemming: Cutting inflected words into their root formsParsing: Performing
grammatical analysis of a sentence
Why is text mining important?
Most businesses have the opportunity to collect large volumes of text data. Customer
feedback, product reviews, and social media posts are just the tip of the big data iceberg.
The kind of ideas that can be derived from such sources of textual (big) data are
profoundly lucrative and can help companies create products that users will value the
most.
Without text mining, the opportunity mentioned above is still a challenge. This is because
analyzing vast amounts of data isn't something the human brain is capable of. Even if a
group of people tries to pull off this Herculean task, the insights extracted might become
obsolete by the time they succeed.
Text mining helps companies automate the process of classifying text. The classification
could be based on several attributes, including topic, intent, sentiment, and language.
Many manual and tedious tasks can be eliminated with the help of text mining. Suppose
you need to understand how the customers feel about a software application you offer. Of
course, you can manually go through user reviews, but if there are thousands of reviews,
the process becomes tedious and time-consuming.
Text mining makes it quick and easy to analyze large and complex data sets and derive
relevant information from them. In this case, text mining enables you to identify the
general sentiment of a product. This process of determining whether the reviews are
positive, negative, or neutral is called sentiment analysis or opinion mining.
117
Further, text mining can be used to determine what users like or dislike or what they want
to be included in the next update. You can also use it to identify the keywords customers
use in association with certain products or topics.
Organizations can use text mining tools to dig deeper into text data to identify relevant
business insights or discover interrelationships within texts that would otherwise go
undetected with search engines or traditional applications.
Here are some specific ways organizations can benefit from text mining:
The pharmaceutical industry can uncover hidden knowledge and accelerate the pace of
drug discovery.
Product companies can perform real-time analysis on customer reviews and identify
product bugs or flaws that require immediate attention.
Companies can create structured data, integrate it into databases and use it for
different types of big data analytics such as descriptive or predictive analytics.
In short, text mining helps businesses put data to work and make data-driven decisions
that can make customers happy and ultimately increase profitability.
Want to learn more about Text Analysis Software? Explore Text Analysis
products.
Text mining vs. text analytics vs. text analysis
Text mining and text analysis are often used synonymously. However, text analytics is
different from both.
Simply put, text analytics can be described as a text analysis or text mining software
application that allows users to extract information from structured and
unstructured text data.
Both text mining and text analytics aim to solve the same problem – analyzing raw text
data. But their results vary significantly. Text mining extracts relevant information from
text data that can be considered qualitative results. On the other hand, text analytics aims
to discover trends and patterns in vast volumes of text data that can be viewed
as quantitative results.
Put differently; text analytics is about creating visual reports such as graphs and tables by
analyzing large amounts of textual data. Whereas text mining is about transforming
unstructured data into structured data for easy analysis.
Text mining is a subfield of data mining and relies on statistics, linguistics, and machine
learning to create models capable of learning from examples and predicting results on
newer data. Text analytics uses the information extracted by text mining models for data
visualization.
Text mining techniques
Numerous text mining techniques and methods are used to derive valuable insights from
text data. Here are some of the most common ones.
Concordance
Concordance is used to identify the context in which a word or series of words appear.
118
Since the same word can mean different things in human language, analyzing the
concordance of a word can help comprehend the exact meaning of a word based on the
context. For example, the term "windows" describes openings in a wall and is also the
name of the operating system from Microsoft.
Word frequency
As the name suggests, word frequency is used to determine the number of times a word
has been mentioned in unstructured text data. For example, it can be used to check the
occurrence of words like "bugs," "errors," and "failure" in the customer reviews. Frequent
occurrences of such terms may indicate that your product requires an update.
Collocation
Collocation is a sequence of words that co-occur frequently. "Decision making," "time-
consuming," and "keep in touch" are some examples. Identifying collocation can improve
the granularity of text and lead to better text mining results.
Then there are advanced text mining methods such as text classification and text
extraction. We'll go over them in detail in the next section.
How does text mining work?
Text mining is primarily made possible through machine learning. Text mining algorithms
are trained to extract information from vast volumes of text data by looking at many
examples.
The first step in text mining is gathering data. Text data can be collected from multiple
sources, including surveys, chats, emails, social media, review websites, databases, news
outlets, and spreadsheets.
The next step is data preparation. It's a pre-processing step in which the raw data is
cleaned, organized, and structured before textual data analysis. It involves standardizing
data formats and removing outliers, making it easier to perform quantitative and
qualitative analysis.
Natural language processing techniques such as parsing, tokenization, stop word removal,
stemming, and lemmatization are applied in this phase.
After that, the text data is analyzed. Text analysis is performed using methods such as
text classification and text extraction. Let's look at both methods in detail.
Text classification
Text classification, also known as text categorization or text tagging, is the process of
classifying text. In other words, it's the process of assigning categories to unstructured
text data. Text classification enables businesses to quickly analyze different types of
textual information and obtain valuable insights from them.
Some common text classification tasks are sentiment analysis, language detection, topic
analysis, and intent detection.
Sentiment analysis is used to understand the emotions conveyed through a given
text. By understanding the underlying emotions of a text, you can classify it as
119
positive, negative, or neutral. Sentiment analysis is helpful to enhance customer
experience and satisfaction.
Language detection is the process of identifying which natural language the given
text is in. This will allow companies to redirect customers to specific teams specialized
in a particular language.
Topic analysis is used to understand the central theme of a text and assign a topic to
it. For example, a customer email that says "the refund hasn't been processed" can be
classified as a "Returns and Refunds issue".
Intent detection is a text classification task used to recognize the purpose or intention
behind a given text. It aims to understand the semantics behind customer messages
and assign the correct label. It's a critical component of several natural language
understanding (NLU) software.
Now, let's take a look at the different types of text classification systems.
1. Rule-based systems
Rule-based text classification systems are based on linguistic rules. Once the text mining
algorithms are coded with these rules, they can detect various linguistic structures and
assign the correct tags.
For example, a rule-based system can be programmed to assign the tag "food" whenever
it encounters words like "bacon," "sandwich," "pasta," or "burger".
Since rule-based systems are developed and maintained by humans, they're easy to
understand. However, unlike machine learning-based systems, rule-based systems
demand humans to manually code prediction rules, making them hard to scale.
2. Machine learning-based systems
Machine learning-based text classification systems learn and improve from examples.
Unlike rule-based systems, machine learning-based systems don't demand data scientists
to code the linguistic rules manually. Instead, they learn from training data that contains
examples of correctly tagged text data.
Machine learning algorithms such as Naive Bayes and Support Vector Machines (SVM) are
used to predict the tag of a text. Many a time, deep learning algorithms are also used to
create machine learning-based systems with greater accuracy.
3. Hybrid systems
As expected, hybrid text classification systems combine both rule-based and machine
learning-based systems. In such systems, both machine learning-based and rule-based
systems complement each other, and their combined results have higher accuracy.
Evaluation of text classifiers
A text classifier's performance is measured with the help of four
parameters: accuracy, precision, recall, F1 score.
Accuracy is the number of times the text classifier made the correct prediction divided
by the total number of predictions.
120
Precision indicates the number of correct predictions made by the text classifier over
the total number of predictions for a specific tag.
Recall depicts the number of texts correctly predicted divided by the total number that
should have been categorized with a specific tag.
F1 score combines precision and recall parameters to give a better understanding of
how adept the text classifier is at making predictions. It's a better indicator than
accuracy as it shows how good the classifier is at predicting all the categories in the
model.
Another way to test the performance of a text classifier is with cross-validation.
Cross-validation is the process of randomly dividing the training data into several subsets.
The text classifier trains on all subsets, except one. After the training, the text classifier is
tested by making predictions on the remaining subset.
In most cases, multiple rounds of cross-validation are performed with different subsets,
and their results are averaged to estimate the model's predictive performance.
Text extraction
Text extraction, also known as keyword extraction, is the process of extracting specific,
relevant information from unstructured text data. This is mainly done with the help of
machine learning and is used to automatically scan text and obtain relevant words and
phrases from unstructured text data such as surveys, news articles, and support tickets.
Text extraction allows companies to extract relevant information from large blocks of text
without even reading it. For example, you can use it to quickly identify the features of a
product from its description.
Quite often, text extraction is performed along with text classification. Some of the
common text extraction tasks are feature extraction, keyword extraction, and named
entity recognition.
Feature extraction is the process of identifying critical features or attributes of an
entity in text data. Understanding the common theme of an extensive collection of text
documents is an example. Similarly, it can analyze product descriptions and extract
their features such as model or color.
Keyword extraction is the process of extracting important keywords and phrases
from text data. It's useful for summarization of text documents, finding the frequently
mentioned attributes in customer reviews, and understanding the opinion of social
media users towards a particular subject.
Named entity recognition (NER), also known as entity extraction or chunking, is the
text extraction task of identifying and extracting critical information (entities) from text
data. An entity can be a word or a series of words, such as the names of companies.
Regular expressions and conditional random field (CRF) are the two common methods of
implementing text extraction.
1. Regular expressions
Regular expressions are a series of characters that can be correlated with a tag. Whenever
121
the text extractor matches a text with a sequence, it assigns the corresponding tag.
Similar to the rule-based text classification systems, each pattern is a specific rule.
Unsurprisingly, this approach is hard to scale as you have to establish the correct
sequence for any kind of information you wish to obtain. It also becomes difficult to handle
when patterns become complex.
2. Conditional random fields
Conditional random fields (CRFs) are a class of statistical approaches often applied in
machine learning and used for text extraction. It builds systems capable of learning the
patterns in text data that they need to extract. It does this by weighing various features
from a sequence of words in text data.
CRFs are more proficient at encoding information when compared to regular expressions.
This makes them more capable of creating richer patterns. However, this method will
require more computational resources to train the text extractor.
Evaluation of text extractors
You can use the same metrics used in text classification to evaluate the performance of the
text extractor. However, they‘re blind to partial matches and consider only exact matches.
Due to that reason, another set of metrics called ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) is used.
Text mining applications
The amount of data managed by most organizations is growing and diversifying at a rapid
pace. It's nearly impossible to take advantage of it without an automated process like text
mining in place.
An excellent example of text mining is how information retrieval happens when you
perform a Google search. For example, if you search for a keyword, say "cute puppies,"
most search results won't include your exact query.
Instead, they'll be synonyms or phrases that closely match your query. In the example of
"cute puppies,‖ you'll come across search engine page results that include phrases such as
"cutest puppy,‖ "adorable puppies,‖ "adorable pups,‖ and "cute puppy".
This happens because text mining applications actually read and comprehend the body of
texts, closely similar to how we do it. Instead of just relying on keyword matching, they
understand search terms at conceptual levels. They do an excellent job of understanding
complex queries and can discover patterns in text data, which is otherwise hidden to the
human eye.
Text mining can also help companies solve several problems in areas such as patent
analysis, operational risk analysis, business intelligence, and competitive intelligence.
Text mining has a broad scope of applications spanning multiple industries. Marketing,
sales, product development, customer service, and healthcare are a few of them. It
eliminates several monotonous and time-consuming tasks with the help of machine
learning models.
Here are some of the applications of text mining.
122
Fraud detection: Text mining technologies make it possible to analyze large volumes of
text data and detect fraudulent transactions or insurance claims. Investigators can quickly
identify fraudulent claims by checking for commonly used keywords in descriptions of
accidents. It can also be used to promptly process genuine claims by automating the
analysis process.
Customer service: Text mining can automate the ticket tagging process and
automatically route tickets to appropriate geographic locations by analyzing their
language. It can also help companies determine the urgency of a ticket and prioritize
the most critical tickets.
Business intelligence: Text mining makes it easier for analysts to examine large
amounts of data and quickly identify relevant information. Since petabytes of business
data, collected from several sources, are involved, manual analysis is impossible. Text
mining tools fasten the process and enable analysts to extract actionable information.
Healthcare: Text mining is becoming increasingly valuable in the healthcare industry,
primarily for clustering information. Manual investigation is time-consuming and costly.
Text mining can be used in medical research to automate the process of extracting
crucial information from medical literature.
3.2.4 Representing Text:
Tokenization is the task of separating (also called tokenizing) words from the body of text.
Raw text is converted into collections of tokens after the tokenization, where each token is
generally a word. A common approach is tokenizing on spaces. For example, with the
tweet shown previously:
I once had a gf back in the day. Then the bPhone came out lol
Tokenization based on spaces would output a list of tokens.
{ I, once, had, a, gf, back, in, the, day.
Then, the, bPhone, came, out, lol}
Note that token ―day‖.contains a period. This is the result of only using space as the
separator. Therefore, tokens "day." and "day" would be considered different terms in the
downstream analysis unless an additional lookup tab le is provided. One way to fix the
problem without the use of a lookup table
is to remove the period if it appears at the end of a sentence. Another way is to tokenize
the text based on punctuation marks and spaces. In the case, the previous tweet would
become:
{I, Once, had, a, gf, back, in, the, day, .,
Then, the, bPhone, came, out, lol}
However, tokenizing based on punctuation marks might not be well suited to certain
scenarios. For example, if the text contains contractions such as we‘ll, tokenizing based on
punctuation will split them into separated words we and ll. For words such as can‘t, the
output would be can and t. It would be more preferable either not to tokenize them or to
tokenize we‘ll into we and ‗ll, and can‘t into can and ‗t. The ‗t token is more recognizable as
123
negative than the t token. If the team is dealing with certain tasks such as information
extraction or sentiment analysis, tokenizing solely based on punctuation marks and spaces
may obscure or even distort meanings in the text.
Tokenization is a much more difficult task than one may expect. For example, should like
state-of-the-art, Wi-Fi, and San Francisco be considered one token or more? Should words
like, re`sume, and resume all map to the same token? Tokenization is even more difficult
beyond English. In German, for example, there are many unsegmented compound nouns.
In Chinese, there are no spaces between words. Japanese has several alphabets
intermingled. This list can go on.
Another text normalization technique is called case folding, which reduces all letters to
lowercase (or the opposite if applicable). For the previous tweet, after case folding the text
would become this:
I once had a gf back in the day. then the bphone came out lol
One needs to be cautious applying case folding to tasks such as information extraction,
sentiment analysis, and machine translation. If implemented incorrectly, case folding may
reduce or change the meaning of the text and create additional noise. For example, when
General Motors becomes general
and motors, the downstream analysis may very likely consider them as separated words
rather than the name of a company. When the abbreviation of the World Health
Organization WHO or the rock band The Who become who, they may both be interpreted
as the pronoun who.
If case folding must be present, one way to reduce such problems is to create a lookup
table of words not to be case folded. Alternatively, the team can come up with some
heuristics or rules-based strategies for the case folding. For example, the program can be
taught to ignore words that have uppercase in the middle of a sentence.
3.2.5 Term Frequency Inverse Document Frequency:(TF- IDF)
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be
defined as the calculation of how relevant a word in a series or corpus is to a text. The
meaning increases proportionally to the number of times in the text a word appears but is
compensated by the word frequency in the corpus (data-set).
Terminologies:
Term Frequency: In document d, the frequency represents the number of instances of a
given word t. Therefore, we can see that it becomes more relevant when a word appears
in the text, which is rational. Since the ordering of terms is not significant, we can use a
vector to describe the text in the bag of term models. For each specific term in the paper,
there is an entry with the value being the term frequency. The weight of a term that
occurs in a document is simply proportional to the term frequency.
tf(t,d) = count of t in d / number of words in d
Document Frequency: This tests the meaning of the text, which is very similar to TF, in
the whole corpus collection. The only difference is that in document d, TF is the frequency
124
counter for a term t, while df is the number of occurrences in the document set N of the
term t. In other words, the number of papers in which the word is present is DF:
df(t) = occurrence of t in documents
Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of
the search is to locate the appropriate records that fit the demand. Since tf considers all
terms equally significant, it is therefore not only possible to use the term frequencies to
measure the weight of the term in the paper. First, find the document frequency of a term
t by counting the number of documents containing the term:
df(t) = N(t) where
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t
Term frequency is the number of instances of a term in a single document only; although
the frequency of the document is the number of separate documents in which the term
appears, it depends on the entire corpus. Now let‘s look at the definition of the frequency
of the inverse paper. The IDF of the word is the number of documents in the corpus
separated by the frequency of the text.
idf(t) = N/ df(t) = N/N(t)
The more common word is supposed to be considered less significant, but the element
(most definite integers) seems too harsh. We then take the logarithm (with base 2) of the
inverse frequency of the paper. So the if of the term t becomes:
idf(t) = log(N/ df(t))
Computation: TF-IDF is one of the best metrics to determine how significant a term is to
a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each
word
in a document based on its term frequency (TF) and the reciprocal document frequency
(TF) (IDF). The words with higher scores of weight are deemed to be more significant.
Usually, the TF-IDF weight consists of two terms-
1. Normalized Term Frequency (TF)
2. Inverse Document Frequency (IDF)
tf-idf(t, d) = tf(t, d) * idf(t)
In python tf-idf values can be computed
using TfidfVectorizer() method in sklearn module.
Syntax:
sklearn.feature_extraction.text.TfidfVectorizer(input)
Parameters:
input: It refers to parameter document passed, it can be a filename, file or content
itself.
Attributes:
vocabulary_: It returns a dictionary of terms as keys and values as feature indices.
125
IDF: It returns the inverse document frequency vector of the document passed as a
parameter.
Returns:
Fit transform(): It returns an array of terms along with TF-IDF values.
Get feature names(): It returns a list of feature
names.
3.2.6 Categorizing Documents By Topics:
A topic consists of a cluster of words that frequently occur together and share the same
theme.
The topics of a document are not as straightforward as they might initially appear.
Consider these two reviews:
1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G.
2. While I love ACME's bPhone series, I've been quite disappointed by the bEbook. The
text is illegible, and it makes even my old NBook look blazingly fast.
A document typically consists of multiple themes running through the text in
different proportions-
For Example: 30% on a topic related to phones, 15% on a topic related to
appearance, 10% on a topic related to shipping, 5% on a topic related to service, and
so on.
Document grouping can be achieved with clustering methods such as:
o k-means clustering
o or classification methods such as
o support vector machines
o k-nearest neighbors
o naive Bayes
However, a more feasible and prevalent approach is to use Topic Modelling.
Topic modelling provides tools to automatically organize, search understand, and
summarize from vast amounts of information.
Topic models are statistical models that examine words from a set of documents,
determine the themes over the text, and discover how the themes are associated or
change over time.
The process of topic modelling can be simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.
A topic is formally defined as a distribution over a fixed vocabulary of words. Different
topics would have different distributions over the same vocabulary.
A topic can be viewed as a cluster of words with related meanings, and each word
has a corresponding weight inside this topic.
126
Note that a word from the vocabulary can reside in multiple topics with different
weights.
Topic models do not necessarily require prior knowledge of the texts. The topics can
emerge solely based on analyzing the text.
The simplest topic model is latent Dirichlet allocation a generative probabilistic
model of a corpus proposed by David M. Blei and two other researchers.
3.2.7 Determining Sentiment:
Sentiment analysis, also known as opinion mining, is the process of determining the
emotions behind a piece of text. Sentiment analysis aims to categorize the given text
as positive, negative, or neutral. Furthermore, it then identifies and quantifies
subjective information about those texts with the help of natural language processing,
text analysis, computational linguistics, and machine learning.
There are two main methods for sentiment analysis: machine learning method and
lexicon-based. The machine learning method leverages human-labeled data to train
the text classifier, making it a supervised learning method. The lexicon-based
approach breaks down a sentence into words and scores each word‘s semantic
orientation based on a dictionary. It then adds up the various scores to arrive at a
conclusion.
In this example, we will look at how sentiment analysis works using a simple lexicon-
based approach. We‘ll take the following comment as our test data:
127
“That movie was a colossal disaster… I absolutely hated it! Waste of time and
money #skipit”
Step 1: Cleaning
The initial step is to remove special characters and numbers from the text. In our
example, we‘ll remove the exclamation marks and commas from the comment above.
That movie was a colossal disaster I absolutely hated it Waste of time and
money skipit.
Step 2: Tokenization
Tokenization is the process of breaking down a text into smaller chunks called tokens,
which are either individual words or short sentences. Breaking down a paragraph into
sentences is known as sentence tokenization, and breaking down a sentence into words
is known as word tokenization.
[ „That‟, „movie‟, „was‟, „a‟, „colossal‟, „disaster‟, „I‟, „absolutely‟, „hated‟,
„it‟, „Waste‟, „of‟, „time‟, „and‟, „money‟, „skipit‟]
Step 3: Part-of-speech (POS) tagging
Part-of-speech tagging is the process of tagging each word with its grammatical group,
categorizing it as either a noun, pronoun, adjective, or adverb—depending on its context.
This transforms each token into a tuple of the form (word, tag). POS tagging is used to
preserve the context of a word.
[ (‗That‘, ‗DT‘),
(‗movie‘, ‗NN‘),
(‗was‘, ‗VBD‘),
(‗a‘, ‗DT‘)
(‗colossal‘, ‗JJ‘),
(‗disaster‘, ‗NN‘),
(‗I‘, ‗PRP‘),
(‗absolutely‘, ‗RB‘),
(‗hated‘, ‗VBD‘),
(‗it‘, ‗PRP‘),
(‗Waste‘, ‗NN‘),
(‗of‘, ‗IN‘),
(‗time‘, ‗NN‘),
(‗and‘, ‗CC‘),
(‗money‘, ‗NN‘),
(‗skipit‘, ‗NN‘)]
Step 4: Removing stop words
Stop words are words like ‗have,‘ ‗but,‘ ‗we,‘ ‗he,‘ ‗into,‘ ‗just,‘ and so on. These words
carry information of little value, andare generally considered noise, so they are removed
128
from the data.
[ „movie‟, „colossal‟, „disaster‟, „absolutely‟, „hated‟, Waste‟, „time‟, „money‟,
„skipit‟].
Step 5: Stemming
Stemming is a process of linguistic normalization which removes the suffix of each of these
words and reduces them to their base word. For example, loved is reduced to love, wasted
is reduced to waste. Here, hated is reduced to hate.
[ ‗movie‘, ‗colossal‘, ‗disaster‘, ‗absolutely‘, ‗hate‘, ‗Waste‘, ‗time‘, ‗money‘, ‗skipit‘].
Step 6: Final Analysis
In a lexicon-based approach, the remaining words are compared against the sentiment
libraries, and the scores obtained for each token are added or averaged. Sentiment
libraries are a list of predefined words and phrases which are manually scored by humans.
For example, ‗worst‘ is scored -3, and ‗amazing‘ is scored +3. With a basic dictionary, our
example comment will be turned into:
movie= 0, colossal= 0, disaster= -2, absolutely=0, hate=-2, waste= -1, time= 0, money=
0, skipit= 0.
This makes the overall score of the comment -5, classifying the comment as negative.
129
Data insights that:
1. Optimize processes to improve performance.
2. Uncover new markets, products or services to add new sources of revenue.
3. Better balance risk vs reward to reduce loss.
4. Deepen the understanding of customers to increase loyalty and lifetime value.
How to Get Data Insights:
The process to obtain actionable data insights typically involves defining objectives,
collecting, integrating and managing the data, analyzing the data to gain insights and then
sharing these insights.
1) Define business objectives
Stakeholders initiate the process by clearly defining objectives such as improving
production processes or determining which marketing campaigns are most effective, like in
the example above.
2) Data collection
Ideally, systems have already been put in place to collect and store raw source data. If
not, the organization needs to establish a systematic process to gather the data.
3) Data integration & management
Once collected, source data must be transformed into clean, analytics-ready information
via data integration. This process includes data replication, ingestion and transformation to
combine different types of data into standardized formats which are then stored in a
repository such as a data lake or data warehouse.
130
UNIT – IV: Analytical Data Report andVisualization
Communicating and Operationalizing an AnalyticsProject, Creating the Final
Deliverables:
Project Goals,
Main Findings,
Approach,
Model Description,
Data Visualization.
4.1 Developing Core Material for Multiple Audiences:
It is important to tailor the project outputs to the audience. For a project sponsor,
show that the team met the project goals. Focus on what was done, what the team
accomplished, what ROI can be anticipated, and what business value can be realized.
When presenting to a technical audience such as data scientists and analysts, focus
on how the work was done. Discuss how the team accomplished the goals and the
choices it made in selecting models or analyzing the data. Share analytical methods and
decision-making processes so other analysts can learn from them for future projects.
Describe methods, techniques, and technologies used, as this technical audience will be
interested in learning about these details and considering whether the approach
makes sense in this case and whether it can be extended to other, similar projects. Plan
to provide specifics related to model accuracy and speed, such as how well the model
will perform in a production environment.
131
Because some of the components of the projects can be used for different audiences, it can
be helpful to create a core set of materials regarding the project, which can be used to
create presentations for either a technical audience or an executive sponsor. Table below
depicts the main components of the final presentations for the project sponsor and an
analyst audience. Notice that teams can create a core set of materials in these seven
areas, which can be used for the two presentation audiences.
Three areas (Project Goals, Main Findings, and Model Description), can be used as is
for both presentations. Other areas need additional elaboration, such as the Approach. Still
other areas, such as the Key Points, require different levels of detail for the analysts and
data scientists than for the project sponsor.
132
bar charts). and graphs, such as ROC
curves and histograms
Visuals of key variables
and significance of
each
Model details Omit this section, or discuss Show the code or main
only at a high level. logic of the model, and
include model type,
variables, and technology
used to execute the
model and score data.
Identify key variables and
impact of each. Describe
expected model
performance and any
caveats. Detailed
description of the
modeling technique
Discuss variables, scope,
and predictive
power.
Recommendations Focus on business impact, Supplement
including risks and ROI. Give recommendations with
the sponsor salient points to implications for the
help her evangelize work within modeling or for
the organization. deploying in a production
environment.
133
The points on this version of the Goals slide emphasize what needs to be done,
but not why, which will beincluded in the alternative.
Figure 4.2.1 Example of Project Goals slide for YoyoDyne case study
Figure 4.2.2 shows a variation of the previous Project Goals slide in Figure 1. It is a
summary of the situation prior to listing the goals. Keep in mind that when delivering
final presentations, these deliverables are shared within organizations, and the original
context can be lost, especially if the original sponsor leaves the group or changes roles.
It is good practice to briefly recap the situation prior to showing the project goals. Keep
in mind that adding a situation overview to the Goals slide does make it appear busier.
The team needs to determine whether to split this into a separate slide or keep it
together, depending on the audience and the team's style for delivering the final
presentation. One method for writing the situational overview in a succinct way is to
summarize as follows:
• Situation: Give a one-sentence overview of the situation that has led to the analytics
project.
• Complication: Give a one-sentence overview of the need for addressing this now.
Something has triggered the organization to decide to take action at this time. For
instance, perhaps it lost 100 customers in the past two weeks and now has an executive
mandate to address an issue, or perhaps it has lost five points of market share to its
biggest competitor in the past three months. Usually, this sentence represents the driver
for why a particular project is being initiated at this time, rather than in some vague time
in the future.
134
FIGURE 4.2.2 Example of Situation & Project Goals slide for YoyoDyne case study
135
FIGURE 4.3.1 Example of Executive Summary slide for YoyoDynecase study
The key message should be clear and conspicuous at the front of the slide. It can be
set apart with color or shading. The key message may become the single talking point
that executives or the project sponsor take away from the project and use to support
the team's recommendation for a pilot project, so it needs to be succinct and
compelling.
Follow the key message with three major supporting points. Although Executive
Summary slides can have more than three major points, going beyond three ideas
makes it difficult for people to recall the main points, so it is important to ensure that
the ideas remain clear and limited to the few most impactful ideas the team wants the
audience to take away from the work that was done. If the author lists ten key points,
messages become diluted, and the audience may remember only one or two main
points.
In addition, because this is an analytics project, be sure to make one of the key points
related to if, and how well, the work will meet the sponsor's service level agreement
(SLA) or expectations. Traditionally, the SLA refers to an arrangement between
someone providing services, such as an information technology (IT) department or a
consulting firm, and an end user or customer. In this case, the SLA refers to system
performance, expected uptime of a system, and other constraints that govern an
agreement.
136
This term has become less formal and many times conveys system performance or
expectations more generally related to performance or timeliness. Finally, although it's
not required, it is often a good idea to support the main points with a visual or graph.
Visual imagery serves to make a visceral connection and helps retain the main message
with the reader.
4.4 Approach:
In the Approach portion of the presentation, the team needs to explain the
methodology pursued on the project. This can include interviews with domain experts,
the groups collaborating within the organization, and a few statements about the
solution developed. The objective of this slide is to ensure the audience understands the
course of action that was pursued well enough to explain it to others within the
organization. The team should also include any additional comments related to working
assumptions the team followed as it performed the work,because this can be critical in
defending why they followed a specific course of action. When explaining the solution, the
discussion should remain at a high level for the project sponsors. If presenting to
analysts or data scientists, provide additional detail about the type of model used,
including the technology and the actual performance of the model during the tests.
137
Finally, as part of the description of the approach, the team may want to mention
constraints from systems, tools, or existingprocesses and any implications for how these
things may need to change with this project.
Figure 4.2.4 shows an example of how to describe the methodology followed during a
data science project to a sponsor audience.
FIGURE 4.4.1 Example describing the project methodology for project sponsors
4.5 Model Description:
After describing the project approach, teams generally include a description of the model
that was used, Assuming that the model will meet the agreed-upon SLAs, mention that
the model will meet the SLAs based on the performance of the model within the testing
or staging environment. For instance, one may want to indicate that the model processed
500,000 records in 5 minutes to give stakeholders an idea of the speed of the model
during run time. Analysts will want to understand the details of the model, including the
decisions made in constructing the model and the scope of the data extracts for testing
and training. Be prepared to explain the team's thought process on this, as well as the
speed of running the model within the test environment.
138
FIGURE 4.5 Model Description
4.6 Key Points Supported with Data
The next step is to identify key points based on insights and observations resulting from
the data and model scoring results. Find ways to illustrate the key points with charts and
visualization techniques, using simpler charts for sponsors and more technical data
visualization for analysts and data scientists. a. For project sponsors, use simple charts
such as bar charts, which illustrate data clearly and enable the audience to understand
the value of the insights. This is also a good point to foreshadow some of the team's
recommendations and begin tying together ideas to demonstrate what led to the
recommendations and why. In other words, this section supplies the data and foundation
for the recommendations that come later in the presentation.
139
FIGURE 4.6 Key points
Example of a presentation of key points of a data science project shownas a bar chart.
4.7 Model Details
Model details are typically needed by people who have a more technical understanding
than the sponsors, such as those who will implement the code, or colleagues on the
analytics team. Project sponsors are typically less interested in the model details; they are
usually more focused on the business implications of the work rather than the details of
the model. This portion of the presentation needs to show the code or main logic of the
model, including the model type, variables, and technology used to execute the model and
score data.
140
FIGURE 4.7.1 Model Details
As part of the model detail description, guidance should be provided regarding the speed
with which the model can run in the test environment; the expected performance in a live,
production environment; and the technology needed.
141
Recommendations
The final main component of the presentation involves creating a set of recommendations
that include how to deploy the model from a business perspective within the organization
and any other suggestions on the rollout of the model's logic.
Implement the model as a pilot, before more wide-scale rollout- test and learn from initial
pilot on performance and precision Addressing these promptly can potentially save more
customers from churning over time and also prevent more networking that seems to drive
additional churn An early churn warning trigger can be set up based on this model run
the predictive model daily or weekly to be proactive on customer churn In-database
scorer can score large datasets in a matter of minutes and can be run daily Each
customer retained via early warning trigger saves 4 hours of account retention efforts &
SOk in new account acquisition costs Develop targeted customer surveys to investigate
the causes of churn, which will make the collection of data for investigation into the causes
of churn easier
Additional Tips on the Final Presentation
As a team completes a project and strives to move on to the next one, it must remember
to invest adequate time in developing the final presentations.
Use imagery and visual representations: Visuals tend to make the presentation more
compelling. Also, people recall imagery better than words, because images can have a
more visceral impact.These visual representations can be static and interactive data.
Make sure the text is mutually exclusive and collectively exhaustive (MECE): This
means having an economy of words in the presentation and making sure the key
points are covered but not repeated unnecessarily.
Measure and quantify the benefits of the project: This can be challenging and requires
time and effort to do well. This kind of measurement should attempt to quantify
benefits that have financial and other benefits in a specific way. Making the
statement that a project provided "$8.5M in annual cost savings" is much more
compelling than saying it has "great value."
Make the benefits of the project clear and conspicuous: After calculating the benefits of
the project, make sure to articulate them clearly in the presentation.
142
4.8 Providing Technical Specifications and Code:
In addition to authoring the final presentations, the team needs to deliver the actual code
that was developed and the technical documentation needed to support it. The team
should consider how the project will affect the end users and the technical people who will
need to implement the code.
Teams should approach writing technical documentation for their code as if it were an
application programming interface (API). Many times, the models become encapsulated as
functions that read a set of inputs in the production environment, possibly perform
preprocessingon data, and create an output, including a set of post-processingresults.
For example, if the model returns a value representing the probability of customer churn,
additional logic may be needed to identify the scoring threshold to determine which
customer accounts to flag as being at risk of churn. In addition, some provision should be
made for adjusting this threshold and training the algorithm, either in an automated
learning fashion or with human intervention.
Although the team must create technical documentation, many times engineers and other
technical staff receive the code and may try to use it without reading through all the
documentation. Therefore, it is important to add extensive comments in the code. If the
team can do a thorough job adding comments in the code, it is much easier for someone
else to maintain the code and tune it in the runtime environment. In addition, it helps the
engineers edit the code when their environment changes or they need to modify processes
that may be providing inputs to the code or receiving its outputs.
4.9 Data Visualization:
As the volume of data continues to increase, more vendors and communities are
developing tools to create clear and impactful graphics for use in presentations and
applications. Although not exhaustive, Table lists some popular tools.
143
What is Data Visualization!?
Data visualization is the representation of data through use of common
graphics, such as charts, plots, infographics and even animations. These visual
displays of information communicate complex data relationships and data-driven
insights in a way that is easy to understand.
Types of Data Visualization
Tables: This consists of rows and columns used to compare variables.
Tables can show a great deal of information in a structured way, but they
can also overwhelm users that are simply looking for high-level trends.
Pie charts and stacked bar charts: These graphs are divided into
sections that represent parts of a whole. They provide a simple way to
organize data and compare the size of each component to one other.
Line charts and area charts: These visuals show change in one or more
quantities by plotting a series of data points over time and are frequently
used within predictive analytics. Line graphs utilize lines to demonstrate
these changes while area charts connect data points with line segments,
stacking variables on top of one another and using color to distinguish
between variables.
Histograms: This graph plots a distribution of numbers using a bar chart
(with no spaces between the bars), representing the quantity of data that
falls within a particular range. This visual makes it easy for an end user to
identify outliers within a given dataset.
Scatter plots: These visuals are beneficial in reveling the relationship
between two variables, and they are commonly used within regression data
analysis. However, these can sometimes be confused with bubble charts,
which are used to visualize three variables via the x-axis, the y-axis, and
the size of the bubble.
Heat maps: These graphical representation displays are helpful in
visualizing behavioral data by location. This can be a location on a map, or
even a webpage.
Tree maps, which display hierarchical data as a set of nested shapes,
typically rectangles. Tree maps are great for comparing the proportions
between categories via their area size.
144
UNIT V –DATA ANALYTICS APPLICATIONS
145
layer, in which different data sources can be addressed. The content of these data
sources is gathered in parallel, converted, and finally added to an index, which
builds the basis for data analytics, business intelligence, and all other data-driven
applications. Other big players such as IBM rely on architectures similar to Oracle‘s
(IBM 2013).
Throughout the different architectures to big data processing, the core of data
acquisition boils down to gathering data from distributed information sources with
the aim of storing them in scalable, big data-capable data storage. To achieve this
goal, three main components are required:
1.Protocols that allow the gathering of information for distributed data sources of
any type (unstructured, semi-structured, structured)
2.Frameworks with which the data is collected from the distributed sources by
using different protocols.
3.Technologies that allow the persistent storage of the data retrieved by the
frameworks.
5.1.1 The Data Acquisition Process:
What is exciting about data acquisition to data professionals is the richness of its
process.
Consider a basic set of tasks that constitute a data acquisition process:
A need for data is identified, perhaps with use cases
Prospecting for the required data is carried out
Data sources are disqualified, leaving a set of qualified sources
Vendors providing the sources are contacted and legal agreements entered into
for evaluation
Sample data sets are provided for evaluation
Semantic analysis of the data sets is undertaken, so they are adequately
understood
The data sets are evaluated against originally established use cases
Legal, privacy and compliance issues are understood, particularly with respect
to permitted use of data
Vendor negotiations occur to purchase the data
Implementation specifications are drawn up, usually involving Data Operations
who will be responsible for production processes
Source onboarding occurs, such that ingestion is technically accomplished
Production ingest is undertaken
There are several things that stand out about this list. The first is that it consists of
a relatively large number of tasks. The second is that it may easily be inferred that
many different groups are going to be involved, e.g., Analytics or Data Science will
likely come up with the need and use cases, whereas Data Governance, and
146
perhaps the Office of General Counsel, will have to give an opinion on legal,
privacy and compliance requirements.
An even more important feature of data acquisition is that the end-to-end process
sketched out above is only one of a number of possible variations. Other
approaches to data acquisition may involve using ―open‖ data sources or
configuring tools to scan internet sources, or hiring a company to aggregate the
required data. Each of these variations will amount to a different end-to-end
process.
147
Increase in explainability of our model.
Let us take an example of Kaggle Mushroom Classification Dataset. Our objective
will be to try to predict if a Mushroom is poisonous or not by looking at the given
features.
First of all, we need to import all the necessary libraries.
The dataset we will be using in this example is shown in the figure below.
We can now use this function using the whole dataset and then use it successively
148
to compare these results when using instead of the whole dataset just a reduced
version.
As shown below, training a Random Forest classifier using all the features, led to
100% Accuracy in about 2.2s of training time. In each of the following examples,
the training time of each model will be printed out on the first line of each snippet
for your reference.
149
maximizing the distance between the means of each class when projecting the
data in a lower-dimensional space can lead to better classification results (thanks
to the reduced overlap between the different classes). When using LDA, is
assumed that the input data follows a Gaussian Distribution (like in this case),
therefore applying LDA to not Gaussian data can possibly lead to poor classification
results.
4) Locally Linear Embedding (LLE)
We have considered so far methods such as PCA and LDA, which are
able to perform really well in case of linear relationships between the different
features, we will now move on considering how to deal with non-linear cases.
Locally Linear Embedding is a dimensionality reduction technique based on
Manifold Learning. A Manifold is an object of D dimensions which is embedded in
an higher-dimensional space. Manifold Learning aims then to make this object
representable in its original D dimensions instead of being represented in an
unnecessary greater space.
Some examples of Manifold Learning algorithms are: Isomap, Locally Linear
Embedding, Modified Locally Linear Embedding, Hessian Eigenmapping, etc…
150
dimensional space and its equivalent in the reduced low dimensional space. t-SNE
makes then use of the Kullback-Leiber (KL) divergence in order to measure the
dissimilarity of the two different distributions. The KL divergence is then minimized
using gradient descent.
When using t-SNE, the higher dimensional space is modelled using a Gaussian
Distribution, while the lower-dimensional space is modelled using a Student‘s t-
distribution. This is done, in order to avoid an imbalance in the neighboring points
distance distribution caused by the translation into a lower-dimensional space.
5.1.3 Tokenization:
Tokenization is a way of separating a piece of text into smaller units
called tokens. Here, tokens can be either words, characters, or subwords. Hence,
tokenization can be broadly classified into 3 types – word, character, and subword
(n-gram characters) tokenization.
For example, consider the sentence: ―Never give up‖.
The most common way of forming tokens is based on space. Assuming space as a
delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As
each token is a word, it becomes an example of Word tokenization.
Tokenization Types:
1)Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm.
It splits a piece of text into individual words based on a certain delimiter.
Depending upon delimiters, different word-level tokens are formed. Pretrained
Word Embeddings such as Word2Vec and GloVe comes under word tokenization.
2) Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It
overcomes the drawbacks we saw above about Word Tokenization.
Character Tokenizers handles OOV words coherently by preserving the
information of the word. It breaks down the OOV word into characters and
represents the word in terms of these characters.
It also limits the size of the vocabulary. Want to talk a guess on the size of the
vocabulary? 26 since the vocabulary contains a unique set of characters.
Tokenization Algorithm:
Byte Pair Encoding (BPE):
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-
based models. BPE addresses the issues of Word and Character Tokenizers:
BPE tackles OOV effectively. It segments OOV as subwords and represents the
word in terms of these subwords
151
The length of input and output sentences after BPE are shorter compared to
character tokenization
BPE is a word segmentation algorithm that merges the most frequently occurring
character or character sequences iteratively. Here is a step by step guide to learn
BPE.
Steps to learn BPE
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations
5.1.4 Stemming:
Stemming is a natural language processing technique that lowers
inflection in words to their root forms, hence aiding in the preprocessing of text,
words, and documents for text normalization.
Inflection is the process through which a word is modified to communicate many
grammatical categories, including tense, case, voice, aspect, person, number,
gender, and mood. Thus, although a word may exist in several inflected forms,
having multiple inflected forms inside the same text adds redundancy to the NLP
process.
As a result, we employ stemming to reduce words to their basic form or stem,
which may or may not be a legitimate word in the language.
For instance, the stem of these three words, connections, connected, connects, is
―connect‖. On the other hand, the root of trouble, troubled, and troubles is
―troubl,‖ which is not a recognized word.
Application of Stemming:
In information retrieval, text mining SEOs, Web search results, indexing,
tagging systems, and word analysis, stemming is employed. For instance, a Google
search for prediction and predicted returns comparable results.
Types of Stemmer:
There are several kinds of stemming algorithms. Let us have a look.
1. Porter Stemmer – PorterStemmer()
Martin Porter invented the Porter Stemmer or Porter algorithm
in 1980. Five steps of word reduction are used in the method, each with its own
set of mapping rules. Porter Stemmer is the original stemmer and is renowned for
its ease of use and rapidity. Frequently, the resultant stem is a shorter word with
152
the same root meaning.
PorterStemmer() is a module in NLTK that implements the Porter Stemming
technique. Let us examine this with the aid of an example.
Example of PorterStemmer()
In the example below, we construct an instance of PorterStemmer() and use the
Porter algorithm to stem the list of words.
153
3. Lancaster Stemmer – LancasterStemmer()
Lancaster Stemmer is straightforward, although it often produces
results with excessive stemming. Over-stemming renders stems non-linguistic or
meaningless.
LancasterStemmer() is a module in NLTK that implements the Lancaster stemming
technique. Allow me to illustrate this with an example.
Example of LancasterStemmer()
In the example below, we construct an instance of LancasterStemmer() and then
use the Lancaster algorithm to stem the list of words.
Example of RegexpStemmer()
In this example, we first construct an object of RegexpStemmer() and then use the
Regex stemming method to stem the list of words.
154
5.1.5 Conversion to Structured Data:
There are seven steps to analyze unstructured data to extract
structured data insights as below.
First analyze the data sources
Before you can initiate, you need to analyze what sources of data are essential for
the data analysis. Unstructured data sources are in found in different forms like
web pages, video files, audio files, text documents, customer emails, chats and
more. You should analyze and use only those unstructured data sources that are
completely relevant.
1. Know what will be done with the results of the analysis
If the end result is not clearer, the analysis may be unusable. It is key to better
understand what sort of outcome is required, is it a trend, effect, cause, quantity
or something else which is needed. There should be clear road-map defined for
what would be done with the final results to use them better for the business,
market or other organization related gains.
2. Decide the technology for data intake and storage as per business
needs
Though the unstructured data will come from different sources, the outcomes of
the analysis must be injected in a technology stack so that the outcomes can be
straightforwardly used. Features that are important for selecting the data retrieval
and storage totally depends on the volume, scalability, velocity and variety of
requirements. A prospective technology stack should be well assessed against the
concluding requirements, after which the data architecture of the whole project is
set-up.
Certain examples of business needs and the selection of the technology
stack are:
155
Real-time: It has turned very critical for E commerce companies to offer real-time
prices. This requires monitoring and tracking real- time competitor activities, and
offering offerings based on the instant results of an analytics software. Such
pricing technologies includes competitor price monitoring software.
Higher availability: This is vital for ingesting unstructured data and information
from social media platforms. The used technology platform should make sure that
there is no loss of data in real- time. It is a better idea to hold information intake
as a data redundancy plan.
Support Multi-tenancy: Another important element is the capability to isolate
data from diverse user groups. Effective Data intelligence solutions should natively
back multi- tenancy positions. The isolation of data is significant as per the
sensitivities involved with customer data and feedbacks combined with the
important insights, to meet the confidentiality requirements.
156
7. Implement and Influence project measurement
The end results matter the most, whatever it might be. It is vital that the results
are provided in a required format, extracting and offering structured data insights
from unstructured data.
This should be handled through a web data extraction software and a data
intelligence tool, so that the user can execute the required actions on a real-time
basis.
The ultimate step would be to measure the effect with the required ROI by
revenue, process effectiveness and business improvements.
5.1.6 Sentiment Analysis:
Sentiment Analysis is the process of classifying whether a block of
text is positive, negative, or, neutral. Sentiment analysis is contextual mining of
words which indicates the social sentiment of a brand and also helps the business
to determine whether the product which they are manufacturing is going to make a
demand in the market or not. The goal which Sentiment analysis tries to gain is to
analyze people‘s opinion in a way that it can help the businesses expand. It
focuses not only on polarity (positive, negative & neutral) but also on emotions
(happy, sad, angry, etc.). It uses various Natural Language Processing algorithms
such as Rule-based, Automatic, and Hybrid.
Why perform Sentiment Analysis?
According to the survey,80% of the world‘s data is unstructured. The data needs to
be analyzed and be in a structured manner whether it is in the form of emails,
texts, documents, articles, and many more.
1. Sentiment Analysis is required as it stores data in an efficient, cost-friendly.
2. Sentiment analysis solves real-time issues and can help you solve all the real-
time scenarios.
Types of Sentiment Analysis
1. Fine-grained sentiment analysis: This depends on the polarity based. This
category can be designed as very positive, positive, neutral, negative, very
negative. The rating is done on the scale 1 to 5. If the rating is 5 then it is
very positive, 2 then negative and 3 then neutral.
2. Emotion detection: The sentiment happy, sad, anger, upset, jolly, pleasant,
and so on come under emotion detection. It is also known as a lexicon
method of sentiment analysis.
3. Aspect based sentiment analysis: It focuses on a particular aspect like for
instance, if a person wants to check the feature of the cell phone then it
checks the aspect such as battery, screen, camera quality then aspect based
is used.
157
4. Multilingual sentiment analysis: Multilingual consists of different
languages where the classification needs to be done as positive, negative, and
neutral. This is highly challenging and comparatively difficult.
Approaches of Sentiment Analysis:
There are three approaches used:
1. Rule-based approach: Over here, the lexicon method, tokenization, parsing
comes in the rule-based. The approach is that counts the number of positive
and negative words in the given dataset. If the number of positive words is
greater than the negative words then the sentiment is positive else vice-
versa.
2. Automatic Approach: This approach works on the machine learning
technique. Firstly, the datasets are trained and predictive analysis is done.
The next process is the extraction of words from the text is done. This text
extraction can be done using different techniques such as Naive Bayes, Linear
Regression, Support Vector, Deep Learning like this machine learning
techniques are used.
3. Hybrid Approach: It is the combination of both the above approaches i.e.
rule-based and automatic approach. The surplus is that the accuracy is high
compared to the other two approaches.
Applications:
Sentiment Analysis has a wide range of applications as:
1. Social Media: If for instance the comments on social media side as Instagram,
over here all the reviews are analyzed and categorized as positive, negative,
and neutral.
2. Customer Service: In the play store, all the comments in the form of 1 to 5
are done with the help of sentiment analysis approaches.
3. Marketing Sector: In the marketing area where a particular product needs to
be reviewed as good or bad.
4. Reviewer side: All the reviewers will have a look at the comments and will
check and give the overall review of the product.
Challenges of Sentiment Analysis:
There are major challenges in sentiment analysis approach:
1. If the data is in the form of a tone, then it becomes really difficult to detect
whether the comment is pessimist or optimist.
2. If the data is in the form of emoji, then you need to detect whether it is good
or bad.
3. Even the ironic, sarcastic, comparing comments detection is really hard.
4. Comparing a neutral statement is a big task.
5.1.7 Web Mining:
158
Different Process in Web Mining:
The processes of Web mining are divided into four stages: i. Source data collection,
ii. Data pre-processing, iii. Pattern discovery, and iv. Pattern analysis.
1. Source Data Collection
‗The direct source of data in web mining is mostly web log files which are stored on
the web server. Web log files record all the behaviour of the user which that time
on the web, including the server log, agent log and client log.
2. Data Pre-processing
‗The data which generally gets collected from the web have features that are
incomplete, redundant and ambiguous. Pre-processing provide accurate and concise
data for data mining. This is the technique used to clean a server log files to
eliminate irrelevant data, hence its importance for web log analysis in web mining. It
includes data cleaning, user identification, user session certifications, access path
supplements and transact
identification (Fig. 2), the details of which are as below.
159
2.5. Transaction Identification
This process is totally based on the user session identification.Web transactions are
divided
or combined according to the demand of data mining tools.
3 Pattern Discovery
There are many types of access pattern mining that can be used to perform on the
need of
the analyst. Some of these are given as path analysis, association rule
discovery,sequential
pattern discovery, clustering analysis and classification.
3.1 Path Analysis
The physical layout of any website is presented in graphical form. Web page is
denoted as
a node and the hyperlink between two pages is represented as edges in the graph.
3.2 Association Rules
‗The association rules are focused mainly in the discovery of relations between
pages visited
by the users on the website. Association rules can be used to relate the web page
that is most often used by the single server session. These pages may be
interlinked to one another by
the help of hyperlinks. For instance, an association rule for a BBA program is
BBA/seminar.
html and BBA/speakers.html, whereby seminar is related to speakers.
3.3 Sequential Pattern Discovery
This technique is used to find inter-session patterns such that the presence of a set
of
items or data is followed by another item in an allotted time to that session. With
the help
of this approach, Web sellers or buyers can predict future visit patterns which will
be
helpful in placing advertisements aimed at certain user groups. There are also
some other
techniques which are useful on sequential patterns including change point
detection, or
similarity analysis and trend analysis.
3.4 Classification Analysis
Classification is the mapping of a data into one or several predefined data or items
i.e.
it classifies data according to the predefined categories. Classification can be done
160
with
the help of the following algorithms such as decision tree classifiers and naive
Bayesian
classifiers. Web mining classification techniques allow one to develop a profile for
clients
who access a particular server file based on their access patterns.
3.5 Clustering Analysis
Cluster analysis technique is the most popular technique which is used in web
mining
wherein a set of items or data which have similar attributes or characteristics is
grouped
together. It can help with marketing decisions of the marketers. Clustering of user
information on the web transaction logs can develop a facility to future marketing
strategies, both online and off-line. Two types of clustering methods are used
namely:
Hierarchical clustering (agglomerative vs. divisive and single link vs. complete link)
and
partition clustering (distance-based, model-based and density-based).
4. Pattern Analysis
Its main purpose to find out a valuable model. Many types of techniques are used
for analysis
such as visualization tools, OLAP techniques, data and knowledge querying and
usability analysis (Fig. 2), whose details are given below:
4.1. Visualization Techniques
This is a natural choice for understanding the behaviour of web users. The web is
visualized as a directed graph with cycles represented as a node and hyperlinks
denoted as
an edge of the graph.
4.2. OLAP Online Analytical Processing
This is a very powerful paradigm for strategic analysis of database in business
system.
OLAP can be performed directly on top of the relational database.
4.3 Data and Knowledge Query
This is the most important analysis pattern in web mining; whereby focus is given
on
the proper analysis of user problems or user needs. Such type of focus is provided
in
two different ways:
* Constraints may be placed on the database in a declarative language.
161
*This query may be performed on the knowledge that has been discovered and
extracted by the mining process.
4.4. Usability Analysis
In this analysis the details of software usability as well as user usability are given.
This approach can also be used for any model for accessing behaviour of the user
on the
website.
162
Unit – V: Chapter-2 Recommender Systems
Feedback
Recommendation Tasks
Recommendation Techniques
Final Remarks
5.2.1 Feedback:
Recommender systems require certain feedback to perform
recommendations. That is why they require information on users‘ past behavior,
the behavior of other people, or the content information of the domain to produce
predictions. It is possible to define the workflow of a recommendation process as:
Collecting information
Learning.
Production recommendations.
There are often three main ways for a recommender to collect information,
which also known as feedback.
Implicit Feedback
Explicit Feedback
Hybrid Feedback
Implicit Feedback
There is no user participation required to gather implicit feedback, unlike the
explicit feedback. The system automatically tracks users‘ preferences by
monitoring the performed actions, such as which item they visited, where they
clicked, which items they purchased, or how long they stayed on a web page. One
must find the correct actions to track based on the domain that the recommender
system operates on. Another advantage of implicit feedback is that it reduces the
cold start problems that occur until an item is rated enough to be served as a
recommendation.
Explicit Feedback
To collect explicit feedback from the user, the system must ask
users to provide their ratings for items. After collecting the feedback, the system
knows how relevant or similar an item is to users‘ preferences. Even though this
allows the recommender to learn the users exact opinion, since it requires direct
participation from the user, it is often not easy to collect. That is why there are
different ways to collect feedback from users. Implementing a like/dislike
functionality into a web site, gives users to evaluate the content easily.
Alternatively, the system can ask users to insert their ratings in which a discrete
numeric scale represents how the user liked/disliked the content. Netfix often asks
customers to rate movies.
163
Another way to collect explicit feedback is to ask users to insert their comments as
text. While this is a great way to learn user opinion, it is usually not easy to obtain
and evaluate.
Hybrid FeedBack
Hybrid Feedback uses both explicit and implicit feedback to
maximize prediction quality. To use the hybrid method, the system must be able to
collect explicit and implicit feedback from users.
5.2.2 Recommendation System:
Recommendation engines are a subclass of machine learning which
generally deal with ranking or rating products / users. Loosely defined, a
recommender system is a system which predicts ratings a user might give to a
specific item. These predictions will then be ranked and returned back to the
user.
The most common types of recommendation systems which are widely used are :
Content-Based Filtering
Collaborative Filtering
Hybrid Recommendation Systems
165
consistency of the assignment with the already existing assignments and defined
set of constraints. If all the possible values of the current variable are inconsistent
with the existing assignments and the constraints, the constraint solver backtracks
which means that the previously instantiated variable is selected again. In the case
that a consistent assignment has been identified, a recursive activation of the
backtracking algorithm is performed and the next variable is selected [54].
Constraint Propagation.
166
and orders the result conform to a similarity metric (defined in the ORDER BY
clause). Finally, instead of combining the mentioned standard constraint solvers
with MAUT, we can represent a recommendation task in the form of soft
constraints where the importance (preference) for each combination of variable
values is determined on the basis of a corresponding utility operation (for details
see, for example, [1]).
Example Queries in probabilistic databases Result = SELECT * /* calculate a
solution */ FROM Products /* select items from ‖Products‖ */
WHERE x1=a1 and x2=a2 /* ‖must‖ criteria */ ORDER BY score(abs(x3-a3), ...,
abs(xm-am)) /* similarity-based utility function */ STOP AFTER N; /* at most N
items in the solution (result set) */.
5.2.3 Recommendation Techniques:
Recommendation Techniques consists of six different approaches.
They are
1.Content-based: The system learns to recommend items that are similar to the
ones that the user liked in the past. The similarity of items is calculated based on
the features associated with the compared items. For example, if a user has
positively rated a movie that belongs to the comedy genre, then the system can
learn to recommend other movies from this genre. Chapter 3 provides an overview
of contentbased recommender systems, imposing some order among the extensive
and diverse aspects involved in their design and implementation. It presents the
basic concepts and terminology of content-based RSs, their high level architecture,
and their main advantages and drawbacks. The chapter then surveys state-of-the-
art systems that have been adopted in several application domains. The survey
encompasses a thorough description of both classical and advanced techniques for
representing items and user profiles. Finally, it discusses trends and future
research which might lead towards the next generation of recommender systems.
2.Collaborative filtering: The simplest and original implementation of this
approach recommends to the active user the items that other users with similar
tastes liked in the past. The similarity in taste of two users is calculated based on
the similarity in the rating history of the users. This is the reason why refers to
collaborative filtering as ―people-to-people correlation.‖ Collaborative filtering is
considered to be the most popular and widely implemented technique in RS.
3.Demographic: This type of system recommends items based on the
demographic profile of the user. The assumption is that different recommendations
should be generated for different demographic niches. Many Web sites adopt
simple and effective personalization solutions based on demographics. For
example, users are dispatched to particular Web sites based on their language or
country. Or suggestions may be customized according to the age of the user. While
167
these approaches have been quite popular in the marketing literature, there has
been relatively little proper RS research into demographic systems.
168
have no ratings. This does not limit content-based approaches since the prediction
for new items is based on their description (features) that are typically easily
available. Given two (or more) basic RSs techniques, several ways have been
proposed for combining them to create a new hybrid system (see for the precise
descriptions). As we have already mentioned, the context of the user when she is
seeking a recommendation can be used to better personalize the output of the
system. For example, in a temporal context, vacation recommendations in winter
should be very different from those provided in summer. Or a restaurant
recommendation for a Saturday evening with your friends should be different from
that suggested for a workday lunch with co-workers.
5.2.4 Final Remarks:
recommendation algorithms can be divided in two great paradigms:
collaborative approaches (such as user-user, item-item and matrix
factorisation) that are only based on user-item interaction matrix and content
based approaches (such as regression or classification models) that use prior
information about users and/or items
memory based collaborative methods do not assume any latent model and
have then low bias but high variance ; model based collaborative approaches
assume a latent interactions model that needs to learn both users and items
representations from scratch and have, so, a higher bias but a lower variance ;
content based methods assume a latent model build around users and/or items
features explicitly given and have, thus, the highest bias and lowest variance
recommender systems are more and more important in many big industries
and some scales considerations have to be taken into account when designing
the system (better use of sparsity, iterative methods for factorisation or
optimisation, approximate techniques for nearest neighbours search…)
169
Unit – V: Chapter-3: Social Network Analysis
Representing Social Networks
Basic Properties of Nodes
Basic and Structural Properties of Networks
Ego network Analysis is the one that finds the relationship among people. The
analysis is done for a particular sample of people chosen from the whole
population. This sampling is done randomly to analyze the relationship. The
attributes involved in this ego network analysis are a person‘s size, diversity, etc.
This analysis is done by traditional surveys. The surveys involve that they people
are asked with whom they interact with and their name of the relationship between
them. It is not focused to find the relationship between everyone in the sample. It
is an effort to find the density of the network in those samples. This hypothesis is
tested using some statistical hypothesis testing techniques.
The following functions are served by Ego Networks:
Propagation of information efficiently.
170
Complete network analysis is the analysis that is used in all network analyses. It
analyses the relationship among the sample of people chosen from the large
population. Subgroup analysis, centrality measure, and equivalence analysis are
based on the complete network analysis. This analysis measure helps the
organization or the company to make any decision with the help of their
relationship. Testing the sample will show the relationship in the whole network
since the sample is taken from a single set of domains.
1 Difference between Ego network analysis and Complete network
analysis:
The difference between ego and complete network analysis is that the ego network
focus on collecting the relationship of people in the sample with the outside world
whereas, in Complete network, it is focused on finding the relationship among the
samples.
The majority of the network analysis will be done only for a particular domain or
one organization. It is not focused on the relationships between the organization.
So many of the social network analysis measure uses only Complete network
analysis.
Representing Social Networks:
Nodes (A,B,C,D,E in the example) are usually representing entities in the network,
and can hold self-properties (such as weight, size, position and any other attribute)
and network-based properties (such as Degree- number of neighbors or Cluster- a
connected component the node belongs to etc.).
Edges represent the connections between the nodes, and might hold properties as
well (such as weight representing the strength of the connection, direction in case
of asymmetric relation or time if applicable).
These two basic elements can describe multiple phenomena, such as social
connections, virtual routing network, physical electricity networks, roads network,
biology relations network and many other relationships.
171
Real-world networks
Real-world networks and in particular social networks have a unique structure
which often differs them from random mathematical networks:
Small World phenomenon claims that real networks often have very short
paths (in terms of number of hops) between any connected network members.
This applies for real and virtual social networks (the six handshakes theory) and
for physical networks such as airports or electricity of web-traffic routings.
Centrality Measures
Highly central nodes play a key role of a network, serving as hubs for different
network dynamics. However the definition and importance of centrality might
differ from case to case, and may refer to different centrality measures:
Degree — the amount of neighbors of the node
172
5.3.2 Basic Properties of Nodes:
Nodes are usually representing entities in the network, and can hold
self-properties (such as weight, size, position and any other attribute) and
network-based properties (such as Degree- number of neighbors or Cluster- a
connected component the node belongs to etc.).
Properties of Nodes:
1.Supported Data types
The following table lists the supported property types, as well as, their
corresponding fallback values.
Long - Long.MIN_VALUE
Double - NaN
173
Most algorithms that are capable of using node properties require a specific
property type. In cases of a mismatch between the type of the provided property
and the required type, the library will try to convert the property value into the
required type. This automatic conversion only happens when the following
conditions are satisfied:
Neither the given, nor the expected type are an Array type.
o Long to Double: The Long values does not exceed the supported range of the
Double type.
o Double to Long: The Double value does not have any decimal places.
The algorithm computation will fail if any of these conditions are not satisfied
for any node property value.
5.3.3 Basic and Structural Properties of Networks:
1. Connectivity (Beta-Index)
2. Diameter of a graph
5. Hierarchies in trees
Connectivity(Beta- Index):
The simplest measure of the degree of connectivity of a graph is given by
the Beta index (β). It measures the density of connections and is defined as:
where E is the total number of edges and V is the total number of vertices in the
network.
Diameter of a graph:
Another measure for the structure of a graph is its diameter.
Diameter δ is an index measuring the topological length or extent of a graph by
counting the number of edges in the shortest path between the most distant
vertices. It is:
where s(i, j) is the number of edges in the shortest path from vertex i to vertex j.
With this formula, first, all the shortest paths between all the vertices are
searched; then, the longest path is chosen. This measure therefore describes the
longest shortest path between two random vertices of a graph.
175
In addition to the purely topological application, actual track lenths or any other
weight (e.g. travel time) can be assigned to the edges. This suggests a more
complex measurement based on the metric of the network. The resulting index is π
= mT/mδ, where mT is the total mileage of the network and mδ is the total
mileage of the network's diameter. The higher π is, the denser the network.
Accessibility of vertices and places:
A frequent type of analysis in transport networks is the
investigation of the accessibility of certain traffic nodes and the developed areas
around them. A measure of accessibility can be determined by the method shown
in the animation. The accessibility of a vertex i is calculated by:
where v = the number of vertices in the network and n (i, j) = the shortest node
distance (i.e. number of nodes along a path) between vertex i and vertex j.
Therefore, for each node i the sum of all the shortest node distances n(i, j) are
calculated, which can efficiently be done with a matrix. The node distance between
two nodes i and j is the number of intermediate nodes. For every node the sum is
formed. The higher the sum (node A), the lower the accessibility and the lower the
sum (node C), the better the accessibility.
176
The importance of the node distance lies in the fact that nodes may also be
transfer stations, transfer points for goods, or subway stations. Therefore, a large
node distance hinders travel through the network.
where e is the number of edges and s(i, j) the shortest weighted path between two
nodes.
Centrality / Location in the network
The first measure of centrality was developed by König in 1936 and is called
the König numberKi. Let s(i, j) denote the number of edges in the shortest path
from vertex i to vertex j. Then the König number for vertex i is defined as:
where s(i, j) is the shortest edge distance between vertex i and vertex j.
Therefore, Ki is the longest shortest path originating from vertex i. It is a measure
of topological distance in terms of edges and suggests that vertices with a low
König numbers occupy a central place in the network.
177
If you have determined the shortest edge distance between the nodes, then the
largest value in a column in the König number (blue). In the example, the orange
node is centrally located and the two green nodes are peripheral.
The method for determining the König number is also applicable to a distance
matrix. The example of accessibility is shown again in the figure below. This time
the matrix is used with the same values to calculate the König number.
Hierarchies in trees
In quantitative geomorphology, more specificall in the field of fluvial morphology,
different methods for structuring and order of hierarchical stream networks have
been developed. Thus, different networks can be compared with each other (e.g.
due to the highest occurence order or the relative frequencies of the unique
levels), and sub-catchments can be segregated easily. Of the four ordering
schemes in the following figure, only three are topologically defined. The Horton
scheme is the only one that takes the metric component into account as well.
178
Calculating the strahler
number, we start with the
outermost branches of the
tree. The ordering value of 1 is
assigned to those segments of
the stream. When two streams
with the same order come
together, they form a stream
with their order value plus one.
Otherwise, the higher order of
the two streams is used. The
strahler number is formally
defined as:
179
First, an order according to
Strahler is calculated. Then,
the highest current order larger
than 2 is assigned to the
longest (metric) branch in the
remaining sub-trees.
180