0% found this document useful (0 votes)
166 views180 pages

Data-Analytics - All UNITS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views180 pages

Data-Analytics - All UNITS

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 180

IV B. Tech.

– I Semester
(19BT71201)DATA ANALYTICS
(Common to CSSE and IT)

Int. Marks Ext. Marks Total Marks L T P C


40 60 100 3 - - 3

PRE-REQUISITES: A Course on Database Management Systems.

COURSE DESCRIPTION: The course provides Introduction to Data Analytics and its Life
Cycle, Review of Basic Data Analytic Methods Using R, Advanced Analytical Theory and
Methods, Advanced Analytics-Technology and Tools: In-Database Analytics and
Communicating and Operationalizing an Analytics Project

COURSE OUTCOMES: After successful completion of this course, the students will be able to:
CO1. Use Analytical Architecture and its life cycle in Data Analytics
CO2. Analyze and Visualize the Data Analytics Methods using R.
CO3. Apply Advanced Analytical Methods for Text Analysis and Time –Series Analysis.
CO4. Develop Analytical Report for given Analytical problems.
CO5. Analyze and Design Data Analytics Application on Societal Issues.

DETAILED SYLLABUS:
UNIT I – INTRODUCTION TO DATA ANALYTICS and R (9 periods)
Practice in Analytics: BI versus Data Science, Current Analytical Architecture, Emerging
Big Data Ecosystem and a New Approach to Analytics. Data Analytics Life Cycle: Key
Roles for a Successful Analytics Project Background and Overview of Data Analytics
Lifecycle Phases - Discovery Phase, Data Preparation Phase, Model Planning, Model
Building, Communicate Results, Operationalize.Introduction to R:R Graphical User
Interfaces, Data Import and Export, Attribute and Data Types, Descriptive Statistics.

UNIT II – BASIC DATA ANALYTICAL METHODS (9 periods)


Exploratory Data Analysis: Visualization Before Analysis, Dirty Data, Visualizing a
Single Variable, Examining Multiple Variables, Data Exploration
VersusPresentation.Statistical Methods for Evaluation: Hypothesis Testing, Difference
of Means, Wilcoxon Rank-Sum Test, Type I and Type II Errors, Power and Sample Size,
ANOVA, Decision Trees in R, Naïve Bayes in R.
UNIT III – ADVANCED ANALYTICAL TECHNOLOGY AND METHODS (9 periods)
Time Series Analysis: Overview of Time Series Analysis, Box-Jenkins Methodology,
ARIMA Model, Autocorrelation Function (ACF),Autoregressive Models, Moving Average
Models , ARMA and ARIMA Models ,Building and Evaluating an ARIMA Model, Reasons to
Choose and Cautions.
Text Analysis: Text Analysis Steps, A Text Analysis Example, Collecting Raw Text,
Representing Text, Term Frequency—Inverse Document Frequency (TFIDF), Categorizing
Documents by Topics, Determining Sentiments, Gaining Insights.
UNIT IV –ANALYTICAL DATA REPORT AND VISULAIZATION (9 periods)
Communicating and Operationalizing an Analytics Project, Creating the Final
Deliverables: Developing Core Material for Multiple Audiences, Project Goals, Main

1
Findings, Approach, Model Description, Key Points Supported with Data, Model Details
Recommendations, Additional Tips on Final Presentation, Providing Technical
Specificationsand Code, Data Visualization.
UNIT V –DATA ANALYTICS APPLICATIONS (9 periods)
Text and Web: Data Acquisition, Feature Extraction, Tokenization, Stemming, Conversion
to Structured Data, Sentiment Analysis, Web Mining.
Recommender Systems: Feedback, Recommendation Tasks, Recommendation
Techniques, Final Remarks.

Social Network Analysis: Representing Social Networks, Basic Properties of Nodes,


Basic and Structural Properties of Networks.
Total Periods: 45
Topics for self-study are provided in the lesson plan
TEXT BOOKS:
1. EMC Education Services, Data Science and Big Data Analytics – Discovering,
Analyzing, Visualizing and Presenting Data, John Wiley and Sons, 2015.

2. Joao Moreira, Andre Carvalho, Andre Carlos Ponce de Leon Ferreira Carvalho, Tomas
Horvath, A General Introduction to Data Analytics, John Wiley and Sons,
1stEdition,2019·

REFERENCE BOOKS:
1. Anil Maheshwari, Data Analytics Made Accessible, Lake Union Publishing,
1stEdition,2017.
2. Richard Dorsey, Data Analytics: Become a Master in Data Analytics, Create
SpaceIndependent Publishing Platform, 2017.

ADDITIONAL LEARNING RESOURCES:


1. https://www.tutorialspoint.com/excel_data_analysis/data_analysis_overview.html
2. https://data-flair.training/blogs/data-analytics-tutorial/
3. https://pythonprogramming.net/data-analysis-tutorials/

CO-PO-PSO Mapping Table


PO1 PO2 PO3 PO4 PO5 PO6 PO7 PO8 PO9 PO10 PO11 PO12 PSO1 PSO2 PSO3 PSO4
CO1 2 3 3 - 2 - - - - - - 3 2
CO2 2 3 - - 2 2 - - - - - 3 1
CO3 1 2 3 - 2 - - - - - - 3 2
CO4 2 3 3 2 2 - - - - - - 3 2
CO5 2 2 3 2 2 3 - - - - - 3 2
Average 1.8 2.6 3 2 2 2 3 1.8
Level of
correlation
2 3 3 2 3 2 3 2
of the
course
Correlation Level: 3- High 2-Medium 1- Low

2
Sree Sainath Nagar, A. Rangampet-517 102

Department of Information Technology

Lesson Plan

Name of the Subject : Data Analytics (19BT71201)


Name of the faculty Member : Dr. K. Khaja Baseer
Class & Semester : IV B.Tech I Semester
Section : IT- A&B
No. of Course Bloo
S. Book(s)
Topic periods Outco ms Remarks
No. required
followed
mes Level
UNIT - I : INTRODUCTION TO DATA ANALYTICS and R
Practice in Analytics: BI versus 1 T1 CO1 BL1
1. Data Science, Current
Analytical Architecture
Emerging Big Data Ecosystem 1 T1 CO1 BL4
2. and a New Approach to
Analytics
Data Analytics Life Cycle: Key 2 T1 CO1 BL2
3. Roles for a Successful Analytics
Project Background Analytics
Overview of Data Analytics 2 T1 CO1 BL4 for
Lifecycle Phases - Discovery Unstructure
Phase, Data Preparation Phase, d Data
4.
Model Planning, Model Building,
Communicate Results,
Operationalize
Introduction to R:R Graphical 1 T1 CO1 BL2
5. User Interfaces, Data Import
and Export
6. Attribute and Data Types 1 T1 CO1 BL2
7. Descriptive Statistics 1 T1 CO1 BL2
Total no of periods required: 09
UNIT – II: BASIC DATA ANALYTICAL METHODS
Exploratory Data Analysis: T1 CO2 BL2
8. Visualization Before Analysis, 1
Dirty Data
9.
Visualizing a Single Variable, T1 CO2 BL2
1
Examining Multiple Variables Case
Data Exploration Versus T1 CO2 BL2 Study:
10. 1 Analysis
Presentation
and
Statistical Methods for T1 CO2 BL4
Forecasting
11. Evaluation: Hypothesis 1
of House
Testing, Difference of Means Price
12. Wilcoxon Rank-Sum Test 1 T1 CO2 BL4 Indices
13.
Type I and Type II Errors, T1 CO2 BL3
1
Power and Sample Size
14. ANOVA 1 T1 CO2 BL4
15. Decision Trees in R, Naïve 2 T1 CO2 BL3

3
No. of Course Bloo
S. Book(s)
Topic periods Outco ms Remarks
No. required
followed
mes Level
Bayes in R
Total no of periods required: 09
UNIT - III: ADVANCED ANALYTICAL TECHNOLOGY AND METHODS
Time Series Analysis: Overview T1 CO3 BL2
16. 1
of Time Series Analysis
Box-Jenkins Methodology, T1 CO3 BL2
17. 1
ARIMA Model
Autocorrelation Function (ACF), T1 CO3 BL3
18. Autoregressive Models, Moving 1
Average Models
19. ARMA and ARIMA Models T1 CO3 BL2 Case
1
Study:
Building and Evaluating an T1 CO3 BL2
Customer
20. ARIMA Model, Reasons to 1
Response
Choose and Cautions
Prediction
Text Analysis: Text Analysis T1 CO3 BL3 and Pro_t
21. 1
Steps, A Text Analysis Example Optimizatio
Collecting Raw Text, T1 CO3 BL2 n
22. 1
Representing Text
Term Frequency—Inverse T1 CO3 BL2
Document Frequency (TFIDF),
23. 1
Categorizing Documents by
Topics
Determining Sentiments, CO3
24. 1 T1 BL4
Gaining Insights
Total no of periods required: 09
UNIT - IV : ANALYTICAL DATA REPORT AND VISULAIZATION
Communicating and T1 CO4 BL2
Operationalizing an Analytics
25. Project, Creating the Final 1
Deliverables: Developing Core
Material for Multiple Audiences Case
26. Project Goals, Main Findings 1 T1 CO4 BL3 Study:
27. Approach, Model Description 2 T1 CO4 BL3 Predictive
Modeling of
28. Key Points Supported with Data 1 T1 CO4 BL4
Big Data
Model Details T1 CO4 BL3 with
29. 1
Recommendations Limited
Additional Tips on Final T1 CO4 BL4 Memory
30. 1
Presentation
Providing Technical T1 CO4 BL2
31. 1
Specifications and Code
32. Data Visualization 1 T1 CO4 BL2
Total no of periods required: 09
UNIT – V: DATA ANALYTICS APPLICATIONS
Text and Web: Data T1 CO5 BL2
33. 1
Acquisition, Feature Extraction
Tokenization, Stemming, T1 CO5 BL2
34. 1
Conversion to Structured Data Alpine
35. Sentiment Analysis, Web Mining 1 T1 CO5 BL3 Miner and
Recommender Systems: T1 CO5 BL4 Data
36. 1 Wrangler
Feedback
37. Recommendation Tasks 1 T1 CO5 BL2
Recommendation Techniques, T1 CO5 BL3
38. 1
Final Remarks

4
No. of Course Bloo
S. Book(s)
Topic periods Outco ms Remarks
No. required
followed
mes Level
Social Network Analysis: T1 CO5 BL4
39. 1
Representing Social Networks
40. Basic Properties of Nodes 1 T1 CO5 BL2
Basic and Structural Properties T1 CO5 BL2
41. 1
of Networks
Total no of periods required: 09
Grand total periods required: 45

TEXT BOOKS:

T1. EMC Education Services, Data Science and Big Data Analytics – Discovering,
Analyzing, Visualizing and Presenting Data, John Wiley and Sons, 2015.
T2. Joao Moreira, Andre Carvalho, Andre Carlos Ponce de Leon Ferreira Carvalho, Tomas
Horvath, A General Introduction to Data Analytics, John Wiley and Sons,
1stEdition,2019.

5
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
 BI versus Data Science
 Current Analytical Architecture
 Emerging Big Data Ecosystem and New Approach to Analytics.

Big Data Overview


 Data is created constantly, and at an ever-increasing rate.
 Mobile phones, social media, imaging technologies to determine a medical
diagnosis-all these and more create new data, and that must be stored
somewhere for some purpose.
 Devices and sensors automatically generate diagnostic information that needs to be
stored and processed in real time.

Points to Remember:
• Keeping up with this huge influx of data is difficult.
• More challenging is analyzing vast amounts of it.
• Does not conform to traditional notions of data structure, to identify meaningful
patterns and extract useful information.
Note:
 These challenges of the data deluge present the opportunity to transform
business, government, science, and everyday life.

Several industries have led the way in developing their ability to gather and exploit
data:
• Credit card companies monitor every purchase their customers make and can
identify fraudulent purchases with a high degree of accuracy using rules derived by
processing billions of transactions shown in figure 1.

Figure.1 Fraud Detection

6
Several industries have led the way in developing their ability to gather and exploit
data:
• Mobile phone companies analyze subscribers' calling patterns to determine, for
example, whether a caller 's frequent contacts are on a rival network. If that rival
network is offering an attractive promotion that might cause the subscriber to defect,
the mobile phone company can proactively offer the subscriber an incentive to
remain in her contract.

Several industries have led the way in developing their ability to gather and exploit
data:
• For companies such as LinkedIn and Facebook, data itself is their primary product.
The valuations of these companies are heavily derived from the data they gather and
host, which contains more and more intrinsic value as the data grows.

Three attributes stand out as defining Big Data characteristics:


7
1. Huge volume of data: Rather than thousands or millions of rows, Big Data can be
billions of rows and millions of columns.
2. Complexity of data types and structures: Big Data reflects the variety of new
data sources, formats, and structures, including digital traces being left on the
web and other digital repositories for subsequent analysis.
3. Speed of new data creation and growth: Big Data can describe high velocity
data, with rapid data ingestion and near real time analysis.

Figure.2 Attributes of Big Data

What is BIG DATA?

 ‗Big Data‘ is similar to ‗small data‘, but bigger in size


 but having data bigger it requires different approaches:
– Techniques, tools and architecture
 An aim to solve new problems or old problems in a better way
 Walmart handles more than 1 million customer transactions every hour.
 Facebook handles 40 billion photos from its user base.

No single definition; here is from Wikipedia:


 Big data is the term for a collection of data sets so large and complex that it
becomes difficult to process using on-hand database management tools or
traditional data processing applications.
 The challenges include capture, curation, storage, search, sharing, transfer, analysis,
and visualization.
 The trend to larger data sets is due to the additional information derivable from
analysis of a single large set of related data, as compared to separate smaller sets
with the same total amount of data, allowing correlations to be found to "spot
business trends, determine quality of research, prevent diseases, link legal citations,
combat crime, and determine real-time roadway traffic conditions.‖

8
Another definition of Big Data comes from the McKinsey Global report from 2011:
• Big Data is data whose scale, distribution, diversity, and/or timeliness require the use
of new technical architectures and analytics to enable insights that unlock new
sources of business value.

• Social media and genetic sequencing are among the fastest-growing sources of
Big Data and examples of untraditional sources of data being used for analysis.

For Example:
• In 2012 Facebook users posted 700 status updates per second worldwide, which
can be leveraged to deduce latent interests or political views of users and show
relevant ads.

• For instance, an update in which a woman changes her relationship status from
"single" to "engaged" would trigger ads on bridal dresses, wedding planning, or
name-changing services.

• Facebook can also construct social graphs to analyze which users are connected to
each other as an interconnected network.
• In March 2013, Facebook released a new feature called "Graph Search," enabling
users and developers to search social graphs for people with similar interests,
hobbies, and shared locations.

• Another example comes from genomics. Genetic sequencing and human genome
mapping provide a detailed understanding of genetic makeup and lineage. The health
care industry is looking toward these advances to help predict which illnesses a
person is likely to get in his lifetime and take steps to avoid these maladies.

9
Application of Big Data analytics

Data Structures
 Big data can come in multiple forms.
 It including structured and non-structured data such as financial data, text files,
multimedia files, and genetic mappings.
10
 Contrary to much of the traditional data analysis performed by organizations, most of
the Big Data is unstructured or semi-structured in nature, which requires different
techniques and tools to process and analyze
 Eg: R Programming, Tableau Public, SAS, Python...

Figure.3 Big Data Characteristics


Four types of data structures shown in figure 3.
1. Structured data: Data containing a defined data type,
format, and structure i.e.,
 transaction data, online analytical processing [OLAP] data cubes, traditional
DBMS, CSV files, and even simple spreadsheets.
 Note:-A CSV file stores tabular data (numbers and text) in plain text
2. Semi-structured data: Textual data files with a discernible pattern that enables
parsing such as:
 Extensible Markup Language [XML] data files that are self-describing and defined by
an XML schema is shown in figure 4.

3. Quasi-structured data: Textual data with erratic data formats that can be
formatted with effort, tools, and time (for instance, web click stream data that may
contain inconsistencies in data values and formats) shown in figure 5.
 E.g.: Google Queries
4. Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.

11
12
Figure.4 Semi-structured data

Figure.5 Semi-structured data

Table 1 shows Types of Data Repositories, from an Analyst Perspective and table 2
shows on Business Drivers for Advanced Analytics.

13
Analyst Perspective on Data Repositories
TABLE 1. Types of Data Repositories, from an Analyst Perspective

State of the Practice in Analytics


TABLE 2. Business Drivers for Advanced Analytics

UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
1.1.1. BI versus Data Science
 Current Analytical Architecture
 Emerging Big Data Ecosystem and
 New Approach to Analytics.

14
Figure.6 Business Intelligence (BI) vs. Data Science
Figure 6 shows the difference between Business Intelligence vs. Data Science.

15
UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
 BI versus Data Science
1.1.2 Current Analytical Architecture
 Emerging Big Data Ecosystem and
 New Approach to Analytics.

Figure.7 Analytical Architecture


1. For data sources to be loaded into the data warehouse, data needs to be well:
• understood, structured, and normalized with the appropriate data
type definitions.
• Although this kind of centralization enables security, backup, and
failover of highly critical data, it also means that data typically must
go through significant preprocessing and checkpoints before it can
enter this sort of controlled environment, which does not lend itself to
data exploration and iterative analytics.
2. As a result of this level of control on the EDW, additional local systems may
emerge in the form of departmental warehouses and local data marts that business
users create to accommodate their need for flexible analysis.
• These local data marts may not have the same constraints for security and
structure as the main EDW and allow users to do some level of more in-depth
analysis.
• However, these one-off systems reside in isolation, often are not synchronized or
integrated with other data stores, and may not be backed up.
3. Once in the data warehouse, data is read by additional applications across the
enterprise for BI and reporting purposes.

16
• These are high-priority operational processes getting critical data feeds from the
data warehouses and repositories.

4. At the end of this workflow, analysts get data provisioned for their downstream
analytics shown in figure 7.
• Because users generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to analyze data
offline in R or other local analytical tools.
• Many times these tools are limited to in-memory analytics on desktops analyzing
samples of data, rather than the entire population of a dataset.
• Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis-and any insights on the quality of the data or
anomalies-rarely are fed back into the main data repository.

Drivers (Sources) of Big Data

The data now comes from multiple sources (as shown in figure 8), such as
these:
 Medical information, such as genomic sequencing and diagnostic imaging
 Photos and video footage uploaded to the World Wide Web
 Video surveillance, such as the thousands of video cameras spread across a city
 Mobile devices, which provide geospatial location data of the users, as well as
metadata about text messages, phone calls, and application usage on smart phones
 Smart devices, which provide sensor-based collection of information from smart
electric grids, smart buildings, and many other public and industry infrastructures
 Nontraditional IT devices, including the use of radio-frequency identification
(RFID) readers, GPS navigation systems, and seismic processing

17
FIGURE.8 Data evolution and the rise of Big Data sources

UNIT-I: Chapter-I
PRACTICE IN ANALYTICS
 BI versus Data Science
 Current Analytical Architecture
1.1.3 Emerging Big Data Ecosystem and New Approach to Analytics.
 Four main groups of players (shown in figure 9)
 Data devices
 Games, smartphones, computers, etc.
 Data collectors
 Phone and TV companies, Internet, Gov‘t, etc.
 Data aggregators – make sense of data
 Websites, credit bureaus, media archives, etc.
 Data users and buyers
 Banks, law enforcement, marketers, employers, etc.

18
Figure.9 Mai groups in Big Data

Key Roles for the New Big Data Ecosystem (shown in below figures)
1. Deep analytical talent
• Advanced training in quantitative disciplines – e.g., math, statistics,
machine learning
Ex. of Professional: statisticians, economists, mathematicians, and the new role of
the Data Scientist
1. Data savvy professionals
• Savvy but less technical than group 1
2. Ex. of Professional: financial analysts, market research analysts, life
scientists, operations managers, and business and functional managers
3. Technology and data enablers
• Support people – e.g., DB admins, programmers, etc.

19
• Data scientists are generally thought of as having five main sets of skills and
behavioral characteristics, as shown in Figure.

UNIT-I: Chapter-II
20
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
 Data Preparation Phase
 Model Planning
 Model Building
 Communicate Results
 Operationalize

Data Analytics Lifecycle Overview


• The Data Analytics Lifecycle is designed specifically for Big Data problems and data
science projects. The lifecycle has six phases shows in figure 10:

Figure.10 Data Analytics Lifecycle

1.2.1 Key Roles for a Successful Analytics Project

21
Figure.11 Key roles in Big Data

 Business User – understands the domain area


 Project Sponsor – provides requirements
 Project Manager – ensures meeting objectives
 Business Intelligence Analyst – provides business domain expertise based on deep
understanding of the data
 Database Administrator (DBA) – creates DB environment
 Data Engineer – provides technical skills, assists data management and extraction,
supports analytic sandbox
 Data Scientist – provides analytic techniques and modeling

UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
1.2.2. Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
 Data Preparation Phase
 Model Planning
 Model Building
 Communicate Results
 Operationalize

 Data Analytics Lifecycle defines the analytics process and best practices from
discovery to project completion
 The Lifecycle employs aspects of:
 Scientific method

22
 Cross Industry Standard Process for Data Mining (CRISP-DM)
 Process model for data mining
 Davenport‘s DELTA framework
 Hubbard‘s Applied Information Economics (AIE) approach
 MAD Skills: New Analysis Practices for Big Data by Cohen et al.

UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
1.2.3 Discovery Phase
 Data Preparation Phase
 Model Planning
 Model Building
 Communicate Results
 Operationalize

23
Figure.12 Data Analytics Lifecycle: Discovery
Phase 1- Discovery (shown in figure 12):
• In Phase 1, the team learns the business domain, including relevant history such
as whether the organization or business unit has attempted similar projects in the
past from which they can learn.
• The team assesses the resources available to support the project in terms of
people, technology, time, and data.
• Important activities in this phase include framing the business problem as an
analytics challenge that can be addressed in subsequent phases and formulating
initial hypotheses (IHs) to test and begin learning the data.

1. Learning the Business Domain


2. Resources
3. Framing the Problem
4. Identifying Key Stakeholders
5. Interviewing the Analytics Sponsor
6. Developing Initial Hypotheses
7. Identifying Potential Data Sources

24
1. Learning the Business Domain
• Understanding the domain area of the problem is essential.
• At this early stage in the process, the team needs to determine how much business
or domain knowledge the data scientist needs to develop models in Phases 3 and 4.
• These data scientists have deep knowledge of the methods, techniques, and
ways for applying heuristics to a variety of business and conceptual problems.

2. Resources:
• Ensure the project team has the right mix of domain experts, customers,
analytic talent, and project management to be effective.
• In addition, evaluate how much time is needed and if the team has the right
breadth and depth of skills.
• After taking inventory of the tools, technology, data, and people, consider if the
team has sufficient resources to succeed on this project, or if additional resources are
needed. Negotiating for resources at the outset of the project, while seeping the
goals, objectives, and feasibility, is generally more useful than later in the
process and ensures sufficient time to execute it properly.

3. Framing the Problem


• Framing the problem well is critical to the success of the project. Framing is the
process of stating the analytics problem to be solved.
4. Identifying Key Stakeholders
• The team can identify the success criteria, key risks, and stakeholders, which
should include anyone who will benefit from the project or will be significantly
impacted by the project.

5. Interviewing the Analytics Sponsor


• Prepare for the interview; draft questions, and review with colleagues.
• Use open-ended questions; avoid asking leading questions.
• Probe for details and pose follow-up questions.
• Avoid filling every silence in the conversation; give the other person time to think.
• Let the sponsors express their ideas and ask clarifying questions, such as "Why? Is
that correct? Is this idea on target? Is there anything else?"
• Use active listening techniques; repeat back what was heard to make sure the team
heard it correctly, or reframe what was sa id.
• Try to avoid expressing the team's opinions, which can introduce bias; instead, focus
on listening.
• Be mindful of the body language of the interviewers and sta keholders; use eye
contact where appropriate, and be attentive.
• Minimize distractions.
• Document what the team heard, and review it with the sponsors.

Following is a brief list of common questions that are helpful to ask during the
discovery phase when interviewing the project sponsor. The responses will begin to
shape the scope of the project and give the team an idea of the goals and objectives
of the project.
25
• What business problem is the team trying to solve?
• What is the desired outcome of the project?
• What data sources are available?
• What industry issues may impact the analysis?
• What timelines need to be considered?
• Who could provide insight into the project?
• Who has final decision-making authority on the project?

• How will the focus and scope of the problem change if the following dimensions
change:
• Time: Analyzing 1 year or 10 years' worth of data?
• People: Assess impact of changes in resources on project timeline.
• Risk: Conservative to aggressive
• Resources: None to unlimited (tools, technology, systems)
• Size and attributes of data: Including internal and external data sources

6. Developing Initial Hypotheses


• Developing a set of IHs is a key facet of the discovery phase. This step involves
forming ideas that the team can test with data.
7. Identifying Potential Data Sources
• Identify data sources
• Capture aggregate data sources
• Review the raw data
• Evaluate the data structures and tools needed
• Scope the sort of data infrastructure needed for this type of problem

UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
1.2.4 Data Preparation Phase
 Model Planning
 Model Building
 Communicate Results
 Operationalize

26
Figure.13 Data Analytics Lifecycle: Data Prep
 Includes steps to explore, preprocess, and condition data.
 Create robust environment – analytics sandbox.
 Data preparation tends to be the most labor-intensive step in the analytics
lifecycle.
 Often at least 50% of the data science project‘s time.
 The data preparation phase is generally the most iterative and the one that teams
tend to underestimate most often shown in figure 13.

1. Preparing the Analytic Sandbox


2. Performing ETLT
3. Learning About the Data
4. Data Conditioning
5. Survey and Visualize
6. Common Tools for the Data Preparation Phase

1. Preparing the Analytic Sandbox

 Create the analytic sandbox (also called workspace).


 Allows team to explore data without interfering with live production data.
 Sandbox collects all kinds of data (expansive approach).
 The sandbox allows organizations to undertake ambitious projects beyond traditional
data analysis and BI to perform advanced predictive analytics.
27
 Although the concept of an analytics sandbox is relatively new, this concept has
become acceptable to data science teams and IT groups.

2. Performing ETLT (Extract, Transform, Load, Transform)

 In ETL users perform extract, transform, load.


 In the sandbox the process is often ELT – early load preserves the raw data which
can be useful to examine.
 Example – in credit card fraud detection, outliers can represent high-risk
transactions that might be inadvertently filtered out or transformed before being
loaded into the database.
 Depending on the size and number of the data sources, the team may need to
consider how to parallelize the movement of the datasets into the sandbox.
 For this purpose, moving large amounts of data is sometimes referred to as Big
ETL. The data movement can be parallelized by technologies such as Hadoop or
MapReduce.

3. Learning about the Data


 Becoming familiar with the data is critical.
 This activity accomplishes several goals:
 Determines the data available to the team early in the project.
 Highlights gaps – identifies data not currently available.
 Identifies data outside the organization that might be useful.

Learning about the Data Sample Dataset Inventory

28
4. Data Conditioning

 Data conditioning includes cleaning data, normalizing datasets, and performing


transformations
 Often viewed as a preprocessing step prior to data analysis, it might be
performed by data owner, IT department, DBA, etc.
 Best to have data scientists involved
 Data science teams prefer more data than too little

 Additional questions and considerations


 What are the data sources? Target fields?
 How clean is the data?
 How consistent are the contents and files? Missing or inconsistent values?
 Assess the consistence of the data types – numeric, alphanumeric?
 Review the contents to ensure the data makes sense
 Look for evidence of systematic error
5. Survey and Visualize

 Leverage data visualization tools to gain an overview of the data


 Shneiderman‘s mantra:
 “Overview first, zoom and filter, then details-on-demand”
 This enables the user to find areas of interest, zoom and filter to find more
detailed information about a particular area, then find the detailed data in that
area.
 Review data to ensure calculations are consistent
 Does the data distribution stay consistent?
 Assess the granularity of the data, the range of values, and the level of aggregation
of the data
 Does the data represent the population of interest?
 Check time-related variables – daily, weekly, monthly? Is this good enough?
 Is the data standardized/normalized? Scales consistent?
 For geospatial datasets, are state/country abbreviations consistent

6. Common Tools for Data Preparation (shown in figure 14)

 Hadoop can perform parallel ingest and analysis.


 Alpine Miner provides a graphical user interface for creating analytic workflows.
 OpenRefine (formerly Google Refine) is a free, open source tool for working with
messy data.
 Similar to OpenRefine, Data Wrangler is an interactive tool for data cleansing an
transformation.

29
Figure.14 Common tools for data preparation

30
UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
 Data Preparation Phase
1.2.5 Model Planning
 Model Building
 Communicate Results
 Operationalize

Figure.15 Data Analytics Lifecycle: Model Planning


Activities to consider for model planning shown in figure 15:
 Assess the structure of the data – this dictates the tools and analytic
techniques for the next phase.
 Ensure the analytic techniques enable the team to meet the business
objectives and accept or reject the working hypotheses.
 Determine if the situation warrants a single model or a series of techniques
as part of a larger analytic workflow.
 Research and understand how other analysts have approached this kind or
similar kind of problem.

31
Phase 3: Model Planning in Industry Verticals

Example of other analysts approaching a similar problem

Data Exploration and Variable Selection


 Explore the data to understand the relationships among the variables to inform
selection of the variables and methods.
 A common way to do this is to use data visualization tools.
 Often, stakeholders and subject matter experts may have ideas.
 For example, some hypothesis that led to the project
 Aim for capturing the most essential predictors and variables.
 This often requires iterations and testing to identify key variables.
 If the team plans to run regression analysis, identify the candidate predictors and
outcome variables of the model.
Model Selection
 The main goal is to choose an analytical technique, or several candidates, based on
the end goal of the project.
 We observe events in the real world and attempt to construct models that emulate
this behavior with a set of rules and conditions.
 A model is simply an abstraction from reality
 Determine whether to use techniques best suited for structured data,
unstructured data, or a hybrid approach
 Teams often create initial models using statistical software packages such as R, SAS,
or Matlab.
 Which may have limitations when applied to very large datasets
 The team moves to the model building phase once it has a good idea about the type
of model to try.
Common Tools for the Model Planning Phase
 R has a complete set of modeling capabilities
 R contains about 5000 packages for data analysis and graphical presentation
 SQL Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models
 SAS/ACCESS provides integration between SAS and the analytics sandbox via
multiple data connections

32
UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
 Data Preparation Phase
 Model Planning
1.2.6. Model Building
 Communicate Results
 Operationalize

Figure.16 Data Analytics Lifecycle: Model Building


1. Execute the models defined in Phase 3
2. Develop datasets for training, testing, and production
3. Develop analytic model on training data, test on test data
4. Question to consider
i. Does the model appear valid and accurate on the test data?
ii. Does the model output/behavior make sense to the domain experts?
iii. Do the parameter values make sense in the context of the domain?
iv. Is the model sufficiently accurate to meet the goal?
v. Does the model avoid intolerable mistakes?
vi. Are more data or inputs needed?
vii. Will the kind of model chosen support the runtime environment?

33
viii. Is a different form of the model required to address the business problem?

Figure 16 shows Data Analytics Lifecycle: Model Building.

Common Tools for the Model Building Phase


 Commercial Tools
 SAS Enterprise Miner – built for enterprise-level computing and analytics
 SPSS Modeler (IBM) – provides enterprise-level computing and analytics
 Matlab – high-level language for data analytics, algorithms, data exploration
 Alpine Miner – provides GUI frontend for backend analytics tools
 STATISTICA and MATHEMATICA – popular data mining and analytics tools
 Free or Open Source Tools
 R and PL/R - PL/R is a procedural language for PostgreSQL with R
 Octave – language for computational modeling
 WEKA – data mining software package with analytic workbench
 Python – language providing toolkits for machine learning and analysis
 SQL – in-database implementations provide an alternative tool

UNIT-I: Chapter-II
Data Analytics Life Cycle
 Key Roles for a Successful Analytics Project
 Background and Overview of Data Analytics Lifecycle Phases
 Discovery Phase
 Data Preparation Phase
 Model Planning
 Model Building
1.2.7 Communicate Results
 Operationalize

34
Figure.17 Data Analytics Lifecycle: Communicate Results
In this last phase, the team communicates the benefits of the project more broadly
and sets up a pilot project to deploy the work in a controlled way shows in figure
17.
 Risk is managed effectively by undertaking small scope, pilot deployment before a
wide-scale rollout.
 During the pilot project, the team may need to execute the algorithm more
efficiently in the database rather than with in-memory tools like R, especially with
larger datasets.
 To test the model in a live setting, consider running the model in a production
environment for a discrete set of products or a single line of business.
 Monitor model accuracy and retrain the model if necessary.

1.2.7. Phase 6: Operationalize


Key outputs from successful analytics project

35
Figure.18 Data Analytics Lifecycle: Operationalize

 Business user – tries to determine business benefits and implications.


 Project sponsor – wants business impact, risks, ROI.
 Project manager – needs to determine if project completed on time, within budget,
goals met.
 Business intelligence analyst – needs to know if reports and dashboards will be
impacted and need to change.
 Data engineer and DBA – must share code and document.
 Data scientist – must share code and explain model to peers, managers,
stakeholders.

Phase 6: Operationalize: Four main deliverables


 Although the seven roles represent many interests, the interests overlap and can be
met with four main deliverables shown in figure 18:
1. Presentation for project sponsors – high-level takeaways for executive
level stakeholders
2. Presentation for analysts – describes business process changes and
reporting changes, includes details and technical graphs
3. Code for technical people
4. Technical specifications of implementing the code

36
UNIT-I Chapter-III
1.3.1 Introduction to R Studio, Basic operations and import and
export of data using R Tool.
Agenda:
1. About Data Mining
2. About R and RStudio
3. Datasets
4. Data Import and Export
a. Save and Load R Data

 Data mining is the process to discover interesting knowledge from large amounts of
data [Han and Kamber, 2000].
 It is an interdisciplinary field with contributions from many areas, such as:
o Statistics, machine learning, information retrieval, pattern recognition and
bioinformatics.
 Data mining is widely used in many domains, such as:
o Retail, Finance, telecommunication and social media.

 The main techniques for data mining include:


o Classification and prediction, clustering, outlier detection, association rules,
sequence analysis, time series analysis and text mining, and also some new
techniques such as social network analysis and sentiment analysis.

 In real world applications, a data mining process can be broken into six major
phases:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation and
6. Deployment
as defined by the CRISP-DM (Cross Industry Standard Process for Data Mining).

37
In figure 19 shows the different Panels of R Studio environment.
About R:
 R is a free software environment for statistical computing and graphics.
 It provides a wide variety of statistical and graphical techniques (http://www.r-
project.org/).
 R can be easily extended with 7324 packages available on CRAN (Comprehensive R
Archive Network) (http://cran.r-project.org/)
 To help users to find out which R packages to use, the CRAN Task Views are a good
guidance (http://cran.r-project.org/web/views/). They provide collections of packages
for different tasks. Some Task Views related to data mining are:
o Machine Learning & Statistical Learning
o Cluster Analysis & Finite Mixture Models
o Time Series Analysis
o Natural Language Processing
o Multivariate Statistics and
o Analysis of Spatial Data.

RStudio
 RStudio 10 is an integrated development environment (IDE) for R and can run on
various operating systems like Windows, Mac OS X and Linux. It is a very useful and
powerful tool for R programming.

Figure.19 R Studio Panels

38
 When RStudio is launched for the first time, you can see a window similar to below
Figure. There are four panels:
1. Source panel (top left), which shows your R source code. If you cannot see the
source panel, you can find it by clicking menu \File", \New File" and then \R Script".
You can run a line or a selection of R code by clicking the \Run" bottom on top of
source panel, or pressing \Ctrl + Enter".
2. Console panel (bottom left), which shows outputs and system messages displayed
in a normal R console;
3. Environment/History/Presentation panel (top right), whose three tabs show
respectively all objects and function loaded in R, a history of submitted R code, and
Presentations generated with R;
4. Files/Plots/Packages/Help/Viewer panel (bottom right), whose tabs show
respectively a list of _les, plots, R packages installed, help documentation and local
web content.

It is always a good practice to begin R programming with an RStudio project, which is a


folder where to put your R code, data files and figures.
 To create a new project, click the ―Project" button at the top-right corner and then
choose ―New Project".
 After that, select ―create project from new directory" and then ―Empty Project".
After typing a directory name, which will also be your project name, click ―Create
Project" to create your project folder and files.

After that, create three folders as below:


1. code, where to put your R souce code;
2. data, where to put your datasets; and
3. figures, where to put produced diagrams.

In addition to above three folders which are useful to most projects, depending on your
project and preference, you may create additional folders below:
1. rawdata, where to put all raw data,
2. models, where to put all produced analytics models, and
3. reports, where to put your analysis reports.

39
Datasets
1. The Iris Dataset
2. The Bodyfat Dataset

The iris dataset (https://archive.ics.uci.edu/ml/datasets/Iris) has been used for classification


in many research publications. It consists of 50 samples from each of three classes of iris
owners [Frank and Asuncion, 2010]. One class is linearly separable from the other two, while
the latter are not linearly separable from each other. There are five attributes in the dataset:
1. sepal length in cm,
2. sepal width in cm,
3. petal length in cm,
4. petal width in cm, and
5. class: Iris Setosa, Iris Versicolour, and Iris Virginica.

> str(iris)
'data.frame': 150 observations (records, or rows) of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2. The Bodyfat Dataset


Bodyfat is a dataset available in package TH.data [Hothorn, 2015]. It has 71 rows, and each
row contains information of one person. It contains the following 10 numeric columns.
_ age: age in years.
_ DEXfat: body fat measured by DXA, response variable.
_ waistcirc: waist circumference.
_ hipcirc: hip circumference.
_ elbowbreadth: breadth of the elbow.
_ kneebreadth: breadth of the knee.
_ anthro3a: sum of logarithm of three anthropometric measurements.
_ anthro3b: sum of logarithm of three anthropometric measurements.
_ anthro3c: sum of logarithm of three anthropometric measurements.
_ anthro4: sum of logarithm of three anthropometric measurements.

The value of DEXfat is to be predicted by the other variables.


> data("bodyfat", package = "TH.data")
> str(bodyfat)
data.frame: 71 obs. of 10 variables:
40
$ age : num 57 65 59 58 60 61 56 60 58 62 ...
$ DEXfat : num 41.7 43.3 35.4 22.8 36.4 ...
$ waistcirc : num 100 99.5 96 72 89.5 83.5 81 89 80 79 ...
$ hipcirc : num 112 116.5 108.5 96.5 100.5 ...
$ elbowbreadth: num 7.1 6.5 6.2 6.1 7.1 6.5 6.9 6.2 6.4 7 ...
$ kneebreadth : num 9.4 8.9 8.9 9.2 10 8.8 8.9 8.5 8.8 8.8 ...
$ anthro3a : num 4.42 4.63 4.12 4.03 4.24 3.55 4.14 4.04 3.91 3.66 ...
$ anthro3b : num 4.95 5.01 4.74 4.48 4.68 4.06 4.52 4.7 4.32 4.21 ...
$ anthro3c : num 4.5 4.48 4.6 3.91 4.15 3.64 4.31 4.47 3.47 3.6 ...
$ anthro4 : num 6.13 6.37 5.82 5.66 5.91 5.14 5.69 5.7 5.49 5.25 ...

41
Data Import and Export

Save and Load R Data


 Data in R can be saved as .Rdata files with function save() and .Rdata files can be
reloaded into R with load().
 With the code below, we first create a new object a as a numeric sequence (1, 2,
..., 10) and a second new object b as a vector of characters (`a', `b', `c', `d', `e').
 Object letters is a built-in vector in R of 26 English letters, and letters[1:5] returns
the first five letters. We then save them to a file and remove them from R with
function rm(). After that, we reload both a and b from the file and print their values.
> a <- 1:10
> b <- letters[1:5]
>getwd() # to know the current directory and setwd() to set
> save(a, b, file="mydatafile.Rdata")
> rm(a, b)
> load("mydatafile.Rdata")
> print(a)
[1] 1 2 3 4 5 6 7 8 9 10
> print(b)
[1] "a" "b" "c" "d" "e"

 An alternative way to save and load R data objects is using functions saveRDS() and
readRDS(). They work in a similar way as save() and load().
 The differences are:
a. multiple R objects can be saved into one single _le with save(), but only one
object can be saved in a file with saveRDS(); and
b. readRDS() enables us to restore the data under a different object name, while
load() restores the data under the same object name as when it was saved.
> a <- 1:10
> saveRDS(a, file="./data/mydatafile2.rds")
> a2 <- readRDS("./data/mydatafile2.rds")
> print(a2)
[1] 1 2 3 4 5 6 7 8 9 10

R also provides function save.image() to save everything in current workspace into a single
file, which is very convenient to save your current work and resume it later, if the data
loaded into R are not very big.

42
1.3.2 Implement Data Exploration and Visualization on different Datasets to explore
multiple and Individual Variables.

Agenda:
5. R If ... Else
6. R While Loop
7. R Functions
a. Creating a Function
b. Arguments
8. Data Structures
a. Vectors
b. Lists
c. Matrices
d. Arrays
e. Data Frames
9. R Graphics
a. Plot
b. Line
c. Scatterplot
d. Pie Charts
e. Bars
10.R Statistics
11.Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files

43
R If ... Else

 Conditions and If Statements


 R supports the usual logical conditions from mathematics:

Operator Name Example


== Equal x == y
!= Not equal x != y
> Greater than x>y
< Less than x<y
>= Greater than or equal to x >= y
<= Less than or equal to x <= y

Example:
a <- 33
b <- 200
if (b > a) {
print("b is greater than a")
}
Output: "b is greater than a"

Note: R uses curly brackets { } to define the scope in the code.


Else If
Example
a <- 33
b <- 33

if (b > a) {
print("b is greater than a")
} else if (a == b) {
print ("a and b are equal")
}

Output: "a and b are equal"

R Loops

Loops can execute a block of code as long as a specified condition is reached.

Loops are handy because they save time, reduce errors, and they make code more readable.

R has two loop commands:

 while loops

44
 for loops

R While Loops
Example

#Print i as long as i is less than 6:

i <- 1
while (i < 6) {
print(i)
i <- i + 1
}

Note:
Break
 With the break statement, we can stop the loop even if the while condition is TRUE.
Next
 With the next statement, we can skip an iteration without terminating the loop.

For Loops
A for loop is used for iterating over a sequence.

Example
for (x in 1:10) {
print(x)
}

Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10

45
R Functions
 A function is a block of code which only runs when it is called.
 You can pass data, known as parameters, into a function.
 A function can return data as a result.
Creating a Function
To create a function, use the function() keyword:
Example
my_function <- function() {
# create a function with the name my_function
print("Hello World!")
}
my_function() # call the function named my_function

Arguments
 Information can be passed into functions as arguments.
Example
my_function <- function(fname) {
paste(fname, "Griffin")
}

my_function("Peter")
my_function("Lois")
my_function("Stewie")

Data Structures
Vectors

 A vector is simply a list of items that are of the same type.

 To combine the list of items to a vector, use the c() function and separate the items
by a comma.

Example
# Vector of strings
fruits <- c("banana", "apple", "orange")

# Print fruits
fruits

Example
# Vector of numerical values
numbers <- c(1, 2, 3)

# Print numbers
numbers

Example
# Vector of logical values

46
log_values <- c(TRUE, FALSE, TRUE, FALSE)

log_values

Vector Length

 To find out how many items a vector has, use the length() function.

Sort a Vector

 To sort items in a vector alphabetically or numerically, use the sort() function.

Access Vectors

You can access the vector items by referring to its index number inside brackets []. The first
item has index 1, the second item has index 2, and so on. # Ex. fruits[1]

You can also access multiple elements by referring to different index positions with the c()
function.

Example
fruits <- c("banana", "apple", "orange", "mango", "lemon")

# Access the first and third item (banana and orange)


fruits[c(1, 3)]

Note:

1. Repeat Vectors: To repeat vectors, use the rep() function

2. The seq() function has three parameters: from is where the sequence starts, to is
where the sequence stops, and by is the interval of the sequence.

Lists

A list in R can contain many different data types inside it. A list is a collection of data which is
ordered and changeable.

To create a list, use the list() function

Example

# List of strings
thislist <- list("apple", "banana", "cherry")

# Print the list


thislist

Access Lists
You can access the list items by referring to its index number, inside brackets. The first item
has index 1, the second item has index 2, and so on.

47
Check if Item Exists
To find out if a specified item is present in a list, use the %in% operator.

Matrices

 A matrix is a two dimensional data set with columns and rows.

 A column is a vertical representation of data, while a row is a horizontal


representation of data.

 A matrix can be created with the matrix() function. Specify


the nrow and ncol parameters to get the amount of rows and columns.

Example
# Create a matrix
thismatrix <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
thismatrix
Access More Than One Row
 More than one row can be accessed if you use the c() function.
Ex. thismatrix[c(1,2),]
Access More Than One Column
Ex. thismatrix[, c(1,2)]
Add Rows and Columns
 Use the cbind() function to add additional columns in a Matrix.
 Use the rbind() function to add additional rows in a Matrix.
Remove Rows and Columns
Use the c() function to remove rows and columns in a Matrix.
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
Number of Rows and Columns
Use the dim() function to find the number of rows and columns in a Matrix.

Arrays
 Compared to matrices, arrays can have more than two dimensions.
 We can use the array() function to create an array, and the dim parameter to specify
the dimensions.

Example
 # An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
multiarray

48
Output:
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
,,1
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
,,2
[,1] [,2] [,3]
[1,] 13 17 21
[2,] 14 18 22
[3,] 15 19 23
[4,] 16 20 24
Data Frames

 Data Frames are data displayed in a format as a table.

 Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each
column should have the same type of data.

 Use the data.frame() function to create a data frame:

Example
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

# Print the data frame


Data_Frame
Summarize the Data

Use the summary() function to summarize the data from a Data Frame:

Example
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)

Data_Frame

summary(Data_Frame)
Output:
Training Pulse Duration
49
1 Strength 100 60
2 Stamina 150 30
3 Other 120 45
Training Pulse Duration
Other :1 Min. :100.0 Min. :30.0
Stamina :1 1st Qu.:110.0 1st Qu.:37.5
Strength:1 Median :120.0 Median :45.0
Mean :123.3 Mean :45.0
3rd Qu.:135.0 3rd Qu.:52.5
Max. :150.0 Max. :60.0
Access Items

 We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a


data frame.

Example:
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Data_Frame[1]
Data_Frame[["Training"]]
Data_Frame$Training
Output:
Training
1 Strength
2 Stamina
3 Other
[1] Strength Stamina Other
Levels: Other Stamina Strength
[1] Strength Stamina Other
Levels: Other Stamina Strength

Plot
 The plot() function is used to draw points (markers) in a diagram.
 The function takes parameters for specifying points in the diagram.
 Parameter 1 specifies points on the x-axis.
 Parameter 2 specifies points on the y-axis.
Example
Draw one point in the diagram, at position (1) and position (3):
>plot(1, 3)

50
To draw more points, use vectors:
Example
Draw two points in the diagram, one at position (1, 3) and one in position (8, 10):
>plot(c(1, 8), c(3, 10))

Multiple Points
Example
>plot(c(1, 2, 3, 4, 5), c(3, 7, 8, 9, 12))
51
Draw a Line

The plot() function also takes a type parameter with the value l to draw a line to connect
all the points in the diagram:

plot(1:10, type="l") # yel (l) not one

Plot Labels
 The plot() function also accept other parameters, such as main, xlab and ylab if you
want to customize the graph with a main title and different labels for the x and y-axis:
>plot(1:10, main="My Graph", xlab="The x-axis", ylab="The y axis")
Graph Appearance
52
Colors
Use col="color" to add a color to the points:
Example
plot(1:10, col="red")
Size
Use cex=number to change the size of the points (1 is default, while 0.5 means 50% smaller,
and 2 means 100% larger):
Example
plot(1:10, cex=2)
Point Shape
Use pch with a value from 0 to 25 to change the point shape format:
Example
plot(1:10, pch=25, cex=2)

Lines
 A line graph has a line that connects all the points in a diagram.
 To create a line, use the plot() function and add the type parameter with a value of
"l":
Example
plot(1:10, type="l")

Line Color
 The line color is black by default. To change the color, use the col parameter
Example
plot(1:10, type="l", col="blue")
53
Line Width
 To change the width of the line, use the lwd parameter (1 is default, while 0.5 means
50% smaller, and 2 means 100% larger).
Line Styles
 The line is solid by default. Use the lty parameter with a value from 0 to 6 to specify
the line format.
For example, lty=3 will display a dotted line instead of a solid line

Available parameter values for lty:

 0 removes the line


 1 displays a solid line
 2 displays a dashed line
 3 displays a dotted line
 4 displays a "dot dashed" line
 5 displays a "long dashed" line
 6 displays a "two dashed" line

Multiple Lines
 To display more than one line in a graph, use the plot() function together with the
lines() function.
line1 <- c(1,2,3,4,5,10)
line2 <- c(2,5,7,8,9,10)
plot(line1, type = "l", col = "blue")
lines(line2, type="l", col = "red")

Scatter Plot
 A "scatter plot" is a type of plot used to display the relationship between two
numerical variables, and plots one dot for each observation.
 It needs two vectors of same length, one for the x-axis (horizontal) and one for
the y-axis (vertical).
Example
x <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y <- c(99,86,87,88,111,103,87,94,78,77,85,86)
plot(x, y)

Compare Plots
To compare the plot with another plot, use the points() function:
Example

54
Draw two plots on the same figure:
# day one, the age and speed of 12 cars:
x1 <- c(5,7,8,7,2,2,9,4,11,12,9,6)
y1 <- c(99,86,87,88,111,103,87,94,78,77,85,86)
# day two, the age and speed of 15 cars:
x2 <- c(2,2,8,1,15,8,12,9,7,3,11,4,7,14,12)
y2 <- c(100,105,84,105,90,99,90,95,94,100,79,112,91,80,85)

plot(x1, y1, main="Observation of Cars", xlab="Car age", ylab="Car speed",


col="red", cex=2)
points(x2, y2, col="blue", cex=2)

Pie Charts
 A pie chart is a circular graphical view of data.
 Use the pie() function to draw pie charts.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart
pie(x)
Start Angle
 You can change the start angle of the pie chart with the init.angle parameter.
 The value of init.angle is defined with angle in degrees, where default angle is
0.
Example
Start the first pie at 90 degrees:
# Create a vector of pies
x <- c(10,20,30,40)
# Display the pie chart and start the first pie at 90 degrees
pie(x, init.angle = 90)

Labels and Header


 Use the label parameter to add a label to the pie chart, and use the main
parameter to add a header.
Example
# Create a vector of pies
x <- c(10,20,30,40)
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Display the pie chart with labels
55
pie(x, label = mylabel, main = "Fruits")

Colors
 You can add a color to each pie with the col parameter.
Example
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Fruits", col = colors)

Legend
 To add a list of explanation for each pie, use the legend() function.
Example
# Create a vector of labels
mylabel <- c("Apples", "Bananas", "Cherries", "Dates")
# Create a vector of colors
colors <- c("blue", "yellow", "green", "black")
# Display the pie chart with colors
pie(x, label = mylabel, main = "Pie Chart", col = colors)
# Display the explanation box
legend("bottomright", mylabel, fill = colors)

Bar Charts
 A bar chart uses rectangular bars to visualize data. Bar charts can be displayed
horizontally or vertically. The height or length of the bars are proportional to the
values they represent.
 Use the barplot() function to draw a vertical bar chart.
Example
# x-axis values
x <- c("A", "B", "C", "D")
# y-axis values
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, col = "red")

Density / Bar Texture


 To change the bar texture, use the density parameter.
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, density = 10)
56
Bar Width
 Use the width parameter to change the width of the bars
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, width = c(1,2,3,4))
Horizontal Bars
 If you want the bars to be displayed horizontally instead of vertically, use
horiz=TRUE.
Example
x <- c("A", "B", "C", "D")
y <- c(2, 4, 6, 8)
barplot(y, names.arg = x, horiz = TRUE)

Statistics Introduction

 Statistics is the science of analyzing, reviewing and conclude data.

Some basic statistical numbers include:

 Mean, median and mode


 Minimum and maximum value
 Percentiles
 Variance and Standard Devation
 Covariance and Correlation
 Probability distributions

 The R language was developed by two statisticians. It has many built-in


functionalities, in addition to libraries for the exact purpose of statistical analysis.
(For more information
visit: https://www.w3schools.com/statistics/index.php)

Data Set
 A data set is a collection of data, often presented in a table.
 There is a popular built-in data set in R called "mtcars" (Motor Trend Car Road
Tests), which is retrieved from the 1974 Motor Trend US Magazine.
Example
# Print the mtcars data set
mtcars
Information About the Data Set
You can use the question mark (?) to get information about the mtcars data set:
?mtcars
Get Information
 Use the dim() function to find the dimensions of the data set, and the
names() function to view the names of the variables:
57
Example
Data_Cars <- mtcars # create a variable of the mtcars data set for better
organization
# Use dim() to find the dimension of the data set
dim(Data_Cars)
# Use names() to find the names of the variables from the data set
names(Data_Cars)

 Use the rownames() function to get the name of each row in the first column,
which is the name of each car: rownames(Data_Cars)

From the examples above, we have found out that the data set has 32 observations (Mazda
RX4, Mazda RX4 Wag, Datsun 710, etc) and 11 variables (mpg, cyl, disp, etc).
 A variable is defined as something that can be measured or counted.
 Here is a brief explanation of the variables from the mtcars data set:

Variable Name Description


mpg Miles/(US) Gallon
cyl Number of cylinders
disp Displacement
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
am Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors

Print Variable Values


 If you want to print all values that belong to a variable, access the data frame by
using the $ sign, and the name of the variable (for example cyl (cylinders)):
Example
Data_Cars <- mtcars
Data_Cars$cyl
Sort Variable Values
To sort the values, use the sort() function: sort(Data_Cars$cyl)
Analyzing the Data

58
 use the summary() function to get a statistical summary of the data:
summary(Data_Cars).

The summary() function returns six statistical numbers for each variable:

1. Min
2. First quantile (percentile)
3. Median
4. Mean
5. Third quantile (percentile)
6. Max

Max Min

Example
#Find the largest and smallest value of the variable hp (horsepower).
Data_Cars <- mtcars
max(Data_Cars$hp)
min(Data_Cars$hp)

For example, we can use the which.max() and which.min() functions to find the
index position of the max and min value in the table:

Example
Data_Cars <- mtcars
which.max(Data_Cars$hp)
which.min(Data_Cars$hp)

Or even better, combine which.max() and which.min() with the rownames() function to get
the name of the car with the largest and smallest horsepower:
Example
Data_Cars <- mtcars
rownames(Data_Cars)[which.max(Data_Cars$hp)]
rownames(Data_Cars)[which.min(Data_Cars$hp)]

Mean, Median, and Mode


In statistics, there are often three values that interests us:

1. Mean - The average value


2. Median - The middle value
3. Mode - The most common value
Mean
 To calculate the average value (mean) of a variable from the mtcars data set, find the
sum of all values, and divide the sum by the number of values.

59
Sorted observation of wt (weight)
1.513 1.615 1.835 1.935 2.140 2.200 2.320 2.465
2.620 2.770 2.780 2.875 3.150 3.170 3.190 3.215
3.435 3.440 3.440 3.440 3.460 3.520 3.570 3.570
3.730 3.780 3.840 3.845 4.070 5.250 5.345 5.424
Example
Find the average weight (wt) of a car:
Data_Cars <- mtcars
mean(Data_Cars$wt)

Median
 The median value is the value in the middle, after you have sorted all the values.
Note: If there are two numbers in the middle, you must divide the sum of those numbers
by two, to find the median.
Example
#Find the mid point value of weight (wt):
Data_Cars <- mtcars
median(Data_Cars$wt)

Mode
 The mode value is the value that appears the most number of times.
Example: names(sort(-table(Data_Cars$wt)))[1]
Percentiles
 Percentiles are used in statistics to give you a number that describes the value that a
given percent of the values are lower than.
Example
Data_Cars <- mtcars
# c() specifies which percentile you want
quantile(Data_Cars$wt, c(0.75))
Note:
1. If you run the quantile() function without specifying the c() parameter, you will get
the percentiles of 0, 25, 50, 75 and 100.
2. Quartiles
a. Quartiles are data divided into four parts, when sorted in an ascending order:
b. The value of the first quartile cuts off the first 25% of the data
c. The value of the second quartile cuts off the first 50% of the data
d. The value of the third quartile cuts off the first 75% of the data
e. The value of the fourth quartile cuts off the 100% of the data

60
12.Data Exploration and Visualization
a. Look at Data
b. Explore Individual Variables
c. Explore Multiple Variables
d. Save Charts into Files

Look at Data
 Note: The iris data is used in this for demonstration of data exploration with R.
 Execute the following commands and note the output for each and write the purpose
of the command in comments using #:
> dim(iris)
> names(iris)
> str(iris)
> attributes(iris)
> iris[1:5, ]
> head(iris)
> tail(iris)
> ## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx
> iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
Explore Individual Variables
 Execute the following commands and note the output for each and write the purpose
of the command in comments using #:
> summary(iris)
> quantile(iris$Sepal.Length)
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
> var(iris$Sepal.Length)
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
> pie(table(iris$Species))

EXPLORE MULTIPLE VARIABLES


 Execute the following commands and note the output for each and write the purpose
of the command in comments using #:

61
> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))
> ## same function as above
> # plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species,
pch=as.numeric(iris$Species))
> smoothScatter(iris$Sepal.Length, iris$Sepal.Width)
> pairs(iris)

More Explorations
> library(scatterplot3d)
> scatterplot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> library(rgl)
> plot3d(iris$Petal.Width, iris$Sepal.Length, iris$Sepal.Width)
> distMatrix <- as.matrix(dist(iris[,1:4]))
> heatmap(distMatrix)
> library(lattice)
> levelplot(Petal.Width~Sepal.Length*Sepal.Width, iris, cuts=9,
+ col.regions=grey.colors(10)[10:1])
> filled.contour(volcano, color=terrain.colors, asp=1,
+ plot.axes=contour(volcano, add=T))
> persp(volcano, theta=25, phi=30, expand=0.5, col="lightblue")
> library(MASS)
> parcoord(iris[1:4], col=iris$Species)
> library(lattice)
> parallelplot(~iris[1:4] | Species, data=iris)
> library(ggplot2)
> qplot(Sepal.Length, Sepal.Width, data=iris, facets=Species ~.)

Save Charts into Files


> # save as a PDF file
> pdf("myPlot.pdf")
> x <- 1:50
> plot(x, log(x))
> graphics.off()
> #
> # Save as a postscript file
62
> postscript("myPlot2.ps")
> x <- -20:20
> plot(x, x^2)
> graphics.off()

63
DATA ANALYTICS
Unit – II chapter I: Exploratory Data Analysis
Exploratory Data Analysis:
‣ Visualization before analysis
‣ Dirty data
‣ Visualizing a single variable
‣ Examining multiple variables
‣ Data Exploration versus Presentation
Exploratory Data Analysis (EDA):
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data
sets and summarize their main characteristics, often employing data visualization methods.

Fig1.EDA process
Steps to explain the EDA process (shown in figure 1):
Look at the structure of the data: number of data points, number of features, feature names,
data types, etc.
When dealing with multiple data sources, check for consistency across datasets.
Identify what data signifies (called measures) for each of data points and be mindful while
obtaining metrics.
Calculate key metrics for each data point (summary analysis):
a. Measures of central tendency (Mean, Median, Mode);
b. Measures of dispersion (Range, Quartile Deviation, Mean Deviation, Standard Deviation);
c. Measures of skewness and kurtosis.
Investigate visuals:
a. Histogram for each variable;
b. Scatterplot to correlate variables.
Calculate metrics and visuals per category for categorical variables (nominal, ordinal).
Identify outliers and mark them. Based on context, either discard outliers or analyze them
separately.
Estimate missing points using data imputation techniques.
64
2.1. 1: Visualization before analysis
What is Visualization?
Data visualization is the representation of data through the use of common graphics, such as
charts, plots, infographics, and even animations. These visual displays of information
communicate complex data relationships and data-driven insights in a way that is easy to
understand.
Visualization Tools:
Tableau (Tableau is a data visualization tool that can be used to create interactive graphs,
charts, and maps), QlikView, Microsoft Power BI, Datawrapper, Plotly, Excel, Zoho analytics.
Figure 2 shows the benefits of data visualization tools.

65
Fig2.1.1 Benefits of Data Visualization Tools

66
Visualization before analysis: Why is it important?

Data visualization allows business users to gain insight into their vast amounts of data. It
benefits them to recognize new patterns and errors in the data. Making sense of these
patterns helps the users pay attention to areas that indicate red flags or progress. This
process, in turn, drives the business ahead. Figure 3 shows importance of Data visualization.
Example: covid cases (red flag)

Fig3. Importance of Data Visualization


To illustrate the importance of visualizing data, consider Anscombe's quartet. Anscom be's
quartet consists of four datasets, as shown in Figure. It was constructed by statistician
Francis Anscom be [10] in 1973 to demonstrate the importance of graphs in statistical
analyses

Fig4.Anscombe's quartet

Figure 4 shows the Anscombe‘s quartet datasets and figure 5 shows Anscombe's quartet
visualized as scatterplots.

67
Fig5.Anscombe's quartet visualized as scatterplots
R code:
>install.packages(''ggplot2")
>data (anscombe) # It load the anscombe dataset into the current workspace
>anscombe
>nrow(anscombe) # It number of rows
[1] 11
# generates levels to indicate which group each data point belongs to
>levels<- gl(4, nrow(anscombe))
>levels
# Group anscombe into a data frame
>mydata <- with(anscombe, data.frame(x=c(xl,x2,x3,x4), y=c(yl,y2,y3,y4),
mygroup=levels))
>mydata
2.1.2 Dirty data
Dirty data can be detected in the data exploration phase with visualizations. In general,
analysts should look for anomalies, verify the data with domain knowledge, and decide the
most appropriate approach to clean the data.
Example to explain dirty data: (Sample taken as student details)

68
Consider a scenario in which a bank is conducting data analyses of its account holders to
gauge customer retention as shown in figure 6 and 7.

Fig6. Age distribution of bank account holders


R Script: >hist(age, breaks=100, main= "Age Distribution of Account Holders ",
xlab="Age", ylab="Frequency", col="gray" )
Causes of dirty data:
Incorrect data. First on the list is incorrect data.
Incomplete/missing data. Incomplete data can be either an unfinished field or a null-value
entry, i.e. no info added.
Inaccurate data.
Duplicate data.
Inconsistent data.
Treating the CRM as a data warehouse.
In R, the is .na (} function provides tests for missing values. The following example creates a
vector x where the fourth value is not available (NA). The is . na ( } function returns TRUE at
each NA value and FALSE otherwise.
69
> X<- c(l, 2, 3, NA, 4) is.na(x)
[1) FALSE FALSE FALSE TRUE FALSE
Some arithmetic functions, such as mean ( } , applied to data containing missing values can
yield an NA result. To prevent this, set the na. rm parameter to TRUE to remove the missing
value during the function's execution.
>mean(x)
[1] NA
>mean(x, na.rm=TRUE)
[1] 2. 5
The na. exclude () function returns the object with incomplete cases removed.
>DF <- data.frame(x = c(l, 2, 3), y = c(lO, 20, NA))
>DF
>DF1 <- na.exclude(DF)
>DF1

Fig7.Distribution of mortgage in years since origination from a bank's home loan


portfolio

2.1.3: Data Exploration


a. Look at the data
b. Visualizing single variable
c. Exploring multiple variable
a) Look at the data: consider the iris dataset
The iris data is used in this for demonstration of data exploration with R.
 Execute the following commands and note the output for each and write the purpose of the
command in comments using #:
> dim(iris)
> names(iris)
> str(iris)
> attributes(iris)

70
> iris[1:5, ]
> head(iris)
> tail(iris)
## draw a sample of 5 rows
> idx <- sample(1:nrow(iris), 5)
> idx > iris[idx, ]
> iris[1:10, "Sepal.Length"]
> iris[1:10, 1]
> iris$Sepal.Length[1:10]
b) Visualizing single variable
Example Functions for Visualizing a Single Variable

Dotchart and Barplot:


Dotchart and Barplot Scatterplot where x is the index and y is the value; suitable for low-
volume data Barplot with vertical or horizontal bars Cleveland dot plot [12) Histogram
Density plot (a continuous histogram) Stem-and-leaf plot Add a rug representation (1-d plot)
of the data to an existing plot Dotchart and barplot portray continuous values with labels
from a discrete variable. A dotchart can be created in R with the function dot cha rt (x, labe
l=...) , where x is a numeric vector and label is a vector of categorical labels for x. A barplot
can be created with the barplot (height) function, where height represents a vector or matrix.
Figure 8 (a) a dotchart and (b) a barplot based on the mtcars dataset, which includes the fuel
consumption and 10 aspects of automobile design and performance of 32 automobiles. This
dataset comes with the standard R distribution.
The plots in Figure can be produced with the following R code.
> data(mtcars)
>dotchart (mtcars$mpg,labels=row.names (mtcars) ,cex=.7, main= "Mi les Per Gallon (MPG
) of Car Models", xlab= "MPG" )
>barplot (table (mtcars$cyl ) , main="Distribution of Car Cylinder Counts", xlab= "Number of
Cylinders" )

71
Figure 8 (a) Dotchart on the miles per gallon of cars (b) Barplot on the distribution of
car cylinder counts
Histogram and Density Plot:
It includes a histogram of household income. The histogram shows a clear concentration of
low household incomes on the left and the long tail of the higher incomes on the right (figure
9 (a) and (b)).

Figure 9 (a) Histogram (b) Density plot of household income


Exploring individual variables:
> summary(iris)
> quantile(iris$Sepal.Length)
> quantile(iris$Sepal.Length, c(0.1, 0.3, 0.65))
>var(iris$Sepal.Length)
> hist(iris$Sepal.Length)
> plot(density(iris$Sepal.Length))
> table(iris$Species)
72
c. Exploring multiple variable
A scatterplot is a simple and widely used visualization for finding the relationship among
multiple variables. A scatterplot can represent data with up to five variables using x-axis, y-
axis, size, color, and shape.
Dotchart and Barplot Exploratory Data Analysis Dotchart and barplot from the previous
section can visualize multiple variables. Both of them use color as an additional dimension for
visualizing the data. For the same mtcars dataset, Figure 3-14 shows a dotchart that groups
vehicle cylinders at they-axis and uses colors to distinguish different cylinders. The vehicles
are sorted according to their MPG values.

Dotplot to visualize multiple variables


cars <- mtcars[or der {mtcars$mpg ) , )
cars$cyl < - f actor {cars$cyl)
cars$col or[car s$cyl==4) <- "red"
cars$color[cars$cyl==6) < - "blue
cars$color[cars$cyl==8) < - "darkgreen
dotchart {cars$mpg, labels=row.names{cars) , cex- .7, groups= cars$cyl, main=" Mi les Per
Gallon {MPG ) of Car Mode l s \ nGr ouped by Cylinder• , xl ab="Miles Per Gall on•, col
or=cars$color, gcolor="bl ac k")

73
Barplot to visualize multiple variables
> count s <- table {mtcars$gear , mtcars$cyl )
> barplot (counts, main= "Distribution of Car Cylinder Counts and Gears‖ , xlab="Number of
Cylinders‖, ylab="Counts‖, col=c ( " #OOO OFFFF" , "#0080FFFF", "#OOFFFFFF") , legend =
rownames (counts) , beside- TRUE, args. legend = list (x= "top", title= "Number of Gears" ))

Scatterplot Matrix:
Fisher's iris dataset [13] includes the measurements in centimeters ofthe sepal length, sepal
width, petal length, and petal width for 50 flowers from three species of iris. The three
species are setosa, versicolor, and virginica.

74
Scatterplot matrix of Fisher's {13} iris dataset
The R code for generating the scatterplot matrix is provided next.
>colors<- C(―red‖, ‖green‖ ,―blue‖)
> pairs(iris[l : 4], main= "Fisher' s Iris Dataset‖, pch = 21, bg =
colors[unclass(iris$Species)])
> par (xpd = TRUE)
> legend (0.2, 0 .02, horiz = TRUE, as.vector(unique (iris$Species)) , fill = colors, bty = "n")
The vector colors defines the color scheme for the plot. It could be changed to something like
colors<- c("gray50", "white" , "black" } to make the scatter plots gray scale.
Explore multiple variables:
> barplot(table(iris$Species))
#calculate covariance and correlation between variables with cov() and cor().
> cov(iris$Sepal.Length, iris$Petal.Length)
> cov(iris[,1:4])
> cor(iris$Sepal.Length, iris$Petal.Length)
> cor(iris[,1:4])
> aggregate(Sepal.Length ~ Species, summary, data=iris)
>boxplot(Sepal.Length~Species, data=iris, xlab="Species", ylab="Sepal.Length")
> with(iris, plot(Sepal.Length, Sepal.Width, col=Species, pch=as.numeric(Species)))
2.1.4: Data Exploration versus presentation
Data exploration means the deep-dive analysis of data in search of new insights.
Data presentation means the delivery of data insights to an audience in a form that makes
clear the implications.
1. Audience - Who is the data for?
For data exploration, the primary audience is the data analyst herself. She is the person
who is both manipulating the data and seeing the results. She needs to work with tight
feedback cycles of defining hypotheses, analyzing data, and visualizing results.
For data presentation, the audience is a separate group of end-users, not the author of the
analysis. These end-users are often non-analytical, they are on the front-lines of business
decision-making, and may have difficulty connecting the dots between an analysis and the
implications for their job.
2. Message - What do you want to say?
Data exploration is about the journey to find a message in your data. The analyst is trying
to put together the pieces of a puzzle.
Data presentation is about sharing the solved puzzle with people who can take action on
the insights. Authors of data presentations need to guide an audience through the content
with a purpose and point of view.
3. Explanation - What does the data mean?
For the analysts using data exploration tools, the meaning of their analysis can be self-
evident. A 1% jump in your conversion metric may represent a big change that changes your
75
marketing tactics. The important challenge for the analysts is to answer why is this
happening.
Data presentations carry a heavier burden in explaining the results of analysis. When the
audience isn‘t as familiar with the data, the data presentation author needs to start with
more basic descriptions and context. How do we measure the conversion metric? Is a 1%
change a big deal or not? What is the business impact of this change?
4. Visualizations - How do I show the data?
The visualizations for data exploration need to be easy to create and may often show
multiple dimensions to unearth complex patterns.
For data presentation, it is important that visualizations be simple and intuitive. The
audience doesn‘t have the patience to decipher the meaning of a chart. I used to love
presenting data in treemaps but found that as a visualization it could seldom stand-alone
without a two-minute tutorial to teach new users how to read the content.
5. Interactions - How are data insights created and shared?
Data exploration work on their own to gather data, connect data across silos, and dig into
the data to find insights. Data exploration is often a solitary activity that only connects with
other people when insights are found and need to be shared.
Data presentation is a collaborative, social activity. The value emerges when insights found
in data are shared with people who understand the context of the business. The dialogue that
emerges is the point, not a failure of the analysis.

Unit – II chapter II: Statistical Methods for Evaluation


Statistical Methods for Evaluation:
‣ Hypothesis Testing
‣ Difference of Means
‣ Wilcoxon Rank-Sum Test
‣ Type I and Type II Errors
‣ Power and Sample Size
‣ ANOVA
‣ Decision Trees in R
‣ Naïve Bayes in R

2.2.1: Hypothesis Testing


Hypothesis Testing is a type of statistical analysis in which you put your assumptions
about a population parameter to the test. It is used to estimate the relationship between 2
statistical variables. Hypothesis Testing is a type of statistical analysis in which you put your
assumptions about a population parameter to the test. It is used to estimate the
relationship between 2 statistical variables shown in figure 10.
Examples of statistical hypothesis from real-life:
• A teacher assumes that 60% of his college's students come from lower-middle-class
families.
• A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic
76
patients.

Figure.10 Workflow of Hypothesis Testing


2.2.2: Difference of Means
The test procedure, called the two-sample t-test, is appropriate when the following
conditions are met:
The sampling method for each sample is simple random sampling.
The samples are independent.
Each population is at least 20 times larger than its respective sample.
The sampling distribution is approximately normal, which is generally the case if any of
the following conditions apply.
The population distribution is normal.
The population data are symmetric, unimodal, without outliers, and the sample size is 15
or less.
The population data are slightly skewed, unimodal, without outliers, and the sample size
is 16 to 40.
The sample size is greater than 40, without outliers.
This approach consists of four steps: (1) state the hypotheses, (2) formulate an analysis
plan, (3) analyze sample data, and (4) interpret results.

77
Reference: https://stattrek.com/hypothesis-test/difference-in-means
Problem 1: Two-Tailed Test
Within a school district, students were randomly assigned to one of two Math teachers -
Mrs. Smith and Mrs. Jones. After the assignment, Mrs. Smith had 30 students, and Mrs.
Jones had 25 students.
At the end of the year, each class took the same standardized test. Mrs. Smith's students
had an average test score of 78, with a standard deviation of 10; and Mrs. Jones'
students had an average test score of 85, with a standard deviation of 15.
Test the hypothesis that Mrs. Smith and Mrs. Jones are equally effective teachers. Use a
0.10 level of significance. (Assume that student performance is approximately normal.)
Solution: The solution to this problem takes four steps: (1) state the hypotheses, (2)
formulate an analysis plan, (3) analyze sample data, and (4) interpret results. We work
through those steps below:
State the hypotheses. The first step is to state the null hypothesis and an alternative
hypothesis.
Null hypothesis: μ1 - μ2 = 0
Alternative hypothesis: μ1 - μ2 ≠ 0
Note that these hypotheses constitute a two-tailed test. The null hypothesis will be
rejected if the difference between sample means is too big or if it is too small.
Formulate an analysis plan. For this analysis, the significance level is 0.10. Using sample
data, we will conduct a two-sample t-test of the null hypothesis.
<Analyze sample data. Using sample data, we compute the standard error (SE), degrees
of freedom (DF), and the t statistic test statistic (t).
SE = sqrt[(s12/n1) + (s22/n2)]
SE = sqrt[(102/30) + (152/25] = sqrt(3.33 + 9)
78
SE = sqrt(12.33) = 3.51

DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }

DF = (102/30 + 152/25)2 / { [ (102 / 30)2 / (29) ] + [ (152 / 25)2 / (24) ] }

DF = (3.33 + 9)2 / { [ (3.33)2 / (29) ] + [ (9)2 / (24) ] } = 152.03 / (0.382 + 3.375) =


152.03/3.757 = 40.47

t = [ (x1 - x2) - d ] / SE = [ (78 - 85) - 0 ] / 3.51 = -7/3.51 = -1.99

where s1 is the standard deviation of sample 1, s2 is the standard deviation of sample 2, n 1 is


the size of sample 1, n2 is the size of sample 2, x1 is the mean of sample 1, x2 is the mean of
sample 2, d is the hypothesized difference between the population means, and SE is the
standard error. Since we have a two-tailed test, the P-value is the probability that a t
statistic having 40 degrees of freedom is more extreme than -1.99; that is, less than -1.99
or greater than 1.99. We use the t Distribution Calculator to find P(t < -1.99) is about 0.027

If you enter 1.99 as the sample mean in the t Distribution Calculator, you will find the that
the P(t ≤ 1.99) is about 0.973. Therefore, P(t > 1.99) is 1 minus 0.973 or 0.027. Thus, the
P-value = 0.027 + 0.027 = 0.054.

Interpret results: Since the P-value (0.054) is less than the significance level (0.10), we
cannot accept the null hypothesis.

2.2.3: Wilcoxon Rank-Sum Test


What is Wilcoxon rank sum test used for?
Wilcoxon rank-sum test is used to compare two independent samples, while Wilcoxon signed-
rank test is used to compare two related samples, matched samples, or to conduct a paired
difference test of repeated measurements on a single sample to assess whether their
population mean ranks differ.

Non-parametric test:
Since the Wilcoxon Rank Sum Test does not assume known distributions, it does not deal
with parameters, and therefore we call it a non-parametric test. Whereas the null hypothesis
of the two-sample t test is equal means, the null hypothesis of the Wilcoxon test is usually
taken as equal medians.
Reference: https://www.slideserve.com/italia/wilcoxon-rank-sum-test

2.2.4: Type I and Type II Errors


A hypothesis test may result in two types of errors, depending on whether the test accepts or
rejects the null hypothesis. These two errors are known as type I and type II errors.
79
• A type I error is the rejection of the null hypothesis when the null hypothesis is TRUE. The
probability of the type I error is denoted by the Greek letter ‗alpha‘.
• A type II error is the acceptance of a null hypothesis when the null hypothesis is FALSE.
The probability of the type II error is denoted by the Greek letter ‗beta‘.
H0 is true H1 is false

H0 is accepted Correct outcome Type II error

H0 is rejected Type I error Correct outcome

The significance level, as mentioned in the Student's t-test discussion, is equivalent to the
type I error. For a significance level such as o = 0.05, if the null hypothesis (Jt1 = J1 1) is
TRUE, there is a 5% chance that the observed T value based on the sample data will be large
enough to reject the null hypothesis. By selecting an appropriate significance level, the
probability of committing a type I error can be defined before any data is collected or
analyzed. The probability of committing a Type II error is somewhat more difficult to
determine. If two population means are truly not equal, the probability of committing a type
II error will depend on how far apart the means truly are. To reduce the probability of a type
II error to a reasonable level, it is often necessary to increase the sample size.
2.2.5: Power and Sample Size
The power of a test is the probability of correctly rejecting the null hypothesis. It is denoted
by 1- beta where beta is the probability of a type II error. Because the power of a test
improves as the sample size increases, power is used to determine the necessary sample
size. In the difference of means, the power of a hypothesis test depends on the true
difference of the population means. In other words, for a fixed significance level, a larger
sample size is required to detect a smaller difference in the means. In general, the
magnitude of the difference is known as the effect size. As the sample size becomes larger, it
is easier to detect a given effect size, alpha as illustrated in Figure.

A larger sample size better identifies a fixed effect size


With a large enough sample size, almost any effect size can appear statistically significant.
However, a very small effect size may be useless in a practical sense. It is important to
80
consider an appropriate effect size for the problem at hand.
2.2.6: ANOVA
Definition:
Analysis of variance, or ANOVA, is a statistical method that separates observed variance data
into different components to use for additional tests. A one-way ANOVA is used for three or
more groups of data, to gain information about the relationship between the dependent and
independent variables.
The hypothesis tests presented in the previous sections are good for analyzing means
between two populations. But what if there are more than two populations? Consider an
example of testing the impact of nutrition and exercise on 60 candidates between age 18 and
50. The candidates are randomly split into six groups, each assigned with a different weight
loss strategy, and the goal is to determine which strategy is the most effective.
Example:
Random sample data of sales collected from different outlets spread over the four
geographical regions.
Variation, being a fundamental characteristics of data, would always be present. Here, the
total variation in the sales may be measured by the squared sum of deviation from the mean
sales. If we analyze the sources of variation in the sales, in this case, we may identify two
sources:

Sales within a region would differ and this would be true for all four regions (within-group
variations).There might be impact of the regions and mean-sales of the four regions would
not be all the same i.e. there might be variation among regions (between-group variations).

81
2.2.7: Decision Trees in R

A decision tree is a flowchart-like structure in which each internal node represents a "test"
on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the
outcome of the test, and each leaf node represents a class label (decision taken after
computing all attributes). The paths from root to leaf represent classification rules.

In decision analysis, a decision tree and the closely related influence diagram are used as a
visual and analytical decision support tool, where the expected values (or expected utility) of
competing alternatives are calculated.

A decision tree consists of three types of nodes:

1. Decision nodes – typically represented by squares


2. Chance nodes – typically represented by circles
3. End nodes – typically represented by triangles

What problems can be solved using DT?


 Decision tree algorithm falls under the category of supervised learning. They can be
used to solve both regression and classification problems.
82
 Decision tree uses the tree representation to solve the problem in which each leaf
node corresponds to a class label and attributes are represented on the internal node
of the

A Decision Tree is an algorithm used for supervised learning problems such as classification
or regression. A decision tree or a classification tree is a tree in which each internal (nonleaf)
node is labeled with an input feature shown in figure 11.
Decision trees used in data mining are of two main types –
Classification tree − when the response is a nominal variable, for example if an email is spam
or not. Regression tree − when the predicted outcome can be considered a real number (e.g.
the salary of a worker)
Steps:
1. Begin the tree with the root node, says S, which contains the complete dataset.
2. Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
3. Divide the S into subsets that contains possible values for the best attributes.
4. Generate the decision tree node, which contains the best attribute.
5. Recursively make new decision trees using the subsets of the dataset created in step
3 Continue this process until a stage is reached where you cannot further classify the
nodes and called the final node as a leaf node.

83
Figure.11 Decision Tree
Applications:
• Marketing
• Companies
• Diagnosis of disease
Example:

So, to solve such problems there is a technique which is called as Attribute selection
measure or ASM. By this measurement, we can easily select the best attribute for the
nodes of the tree. There are two popular techniques for ASM, which are:
1. Information Gain
2. Gini Index
• Information gain is the measurement of changes in entropy.
• Entropy is a metric to measure the impurity in a given attribute.
• Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
Advantages:
84
• Easy to read and interpret
• Easy to prepare
• It gives great visual representation
Disadvantages:
• Unstable nature
• Less effective in predicting the outcome of a continuous variable.
Topic 8: Naïve Bayes in R
Naive Bayes is a Supervised Non-linear classification algorithm in R Programming. Naive
Bayes classifiers are a family of simple probabilistic classifiers based on applying Baye's
theorem with strong(Naive) independence assumptions between the features or
variables.
The Naive Bayes algorithm is called ―Naive‖ because it makes the assumption that the
occurrence of a certain feature is independent of the occurrence of other features.
Theory:
Naive Bayes algorithm is based on Bayes theorem. Bayes theorem gives the conditional
probability of an event A given another event B has occurred.

where,
P(A|B) = Conditional probability of A given B.
P(B|A) = Conditional probability of B given A.
P(A) = Probability of event A.
P(B) = Probability of event B.
For many predictors, we can formulate the posterior probability as follows:
P(A|B) = P(B1|A) * P(B2|A) * P(B3|A) * P(B4|A) …
Example:
Consider a sample space: {HH, HT, TH, TT}
where,
H: Head
T: Tail
P(Second coin being head given = P(A|B)
first coin is tail) = P(A|B)
= [P(B|A) * P(A)] / P(B)
= [P(First coin is tail given second coin is head) *
P(Second coin being Head)] / P(first coin being tail)

85
= [(1/2) * (1/2)] / (1/2)
= (1/2)
= 0.5
Performing Naive Bayes on Dataset:
Using Naive Bayes algorithm on the dataset which includes 11 persons and 6 variables or
attributes.
Python 3:
# Installing Packages
install.packages("e1071")
install.packages("caTools")
install.packages("caret")
# Loading package
library(e1071)
library(caTools)
library(caret)
# Splitting data into train
# and test data
split <- sample.split(iris, SplitRatio = 0.7)
train_cl <- subset(iris, split == "TRUE")
test_cl <- subset(iris, split == "FALSE")
# Feature Scaling
train_scale <- scale(train_cl[, 1:4])
test_scale <- scale(test_cl[, 1:4])
# Fitting Naive Bayes Model
# to training dataset
set.seed(120) # Setting Seed
classifier_cl <- naiveBayes(Species ~ ., data = train_cl)
classifier_cl
# Predicting on test data'
y_pred <- predict(classifier_cl, newdata = test_cl)
# Confusion Matrix
cm <- table(test_cl$Species, y_pred)
cm
# Model Evaluation
confusionMatrix(cm)
Output: Model classifier_cl:

86
The Conditional probability for each feature or variable is created by model separately.The
apriori probabilities are also calculated which indicates the distribution of our data.
Confusion Matrix:

So, 20 Setosa are correctly classified as Setosa. Out of 16 Versicolor, 15 Versicolor are
correctly classified as Versicolor, and 1 are classified as virginica. Out of 24 virginica, 19
virginica are correctly classified as virginica and 5 are classified as Versicolor.
Model Evaluation:

87
The model achieved 90% accuracy with a p-value of less than 1. With Sensitivity, Specificity,
and Balanced accuracy, the model build is good.
So, Naive Bayes is widely used in Sentiment analysis, document categorization, Email spam
filtering etc in industry.

88
UNIT III – ADVANCED ANALYTICAL TECHNOLOGY AND METHODS
Chapter – I: Time Series Analysis
Time Series Analysis: Overview of Time Series Analysis, Box-Jenkins Methodology, ARIMA
Model, Autocorrelation Function (ACF), Autoregressive Models, Moving Average Models, ARMA
and ARIMA Models, Building and Evaluating an ARIMA Model, Reasons to Choose and
Cautions.

AGENDA:
 Overview of Time Series Analysis
 Box-Jenkins Methodology
 ARIMA Model
 Autocorrelation Function (ACF)
 Autoregressive Models
 Moving Average Models
 ARMA and ARIMA Models
 Building and Evaluating an ARIMA Model
 Reasons to Choose and Cautions

3.1.1 Overview of Time Series Analysis


Time series analysis attempts to model the underlying structure of observations taken
over time. A time series, denoted Y =a+ bX, is an ordered sequence of equally spaced
values over time.
Example:

Fig 3.1.1: Monthly international airline passengers

From the figure 3.1.1, the time series consists of an ordered sequence of 144 values. The
analyses presented are limited to equally spaced time series of one variable. The goals of
time series analysis are:
 Identify and model the structure of the time series.
 Forecast future values in the time series.

Time series analysis has many applications in finance, economics, biology, engineering,
89
retail, and manufacturing. Here are a few specific use cases:
 Retail Sales: For various product lines, a clothing retailer is looking to forecast future
monthly sales. These forecasts need to account for the seasonal aspects of the
customer's purchasing decisions.
 Spare parts planning: Companies' service organizations have to forecast future
spare part demands to ensure an adequate supply of parts to repair customer
products. To forecast future demand, complex models for each part number can be
built using input variables such as expected part failure rates, service diagnostic
effectiveness, forecasted new product shipments, and forecasted.
 Stock trading: Some high-frequency stock traders utilize a technique called pairs
trading. In pairs trading, an identified strong positive correlation between the prices of
two stocks is used to detect a market opportunity. Pairs trading is one of many
techniques that falls into a trading strategy called statistical arbitrage.

3.1.2 Box-Jenkins Methodology:


A time series consists of an ordered sequence of equally spaced values over time.
Examples of a time series are monthly unemployment rates, daily website visits, or stock
prices every second.
A time series can consist of 4 components:
a. Trend: The trend refers to the long-term movement in a time series. It indicates
whether the observation values are increasing or decreasing over time. Examples of
trends are a steady increase in sales month over month or an annual decline of
fatalities due to car accidents.
b. Seasonality: The seasonality component describes the fixed, periodic fluctuation
in the observations over time. As the name suggests, the seasonality component is
often related to the calendar. For example, monthly retail sales can fluctuate over the
year due to the weather and holidays.
c. Cyclic: A cyclic component also refers to a periodic fluctuation, but one that is not
as fixed as in the case of a seasonality component. For example, retails sales are
influenced by the general state of the economy. Thus, a retail sales time series can
often follow the lengthy boom-bust cycles of the economy.
d. Random: After accounting for the other three components, the random component is
what remains. Although noise is certainly part of this random component, there is
often some underlying structure to this random component that needs to be modelled
to forecast future values of a given time series.

Developed by George Box and Gwilym Jenkins, the Box-Jenkins methodology for time
series analysis involves the following three main steps:

90
1. Condition data and select a model.
 Identify and account for any trends or seasonality in the time series.
 Examine the remaining time series and determine a suitable model.

2. Estimate the model parameters.


3. Assess the model and return to Step 1, if necessary

Fig 3.1.2 Three stage Box-Jenkins Methodology

3.1.3 ARIMA model:


ARIMA is Autoregressive Integrated Moving Average model, describes the model's
various parts and how they are combined. As stated in the first step of the Box-Jenkins
methodology, it is necessary to remove any trends or seasonality in the time series. This step
is necessary to achieve a time series with certain properties to which autoregressive and
moving average models can be applied. Such a time series is known as a stationary time
series. A time series, Yt for t = 1,2,3, ... ,, is a stationary time series if the following three
conditions are met:
a) The expected value (mean) of Yt is a constant for all values of t.
b) The variance of Yt is finite.
c) The covariance of Yt and Yt+h depends only on the value of h = 0, 1, 2, .. .for all t.

The covariance of Yt and Yt+h is a measure of how the two variables, Yt and Yt-h vary together.
It is expressed as:
cov( yt ,yt+h )=E[(yt-ut)(yt+h-µt+h)] ->Eq. 1.1
If two variables are independent of each other, their covariance is zero. If the variables
change together in the same direction, the variables have a positive covariance.

91
Conversely, if the variables change together in the opposite direction, the variables have a
negative covariance.
For a stationary timeseries, by condition (a), the mean is a constant, say µ· So, for a given
stationary sequence, Yt. the covariance notation can be simplified to:
cov(h)= E[(yt-u)(yt+h-µ)] ->Eq. 1.2
By part (c), the covariance between two points in the time series can be nonzero, as long as
the value of the covariance is only a function of h. It is an example for h = 3.
cov(3)= E(y1,y4)= E(y2,y5)=……… ->Eq. 1.3
It is important to note that for h 0, the cov(O) = cov(y t,yt) = var(yt) for all t. Because the var
(yt) < ∞, by condition (b), the variance of Yt is a constant for all t.

3.1.4 Autocorrelation Function (ACF)


Although there is not an overall trend in the time series plotted in Figure 8-2, it appears that
each point is somewhat dependent on the past points. The difficulty is that the plot does not
provide insight into the covariance of the variables in the time series and its underlying
structure. The plot of autocorrelation function (ACF) provides this insight. For a stationary
time series, the ACF is defined as shown in Equation 8-4.

h) = ->Eq. 1.4

Fig 3.1.4(a) A plot of stationary series

Because the cov(0) is the variance, the ACF is analogous to the correlation function of two
variables, corr (yt, Yt+h), and the value of the ACF falls between - 1 and 1. Thus, the closer
the absolute value of ACF(h) is to 1, the more useful Y, can be as a predictor of Y t+h .

92
Fig 3.1.4(b) Autocorrelation function (ACF)

By convention, the quantity h in the ACF is referred to as the lag, the difference between the
time points t and t+h. At lag 0, the ACF provides the correlation of every point with itself. So
ACF(0) always equals 1. According to the ACF plot, at lag 1 the correlation between Y t andYt-1
is approximately 0.9, which is very close to 1. So Y t-1 appears to be a good predictor of the
value of Yt· Because ACF(2) is around 0.8, Y t-2 also appears to be a good predictor of the
value of Yt. A similar argument could be made for lag 3 to lag 8. (All the autocorrelations are
greater than 0.6.) In other words, a model can be considered that would express Y t as a
linear sum of its previous 8 terms. Such a model is known as an autoregressive model of
order 8.

3.1.5 Autoregressive Models


For a stationary time series, y, t= 1, 2, 3, ... , an autoregressive model of order p, denoted
AR(p), is expressed as:
yt=δ+ø1yt-1+ø2yt-2+……..+øp yt-p+εt >Eq. 1.5
Where δ is a constant for a nonzero-centred time series
Øj is a constant for j = 1, 2, ... , p
Yt-j is the value of the time series at time t - j
Øp ≠0
εt ~N (0, ζε2) for all t
Thus, a particular point in the time series can be expressed as a linear combination of the
prior p values, Yt-j for j = 1, 2, ... p, of the time series plus a random error term, εt. In this
definition, the εt, time series is often called a white noise process and is used to represent
random, independent fluctuations that are part of the time series.
The autocorrelations are quite high for the first several lags. Although it appears that an
AR(8) model might be a good candidate to consider for the given dataset, examining an AR(l)
model provides further insight into the ACF and the appropriate value of p to choose. For an
AR(1) model, centred around δ = 0

93
yt=ø1yt-1+εt ->Eq. 1.6
It is evident that yt-1=ø1yt-2+εt-1 . Thus, substituting for yt-1 yields
yt=ø1(ø1yt-2+εt-1)+εt
yt=ø12 yt-2+ ø1εt-1+εt ->Eq. 1.7
As this substitution process is repeated, Yt can be expressed as a function of Yt-h for h = 3, 4
... and a sum of the error terms. This observation means that even in the simple AR(1)
model, there will be considerable autocorrelation with the larger lags even though those lags
are not explicitly included in the model. What is needed is a measure of the autocorrelation
between Yt andYt+h for h = 1, 2, 3 ... with the effect of the Yt+1 to Yt+h-1 values excluded from
the measure. The partial autocorrelation function (PACF) provides such a measure
PACF(h)=corr(y*t- yt, yt+h- y*t+h) for h=2 ->Eq. 1.8
=corr(yt ,yt+1) for h=1
Where y*t = β1yt+1 + β2yt+2 +……..+ βh-1yt+h-1
y*t+h = β1yt+h-1 + β2yt+h-2 +……..+ βh-1yt+1and
the h-1 values of the βs are based on linear regression.
In other words, after linear regression is used to remove the effect of the variables between
yt and yt+h on yt and yt+h , the PACF is the correlation of what remains. For h = 1, there are
no variables between Yt and Yt+1. So the PACF(1) equals ACF(1). Although the computation of
the PACF is somewhat complex, many software tools hide this complexity from the analyst.

Fig3.1.5 Partial auto correlation function(PACF) plot

3.1.6 Moving Average Models


For a time series, yt , centered at zero, a moving average model of order q, denoted MA(q),
is expressed as:
yt=εt+θ1εt-1+……..+ θqεt-q ->Eq. 1.9
Where θk is a constant for k=1,2,…..,q
θq≠0
εt ~N (0, ζε2) for all t
In an MA(q) model, the value of a time series is a linear combination of the current white
94
noise term and the prior q white noise terms. So earlier random shocks directly affect the
current value of the time series. For MA(q) models, the behavior of the ACF and PACF plots
are somewhat swapped from the behavior of these plots for AR(p) models.
For a simulated MA(3) time series of the form yt = εt - 0.4 εt-1 + 1.1 εt-2- 2.5εt-3
where εt ~N(O, 1)

Fig 3.1.6(a) Scatterplot of a simulated MA(3) time series


The ACF(O) equals 1, because any variable is perfectly correlated with itself. At lags 1, 2, and
3, the value of the ACF is relatively large in absolute value compared to the subsequent
terms. In an autoregressive model, the ACF slowly decays, but for an MA(3) model, the ACF
somewhat abruptly cuts off after lag 3.

Fig 3.1.6(b) ACF plot of a simulated MA(3) time series

To understand why this phenomenon occurs, for an MA(3) time series model:
yt=εt+θ1εt-1+θ2εt-2+θ3εt-3 ->Eq. 1.10
yt-1= εt-1+θ1εt-2+θ2εt-3+θ3εt-4 ->Eq. 1.11
yt-2= εt-2+θ1εt-3+θ2εt-4+θ3εt ->Eq. 1.12
yt-3= εt-3+θ1εt-4+θ2εt-5+θ3ε ->Eq. 1.13
yt-4= εt-4+θ1εt-5+θ2εt-6+θ3 ->Eq. 1.14
Because the expression of yt shares specific white noise variables with the expressions for y t-1
through yt-3, inclusive, those three variables are correlated to y t. However, the expression of
yt does not share white noise variables with yt-4. So the theoretical correlation between yt and
yt-4 is zero.
95
3.1.7 ARMA and ARIMA Models
In general, the data scientist does not have to choose between an AR(p) and an MA(q) model
to describe a time series. In fact, it is often useful to combine these two representations into
one model. The combination of these two models for a stationary time series results in an
Autoregressive Moving Average model, ARMA(p,q), which is expressed as:
yt=δ+ø1yt-1+ø2yt-2+……+ øpyt-p
+εt+θ1εt-1+…..+ θqεt-q ->Eq. 1.15
Where δ is a constant for a nonzero-centered time series
øj is a constant for j = 1, 2, ... , p
øp ≠0
θk is a constant fork= 1, 2, ... , q
εt ~N (0, ζε2) for all t
If p = 0 and q =;e. 0, then the ARMA(p,q) model is simply an AR(p) model. Similarly, if p = 0
and q ≠0, then the ARMA(p,q) model is an MA(q) model.
To apply an ARMA model properly, the time series must be a stationary one. However, many
time series exhibit some trend over time. Since such a time series does not meet the
requirement of a constant expected value (mean), the data needs to be adjusted to remove
the trend. One transformation option is to perform a regression analysis on the time series
and then to subtract the value of the fitted regression line from each observed y-value.
If detrending using a linear or higher order regression model does not provide a stationary
series, a second option is to compute the difference between successive y-values. This is
known as differencing.
dt = yt- yt-1 for t=2,3, ... ,n

Fig 3.1.7(a) A time series with a trend

The mean of the time series plotted is certainly not a constant. Applying differencing to the
time series results in the plot below. This plot illustrates a time series with a constant mean
and a fairly constant variance over time.

96
Fig 3.1.7(b) Time series for differencing example

If the differenced series is not reasonably stationary, applying differencing additional times
may help. It provides the twice differenced time series for t= 3, 4, ... n.
dt-1-dt-2=(yt-yt-1)- (yt-1-yt-2)=yt-2yt-1-yt-2 ->Eq. 1.16
Successive differencing can be applied, but over-differencing should be avoided. One reason
is that over-differencing may unnecessarily increase the variance. The increased variance can
be detected by plotting the possibly over-differenced values and observing that the spread of
the values is much larger, after differencing the values of y twice.

Fig 3.1.7(c) Detrended time series using differencing

Fig 3.1.7(d) Twice differenced series

Because the need to make a time series stationary is common, the differencing can be
included (integrated) into the ARMA model definition by defining the Autoregressive

97
Integrated Moving Average model, denoted ARIMA(p,d,q). The structure of the ARIMA
model is identical to the expression, but the ARMA(p,q) model is applied to the time series,
Yt, after applying differencing d times. Additionally, it is often necessary to account for
seasonal patterns in time series. For example, in the retail sales use case example monthly
clothing sales track closely with the calendar month. Similar to the earlier option of
detrending a series by first applying linear regression, the seasonal pattern could be
determined and the time series appropriately adjusted. An alternative is to use a seasonal
autoregressive integrated moving average model, denoted ARIMA(p,d,q) x (P,D,Q)s
where:

 p, d, and q are the same as defined previously.

 s denotes the seasonal period.

 P is the number of terms in the AR model across the s periods.

 D is the number of differences applied across the s periods.

 Q is the number of terms in the MA model across the s periods.

For a time series with a seasonal pattern, following are typical values of s:

 52 for weekly data

 12 for monthly data

 7 for daily data

3.1.8 Building and Evaluating an ARIMA Model


For a large country, the monthly gasoline production measured in millions of barrels has been
obtained for the past 240 months (20 years). A market research firm requires some short-
term gasoline production forecasts to assess the petroleum industry's ability to deliver future
gasoline supplies and the effect on gasoline prices.
> library (forecast)
#read in gasoline production time series
#monthly gas production expressed in mill1ons of barrels
>gas_prod_input <- as. data .frame ( r ead.csv ( "c: / data/ gas__prod. csv" ) )
#create a time series object
>gas_prod <- ts (gas_prod_input[ , 2])
#examine the time series
>plot (gas_prod, xlab = "Time (months) ", ylab = "Gas online production (millions
of barrels) " )

98
Fig 3.1.8(a) Monthly gasoline production

In R, the ts() function creates a time series object from a vector or a matrix. The use of time
series objects in R simplifies the analysis by providing several methods that are tailored
specifically for handling equally time spaced data series. For example, the plot() function
does not require an explicitly specified variable for the x-axis. To apply an ARMA model, the
dataset needs to be a stationary time series. Using the diff() function, the gasoline
production time series is differenced once and plotted:
> plot(diff(gas_prod ))
>abline (a =0, b =0)

Fig 3.1.8(b) Differenced gasoline production time series


The differenced time series has a constant mean near zero with a fairly constant variance
over time. Thus, a stationary time series has been obtained. Using the following R code, the
ACF and PACF plots for the differenced series
>acf(diff (gas_prod), xaxp = c(0, 48, 4), lag.max=48 , main= " " )

99
Fig 3.1.8(c) ACF of the differenced gasoline time series
>pacf(diff (gas _prod) , xaxp = c(0, 48 , 4) , lag. max=48 , main=" ")

Fig 3.1.8(d) PACF of the differenced gasoline time series

The dashed lines provide upper and lower bounds at a 95% significance level. Any value of
the ACF or PACF outside of these bounds indicates that the value is significantly different
from zero.
It shows several significant ACF values. The slowly decaying ACF values at lags 12, 24, 36,
and 48 are of particular interest. A similar behavior in the ACF was seen but for lags 1, 2, 3,
... It indicates a seasonal autoregressive pattern every 12 months. Examining the PACF plot,
the PACF value at lag 12 is quite large, but the PACF values are close to zero at lags 24, 36,
and 48. Thus, a seasonal AR(1) model with period = 12 will be considered. It is often useful
to address the seasonal portion of the overall ARMA model before addressing the nonseasonal
portion of the model.
The arima() function in R is used to fit a (0,1,0) x (1,0,0) 12 model. The analysis is applied to
the original time series variable, gas _prod. The differencing, d =1, is specified by the order=
c(0,1,0) term.
>arima 1 <- arima (gas_prod, order=c(0 ,1, 0) , seasonal= list(order=c (1,0 , 0)
,period=1 2))
>arima1
Series: gas_prod
ARIMA(0,1,0)(1,0,0)(12)
100
Coefficients:
Sar1:0.8335
s.e.:0.0324
sigma^2 estimated as 37.29: log likelihood=-778.69
AIC=l561.38 AICc=l561.43 BIC=l568.33
The value of the coefficient for the seasonal AR(1) model is estimated to be 0.8335 with a
standard error of 0.0324. Because the estimate is several standard errors away from zero,
this coefficient is considered significant. The output from this first pass ARIMA analysis is
stored in the variable arima _1, which contains several useful quantities including the
residuals. The next step is to examine the residuals from fitting the (0, 1,0) x (1,0,0) 12
ARIMA model. The ACF and PACF plots of the residuals are provided
>acf(arima_l$residuals, xaxp = c(0, 48, 4), lag.max=48, main=' ')
>pacf arima_l$residuals, xaxp = c 0, 48, 4), lag.max=48, main=‟‟ )

Fig 3.1.8(e) ACF of residuals from seasonal AR(1) model


The ACF plot of the residuals indicates that the autoregressive behavior at lags 12, 24, 26,
and 48 has been addressed by the seasonal AR(1) term. The only remaining ACF value of any
significance occurs at lag 1. There are several significant PACF values at lags 1, 2, 3, and 4.
Because the PACF plot exhibits a slowly decaying PACF, and the ACF cuts off sharply at lag 1,
an MA(1) model should be considered for the nonseasonal portion of the ARMA model on the
differenced series. In other words, a (0,1,1) x (1,0,0) 12 ARIMA model will be fitted to the
original gasoline production time series.
>arima_2 <- arima (gas_prod, order=c(0,l,l), seasonal=
list(order=c(l,0,0),period=l2))
>arima_2
Series: gas_prod
ARIMA( 0,1,1)(1, 0, 0) [ 12]
Coefficients: ma1 sar1
-0.7065 0.8566
s.e. 0.0526 0.0298
sigma^2 estimated as 25.24: log likelihood=-733.22
AIC=l472.43 AICc=l472.53 BIC=l482.86
>acf(arima_2$residuals, xaxp = c(0,48,4) , lag .max=48, main=" " )

101
>pacf(arima_2$residuals, xaxp = c(0,48,4) , l ag .max=48, main="" )

Fig 3.1.8(f) PACF of residuals from seasonal AR(I) model


Based on the standard errors associated with each coefficient estimate, the coefficients are
significantly different from zero. The respective ACF and PACF plots for the residuals from the
second pass ARIMA model indicate that no further terms need to be considered in the ARIMA
model.

Fig 3.1.8(g) ACF for the residuals from the (0, 1, 1) x (1, 0,0) 12 model

Fig 3.1.8(h) PACF for the residuals from the (0, 1, 1) x (1, 0,0) 12 model

It should be noted that the ACF and PACF plots each have several points that are close to the
bounds at a 95% significance level. However, these points occur at relatively large lags. To
102
avoid overfitting the model, these values are attributed to random chance. So no attempt is
made to include these lags in the model. However, it is advisable to compare a reasonably
fitting model to slight variations of that model.
Comparing Fitted Time Series Models
The arima() function in Ruses Maximum Likelihood Estimation (MLE) to estimate the model
coefficients. In the R output for an ARIMA model, the log-likelihood (logL) value is provided.
The values of the model coefficients are determined such that the value of the log likelihood
function is maximized. Based on the log L value, the R output provides several measures that
are useful for comparing the appropriateness of one fitted model against another fitted
model. These measures follow:
 AIC (Akaike Information Criterion)
 AICc (Akaike Information Criterion, corrected)
 BIC (Bayesian Information Criterion)

Because these criteria impose a penalty based on the number of parameters included in the
models, the preferred model is the fitted model with the smallest AIC, AICc, or BIC value.
The table provides the information criteria measures for the ARIMA models already fitted as
well as a few additional fitted models. The highlighted row corresponds to the fitted ARIMA
model obtained previously by examining the ACF and PACF plots.
Table: Information Criteria to Measure Goodness of Fit

Normality and Constant Variance


The last model validation step is to examine the normality assumption of the residuals, The
image indicates residuals with a mean near zero and a constant variance over time. The
histogram and the 0-0 plot support the assumption that the error terms are normally
distributed.
>plot(arima_2$residuals, ylab = "Residuals")
>abline(a=0,b=0)

103
Fig 3.1.8(i) Plot of residuals from the fitted (0,1,1) x (1,0,0)12 model

>hist (arima_2$residuals, xlab= "Residuals " , xlim=c ( -20 , 20) )

Fig 3.1.8(j) Histogram of the residuals from the fitted (0,1,1) x {1,0,0) 12 model
>qqnorm(arima_2$r esiduals, main= "" )
>qqline(arima_2$residuals)

Fig 3.1.8(k) Q-Q plot of the residuals from the fitted (0,1,1) x (1,0,0) 12 model
If the normality or the constant variance assumptions do not appear to be true, it may be
104
necessary to transform the time series prior to fitting the ARIMA model. A common
transformation is to apply a logarithm function.
Forecasting
The next step is to use the fitted (0,1,1) x (1,0,0)12 model to forecast the next 12 months of
gasoline production. ln R, the forecasts are easily obtained using the predict () function and
the fitted model already stored in the variable arima_2. The predicted values along with the
associated upper and lower bounds at a 95% confidence level are displayed in R.
#predict the next 12 months
>arima_2.predict<-predict(arima_2,n.ahead=l2)

>matrix(c(arima_2.predict$pred-1.96*arima_2.predict$se,arima_2.predict$pred,
arima_2.predict$pred+1.96*arima_2.predict$se),12,3,dimnames=list(c(241:252),c
("LB","Pred","UB"))))
LB PRED UB
241 394.9689 404.8167 414.6645
242 378.6142 388.8773 399.1404
243 394.9943 405.6566 416.3189
244 405.0188 416.0658 427.1128
245 397.9545 409.3733 420.7922
246 396.1202 407.8991 419.6780
247 396.6028 408.7311 420.8594
248 387.5241 399.9920 412.4598
249 387.1523 399.9507 412.7492
250 387.8486 400.9693 414..0900
251 383.1724 396.6076 410.0428
252 390.2075 403.9500 417.6926
>plot (gas_prod, xlim=c (145,252), xlab = "Time (months)" , ylab = "Gasoline
production (millions of barrels) ", ylim=c (360,440))
>lines(arima_2 .predict$pred)
>lines(arima_2 .predict$pred+1. 96*arima_2 .predict$se , col=4 , lty=2)
lines(arima_2 .predict$pred-1 . 96*arima_2 .predict$se , col=4 , lty=2)

105
Fig 3.1.8(l) Actual and forecasted gasoline production
3.1.9 Reasons to Choose and Cautions
One advantage of ARIMA modeling is that the analysis can be based simply on
historical time series data for the variable of interest. The various input variables need to be
considered and evaluated for inclusion in the regression model for the outcome variable.
Because ARIMA modeling, in general, ignores any additional input variables, the forecasting
process is simplified. If regression analysis was used to model gasoline production, input
variables such as Gross Domestic Product (GOP), oil prices, and unemployment rate may be
useful input variables.
However, to forecast the gasoline production using regression, predictions are
required for the GOP, oil price, and unemployment rate input variables. The minimal data
requirement also leads to a disadvantage of ARIMA modeling; the model does not provide an
indication of what underlying variables affect the outcome. For example, if ARIMA modeling
was used to forecast future retail sales, the fitted model would not provide an indication of
what could be done to increase sales. In other words, casual inferences cannot be drawn
from the fitted ARIMA model. One caution in using time series analysis is the impact of
severe shocks to the system. In the gas production example, shocks might include refinery
fires, international incidents, or weather-related impacts such as hurricanes. Such events can
lead to short-term drops in production, followed by persistently high increases in production
to compensate for the lost production or to simply capitalize on any price increases. Along
similar lines of reasoning, time series analysis should only be used for short-term forecasts.
Over time, gasoline production volumes may be affected by changing consumer demands as
a result of more fuel-efficient gasoline-powered vehicles, electric vehicles, or the introduction
of natural gas-powered vehicles. Changing market dynamics in addition to shocks will make
any long-term forecasts, several years into the future, very questionable.
Additional Methods
Additional time series methods include the following:

 Autoregressive Moving Average with Exogenous inputs (ARMAX) is used to


analyze a time series that is dependent on another time series. For example, retail

106
demand for products can be modeled based on the previous demand combined with a
weather-related time series such as temperature or rainfall.

 Spectral analysis is commonly used for signal processing and other engineering
applications. Speech recognition software uses such techniques to separate the signal
for the spoken words from the overall signal that may include some noise.

 Generalized Autoregressive Conditionally Heteroscedastic (GARCH) is a useful


model for addressing time series with nonconstant variance or volatility. GARCH is
used for modeling stock market activity and price fluctuations.

 Kalman filtering is useful for analyzing real-time inputs about a system that can
exist in certain states. Typically, there is an underlying model of how the various
components of the system interact and affect each other. A Kalman filter processes
the various inputs, attempts to identify the errors in the input, and predicts the
current state. For example, a Kalman filter in a vehicle navigation system can process
various inputs, such as speed and direction, and update the estimate of the current
location.

 Multivariate time series analysis examines multiple time series and their effect on
each other. Vector ARIMA (VARIMA) extends ARIMA by considering a vector of several
time series at a particular time, t. VARIMA can be used in marketing analyses that
examine the time series related to a company's price and sales volume as well as
related time series for the competitors.

107
Unit – III: ADVANCED ANALYTICAL THEORY AND METHODS: TEXT ANALYSIS
Chapter 2: TEXT ANALYSIS
Agenda

 Text Analysis Steps


 A Text Analysis Example
 Collecting Raw Text
 Representing Text
 Term Frequency—Inverse Document Frequency (TFIDF)
 Categorizing Documents by Topics
 Determining Sentiments
 Gaining Insights.

3.2.1 Text Analysis Steps:

Text analysis (TA) is a machine learning technique used to automatically extract


valuable insights from unstructured text data.Text analysis, sometimes called text
analytics, refers to the representation, processing, and modelling of textual data to derive
useful insights. An important component of text analysis is text mining, the process of
discovering relationships and interesting patterns in large text collections.
Text analysis often deals with textual data that is far more complex. A corpus (plural:
corpora) is a large collection of texts used for various purposes in Natural Language
Processing (NLP). Table below lists a few example corpora that are commonly used in NLP
research.
TABLE: Example Corpora in Natural Language Processing
Corpus WordCoun Domain Website
t
William 0.88 million Written http : //shakespeare.mit.edu/
Shakespear
e
Brown 1 million Written http://icame.uib.no/brown/bcm.htm
Corpus l
Penn 1 million NewsWir http : //www.cis .upenn . edu/ -
Treebank e treebank/
Google N- 1 trillion Written http://catalog.ldc.upenn.edu/
Grams LDC2006T13
Corpus
The smallest corpus in the list, the complete works of Shakespeare, contains about 0.88
million words. In contrast, the Google n-gram corpus contains one trillion words from
publicly accessible web pages. Out of the one trillion words in the Google n-gram corpus,
there might be one million distinct words, which would correspond to one million
dimensions. The high dimensionality of text is an important issue, and it has a direct
impact on the complexities of many text analysis tasks. Another major challenge with

108
text analysis is that most of the time the text is not structured.

TABLE: Data Sources and formats for Text Analysis


Data Source Data format Data Structure Type
News Articles Txt, HTML or scanned Pdf Unstructured
Literature Txt, DOC, HTML or Pdf Unstructured
E-Mail Txt, MSG, EML Unstructured
Web pages HTML Semi-Structured
Server Logs LOG or TXT Semi-Structured or Quasi-
structured
Why Is Text Analysis Important?
When you put machines to work on organizing and analysing your text data, the insights and
benefits are huge. Let's take a look at some of the advantages of text analysis, below:
Text Analysis Is Scalable: Text analysis tools allow businesses to structure vast quantities
of information, like emails, chats, social media, support tickets, documents, and so on, in
seconds rather than days, so you can redirect extra resources to more important business
tasks.
Analyze Text in Real-time:
Businesses are inundated with information and customer comments can appear anywhere
on the web these days, but it can be difficult to keep an eye on it all. Text analysis is a
game-changer when it comes to detecting urgent matters, wherever they may appear,
24/7 and in real time. By training text analysis models to detect expressions and
sentiments that imply negativity or urgency, businesses can automatically flag tweets,
reviews, videos, tickets, and the like, and take action sooner rather than later.
AI Text Analysis Delivers Consistent Criteria:
Humans make errors. Fact. And the more tedious and time-consuming a task is, the
more errors they make. By training text analysis models to your needs and criteria,
algorithms are able to analyze, understand, and sort through data much more accurately
than humans ever could.
Text Analysis Steps:
A text analysis problem usually consists of three important steps:
1. Parsing,
2. Search and retrieval, and
3. Text mining.
Parsing: Parsing is the process that takes unstructured text and imposes a structure for
further analysis. The unstructured text could be a plain text file, a weblog, an Extensible
Markup Language (XML) file, a Hyper Text Markup Language (HTML) file, or a Word
document. Parsing deconstructs the provided text and renders it in a more structured way
for the subsequent steps.

109
Search and Retrieval: Search and retrieval is the identification of the documents in a
corpus that contain search items such as specific words, phrases, topics, or entities like
people or organizations. These search items are generally called key terms. Search and
retrieval originated from the field of library science and is now used extensively by web
search engines.
Text Mining: Text mining uses the terms and indexes produced by the prior two steps to
discover meaningful insights pertaining to domains or problems of interest. With the
proper representation of the text, many of the techniques, such as clustering and
classification, can be adapted to text mining. For example, the k-means can be modified to
cluster text documents into groups, where each group represents a collection of
documents with a similar topic. The distance of a document to a centroid represents how
closely the document talks about that topic. Classification tasks such as sentiment analysis
and spam filtering are prominent use the naive Bayes classifier. Text mining may utilize
methods and techniques from various fields of study, such as statistical analysis,
information retrieval, data mining, and natural language processing. Note that, in reality,
all three steps do not have to be present in a text analysis project. If the goal is to
construct a corpus or provide a catalog service, for example, the focus would be the
parsing task using one or more text pre-processing techniques, such as part-of-speech
(POS) tagging, named entity recognition, lemmatization, or stemming. Furthermore, the
three tasks do not have to be sequential. Sometimes their orders might even look like a
tree. For example, one could use parsing to build a data store and choose to either search
and retrieve the related documents or use text mining on the entire data store to gain
insights.
Part-of-Speech (POS) Tagging, Lemmatization, and Stemming:
The goal of POS tagging is to build a model whose input is a sentence, such as:
he saw a fox
and whose output is a tag sequence. Each tag marks the POS for the corresponding word,
such as:
PRPVBD DT NN according to the Penn Treebank POS tags.
Therefore, the four words are mapped to pronoun (personal), verb (past tense).
determiner, and noun (singular), respectively.
Both lemmatization and stemming are techniques to reduce the number of dimensions
and reduce inflections or variant forms to the base form to more accurately measure the
number of times each word appears. With the use of a given dictionary, lemmatization
finds the correct dictionary base form of a word. For example, given the sentence:
obesity causes many problems,the output of lemmatization would be:
obesity cause many problem.
Different from lemmatization, stemming does not need a dictionary, and it usually refers
to a crude process of stripping affixes based on a set of heuristics with the hope of
correctly achieving the goal to reduce inflections or variant forms. After the process, words
110
are stripped to become stems. A stem is not necessarily an actual word defined in the
natural language, but it is sufficient to differentiate itself from the stems of other words. A
well-known rule-based stemming algorithm is Porter's stemming algorithm. It defines a set
of production rules to iteratively transform words into their stems. For the sentence shown
previously:
obesity causes many problems, the output of Porter's stemming algorithm is:
obescaus mani problem
3.2.2 A Text Analysis Example
To further describe the three text analysis steps, consider the fictitious company
ACME, maker of two products: bPhone and bEbook. ACME is in strong competition with
other companies that manufacture and sell similar products. To succeed, ACME needs to
produce excellent phones and eBook readers and increase sales. One of the ways the
company does this is to monitor what is being said about ACME products in social media.
In other words, what is the buzz on its products? ACME wants to search all that is said
about ACME products in social media sites, such as Twitter and Facebook, and popular
review sites, such as Amazon and ConsumerReports. It wants to answer questions such as
these. • Are people mentioning its products? • What is being said? Are the products seen
as good or bad? If people think an ACME product is bad, why? For example, are they
complaining about the battery life of the bPhone, or the response time in their bEbook?
ACME can monitor the social media buzz using a simple process based on the three steps
outlined above. This process is illustrated in Figure below, and it includes the modules in
the next list.
Figure: ACME‘s Text Analysis process

3.TFIDF

1.Collect Raw 4.Topic 6.Gain


2.Represent
Text Modeling Insights
Text

5.Sentimen
t Analysis

3.2.3 Collect raw text


This corresponds to Phase 1 and Phase 2 of the Data Analytic Lifecycle. In this step, the
Data Science team at ACME monitors websites for references to specific
1. products. The websites may include social media and review sites. The team could
interact with social network application programming interfaces (APis) process data
111
feeds, or scrape pages and use product names as keywords to get the raw data.
Regular expressions are commonly used in this case to identify text that matches
certain patterns. Additional filters can be applied to the raw data for a more focused
study. For example, only retrieving the reviews originating in New York instead of the
entire United States would allow ACME to conduct regional studies on its products.
Generally, it is a good practice to apply filters during the data collection phase. They
can reduce 1/0 workloads and minimize the storage requirements.
2. Represent text Convert each review into a suitable document representation with
proper indices, and build a corpus based on these indexed reviews. This step
corresponds to Phases 2 and 3 of the Data Analytic Lifecycle.
3. Compute the usefulness of each word in the reviews using methods such as TFIDF.
This and the following two steps correspond to Phases 3 through 5 of the Data
Analytic Lifecycle.
4. Categorize documents by topics . This can be achieved through topic models
(such as latent Dirichlet allocation).
5. Determine sentiments of the reviews . Identify whether the reviews are positive
or negative. Many product review sites provide ratings of a product with each review.
If such information is not available, techniques like sentiment analysis can be used
on the textual data to infer the underlying sentiments. People can express many
emotions. To keep the process simple, ACME considers sentiments as positive,
neutral, or negative.
6. Review the results and gain greater insights (Section 9.8). This step corresponds to
Phase 5 and 6 of the Data Analytic Lifecycle. Marketing gathers the results from the
previous steps. Find out what exactly makes people love or hate a product. Use one
or more visualization techniques to report the findings. Test the soundness of the
conclusions and operationalize the findings if applicable.
This process organizes the topics presented in the rest of the chapter and calls out some of
the difficulties that are unique to text analysis.

Text analysis
This can help your customer support teams sort text data at scale, more accurately, and
faster than human analysis. That means valuable customer insights sooner rather than
later, and quicker turnarounds when it comes to making business decisions.
Text analysis, also known as text mining, is a machine learning technique used to
automatically extract value from text data. With the help of natural language processing
(NLP), text analysis tools are able to understand, analyze, and extract insights from
your unstructured data.
They make processing and analyzing huge amounts of unstructured data incredibly easy.
For example, if you receive thousands of support tickets, text analysis tools can analyze

112
them as soon as they come into your helpdesk, and alert you to any recurring issues or
even angry customers that are at risk of churning.
Text Analysis Examples:
There are two main text analysis techniques that you can use text classification and text

extraction. – While they can be used individually, you‘ll be able to get more detailed

insights when you use them in unison.


Let's take a look at some examples of the most popular text analysis techniques.
 Text Classification

 Sentiment Analysis

 Topic Analysis

 Language Detection

 Intent Detection

 Text Extraction

 Keyword Extraction

 Entity Extraction

Text Classification:
Text Classification, also referred to as text tagging, is the practice of classifying text using
pre-defined tags. There are many examples of text classification, but we‘ll just touch upon
some of the most popular methods used by businesses.
Sentiment Analysis: Sentiment analysis can automatically detect the emotional
undertones embedded in customer reviews, survey responses, social media posts, and
beyond, which helps organizations understand how their customers feel about their brand,
product, or service. Sentiment analysis of product reviews, for example, can tell you what
customers like or dislike about your product. Restaurants might want to quickly detect
negative reviews on public opinion sites, like Yelp. By performing sentiment analysis on

Yelp reviews, they can quickly detect negative sentiments, and respond right away.
Review items like Cpterra, and G2Crowd also offer unsolicited feedback. Take Slack, for
example. Customers leave long-winded reviews that praise or criticize different aspects of
the software. By running sentiment analysis, you can start organizing these reviews by
sentiment in real time. By training your own sentiment analysis model to detect emotional
tones in your customer feedback, you‘ll be able to gain more accurate results that are
tailored to your dataset. Request a demo to see how sentiment analysis can be tailored
to your use case.
Let‘s go over the two main text analysis methods – text classification and text extraction –
and the various models available. The one you choose will depend on the insights you are
hoping to gain, and/or the problem you‘re attempting to solve. Let‘s take a closer look:
Text Classification:
Text classification, also referred to as text tagging, is the practice of classifying text into

113
pre-defined groups. With the help of Natural Language Processing(NLP),text
classification tools are powerful enough to automatically analyze text and classify it into
categories, depending on the content that you‘re dealing with.
Now, let's proceed with the different types of text classification models available.
Sentiment Analysis:
Nowadays, analysis falls short if it doesn‘t examine the different emotions behind every
piece of text. Sentiment analysis can automatically detect the emotional undertones
embedded in customer reviews, survey responses, social media posts, and so on, which
helps organizations understand how their customers feel about their brand, product, or
service.
For example, a sentiment analysis of product reviews can help a business understand
what customers like or dislike about your product. Think of review sites
like Yelp, Capterra, and G2 Crowd, where you might stumble upon feedback about your,
let‘s say, SaaS business. In the following reviews for Slack, customers praise or criticize a
few aspects of the tool:
―In love with Slack, I won‟t be using anything to communicate else going forward. How did
I survive without it?!” → Positive
―I don‟t agree with the hype, Slack failed to notify me of several important messages and
that‟s what a communication platform should be all about.” → Negative
“The UX is one of the best, it's very intuitive and easy to use. However, we don't have a
budget for the high price Slack asks for its paid plans.” → Neutral
By training a model to detect sentiment on the other hand, you can delegate the task
of categorizing texts into Positive, Neutral and Negative, to machines. Not only does this
help speed up the process, you‘ll receive more consistent results since machines are
inherently not biased.
Topic analysis:
Topic analysis is a machine learning technique that interprets and categorizes large
collections of text according to individual topics or themes.
For example, instead of humans having to read thousands of product reviews to identify
the main topic that customers are talking about in regards to your product, you can use a
topic analysis tool to do it in seconds.
Let‘s say you‘re an entertainment on-demand service company that‘s just released new
content, and you want to know what topics customers are mentioning. You could define
tags such as UX/UI, Quality, Functionality, and Reliability, and find out which aspect is
being talked about most often, and how customers are talking about each aspect. Take
this review for Prime Video:
“I think Amazon is making a great effort in adding engaging content but I can‟t get past
the ugly interface. It‟s not as intuitive as other competing streaming services and if it
weren‟t lumped in with my Prime membership, I wouldn‟t pay for the stand-alone service.”
In this example, the topic analysis classifier can be trained to process this and
114
automatically tag it under UX/UI.
Test MonkeyLearn‘s very own feedback classifier for SaaS companies to get an idea of
how topic analysis sorts information according to themes.
Language Detection
This text analysis model identifies and classifies text according to its language. This is
particularly helpful for businesses with a global presence that need to route information to
the correct localized teams.
Take this ticket, for example:
“La blusa es más grande de lo que esperaba, quisiera devolverla por una prenda de una
talla menor.”
A language classifier could easily detect that the language is Spanish and route it to the
correct customer service agent, helping businesses improve response times and avoiding
unnecessary delays.
Test MonkeyLearn‘s language classifier and see how it can identify over 49 different
languages!
Intent Detection
Text classifiers can also be used to automatically discover customer intent in text. For
example, you might receive a message like the one below:
“Sheesh, the amount of emails I receive is staggering - and it‟s only been one week. It‟s
sporting goods, folks. I don‟t need over 20 emails per week to remind me of that.
Unsubscribing ASAP.”
With an intent detection classifier in place, you could quickly detect that this customer
wants to „Unsubscribe‟ and address this customer immediately. Perhaps you‘ll convince
them to change the settings to limit the number of emails they receive every week, rather
than unsubscribe.
With a clear intent detected, you can easily classify customers and take immediate action.
Play around with the following model to classify outbound sales responses.
Text Extraction:
Text Extraction is a text analysis technique that extracts valuable pieces of data from
text, like keywords, names, product specifications, phone numbers, and more.
Here are some examples of text extraction models.
Keyword Extraction:
Keyword extraction shows you the most relevant words or expressions in your text data.
For example, if you want to understand the main topics mentioned in a set of customer
reviews about a particular product or feature, you‘d quickly run your data through
a keyword extractor.
Entity Extraction:
Entity extraction is used to obtain names of people, companies, brands, and more. This
technique is particularly helpful when you‘re trying to discover names of competitors in
text, brand mentions, and specific customers that you want to track.
115
Textual information is all around us.
Soon after you wake up, you usually navigate through large amounts of textual data in the
form of text messages, emails, social media updates, and blog posts before you make it to
your first cup of coffee.
Deriving information from such large volumes of text data is challenging. Businesses deal
with massive quantities of text data generated from several data sources, including apps,
web pages, social media, customer reviews, support tickets, and call transcripts.
To extract high-quality, relevant information from such huge amounts of text data,
businesses employ a process called text mining. This process of information extraction
from text data is performed with the help of text analysis software.
Text Mining
Text mining, also called text data mining, is the process of analyzing large volumes of
unstructured text data to derive new information. It helps identify facts, trends, patterns,
concepts, keywords, and other valuable elements in text data.
It's also known as text analysis and transforms unstructured data into structured data,
making it easier for organizations to analyze vast collections of text documents. Some of
the common text mining tasks are text classification, text clustering, creation of granular
taxonomies, document summarization, entity extraction, and sentiment analysis.
Text mining uses several methodologies to process text, including natural language
processing (NLP).
What is natural language processing?
Natural language processing (NLP) is a subfield of computer science, linguistics, data
science, and artificial intelligence concerned with the interactions between humans and
computers using natural language.
In other words, natural language processing aims to make sense of human languages to
enhance the quality of human-machine interaction. NLP evolved from computational
linguistics, enabling computers to understand both written and spoken forms of human
language.
Many of the applications you use have NLP at their core. Voice assistants like Siri, Alexa,
and Google Assistant use NLP to understand your queries and craft responses. Grammarly
uses NLP to check the grammatical accuracy of sentences. Even Google Translate is made
possible by NLP.
Natural language processing employs several machine learning algorithms to extract the
meaning associated with each sentence and convert it into a form that computers can
understand. Semantic analysis and syntactic analysis are the two main methods used to
perform natural language processing tasks.
Semantic analysis
Semantic analysis is the process of understanding human language. It's a critical aspect of
NLP, as understanding the meaning of words alone won't do the trick. It enables
computers to understand the context of sentences as we comprehend them.
116
Semantic analysis is based on semantics – the meaning conveyed by a text. The semantic
analysis process starts with identifying the text elements of a sentence and assigning them
to their grammatical and semantical role. It then analyzes the context in the surrounding
text to determine the meaning of words with more than one interpretation.

Syntactic analysis
Syntactic analysis is used to determine how a natural language aligns with grammatical
rules. It's based on syntax, a field of linguistics that refers to the rules for arranging words
in a sentence to make grammatical sense.
Some of the syntax techniques used in NLP are:
 Part-of-speech tagging: Identifying the part of speech for each word
 Sentence breaking: Assigning sentence boundaries on a huge piece of text
 Morphological segmentation: Dividing words into simpler individual parts called
morphemes
 Word segmentation: Dividing huge pieces of continuous text into smaller, distinct
units
 Lemmatization: Reducing inflected forms of a word into singular form for easy
analysis
 Stemming: Cutting inflected words into their root formsParsing: Performing
grammatical analysis of a sentence
Why is text mining important?
Most businesses have the opportunity to collect large volumes of text data. Customer
feedback, product reviews, and social media posts are just the tip of the big data iceberg.
The kind of ideas that can be derived from such sources of textual (big) data are
profoundly lucrative and can help companies create products that users will value the
most.
Without text mining, the opportunity mentioned above is still a challenge. This is because
analyzing vast amounts of data isn't something the human brain is capable of. Even if a
group of people tries to pull off this Herculean task, the insights extracted might become
obsolete by the time they succeed.
Text mining helps companies automate the process of classifying text. The classification
could be based on several attributes, including topic, intent, sentiment, and language.
Many manual and tedious tasks can be eliminated with the help of text mining. Suppose
you need to understand how the customers feel about a software application you offer. Of
course, you can manually go through user reviews, but if there are thousands of reviews,
the process becomes tedious and time-consuming.
Text mining makes it quick and easy to analyze large and complex data sets and derive
relevant information from them. In this case, text mining enables you to identify the
general sentiment of a product. This process of determining whether the reviews are
positive, negative, or neutral is called sentiment analysis or opinion mining.
117
Further, text mining can be used to determine what users like or dislike or what they want
to be included in the next update. You can also use it to identify the keywords customers
use in association with certain products or topics.
Organizations can use text mining tools to dig deeper into text data to identify relevant
business insights or discover interrelationships within texts that would otherwise go
undetected with search engines or traditional applications.
Here are some specific ways organizations can benefit from text mining:
 The pharmaceutical industry can uncover hidden knowledge and accelerate the pace of
drug discovery.
 Product companies can perform real-time analysis on customer reviews and identify
product bugs or flaws that require immediate attention.
 Companies can create structured data, integrate it into databases and use it for
different types of big data analytics such as descriptive or predictive analytics.
In short, text mining helps businesses put data to work and make data-driven decisions
that can make customers happy and ultimately increase profitability.
Want to learn more about Text Analysis Software? Explore Text Analysis
products.
Text mining vs. text analytics vs. text analysis
Text mining and text analysis are often used synonymously. However, text analytics is
different from both.
Simply put, text analytics can be described as a text analysis or text mining software
application that allows users to extract information from structured and
unstructured text data.
Both text mining and text analytics aim to solve the same problem – analyzing raw text
data. But their results vary significantly. Text mining extracts relevant information from
text data that can be considered qualitative results. On the other hand, text analytics aims
to discover trends and patterns in vast volumes of text data that can be viewed
as quantitative results.
Put differently; text analytics is about creating visual reports such as graphs and tables by
analyzing large amounts of textual data. Whereas text mining is about transforming
unstructured data into structured data for easy analysis.
Text mining is a subfield of data mining and relies on statistics, linguistics, and machine
learning to create models capable of learning from examples and predicting results on
newer data. Text analytics uses the information extracted by text mining models for data
visualization.
Text mining techniques
Numerous text mining techniques and methods are used to derive valuable insights from
text data. Here are some of the most common ones.
Concordance
Concordance is used to identify the context in which a word or series of words appear.
118
Since the same word can mean different things in human language, analyzing the
concordance of a word can help comprehend the exact meaning of a word based on the
context. For example, the term "windows" describes openings in a wall and is also the
name of the operating system from Microsoft.
Word frequency
As the name suggests, word frequency is used to determine the number of times a word
has been mentioned in unstructured text data. For example, it can be used to check the
occurrence of words like "bugs," "errors," and "failure" in the customer reviews. Frequent
occurrences of such terms may indicate that your product requires an update.
Collocation
Collocation is a sequence of words that co-occur frequently. "Decision making," "time-
consuming," and "keep in touch" are some examples. Identifying collocation can improve
the granularity of text and lead to better text mining results.
Then there are advanced text mining methods such as text classification and text
extraction. We'll go over them in detail in the next section.
How does text mining work?
Text mining is primarily made possible through machine learning. Text mining algorithms
are trained to extract information from vast volumes of text data by looking at many
examples.
The first step in text mining is gathering data. Text data can be collected from multiple
sources, including surveys, chats, emails, social media, review websites, databases, news
outlets, and spreadsheets.
The next step is data preparation. It's a pre-processing step in which the raw data is
cleaned, organized, and structured before textual data analysis. It involves standardizing
data formats and removing outliers, making it easier to perform quantitative and
qualitative analysis.
Natural language processing techniques such as parsing, tokenization, stop word removal,
stemming, and lemmatization are applied in this phase.
After that, the text data is analyzed. Text analysis is performed using methods such as
text classification and text extraction. Let's look at both methods in detail.
Text classification
Text classification, also known as text categorization or text tagging, is the process of
classifying text. In other words, it's the process of assigning categories to unstructured
text data. Text classification enables businesses to quickly analyze different types of
textual information and obtain valuable insights from them.
Some common text classification tasks are sentiment analysis, language detection, topic
analysis, and intent detection.
 Sentiment analysis is used to understand the emotions conveyed through a given
text. By understanding the underlying emotions of a text, you can classify it as

119
positive, negative, or neutral. Sentiment analysis is helpful to enhance customer
experience and satisfaction.
 Language detection is the process of identifying which natural language the given
text is in. This will allow companies to redirect customers to specific teams specialized
in a particular language.
 Topic analysis is used to understand the central theme of a text and assign a topic to
it. For example, a customer email that says "the refund hasn't been processed" can be
classified as a "Returns and Refunds issue".
 Intent detection is a text classification task used to recognize the purpose or intention
behind a given text. It aims to understand the semantics behind customer messages
and assign the correct label. It's a critical component of several natural language
understanding (NLU) software.
Now, let's take a look at the different types of text classification systems.
1. Rule-based systems
Rule-based text classification systems are based on linguistic rules. Once the text mining
algorithms are coded with these rules, they can detect various linguistic structures and
assign the correct tags.
For example, a rule-based system can be programmed to assign the tag "food" whenever
it encounters words like "bacon," "sandwich," "pasta," or "burger".
Since rule-based systems are developed and maintained by humans, they're easy to
understand. However, unlike machine learning-based systems, rule-based systems
demand humans to manually code prediction rules, making them hard to scale.
2. Machine learning-based systems
Machine learning-based text classification systems learn and improve from examples.
Unlike rule-based systems, machine learning-based systems don't demand data scientists
to code the linguistic rules manually. Instead, they learn from training data that contains
examples of correctly tagged text data.
Machine learning algorithms such as Naive Bayes and Support Vector Machines (SVM) are
used to predict the tag of a text. Many a time, deep learning algorithms are also used to
create machine learning-based systems with greater accuracy.
3. Hybrid systems
As expected, hybrid text classification systems combine both rule-based and machine
learning-based systems. In such systems, both machine learning-based and rule-based
systems complement each other, and their combined results have higher accuracy.
Evaluation of text classifiers
A text classifier's performance is measured with the help of four
parameters: accuracy, precision, recall, F1 score.
 Accuracy is the number of times the text classifier made the correct prediction divided
by the total number of predictions.

120
 Precision indicates the number of correct predictions made by the text classifier over
the total number of predictions for a specific tag.
 Recall depicts the number of texts correctly predicted divided by the total number that
should have been categorized with a specific tag.
 F1 score combines precision and recall parameters to give a better understanding of
how adept the text classifier is at making predictions. It's a better indicator than
accuracy as it shows how good the classifier is at predicting all the categories in the
model.
Another way to test the performance of a text classifier is with cross-validation.
Cross-validation is the process of randomly dividing the training data into several subsets.
The text classifier trains on all subsets, except one. After the training, the text classifier is
tested by making predictions on the remaining subset.
In most cases, multiple rounds of cross-validation are performed with different subsets,
and their results are averaged to estimate the model's predictive performance.
Text extraction
Text extraction, also known as keyword extraction, is the process of extracting specific,
relevant information from unstructured text data. This is mainly done with the help of
machine learning and is used to automatically scan text and obtain relevant words and
phrases from unstructured text data such as surveys, news articles, and support tickets.
Text extraction allows companies to extract relevant information from large blocks of text
without even reading it. For example, you can use it to quickly identify the features of a
product from its description.
Quite often, text extraction is performed along with text classification. Some of the
common text extraction tasks are feature extraction, keyword extraction, and named
entity recognition.
 Feature extraction is the process of identifying critical features or attributes of an
entity in text data. Understanding the common theme of an extensive collection of text
documents is an example. Similarly, it can analyze product descriptions and extract
their features such as model or color.
 Keyword extraction is the process of extracting important keywords and phrases
from text data. It's useful for summarization of text documents, finding the frequently
mentioned attributes in customer reviews, and understanding the opinion of social
media users towards a particular subject.
 Named entity recognition (NER), also known as entity extraction or chunking, is the
text extraction task of identifying and extracting critical information (entities) from text
data. An entity can be a word or a series of words, such as the names of companies.
Regular expressions and conditional random field (CRF) are the two common methods of
implementing text extraction.
1. Regular expressions
Regular expressions are a series of characters that can be correlated with a tag. Whenever
121
the text extractor matches a text with a sequence, it assigns the corresponding tag.
Similar to the rule-based text classification systems, each pattern is a specific rule.
Unsurprisingly, this approach is hard to scale as you have to establish the correct
sequence for any kind of information you wish to obtain. It also becomes difficult to handle
when patterns become complex.
2. Conditional random fields
Conditional random fields (CRFs) are a class of statistical approaches often applied in
machine learning and used for text extraction. It builds systems capable of learning the
patterns in text data that they need to extract. It does this by weighing various features
from a sequence of words in text data.
CRFs are more proficient at encoding information when compared to regular expressions.
This makes them more capable of creating richer patterns. However, this method will
require more computational resources to train the text extractor.
Evaluation of text extractors
You can use the same metrics used in text classification to evaluate the performance of the
text extractor. However, they‘re blind to partial matches and consider only exact matches.
Due to that reason, another set of metrics called ROUGE (Recall-Oriented Understudy for
Gisting Evaluation) is used.
Text mining applications
The amount of data managed by most organizations is growing and diversifying at a rapid
pace. It's nearly impossible to take advantage of it without an automated process like text
mining in place.
An excellent example of text mining is how information retrieval happens when you
perform a Google search. For example, if you search for a keyword, say "cute puppies,"
most search results won't include your exact query.
Instead, they'll be synonyms or phrases that closely match your query. In the example of
"cute puppies,‖ you'll come across search engine page results that include phrases such as
"cutest puppy,‖ "adorable puppies,‖ "adorable pups,‖ and "cute puppy".
This happens because text mining applications actually read and comprehend the body of
texts, closely similar to how we do it. Instead of just relying on keyword matching, they
understand search terms at conceptual levels. They do an excellent job of understanding
complex queries and can discover patterns in text data, which is otherwise hidden to the
human eye.
Text mining can also help companies solve several problems in areas such as patent
analysis, operational risk analysis, business intelligence, and competitive intelligence.
Text mining has a broad scope of applications spanning multiple industries. Marketing,
sales, product development, customer service, and healthcare are a few of them. It
eliminates several monotonous and time-consuming tasks with the help of machine
learning models.
Here are some of the applications of text mining.
122
 Fraud detection: Text mining technologies make it possible to analyze large volumes of
text data and detect fraudulent transactions or insurance claims. Investigators can quickly
identify fraudulent claims by checking for commonly used keywords in descriptions of
accidents. It can also be used to promptly process genuine claims by automating the
analysis process.
 Customer service: Text mining can automate the ticket tagging process and
automatically route tickets to appropriate geographic locations by analyzing their
language. It can also help companies determine the urgency of a ticket and prioritize
the most critical tickets.
 Business intelligence: Text mining makes it easier for analysts to examine large
amounts of data and quickly identify relevant information. Since petabytes of business
data, collected from several sources, are involved, manual analysis is impossible. Text
mining tools fasten the process and enable analysts to extract actionable information.
 Healthcare: Text mining is becoming increasingly valuable in the healthcare industry,
primarily for clustering information. Manual investigation is time-consuming and costly.
Text mining can be used in medical research to automate the process of extracting
crucial information from medical literature.
3.2.4 Representing Text:
Tokenization is the task of separating (also called tokenizing) words from the body of text.
Raw text is converted into collections of tokens after the tokenization, where each token is
generally a word. A common approach is tokenizing on spaces. For example, with the
tweet shown previously:
I once had a gf back in the day. Then the bPhone came out lol
Tokenization based on spaces would output a list of tokens.
{ I, once, had, a, gf, back, in, the, day.
Then, the, bPhone, came, out, lol}
Note that token ―day‖.contains a period. This is the result of only using space as the
separator. Therefore, tokens "day." and "day" would be considered different terms in the
downstream analysis unless an additional lookup tab le is provided. One way to fix the
problem without the use of a lookup table
is to remove the period if it appears at the end of a sentence. Another way is to tokenize
the text based on punctuation marks and spaces. In the case, the previous tweet would
become:
{I, Once, had, a, gf, back, in, the, day, .,
Then, the, bPhone, came, out, lol}
However, tokenizing based on punctuation marks might not be well suited to certain
scenarios. For example, if the text contains contractions such as we‘ll, tokenizing based on
punctuation will split them into separated words we and ll. For words such as can‘t, the
output would be can and t. It would be more preferable either not to tokenize them or to
tokenize we‘ll into we and ‗ll, and can‘t into can and ‗t. The ‗t token is more recognizable as
123
negative than the t token. If the team is dealing with certain tasks such as information
extraction or sentiment analysis, tokenizing solely based on punctuation marks and spaces
may obscure or even distort meanings in the text.
Tokenization is a much more difficult task than one may expect. For example, should like
state-of-the-art, Wi-Fi, and San Francisco be considered one token or more? Should words
like, re`sume, and resume all map to the same token? Tokenization is even more difficult
beyond English. In German, for example, there are many unsegmented compound nouns.
In Chinese, there are no spaces between words. Japanese has several alphabets
intermingled. This list can go on.
Another text normalization technique is called case folding, which reduces all letters to
lowercase (or the opposite if applicable). For the previous tweet, after case folding the text
would become this:
I once had a gf back in the day. then the bphone came out lol
One needs to be cautious applying case folding to tasks such as information extraction,
sentiment analysis, and machine translation. If implemented incorrectly, case folding may
reduce or change the meaning of the text and create additional noise. For example, when
General Motors becomes general
and motors, the downstream analysis may very likely consider them as separated words
rather than the name of a company. When the abbreviation of the World Health
Organization WHO or the rock band The Who become who, they may both be interpreted
as the pronoun who.
If case folding must be present, one way to reduce such problems is to create a lookup
table of words not to be case folded. Alternatively, the team can come up with some
heuristics or rules-based strategies for the case folding. For example, the program can be
taught to ignore words that have uppercase in the middle of a sentence.
3.2.5 Term Frequency Inverse Document Frequency:(TF- IDF)
TF-IDF stands for Term Frequency Inverse Document Frequency of records. It can be
defined as the calculation of how relevant a word in a series or corpus is to a text. The
meaning increases proportionally to the number of times in the text a word appears but is
compensated by the word frequency in the corpus (data-set).
Terminologies:
Term Frequency: In document d, the frequency represents the number of instances of a
given word t. Therefore, we can see that it becomes more relevant when a word appears
in the text, which is rational. Since the ordering of terms is not significant, we can use a
vector to describe the text in the bag of term models. For each specific term in the paper,
there is an entry with the value being the term frequency. The weight of a term that
occurs in a document is simply proportional to the term frequency.
tf(t,d) = count of t in d / number of words in d
Document Frequency: This tests the meaning of the text, which is very similar to TF, in
the whole corpus collection. The only difference is that in document d, TF is the frequency
124
counter for a term t, while df is the number of occurrences in the document set N of the
term t. In other words, the number of papers in which the word is present is DF:
df(t) = occurrence of t in documents
Inverse Document Frequency: Mainly, it tests how relevant the word is. The key aim of
the search is to locate the appropriate records that fit the demand. Since tf considers all
terms equally significant, it is therefore not only possible to use the term frequencies to
measure the weight of the term in the paper. First, find the document frequency of a term
t by counting the number of documents containing the term:
df(t) = N(t) where
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t
Term frequency is the number of instances of a term in a single document only; although
the frequency of the document is the number of separate documents in which the term
appears, it depends on the entire corpus. Now let‘s look at the definition of the frequency
of the inverse paper. The IDF of the word is the number of documents in the corpus
separated by the frequency of the text.
idf(t) = N/ df(t) = N/N(t)
The more common word is supposed to be considered less significant, but the element
(most definite integers) seems too harsh. We then take the logarithm (with base 2) of the
inverse frequency of the paper. So the if of the term t becomes:
idf(t) = log(N/ df(t))
Computation: TF-IDF is one of the best metrics to determine how significant a term is to
a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each
word
in a document based on its term frequency (TF) and the reciprocal document frequency
(TF) (IDF). The words with higher scores of weight are deemed to be more significant.
Usually, the TF-IDF weight consists of two terms-
1. Normalized Term Frequency (TF)
2. Inverse Document Frequency (IDF)
tf-idf(t, d) = tf(t, d) * idf(t)
In python tf-idf values can be computed
using TfidfVectorizer() method in sklearn module.
Syntax:
sklearn.feature_extraction.text.TfidfVectorizer(input)
Parameters:
 input: It refers to parameter document passed, it can be a filename, file or content
itself.
Attributes:
 vocabulary_: It returns a dictionary of terms as keys and values as feature indices.

125
 IDF: It returns the inverse document frequency vector of the document passed as a
parameter.
Returns:
 Fit transform(): It returns an array of terms along with TF-IDF values.
 Get feature names(): It returns a list of feature
names.
3.2.6 Categorizing Documents By Topics:
A topic consists of a cluster of words that frequently occur together and share the same
theme.
 The topics of a document are not as straightforward as they might initially appear.
 Consider these two reviews:
1. The bPhoneSx has coverage everywhere. It's much less flaky than my old bPhone4G.
2. While I love ACME's bPhone series, I've been quite disappointed by the bEbook. The
text is illegible, and it makes even my old NBook look blazingly fast.
 A document typically consists of multiple themes running through the text in
different proportions-
 For Example: 30% on a topic related to phones, 15% on a topic related to
appearance, 10% on a topic related to shipping, 5% on a topic related to service, and
so on.
 Document grouping can be achieved with clustering methods such as:
o k-means clustering
o or classification methods such as
o support vector machines
o k-nearest neighbors
o naive Bayes
 However, a more feasible and prevalent approach is to use Topic Modelling.
 Topic modelling provides tools to automatically organize, search understand, and
summarize from vast amounts of information.
 Topic models are statistical models that examine words from a set of documents,
determine the themes over the text, and discover how the themes are associated or
change over time.
 The process of topic modelling can be simplified to the following.
1. Uncover the hidden topical patterns within a corpus.
2. Annotate documents according to these topics.
3. Use annotations to organize, search, and summarize texts.
 A topic is formally defined as a distribution over a fixed vocabulary of words. Different
topics would have different distributions over the same vocabulary.
 A topic can be viewed as a cluster of words with related meanings, and each word
has a corresponding weight inside this topic.

126
 Note that a word from the vocabulary can reside in multiple topics with different
weights.
 Topic models do not necessarily require prior knowledge of the texts. The topics can
emerge solely based on analyzing the text.
 The simplest topic model is latent Dirichlet allocation a generative probabilistic
model of a corpus proposed by David M. Blei and two other researchers.
3.2.7 Determining Sentiment:

 Sentiment analysis, also known as opinion mining, is the process of determining the

emotions behind a piece of text. Sentiment analysis aims to categorize the given text
as positive, negative, or neutral. Furthermore, it then identifies and quantifies
subjective information about those texts with the help of natural language processing,
text analysis, computational linguistics, and machine learning.
 There are two main methods for sentiment analysis: machine learning method and
lexicon-based. The machine learning method leverages human-labeled data to train
the text classifier, making it a supervised learning method. The lexicon-based
approach breaks down a sentence into words and scores each word‘s semantic
orientation based on a dictionary. It then adds up the various scores to arrive at a
conclusion.

In this example, we will look at how sentiment analysis works using a simple lexicon-

based approach. We‘ll take the following comment as our test data:

127
“That movie was a colossal disaster… I absolutely hated it! Waste of time and
money #skipit”
Step 1: Cleaning

The initial step is to remove special characters and numbers from the text. In our

example, we‘ll remove the exclamation marks and commas from the comment above.

That movie was a colossal disaster I absolutely hated it Waste of time and
money skipit.

Step 2: Tokenization
Tokenization is the process of breaking down a text into smaller chunks called tokens,
which are either individual words or short sentences. Breaking down a paragraph into
sentences is known as sentence tokenization, and breaking down a sentence into words
is known as word tokenization.
[ „That‟, „movie‟, „was‟, „a‟, „colossal‟, „disaster‟, „I‟, „absolutely‟, „hated‟,
„it‟, „Waste‟, „of‟, „time‟, „and‟, „money‟, „skipit‟]
Step 3: Part-of-speech (POS) tagging
Part-of-speech tagging is the process of tagging each word with its grammatical group,
categorizing it as either a noun, pronoun, adjective, or adverb—depending on its context.
This transforms each token into a tuple of the form (word, tag). POS tagging is used to
preserve the context of a word.
[ (‗That‘, ‗DT‘),
(‗movie‘, ‗NN‘),
(‗was‘, ‗VBD‘),
(‗a‘, ‗DT‘)
(‗colossal‘, ‗JJ‘),
(‗disaster‘, ‗NN‘),
(‗I‘, ‗PRP‘),
(‗absolutely‘, ‗RB‘),
(‗hated‘, ‗VBD‘),
(‗it‘, ‗PRP‘),
(‗Waste‘, ‗NN‘),
(‗of‘, ‗IN‘),
(‗time‘, ‗NN‘),
(‗and‘, ‗CC‘),
(‗money‘, ‗NN‘),
(‗skipit‘, ‗NN‘)]
Step 4: Removing stop words
Stop words are words like ‗have,‘ ‗but,‘ ‗we,‘ ‗he,‘ ‗into,‘ ‗just,‘ and so on. These words
carry information of little value, andare generally considered noise, so they are removed

128
from the data.
[ „movie‟, „colossal‟, „disaster‟, „absolutely‟, „hated‟, Waste‟, „time‟, „money‟,
„skipit‟].
Step 5: Stemming
Stemming is a process of linguistic normalization which removes the suffix of each of these
words and reduces them to their base word. For example, loved is reduced to love, wasted
is reduced to waste. Here, hated is reduced to hate.
[ ‗movie‘, ‗colossal‘, ‗disaster‘, ‗absolutely‘, ‗hate‘, ‗Waste‘, ‗time‘, ‗money‘, ‗skipit‘].
Step 6: Final Analysis
In a lexicon-based approach, the remaining words are compared against the sentiment
libraries, and the scores obtained for each token are added or averaged. Sentiment
libraries are a list of predefined words and phrases which are manually scored by humans.
For example, ‗worst‘ is scored -3, and ‗amazing‘ is scored +3. With a basic dictionary, our
example comment will be turned into:
movie= 0, colossal= 0, disaster= -2, absolutely=0, hate=-2, waste= -1, time= 0, money=
0, skipit= 0.
This makes the overall score of the comment -5, classifying the comment as negative.

Applications Of Sentiment Analysis:


 Customer support
 Social media monitoring
 Voice assistants & chatbots
 Election polls
 Customer experience about a product
 Stock market sentiment and market movement
3.2.8 Gaining Insight:
The broad definition of insight is a deep understanding of a situation (or person or
thing). In the context of data and analytics, the word insight refers to an analyst or
business user discovering a pattern in data or a relationship between variables that they
didn‘t previously know existed. This can have a thrilling, ―aha moment‖ aspect to it.
Difference Between data, analytics and insights:
Data = a collection of facts.
Analytics = organizing and examining data.
Insights = discovering patterns in data.
There‘s also a linear aspect to these terms that differentiates them. Data is collected and
organized, then analysis is performed, and insights are generated as follows:

129
Data insights that:
1. Optimize processes to improve performance.
2. Uncover new markets, products or services to add new sources of revenue.
3. Better balance risk vs reward to reduce loss.
4. Deepen the understanding of customers to increase loyalty and lifetime value.
How to Get Data Insights:
The process to obtain actionable data insights typically involves defining objectives,
collecting, integrating and managing the data, analyzing the data to gain insights and then
sharing these insights.
1) Define business objectives
Stakeholders initiate the process by clearly defining objectives such as improving
production processes or determining which marketing campaigns are most effective, like in
the example above.
2) Data collection
Ideally, systems have already been put in place to collect and store raw source data. If
not, the organization needs to establish a systematic process to gather the data.
3) Data integration & management
Once collected, source data must be transformed into clean, analytics-ready information
via data integration. This process includes data replication, ingestion and transformation to
combine different types of data into standardized formats which are then stored in a
repository such as a data lake or data warehouse.

130
UNIT – IV: Analytical Data Report andVisualization
Communicating and Operationalizing an AnalyticsProject, Creating the Final
Deliverables:

 Developing Core Material for Multiple Audiences,

 Project Goals,

 Main Findings,

 Approach,

 Model Description,

 Key Points Supported with Data,

 Model Details Recommendations,

 Additional Tips on Final Presentation,

 Providing Technical Specifications and Code,

 Data Visualization.
4.1 Developing Core Material for Multiple Audiences:
It is important to tailor the project outputs to the audience. For a project sponsor,
show that the team met the project goals. Focus on what was done, what the team
accomplished, what ROI can be anticipated, and what business value can be realized.
When presenting to a technical audience such as data scientists and analysts, focus
on how the work was done. Discuss how the team accomplished the goals and the
choices it made in selecting models or analyzing the data. Share analytical methods and
decision-making processes so other analysts can learn from them for future projects.
Describe methods, techniques, and technologies used, as this technical audience will be
interested in learning about these details and considering whether the approach
makes sense in this case and whether it can be extended to other, similar projects. Plan
to provide specifics related to model accuracy and speed, such as how well the model
will perform in a production environment.

131
Because some of the components of the projects can be used for different audiences, it can
be helpful to create a core set of materials regarding the project, which can be used to
create presentations for either a technical audience or an executive sponsor. Table below
depicts the main components of the final presentations for the project sponsor and an
analyst audience. Notice that teams can create a core set of materials in these seven
areas, which can be used for the two presentation audiences.
Three areas (Project Goals, Main Findings, and Model Description), can be used as is
for both presentations. Other areas need additional elaboration, such as the Approach. Still
other areas, such as the Key Points, require different levels of detail for the analysts and
data scientists than for the project sponsor.

TABLE: Comparison of Materials for Sponsor and AnalystPresentations


Presentation Project Analyst presentation
component Sponsor
presentation
Project Goals List 3-5 agreed upon goals
Main Findings Emphasize key messages
Approach High level methodology High-level methodology
Relevant details on
modeling
techniques and technology
Model Description Overview of the modeling technology
Key points supported Support key points with Show details to support
with Data simple charts and graphics the key points. Analyst-
(example: oriented charts

132
bar charts). and graphs, such as ROC
curves and histograms
Visuals of key variables
and significance of
each
Model details Omit this section, or discuss Show the code or main
only at a high level. logic of the model, and
include model type,
variables, and technology
used to execute the
model and score data.
Identify key variables and
impact of each. Describe
expected model
performance and any
caveats. Detailed
description of the
modeling technique
Discuss variables, scope,
and predictive
power.
Recommendations Focus on business impact, Supplement
including risks and ROI. Give recommendations with
the sponsor salient points to implications for the
help her evangelize work within modeling or for
the organization. deploying in a production
environment.

4.2 Project Goals:


The Project Goals portion of the final presentation is generally the same, or
similar, for sponsors and for analysts. For each audience, the team needs to reiterate
the goals of the project to lay the groundwork forthe solution and recommendations that
are shared laterin the presentation. In addition, the Goals slide serves to ensure there is
a shared understanding between the project team and the sponsors and confirm they
are aligned in moving forward in the project. Generally, the goals are agreed on early in
the project. It is good practice to write them down and share them to ensure the goals
and objectives are clearly understood by both the project team and the sponsors.
Figures 1 and 2 show two examples of slides for Project Goals. Figure 4.2.1 shows three
goals for creating a predictive model to anticipate customer churn.

133
The points on this version of the Goals slide emphasize what needs to be done,
but not why, which will beincluded in the alternative.

Figure 4.2.1 Example of Project Goals slide for YoyoDyne case study
Figure 4.2.2 shows a variation of the previous Project Goals slide in Figure 1. It is a
summary of the situation prior to listing the goals. Keep in mind that when delivering
final presentations, these deliverables are shared within organizations, and the original
context can be lost, especially if the original sponsor leaves the group or changes roles.
It is good practice to briefly recap the situation prior to showing the project goals. Keep
in mind that adding a situation overview to the Goals slide does make it appear busier.
The team needs to determine whether to split this into a separate slide or keep it
together, depending on the audience and the team's style for delivering the final
presentation. One method for writing the situational overview in a succinct way is to
summarize as follows:

• Situation: Give a one-sentence overview of the situation that has led to the analytics
project.

• Complication: Give a one-sentence overview of the need for addressing this now.
Something has triggered the organization to decide to take action at this time. For
instance, perhaps it lost 100 customers in the past two weeks and now has an executive
mandate to address an issue, or perhaps it has lost five points of market share to its
biggest competitor in the past three months. Usually, this sentence represents the driver
for why a particular project is being initiated at this time, rather than in some vague time
in the future.

• Implication: Give a one-sentence overview of the impact of the complication. For


instance, if the bank fails to address its customer attrition problem, it stands to lose its
dominant market position in three key markets. Focus on the business impact to
illustrate theurgency of doing the project.

134
FIGURE 4.2.2 Example of Situation & Project Goals slide for YoyoDyne case study

4.3 Main Findings:


Write a solid executive summary to portray the main findings of a
project. In many cases, the summary may be the only portion of the presentation that
hurried managers will read. For this reason, it is imperative to make the language
clear, concise, and complete. Those reading the executive summary should be able
to grasp the full story of the project and the key insights in a single slide. In addition,
this is an opportunity to provide key talking points for the executive sponsor to use to
evangelize the project work with others in the customer's organization. Be sure to
frame the outcomes of the project in terms of both quantitative and qualitative business
value. This is especially important if the presentation is for the project sponsor. The
executive summary slide containing the main findings is generally the same for both
sponsor and analyst audiences. Figure 3 shows an example of an executive summary
slide for the YoyoDyne case study. It is useful to take a closer look at the parts of the
slide to make sure it is clear.

135
FIGURE 4.3.1 Example of Executive Summary slide for YoyoDynecase study

The key message should be clear and conspicuous at the front of the slide. It can be
set apart with color or shading. The key message may become the single talking point
that executives or the project sponsor take away from the project and use to support
the team's recommendation for a pilot project, so it needs to be succinct and
compelling.
Follow the key message with three major supporting points. Although Executive
Summary slides can have more than three major points, going beyond three ideas
makes it difficult for people to recall the main points, so it is important to ensure that
the ideas remain clear and limited to the few most impactful ideas the team wants the
audience to take away from the work that was done. If the author lists ten key points,
messages become diluted, and the audience may remember only one or two main
points.
In addition, because this is an analytics project, be sure to make one of the key points
related to if, and how well, the work will meet the sponsor's service level agreement
(SLA) or expectations. Traditionally, the SLA refers to an arrangement between
someone providing services, such as an information technology (IT) department or a
consulting firm, and an end user or customer. In this case, the SLA refers to system
performance, expected uptime of a system, and other constraints that govern an
agreement.
136
This term has become less formal and many times conveys system performance or
expectations more generally related to performance or timeliness. Finally, although it's
not required, it is often a good idea to support the main points with a visual or graph.
Visual imagery serves to make a visceral connection and helps retain the main message
with the reader.

FIGURE 4.3.2 Anatomy of an Executive Summary slide

4.4 Approach:
In the Approach portion of the presentation, the team needs to explain the
methodology pursued on the project. This can include interviews with domain experts,
the groups collaborating within the organization, and a few statements about the
solution developed. The objective of this slide is to ensure the audience understands the
course of action that was pursued well enough to explain it to others within the
organization. The team should also include any additional comments related to working
assumptions the team followed as it performed the work,because this can be critical in
defending why they followed a specific course of action. When explaining the solution, the
discussion should remain at a high level for the project sponsors. If presenting to
analysts or data scientists, provide additional detail about the type of model used,
including the technology and the actual performance of the model during the tests.

137
Finally, as part of the description of the approach, the team may want to mention
constraints from systems, tools, or existingprocesses and any implications for how these
things may need to change with this project.
Figure 4.2.4 shows an example of how to describe the methodology followed during a
data science project to a sponsor audience.

FIGURE 4.4.1 Example describing the project methodology for project sponsors
4.5 Model Description:
After describing the project approach, teams generally include a description of the model
that was used, Assuming that the model will meet the agreed-upon SLAs, mention that
the model will meet the SLAs based on the performance of the model within the testing
or staging environment. For instance, one may want to indicate that the model processed
500,000 records in 5 minutes to give stakeholders an idea of the speed of the model
during run time. Analysts will want to understand the details of the model, including the
decisions made in constructing the model and the scope of the data extracts for testing
and training. Be prepared to explain the team's thought process on this, as well as the
speed of running the model within the test environment.

138
FIGURE 4.5 Model Description
4.6 Key Points Supported with Data
The next step is to identify key points based on insights and observations resulting from
the data and model scoring results. Find ways to illustrate the key points with charts and
visualization techniques, using simpler charts for sponsors and more technical data
visualization for analysts and data scientists. a. For project sponsors, use simple charts
such as bar charts, which illustrate data clearly and enable the audience to understand
the value of the insights. This is also a good point to foreshadow some of the team's
recommendations and begin tying together ideas to demonstrate what led to the
recommendations and why. In other words, this section supplies the data and foundation
for the recommendations that come later in the presentation.

139
FIGURE 4.6 Key points
Example of a presentation of key points of a data science project shownas a bar chart.
4.7 Model Details
Model details are typically needed by people who have a more technical understanding
than the sponsors, such as those who will implement the code, or colleagues on the
analytics team. Project sponsors are typically less interested in the model details; they are
usually more focused on the business implications of the work rather than the details of
the model. This portion of the presentation needs to show the code or main logic of the
model, including the model type, variables, and technology used to execute the model and
score data.

140
FIGURE 4.7.1 Model Details
As part of the model detail description, guidance should be provided regarding the speed
with which the model can run in the test environment; the expected performance in a live,
production environment; and the technology needed.

FIGURE 4.7.2 Model details comparing two data variables

141
Recommendations
The final main component of the presentation involves creating a set of recommendations
that include how to deploy the model from a business perspective within the organization
and any other suggestions on the rollout of the model's logic.
Implement the model as a pilot, before more wide-scale rollout- test and learn from initial
pilot on performance and precision Addressing these promptly can potentially save more
customers from churning over time and also prevent more networking that seems to drive
additional churn An early churn warning trigger can be set up based on this model run
the predictive model daily or weekly to be proactive on customer churn In-database
scorer can score large datasets in a matter of minutes and can be run daily Each
customer retained via early warning trigger saves 4 hours of account retention efforts &
SOk in new account acquisition costs Develop targeted customer surveys to investigate
the causes of churn, which will make the collection of data for investigation into the causes
of churn easier
Additional Tips on the Final Presentation
As a team completes a project and strives to move on to the next one, it must remember
to invest adequate time in developing the final presentations.

 Use imagery and visual representations: Visuals tend to make the presentation more
compelling. Also, people recall imagery better than words, because images can have a
more visceral impact.These visual representations can be static and interactive data.

 Make sure the text is mutually exclusive and collectively exhaustive (MECE): This
means having an economy of words in the presentation and making sure the key
points are covered but not repeated unnecessarily.

 Measure and quantify the benefits of the project: This can be challenging and requires
time and effort to do well. This kind of measurement should attempt to quantify
benefits that have financial and other benefits in a specific way. Making the
statement that a project provided "$8.5M in annual cost savings" is much more
compelling than saying it has "great value."

 Make the benefits of the project clear and conspicuous: After calculating the benefits of
the project, make sure to articulate them clearly in the presentation.

142
4.8 Providing Technical Specifications and Code:
In addition to authoring the final presentations, the team needs to deliver the actual code
that was developed and the technical documentation needed to support it. The team
should consider how the project will affect the end users and the technical people who will
need to implement the code.
Teams should approach writing technical documentation for their code as if it were an
application programming interface (API). Many times, the models become encapsulated as
functions that read a set of inputs in the production environment, possibly perform
preprocessingon data, and create an output, including a set of post-processingresults.
For example, if the model returns a value representing the probability of customer churn,
additional logic may be needed to identify the scoring threshold to determine which
customer accounts to flag as being at risk of churn. In addition, some provision should be
made for adjusting this threshold and training the algorithm, either in an automated
learning fashion or with human intervention.
Although the team must create technical documentation, many times engineers and other
technical staff receive the code and may try to use it without reading through all the
documentation. Therefore, it is important to add extensive comments in the code. If the
team can do a thorough job adding comments in the code, it is much easier for someone
else to maintain the code and tune it in the runtime environment. In addition, it helps the
engineers edit the code when their environment changes or they need to modify processes
that may be providing inputs to the code or receiving its outputs.
4.9 Data Visualization:
As the volume of data continues to increase, more vendors and communities are
developing tools to create clear and impactful graphics for use in presentations and
applications. Although not exhaustive, Table lists some popular tools.

143
What is Data Visualization!?
Data visualization is the representation of data through use of common
graphics, such as charts, plots, infographics and even animations. These visual
displays of information communicate complex data relationships and data-driven
insights in a way that is easy to understand.
Types of Data Visualization
 Tables: This consists of rows and columns used to compare variables.
Tables can show a great deal of information in a structured way, but they
can also overwhelm users that are simply looking for high-level trends.
 Pie charts and stacked bar charts: These graphs are divided into
sections that represent parts of a whole. They provide a simple way to
organize data and compare the size of each component to one other.
 Line charts and area charts: These visuals show change in one or more
quantities by plotting a series of data points over time and are frequently
used within predictive analytics. Line graphs utilize lines to demonstrate
these changes while area charts connect data points with line segments,
stacking variables on top of one another and using color to distinguish
between variables.
 Histograms: This graph plots a distribution of numbers using a bar chart
(with no spaces between the bars), representing the quantity of data that
falls within a particular range. This visual makes it easy for an end user to
identify outliers within a given dataset.
 Scatter plots: These visuals are beneficial in reveling the relationship
between two variables, and they are commonly used within regression data
analysis. However, these can sometimes be confused with bubble charts,
which are used to visualize three variables via the x-axis, the y-axis, and
the size of the bubble.
 Heat maps: These graphical representation displays are helpful in
visualizing behavioral data by location. This can be a location on a map, or
even a webpage.
 Tree maps, which display hierarchical data as a set of nested shapes,
typically rectangles. Tree maps are great for comparing the proportions
between categories via their area size.

144
UNIT V –DATA ANALYTICS APPLICATIONS

Text and Web: Data Acquisition, Feature Extraction, Tokenization, Stemming,


Conversion to Structured Data, Sentiment Analysis, Web Mining.
Recommender Systems: Feedback, Recommendation Tasks, Recommendation
Techniques, Final Remarks.
Social Network Analysis: Representing Social Networks, Basic Properties of
Nodes, Basic and Structural Properties of Networks.
Unit V: Chapter – 1: Text and Web
 Data Acquisition
 Feature Extraction
 Tokenization
 Stemming
 Conversion to Structured Data
 Sentiment Analysis
 Web Mining
5.1.1 Data Acquisition:
Data acquisition has been understood as the process of gathering, filtering, and
cleaning data before the data is put in a data warehouse or any other storage
solution.
The acquisition of big data is most commonly governed by four of the Vs: ―volume,
velocity, variety, and value‖. Most data acquisition scenarios assume high-volume,
high-velocity, high-variety, but low-value data, making it important to have
adaptable and time-efficient gathering, filtering, and cleaning algorithms that
ensure that only the high-value fragments of the data are actually processed by
the data-warehouse analysis.
To get a better understanding of data acquisition, the chapter will first take a look
at the different big data architectures of Oracle , Vivisimo , and IBM . This will
integrate the process of acquisition within the big data processing pipeline.
The big data processing pipeline has been abstracted in numerous ways in previous
works. Oracle (2012) relies on a three-step approach for data processing. In the
first step, the content of different data sources is retrieved and stored within a
scalable storage solution such as a NoSQL database or the Hadoop Distributed File
System (HDFS) . The stored data is subsequently processed by first being
reorganized and stored in an SQL-capable big data analytics software and finally
analysed by using big data analytics algorithms.
Velocity (Vivisimo 2012) relies on a different view on big data. Here, the approach
is more search-oriented. The main component of the architecture is a connector

145
layer, in which different data sources can be addressed. The content of these data
sources is gathered in parallel, converted, and finally added to an index, which
builds the basis for data analytics, business intelligence, and all other data-driven
applications. Other big players such as IBM rely on architectures similar to Oracle‘s
(IBM 2013).
Throughout the different architectures to big data processing, the core of data
acquisition boils down to gathering data from distributed information sources with
the aim of storing them in scalable, big data-capable data storage. To achieve this
goal, three main components are required:
1.Protocols that allow the gathering of information for distributed data sources of
any type (unstructured, semi-structured, structured)
2.Frameworks with which the data is collected from the distributed sources by
using different protocols.
3.Technologies that allow the persistent storage of the data retrieved by the
frameworks.
5.1.1 The Data Acquisition Process:
What is exciting about data acquisition to data professionals is the richness of its
process.
Consider a basic set of tasks that constitute a data acquisition process:
 A need for data is identified, perhaps with use cases
 Prospecting for the required data is carried out
 Data sources are disqualified, leaving a set of qualified sources
 Vendors providing the sources are contacted and legal agreements entered into
for evaluation
 Sample data sets are provided for evaluation
 Semantic analysis of the data sets is undertaken, so they are adequately
understood
 The data sets are evaluated against originally established use cases
 Legal, privacy and compliance issues are understood, particularly with respect
to permitted use of data
 Vendor negotiations occur to purchase the data
 Implementation specifications are drawn up, usually involving Data Operations
who will be responsible for production processes
 Source onboarding occurs, such that ingestion is technically accomplished
 Production ingest is undertaken
There are several things that stand out about this list. The first is that it consists of
a relatively large number of tasks. The second is that it may easily be inferred that
many different groups are going to be involved, e.g., Analytics or Data Science will
likely come up with the need and use cases, whereas Data Governance, and

146
perhaps the Office of General Counsel, will have to give an opinion on legal,
privacy and compliance requirements.

An even more important feature of data acquisition is that the end-to-end process
sketched out above is only one of a number of possible variations. Other
approaches to data acquisition may involve using ―open‖ data sources or
configuring tools to scan internet sources, or hiring a company to aggregate the
required data. Each of these variations will amount to a different end-to-end
process.

5.1.2 Feature Extraction:


Feature Extraction aims to reduce the number of features in a dataset by
creating new features from the existing ones (and then discarding the original
features). These new reduced set of features should then be able to summarize
most of the information contained in the original set of features. In this way, a
summarised version of the original features can be created from a combination of
the original set.
Feature Extraction techniques can also lead to other types of advantages
such as:
 Accuracy improvements.
 Overfitting risk reduction.
 Speed up in training.
 Improved Data Visualization.

147
 Increase in explainability of our model.
Let us take an example of Kaggle Mushroom Classification Dataset. Our objective
will be to try to predict if a Mushroom is poisonous or not by looking at the given
features.
First of all, we need to import all the necessary libraries.

The dataset we will be using in this example is shown in the figure below.

Mushroom Classification dataset


Before feeding this data into our Machine Learning models I decided to divide our
data into features (X) and labels (Y) and One Hot Encode all the Categorical
Variables.

Successively, I decided to create a function (forest_test) to divide the input data


into train and test sets and then train and test a Random Forest Classifier.

We can now use this function using the whole dataset and then use it successively

148
to compare these results when using instead of the whole dataset just a reduced
version.

As shown below, training a Random Forest classifier using all the features, led to
100% Accuracy in about 2.2s of training time. In each of the following examples,
the training time of each model will be printed out on the first line of each snippet
for your reference.

Feature Extraction Techniques:


1) Principle Components Analysis (PCA)
PCA is one of the most used linear dimensionality reduction
technique. When using PCA, we take as input our original data and try to find a
combination of the input features which can best summarize the original data
distribution so that to reduce its original dimensions. PCA is able to do this by
maximizing variances and minimizing the reconstruction error by looking at pair
wised distances. In PCA, our original data is projected into a set of orthogonal axes
and each of the axes gets ranked in order of importance. PCA is an unsupervised
learning algorithm.
2) Independent Component Analysis (ICA)
ICA is a linear dimensionality reduction method which takes as input
data a mixture of independent components and it aims to correctly identify each of
them (deleting all the unnecessary noise). Two input features can be considered
independent if both their linear and not linear dependance is equal to zero [1].
Independent Component Analysis is commonly used in medical applications such
as EEG and fMRI analysis to separate useful signals from unhelpful ones.
3) Linear Discriminant Analysis (LDA)
LDA is supervised learning dimensionality reduction technique and
Machine Learning classifier. LDA aims to maximize the distance between the mean
of each class and minimize the spreading within the class itself. LDA uses therefore
within classes and between classes as measures. This is a good choice because

149
maximizing the distance between the means of each class when projecting the
data in a lower-dimensional space can lead to better classification results (thanks
to the reduced overlap between the different classes). When using LDA, is
assumed that the input data follows a Gaussian Distribution (like in this case),
therefore applying LDA to not Gaussian data can possibly lead to poor classification
results.
4) Locally Linear Embedding (LLE)
We have considered so far methods such as PCA and LDA, which are
able to perform really well in case of linear relationships between the different
features, we will now move on considering how to deal with non-linear cases.
Locally Linear Embedding is a dimensionality reduction technique based on
Manifold Learning. A Manifold is an object of D dimensions which is embedded in
an higher-dimensional space. Manifold Learning aims then to make this object
representable in its original D dimensions instead of being represented in an
unnecessary greater space.
Some examples of Manifold Learning algorithms are: Isomap, Locally Linear
Embedding, Modified Locally Linear Embedding, Hessian Eigenmapping, etc…

Locally linear embedding (LLE) seeks a lower-dimensional projection of the data


which preserves distances within local neighborhoods. It can be thought of as a
series of local Principal Component Analyses which are globally compared to find
the best non-linear embedding.
5) t-distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is non-linear dimensionality reduction technique which is typically used to
visualize high dimensional datasets. Some of the main applications of t-SNE are
Natural Language Processing (NLP), speech processing, etc…
t-SNE works by minimizing the divergence between a distribution constituted by
the pairwise probability similarities of the input features in the original high

150
dimensional space and its equivalent in the reduced low dimensional space. t-SNE
makes then use of the Kullback-Leiber (KL) divergence in order to measure the
dissimilarity of the two different distributions. The KL divergence is then minimized
using gradient descent.

When using t-SNE, the higher dimensional space is modelled using a Gaussian
Distribution, while the lower-dimensional space is modelled using a Student‘s t-
distribution. This is done, in order to avoid an imbalance in the neighboring points
distance distribution caused by the translation into a lower-dimensional space.
5.1.3 Tokenization:
Tokenization is a way of separating a piece of text into smaller units
called tokens. Here, tokens can be either words, characters, or subwords. Hence,
tokenization can be broadly classified into 3 types – word, character, and subword
(n-gram characters) tokenization.
For example, consider the sentence: ―Never give up‖.
The most common way of forming tokens is based on space. Assuming space as a
delimiter, the tokenization of the sentence results in 3 tokens – Never-give-up. As
each token is a word, it becomes an example of Word tokenization.
Tokenization Types:
1)Word Tokenization
Word Tokenization is the most commonly used tokenization algorithm.
It splits a piece of text into individual words based on a certain delimiter.
Depending upon delimiters, different word-level tokens are formed. Pretrained
Word Embeddings such as Word2Vec and GloVe comes under word tokenization.
2) Character Tokenization
Character Tokenization splits apiece of text into a set of characters. It
overcomes the drawbacks we saw above about Word Tokenization.
 Character Tokenizers handles OOV words coherently by preserving the
information of the word. It breaks down the OOV word into characters and
represents the word in terms of these characters.
 It also limits the size of the vocabulary. Want to talk a guess on the size of the
vocabulary? 26 since the vocabulary contains a unique set of characters.
Tokenization Algorithm:
Byte Pair Encoding (BPE):
Byte Pair Encoding (BPE) is a widely used tokenization method among transformer-
based models. BPE addresses the issues of Word and Character Tokenizers:
 BPE tackles OOV effectively. It segments OOV as subwords and represents the
word in terms of these subwords

151
 The length of input and output sentences after BPE are shorter compared to
character tokenization

BPE is a word segmentation algorithm that merges the most frequently occurring
character or character sequences iteratively. Here is a step by step guide to learn
BPE.
Steps to learn BPE
1. Split the words in the corpus into characters after appending </w>
2. Initialize the vocabulary with unique characters in the corpus
3. Compute the frequency of a pair of characters or character sequences in corpus
4. Merge the most frequent pair in corpus
5. Save the best pair to the vocabulary
6. Repeat steps 3 to 5 for a certain number of iterations
5.1.4 Stemming:
Stemming is a natural language processing technique that lowers
inflection in words to their root forms, hence aiding in the preprocessing of text,
words, and documents for text normalization.
Inflection is the process through which a word is modified to communicate many
grammatical categories, including tense, case, voice, aspect, person, number,
gender, and mood. Thus, although a word may exist in several inflected forms,
having multiple inflected forms inside the same text adds redundancy to the NLP
process.
As a result, we employ stemming to reduce words to their basic form or stem,
which may or may not be a legitimate word in the language.
For instance, the stem of these three words, connections, connected, connects, is
―connect‖. On the other hand, the root of trouble, troubled, and troubles is
―troubl,‖ which is not a recognized word.
Application of Stemming:
In information retrieval, text mining SEOs, Web search results, indexing,
tagging systems, and word analysis, stemming is employed. For instance, a Google
search for prediction and predicted returns comparable results.
Types of Stemmer:
There are several kinds of stemming algorithms. Let us have a look.
1. Porter Stemmer – PorterStemmer()
Martin Porter invented the Porter Stemmer or Porter algorithm
in 1980. Five steps of word reduction are used in the method, each with its own
set of mapping rules. Porter Stemmer is the original stemmer and is renowned for
its ease of use and rapidity. Frequently, the resultant stem is a shorter word with

152
the same root meaning.
PorterStemmer() is a module in NLTK that implements the Porter Stemming
technique. Let us examine this with the aid of an example.

Example of PorterStemmer()
In the example below, we construct an instance of PorterStemmer() and use the
Porter algorithm to stem the list of words.

2. Snowball Stemmer – SnowballStemmer()


Martin Porter also created Snowball Stemmer. The method utilized in
this instance is more precise and is referred to as ―English Stemmer‖ or ―Porter2
Stemmer.‖ It is somewhat faster and more logical than the original Porter
Stemmer.
SnowballStemmer() is a module in NLTK that implements the Snowball stemming
technique. Let us examine this form of stemming using an example.
Example of SnowballStemmer()
In the example below, we first construct an instance of SnowballStemmer() to use
the Snowball algorithm to stem the list of words.

153
3. Lancaster Stemmer – LancasterStemmer()
Lancaster Stemmer is straightforward, although it often produces
results with excessive stemming. Over-stemming renders stems non-linguistic or
meaningless.
LancasterStemmer() is a module in NLTK that implements the Lancaster stemming
technique. Allow me to illustrate this with an example.
Example of LancasterStemmer()
In the example below, we construct an instance of LancasterStemmer() and then
use the Lancaster algorithm to stem the list of words.

4. Regexp Stemmer – RegexpStemmer()


Regex stemmer identifies morphological affixes using regular expressions.
Substrings matching the regular expressions will be discarded.
RegexpStemmer() is a module in NLTK that implements the Regex stemming
technique. Let us try to understand this with an example.

Example of RegexpStemmer()
In this example, we first construct an object of RegexpStemmer() and then use the
Regex stemming method to stem the list of words.

154
5.1.5 Conversion to Structured Data:
There are seven steps to analyze unstructured data to extract
structured data insights as below.
First analyze the data sources
Before you can initiate, you need to analyze what sources of data are essential for
the data analysis. Unstructured data sources are in found in different forms like
web pages, video files, audio files, text documents, customer emails, chats and
more. You should analyze and use only those unstructured data sources that are
completely relevant.
1. Know what will be done with the results of the analysis
If the end result is not clearer, the analysis may be unusable. It is key to better
understand what sort of outcome is required, is it a trend, effect, cause, quantity
or something else which is needed. There should be clear road-map defined for
what would be done with the final results to use them better for the business,
market or other organization related gains.
2. Decide the technology for data intake and storage as per business
needs
Though the unstructured data will come from different sources, the outcomes of
the analysis must be injected in a technology stack so that the outcomes can be
straightforwardly used. Features that are important for selecting the data retrieval
and storage totally depends on the volume, scalability, velocity and variety of
requirements. A prospective technology stack should be well assessed against the
concluding requirements, after which the data architecture of the whole project is
set-up.
Certain examples of business needs and the selection of the technology
stack are:
155
Real-time: It has turned very critical for E commerce companies to offer real-time
prices. This requires monitoring and tracking real- time competitor activities, and
offering offerings based on the instant results of an analytics software. Such
pricing technologies includes competitor price monitoring software.
Higher availability: This is vital for ingesting unstructured data and information
from social media platforms. The used technology platform should make sure that
there is no loss of data in real- time. It is a better idea to hold information intake
as a data redundancy plan.
Support Multi-tenancy: Another important element is the capability to isolate
data from diverse user groups. Effective Data intelligence solutions should natively
back multi- tenancy positions. The isolation of data is significant as per the
sensitivities involved with customer data and feedbacks combined with the
important insights, to meet the confidentiality requirements.

3. Keep the information stored in a data warehouse till the end


Information should be well stored in its native format until it is really estimated
beneficial and required for a precise purpose, maintaining storage of meta-data or
other information that might help in the analysis if not now but later.
4. Formulate data for the storage
While maintaining the original data files, if you require to enable utilization of data,
the best option is to clean one of the copies. It is always better to cleanse
whitespaces and the symbols, while transforming text. The duplicate results should
be detached and the out of topic data or information should be well removed from
the data-sets.
5. Understand the data patterns and text flow
By using semantic analysis and natural language processing, you can use Parts- of-
Speech tagging to fetch entities which are common, like ―person‖, ―location‖,
―company‖ and their internal relationships. By doing this, you can build a term
frequency matrix to better understand the data patterns and the text flow.
6. Text mining and Data extraction
Once the database has been shaped, the data must be categorized and properly
segmented. The data intelligence tools can be utilized to search similarities in
customer behavior when targeted for a particular campaign or classification. The
outlook of customers can be resolute using sentiment analysis of feedbacks and
the reviews, which assists in better understanding the product recommendations,
market trends and offer guidance for new products or services launch.
You can utilize Social Media Intelligence Solutions to extract the posts or the
events that customers and prospects are sharing through social media, forums and
other platforms to improve your product and services.

156
7. Implement and Influence project measurement
The end results matter the most, whatever it might be. It is vital that the results
are provided in a required format, extracting and offering structured data insights
from unstructured data.
This should be handled through a web data extraction software and a data
intelligence tool, so that the user can execute the required actions on a real-time
basis.
The ultimate step would be to measure the effect with the required ROI by
revenue, process effectiveness and business improvements.
5.1.6 Sentiment Analysis:
Sentiment Analysis is the process of classifying whether a block of
text is positive, negative, or, neutral. Sentiment analysis is contextual mining of
words which indicates the social sentiment of a brand and also helps the business
to determine whether the product which they are manufacturing is going to make a
demand in the market or not. The goal which Sentiment analysis tries to gain is to
analyze people‘s opinion in a way that it can help the businesses expand. It
focuses not only on polarity (positive, negative & neutral) but also on emotions
(happy, sad, angry, etc.). It uses various Natural Language Processing algorithms
such as Rule-based, Automatic, and Hybrid.
Why perform Sentiment Analysis?
According to the survey,80% of the world‘s data is unstructured. The data needs to
be analyzed and be in a structured manner whether it is in the form of emails,
texts, documents, articles, and many more.
1. Sentiment Analysis is required as it stores data in an efficient, cost-friendly.
2. Sentiment analysis solves real-time issues and can help you solve all the real-
time scenarios.
Types of Sentiment Analysis
1. Fine-grained sentiment analysis: This depends on the polarity based. This
category can be designed as very positive, positive, neutral, negative, very
negative. The rating is done on the scale 1 to 5. If the rating is 5 then it is
very positive, 2 then negative and 3 then neutral.
2. Emotion detection: The sentiment happy, sad, anger, upset, jolly, pleasant,
and so on come under emotion detection. It is also known as a lexicon
method of sentiment analysis.
3. Aspect based sentiment analysis: It focuses on a particular aspect like for
instance, if a person wants to check the feature of the cell phone then it
checks the aspect such as battery, screen, camera quality then aspect based
is used.

157
4. Multilingual sentiment analysis: Multilingual consists of different
languages where the classification needs to be done as positive, negative, and
neutral. This is highly challenging and comparatively difficult.
Approaches of Sentiment Analysis:
There are three approaches used:
1. Rule-based approach: Over here, the lexicon method, tokenization, parsing
comes in the rule-based. The approach is that counts the number of positive
and negative words in the given dataset. If the number of positive words is
greater than the negative words then the sentiment is positive else vice-
versa.
2. Automatic Approach: This approach works on the machine learning
technique. Firstly, the datasets are trained and predictive analysis is done.
The next process is the extraction of words from the text is done. This text
extraction can be done using different techniques such as Naive Bayes, Linear
Regression, Support Vector, Deep Learning like this machine learning
techniques are used.
3. Hybrid Approach: It is the combination of both the above approaches i.e.
rule-based and automatic approach. The surplus is that the accuracy is high
compared to the other two approaches.
Applications:
Sentiment Analysis has a wide range of applications as:
1. Social Media: If for instance the comments on social media side as Instagram,
over here all the reviews are analyzed and categorized as positive, negative,
and neutral.
2. Customer Service: In the play store, all the comments in the form of 1 to 5
are done with the help of sentiment analysis approaches.
3. Marketing Sector: In the marketing area where a particular product needs to
be reviewed as good or bad.
4. Reviewer side: All the reviewers will have a look at the comments and will
check and give the overall review of the product.
Challenges of Sentiment Analysis:
There are major challenges in sentiment analysis approach:
1. If the data is in the form of a tone, then it becomes really difficult to detect
whether the comment is pessimist or optimist.
2. If the data is in the form of emoji, then you need to detect whether it is good
or bad.
3. Even the ironic, sarcastic, comparing comments detection is really hard.
4. Comparing a neutral statement is a big task.
5.1.7 Web Mining:

158
Different Process in Web Mining:
The processes of Web mining are divided into four stages: i. Source data collection,
ii. Data pre-processing, iii. Pattern discovery, and iv. Pattern analysis.
1. Source Data Collection
‗The direct source of data in web mining is mostly web log files which are stored on
the web server. Web log files record all the behaviour of the user which that time
on the web, including the server log, agent log and client log.
2. Data Pre-processing
‗The data which generally gets collected from the web have features that are
incomplete, redundant and ambiguous. Pre-processing provide accurate and concise
data for data mining. This is the technique used to clean a server log files to
eliminate irrelevant data, hence its importance for web log analysis in web mining. It
includes data cleaning, user identification, user session certifications, access path
supplements and transact
identification (Fig. 2), the details of which are as below.

2.1 Data Cleaning


This process removes the web log redundant data which is not associated with
useful data
thus improving the scope of data object
2.2 User Identification
‗The main work of this process is to identify the user uniquely on the web server.
This can
be done by cookies technology, user registration and heuristic rules.
2.3 User Session Identification
This process is done on the basis of user identification, and the purpose of this
process is to
divide each user access information into several separate sessions. The timeout
estimation approach is used to separate each session process at the time web
server is in use. In this
way, when the allocated time is complete for one session, new sessions are
initiated
automatically.
2.4 Access Path Estimation
By the use of page caching technology and proxy servers, the access path is
recorded by the access log files on the web server, which does not give the
complete access path of users. Instead, path supplements can be archived by the
use of web site topology to make the page analysis.

159
2.5. Transaction Identification
This process is totally based on the user session identification.Web transactions are
divided
or combined according to the demand of data mining tools.
3 Pattern Discovery
There are many types of access pattern mining that can be used to perform on the
need of
the analyst. Some of these are given as path analysis, association rule
discovery,sequential
pattern discovery, clustering analysis and classification.
3.1 Path Analysis
The physical layout of any website is presented in graphical form. Web page is
denoted as
a node and the hyperlink between two pages is represented as edges in the graph.
3.2 Association Rules
‗The association rules are focused mainly in the discovery of relations between
pages visited
by the users on the website. Association rules can be used to relate the web page
that is most often used by the single server session. These pages may be
interlinked to one another by
the help of hyperlinks. For instance, an association rule for a BBA program is
BBA/seminar.
html and BBA/speakers.html, whereby seminar is related to speakers.
3.3 Sequential Pattern Discovery
This technique is used to find inter-session patterns such that the presence of a set
of
items or data is followed by another item in an allotted time to that session. With
the help
of this approach, Web sellers or buyers can predict future visit patterns which will
be
helpful in placing advertisements aimed at certain user groups. There are also
some other
techniques which are useful on sequential patterns including change point
detection, or
similarity analysis and trend analysis.
3.4 Classification Analysis
Classification is the mapping of a data into one or several predefined data or items
i.e.
it classifies data according to the predefined categories. Classification can be done

160
with
the help of the following algorithms such as decision tree classifiers and naive
Bayesian
classifiers. Web mining classification techniques allow one to develop a profile for
clients
who access a particular server file based on their access patterns.
3.5 Clustering Analysis
Cluster analysis technique is the most popular technique which is used in web
mining
wherein a set of items or data which have similar attributes or characteristics is
grouped
together. It can help with marketing decisions of the marketers. Clustering of user
information on the web transaction logs can develop a facility to future marketing
strategies, both online and off-line. Two types of clustering methods are used
namely:
Hierarchical clustering (agglomerative vs. divisive and single link vs. complete link)
and
partition clustering (distance-based, model-based and density-based).
4. Pattern Analysis
Its main purpose to find out a valuable model. Many types of techniques are used
for analysis
such as visualization tools, OLAP techniques, data and knowledge querying and
usability analysis (Fig. 2), whose details are given below:
4.1. Visualization Techniques
This is a natural choice for understanding the behaviour of web users. The web is
visualized as a directed graph with cycles represented as a node and hyperlinks
denoted as
an edge of the graph.
4.2. OLAP Online Analytical Processing
This is a very powerful paradigm for strategic analysis of database in business
system.
OLAP can be performed directly on top of the relational database.
4.3 Data and Knowledge Query
This is the most important analysis pattern in web mining; whereby focus is given
on
the proper analysis of user problems or user needs. Such type of focus is provided
in
two different ways:
* Constraints may be placed on the database in a declarative language.

161
*This query may be performed on the knowledge that has been discovered and
extracted by the mining process.
4.4. Usability Analysis
In this analysis the details of software usability as well as user usability are given.
This approach can also be used for any model for accessing behaviour of the user
on the
website.

Fig 5.1 Web mining process

162
Unit – V: Chapter-2 Recommender Systems
 Feedback
 Recommendation Tasks
 Recommendation Techniques
 Final Remarks
5.2.1 Feedback:
Recommender systems require certain feedback to perform
recommendations. That is why they require information on users‘ past behavior,
the behavior of other people, or the content information of the domain to produce
predictions. It is possible to define the workflow of a recommendation process as:
 Collecting information
 Learning.
 Production recommendations.
There are often three main ways for a recommender to collect information,
which also known as feedback.
 Implicit Feedback
 Explicit Feedback
 Hybrid Feedback
Implicit Feedback
There is no user participation required to gather implicit feedback, unlike the
explicit feedback. The system automatically tracks users‘ preferences by
monitoring the performed actions, such as which item they visited, where they
clicked, which items they purchased, or how long they stayed on a web page. One
must find the correct actions to track based on the domain that the recommender
system operates on. Another advantage of implicit feedback is that it reduces the
cold start problems that occur until an item is rated enough to be served as a
recommendation.
Explicit Feedback
To collect explicit feedback from the user, the system must ask
users to provide their ratings for items. After collecting the feedback, the system
knows how relevant or similar an item is to users‘ preferences. Even though this
allows the recommender to learn the users exact opinion, since it requires direct
participation from the user, it is often not easy to collect. That is why there are
different ways to collect feedback from users. Implementing a like/dislike
functionality into a web site, gives users to evaluate the content easily.
Alternatively, the system can ask users to insert their ratings in which a discrete
numeric scale represents how the user liked/disliked the content. Netfix often asks
customers to rate movies.

163
Another way to collect explicit feedback is to ask users to insert their comments as
text. While this is a great way to learn user opinion, it is usually not easy to obtain
and evaluate.
Hybrid FeedBack
Hybrid Feedback uses both explicit and implicit feedback to
maximize prediction quality. To use the hybrid method, the system must be able to
collect explicit and implicit feedback from users.
5.2.2 Recommendation System:
Recommendation engines are a subclass of machine learning which
generally deal with ranking or rating products / users. Loosely defined, a
recommender system is a system which predicts ratings a user might give to a
specific item. These predictions will then be ranked and returned back to the
user.
The most common types of recommendation systems which are widely used are :
 Content-Based Filtering
 Collaborative Filtering
 Hybrid Recommendation Systems

1. Content-Based Filtering: In content-based filtering, the recommendation of


a product to the user is based on the similarity measures of various properties
of other products which have been opted for purchasing in the past.
2. Collaborative Filtering: In collaborative filtering, the recommendation of a
product to the user is based on the similarity measures of like-minded people
or items. It is sub-divided into Neighborhood-based approach, Model-based
approach, and Hybrid models
3. Hybrid Recommendation Systems: The user preferences are dynamic in
nature. Single content-based or collaborative filtering is unable to provide the
164
recommendation to the users about the products with great accuracy. So, in
Hybrid Recommendation System, the recommendation of a product to the
user is based on the combination of Content-Based Filtering and Collaborative
Filtering.
Application Of Recommendation System :
There are various fields where the recommender system is
applicable some of them are stated below:
 E-Commerce: It is used in E-Commerce sites to recommend products to its
user.
 Media: It is used in electronic media to recommend the latest news and
updates.
 Banking: It is used in banking sectors to suggest the latest offers and
benefits to its user.
 Telecom: It is used in telecommunication sectors to offer the best services to
its user.
 Movies: It is used to recommend movies as per its user choices.
 Music: It is used to recommend songs to its user based on their previous
choices.
 Books: It is used to recommend books to its end-users based on the genre
they love to read.
 Tourism scenic spots: It is used in tourism sites to offers the most
prominent and adequate travel services to the users.
Recommendation Tasks:
A recommendation task can be defined as a constraint satisfaction
problem (VC, VPROD, CC ∪CF ∪CR ∪CPROD) where VC is a set of variables
representing possible customer requirements and VPROD is a set of variables
describing product properties. CPROD is a set of constraints describing product
instances, CR is a set of constraints describing possible combinations of customer
requirements, and CF (also called filter conditions) is a set of constraints describing
the relationship between customer requirements and product properties. Finally,
CC is a set of unary constraints representing concrete customer requirements.
Problem Solving Approaches of Recommendation Tasks:
Typical approaches to solve a recommendation task are constraint
satisfaction algorithms and conjunctive database queries.
Constraint Satisfaction Solutions for constraint satisfaction problems
are calculated on the basis of search algorithms that use different combinations of
backtracking and constraint propagation - the basic principle of both concepts will
be explained in the following. Backtracking. In each step, backtracking chooses a
variable and assigns all the possible values to this variable. It checks the

165
consistency of the assignment with the already existing assignments and defined
set of constraints. If all the possible values of the current variable are inconsistent
with the existing assignments and the constraints, the constraint solver backtracks
which means that the previously instantiated variable is selected again. In the case
that a consistent assignment has been identified, a recursive activation of the
backtracking algorithm is performed and the next variable is selected [54].
Constraint Propagation.

The major disadvantage of pure backtracking-based search is ‖trashing‖ where


parts of the search space are revisited although no solution exists in these parts.
In order to make constraint solving more efficient, constraint propagation
techniques have been introduced. These techniques try to modify an existing
constraint satisfaction problem such that the search space can be reduced
significantly. The methods try to create a state of local consistency that guarantees
consistent instantiations among groups of variables. The mentioned modification
steps turn an existing constraint satisfaction problem into an equivalent one. A well
known type of local consistency is arc consistency [54] which states that for two
variables X and Y there must not exist a value in the domain of Y which does not
have a corresponding consistent value in X. Thus, arc consistency is a directed
concept which means that if X is arc consistent with Y, the reverse must not
necessarily be the case. When using a constraint solver, constraints are typically
represented in the form of expressions of the corresponding programming
language. Many of the existing constraint solvers are implemented on the basis of
Java (see, for example, jacop.osolpro.com).
Conjunctive Database Queries Solutions to conjunctive queries are calculated
on the basis of database queries that try to retrieve items which fulfill all of the
defined customer requirements. For details on the database technologies and the
execution of queries on database tables see, for example, [46]. Ranking Items
Given a recommendation task, both constraint solvers and database engines try to
identify a set of items that fulfill the given customer requirements. Typically, we
have to deal with situations where more than one item is part of a
recommendation result. In such situations the items (products) in the result set
have to be ranked. In both cases (constraint solvers and database engines), we
can apply the concepts of multi-attribute utility theory (MAUT) [56] that helps to
determine a ranking for each of the items in the result set. Examples for the
application of MAUT can be found in [13]. An alternative to the application of MAUT
in combination with conjunctive queries are probabilistic databases [35] which
allow a direct specification of ranking criteria within a query. Example 6.6 shows
such a query which selects all products that fulfill the criteria in the WHERE clause

166
and orders the result conform to a similarity metric (defined in the ORDER BY
clause). Finally, instead of combining the mentioned standard constraint solvers
with MAUT, we can represent a recommendation task in the form of soft
constraints where the importance (preference) for each combination of variable
values is determined on the basis of a corresponding utility operation (for details
see, for example, [1]).
Example Queries in probabilistic databases Result = SELECT * /* calculate a
solution */ FROM Products /* select items from ‖Products‖ */
WHERE x1=a1 and x2=a2 /* ‖must‖ criteria */ ORDER BY score(abs(x3-a3), ...,
abs(xm-am)) /* similarity-based utility function */ STOP AFTER N; /* at most N
items in the solution (result set) */.
5.2.3 Recommendation Techniques:
Recommendation Techniques consists of six different approaches.
They are
1.Content-based: The system learns to recommend items that are similar to the
ones that the user liked in the past. The similarity of items is calculated based on
the features associated with the compared items. For example, if a user has
positively rated a movie that belongs to the comedy genre, then the system can
learn to recommend other movies from this genre. Chapter 3 provides an overview
of contentbased recommender systems, imposing some order among the extensive
and diverse aspects involved in their design and implementation. It presents the
basic concepts and terminology of content-based RSs, their high level architecture,
and their main advantages and drawbacks. The chapter then surveys state-of-the-
art systems that have been adopted in several application domains. The survey
encompasses a thorough description of both classical and advanced techniques for
representing items and user profiles. Finally, it discusses trends and future
research which might lead towards the next generation of recommender systems.
2.Collaborative filtering: The simplest and original implementation of this
approach recommends to the active user the items that other users with similar
tastes liked in the past. The similarity in taste of two users is calculated based on
the similarity in the rating history of the users. This is the reason why refers to
collaborative filtering as ―people-to-people correlation.‖ Collaborative filtering is
considered to be the most popular and widely implemented technique in RS.
3.Demographic: This type of system recommends items based on the
demographic profile of the user. The assumption is that different recommendations
should be generated for different demographic niches. Many Web sites adopt
simple and effective personalization solutions based on demographics. For
example, users are dispatched to particular Web sites based on their language or
country. Or suggestions may be customized according to the age of the user. While

167
these approaches have been quite popular in the marketing literature, there has
been relatively little proper RS research into demographic systems.

4.Knowledge-based: Knowledge-based systems recommend items based on


specific domain knowledge about how certain item features meet users needs and
preferences and, ultimately, how the item is useful for the user. Notable
knowledgebased recommender systems are case-based . In these systems a
similarity function estimates how much the user needs (problem description)
match the recommendations (solutions of the problem). Here the similarity score
can be directly interpreted as the utility of the recommendation for the user.
Constraint-based systems are another type of knowledge-based RSs. In terms of
used knowledge, both systems are similar: user requirements are collected;
repairs for inconsistent requirements are automatically proposed in situations
where no solutions could be found; and recommendation results are explained. The
major difference lies in the way solutions are calculated.
Case-based recommenders determine recommendations on the basis of similarity
metrics whereas constraintbased recommenders predominantly exploit predefined
knowledge bases that contain explicit rules about how to relate customer
requirements with item features. Knowledge-based systems tend to work better
than others at the beginning of their deployment but if they are not equipped with
learning components they may be surpassed by other shallow methods that can
exploit the logs of the human/computer interaction (as in CF).
5.Community-based: This type of system recommends items based on the
preferences of the users friends. This technique follows the epigram ―Tell me who
your friends are, and I will tell you who you are‖. Evidence suggests that people
tend to rely more on recommendations from their friends than on
recommendations from similar but anonymous individuals. This observation,
combined with the growing popularity of open social networks, is generating a
rising interest in community-based systems or, as or as they usually referred to,
social recommender systems. This type of RSs models and acquires information
about the social relations of the users and the preferences of the user‘s friends.
The recommendation is based on ratings that were provided by the user‘s friends.
In fact these RSs are following the rise of social-networks and enable a simple and
comprehensive acquisition of data related to the social relations of the users.

6.Hybrid recommender systems: These RSs are based on the combination of


the above mentioned techniques. A hybrid system combining techniques A and B
tries to use the advantages of A to fix the disadvantages of B. For instance, CF
methods suffer from new-item problems, i.e., they cannot recommend items that

168
have no ratings. This does not limit content-based approaches since the prediction
for new items is based on their description (features) that are typically easily
available. Given two (or more) basic RSs techniques, several ways have been
proposed for combining them to create a new hybrid system (see for the precise
descriptions). As we have already mentioned, the context of the user when she is
seeking a recommendation can be used to better personalize the output of the
system. For example, in a temporal context, vacation recommendations in winter
should be very different from those provided in summer. Or a restaurant
recommendation for a Saturday evening with your friends should be different from
that suggested for a workday lunch with co-workers.
5.2.4 Final Remarks:
 recommendation algorithms can be divided in two great paradigms:
collaborative approaches (such as user-user, item-item and matrix
factorisation) that are only based on user-item interaction matrix and content
based approaches (such as regression or classification models) that use prior
information about users and/or items

 memory based collaborative methods do not assume any latent model and
have then low bias but high variance ; model based collaborative approaches
assume a latent interactions model that needs to learn both users and items
representations from scratch and have, so, a higher bias but a lower variance ;
content based methods assume a latent model build around users and/or items
features explicitly given and have, thus, the highest bias and lowest variance

 recommender systems are more and more important in many big industries
and some scales considerations have to be taken into account when designing
the system (better use of sparsity, iterative methods for factorisation or
optimisation, approximate techniques for nearest neighbours search…)

 recommender systems are difficult to evaluate: if some classical metrics such


that MSE, accuracy, recall or precision can be used, one should keep in mind
that some desired properties such as diversity (serendipity) and explainability
can‘t be assessed this way ; real conditions evaluation (like A/B testing or
sample testing) is finally the only real way to evaluate a new recommender
system but requires a certain confidence in the model

169
Unit – V: Chapter-3: Social Network Analysis
 Representing Social Networks
 Basic Properties of Nodes
 Basic and Structural Properties of Networks

5.3.1 Social Network Analysis:


Social Network Analysis (SNA) is the process of exploring or examining
the social structure by using graph theory. It is used for measuring and analyzing
the structural properties of the network. It helps to measure relationships and
flows between groups, organizations, and other connected entities.
Social networks are the networks that depict the relations between people in the
form of a graph for different kinds of analysis. The graph to store the relationships
of people is known as Sociogram. All the graph points and lines are stored in the
matrix data structure called Sociomatrix. The relationships indicate of any kind like
kinship, friendship, enemies, acquaintances, colleagues, neighbors, disease
transmission, etc.
Basically, there are two types of social networks:
 Ego network Analysis

 Complete network Analysis

1. Ego Network Analysis:

Ego network Analysis is the one that finds the relationship among people. The
analysis is done for a particular sample of people chosen from the whole
population. This sampling is done randomly to analyze the relationship. The
attributes involved in this ego network analysis are a person‘s size, diversity, etc.
This analysis is done by traditional surveys. The surveys involve that they people
are asked with whom they interact with and their name of the relationship between
them. It is not focused to find the relationship between everyone in the sample. It
is an effort to find the density of the network in those samples. This hypothesis is
tested using some statistical hypothesis testing techniques.
The following functions are served by Ego Networks:
 Propagation of information efficiently.

 Sensemaking from links, For example, Social links, relationships.

 Access to resources, efficient connection path generation.

 Community detection, identification of the formation of groups.

 Analysis of the ties among individuals for social support.

2. Complete Network Analysis:

170
Complete network analysis is the analysis that is used in all network analyses. It
analyses the relationship among the sample of people chosen from the large
population. Subgroup analysis, centrality measure, and equivalence analysis are
based on the complete network analysis. This analysis measure helps the
organization or the company to make any decision with the help of their
relationship. Testing the sample will show the relationship in the whole network
since the sample is taken from a single set of domains.
1 Difference between Ego network analysis and Complete network
analysis:

The difference between ego and complete network analysis is that the ego network
focus on collecting the relationship of people in the sample with the outside world
whereas, in Complete network, it is focused on finding the relationship among the
samples.
The majority of the network analysis will be done only for a particular domain or
one organization. It is not focused on the relationships between the organization.
So many of the social network analysis measure uses only Complete network
analysis.
Representing Social Networks:

Nodes (A,B,C,D,E in the example) are usually representing entities in the network,
and can hold self-properties (such as weight, size, position and any other attribute)
and network-based properties (such as Degree- number of neighbors or Cluster- a
connected component the node belongs to etc.).
Edges represent the connections between the nodes, and might hold properties as
well (such as weight representing the strength of the connection, direction in case
of asymmetric relation or time if applicable).
These two basic elements can describe multiple phenomena, such as social
connections, virtual routing network, physical electricity networks, roads network,
biology relations network and many other relationships.

171
Real-world networks
Real-world networks and in particular social networks have a unique structure
which often differs them from random mathematical networks:

 Small World phenomenon claims that real networks often have very short
paths (in terms of number of hops) between any connected network members.
This applies for real and virtual social networks (the six handshakes theory) and
for physical networks such as airports or electricity of web-traffic routings.

 Scale Free networks with power-law degree distribution have a skewed


population with a few highly-connected nodes (such as social-influences) and a
lot of loosely-connected nodes.

 Homophily is the tendency of individuals to associate and bond with similar


others, which results in similar properties among neighbors.

Centrality Measures
Highly central nodes play a key role of a network, serving as hubs for different
network dynamics. However the definition and importance of centrality might
differ from case to case, and may refer to different centrality measures:
 Degree — the amount of neighbors of the node

 EigenVector / PageRank — iterative circles of neighbors

 Closeness — the level of closeness to all of the nodes

 Betweenness — the amount of short path going through the node

172
5.3.2 Basic Properties of Nodes:
Nodes are usually representing entities in the network, and can hold
self-properties (such as weight, size, position and any other attribute) and
network-based properties (such as Degree- number of neighbors or Cluster- a
connected component the node belongs to etc.).
Properties of Nodes:
1.Supported Data types
The following table lists the supported property types, as well as, their
corresponding fallback values.
 Long - Long.MIN_VALUE

 Double - NaN

 Long Array - null

 Float Array - null

 Double Array – null

2. Defining the type of a node property


When creating a graph projection that specifies a set of node properties, the
type of these properties is automatically determined using the first property value
that is read by the loader for any specified property. All integral numerical types
are interpreted as Long values, all floating point values are interpreted
as Double values. Array values are explicitly defined by the type of the values that
the array contains, i.e. a conversion of, for example, an Integer Array into a Long
Array is not supported. Arrays with mixed content types are not supported.

3.Automatic Type Conversion

173
Most algorithms that are capable of using node properties require a specific
property type. In cases of a mismatch between the type of the provided property
and the required type, the library will try to convert the property value into the
required type. This automatic conversion only happens when the following
conditions are satisfied:
 Neither the given, nor the expected type are an Array type.

 The conversion is loss-less

o Long to Double: The Long values does not exceed the supported range of the
Double type.

o Double to Long: The Double value does not have any decimal places.

The algorithm computation will fail if any of these conditions are not satisfied
for any node property value.
5.3.3 Basic and Structural Properties of Networks:
1. Connectivity (Beta-Index)

2. Diameter of a graph

3. Accessability of nodes and places

4. Centrality / location in the network

5. Hierarchies in trees

Connectivity(Beta- Index):
The simplest measure of the degree of connectivity of a graph is given by
the Beta index (β). It measures the density of connections and is defined as:

where E is the total number of edges and V is the total number of vertices in the
network.

In the figure above, the number of vertices remains constant in A, B, C and D,


while the number of connecting edges is progressively increased from four to ten
174
(until the graph is complete). As the number of edges increases, the connectivity
between the vertices rises and the Beta index changes progressively from 0.8 to 2.
Values for the index start at zero and are open-ended, with values below one
indicating trees and disconnected graphs (A), and values of one indicating a
network which has only one circuit (B). Thus, the larger the index, the higher the
density.
With the help of this index, regional disparities can be described, for example. In
the figure below, the railway networks of selected countries are compared to
general economic develompent (using the energy consumption-index of the
1960s). Energy consumption is plotted on the y-axis and the Beta index on the x-
axis. Where connectivity is high, the economic development is high as well.

Diameter of a graph:
Another measure for the structure of a graph is its diameter.
Diameter δ is an index measuring the topological length or extent of a graph by
counting the number of edges in the shortest path between the most distant
vertices. It is:

where s(i, j) is the number of edges in the shortest path from vertex i to vertex j.
With this formula, first, all the shortest paths between all the vertices are
searched; then, the longest path is chosen. This measure therefore describes the
longest shortest path between two random vertices of a graph.

175
In addition to the purely topological application, actual track lenths or any other
weight (e.g. travel time) can be assigned to the edges. This suggests a more
complex measurement based on the metric of the network. The resulting index is π
= mT/mδ, where mT is the total mileage of the network and mδ is the total
mileage of the network's diameter. The higher π is, the denser the network.
Accessibility of vertices and places:
A frequent type of analysis in transport networks is the
investigation of the accessibility of certain traffic nodes and the developed areas
around them. A measure of accessibility can be determined by the method shown
in the animation. The accessibility of a vertex i is calculated by:

where v = the number of vertices in the network and n (i, j) = the shortest node
distance (i.e. number of nodes along a path) between vertex i and vertex j.
Therefore, for each node i the sum of all the shortest node distances n(i, j) are
calculated, which can efficiently be done with a matrix. The node distance between
two nodes i and j is the number of intermediate nodes. For every node the sum is
formed. The higher the sum (node A), the lower the accessibility and the lower the
sum (node C), the better the accessibility.
176
The importance of the node distance lies in the fact that nodes may also be
transfer stations, transfer points for goods, or subway stations. Therefore, a large
node distance hinders travel through the network.

Calculation of the accessibility Ei


As with the diameter of a network, a weighted edge distance can also be used
along with the pure topological node distance. Examples of possible weighting
factors are: distance in miles or travel time as well as transportation cost. For this
weighted measure, however, the edge distance is used and not the node distance.

where e is the number of edges and s(i, j) the shortest weighted path between two
nodes.
Centrality / Location in the network
The first measure of centrality was developed by König in 1936 and is called
the König numberKi. Let s(i, j) denote the number of edges in the shortest path
from vertex i to vertex j. Then the König number for vertex i is defined as:

where s(i, j) is the shortest edge distance between vertex i and vertex j.
Therefore, Ki is the longest shortest path originating from vertex i. It is a measure
of topological distance in terms of edges and suggests that vertices with a low
König numbers occupy a central place in the network.

177
If you have determined the shortest edge distance between the nodes, then the
largest value in a column in the König number (blue). In the example, the orange
node is centrally located and the two green nodes are peripheral.
The method for determining the König number is also applicable to a distance
matrix. The example of accessibility is shown again in the figure below. This time
the matrix is used with the same values to calculate the König number.

Hierarchies in trees
In quantitative geomorphology, more specificall in the field of fluvial morphology,
different methods for structuring and order of hierarchical stream networks have
been developed. Thus, different networks can be compared with each other (e.g.
due to the highest occurence order or the relative frequencies of the unique
levels), and sub-catchments can be segregated easily. Of the four ordering
schemes in the following figure, only three are topologically defined. The Horton
scheme is the only one that takes the metric component into account as well.

178
Calculating the strahler
number, we start with the
outermost branches of the
tree. The ordering value of 1 is
assigned to those segments of
the stream. When two streams
with the same order come
together, they form a stream
with their order value plus one.
Otherwise, the higher order of
the two streams is used. The
strahler number is formally
defined as:

where e1 and e2 are the


joining stream segments
and e3 is the evolving stream.
Associated with the Strahler
numbers of a tree are
bifurcation ratios, numbers
describing how close to
balanced a tree is:
Strahler stream order

where Ns1 is the number of


edges of a specific order (e.g.
Order 1) and Ns2 is the
number of edges of the next
higher order. In the example
on the left, Ns1 = 15
and Ns2 = 7. This results in the
bifurcation ration of:

179
First, an order according to
Strahler is calculated. Then,
the highest current order larger
than 2 is assigned to the
longest (metric) branch in the
remaining sub-trees.

Horton stream order

The Shreve stream order (also


called magnitude) of a sub-tree
indicates how many segments
of first order (or "sources") are
upstream. One possible
application outside hydrology
or geomorphology is in the
choice of line widths in the
cartographic representation of
river networks.
Shreve stream order

A simple order by path


length is achieved by
identifying the length of the
paths by starting at the tree's
root.

180

You might also like