0% found this document useful (0 votes)
17 views186 pages

Data Anlytics Full Notes

The document outlines the course on Data Analytics offered at Vignan Institute of Technology and Science, detailing prerequisites, objectives, and outcomes for students. It emphasizes the importance of data analytics in business for gathering insights, generating reports, and performing market analysis, while also discussing various tools and data architecture design principles. Additionally, it covers methods for collecting primary and secondary data, highlighting the significance of structured data collection in the analytics process.

Uploaded by

KANNAN S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views186 pages

Data Anlytics Full Notes

The document outlines the course on Data Analytics offered at Vignan Institute of Technology and Science, detailing prerequisites, objectives, and outcomes for students. It emphasizes the importance of data analytics in business for gathering insights, generating reports, and performing market analysis, while also discussing various tools and data architecture design principles. Additionally, it covers methods for collecting primary and secondary data, highlighting the significance of structured data collection in the analytics process.

Uploaded by

KANNAN S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 186

DA Notes

Computer science (Vignan Institute of Technology and


Science)

Scan to open on Studocu

Downloaded by KANNAN S
Studocu is not sponsored or endorsed by any college or university

Downloaded by KANNAN S
DATA ANALYTICS
(Professional Elective - I)
Subject Code: CS513PE

NOTES MATERIAL
UNIT1

For
B. TECH (CSE)
3rd YEAR – 1st SEM (R18)

Faculty:
B. RAVIKRISHNA

DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

Prerequisites:
1. A course on “Database Management Systems”.
2. Knowledge of probability and statistics.

Course Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles and methods of statistical analysis
3. Discover interesting patterns, analyze supervised and unsupervised models and estimate the
accuracy of the algorithms.
4. To understand the various search methods and visualization techniques.

Course Outcomes: After completion of this course students will be able to


1. Understand the impact of data analytics for business decisions and strategy
2. Carry out data analysis/statistical analysis
3. To carry out standard data visualization and formal inference procedures
4. Design Data Architecture
5. Understand various Data Sources

INTRODUCTION:

In the beginning times of computers and Internet, the data used was not as much
of as it is today, the data then could be so easily stored and managed by all the users and
business enterprises on a single computer, because the data never exceeded to the
extent of 19 exabytes but now in this era, the data has increased about 2.5 quintillion
per day.

Most of the data is generated from social media sites like Facebook, Instagram, Twitter,
etc, and the other sources can be e-business, e-commerce transactions, hospital, school,
bank data, etc. This data is impossible to manage by traditional data storing techniques.
Either the data being generated from large-scale enterprises or the data generated from
an individual, each and every aspect of data needs to be analysed to benefit yourself from
it. But how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden
insights, Interesting Patterns in Data, generate reports, perform market analysis, and
improve business requirements.

What is the role of Data Analytics?

 Gather Hidden Insights – Hidden insights from data are gathered and then
analyzed with respect to business requirements.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
RK VIGNAN VITS – CSE 2|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
 Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.

What are the tools used in Data Analytics?


ith the increasing demand for Data Analytics in the market, many tools have emerged with
various functionalities for this purpose. Either open-source or user-friendly, the top tools in
the data analytics market are as follows.
 R programming
 Python
 Tableau Public
 QlikView
 SAS
 Microsoft Excel
 RapidMiner
 KNIME
 OpenRefine
 Apache Spark

Data and architecture design:

Data architecture in Information Technology is composed of models, policies, rules or standards


that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data
systems and in organizations.

 A data architecture should set data standards for all its data systems as a vision or a
model of the eventual interactions between those data systems.
 Data architectures address data in storage and data in motion; descriptions of data
stores, data groups and data items; and mappings of those data artifacts to data
qualities, applications, locations etc.
 Essential to realizing the target state, Data Architecture describes how data is
processed, stored, and utilized in a given system. It provides criteria for data
processing operations that make it possible to design data flows and also control the
flow of data in the system.
 The Data Architect is typically responsible for defining the target state, aligning
during development and then following up to ensure enhancements are done in the
spirit of the original blueprint.

Downloaded by KANNAN S
RK VIGNAN VITS – CSE 3|Page
DATA ANALYTICS UNIT–I

During the definition of the target state, the Data Architecture breaks a subject down to
the atomic level and then builds it back up to the desired form.

The Data Architect breaks the subject down by going through 3 traditional architectural
processes:

Conceptual model: It is a business model which uses Entity Relationship (ER) model
for relation between entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as
rows and column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database
technology will be suitable for architecture.

Laye View Data (What) Stakehold


r er
1 Scope/Contextual List of things and architectural Planner
standards important to the
business
2 Business Semantic model Owner
Model/Conceptual or Conceptual/Enterprise Data
Model
3 System Model/Logical Enterprise/Logical Data Model Designer
4 Technology Physical Data Model Builder
Model/Physical

5 Detailed Actual databases Subcontract


or
Representations

The data architecture is formed by dividing into three essential models and then are
combined:

Factors that influence Data Architecture:


Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN VITS – CSE 4|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

Various constraints and influences will have an effect on data architecture design.
These include enterprise requirements, technology drivers, economics, business policies
and data processing need. Enterprise requirements:
 These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed),
transaction reliability, and transparent data management.
 In addition, the conversion of raw data such as transaction records and image files
into more useful information forms through such features as data warehouses is
also a common organizational requirement, since this enables managerial decision
making and other organizational processes.
 One of the architecture techniques is the split between managing transaction data
and (master) reference data. Another one is splitting data capture systems from
data retrieval systems (as done in a data warehouse).
Technology drivers:
 These are usually suggested by the completed data architecture and database
architecture designs.
 In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site
resources (e.g. previously purchased software licensing).
Economics:
 These are also important factors that must be considered during the data
architecture phase. It is possible that some solutions, while optimal in principle, may
not be potential candidates due to their cost.
 External factors such as the business cycle, interest rates, market conditions, and
legal considerations could all have an effect on decisions relevant to data
architecture.
Business policies:
 Business policies that also drive data architecture design include internal
organizational policies, rules of regulatory bodies, professional standards, and
applicable governmental laws that can vary by applicable agency.
 These policies and rules will help describe the manner in which enterprise wishes to
process their data.
Data processing needs
 These include accurate and reproducible transactions performed in high volumes,
data warehousing for the support of management information systems (and
potential data mining), repetitive periodic reporting, ad hoc reporting, and support
of various organizational initiatives as required (i.e. annual budgets, new product
development)
 The General Approach is based on designing the Architecture at three Levels of
Specification.
 The Logical Level
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 The Physical Level


 The Implementation Level

Understand various sources of the Data:


 Data can be generated from two types of sources namely Primary and Secondary
Sources of Primary Data.
 Data collection is the process of acquiring, collecting, extracting, and storing the
voluminous amount of data which may be in the structured or unstructured form
like text, video, audio, XML files, records, or other image files used in later stages of
data analysis.
 In the process of big data analysis, “Data collection” is the initial step before
starting to analyse the patterns or useful information in data. The data which is to
be analysed must be collected from different valid sources.
 The data which is collected is known as raw data which is not useful now but on
cleaning the impure and utilizing that data for further analysis forms information,
the information obtained is known as “knowledge”. Knowledge has many
meanings like business knowledge or sales of enterprise products, disease
treatment, etc.
 The main goal of data collection is to collect information-rich data.
 Data collection starts with asking some questions such as what type of data is to
be collected
and what is the source of collection.
 Most of the data collected are of two types known as qualitative data which is a
group of non-numerical data such as words, sentences mostly focus on behaviour
and actions of the group and another one is quantitative data which is in
numerical forms and can be calculated using different scientific tools and sampling
data.
The actual data is then further divided mainly into two types known as:
1. Primary data
2. Secondary data

1. Primary data:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
 The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as

RK VIGNAN VITS – CSE 6|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

questionnaires, interviews, and surveys. The data collected must be according to


the demand and requirements of the target audience on which analysis is
performed otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
1. Interview method:
 The data collected during this process is through interviewing the target audience
by a person called interviewer and the person who answers the interview is known
as the interviewee.
 Some basic business or product related questions are asked and noted down in
the form of notes, audio, or video and this data is stored for processing.
 These can be both structured and unstructured like personal interviews or formal
interviews through telephone, face to face, email, etc.

2. Survey method:
 The survey method is the process of research where a list of relevant questions are
asked and answers are noted down in the form of text, audio, or video.
 The survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analysing data.
Examples are online surveys or surveys through social media polls.
3. Observation method:
 The observation method is a method of data collection in which the researcher
keenly observes the behaviour and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video, or
any raw formats.
 In this method, the data is collected directly by posting a few questions on the
participants. For example, observing a group of customers and their behaviour
towards the products. The data obtained will be sent for processing.
4. Experimental method:
 The experimental method is the process of collecting data through performing
experiments, research, and investigation.
 The most frequently used experiment methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design used in data
analytics which is based on randomization and replication. It is mostly used for comparing
the experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is
divided into small units called blocks.
 Random experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated from
the agriculture sector.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 Randomized Block Design - The Term Randomized Block Design has originated from
agricultural research. In this design several treatments of variables are applied to
different blocks of land to ascertain their effect on the yield of the crop.
 Blocks are formed in such a manner that each block contains as many plots as a
number of treatments so that one plot from each is selected at random for each
treatment. The production of each plot is measured after the treatment is given.
 These data are then interpreted and inferences are drawn by using the analysis of
Variance technique so as to know the effect of various treatments like different
dozes of fertilizers, different types of irrigation etc.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns.
 It is an arrangement of NxN squares with an equal number of rows and columns which
contain letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzle is an example of a Latin
square design.
 A Latin square is one of the experimental designs which has a balanced two-way
classification scheme say for example - 4 X 4 arrangement. In this scheme each letter
from A to D occurs only once in each row and also only once in each column.
 The Latin square is probably under used in most fields of research because text book
examples tend to be restricted to agriculture, the area which spawned most original
work on ANOVA. Agricultural examples often reflect geographical designs where rows
and columns are literally two dimensions of a grid in a field.
 Rows and columns can be any two sources of variation in an experiment. In this sense
a Latin square is a generalisation of a randomized block design with two different
blocking systems
 A B C D
B C D A
C D A B
D A B C
 The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments, will be free from both differences
between rows and columns. Thus, the magnitude of error will be smaller than any
other design.
FD- Factorial design is an experimental design where each experiment has two factors
each with possible values and on performing trail other combinational factors are derived.
This design allows the experimenter to test two or more variables simultaneously. It also
measures interaction effects of the variables and analyses the impacts of each of the
variables. In a true experiment, randomization is essential so that the experimenter can
infer cause and effect without any bias.

2. Secondary data:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN VITS – CSE 8|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two
types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
 Accounting resources- This gives so much information which can be used by the
marketing researcher. They give information about internal factors.
 Sales Force Report- It gives information about the sales of a product. The
information provided is from outside the organization.
 Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working.
 Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
External source:
The data which can’t be found at internal organizations and can be gained through
external third-party resources is external source data. The cost and time consumption are
more because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labour bureau, syndicate services, and other non-governmental
publications.
1. Government Publications-
 Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There
are number of government agencies generating data.
These are like: Registrar General of India- It is an office which generates
demographic data. It includes details of gender, age, occupation etc.
2. Central Statistical Organization-
 This organization publishes the national accounts statistics. It contains estimates
of national income for several years, growth rate, and rate of major economic
activities. Annual survey of Industries is also published by the CSO.
 It gives information about the total number of workers employed, production
units, material used and value added by the manufacturer.
3. Director General of Commercial Intelligence-
 This office operates from Kolkata. It gives information about foreign trade i.e.
import and export. These figures are provided region-wise and country-wise.
4. Ministry of Commerce and Industries-
Downloaded by KANNAN S
RK VIGNAN VITS – CSE 9|Page
DATA ANALYTICS UNIT–I

 This ministry through the office of economic advisor provides information on


wholesale price index. These indices may be related to a number of sectors like
food, fuel, power, food grains etc.
 It also generates All India Consumer Price Index numbers for industrial workers,
urban, non- manual employees and cultural labourers.
5. Planning Commission-
 It provides the basic statistics of Indian Economy.
6. Reserve Bank of India-
 This provides information on Banking Savings and investment. RBI also prepares
currency and finance reports.
7. Labour Bureau-
 It provides information on skilled, unskilled, white collared jobs etc.
8. National Sample Survey-
 This is done by the Ministry of Planning and it provides social, economic,
demographic, industrial and agricultural statistics.
9. Department of Economic Affairs-
 It conducts economic survey and it also generates information on income,
consumption, expenditure, investment, savings and foreign trade.
10. State Statistical Abstract-
 This gives information on various types of activities related to the state like -
commercial activities, education, occupation etc.
11. Non-Government Publications-
 These includes publications of various industrial and trade associations, such as
The Indian Cotton Mill Association Various chambers of commerce.
12. The Bombay Stock Exchange
 It publishes a directory containing financial accounts, key profitability and other
relevant matter) Various Associations of Press Media.
 Export Promotion Council.
 Confederation of Indian Industries (CII)
 Small Industries Development Board of India
 Different Mills like - Woollen mills, Textile mills etc
 The only disadvantage of the above sources is that the data may be biased.
They are likely to colour their negative points.
13. Syndicate Services-
 These services are provided by certain organizations which collect and tabulate
the marketing information on a regular basis for a number of clients who are the
subscribers to these services.
 These services are useful in television viewing, movement of consumer goods etc.

RK VIGNAN VITS – CSE 10 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 These syndicate services provide information data from both household as well as
institution.

In collecting data from household, they use three approaches:


Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
Mail Diary Panel- It may be related to 2 fields - Purchase and Media.
Electronic Scanner Services- These are used to generate data on
volume. They collect data for Institutions from
 Whole sellers
 Retailers, and
 Industrial Firms
 Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
Importance of Syndicate Services:
 Syndicate services are becoming popular since the constraints of decision making
are changing and we need more of specific decision-making in the light of changing
environment. Also, Syndicate services are able to provide information to the
industries at a low unit cost.
Disadvantages of Syndicate Services:
 The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
International Organization-
These includes
 The International Labour Organization (ILO):
 It publishes data on the total and active population, employment,
unemployment, wages and consumer prices.
 The Organization for Economic Co-operation and development (OECD):
 It publishes data on foreign trade, industry, food, transport, and science and
technology.
 The International Monetary Fund (IMA):
 It publishes reports on national and international foreign exchange
regulations.
Other sources:
Sensor’s data: With the advancement of IoT devices, the sensors of these devices collect
data which can be used for sensor data analytics to track the performance and usage of
products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis
through surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through keywords

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
and queries searched mostly.
Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and real time

RK VIGNAN VITS – CSE 11 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

simultaneous analysis.

Data Management:
Data management is the practice of collecting, keeping, and using data securely,
efficiently, and cost- effectively. The goal of data management is to help people,
organizations, and connected things optimize the use of data within the bounds of policy
and regulation so that they can make decisions and take actions that maximize the
benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies,
procedures, and practices. The work of data management has a wide scope, covering
factors such as how to:
 Create, access, and update data across a diverse data tier
 Store data across multiple clouds and on premises
 Provide high availability and disaster recovery
 Use data in a growing variety of apps, analytics, and algorithms
 Ensure data privacy and security
 Archive and destroy data in accordance with retention schedules and compliance
requirements
What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It
doesn’t store any data on the hard disk of your personal computer. In cloud computing,
you can access data from a remote server.
Service Models of Cloud computing are the reference models on which the Cloud
Computing is based.
These can be categorized into
three basic service models as listed below:
1. INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual
machines, virtual storage, etc.
2. PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools,
etc.
3. SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.

For providing the above services models AWS is one of the popular platforms. In this
Amazon Cloud (Web) Services is one of the popular service platforms for Data Management
Amazon Cloud (Web) Services Tutorial
What is AWS?
The full form of AWS is Amazon Web Services. It is a platform that offers flexible, reliable,
scalable, easy-to-use and, cost-effective cloud computing solutions.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN VITS – CSE 12 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

AWS is a comprehensive, easy to use computing platform offered Amazon. The platform
is developed with a combination of infrastructure as a service (IaaS), platform as a
service (PaaS) and packaged software as a service (SaaS) offering.

History of AWS
2002- AWS services launched
2006- Launched its cloud
products 2012- Holds first
customer event
2015- Reveals revenues achieved of $4.6
billion 2016- Surpassed $10 billon
revenue target 2016- Release snowball
and snowmobile
2019- Offers nearly 100 cloud services
2021- AWS comprises over 200 products and services

Important AWS Services


Amazon Web Services offers a wide range of different business purpose global cloud-
based products. The products include storage, databases, analytics, networking,
mobile, development tools, enterprise applications, with a pay-as-you-go pricing model.

Amazon Web Services - Amazon S3:

 Amazon S3 (Simple Storage Service) is a scalable, high-speed, low-cost web-


based service designed for online backup and archiving of data and application
programs.
 It allows to upload, store, and download any type of files up to 5 TB in size. This
service allows the subscribers to access the same systems that Amazon uses to
run its own web sites.
 The subscriber has control over the accessibility of data, i.e. privately/publicly
accessible.
1. How to Configure S3?
Following are the steps to configure a S3 account.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Step 1 − Open the Amazon S3 console using this link −
https://console.aws.amazon.com/s3/home
Step 2 − Create a Bucket using the following steps.

RK VIGNAN VITS – CSE 13 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 A prompt window will open. Click the Create Bucket button at the bottom of the
page.

 Create a Bucket dialog box will open. Fill the required details and click the Create
button.

 The bucket is created successfully in Amazon S3. The console displays the list of
buckets and its properties.

 Select the Static Website Hosting option. Click the radio button Enable website hosting
and fill the required details.

Step 3 − Add an Object to a bucket using the following steps.


 Open the Amazon S3 console using the
following link.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
https://console.aws.amazon.com/s3/home

RK VIGNAN VITS – CSE 14 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

 Click the Upload button.

 Click the Add files option. Select those files which are to be uploaded from the
system and then click the Open button.

 Click the start upload button. The files will get uploaded into the bucket.
 Afterwards, we can create, edit, modify, update the objects and other files in wide
formats.

Amazon S3 Features
 Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at
very low charges.
 Secure − Amazon S3 supports data transfer over SSL and the data gets
encrypted automatically once it is uploaded. The user has complete control over
their data by configuring bucket policies using AWS IAM.
 Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
 Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data
transfer speeds without any minimum usage commitments.
 Integrated with AWS services − Amazon S3 integrated with AWS services
include Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS,
Amazon Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB,
etc.

We are discussing Amazon S3:


https://d1.awsstatic.com/whitepapers/aws-
overview.pdf
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

Data
Quality:
VIGNAN – CSE 15 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

What is Data Quality?


There are many definitions of data quality, in general, data quality is the assessment
of how much the data is usable and fits its serving context.

Why Data Quality is Important?


Enhancing the data quality is a critical concern as data is considered as the core of
all activities within organizations, poor data quality leads to inaccurate reporting
which will result inaccurate decisions and surely economic damages.

Many factors help measuring data quality such as:


 Data Accuracy: Data are accurate when data values stored in the database
correspond to real-world values.
 Data Uniqueness: A measure of unwanted duplication existing within or
across systems for a particular field, record, or data set.
 Data Consistency: Violation of semantic rules defined over the dataset.
 Data Completeness: The degree to which values are present in a data
collection.
 Data Timeliness: The extent to which age of the data is appropriated for
the task at hand.
Other factors can be taken into consideration such as Availability, Ease of
Manipulation, Believability.

RK VIGNAN VITS – CSE 16 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

OUTLIERS:

 Outlier is a point or an observation that deviates significantly


from the other observations.
 Outlier is a commonly used terminology by analysts and data scientists
as it needs close attention else it can result in wildly wrong estimations. Simply
speaking, Outlier is an observation that appears far away and diverges from an
overall pattern in a sample.
 Reasons for outliers: Due to experimental errors or “special circumstances”.
 There is no rigid mathematical definition of what constitutes an outlier;
determining whether or not an observation is an outlier is ultimately a subjective
exercise.
 There are various methods of outlier detection. Some are graphical such as normal
probability plots. Others are model-based. Box plots are a hybrid.
Types of Outliers:

Outlier can be of two types:


Univariate: These outliers can be found when we look at distribution of a single
variable. Multivariate: Multi-variate outliers are outliers in an n-dimensional space.

In order to find them, you have to look at distributions in multi-dimensions.

Impact of Outliers on a dataset:


Outliers can drastically change the results of the data analysis and statistical modelling.
There are numerous unfavourable impacts of outliers in the data set:
 It increases the error variance and reduces the power of statistical tests
 If the outliers are non-randomly distributed, they can decrease normality
 They can bias or influence estimates that may be of substantive interest
 They can also impact the basic assumption of Regression, ANOVA and other
statistical model assumptions.

RK VIGNAN TS – CSE 17 | P a g e
Downloaded by KANNANVSI
DATA ANALYTICS UNIT–I

Detect Outliers:
Most commonly used method to detect outliers is visualization. We use various
visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used
box plot and scatter plot for visualization).

Outlier treatments are three types:


Retention:
 There is no rigid mathematical definition of what constitutes an outlier;
determining whether or not an observation is an outlier is ultimately a subjective
exercise. There are various methods of outlier detection. Some are graphical such
as normal probability plots. Others are model- based. Box plots are a hybrid.
Exclusion:
 According to a purpose of the study, it is necessary to decide, whether and which
outlier will be removed/excluded from the data, since they could highly bias the
final results of the analysis.

Rejection:
 Rejection of outliers is more acceptable in areas of practice where the underlying
model of the process being measured and the usual distribution of measurement
error are confidently known.
 An outlier resulting from an instrument reading error may be excluded but it is
desirable that the reading is at least verified.

Other treatment methods


OUTLIER package in R: to detect and treat outliers in
Data. Outlier detection from graphical representation:
– Scatter plot and Box plot

The observations out of box are treated as outliers in data



Missing Data treatment:

RK VIGNAN VITS – CSE 18 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

Missing Values
 Missing data in the training data set can
reduce the power / fit of a model or can lead to
a biased model because we have not
analyzed the behavior and
relationship with other variables correctly. It can lead

to wrong prediction or classification.

 In R, missing values are represented by the symbol NA (not available).


 Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a
number)
and R outputs the result for dividing by zero as ‘Inf’(Infinity).

PMM approach to treat missing values:


• PMM-> Predictive Mean Matching (PMM) is a semi-parametric imputation approach.
• It is similar to the regression method except that for each missing value, it fills
in a value randomly from among the observed donor values from an observation
• whose regression-predicted values are closest to the regression-predicted value for
the missing value from the simulated regression model.

Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and efficient format.

Steps Involved in Data Preprocessing:


VITS
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN – CSE 19 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning
is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

2. Fill the Missing values:


There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

 (b). Noisy Data:


Noisy data is a meaningless data that can’t be interpreted by machines. It can be
generated due to faulty data collection, data entry errors etc. It can be handled in
following ways:
1. Binning Method:
This method works on sorted data in order to smooth it. Binning, also called
discretization, is a technique for reducing the cardinality (The total number of unique values
for a dimension is known as its cardinality) of continuous and discrete data. Binning groups
related values together in bins to reduce the number of distinct values
2. Regression:
Here data can be made smooth by fitting it to a regression function. The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or
it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

1. Normalization:
Normalization is a technique often applied as part of data preparation in Data
Analytics through machine learning. The goal of normalization is to change the
values of numeric columns in the dataset to a common scale, without distorting
differences in the ranges of

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN VITS – CSE 20 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

values. For machine learning, every dataset does not require normalization. It is
done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

3. Discretization:
Discretization is the process through which we can transform continuous
variables, models or functions into a discrete form. We do this by creating a set
of contiguous intervals (or bins) that go across the range of our desired
variable/model/function. Continuous data is Measured, while Discrete data is
Counted

4. Concept Hierarchy Generation:


Here attributes are converted from lower level to higher level in hierarchy. For
Example-The
attribute “city” can be converted to “country”.

3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we use data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.

2. Attribute Subset Selection:


The highly relevant attributes should be used, rest all can be discarded. For
performing attribute selection, one can use level of significance and p- value of
the attribute. The attribute having p-value greater than significance level can be
discarded.

3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.

4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

dimensionality reduction are: Wavelet transforms and PCA (Principal Component


Analysis).

Data Processing:
Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for data
processing to be done correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable
format (graphs, documents, etc.), giving it the form and context necessary to be
interpreted by computers and utilized by employees throughout an organization.

Six stages of data processing


1. Data collection
Collecting data is the first step in data processing. Data is pulled from available sources,
including data lakes and data warehouses. It is important that the data sources available
are trustworthy and well- built so the data collected (and later used as information) is of
the highest possible quality.

2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation,
often referred to as “pre-processing” is the stage at which raw data is cleaned up and
organized for the following stage of data processing. During preparation, raw data is
diligently checked for any errors. The purpose of this step is to eliminate bad data
(redundant, incomplete, or incorrect data) and begin to create high-quality data for the
best business intelligence.

3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data
warehouse like Redshift), and translated into a language that it can understand. Data
input is the first stage in which raw data begins to take the form of usable information.

4. Processing
During this stage, the data inputted to the computer in the previous stage is actually
processed for interpretation. Processing is done using machine learning algorithms,
though the process itself may vary slightly depending on the source of data being
processed (data lakes, social networks, connected devices etc.) and its intended use
(examining advertising patterns, medical diagnosis from connected devices, determining
customer needs, etc.).

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

RK VIGNAN VITS – CSE 22 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–I

5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data
scientists. It is translated, readable, and often in the form of graphs, videos, images, plain
text, etc.).

6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then
stored for future use. While some information may be put to use immediately, much of it
will serve a purpose later on. When data is properly stored, it can be quickly and easily
accessed by members of the organization when needed.

*** End of Unit-1 ***

RK VIGNAN – CSE 23 | P a g e
VITS
Downloaded by KANNAN S
DATA ANALYTICS (Professional Elective - I)
Subject Code: CS513PE

Data Analytics – Introduction &


Tools and Environment

NOTES MATERIAL
UNIT 2

For
B. TECH (CSE)
3rd YEAR – 1st SEM (R18)
Faculty:
B. RAVIKRISHNA

DEPARTMENT OF CSE

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE


DESHMUKHI

UNIT – II Syllabus
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment,
Application of Modeling in Business, Databases & Types of Data and variables, Data
Modeling Techniques, Missing Imputations etc. Need for Business Modeling.
Topics:
1. Introduction to Data Analytics
2. Data Analytics Tools and Environment
3. Need for Business Modeling.
4. Data Modeling Techniques
5. Application of Modeling in Business
6. Databases & Types of Data and variables
7. Missing Imputations etc.

Unit-2 Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles Tools and Environment
3. To explore the applications of Business Modelling
4. To understand the Data Modeling Techniques
5. To understand the Data Types and Variables and Missing imputations

Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe concepts of data analytics.
2. To demonstrate the principles Tools and Environment
3. To analyze the applications of Business Modelling
4. To understand and Compare the Data Modeling Techniques
5. To describe the Data Types and Variables and Missing imputations

RK VIGNAN VITS – CSE 2|Page


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

INTRODUCTION:

Data has been the buzzword for ages now. Either the data being generated from large-
scale enterprises or the data generated from an individual, each and every aspect of
data needs to be analyzed to benefit yourself from it.

Why is Data Analytics important?


Data Analytics has a key role in improving your business as it is used to gather hidden
insights, generate reports, perform market analysis, and improve business
requirements.

What is the role of Data Analytics?


 Gather Hidden Insights – Hidden insights from data are gathered and then
analyzed with respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed
on to the respective teams and individuals to deal with further actions for a high
rise in business.
 Perform Market Analysis – Market Analysis can be performed to understand
the strengths and weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving
Business to customer requirements and experience.

Ways to Use Data Analytics:


Now that you have looked at what data analytics is, let’s understand how we can use data
analytics.

Fig: Ways to use Data Analytics


1. Improved Decision Making: Data Analytics eliminates guesswork and manual
tasks. Be it choosing the right content, planning marketing campaigns, or developing
products. Organizations can use the insights they gain from data analytics to make
informed decisions. Thus, leading to better outcomes and customer satisfaction.
2. Better Customer Service: Data analytics allows you to tailor customer service
according to their needs. It also provides personalization and builds stronger
relationships with customers. Analyzed data can reveal information about customers’
interests, concerns, and more. It helps you give better recommendations for products
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
and services.

RK VIGNAN VITS – CSE 3|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

3. Efficient Operations: With the help of data analytics, you can streamline your
processes, save money, and boost production. With an improved understanding of what
your audience wants, you spend lesser time creating ads and content that aren’t in
line with your audience’s interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your
campaigns are performing. This helps in fine-tuning them for optimal outcomes.
Additionally, you can also find potential customers who are most likely to interact with a
campaign and convert into leads.

Steps Involved in Data Analytics:


Next step to understanding what data analytics is to learn how data is analyzed in
organizations. There are a few steps that are involved in the data analytics lifecycle.
Below are the steps that you can take to solve your problems.

Fig: Data Analytics process steps


1. Understand the problem: Understanding the business problems, defining the
organizational goals, and planning a lucrative solution is the first step in the analytics
process. E-commerce companies often encounter issues such as predicting the return
of items, giving relevant product recommendations, cancellation of orders, identifying
frauds, optimizing vehicle routing, etc.
2. Data Collection: Next, you need to collect transactional business data and
customer-related information from the past few years to address the problems your
business is facing. The data can have information about the total units that were sold for
a product, the sales, and profit that were made, and also when was the order placed.
Past data plays a crucial role in shaping the future of a business.
3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and
contain unwanted missing values. Such data is not suitable or relevant for performing
data analysis. Hence, you need to clean the data to remove unwanted, redundant, and
missing values to make it ready for analysis.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

4. Data Exploration and Analysis: After you gather the right data, the next vital step
is to execute exploratory data analysis. You can use data visualization and business
intelligence tools, data mining techniques, and predictive modelling to analyze,
visualize, and predict future outcomes from this data. Applying these methods can tell
you the impact and relationship of a certain feature as compared to other variables.
Below are the results you can get from the analysis:
 You can identify when a customer purchases the next product.
 You can understand how long it took to deliver the product.
 You get a better insight into the kind of items a customer looks for, product
returns, etc.
 You will be able to predict the sales and profit for the next quarter.
 You can minimize order cancellation by dispatching only relevant products.
 You’ll be able to figure out the shortest route to deliver the product, etc.
5. Interpret the results: The final step is to interpret the results and validate if the
outcomes meet your expectations. You can find out hidden patterns and future trends.
This will help you gain insights that will support you with appropriate data-driven
decision making.

What are the tools used in Data Analytics?


With the increasing demand for Data Analytics in the market, many tools have emerged
with various functionalities for this purpose. Either open-source or user-friendly, the top
tools in the data analytics market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and
data modeling. R compiles and runs on various platforms such as UNIX,
Windows, and Mac OS. It also provides tools to automatically install all packages
as per user-requirement.
 Python – Python is an open-source, object-oriented programming language that
is easy to read, write, and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras,
etc. It also can be assembled on any platform like SQL server, a MongoDB
database or JSON

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

RK VIGNAN VITS – CSE 5|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

 Tableau Public – This is a free software that connects to any data source
such as Excel, corporate Data Warehouse, etc. It then creates visualizations,
maps, dashboards etc with real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered
to the end-users quickly. It also offers data association and data visualization
with data being compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and
analytics, this tool is easily accessible and can analyze data from different
sources.
 Microsoft Excel – This tool is one of the most widely used tools for data
analytics. Mostly used for clients’ internal data, this tool analyzes the tasks that
summarize the data with a preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text
analytics, machine learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through
its modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the
transformation of data and parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10
times faster on disk. This tool is also popular for data pipelines and machine
learning model development.

Data Analytics Applications:


Data analytics is used in almost every sector of business, let’s discuss a few of them:
1. Retail: Data analytics helps retailers understand their customer needs and buying
habits to predict trends, recommend new products, and boost their business. They
optimize the supply chain, and retail operations at every step of the customer
journey.
2. Healthcare: Healthcare industries analyse patient data to provide lifesaving
diagnoses and treatment options. Data analytics help in discovering new drug
development methods as well.
3. Manufacturing: Using data analytics, manufacturing sectors can discover new
cost-saving opportunities. They can solve complex supply chain issues, labour
constraints, and equipment breakdowns.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
4. Banking sector: Banking and financial institutions use analytics to find out
probable loan defaulters and customer churn out rate. It also helps in detecting
fraudulent transactions immediately.

RK VIGNAN VITS – CSE 6|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

5. Logistics: Logistics companies use data analytics to develop new business models
and optimize routes. This, in turn, ensures that the delivery reaches on time in a
cost-efficient manner.

Cluster computing:
 Cluster computing is a collection of tightly or
loosely connected computers that work
together so that they act as a single entity.
 The connected computers execute operations
all together thus creating the idea of a single
system.
 The clusters are generally connected through
fast local area networks (LANs)

Why is Cluster Computing important?

 Cluster computing gives a relatively inexpensive, unconventional to the large


server or mainframe computer solutions.
 It resolves the demand for content criticality and process services in a faster way.
 Many organizations and IT companies are implementing cluster computing to
augment their scalability, availability, processing speed and resource
management at economic prices.
 It ensures that computational power is always available. It provides a single
general strategy for the implementation and application of parallel high-
performance systems independent of certain hardware vendors and their
product decisions.

Apache Spark:

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

RK VIGNAN VITS – CSE 7|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

 Apache Spark is a lightning-fast cluster computing technology, designed for fast


computation. It is based on Hadoop MapReduce and it extends the MapReduce
model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
 The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
 Spark is designed to cover a wide range of workloads such as batch applications,
iterative algorithms, interactive queries and streaming.
 Apart from supporting all these workloads in a respective system, it reduces the
management burden of maintaining separate tools.

Evolution of Apache Spark


Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by
Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to
Apache software foundation in 2013, and now Apache Spark has become a top level
Apache project from Feb-2014.

Features of Apache Spark:


Apache Spark has following features.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in
memory, and 10 times faster when running on disk. This is possible by reducing
number of read/write operations to disk. It stores the intermediate processing data in
memory.
Supports multiple languages − Spark provides built-in APIs in Java, Scala, or
Python. Therefore, you can write applications in different languages. Spark comes up
with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries,
Streaming data, Machine learning (ML), and Graph algorithms.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Spark Built on Hadoop


The following diagram shows three ways of how Spark can be built with Hadoop
components.

There are three ways of Spark deployment as explained below.


Standalone − Spark Standalone deployment means Spark occupies the place on top
of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly.
Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn (Yet
Another Resource Negotiator) without any pre-installation or root access required. It
helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other
components to run on top of stack. Spark in MapReduce (SIMR) − Spark in
MapReduce is used to launch spark job in addition to standalone deployment. With
SIMR, user can start Spark and uses its shell without any administrative access.
Components of Spark
The following illustration depicts the different components of Spark.

RK VIGNAN VITS – CSE 9|Page


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Apache Spark Core


Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets
in external storage systems.

Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the
distributed memory- based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
(before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.

What is Scala?
 Scala is a statically typed programming language that incorporates both
functional and object oriented, also suitable for imperative programming
approaches.to increase scalability of applications. It is a general-purpose
programming language. It is a strong static type language. In scala,
everything is an object whether it is a function or a number. It does not
have concept of primitive data.
 Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through ScalaJs.
 This language was originally built for the Java Virtual Machine (JVM) and
one of
Scala’s strengths is that it makes it very easy to interact with Java code.
 Scala is a Scalable Language used to write Software for multiple platforms.
Hence,
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
it got the name “Scala”. This language is intended to solve the problems of Java

RK VIGNAN VITS – CSE 10 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

while simultaneously being more concise. Initially designed by Martin


Odersky, it was released in 2003.

Why Scala?
 Scala is the core language to be used in writing the most popular
distributed big data processing framework Apache Spark. Big Data
processing is becoming inevitable from small to large enterprises.
 Extracting the valuable insights from data requires state of the art
processing tools and frameworks.
 Scala is easy to learn for object-oriented programmers, Java developers. It
is becoming one of the popular languages in recent years.
 Scala offers first-class functions for users
 Scala can be executed on JVM, thus paving the way for the
interoperability with other languages.
 It is designed for applications that are concurrent (parallel), distributed,
and resilient (robust) message-driven. It is one of the most demanding
languages of this decade.
 It is concise, powerful language and can quickly grow according to the
demand of its users.
 It is object-oriented and has a lot of functional programming features
providing a lot of flexibility to the developers to code in a way they want.
 Scala offers many Duck Types(Structural Types)
 Unlike Java, Scala has many features of functional programming
languages like Scheme, Standard ML and Haskell, including currying, type
inference, immutability, lazy evaluation, and pattern matching.
 The name Scala is a portmanteau of "scalable" and "language", signifying
that it is designed to grow with the demands of its users.

Where Scala can be used?


 Web Applications
 Utilities and Libraries
 Data Streaming
 Parallel batch processing
 Concurrency and distributed application
 Data analytics with Spark
 AWS lambda Expression

RK VIGNAN VITS – CSE 11 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Cloudera Impala:

 Cloudera Impala is Cloudera's open source massively parallel processing


(MPP) SQL query engine for data stored in a computer cluster running
Apache Hadoop.
 Impala is the open source, massively parallel processing (MPP) SQL query
engine for
native analytic database in a computer cluster running Apache Hadoop.
 It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon.
 Cloudera Impala is a query engine that runs on Apache Hadoop.
 The project was announced in October 2012 with a public beta test
distribution and became generally available in May 2013.
 Impala brings enabling users to issue low latency SQL queries to data
stored in HDFS and Apache HBase without requiring data movement or
transformation.
 Impala is integrated with Hadoop to use the same file and data formats,
metadata, security and resource management frameworks used by
MapReduce, Apache Hive, Apache Pig and other Hadoop software.
 Impala is promoted for analysts and data scientists to perform analytics on
data stored in Hadoop via SQL or business intelligence tools.
 The result is that large-scale data processing (via MapReduce) and
interactive queries can be done on the same system using the same data
and metadata – removing the need to migrate data sets into specialized
systems and/or proprietary formats simply to perform analysis.

Features include:
 Supports HDFS and Apache HBase storage,
 Reads Hadoop file formats, including text, LZO, SequenceFile, Avro,
RCFile, and Parquet,
 Supports Hadoop security (Kerberos authentication),
 Fine-grained, role-based authorization with Apache Sentry,
 Uses metadata, ODBC driver, and SQL syntax from Apache Hive.

RK VIGNAN VITS – CSE 12 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Databases & Types of Data and variables


Data Base: A Database is a collection of related data.
Database Management System: DBMS is a software or set of Programs used to
define, construct and manipulate the data.
Relational Database Management System: RDBMS is a software system
used to maintain relational databases. Many relational database systems have
an option of using the SQL.
NoSQL:
 NoSQL Database is a non-relational Data Management System, that does
not require a fixed schema. It avoids joins, and is easy to scale. The major
purpose of using a NoSQL database is for distributed data stores with
humongous data storage needs. NoSQL is used for Big data and real-time
web apps. For example, companies like Twitter, Facebook and Google
collect terabytes of user data every single day.
 NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a
better term would be “NoREL”, NoSQL caught on. Carl Strozz introduced
the NoSQL concept in 1998.
 Traditional RDBMS uses SQL syntax to store and retrieve data for further
insights. Instead, a NoSQL database system encompasses a wide range of
database technologies that can store structured, semi-structured,
unstructured and polymorphic data.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
RK VIGNAN VITS – CSE 13 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Why NoSQL?
 The concept of NoSQL databases became popular with Internet giants like
Google, Facebook, Amazon, etc. who deal with huge volumes of data. The
system response time becomes slow when you use RDBMS for massive
volumes of data.
 To resolve this problem, we could “scale up” our systems by upgrading
our existing hardware. This process is expensive. The alternative for this
issue is to distribute database load on multiple hosts whenever the load
increases. This method is known as “scaling out.”

Types of NoSQL Databases:

 Document-oriented: JSON documents MongoDB and CouchDB


 Key-value: Redis and DynamoDB
 Wide-column: Cassandra and HBase
 Graph: Neo4j and Amazon Neptune
Relational Non-relational
Databases (SQL) Databases (NoSQL)
Oracle MongoDB

MySQL couchDB

SQL Server BigTable

RK VIGNAN VITS – CSE 14 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

SQL vs NOSQL DB:

SQL NoSQL

RELATIONAL DATABASE Non-relational or distributed database


MANAGEMENT SYSTEM (RDBMS) system.

These databases have fixed or static


or predefined schema They have dynamic schema

These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.

These databases are best suited for These databases are not so good for
complex queries complex queries

Vertically Scalable Horizontally scalable

Follows CAP (consistency, availability,


Follows ACID property partition tolerance)

RK VIGNAN VITS – CSE 15 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Differences between SQL and NoSQL

The table below summarizes the main differences between SQL and NoSQL
databases.

SQL Databases NoSQL Databases

Document: JSON documents, Key-


Data Storage Tables with fixed value: key-value pairs, Wide-column:
Model rows and columns tables with rows and dynamic
columns, Graph: nodes and edges

Developed in the Developed in the late 2000s with a


Development 1970s with a focus focus on scaling and allowing for
History on reducing data rapid application change driven by
duplication agile and DevOps practices.

Document: MongoDB and CouchDB,


Oracle, MySQL,
Key- value: Redis and DynamoDB,
Examples Microsoft SQL
Wide- column: Cassandra and
Server, and
HBase, Graph: Neo4j and Amazon
PostgreSQL
Neptune

Document: general purpose, Key-


value: large amounts of data with
simple lookup queries, Wide-column:
Primary General purpose
large amounts of data with
Purpose
predictable query patterns, Graph:
analyzing and traversing
relationships between connected
data

Schemas Rigid Flexible

Vertical (scale-up Horizontal (scale-out across


Scaling
with a larger commodity servers)
server)

Multi-Record Most do not support multi-record ACID


ACID Supported transactions. However, some—like
Transactions MongoDB—do.

Joins Typically required Typically not required

RK VIGNAN VITS – CSE 16 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

SQL Databases NoSQL Databases

Requires ORM Many do not require ORMs. MongoDB


Data to Object
(object- documents map directly to data
Mapping
relational structures in most popular
mapping) programming languages.

Benefits of NoSQL
 The NoSQL data model addresses several issues that the relational
model is not designed to address:
 Large volumes of structured, semi-structured, and unstructured data.
 Object-oriented programming that is easy to use and flexible.
 Efficient, scale-out architecture instead of expensive, monolithic
architecture.

Variables:
 Data consist of individuals and variables that give us information
about those individuals. An individual can be an object or a person.
 A variable is an attribute, such as a measurement or a label.
 Two types of Data
 Quantitative data(Numerical)
 Categorical data

 Quantitative Variables: Quantitative data, contains numerical that can


be added, subtracted, divided, etc.
There are two types of quantitative variables: discrete and continuous.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
RK VIGNAN VITS – CSE 17 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Discrete vs continuous variables

Type What does the data Examples


represent?
of
variable
Discret Counts of individual items or • Number of students
e values. in a class
variabl • Number of different
es tree species in a
forest
Continuous Measurements of continuous or • Distance
variables non- finite values. • Volume
• Age

Categorical variables: Categorical variables represent groupings of some


kind. They are sometimes recorded as numbers, but the numbers represent
categories rather than actual amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal
variables.

Type of variable What does the Examples


data
represent?
Binary variables Yes/no outcomes. • Heads/tails in a coin flip
• Win/lose in a football game

Nomina Groups with no rank • Colors


l or order between them. • Brands
variable • ZIP CODE
s
Ordinal Groups that are ranked • Finishing place in a race
variables in a specific order. • Rating scale responses
in a survey*
RK VIGNAN VITS – CSE 18 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Missing Imputations:
Imputation is the process of replacing missing data with substituted values.

Types of missing data


Missing data can be classified into one of three categories
1. MCAR
Data which is Missing Completely At Random has nothing systematic about
which observations are missing values. There is no relationship between
missingness and either observed or unobserved covariates.

2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but
due entirely to observed variables. For example, those from a lower
socioeconomic status may be less willing to provide salary information (but we
know their SES status). The key is that the missingness is not due to the values
which are not observed. MCAR implies MAR but not vice-versa.

3. MNAR
If the data are Missing Not At Random, then the missingness depends on the
values of the missing data. Censored data falls into this category. For example,
individuals who are heavier are less likely to report their weight. Another
example, the device measuring some response can only measure values above
.5. Anything below that is missing.
There can be two types of gaps in Data:
1. Missing Data Imputation
2. Model based Technique

Imputations: (Treatment of Missing Values)


1. Ignore the tuple: This is usually done when the class label is missing
(assuming the mining task involves classification). This method is not very
effective, unless the tuple contains several attributes with missing values.
It is especially poor when the percentage of missing values per attribute
varies considerably.
2. Fill in the missing value manually: In general, this approach is time-
consuming and may not be feasible given a large data set with many
missing values.
3. Use a global constant to fill in the missing value: Replace all missing
attribute values by the same constant, such as a label like “Unknown” or -

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
∞. If missing values

RK VIGNAN VITS – CSE 19 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have
a value in common-that of “Unknown.” Hence, although this method is
simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the
average value of that particular attribute and use this value to replace the
missing value in that attribute column.
5. Use the attribute mean for all samples belonging to the same class
as the given tuple:
For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same
credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.

Need for Business Modelling:


The main need of Business Modelling for the Companies that embrace big
data analytics and transform their business models in parallel will create new
opportunities for revenue streams, customers, products and services
Having a big data strategy and vision that
identifies and capitalizes on new opportunities.

Analytics applications to various Business Domains


Application of Modelling in Business:
 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and
studying of business-related data using statistical models and iterative
methodologies. The goal of BA is to narrow down which datasets are useful
and which can increase revenue, productivity, and efficiency.
 Business analytics (BA) is the combination of skills, technologies, and
practices used to examine an organization's data and performance as a
way to gain insights and make data-driven decisions in the future using
statistical analysis.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Although business analytics is being leveraged in most commercial sectors and


industries, the following applications are the most common.
1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an
ideal way of gathering information about a purchaser’s spending habits,
financial situation, behavior trends, demographics, and lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain
customer loyalty to stay in business for the long haul. CRM systems analyze
important performance indicators such as demographics, buying patterns,
socio-economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to extract
insights that help organizations maneuver their way through tricky terrain.
Corporations turn to business analysts to optimize budgeting, banking, financial
planning, forecasting, and portfolio management.
4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the
average length of employment, etc. By working with this information, business
analysts help HR by forecasting the best fits between the company and
candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that
affect operations and the bottom line. Identifying things like equipment
downtime, inventory levels, and maintenance costs help companies streamline
inventory management, risks, and supply-chain management to create
maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by measuring
marketing and advertising metrics, identifying consumer behavior and the
target audience, and analyzing market trends.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

RK VIGNAN VITS – CSE 21 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

Data Modelling Techniques in Data Analytics:


What is Data Modelling?
 Data Modelling is the process of analyzing the data objects and their relationship
to the other objects. It is used to analyze the data requirements that are
required for the business processes. The data models are created for the data to
be stored in a database.
 The Data Model's main focus is on what data is needed and how we have to
organize data rather than what operations we have to perform.
 Data Model is basically an architect's building plan. It is a process of
documenting complex software system design as in a diagram that can be easily
understood.

Uses of Data Modelling:


 Data Modelling helps create a robust design with a data model that can
show an organization's entire data on the same platform.
 The database at the logical, physical, and conceptual levels can be designed with
the help data model.
 Data Modelling Tools help in the improvement of data quality.
 Redundant data and missing data can be identified with the help of data models.
 The data model is quite a time consuming, but it makes the maintenance
cheaper and faster.

Data Modelling Techniques:

Below given are 5 different types of techniques used to organize the data:
1. Hierarchical Technique
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node and the other child nodes are sorted in a particular order. But, the
hierarchical model is very rarely used now. This model can be used for real-world model
relationships.

RK Downloaded
V IbyG KANNAN
N A N SV ITS – CSE 22 | P a g e
DATA ANALYTICS UNIT–2

2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The
object- oriented model communicates while supporting data abstraction, inheritance,
and encapsulation.
3. Network Technique
The network model provides us with a flexible way of representing objects and
relationships between these entities. It has a feature known as a schema representing
the data in the form of a graph. An object is represented inside a node and the relation
between them as an edge, enabling them to maintain multiple parent and child records
in a generalized manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual
design provides a better view of the data that helps us easy to understand. In this
model, the entire database is represented in a diagram called an entity-relationship
diagram, consisting of Entities, Attributes, and Relationships.
5. Relational Technique
Relational is used to describe the different relationships between the entities. And there
are different sets of relations between the entities such as one to one, one to many,
many to one, and many to many.

*** End of Unit-2 ***

Downloaded by KANNAN S
DATA ANALYTICS UNIT–2

RK VIGNAN VITS – CSE 23 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS (Professional Elective - I)
Subject Code: CS513PE

NOTES MATERIAL
UNIT 3

For B. TECH (CSE)


3rd YEAR – 1st SEM (R18)

Faculty:
B. RAVIKRISHNA

DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

UNIT - III
Linear & Logistic Regression
Syllabus
Regression – Concepts, Blue property assumptions,Least Square Estimation,
Variable Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.

Topics:
1. Regression – Concepts
2. Blue property assumptions
3. Least Square Estimation
4. Variable Rationalization
5. Model Building etc.
6. Logistic Regression - Model Theory
7. Model fit Statistics
8. Model Construction
9. Analytics applications to various Business Domains

Unit-2 Objectives:
1. To explore the Concept of Regression
2. To learn the Linear Regression
3. To explore Blue Property Assumptions
4. To Learn the Logistic Regression
5. To understand the Model Theory and Applications

Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe the Concept of Regression
2. To demonstrate Linear Regression
3. To analyze the Blue Property Assumptions
4. To explore the Logistic Regression
5. To describe the Model Theory and Applications

RK VIGNAN VITS – CSE 2|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Regression – Concepts:
Introduction:
 The term regression is used to indicate the estimation or prediction of
the average value of one variable for a specified value of another variable.
 Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables.

“Regression Analysis is a statistical process for estimating the relationships


between the Dependent Variables /Criterion Variables / Response Variables
&
One or More Independent variables / Predictor variables.
 Regression describes how an independent variable is numerically
related to the dependent variable.
 Regression can be used for prediction, estimation and hypothesis testing,
and modeling causal relationships.

When Regression is chosen?


 A regression problem is when the output variable is a real or continuous value,
such as
“salary” or “weight”.
 Many different models can be used, the simplest is linear regression. It tries
to fit data with the best hyperplane which goes through the points.
 Mathematically a linear relationship represents a straight line when plotted as
a graph.
 A non-linear relationship where the exponent of any variable is not equal to
1 creates a curve.
Types of Regression Analysis Techniques:
1. Linear Regression
2. Logistic Regression
3. Ridge Regression
4. Lasso Regression
5. Polynomial Regression
6. Bayesian Linear Regression

RK VIGNAN VITS – CSE 3|Page


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Advantages & Limitations:


 Fast and easy to model and is particularly useful when the relationship to
be modeled is not extremely complex and if you don’t have a lot of data.
 Very intuitive to understand and interpret.
 Linear Regression is very sensitive to outliers.
Linear regression:
 Linear Regression is a very simple method but has proven to be very
useful for a large number of situations.
 When we have a single input attribute (x) and we want to use linear
regression, this is called simple linear regression.

 simple linear regression we want to model our data as


follows: y = B0 + B1 * x
 we know and B0 and B1 are coefficients that we need to estimate that
move the line around.
 Simple regression is great, because rather than having to search for values
by trial and error or calculate them analytically using more advanced
linear algebra, we can estimate them directly from our data.
OLS Regression:
Linear Regression using Ordinary Least Squares
Approximation Based on Gauss Markov Theorem:
We can start off by estimating the value for B1 as:
n

 x  mean(x)* y  mean( y)


i i

B1  i 1
n

2
 mean(x)
i
1
xi

B0  mean y  – B1*meanx

RK VIGNAN VITS – CSE 4|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

 If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be
called multiple linear regression. The procedure for linear regression is
different and simpler than that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x)
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
* yi-
mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2

Mean(x) = 1.4 and Mean(y) = 5.8


n

 xi  mean(x)* yi  mean( y)


B1  i 1
n


2
 mean(x)
i
1
xi

B0  mean y – B1*meanx
We can find from the above formulas,
B1=2 and B0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")

RK VIGNAN VITS – CSE 5|Page


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

> a <- data.frame(x = 7)


>a
7
> result <- predict(relxy,a)
> print(resul
t) 6
> #Note: you can observe that
> 0.8*7+0.4
[1] 6 #The same calculated using the line equation y= 0.8*x +0.4.
Simple linear regression is the simplest form of regression and the most studied.
Calculating B1 & B0 using Correlations and Standard Deviations:
B1 = corr(x, y) * stdev(y) / stdev(x)

Correlation(x, y)* St.Deviation( y)


B1  St.Deviation(x)
Where cor (x,y) is the correlation between x & y and stdev() is the calculation of the
standard deviation for a variable. The same is calculated in R as follows:
> x<-c(1,2,4,3,5)
> y<-c(1,3,3,2,5)
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> B1=cor(x,y)*sd(y)/sd(x)
> B1
[1] 0.8
> B0=mean(y)-B1*mean(x)
> B0
[1] 0.4

Estimating Error: (RMSE: Root Mean Squared Error)


We can calculate the error for our predictions called the Root Mean Squared Error or
RMSE. Root Mean Square Error can be calculated by
n
 pi  y i
2
Err  i
1
n
p is the predicted value and y is the
actual value, i is the index for a specific
instance, n is the number of predictions,
because we must calculate the error
across all predicted values. Estimating
Error for y=0.8*x+0.4 mean(x)= 3
(p-y)^2
x y = y-actual p = y-predicted p-y
1 1 1.2 0.2 0.04
s= sum of (p-y)2 = 2.4
2 3 2 -1 1
 s/n = 2.4 / 5 = 0.48
4 3 3.6 0.6 0.36
3 2 2.8 0.8 0.64  sqrt(s/n) = sqrt(0.48) = 0.692
5 5 4.4 -0.6 0.36  RMSE = 0.692

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
RK VIGNAN VITS – CSE 6|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Properties and Assumptions of OLS approximation:


1. Unbiasedness:
i. Biased estimator is defined as the difference between its expected
value and the true value. i.e., e(y)=y_actual – y_predited
ii. If the biased error (bias) is zero then estimator become unbiased.
iii. Unbiasedness is important only when it is combined with small variance
2. Least Variance:
i. An estimator is best when it has the smallest or least variance
ii. Least variance property is more important when it combined with small
biased.
3. Efficient estimator:
i. An estimator said to be efficient when it fulfilled both conditions.
ii. Estimator should unbiased and have least variance
4. Best Linear Unbiased Estimator (BLUE Properties):
i. An estimator is said to be BLUE when it fulfill the above properties
ii. An estimator is BLUE if it is Unbiased, Least Variance and Linear
Estimator
5. Minimum Mean Square Error (MSE):
i. An estimator is said to be MSE estimator if it has smallest mean square
error.
ii. Less difference between estimated value and True Value
6. Sufficient Estimator:
i. An estimator is sufficient if it utilizes all the information of a sample
about the True parameter.
ii. It must use all the observations of the sample.
Assumptions of OLS Regression:
1. There are random sampling of observations.
2. The conditional mean should be zero
3. There is homoscedasticity and no Auto-correlation.
4. Error terms should be normally distributed(optional)
5. The Properties of OLS estimates of simple linear regression
equation is y = B0+B1*x + µ (µ -> Error)
6. The above equation is based on the following assumptions
a. Randomness of µ
b. Mean of µ is Zero
c. Variance of µ is constant
d. The variance of µ has normal distribution
e. Error µ of different observations are independent.

Downloaded by KANNAN S
RK VIGNAN VITS – CSE 7|Page
DATA ANALYTICS UNIT–3

Homoscedasticity vs Heteroscedasticity:

 The Assumption of homoscedasticity (meaning “same variance”) is central to


linear regression models. Homoscedasticity describes a situation in which the
error term (that is, the “noise” or random disturbance in the relationship
between the independent variables and the dependent variable) is the same
across all values of the independent variables.
 Heteroscedasticity (the violation of homoscedasticity) is present when the size of
the error term differs across values of an independent variable.
 The impact of violating the assumption of homoscedasticity is a matter of degree,
increasing as heteroscedasticity increases.
 Homoscedasticity means “having the same scatter.” For it to exist in a set of data,
the points
must be about the same distance from the line, as shown in the picture above.
 The opposite is heteroscedasticity (“different scatter”), where points are at
widely varying distances from the regression line.
Variable Rationalization:
 The data set may have a large number of attributes. But some of those attributes
can be irrelevant or redundant. The goal of Variable Rationalization is to improve
the Data Processing in an optimal way through attribute subset selection.
 This process is to find a minimum set of attributes such that dropping of those
irrelevant attributes does not much affect the utility of data and the cost of data
analysis could be reduced.
 Mining on a reduced data set also makes the discovered pattern easier to
understand. As part of Data processing, we use the below methods of Attribute
subset selection
1. Stepwise Forward Selection
2. Stepwise Backward Elimination
3. Combination of Forward Selection and Backward Elimination
4. Decision Tree Induction.
All the above methods are greedy approaches for attribute subset selection.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

RK VIGNAN VITS – CSE 8|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

1. Stepwise Forward Selection: This procedure starts with an empty set of


attributes as the minimal set. The most relevant attributes are chosen (having
minimum p-value) and are added to the minimal set. In each iteration, one
attribute is added to a reduced set.
 Stepwise Backward Elimination: Here all the attributes are considered in the
initial set of attributes. In each iteration, one attribute is eliminated from the set
of attributes whose p-value is higher than significance level.
 Combination of Forward Selection and Backward Elimination: The
stepwise forward selection and backward elimination are combined so as to
select the relevant attributes most efficiently. This is the most common
technique which is generally used for attribute selection.
 Decision Tree Induction: This approach uses decision tree for attribute
selection. It constructs a flow chart like structure having nodes denoting a test on
an attribute. Each branch corresponds to the outcome of test and leaf nodes is a
class prediction. The attribute that is not the part of tree is considered irrelevant
and hence discarded.

Model Building Life Cycle in Data Analytics:


When we come across a business analytical problem, without acknowledging the
stumbling blocks, we proceed towards the execution. Before realizing the misfortunes,
we try to implement and predict the outcomes. The problem-solving steps involved in
the data science model-building life cycle.
Let’s understand every model building step in-depth,
The data science model-building life cycle includes some important steps to follow. The
following are the steps to follow to build a Data Model

1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment

1. Problem Definition
 The first step in constructing a model is to
understand the industrial problem in a more comprehensive way. To identify the
purpose of the problem and the prediction target, we must define the project
objectives appropriately.
 Therefore, to proceed with an analytical approach, we have to recognize the
obstacles first. Remember, excellent results always depend on a better

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
understanding of the problem.

2. Hypothesis Generation

RK VIGNAN VITS – CSE 9|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

 Hypothesis generation is the guessing approach through which we derive


some essential data parameters that have a significant correlation with the
prediction target.
 Your hypothesis research must be in-depth, looking for every perceptive of all
stakeholders into account. We search for every suitable factor that can influence
the outcome.
 Hypothesis generation focuses on what you can create rather than what is
available in the dataset.

3. Data Collection
 Data collection is gathering data from relevant sources regarding the
analytical problem, then we extract meaningful insights from the data for
prediction.

The data gathered must have:


 Proficiency in answer hypothesis questions.
 Capacity to elaborate on every data parameter.
 Effectiveness to justify your research.
 Competency to predict outcomes accurately.

4. Data Exploration/Transformation
 The data you collected may be in unfamiliar shapes and sizes. It may contain
unnecessary features, null values, unanticipated small values, or immense
values. So, before applying any algorithmic model to data, we have to explore it
first.
 By inspecting the data, we get to understand the explicit and hidden trends in
data. We find the relation between data features and the target variable.
 Usually, a data scientist invests his 60–70% of project time dealing with data
exploration only.
 There are several sub steps involved in data exploration:
o Feature Identification:
 You need to analyze which data features are available and which
ones are not.
 Identify independent and target variables.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
 Identify data types and categories of these variables.

RK VIGNAN VITS – CSE 10 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

o Univariate Analysis:
 We inspect each variable one by one. This kind of analysis depends
on the variable type whether it is categorical and continuous.
 Continuous variable: We mainly look for statistical trends like
mean, median, standard deviation, skewness, and many
more in the dataset.
 Categorical variable: We use a frequency table to understand
the spread of data for each category. We can measure the
counts and frequency of occurrence of values.
o Multi-variate Analysis:
 The bi-variate analysis helps to discover the relation between two or
more variables.
 We can find the correlation in case of continuous variables and the
case of categorical, we look for association and dissociation
between them.
o Filling Null Values:
 Usually, the dataset contains null values which lead to lower the
potential of the model. With a continuous variable, we fill these null
values using the mean or mode of that specific column. For the null
values present in the categorical column, we replace them with the
most frequently occurred categorical value. Remember, don’t
delete that rows because you may lose the information.
5. Predictive Modeling
 Predictive modeling is a mathematical approach to create a statistical model
to forecast future behavior based on input test data.
Steps involved in predictive modeling:
 Algorithm Selection:
o When we have the structured dataset, and we want to estimate the
continuous or categorical outcome then we use supervised machine
learning methodologies like regression and classification techniques.
When we have unstructured data and want to predict the clusters of items
to which a particular input test sample belongs, we use unsupervised
algorithms. An actual data scientist applies multiple algorithms to get a
more accurate model.
 Train Model:
o After assigning the algorithm and getting the data handy, we train our
model using the input data applying the preferred algorithm. It is an action
to determine the correspondence between independent variables, and the
prediction targets.
 Model Prediction:

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

o We make predictions by giving the input test data to the trained model.
We measure the accuracy by using a cross-validation strategy or ROC
curve which performs well to derive model output for test data.

6. Model Deployment
 There is nothing better than deploying the model in a real-time environment. It
helps us to gain analytical insights into the decision-making procedure. You
constantly need to update the model with additional features for customer
satisfaction.
 To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
 When you go through the Amazon website and notice the product
recommendations completely based on your curiosities. You can experience the
increase in the involvement of the customers utilizing these services. That’s how
a deployed model changes the mindset of the customer and convince him to
purchase the product.

Key Takeaways

SUMMARY OF DA MODEL LIFE CYCLE:


 Understand the purpose of the business analytical problem.
 Generate hypotheses before looking at data.
 Collect reliable data from well-known resources.
 Invest most of the time in data exploration to extract meaningful insights from the
data.
 Choose the signature algorithm to train the model and use test data to evaluate.
 Deploy the model into the production environment so it will be available to
users and strategize to make business decisions effectively.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

RK VIGNAN VITS – CSE 12 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Logistic Regression:
Model Theory, Model fit Statistics, Model Construction
Introduction:
 Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.
 The outcome must be a categorical or discrete value. It can be either Yes or
No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1,
it gives the probabilistic values which lie between 0 and 1.
 In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
 The curve from the logistic function indicates the likelihood of something
such as whether or not the cells are cancerous or not, a mouse is obese or
not based on its weight, etc.
 Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples;
therefore, it falls under the classification algorithm.
 In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
Types of Logistic Regressions:
On the basis of the categories, Logistic Regression can be classified into three
types:
 Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
 Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
 Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
Definition: Multi-collinearity:
 Multicollinearity is a statistical phenomenon in which multiple independent
variables show high correlation between each other and they are too inter-
related.
 Multicollinearity also called as Collinearity and it is an undesired situation for any
statistical regression model since it diminishes the reliability of the model itself.

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

 If two or more independent variables are too correlated, the data obtained
from the regression will be disturbed because the independent variables
are actually dependent between each other.
Assumptions for Logistic Regression:
 The dependent variable must be categorical in nature.
 The independent variable should not have multi-collinearity.
Logistic Regression Equation:
 The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:
 Logistic Regression uses a more complex cost function, this cost function can
be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’
instead of a linear function.
 The hypothesis of logistic regression tends it to limit the cost function between
0 and
1. Therefore linear functions fail to represent it as it can have a value greater
than 1 or less than 0 which is not possible as per the hypothesis of logistic
regression.

0  h (x)  1 --- Logistic Regression Hypothesis Expectation


Logistic Function (Sigmoid Function):
 The sigmoid function is a mathematical function used to map the predicted
values to probabilities.
 The sigmoid function maps any real value into another value within a range
of 0 and 1, and so forma S-Form curve.
 The value of the logistic regression must be between 0 and 1, which cannot
go beyond this limit, so it forms a curve like the "S" form.
 The below image is showing the logistic function:

Fig: Sigmoid Function Graph

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

RK VIGNAN VITS – CSE 14 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

The Sigmoid function can be interpreted as a probability indicating to a Class-1 or


Class-
0. So the Regression model makes the following predictions as

z  sigmoid ( y)   ( 1
y) 
1 e y
Hypothesis Representation
 When using linear regression, we used a formula for the line equation as:

y  b0  b1x1  b2 x2 ...  bn xn
 In the above equation y is a response variable, x1, x2 ,...xn are the predictor
variables,

and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.

 For logistic regression, we need the maximum likelihood


h ( y)
hypothesis

 Apply Sigmoid function on y as


z   ( y)   (b0  b1x1  b2 x2 ... 
bn xn )

z   ( y) 
1
Example for Sigmoid Function in R: (b0 b1x1 b2 x2 ...bn xn
> #Example for Sigmoid Function
1 e
)
> y<-c(-10:10);y
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
> z<-1/(1+exp(-y));z
[1] 4.539787e-05 1.233946e-04 3.353501e-04 9.110512e-04 2.472623e-03
6.692851e-03 1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-
01 5.000000e-01 7.310586e-
01
8.807971e-01 9.525741e-01
9.820138e-01 9.933071e-01
[17] 9.975274e-01 9.990889e-
01 9.996646e-01 9.998766e-
01
9.999546e-01
> plot(y,z)

> rm(list=ls())
> attach(mtcars) #attaching
a data set into the R
environment
> input <- mtcars[,c("mpg","disp","hp","wt")]

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215

RK VIGNAN VITS – CSE 15 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Hornet Sportabout 18.7 360 175 3.440


Valiant 18.1 225 105 3.460
> #model<-lm(mpg~disp+hp+wt);model1# Show the model
> model<-glm(mpg~disp+hp+wt);model

Call: glm(formula = mpg ~ disp + hp + wt)

Coefficients
: disp hp wt
(Intercept)
37.105505 -0.000937 -0.031157 -3.800891

Degrees of Freedom: 31 Total (i.e. Null); 28


Residual Null Deviance:1126
Residual Deviance: 195 AIC: 158.6
> newx<-data.frame(disp=150,hp=150,wt=4) #new input for prediction
> predict(model,newx)
1
17.08791
> 37.15+(-0.000937)*150+(-0.0311)*150+(-3.8008)*4 #checking with the data
newx
[1] 17.14125
y<-input[,c("mpg")];
y z=1/(1+exp(-y));z
plot(y,z)
> y<-input[,c("mpg")]
> y
[1] 21.0 21.0 21.4 18.1 24.4 19.2 16.4 17.3 15.2
22.8 18.7 14.3 22.8 17.8 10.4
10.4 14.7 32.4 33.9 15.5 13.3 27.3 30.4 15.8 19.7
30.4 21.5 15.2 19.2 26.0 15.0
21.4
> z=1/(1+exp(-
y));z
1.000000 1.000000 1.0000000 1.0000000 1.0000000 1.0000000 0.999999 1.0000000
0 0 4
1.0000000 1.0000000 1.0000000
0.9999999 1.0000000 0.9999997
0.9999696 0.9999696 0.9999996
1.0000000 1.0000000 1.0000000
1.0000000 0.9999998 0.9999997
0.9999983 1.0000000 1.0000000
1.0000000 1.0000000 0.9999999
1.0000000 0.9999997 1.0000000
> plot(y,z)

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

RK VIGNAN VITS – CSE 16 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Confusion Matrix (or) Error Matrix (or) Contingency Table:


What is a Confusion Matrix?
“A Confusion matrix is an N x N matrix used for evaluating the performance of a
classification model, where N is the number of target classes. The matrix
compares the actual target values with those predicted by the machine learning
model. This gives us a holistic view of how well our classification model is
performing and what kinds of errors it is making. It is a specific table layout that
allows visualization of the performance of an algorithm, typically a supervised
learning one (in unsupervised learning it is usually called a matching matrix).”
For a binary classification problem, we would have a 2 x 2 matrix as shown below
with 4 values:

Let’s decipher the matrix:

 The target variable has two values: Positive or Negative


 The columns represent the actual values of the target variable
 The rows represent the predicted values of the target variable

 True Positive
 True Negative
 False Positive – Type 1 Error
 False Negative – Type 2 Error

Why we need a Confusion matrix?


 Precision vs Recall
 F1-score

RK VIGNAN VITS – CSE 17 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Understanding True Positive, True Negative, False Positive and False


Negativ e in a Confusion Matrix
True Positive (TP)
 The predicted value matches the actual value
 The actual value was positive and the model predicted a positive value
True Negative (TN)
 The predicted value matches the actual value
 The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
 The predicted value was falsely predicted
 The actual value was negative but the model predicted a positive value
 Also known as the Type 1
error False Negative (FN) – Type
2 error
 The predicted value was falsely predicted
 The actual value was positive but the model predicted a negative value
 Also known as the Type 2 error
To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics
Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of
correctly predicted observation to the total observations.
 Accuracy is a great measure to understand that the model is Best.
 Accuracy is dependable only when you have symmetric datasets where
values of false positive and false negatives are almost same.
TP  TN
Accuracy
 TP  FP  TN  FN
Precision:
Precision is the ratio of correctly predicted positive observations to the total
predicted positive observations.
It tells us how many of the correctly predicted cases actually turned out to be
positive.

Precision = TP
TP+
FP

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
 Precision is a useful metric in cases where False Positive is a higher
concern than False Negatives.

RK VIGNAN VITS – CSE 18 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

 Precision is important in music or video recommendation systems, e-


commerce websites, etc. Wrong results could lead to customer churn and
be harmful to the business.
Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all
observations in actual class.

Recall = TP
TP+ FN
 Recall is a useful metric in cases where False Negative trumps False Positive.
 Recall is important in medical cases where it doesn’t matter whether
we raise a
false alarm but the actual positive cases should not go undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea
about these two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.

F  Score
 2 Precision* Recall
1
 Precesion  2*
1
Recall
1
Precision  Recall
 F1 is usually more useful than accuracy, especially if you have an
uneven class distribution.
 Accuracy works best if false positives and false negatives have similar cost.
 If the cost of false positives and false negatives are very different, it’s
better to look
at both Precision and Recall.
 But there is a catch here. If the interpretability of the F1-score is poor,
means that we don’t know what our classifier is maximizing – precision
or recall? So, we use it in combination with other evaluation metrics which
gives us a complete picture of the result.

RK VIGNAN VITS – CSE 19 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on
it and get the below confusion matrix:

The different values of the Confusion matrix


would be as follows:
 True Positive (TP) = 560
-Means 560 positive class data points were
correctly classified by the model.

 True Negative (TN) = 330


-Means 330 negative class data points
were correctly classified by the model.

 False Positive (FP) = 60


-Means 60 negative class data points were incorrectly classified as belonging to
the positive class by the model.

 False Negative (FN) = 50


-Means 50 positive class data points were incorrectly classified as belonging to
the negative class by the model.
This turned out to be a pretty decent classifier for our dataset considering the
relatively larger number of true positive and true negative values.
Precisely we have the outcomes represented in Confusion Matrix as:
TP = 560, TN = 330, FP = 60, FN = 50
Accuracy:
The accuracy for our model turns out to be:
TP  TN
Accuracy 
TP  FP  TN 
FN
 Accuracy 560 
 890
  0.89
330
560  60  330  50 1000
Hence Accuracy is 89%...Not bad!

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
RK VIGNAN VITS – CSE 20 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Precision:
It tells us how many of the correctly predicted cases actually turned out to be
positive.
TP
Precision =
TP+
FP
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.

Precision = TP 560
 560   0.903
TP+
60
FP
We can easily calculate Precision and Recall for our model by plugging in the
values into the above questions:

Recall = TP 560
 560   0.918
TP+ FN
50

F1-
Score
Precision* Recall
F1  Score  2 *
Precision  Recall
0.903* 0.918 0.8289
 F  Score  2 *   0.4552
1
0.903  0.918 1.821

AUC (Area Under Curve) ROC (Receiver Operating Characteristics)


Curves: Performance measurement is an essential task in Data
Modelling Evaluation. It is one of the most important evaluation metrics
for checking any classification model’s performance. It is also written
as AUROC (Area Under the Receiver Operating Characteristics) So
when it comes to a classification problem, we can count on an AUC - ROC
Curve.
When we need to check or visualize the performance of the multi-class
classification problem, we use the AUC (Area Under The Curve) ROC
(Receiver Operating Characteristics) curve.

What is the AUC - ROC Curve?

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
AUC - ROC curve is a performance measurement for the classification problems
at various threshold settings. ROC is a probability curve and AUC represents the
degree or measure

RK VIGNAN VITS – CSE 21 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

of separability. It tells how much the model is capable of distinguishing between


classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and
1 classes as 1. By analogy, the Higher the AUC, the better the model is at
distinguishing between patients with the disease and no disease.

The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and
FPR is on the x-axis.

TPR (True Positive Rate) / Recall /Sensitivity

Specificity

FPR (False Positive Rate)

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:

 True Positive Rate

 False Positive Rate

 True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
 TPR=TPTP+FN

False Positive Rate (FPR) is defined as follows:

RK VIGNAN VITS – CSE 22 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.

RK VIGNAN VITS – CSE 23 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Analytics applications to various Business Domains:


Application of Modelling in Business:
 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and studying
of business-related data using statistical models and iterative
methodologies. The goal of BA is to narrow down which datasets are useful
and which can increase revenue, productivity, and efficiency.
 Business analytics (BA) is the combination of skills, technologies, and
practices used to examine an organization's data and performance as a
way to gain insights and make data-driven decisions in the future using
statistical analysis.

Although business analytics is being leveraged in most commercial sectors and


industries, the following applications are the most common.
1. Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and
they are an ideal way of gathering information about a purchaser’s
spending habits, financial situation, behaviour trends, demographics, and
lifestyle preferences.
2. Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to
retain customer loyalty to stay in business for the long haul. CRM systems
analyze important performance indicators such as demographics, buying
patterns, socio- economic information, and lifestyle.
3. Finance
The financial world is a volatile place, and business analytics helps to
extract insights that help organizations maneuver their way through tricky
terrain. Corporations turn to business analysts to optimize budgeting,
banking, financial planning, forecasting, and portfolio management.
4. Human Resources
Business analysts help the process by pouring through data that
characterizes high performing candidates, such as educational
background, attrition rate, the average length of employment, etc. By
working with this information, business analysts help HR by forecasting
the best fits between the company and candidates.
5. Manufacturing
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

RK VIGNAN VITS – CSE 24 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Business analysts work with data to help stakeholders understand the


things that affect operations and the bottom line. Identifying things like
equipment downtime, inventory levels, and maintenance costs help
companies streamline inventory management, risks, and supply-chain
management to create maximum efficiency.
6. Marketing
Business analysts help answer these questions and so many more, by
measuring marketing and advertising metrics, identifying consumer
behaviour and the target audience, and analyzing market trends.

*** End of Unit-3 ***

RK VIGNAN VITS – CSE 25 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

Add-ons for Unit-3

TOBE DISCUSSED:
Receiver Operating Characteristics:
ROC & AUC

Derivation for Logistic Regression:


The logistic regression model assumes that the log-odds of an observation y can be
expressed as a linear function of the K input variables x:

Here, we add the constant term b0, by setting x0 = 1. This gives us K+1
parameters. The left hand side of the above equation is called the logit of P
(hence, the name logistic regression).

Let’s take the exponent of both sides of the logit equation.

(Since ln(ab)=ln(a)+ln(b) and


exp(a+b)=exp(a)exp(b).) We can also invert the logit equation to get a new
expression for P(x):

RK VIGNAN VITS – CSE 26 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT–3

The right hand side of the top equation is the sigmoid of z, which maps the real
line to the interval (0, 1), and is approximately linear near the origin. A useful
fact about P(z) is that the derivative P'(z) = P(z) (1 – P(z)). Here’s the
derivation:

Later, we will want to take the gradient of P with respect to the set of
coefficients b, rather than z. In that case, P'(z) = P(z) (1 – P(z))z‘, where ‘ is the
gradient taken with respect to b.

RK VIGNAN VITS – CSE 27 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS (Professional Elective - I)
Subject Code: CS513PE

UNIT 4
NOTES MATERIAL
OBJECT SEGMENTATION
TIME SERIES METHODS

For B. TECH (CSE)


3rd YEAR – 1st SEM (R18)

Faculty:
B. RAVIKRISHNA

DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

UNIT - IV
Object Segmentation & Time Series Methods
Syllabus:
Object Segmentation: Regression Vs Segmentation – Supervised and
Unsupervised Learning, Tree Building – Regression, Classification, Overfitting,
Pruning and Complexity, Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach,
Extract features from generated model as Height, Average Energy etc and
analyze for prediction

Topics:
Object Segmentation:
 Supervised and Unsupervised Learning
 Segmentaion & Regression Vs Segmentation
 Regression, Classification, Overfitting,
 Decision Tree Building
 Pruning and Complexity
 Multiple Decision Trees etc.

Time Series Methods:


 Arima, Measures of Forecast Accuracy
 STL approach
 Extract features from generated model as Height Average Energy etc.
and analyze for prediction

Unit-4 Objectives:
1. To explore the Segmentaion & Regression Vs Segmentation
2. To learn the Regression, Classification, Overfitting
3. To explore Decision Tree Building, Multiple Decision Trees etc.
4. To Learn the Arima, Measures of Forecast Accuracy
5. To understand the STL approach

Unit-4 Outcomes:
After completion of this course students will be able to
1. To Describe the Segmentaion & Regression Vs Segmentation
2. To demonstrate Regression, Classification, Overfitting
3. To analyze the Decision Tree Building, Multiple Decision Trees etc.
4. To explore the Arima, Measures of Forecast Accuracy
5. To describe the STL approach

RK VIGNAN VITS – CSE 2|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Supervised and Unsupervised Learning


Supervised Learning:
 Supervised learning is a machine learning method in which models are
trained using labeled data. In supervised learning, models need to find the
mapping function to map the input variable (X) with the output variable
(Y).
 We find a relation between x & y, suchthat y=f(x)
 Supervised learning needs supervision to train the model, which is similar
to as a student learns things in the presence of a teacher. Supervised
learning can be used for two types of problems: Classification and
Regression.
 Example: Suppose we have an image of different types of fruits. The task
of our supervised learning model is to identify the fruits and classify them
accordingly. So to identify the image in supervised learning, we will give
the input data as well as output for that, which means we will train the
model by the shape, size, color, and taste of each fruit. Once the training is
completed, we will test the model by giving the new set of fruit. The model
will identify the fruit and predict the output using a suitable algorithm.
Unsupervised Machine Learning:
 Unsupervised learning is another machine learning method in which
patterns inferred from the unlabeled input data. The goal of unsupervised
learning is to find the structure and patterns from the input data.
Unsupervised learning does not need any supervision. Instead, it finds
patterns from the data by its own.
 Unsupervised learning can be used for two types of problems: Clustering
and Association.
 Example: To understand the unsupervised learning, we will use the
example given above. So unlike supervised learning, here we will not
provide any supervision to the model. We will just provide the input
dataset to the model and allow the model to find the patterns from the
data. With the help of a suitable algorithm, the model will train itself and
divide the fruits into different groups according to the most similar
features between them.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 3|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

The main differences between Supervised and Unsupervised learning are given
below:

Supervised Learning Unsupervised Learning

Supervised learning algorithms are Unsupervised learning algorithms


trained using labeled data. are trained using
unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not
feedback to check if it is predicting take any feedback.
correct output or not.

Supervised learning model predicts Unsupervised learning model finds


the output. the hidden patterns in data.

In supervised learning, input data is In unsupervised learning, only input


provided to the model along with the data is provided to the model.
output.
The goal of supervised learning is to The goal of unsupervised learning is to
train the model so that it can predict the find the hidden patterns and useful
output when it is given new data. insights from the unknown dataset.

Supervised learning needs supervision Unsupervised learning does not need


to train the model. any supervision to train the model.

Supervised learning can be Unsupervised Learning can be


categorized in Classification classified in Clustering and
and Regression problems. Associations problems.
Supervised learning can be used for Unsupervised learning can be used for
those cases where we know the input as those cases where we have only input
well as corresponding outputs. data and no corresponding output
data.
Supervised learning model produces Unsupervised learning model may give
an accurate result. less accurate result as compared to
supervised learning.

Supervised learning is not close to true Unsupervised learning is more close to


Artificial intelligence as in this, we first the true Artificial Intelligence as it
train the model for each data, and then learns similarly as a child learns daily
only it can predict the correct output. routine things by his experiences.

It includes various algorithms such as In Unsupervised Learing we have K-


Linear Regression, Logistic Regression, means clustering. KNN (k-nearest
Support Vector Machine, Multi-class neighbors), Hierarchal clustering,
Classification, Decision tree, Random Anomaly detection, Neural Networks,
Forest, Decision Trees , Bayesian Logic, Principle Component Analysis,
etc. Independent Component Analysis,
RK
Apriori algorithms, etc.
VIGNAN VITS – CSE 4|Page
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Segmentation
 Segmentation refers to the act of segmenting data according to your
company’s needs in order to refine your analyses based on a defined
context. It is a technique of splitting customers into separate groups
depending on their attributes or behavior.
 The purpose of segmentation is to better understand your
customers(visitors), and to obtain actionable data in order to improve your
website or mobile app. In concrete terms, a segment enables you to filter
your analyses based on certain elements (single or combined).
 Segmentation can be done on elements related to
a visit, as well as on elements related to multiple
visits during a studied period.

Steps:
 Define purpose – Already mentioned in the statement above
 Identify critical parameters – Some of the variables which come up in mind
are skill, motivation, vintage, department, education etc. Let us say that
basis past experience, we know that skill and motivation are most
important parameters. Also, for sake of simplicity we just select 2
variables. Taking additional variables will increase the complexity, but can
be done if it adds value.
 Granularity – Let us say we are able to classify both skill and motivation
into High and Low using various techniques.
There are two broad set of methodologies for segmentation:
 Objective (supervised) segmentation
 Non-Objective (unsupervised) segmentation

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
RK VIGNAN VITS – CSE 5|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Objective Segmentation
 Segmentation to identify the type of customers who would respond to a
particular offer.
 Segmentation to identify high spenders among customers who will use
the e- commerce channel for festive shopping.
 Segmentation to identify customers who will default on their credit
obligation for a loan or credit card.

Non-Objective Segmentation

https://www.yieldify.com/blog/types-of-market-segmentation/
 Segmentation of the customer base to understand the specific profiles
which exist within the customer base so that multiple marketing actions
can be personalized for each segment
 Segmentation of geographies on the basis of affluence and lifestyle of
people living in each geography so that sales and distribution strategies
can be formulated accordingly.
 Hence, it is critical that the segments created on the basis of an objective
segmentation methodology must be different with respect to the stated
objective (e.g. response to an offer).
 However, in case of a non-objective methodology, the segments are
different with respect to the “generic profile” of observations belonging
to each segment, but not with regards to any specific outcome of interest.
 The most common techniques for building non-objective segmentation are
cluster analysis, K nearest neighbor techniques etc.

Regression Vs Segmentation
 Regression analysis focuses on finding a relationship between a
dependent variable and one or more independent variables.
 Predicts the value of a dependent variable based on the value of at
least one independent variable.
 Explains the impact of changes in an independent variable on the
dependent variable.
 We use linear or logistic regression technique for developing accurate
models for predicting an outcome of interest.
 Often, we create separate models for separate segments.
 Segmentation methods such as CHAID or CRT is used to judge their
effectiveness.

RK VIGNAN VITS – CSE 6|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 Creating separate model for separate segments may be time consuming


and not worth the effort. But, creating separate model for separate
segments may provide higher predictive power.

Decision Tree Classification Algorithm:

 Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
solving Classification problems.
 Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
 A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
 It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
 It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
 In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
 Basic Decision Tree Learning Algorithm:
 Now that we know what a Decision Tree is, we’ll see how it works
internally. There are many algorithms out there which construct Decision
Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for
Iterative Dichotomiser 3.

There are two main types of Decision Trees:


1. Classification trees (Yes/No types)
What we’ve seen above is an example of classification tree, where the outcome
was a variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
2. Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a number like 123.

Decision Tree Terminologies

RK VIGNAN VITS – CSE 7|Page


Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.

Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Decision Tree Representation:
 Each non-leaf node is connected to a test that splits its set of possible
answers into subsets corresponding to different test results.
 Each branch carries a particular test result's subset to another node.
 Each node is connected to a set of possible answers.
 Below diagram explains the general structure of a decision tree:

 A decision tree is an arrangement of tests that provides an appropriate


classification at every step in an analysis.
 "In general, decision trees represent a disjunction of conjunctions of
constraints on the attribute-values of instances. Each path from the tree
root to a leaf corresponds to a conjunction of attribute tests, and the tree
itself to a disjunction of these conjunctions" (Mitchell, 1997, p.53).
 More specifically, decision trees classify instances by sorting them down
the tree from the root node to some leaf node, which provides the
classification of the instance. Each node in the tree specifies a test of
some attribute of the instance,
RK VIGNAN VITS – CSE 8|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

and each branch descending from that node corresponds to one of the
possible
values for this attribute.
 An instance is classified by starting at the root node of the decision tree,
testing the attribute specified by this node, then moving down the tree
branch corresponding to the value of the attribute. This process is then
repeated at the node on this branch and so on until a leaf node is reached.

Appropriate Problems for Decision Tree Learning


Decision tree learning is generally best suited to problems with the following
characteristics:
 Instances are represented by attribute-value pairs.
o There is a finite list of attributes (e.g. hair colour) and each instance
stores a value for that attribute (e.g. blonde).
o When each attribute has a small number of distinct values (e.g.
blonde, brown, red) it is easier for the decision tree to reach a
useful solution.
o The algorithm can be extended to handle real-valued
attributes (e.g. a floating point temperature)
 The target function has discrete output values.
o A decision tree classifies each example as one of the output values.
 Simplest case exists when there are only two possible
classes (Boolean classification).
 However, it is easy to extend the decision tree to produce a
target function with more than two possible output values.
o Although it is less common, the algorithm can also be extended to
produce a target function with real-valued outputs.
 Disjunctive descriptions may be required.
o Decision trees naturally represent disjunctive expressions.
 The training data may contain errors.
o Errors in the classification of examples, or in the attribute values
describing those examples are handled well by decision trees,
making them a robust learning method.
 The training data may contain missing attribute values.
o Decision tree methods can be used even when some training
examples have unknown values (e.g., humidity is known for
only a fraction of the examples).

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
RK VIGNAN VITS – CSE 9|Page

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

After a decision tree learns classification rules, it can also be re-represented as a


set of if-then rules in order to improve readability.
How does the Decision Tree algorithm Work?
The decision of making strategic splits heavily affects a tree’s accuracy. The
decision
criteria are different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node into two or more
sub- nodes. The creation of sub-nodes increases the homogeneity of resultant
sub-nodes. In other words, we can say that the purity of the node increases with
respect to the target variable. The decision tree splits the nodes on all available
variables and then selects the split which results in most homogeneous sub-
nodes.

Tree Building: Decision tree learning is the construction of a decision tree from
class- labeled training tuples. A decision tree is a flow-chart-like structure, where
each internal (non-leaf) node denotes a test on an attribute, each branch
represents the outcome of a test, and each leaf (or terminal) node holds a class
label. The topmost node in a tree is the root node. There are many specific
decision-tree algorithms. Notable ones include the following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits
when computing classification trees)
MARS → (multivariate adaptive regression splines): Extends decision trees to
handle numerical data better
Conditional Inference Trees → Statistics-based approach that uses non-
parametric tests as splitting criteria, corrected for multiple testing to avoid over
fitting.

The ID3 algorithm builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking. A greedy
algorithm, as the name suggests, always makes the choice that seems to be the
best at that moment.

In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
follows the branch and jumps to the next node.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

For the next node, the algorithm again compares the attribute value with the
other sub- nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:
 Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
 Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
 Step-3: Divide the S into subsets that contains possible values for
the best attributes.
 Step-4: Generate the decision tree node, which contains the best attribute.
 Step-5: Recursively make new decision trees using the subsets of the
dataset created in
 Step -6: Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to
draw any conclusions from that information. Flipping a coin is
an example of an action that provides information that is
random.
From the graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability is 0.5
because it projects perfect randomness in the data and there is no chance if
perfectly determining the outcome.
Information Gain
Information gain or IG is a statistical property that
measures how well a given attribute separates the
training examples according to their target
classification. Constructing a decision tree is all about
finding an attribute that returns the highest
information gain and the smallest entropy.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A
branch with entropy more than zero needs further splitting.

Hypothesis space search in decision tree learning:

In order to derive the Hypothesis space, we compute the Entropy and Information
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Gain of Class and attributes. For them we use the following statistics formulae:

RK VIGNAN VITS – CSE 11 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Entropy of Class is:


Entropy P
 (Class)   log P  N log  N 

PN    
2
PN PN 2
PN
   

For any Attribute,

InformationGain( Attribute)
I (p , n )   pi  pi  ni 
log  log

i i  ni 2  
 pn 2
pn n
p pn
i i  i i  i i  i i 

Entropy of an Attribute
is:
Entropy( Attribute) 
 p  n  I p  n 
i i

i
PN i

Gain  Entropy(Class)  Entropy(Attribute)

Illustrative Example: Concept: “Play Tennis”:

Data set:

RK VIGNAN VITS – CSE 12 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Basic algorithm for inducing a decision tree from training tuples:


Algorithm:
Generate decision tree. Generate a decision tree from the training
tuples of data
partition D.
Input:
Data partition, D, which is a set of training tuples and their
associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the splitting
criterion that “best” partitions the data tuples into individual
classes. This criterion consists of a splitting attribute
and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
return N as a leaf node labeled with the class C;
(3) if attribute list is empty then
return N as a leaf node labeled with the majority class in
D;
// majority voting
(4) apply Attribute selection method(D, attribute list) to find the
“best”
splitting criterion;
(5) Label node N with splitting criterion;
(6) if splitting attribute is discrete-valued and multiway
splits allowed
then // not restricted to binary trees
(7) attribute list= attribute list - splitting attribute
(8) for each outcome j of splitting criterion
// partition the tuples and grow subtrees
for
each partition
(9) let Dj be the set of data tuples in D satisfying outcome j;
// a partition
(10) if Dj is empty then
attach a leaf labeled with the majority class in D to node
N;
else
attach the node returned by Generate decision tree(Dj,
attribute list) to node N;
(11) return N;

Advantages of Decision Tree:


 Simple to understand and interpret. People are able to understand
decision tree models after a brief explanation.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 13 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 Requires little data preparation. Other techniques often require data


normalization, dummy variables need to be created and blank values to be
removed.
 Able to handle both numerical and categorical data. Other techniques are
usually specialized in analysing datasets that have only one type of variable.
(For example, relation rules can be used only with nominal variables while
neural networks can be used only with numerical variables.)
 Uses a white box model. If a given situation is observable in a model the
explanation for the condition is easily explained by Boolean logic. (An
example of a black box model is an artificial neural network since the
explanation for the results is difficult to understand.)
 Possible to validate a model using statistical tests. That makes it possible to
account for the reliability of the model.
 Robust: Performs well with large datasets. Large amounts of data can be
analyzed using standard computing resources in reasonable time.
Tools used to make Decision Tree:
Many data mining software packages provide implementations of one or more
decision tree algorithms. Several examples include:
 SAS Enterprise Miner
 Matlab
 R (an open source software environment for statistical computing which
includes several CART implementations such as rpart, party and random
Forest packages)
 Weka (a free and open-source data mining suite, contains many
decision tree algorithms)
 Orange (a free data mining software suite, which includes the tree module
orngTree)
 KNIME
 Microsoft SQL Server
 Scikit-learn (a free and open-source machine learning library for the
Python programming language).
 Salford Systems CART (which licensed the proprietary code of the original
CART authors)
 IBM SPSS Modeler
 Rapid Miner

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Multiple Decision Trees:


Classification & Regression
Trees:
 Classification and regression trees is a term used to describe decision tree
algorithms that are used for classification and regression learning tasks.
 The Classification and Regression Tree methodology, also known as the CART
were introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen,
and Charles Stone.

Classification Trees:
A classification tree is an algorithm where
the target variable is fixed or categorical.
The algorithm is then used to identify
the “class” within which a target variable
would most likely fall.
 An example of a classification-type
problem would be determining who will or
will not subscribe to a digital platform;
or who will or will not graduate from high
school.
 These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive values.

RK VIGNAN VITS – CSE 15 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Regression Trees
 A regression tree refers to an algorithm
where the target variable is and the
algorithm is used to predict its value
which is a continuous variable.
 As an example of a regression type
problem, you may want to predict the
selling prices of a residential house,
which is a continuous dependent
variable.
 This will depend on both continuous
factors like square footage as well as
categorical factors.

Difference Between Classification and Regression Trees


 Classification trees are used when the dataset needs to be split into classes
that belong to the response variable. In many cases, the classes Yes or No.
 In other words, they are just two and mutually exclusive. In some cases, there
may be more than two classes in which case a variant of the classification
tree algorithm is used.
 Regression trees, on the other hand, are used when the response variable is
continuous. For instance, if the response variable is something like the price
of a property or the temperature of the day, a regression tree is used.
 In other words, regression trees are used for prediction-type problems while
classification trees are used for classification-type problems.

CART: CART stands for Classification And Regression Tree.


 CART algorithm was introduced in Breiman et al. (1986). A CART tree is a
binary decision tree that is constructed by splitting a node into two child
nodes repeatedly, beginning with the root node that contains the whole
learning sample. The CART growing method attempts to maximize within-
node homogeneity.
 The extent to which a node does not represent a homogenous subset of cases
is an indication of impurity. For example, a terminal node in which all cases

Downloaded by KANNAN S
RK VIGNAN VITS – CSE 16 | P a g e
DATA ANALYTICS UNIT-4
have the same

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

value for the dependent variable is a homogenous node that requires no further
splitting because it is "pure." For categorical (nominal, ordinal) dependent
variables the common measure of impurity is Gini, which is based on squared
probabilities of membership for each category. Splits are found that
maximize the homogeneity of child nodes with respect to the value of the
dependent variable.

Decision tree pruning:


Pruning is a data compression technique in machine learning and search algorithms
that reduces the size of decision trees by removing sections of the tree that are
non-critical and redundant to classify instances. Pruning reduces the complexity
of the final classifier, and hence improves predictive accuracy by the reduction
of overfitting.

One of the questions that arises in a decision tree algorithm is the optimal size of
the final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree
algorithm should stop because it is impossible Before and After pruning to tell if
the addition of a single extra node will dramatically decrease error. This problem
is known as the horizon effect. A common strategy is to grow the tree until each
node contains a small number of instances then use pruning to remove nodes

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
that do not provide additional information. Pruning should reduce the size of a
learning tree without reducing predictive accuracy as measured by a cross-
RK VIGNAN VITS – CSE 17 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.

Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
 Pre-pruning procedures prevent a complete induction of the training set
by replacing a stop () criterion in the induction algorithm (e.g. max. Tree
depth or information gain (Attr)> minGain). They considered to be more
efficient because they do not induce an entire set, but rather trees remain
small from the start.
 Post-Pruning (or just pruning) is the most common way of simplifying
trees. Here, nodes and subtrees are replaced with leaves to reduce
complexity.
The procedures are differentiated on the basis of their approach in the tree: Top-
down approach & Bottom-Up approach

Bottom-up pruning approach:


 These procedures start at the last node in the tree (the lowest point).
 Following recursively upwards, they determine the relevance of each
individual node.
 If the relevance for the classification is not given, the node is dropped or
replaced by a leaf.
 The advantage is that no relevant sub-trees can be lost with this method.
 These methods include Reduced Error Pruning (REP), Minimum Cost
Complexity Pruning (MCCP), or Minimum Error Pruning (MEP).

Top-down pruning approach:


 In contrast to the bottom-up method, this method starts at the root of the
tree. Following the structure below, a relevance check is carried out which
decides whether a node is relevant for the classification of all n items or
not.
 By pruning the tree at an inner node, it can happen that an entire sub-tree
(regardless of its relevance) is dropped.
 One of these representatives is pessimistic error pruning (PEP), which
brings quite good results with unseen items.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

CHAID:
 CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and
Sonquist (1963) proposed a simple method for fitting trees to predict a
quantitative variable.
 Each predictor is tested for splitting as follows: sort all the n cases on the
predictor and examine all n-1 ways to split the cluster in two. For each
possible split, compute the within-cluster sum of squares about the mean of
the cluster on the dependent variable.
 Choose the best of the n-1 splits to represent the predictor’s contribution.
Now do this for every other predictor. For the actual split, choose the
predictor and its cut point which yields the smallest overall within-cluster
sum of squares. Categorical predictors require a different approach. Since
categories are unordered, all possible splits between categories must be
considered. For deciding on one split of k categories into two groups, this
means that 2k-1 possible splits must be considered. Once a split is found, its
suitability is measured on the same within-cluster sum of squares as for a
quantitative predictor.
 It has to do instead with conditional discrepancies. In the analysis of
variance, interaction means that a trend within one level of a variable is not
parallel to a trend within another level of the same variable. In the ANOVA
model, interaction is represented by cross-products between predictors.
 In the tree model, it is represented by branches from the same nodes which
have different splitting predictors further down the tree. Regression trees
parallel regression/ANOVA modeling, in which the dependent variable is
quantitative. Classification trees parallel discriminant analysis and
algebraic classification methods. Kass (1980) proposed a modification to
AID called CHAID for categorized dependent and independent variables. His
algorithm incorporated a sequential merge and split procedure based on a
chi-square test statistic.
 Kass’s algorithm is like sequential cross-tabulation. For each predictor:
1) cross tabulate the m categories of the predictor with the k
categories of the dependent variable.
2) find the pair of categories of the predictor whose 2xk sub-table is
least significantly different on a chi-square test and merge these two
categories.
3) if the chi-square test statistic is not “significant” according to a
preset critical value, repeat this merging process for the selected

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
predictor until no non- significant chi-square is found for a sub-table,
and pick the predictor variable

RK VIGNAN VITS – CSE 19 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

whose chi-square is largest and split the sample into subsets, where
l is the number of categories resulting from the merging process on
that predictor.
4) Continue splitting, as with AID, until no “significant” chi-squares
result. The CHAID algorithm saves some computer time, but it is not
guaranteed to find the splits which predict best at a given step.
 Only by searching all possible category subsets can we do that. CHAID is
also limited to categorical predictors, so it cannot be used for quantitative or
mixed categorical quantitative models.

GINI Index Impurity Measure:


 GINI Index Used by the CART (classification and regression tree) algorithm,
Gini impurity is a measure of how often a randomly chosen element from
the set would be incorrectly labeled if it were randomly labeled according to
the distribution of labels in the subset. Gini impurity can be computed by
summing the probability fi of each item being chosen times the probability
1-fi of a mistake in categorizing that item.

Overfitting and Underfitting

 Let’s clearly understand overfitting, underfitting and perfectly fit models.

RK VIGNAN VITS – CSE 20 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 From the three graphs shown above, one can clearly understand that the
leftmost figure line does not cover all the data points, so we can say that
the model is under- fitted. In this case, the model has failed to generalize
the pattern to the new dataset, leading to poor performance on testing.
The under-fitted model can be easily seen as it gives very high errors on
both training and testing data. This is because the dataset is not clean and
contains noise, the model has High Bias, and the size of the training data is
not enough.
 When it comes to the overfitting, as shown in the rightmost graph, it shows
the model is covering all the data points correctly, and you might think this
is a perfect fit. But actually, no, it is not a good fit! Because the model
learns too many details from the dataset, it also considers noise. Thus, it
negatively affects the new data set; not every detail that the model has
learned during training needs also apply to the new data points, which
gives a poor performance on testing or validation

dataset. This is because the model has trained itself in a very complex
manner and has high variance.

 The best fit model is shown by the middle graph, where both training and
testing (validation) loss are minimum, or we can say training and testing
accuracy should be near each other and high in value.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 21 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Time Series Methods:


 Time series forecasting focuses on analyzing data changes across equally
spaced time intervals.
 Time series analysis is used in a wide variety of domains, ranging from
econometrics to geology and earthquake prediction; it’s also used in
almost all applied sciences and engineering.
 Time-series databases are highly popular and provide a wide spectrum of
numerous applications such as stock market analysis, economic and sales
forecasting, budget analysis, to name a few.
 They are also useful for studying natural phenomena like atmospheric
pressure, temperature, wind speeds, earthquakes, and medical prediction
for treatment.
 Time series data is data that is observed at different points in time. Time
Series Analysis finds hidden patterns and helps obtain useful insights from
the time series data.
 Time Series Analysis is useful in predicting future values or detecting
anomalies from the data. Such analysis typically requires many data
points to be present in the dataset to ensure consistency and reliability.
 The different types of models and analyses that can be created through
time series analysis are:
o Classification: To Identify and assign categories to the data.
o Curve fitting: Plot the data along a curve and study the relationships
of variables present within the data.
o Descriptive analysis: Help Identify certain patterns in time-series
data such as trends, cycles, or seasonal variation.
o Explanative analysis: To understand the data and its relationships,
the dependent features, and cause and effect and its tradeoff.
o Exploratory analysis: Describe and focus on the main characteristics
of the time series data, usually in a visual format.
o Forecasting: Predicting future data based on historical trends. Using
the historical data as a model for future data and predicting
scenarios that could happen along with the future plot points.
o Intervention analysis: The Study of how an event can change the data.
o Segmentation: Splitting the data into segments to discover the
underlying properties from the source information.

RK VIGNAN VITS – CSE 22 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Components of Time Series:


Long term trend – The smooth long term direction of time series where the data
can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a year which tends to
repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise and fall of time
series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any of the three above
mentioned components. They can be classified into – stationary and non –
stationary variation.
Stationary Variation: When the data neither increases nor decreases, i.e. it’s
completely random it’s called stationary variation. Or When the data has some
explainable portion remaining and can be analyzed further then such case is
called non – stationary variation.

ARIMA & ARMA:


What is ARIMA?
 In time series analysis, ARIMA is an acronym that stands for
AutoRegressive Integrated Moving Average model is ageneralization of an
autoregressive moving average (ARMA) model. These models are fitted to
time series data either to better understand the data or to predict future
points in the series (forecasting).
 They are applied in some cases where data show evidence of non-stationary,
 .A popular and very widely used statistical method for time series
forecasting and analysis is the ARIMA model.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 23 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 It is a class of models that capture a spectrum of different standard


temporal structures present in time series data. By implementing an
ARIMA model, you can forecast and analyze a time series using past
values, such as predicting future prices based on historical earnings.
 Univariate models such as these are used to understand better a single
time- dependent variable present in the data, such as temperature over
time. They predict future data points of and from the variables.
 wherean initial differencing step (corresponding to the "integrated" part of
the model) can be applied to reduce the non-stationary.A standard
notation used for describing ARIMA is by parameters p,d and q.
 Non-seasonal ARIMA models are generally denoted ARIMA(p, d, q) where
parameters p, d, and q are non-negative integers, p is the order of the
Autoregressive model, d is the degree of differencing, and q is the order of
the Moving-average model.
 The parameters are substituted with an integer value to indicate the
specific ARIMA model being used quickly. The parameters of the ARIMA
model are further described as follows:

o p: Stands for the number of lag observations included in the


model, also known as the lag order.
o d: The number of times the raw observations are differentiated, also
called the degree of differencing.
o q: Is the size of the moving average window and also called the
order of moving average.
Univariate stationary processes (ARMA)
A covariance stationary process is an ARMA (p, q) process of autoregressive order
p and moving
average order q if it can be written as

The acronym ARIMA stands for Auto-Regressive Integrated Moving Average.


Lags of the stationarized series in the forecasting equation are called
"autoregressive" terms, lags of the forecast errors are called "moving average"
terms, and a time series which needs to be differenced to be made stationary is
said to be an "integrated" version of a stationary series. Random-walk and
random-trend models, autoregressive models, and exponential smoothing
models are all special cases of ARIMA models.
RK VIGNAN VITS – CSE 24 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:

 p is the number of autoregressive terms,


 d is the number of nonseasonal differences needed for stationarity, and
 q is the number of lagged forecast errors in the prediction equation.

The forecasting equation is constructed as follows. First, let y denote the dth
difference of Y, which means:

If d=0: yt = Yt

If d=1: yt = Yt - Yt-1

If d=2: yt = (Yt - Yt-1) - (Yt-1 - Yt-2) = Yt - 2Yt-1 + Yt-2

Note that the second difference of Y (the d=2 case) is not the difference from 2
periods ago. Rather, it is the first-difference-of-the-first difference, which is the
discrete analog of a second derivative, i.e., the local acceleration of the series
rather than its local trend.

In terms of y, the general forecasting equation is:

ŷt = μ + ϕ1 yt-1 +…+ ϕp yt-p - θ1et-1 -…- θqet-q

Measure of Forecast Accuracy: Forecast Accuracy can be defined as the


deviation of Forecast or Prediction from the actual results.

Error = Actual demand – Forecast OR ei  At  Ft


We measure Forecast Accuracy by 2 methods : 1. Mean Forecast Error (MFE) For n
time periods where we have actual demand and forecast values:
n

(e ) i
MFE  i1

n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends to
over- forecast 2. Mean Absolute Deviation (MAD) For n time periods where we
have actual demand and forecast values:
n

(e ) i

MAD  i 1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute size
of the errors
Uses of Forecast error:
 Forecast model bias
 Absolute size of the forecast errors
 Compare alternative forecasting models

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 25 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 Identify forecast models that need adjustment

ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that:
 Extracts data from homogeneous or heterogeneous data sources
 Transforms the data for storing it in proper format or structure for
querying and analysis purpose
 Loads it into the final target (database, more specifically, operational data
store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes
time, so while the data is being pulled another transformation process executes,
processing the already received data and prepares the data for loading and as
soon as there is some data ready to be loaded into the target, the data loading
kicks off without waiting for the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems),
typically developed and supported by different vendors or hosted on separate
computer hardware. The disparate systems containing the original data are
frequently managed and operated by different employees. For example, a cost
accounting system may combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:
 Anatella
 Alteryx
 CampaignRunner
 ESF Database Migration Toolkit
 InformaticaPowerCenter
 Talend
 IBM InfoSphereDataStage
 Ab Initio
 Oracle Data Integrator (ODI)
 Oracle Warehouse Builder (OWB)
 Microsoft SQL Server Integration Services (SSIS)
 Tomahawk Business Integrator by Novasoft Technologies.
 Pentaho Data Integration (or Kettle) opensource data integration framework
 Stambia

RK VIGNAN VITS – CSE 26 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 Diyotta DI-SUITE for Modern Data Integration


 FlyData
 Rhino ETL
 SAP Business Objects Data Services
 SAS Data Integration Studio
 SnapLogic
 Clover ETL opensource engine supporting only basic partial functionality
and not server
 SQ-ALL - ETL with SQL queries from internet sources such as APIs
 North Concepts Data Pipeline
There are various steps involved in ETL. They are as below in detail:
Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources as
possible. The extract step should be designed in a way that it does not
negatively affect the source system in terms or performance, response time or
any kind of locking.
There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification
that a record has been changed and describe the change, this is the
easiest way to get the data.
 Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to identify
which records have been modified and provide an extract of such records.
During further ETL steps, the system needs to identify changes and
propagate it down. Note, that by using daily extract, we may not be able to
handle deleted records properly.
 Full extract - some systems are not able to identify which data has been
changed at all, so a full extract is the only way one can get the data out of
the system. The full extract requires keeping a copy of the last extract in
the same format in order to be able to identify changes. Full extract
handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data volumes can be
in tens of gigabytes.

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

RK VIGNAN VITS – CSE 27 | P a g e

Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

 Clean: The cleaning step is one of the most important as it ensures the
quality of the data in the data warehouse. Cleaning should perform basic
data unification rules, such as:
 Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
 Convert null values into standardized Not Available/Not Provided
valueConvert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
 Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).

Transform:
 The transform step applies a set of rules to transform the data from the
source to the target.
 This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
 The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.

Load:
 During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database.
 In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after
the load completes. The referential integrity needs to be maintained by
ETL tool to ensure consistency.

Managing ETL Process


The ETL process seems quite straight forward. As with every application, there is
a possibility that the ETL process fails. This can be caused by missing extracts
from one of the systems, missing values in one of the reference tables, or simply
a connection or power outage. Therefore, it is necessary to design the ETL
process keeping fail-recovery in mind.
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4

Staging:
It should be possible to restart, at least, some of the phases independently from
the others. For example, if the transformation step fails, it should not be
necessary to restart the Extract step. We can ensure this by implementing
proper staging. Staging means that the data is simply dumped to the location
(called the Staging Area) so that it can then be read by the next processing
phase. The staging area is also used during ETL process to store intermediate
results of processing. This is ok for the ETL process which uses for this purpose.
However, the staging area should be accessed by the load ETL process only. It
should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user. May contain incomplete or in-
the-middle-of-the- processing data.

*** End of Unit-4 ***

RK VIGNAN VITS – CSE 29 | P a g e


Downloaded by KANNAN S
Decision Tree Example
for
DATA ANALYTICS
(UNIT 4)

Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

Hypothesis space search in decision tree learning:

In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class
and attributes. For them we use the following statistics formulae:

Entropy of Class is:


Entropy P N 
(Class)  log 
2 
P  N log2  
PN PN  PN PN
   
For any Attribute,

InformationGain( Attribute)
pi  pi  ni  ni 
I (pi , )   log
2 
 log
2 

pi   n pi   n i 
ni
ni  pi i
 pi
 ni
Entropy of an Attribute is:

Entropy( Attribute) p  n I p  n 


i i
i i
 
PN
Gain  Entropy(Class)  Entropy(Attribute)
Illustrative Example:

Concept: “Play Tennis”:

Data set:

Day Outlook Temperatur Humidity Wind Play Golf


e
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes

Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning
D14 Rain Mild High Strong No

RK VIGNAN VITS – CSE 2|Page

Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

By using the said formulae, the decision tree is derived as follows:

RK VIGNAN VITS – CSE 3|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 4|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 5|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 6|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 7|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 8|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 9|Page


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 10 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 11 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 12 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 13 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning

RK VIGNAN VITS – CSE 14 | P a g e


Downloaded by KANNAN S
DATA ANALYTICS(Professional Elective - I)
Subject Code: CS513PE

NOTES MATERIAL
UNIT 5 - Data Visualization

For B.
TECH (CSE)
3rd YEAR – 1st SEM
(R18)
Faculty:
B. RAVIKRISHNA

DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI

Downloaded by KANNAN S
UNIT - V
Data Visualization
Syllabus:
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric
Projection Visualization Techniques, Icon-Based Visualization Techniques,
Hierarchical Visualization Techniques, Visualizing Complex Data and Relations.

Data Visualization – Topics


 Pixel-Oriented Visualization Techniques
 Geometric Projection Visualization Techniques
 Icon-Based Visualization Techniques
 Hierarchical Visualization Techniques
 Visualizing Complex Data and Relations.

Unit-5 Objectives:
1. To explore Pixel-Oriented Visualization Techniques
2. To learn Geometric Projection Visualization Techniques
3. To explore Icon-Based Visualization Techniques
4. To Learn Hierarchical Visualization Techniques
5. To understand Visualizing Complex Data and Relations

Unit-5 Outcomes:
After completion of this course students will be able to
1. To Describe the Pixel-Oriented Visualization Techniques
2. To demonstrate Geometric Projection Visualization Techniques
3. To analyze the Icon-Based Visualization Techniques
4. To explore the Hierarchical Visualization Techniques
5. To compare the Visualizing Complex Data and Relations

Downloaded by KANNAN S
Data Visualization

 Data visualization is the art and practice of gathering, analyzing, and


graphically representing empirical information.
 They are sometimes called information graphics, or even just charts and
graphs.
 The goal of visualizing data is to tell the story in the data.
 Telling the story is predicated on understanding the data at a very deep
level, and gathering insight from comparisons of data points in the
numbers

Why data visualization?

 Gain insight into an information space by mapping data onto graphical


primitives Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, and relationships
among data.
 Help find interesting regions and suitable parameters for further
quantitative analysis.
 Provide a visual proof of computer representations derived.

Categorization of visualization methods


 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

Pixel-Oriented Visualization Techniques

Downloaded by KANNAN S
 For a data set of m dimensions, create m windows on the screen, one
for each dimension.
 The m dimension values of a record are mapped to m pixels at
the correspondingpositions in the windows.
 The colors of the pixels reflect the corresponding values.

 To save space and show the connections among multiple dimensions,


space filling isoften done in a circle segment.

Geometric Projection Visualization Techniques


Visualization of geometric transformations and
projections of the data.Methods
 Direct visualization
 Scatterplot and scatterplot matrices
 Landscapes Projection pursuit technique: Help users find
meaningful projections ofmultidimensional data
 Prosection views
 Hyperslice
 Parallel coordinates

Downloaded by KANNAN S
Line Plot:
 This is the plot that you can see in the nook and corners of any sort of
analysis between 2 variables.

 The line plots are nothing but the values on a series of data points
will be connected with straight lines.
 The plot may seem very simple but it has more applications not only in
machine learning but in many other areas.
 Used to analyze the performance of a model using the ROC- AUC curve.

Bar Plot
 This is one of the widely used plots, that we would have seen multiple
times not just in data analysis, but we use this plot also wherever there
is a trend analysis in many fields.
 We can visualize the data in a cool plot and can convey the details
straight forward to others.

Downloaded by KANNAN S
 This plot may be simple and clear but it’s not much frequently used in
Data science applications.
Stacked Bar Graph:

 Unlike a Multi-set Bar Graph which displays their bars side-by-side,


Stacked Bar Graphs segment their bars. Stacked Bar Graphs are used to
show how a larger category is divided into smaller categories and what
the relationship of each part has on the total amount. There are two
types of Stacked Bar Graphs:
 Simple Stacked Bar Graphs place each value for the segment after the
previous one. The total value of the bar is all the segment values added
together. Ideal for

Downloaded by KANNAN S
comparing the total amounts across each group/segmented bar.
 100% Stack Bar Graphs show the percentage-of-the-whole of each
group and are plotted by the percentage of each value to the total
amount in each group. This makes it easier to see the relative
differences between quantities in each group.
 One major flaw of Stacked Bar Graphs is that they become harder to
read the more segments each bar has. Also comparing each segment to
each other is difficult, as they're not aligned on a common baseline.
Scatter Plot

 It is one of the most commonly used plots used for visualizing simple
data in Machine learning and Data Science.
 This plot describes us as a representation, where each point in the entire
dataset is present with respect to any 2 to 3 features(Columns).
 Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter
plot is the common one, where we will primarily try to find the patterns,
clusters, and separability of the data.
 The colors are assigned to different data points based on how they were
present in the dataset i.e, target column representation.
 We can color the data points as per their class label given in the dataset.

Downloaded by KANNAN S
Box and Whisker Plot
 This plot can be used to obtain more statistical details about the data.
 The straight lines at the maximum and minimum are also called whiskers.
 Points that lie outside the whiskers
will be considered as an outlier.
 The box plot also gives us a
description of the 25th, 50th,75th
quartiles.
 With the help of a box plot, we can
also determine the
Interquartile range(IQR) where
maximum details of the data will be
present
 These box plots come under
univariate analysis, which means
that we are exploring data only with
one variable.

Pie Chart :
A pie chart shows a static number and how categories represent part of a whole
the composition of something. A pie chart represents numbers in
percentages, and the total sum of all segments needs to equal 100%.
 Extensively used in presentations and offices, Pie Charts help show
proportions and percentages between categories, by dividing a circle into
proportional segments. Each arc length represents a proportion of each
category, while the full circle represents the total sum of all the data, equal to
100%.

Downloaded by KANNAN S
Donut Chart:

 A donut chart is essentially a Pie Chart


with an area of the centre cut out. Pie
Charts are sometimes criticised for
focusing readers on the proportional
areas of the slices to one another and to
the chart as a whole. This makes it tricky
to see the differences between slices,
especially when you try to compare
multiple Pie Charts together.

Downloaded by KANNAN S
 A Donut Chart somewhat remedies this problem by de-emphasizing the use of
the area. Instead, readers focus more on reading the length of the arcs, rather
than comparing the proportions between slices.
 Also, Donut Charts are more space-efficient than Pie Charts because the blank
space inside a Donut Chart can be used to display information inside it.

Marimekko Chart:

Also known as a Mosaic Plot.

Downloaded by KANNAN S
 Marimekko Charts are used to visualise categorical data over a pair of
variables. In a Marimekko Chart, both axes are variable with a percentage
scale, that determines both the width and height of each segment. So
Marimekko Charts work as a kind of two- way 100% Stacked Bar Graph. This
makes it possible to detect relationships between categories and their
subcategories via the two axes.
 The main flaws of Marimekko Charts are that they can be hard to read,
especially when there are many segments. Also, it’s hard to accurately make
comparisons between each segment, as they are not all arranged next to each
other along a common baseline. Therefore, Marimekko Charts are better suited
for giving a more general overview of the data.

Icon-Based Visualization Techniques


 It uses small icons to represent multidimensional data values
 Visualization of the data values as features of icons
 Typical visualization methods
o Chernoff Faces
o Stick Figures
Chernoff Faces

Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x
be eyebrow slant, y be eye size, z be nose length, etc.
 The figure shows faces produced using 10 characteristics–head
eccentricity,

Downloaded by KANNAN S
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose
size, mouth shape, mouth size, and mouth opening. Each assigned
one of 10 possible values.

Stick Figure

 A census data figure showing age, income, gender, education


 A 5-piece stick figure (1 body and 4 limbs w. different angle/length)
 Age, income are indicated by position of the figure.
 Gender, education are indicated by angle/length.
 Visualization can show a texture pattern.
 2 dimensions are mapped to the display axes and the remaining
dimensions are mapped to the angle and/ or length of the limbs.

Downloaded by KANNAN S
Hierarchical Visualization
Circle Packing

 Circle Packing is a variation of a Treemap that uses circles instead of


rectangles. Containment within each circle represents a level in the
hierarchy: each branch of the tree is represented as a circle and its sub-
branches are represented as circles inside of it. The area of each circle
can also be used to represent an additional arbitrary value, such as
quantity or file size. Colour may also be used to assign categories or to
represent another variable via different shades.
 As beautiful as Circle Packing appears, it's not as space-efficient as a
Treemap, as there's a lot of empty space within the circles. Despite
this, Circle Packing actually reveals hierarchal structure better than a
Treemap.

Downloaded by KANNAN S
Sunburst Diagram

 As known as a Sunburst Chart, Ring Chart, Multi-level Pie Chart, Belt Chart, Radial
Treemap.
 This type of visualisation shows hierarchy through a series of rings, that
are sliced for each category node. Each ring corresponds to a level in
the hierarchy, with the central circle representing the root node and
the hierarchy moving outwards from it.

Downloaded by KANNAN S
 Rings are sliced up and divided based on their hierarchical relationship
to the parent slice. The angle of each slice is either divided equally
under its parent node or can be made proportional to a value.
 Colour can be used to highlight hierarchal groupings or specific
categories.

Treemap:

 Treemaps are an alternative way of visualising the hierarchical


structure of

Downloaded by KANNAN S
a Tree Diagram while also displaying quantities for each category via area
size. Each category is assigned a rectangle area with their subcategory
rectangles nested inside of it.
 When a quantity is assigned to a category, its area size is displayed in
proportion to that quantity and to the other quantities within the same
parent category in a part-to-whole relationship. Also, the area size of
the parent category is the total of its subcategories. If no quantity is
assigned to a subcategory, then it's area is divided equally amongst
the other subcategories within its parent category.
 The way rectangles are divided and ordered into sub-rectangles is
dependent on the tiling algorithm used. Many tiling algorithms have
been developed, but the "squarified algorithm" which keeps each
rectangle as square as possible is the one commonly used.
 Ben Shneiderman originally developed Treemaps as a way of
visualising a vast file directory on a computer, without taking up too
much space on the screen. This makes Treemaps a more compact and
space-efficient option for displaying hierarchies, that gives a quick
overview of the structure. Treemaps are also great at comparing the
proportions between categories via their area size.
 The downside to a Treemap is that it doesn't show the hierarchal levels
as clearly as other charts that visualise hierarchal data (such as a Tree
Diagram or Sunburst Diagram).

Downloaded by KANNAN S
Visualizing Complex Data and Relations

 For a large data set of high dimensionality, it would be difficult to


visualize alldimensions at the same time.
 Hierarchical visualization techniques partition all dimensions into
subsets (i.e., subspaces).
 The subspaces are visualized in a hierarchical manner
 “Worlds-within-Worlds,” also known as n-Vision, is
a representative hierarchicalvisualization
method.
 To visualize a 6-D data set, where the dimensions are F,X1,X2,X3,X4,X5.
 We want to observe how F changes w.r.t. other dimensions. We
can fix X3,X4,X5dimensions to selected values and visualize changes
to F w.r.t. X1, X2
 Most visualization techniques were mainly for numeric data.
 Recently, more and more non-numeric data, such as text and
social networks, havebecome available.
 Many people on the Web tag various objects such as pictures, blog
entries, and productreviews.
 A tag cloud is a visualization of statistics of user-generated tags.

Downloaded by KANNAN S
 Often, in a tag cloud, tags are listed alphabetically or in a user-preferred
order.
 The importance of a tag is indicated by font size or color.

Word Cloud:

Also known as aTag Cloud.

 A visualisation method that displays how frequently words appear in a given


body of text, by making the size of each word proportional to its frequency. All
the words are then arranged in a cluster or cloud of words. Alternatively, the
words can also be arranged in any format: horizontal lines, columns or within a
shape.
 Word Clouds can also be used to display words that have meta-data assigned
to them. For example, in a Word Cloud with all the World's country's names,
the population could be assigned to each name to determine its size.

Downloaded by KANNAN S
 Colour used on Word Clouds is usually meaningless and is primarily aesthetic,
but it can be used to categorise words or to display another data variable.
 Typically, Word Clouds are used on websites or blogs to depict keyword or tag
usage. Word Clouds can also be used to compare two different bodies of text
together.
 Although being simple and easy to understand, Word Clouds have some major
flaws:
 Long words are emphasised over short words.
 Words whose letters contain many ascenders and descenders may receive
more attention.
 They're not great for analytical accuracy, so used more for aesthetic reasons
instead.

Source: https://datavizcatalogue.com/index.html

Downloaded by KANNAN S

You might also like