Data Anlytics Full Notes
Data Anlytics Full Notes
Downloaded by KANNAN S
Studocu is not sponsored or endorsed by any college or university
Downloaded by KANNAN S
DATA ANALYTICS
(Professional Elective - I)
Subject Code: CS513PE
NOTES MATERIAL
UNIT1
For
B. TECH (CSE)
3rd YEAR – 1st SEM (R18)
Faculty:
B. RAVIKRISHNA
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Prerequisites:
1. A course on “Database Management Systems”.
2. Knowledge of probability and statistics.
Course Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles and methods of statistical analysis
3. Discover interesting patterns, analyze supervised and unsupervised models and estimate the
accuracy of the algorithms.
4. To understand the various search methods and visualization techniques.
INTRODUCTION:
In the beginning times of computers and Internet, the data used was not as much
of as it is today, the data then could be so easily stored and managed by all the users and
business enterprises on a single computer, because the data never exceeded to the
extent of 19 exabytes but now in this era, the data has increased about 2.5 quintillion
per day.
Most of the data is generated from social media sites like Facebook, Instagram, Twitter,
etc, and the other sources can be e-business, e-commerce transactions, hospital, school,
bank data, etc. This data is impossible to manage by traditional data storing techniques.
Either the data being generated from large-scale enterprises or the data generated from
an individual, each and every aspect of data needs to be analysed to benefit yourself from
it. But how do we do it? Well, that’s where the term ‘Data Analytics’ comes in.
Gather Hidden Insights – Hidden insights from data are gathered and then
analyzed with respect to business requirements.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
RK VIGNAN VITS – CSE 2|Page
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Generate Reports – Reports are generated from the data and are passed on to the
respective teams and individuals to deal with further actions for a high rise in
business.
Perform Market Analysis – Market Analysis can be performed to understand the
strengths and weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to
customer requirements and experience.
A data architecture should set data standards for all its data systems as a vision or a
model of the eventual interactions between those data systems.
Data architectures address data in storage and data in motion; descriptions of data
stores, data groups and data items; and mappings of those data artifacts to data
qualities, applications, locations etc.
Essential to realizing the target state, Data Architecture describes how data is
processed, stored, and utilized in a given system. It provides criteria for data
processing operations that make it possible to design data flows and also control the
flow of data in the system.
The Data Architect is typically responsible for defining the target state, aligning
during development and then following up to ensure enhancements are done in the
spirit of the original blueprint.
Downloaded by KANNAN S
RK VIGNAN VITS – CSE 3|Page
DATA ANALYTICS UNIT–I
During the definition of the target state, the Data Architecture breaks a subject down to
the atomic level and then builds it back up to the desired form.
The Data Architect breaks the subject down by going through 3 traditional architectural
processes:
Conceptual model: It is a business model which uses Entity Relationship (ER) model
for relation between entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as
rows and column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database
technology will be suitable for architecture.
The data architecture is formed by dividing into three essential models and then are
combined:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Various constraints and influences will have an effect on data architecture design.
These include enterprise requirements, technology drivers, economics, business policies
and data processing need. Enterprise requirements:
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed),
transaction reliability, and transparent data management.
In addition, the conversion of raw data such as transaction records and image files
into more useful information forms through such features as data warehouses is
also a common organizational requirement, since this enables managerial decision
making and other organizational processes.
One of the architecture techniques is the split between managing transaction data
and (master) reference data. Another one is splitting data capture systems from
data retrieval systems (as done in a data warehouse).
Technology drivers:
These are usually suggested by the completed data architecture and database
architecture designs.
In addition, some technology drivers will derive from existing organizational
integration frameworks and standards, organizational economics, and existing site
resources (e.g. previously purchased software licensing).
Economics:
These are also important factors that must be considered during the data
architecture phase. It is possible that some solutions, while optimal in principle, may
not be potential candidates due to their cost.
External factors such as the business cycle, interest rates, market conditions, and
legal considerations could all have an effect on decisions relevant to data
architecture.
Business policies:
Business policies that also drive data architecture design include internal
organizational policies, rules of regulatory bodies, professional standards, and
applicable governmental laws that can vary by applicable agency.
These policies and rules will help describe the manner in which enterprise wishes to
process their data.
Data processing needs
These include accurate and reproducible transactions performed in high volumes,
data warehousing for the support of management information systems (and
potential data mining), repetitive periodic reporting, ad hoc reporting, and support
of various organizational initiatives as required (i.e. annual budgets, new product
development)
The General Approach is based on designing the Architecture at three Levels of
Specification.
The Logical Level
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
1. Primary data:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
The data which is Raw, original, and extracted directly from the official sources is
known as primary data. This type of data is collected directly by performing
techniques such as
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
2. Survey method:
The survey method is the process of research where a list of relevant questions are
asked and answers are noted down in the form of text, audio, or video.
The survey method can be obtained in both online and offline mode like through
website forms and email. Then that survey answers are stored for analysing data.
Examples are online surveys or surveys through social media polls.
3. Observation method:
The observation method is a method of data collection in which the researcher
keenly observes the behaviour and practices of the target audience using some
data collecting tool and stores the observed data in the form of text, audio, video, or
any raw formats.
In this method, the data is collected directly by posting a few questions on the
participants. For example, observing a group of customers and their behaviour
towards the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through performing
experiments, research, and investigation.
The most frequently used experiment methods are CRD, RBD, LSD, FD.
CRD- Completely Randomized design is a simple experimental design used in data
analytics which is based on randomization and replication. It is mostly used for comparing
the experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is
divided into small units called blocks.
Random experiments are performed on each of the blocks and results are drawn
using a technique known as analysis of variance (ANOVA). RBD was originated from
the agriculture sector.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Randomized Block Design - The Term Randomized Block Design has originated from
agricultural research. In this design several treatments of variables are applied to
different blocks of land to ascertain their effect on the yield of the crop.
Blocks are formed in such a manner that each block contains as many plots as a
number of treatments so that one plot from each is selected at random for each
treatment. The production of each plot is measured after the treatment is given.
These data are then interpreted and inferences are drawn by using the analysis of
Variance technique so as to know the effect of various treatments like different
dozes of fertilizers, different types of irrigation etc.
LSD – Latin Square Design is an experimental design that is similar to CRD and RBD
blocks but contains rows and columns.
It is an arrangement of NxN squares with an equal number of rows and columns which
contain letters that occurs only once in a row. Hence the differences can be easily
found with fewer errors in the experiment. Sudoku puzzle is an example of a Latin
square design.
A Latin square is one of the experimental designs which has a balanced two-way
classification scheme say for example - 4 X 4 arrangement. In this scheme each letter
from A to D occurs only once in each row and also only once in each column.
The Latin square is probably under used in most fields of research because text book
examples tend to be restricted to agriculture, the area which spawned most original
work on ANOVA. Agricultural examples often reflect geographical designs where rows
and columns are literally two dimensions of a grid in a field.
Rows and columns can be any two sources of variation in an experiment. In this sense
a Latin square is a generalisation of a randomized block design with two different
blocking systems
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this
design, the comparisons among treatments, will be free from both differences
between rows and columns. Thus, the magnitude of error will be smaller than any
other design.
FD- Factorial design is an experimental design where each experiment has two factors
each with possible values and on performing trail other combinational factors are derived.
This design allows the experimenter to test two or more variables simultaneously. It also
measures interaction effects of the variables and analyses the impacts of each of the
variables. In a true experiment, randomization is essential so that the experimenter can
infer cause and effect without any bias.
2. Secondary data:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two
types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a
sales record, transactions, customer data, accounting resources, etc. The cost and time
consumption is less in obtaining internal sources.
Accounting resources- This gives so much information which can be used by the
marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sales of a product. The
information provided is from outside the organization.
Internal Experts- These are people who are heading the various departments.
They can give an idea of how a particular thing is working.
Miscellaneous Reports- These are what information you are getting from
operational reports. If the data available within the organization are unsuitable or
inadequate, the marketer should extend the search to external secondary data
sources.
External source:
The data which can’t be found at internal organizations and can be gained through
external third-party resources is external source data. The cost and time consumption are
more because this contains a huge amount of data. Examples of external sources are
Government publications, news publications, Registrar General of India, planning
commission, international labour bureau, syndicate services, and other non-governmental
publications.
1. Government Publications-
Government sources provide an extremely rich pool of data for the researchers. In
addition, many of these data are available free of cost on internet websites. There
are number of government agencies generating data.
These are like: Registrar General of India- It is an office which generates
demographic data. It includes details of gender, age, occupation etc.
2. Central Statistical Organization-
This organization publishes the national accounts statistics. It contains estimates
of national income for several years, growth rate, and rate of major economic
activities. Annual survey of Industries is also published by the CSO.
It gives information about the total number of workers employed, production
units, material used and value added by the manufacturer.
3. Director General of Commercial Intelligence-
This office operates from Kolkata. It gives information about foreign trade i.e.
import and export. These figures are provided region-wise and country-wise.
4. Ministry of Commerce and Industries-
Downloaded by KANNAN S
RK VIGNAN VITS – CSE 9|Page
DATA ANALYTICS UNIT–I
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
These syndicate services provide information data from both household as well as
institution.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
and queries searched mostly.
Export all the Data onto the cloud like Amazon web services S3
We usually export our data to cloud for purposes like safety, multiple access and real time
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
simultaneous analysis.
Data Management:
Data management is the practice of collecting, keeping, and using data securely,
efficiently, and cost- effectively. The goal of data management is to help people,
organizations, and connected things optimize the use of data within the bounds of policy
and regulation so that they can make decisions and take actions that maximize the
benefit to the organization.
Managing digital data in an organization involves a broad range of tasks, policies,
procedures, and practices. The work of data management has a wide scope, covering
factors such as how to:
Create, access, and update data across a diverse data tier
Store data across multiple clouds and on premises
Provide high availability and disaster recovery
Use data in a growing variety of apps, analytics, and algorithms
Ensure data privacy and security
Archive and destroy data in accordance with retention schedules and compliance
requirements
What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It
doesn’t store any data on the hard disk of your personal computer. In cloud computing,
you can access data from a remote server.
Service Models of Cloud computing are the reference models on which the Cloud
Computing is based.
These can be categorized into
three basic service models as listed below:
1. INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual
machines, virtual storage, etc.
2. PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools,
etc.
3. SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.
For providing the above services models AWS is one of the popular platforms. In this
Amazon Cloud (Web) Services is one of the popular service platforms for Data Management
Amazon Cloud (Web) Services Tutorial
What is AWS?
The full form of AWS is Amazon Web Services. It is a platform that offers flexible, reliable,
scalable, easy-to-use and, cost-effective cloud computing solutions.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
AWS is a comprehensive, easy to use computing platform offered Amazon. The platform
is developed with a combination of infrastructure as a service (IaaS), platform as a
service (PaaS) and packaged software as a service (SaaS) offering.
History of AWS
2002- AWS services launched
2006- Launched its cloud
products 2012- Holds first
customer event
2015- Reveals revenues achieved of $4.6
billion 2016- Surpassed $10 billon
revenue target 2016- Release snowball
and snowmobile
2019- Offers nearly 100 cloud services
2021- AWS comprises over 200 products and services
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Step 1 − Open the Amazon S3 console using this link −
https://console.aws.amazon.com/s3/home
Step 2 − Create a Bucket using the following steps.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
A prompt window will open. Click the Create Bucket button at the bottom of the
page.
Create a Bucket dialog box will open. Fill the required details and click the Create
button.
The bucket is created successfully in Amazon S3. The console displays the list of
buckets and its properties.
Select the Static Website Hosting option. Click the radio button Enable website hosting
and fill the required details.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Click the Add files option. Select those files which are to be uploaded from the
system and then click the Open button.
Click the start upload button. The files will get uploaded into the bucket.
Afterwards, we can create, edit, modify, update the objects and other files in wide
formats.
Amazon S3 Features
Low cost and Easy to Use − Using Amazon S3, the user can store a large amount
of data at
very low charges.
Secure − Amazon S3 supports data transfer over SSL and the data gets
encrypted automatically once it is uploaded. The user has complete control over
their data by configuring bucket policies using AWS IAM.
Scalable − Using Amazon S3, there need not be any worry about storage
concerns. We can store as much data as we have and access it anytime.
Higher performance − Amazon S3 is integrated with Amazon CloudFront, that
distributes content to the end users with low latency and provides high data
transfer speeds without any minimum usage commitments.
Integrated with AWS services − Amazon S3 integrated with AWS services
include Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS,
Amazon Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB,
etc.
Data
Quality:
VIGNAN – CSE 15 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
OUTLIERS:
RK VIGNAN TS – CSE 17 | P a g e
Downloaded by KANNANVSI
DATA ANALYTICS UNIT–I
Detect Outliers:
Most commonly used method to detect outliers is visualization. We use various
visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used
box plot and scatter plot for visualization).
Rejection:
Rejection of outliers is more acceptable in areas of practice where the underlying
model of the process being measured and the usual distribution of measurement
error are confidently known.
An outlier resulting from an instrument reading error may be excluded but it is
desirable that the reading is at least verified.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Missing Values
Missing data in the training data set can
reduce the power / fit of a model or can lead to
a biased model because we have not
analyzed the behavior and
relationship with other variables correctly. It can lead
Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which
is used to transform the raw data in a useful and efficient format.
RK VIGNAN – CSE 19 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning
is done. It involves handling of missing data, noisy data etc.
(a). Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways. Some of them are:
1. Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
1. Normalization:
Normalization is a technique often applied as part of data preparation in Data
Analytics through machine learning. The goal of normalization is to change the
values of numeric columns in the dataset to a common scale, without distorting
differences in the ranges of
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
values. For machine learning, every dataset does not require normalization. It is
done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0).
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
3. Discretization:
Discretization is the process through which we can transform continuous
variables, models or functions into a discrete form. We do this by creating a set
of contiguous intervals (or bins) that go across the range of our desired
variable/model/function. Continuous data is Measured, while Discrete data is
Counted
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to get
rid of this, we use data reduction technique. It aims to increase the storage efficiency and
reduce data storage and analysis costs.
The various steps to data reduction are:
1. Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
3. Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
4. Dimensionality Reduction:
This reduce the size of data by encoding mechanisms. It can be lossy or
lossless. If after reconstruction from compressed data, original data can be
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
retrieved, such reduction are called lossless reduction else it is called lossy
reduction. The two effective methods of
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Data Processing:
Data processing occurs when data is collected and translated into usable information.
Usually performed by a data scientist or team of data scientists, it is important for data
processing to be done correctly as not to negatively affect the end product, or data output.
Data processing starts with data in its raw form and converts it into a more readable
format (graphs, documents, etc.), giving it the form and context necessary to be
interpreted by computers and utilized by employees throughout an organization.
2. Data preparation
Once the data is collected, it then enters the data preparation stage. Data preparation,
often referred to as “pre-processing” is the stage at which raw data is cleaned up and
organized for the following stage of data processing. During preparation, raw data is
diligently checked for any errors. The purpose of this step is to eliminate bad data
(redundant, incomplete, or incorrect data) and begin to create high-quality data for the
best business intelligence.
3. Data input
The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data
warehouse like Redshift), and translated into a language that it can understand. Data
input is the first stage in which raw data begins to take the form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous stage is actually
processed for interpretation. Processing is done using machine learning algorithms,
though the process itself may vary slightly depending on the source of data being
processed (data lakes, social networks, connected devices etc.) and its intended use
(examining advertising patterns, medical diagnosis from connected devices, determining
customer needs, etc.).
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
Downloaded by KANNAN S
DATA ANALYTICS UNIT–I
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data
scientists. It is translated, readable, and often in the form of graphs, videos, images, plain
text, etc.).
6. Data storage
The final stage of data processing is storage. After all of the data is processed, it is then
stored for future use. While some information may be put to use immediately, much of it
will serve a purpose later on. When data is properly stored, it can be quickly and easily
accessed by members of the organization when needed.
RK VIGNAN – CSE 23 | P a g e
VITS
Downloaded by KANNAN S
DATA ANALYTICS (Professional Elective - I)
Subject Code: CS513PE
NOTES MATERIAL
UNIT 2
For
B. TECH (CSE)
3rd YEAR – 1st SEM (R18)
Faculty:
B. RAVIKRISHNA
DEPARTMENT OF CSE
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
UNIT – II Syllabus
Data Analytics: Introduction to Analytics, Introduction to Tools and Environment,
Application of Modeling in Business, Databases & Types of Data and variables, Data
Modeling Techniques, Missing Imputations etc. Need for Business Modeling.
Topics:
1. Introduction to Data Analytics
2. Data Analytics Tools and Environment
3. Need for Business Modeling.
4. Data Modeling Techniques
5. Application of Modeling in Business
6. Databases & Types of Data and variables
7. Missing Imputations etc.
Unit-2 Objectives:
1. To explore the fundamental concepts of data analytics.
2. To learn the principles Tools and Environment
3. To explore the applications of Business Modelling
4. To understand the Data Modeling Techniques
5. To understand the Data Types and Variables and Missing imputations
Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe concepts of data analytics.
2. To demonstrate the principles Tools and Environment
3. To analyze the applications of Business Modelling
4. To understand and Compare the Data Modeling Techniques
5. To describe the Data Types and Variables and Missing imputations
INTRODUCTION:
Data has been the buzzword for ages now. Either the data being generated from large-
scale enterprises or the data generated from an individual, each and every aspect of
data needs to be analyzed to benefit yourself from it.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
3. Efficient Operations: With the help of data analytics, you can streamline your
processes, save money, and boost production. With an improved understanding of what
your audience wants, you spend lesser time creating ads and content that aren’t in
line with your audience’s interests.
4. Effective Marketing: Data analytics gives you valuable insights into how your
campaigns are performing. This helps in fine-tuning them for optimal outcomes.
Additionally, you can also find potential customers who are most likely to interact with a
campaign and convert into leads.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
4. Data Exploration and Analysis: After you gather the right data, the next vital step
is to execute exploratory data analysis. You can use data visualization and business
intelligence tools, data mining techniques, and predictive modelling to analyze,
visualize, and predict future outcomes from this data. Applying these methods can tell
you the impact and relationship of a certain feature as compared to other variables.
Below are the results you can get from the analysis:
You can identify when a customer purchases the next product.
You can understand how long it took to deliver the product.
You get a better insight into the kind of items a customer looks for, product
returns, etc.
You will be able to predict the sales and profit for the next quarter.
You can minimize order cancellation by dispatching only relevant products.
You’ll be able to figure out the shortest route to deliver the product, etc.
5. Interpret the results: The final step is to interpret the results and validate if the
outcomes meet your expectations. You can find out hidden patterns and future trends.
This will help you gain insights that will support you with appropriate data-driven
decision making.
R programming – This tool is the leading analytics tool used for statistics and
data modeling. R compiles and runs on various platforms such as UNIX,
Windows, and Mac OS. It also provides tools to automatically install all packages
as per user-requirement.
Python – Python is an open-source, object-oriented programming language that
is easy to read, write, and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras,
etc. It also can be assembled on any platform like SQL server, a MongoDB
database or JSON
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Tableau Public – This is a free software that connects to any data source
such as Excel, corporate Data Warehouse, etc. It then creates visualizations,
maps, dashboards etc with real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered
to the end-users quickly. It also offers data association and data visualization
with data being compressed to almost 10% of its original size.
SAS – A programming language and environment for data manipulation and
analytics, this tool is easily accessible and can analyze data from different
sources.
Microsoft Excel – This tool is one of the most widely used tools for data
analytics. Mostly used for clients’ internal data, this tool analyzes the tasks that
summarize the data with a preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text
analytics, machine learning.
KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through
its modular data pipeline concept.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the
transformation of data and parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10
times faster on disk. This tool is also popular for data pipelines and machine
learning model development.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
4. Banking sector: Banking and financial institutions use analytics to find out
probable loan defaulters and customer churn out rate. It also helps in detecting
fraudulent transactions immediately.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
5. Logistics: Logistics companies use data analytics to develop new business models
and optimize routes. This, in turn, ensures that the delivery reaches on time in a
cost-efficient manner.
Cluster computing:
Cluster computing is a collection of tightly or
loosely connected computers that work
together so that they act as a single entity.
The connected computers execute operations
all together thus creating the idea of a single
system.
The clusters are generally connected through
fast local area networks (LANs)
Apache Spark:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform
streaming analytics. It ingests data in mini-batches and performs RDD (Resilient
Distributed Datasets) transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the
distributed memory- based Spark architecture. It is, according to benchmarks, done by
the MLlib developers against the Alternating Least Squares (ALS) implementations.
Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout
(before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using
Pregel abstraction API. It also provides an optimized runtime for this abstraction.
What is Scala?
Scala is a statically typed programming language that incorporates both
functional and object oriented, also suitable for imperative programming
approaches.to increase scalability of applications. It is a general-purpose
programming language. It is a strong static type language. In scala,
everything is an object whether it is a function or a number. It does not
have concept of primitive data.
Scala primarily runs on JVM platform and it can also be used to write
software for native platforms using Scala-Native and JavaScript runtimes
through ScalaJs.
This language was originally built for the Java Virtual Machine (JVM) and
one of
Scala’s strengths is that it makes it very easy to interact with Java code.
Scala is a Scalable Language used to write Software for multiple platforms.
Hence,
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
it got the name “Scala”. This language is intended to solve the problems of Java
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Why Scala?
Scala is the core language to be used in writing the most popular
distributed big data processing framework Apache Spark. Big Data
processing is becoming inevitable from small to large enterprises.
Extracting the valuable insights from data requires state of the art
processing tools and frameworks.
Scala is easy to learn for object-oriented programmers, Java developers. It
is becoming one of the popular languages in recent years.
Scala offers first-class functions for users
Scala can be executed on JVM, thus paving the way for the
interoperability with other languages.
It is designed for applications that are concurrent (parallel), distributed,
and resilient (robust) message-driven. It is one of the most demanding
languages of this decade.
It is concise, powerful language and can quickly grow according to the
demand of its users.
It is object-oriented and has a lot of functional programming features
providing a lot of flexibility to the developers to code in a way they want.
Scala offers many Duck Types(Structural Types)
Unlike Java, Scala has many features of functional programming
languages like Scheme, Standard ML and Haskell, including currying, type
inference, immutability, lazy evaluation, and pattern matching.
The name Scala is a portmanteau of "scalable" and "language", signifying
that it is designed to grow with the demands of its users.
Cloudera Impala:
Features include:
Supports HDFS and Apache HBase storage,
Reads Hadoop file formats, including text, LZO, SequenceFile, Avro,
RCFile, and Parquet,
Supports Hadoop security (Kerberos authentication),
Fine-grained, role-based authorization with Apache Sentry,
Uses metadata, ODBC driver, and SQL syntax from Apache Hive.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
RK VIGNAN VITS – CSE 13 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like
Google, Facebook, Amazon, etc. who deal with huge volumes of data. The
system response time becomes slow when you use RDBMS for massive
volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading
our existing hardware. This process is expensive. The alternative for this
issue is to distribute database load on multiple hosts whenever the load
increases. This method is known as “scaling out.”
MySQL couchDB
SQL NoSQL
These databases are not suited for These databases are best suited for
hierarchical data storage. hierarchical data storage.
These databases are best suited for These databases are not so good for
complex queries complex queries
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
The table below summarizes the main differences between SQL and NoSQL
databases.
Benefits of NoSQL
The NoSQL data model addresses several issues that the relational
model is not designed to address:
Large volumes of structured, semi-structured, and unstructured data.
Object-oriented programming that is easy to use and flexible.
Efficient, scale-out architecture instead of expensive, monolithic
architecture.
Variables:
Data consist of individuals and variables that give us information
about those individuals. An individual can be an object or a person.
A variable is an attribute, such as a measurement or a label.
Two types of Data
Quantitative data(Numerical)
Categorical data
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
RK VIGNAN VITS – CSE 17 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Missing Imputations:
Imputation is the process of replacing missing data with substituted values.
2. MAR
Missing At Random is weaker than MCAR. The missingness is still random, but
due entirely to observed variables. For example, those from a lower
socioeconomic status may be less willing to provide salary information (but we
know their SES status). The key is that the missingness is not due to the values
which are not observed. MCAR implies MAR but not vice-versa.
3. MNAR
If the data are Missing Not At Random, then the missingness depends on the
values of the missing data. Censored data falls into this category. For example,
individuals who are heavier are less likely to report their weight. Another
example, the device measuring some response can only measure values above
.5. Anything below that is missing.
There can be two types of gaps in Data:
1. Missing Data Imputation
2. Model based Technique
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
∞. If missing values
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
are replaced by, say, “Unknown,” then the mining program may
mistakenly think that they form an interesting concept, since they all have
a value in common-that of “Unknown.” Hence, although this method is
simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the
average value of that particular attribute and use this value to replace the
missing value in that attribute column.
5. Use the attribute mean for all samples belonging to the same class
as the given tuple:
For example, if classifying customers according to credit risk, replace the
missing value with the average income value for customers in the same
credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian
formalism, or decision tree induction. For example, using the other
customer attributes in your data set, you may construct a decision tree to
predict the missing values for income.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Below given are 5 different types of techniques used to organize the data:
1. Hierarchical Technique
The hierarchical model is a tree-like structure. There is one root node, or we can say
one parent node and the other child nodes are sorted in a particular order. But, the
hierarchical model is very rarely used now. This model can be used for real-world model
relationships.
RK Downloaded
V IbyG KANNAN
N A N SV ITS – CSE 22 | P a g e
DATA ANALYTICS UNIT–2
2. Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The
object- oriented model communicates while supporting data abstraction, inheritance,
and encapsulation.
3. Network Technique
The network model provides us with a flexible way of representing objects and
relationships between these entities. It has a feature known as a schema representing
the data in the form of a graph. An object is represented inside a node and the relation
between them as an edge, enabling them to maintain multiple parent and child records
in a generalized manner.
4. Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to
define data elements and relationship for the entities in a system. This conceptual
design provides a better view of the data that helps us easy to understand. In this
model, the entire database is represented in a diagram called an entity-relationship
diagram, consisting of Entities, Attributes, and Relationships.
5. Relational Technique
Relational is used to describe the different relationships between the entities. And there
are different sets of relations between the entities such as one to one, one to many,
many to one, and many to many.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–2
Downloaded by KANNAN S
DATA ANALYTICS (Professional Elective - I)
Subject Code: CS513PE
NOTES MATERIAL
UNIT 3
Faculty:
B. RAVIKRISHNA
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
UNIT - III
Linear & Logistic Regression
Syllabus
Regression – Concepts, Blue property assumptions,Least Square Estimation,
Variable Rationalization, and Model Building etc.
Logistic Regression: Model Theory, Model fit Statistics, Model Construction, Analytics
applications to various Business Domains etc.
Topics:
1. Regression – Concepts
2. Blue property assumptions
3. Least Square Estimation
4. Variable Rationalization
5. Model Building etc.
6. Logistic Regression - Model Theory
7. Model fit Statistics
8. Model Construction
9. Analytics applications to various Business Domains
Unit-2 Objectives:
1. To explore the Concept of Regression
2. To learn the Linear Regression
3. To explore Blue Property Assumptions
4. To Learn the Logistic Regression
5. To understand the Model Theory and Applications
Unit-2 Outcomes:
After completion of this course students will be able to
1. To Describe the Concept of Regression
2. To demonstrate Linear Regression
3. To analyze the Blue Property Assumptions
4. To explore the Logistic Regression
5. To describe the Model Theory and Applications
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Regression – Concepts:
Introduction:
The term regression is used to indicate the estimation or prediction of
the average value of one variable for a specified value of another variable.
Regression analysis is a very widely used statistical tool to establish a
relationship model between two variables.
B1 i 1
n
2
mean(x)
i
1
xi
B0 mean y – B1*meanx
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
If we had multiple input attributes (e.g. x1, x2, x3, etc.) This would be
called multiple linear regression. The procedure for linear regression is
different and simpler than that for multiple linear regression.
Let us consider the following Example:
for an equation y=2*x+3.
xi-mean(x)
x y xi-mean(x) yi-mean(y) (xi-mean(xi)2
* yi-
mean(y)
-3 -3 -4.4 -8.8 38.72 19.36
-1 1 -2.4 -4.8 11.52 5.76
2 7 0.6 1.2 0.72 0.36
4 11 2.6 5.2 13.52 6.76
5 13 3.6 7.2 25.92 12.96
1.4 5.8 Sum = 90.4 Sum = 45.2
2
mean(x)
i
1
xi
B0 mean y – B1*meanx
We can find from the above formulas,
B1=2 and B0=3
Example for Linear Regression using R:
Consider the following data set:
x = {1,2,4,3,5} and y = {1,3,3,2,5}
We use R to apply Linear Regression for the above data.
> rm(list=ls()) #removes the list of variables in the current session of R
> x<-c(1,2,4,3,5) #assigns values to x
> y<-c(1,3,3,2,5) #assigns values to y
> x;y
[1] 1 2 4 3 5
[1] 1 3 3 2 5
> graphics.off() #to clear the existing plot/s
> plot(x,y,pch=16, col="red")
> relxy<-lm(y~x)
> relxy
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept)
x
0.4 0.8
> abline(relxy,col="Blue")
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
RK VIGNAN VITS – CSE 6|Page
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Downloaded by KANNAN S
RK VIGNAN VITS – CSE 7|Page
DATA ANALYTICS UNIT–3
Homoscedasticity vs Heteroscedasticity:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
1. Problem Definition
2. Hypothesis Generation
3. Data Collection
4. Data Exploration/Transformation
5. Predictive Modelling
6. Model Deployment
1. Problem Definition
The first step in constructing a model is to
understand the industrial problem in a more comprehensive way. To identify the
purpose of the problem and the prediction target, we must define the project
objectives appropriately.
Therefore, to proceed with an analytical approach, we have to recognize the
obstacles first. Remember, excellent results always depend on a better
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
understanding of the problem.
2. Hypothesis Generation
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
3. Data Collection
Data collection is gathering data from relevant sources regarding the
analytical problem, then we extract meaningful insights from the data for
prediction.
4. Data Exploration/Transformation
The data you collected may be in unfamiliar shapes and sizes. It may contain
unnecessary features, null values, unanticipated small values, or immense
values. So, before applying any algorithmic model to data, we have to explore it
first.
By inspecting the data, we get to understand the explicit and hidden trends in
data. We find the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data
exploration only.
There are several sub steps involved in data exploration:
o Feature Identification:
You need to analyze which data features are available and which
ones are not.
Identify independent and target variables.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Identify data types and categories of these variables.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
o Univariate Analysis:
We inspect each variable one by one. This kind of analysis depends
on the variable type whether it is categorical and continuous.
Continuous variable: We mainly look for statistical trends like
mean, median, standard deviation, skewness, and many
more in the dataset.
Categorical variable: We use a frequency table to understand
the spread of data for each category. We can measure the
counts and frequency of occurrence of values.
o Multi-variate Analysis:
The bi-variate analysis helps to discover the relation between two or
more variables.
We can find the correlation in case of continuous variables and the
case of categorical, we look for association and dissociation
between them.
o Filling Null Values:
Usually, the dataset contains null values which lead to lower the
potential of the model. With a continuous variable, we fill these null
values using the mean or mode of that specific column. For the null
values present in the categorical column, we replace them with the
most frequently occurred categorical value. Remember, don’t
delete that rows because you may lose the information.
5. Predictive Modeling
Predictive modeling is a mathematical approach to create a statistical model
to forecast future behavior based on input test data.
Steps involved in predictive modeling:
Algorithm Selection:
o When we have the structured dataset, and we want to estimate the
continuous or categorical outcome then we use supervised machine
learning methodologies like regression and classification techniques.
When we have unstructured data and want to predict the clusters of items
to which a particular input test sample belongs, we use unsupervised
algorithms. An actual data scientist applies multiple algorithms to get a
more accurate model.
Train Model:
o After assigning the algorithm and getting the data handy, we train our
model using the input data applying the preferred algorithm. It is an action
to determine the correspondence between independent variables, and the
prediction targets.
Model Prediction:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
o We make predictions by giving the input test data to the trained model.
We measure the accuracy by using a cross-validation strategy or ROC
curve which performs well to derive model output for test data.
6. Model Deployment
There is nothing better than deploying the model in a real-time environment. It
helps us to gain analytical insights into the decision-making procedure. You
constantly need to update the model with additional features for customer
satisfaction.
To predict business decisions, plan market strategies, and create personalized
customer interests, we integrate the machine learning model into the existing
production domain.
When you go through the Amazon website and notice the product
recommendations completely based on your curiosities. You can experience the
increase in the involvement of the customers utilizing these services. That’s how
a deployed model changes the mindset of the customer and convince him to
purchase the product.
Key Takeaways
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Logistic Regression:
Model Theory, Model fit Statistics, Model Construction
Introduction:
Logistic regression is one of the most popular Machine Learning algorithms,
which comes under the Supervised Learning technique. It is used for
predicting the categorical dependent variable using a given set of
independent variables.
The outcome must be a categorical or discrete value. It can be either Yes or
No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1,
it gives the probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped
logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something
such as whether or not the cells are cancerous or not, a mouse is obese or
not based on its weight, etc.
Logistic regression uses the concept of predictive modeling as regression;
therefore, it is called logistic regression, but is used to classify samples;
therefore, it falls under the classification algorithm.
In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the threshold
value tends to 1, and a value below the threshold values tends to 0.
Types of Logistic Regressions:
On the basis of the categories, Logistic Regression can be classified into three
types:
Binomial: In binomial Logistic regression, there can be only two possible types
of the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered
types of dependent variables, such as "low", "Medium", or "High".
Definition: Multi-collinearity:
Multicollinearity is a statistical phenomenon in which multiple independent
variables show high correlation between each other and they are too inter-
related.
Multicollinearity also called as Collinearity and it is an undesired situation for any
statistical regression model since it diminishes the reliability of the model itself.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
If two or more independent variables are too correlated, the data obtained
from the regression will be disturbed because the independent variables
are actually dependent between each other.
Assumptions for Logistic Regression:
The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.
Logistic Regression Equation:
The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are
given below:
Logistic Regression uses a more complex cost function, this cost function can
be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’
instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between
0 and
1. Therefore linear functions fail to represent it as it can have a value greater
than 1 or less than 0 which is not possible as per the hypothesis of logistic
regression.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
z sigmoid ( y) ( 1
y)
1 e y
Hypothesis Representation
When using linear regression, we used a formula for the line equation as:
y b0 b1x1 b2 x2 ... bn xn
In the above equation y is a response variable, x1, x2 ,...xn are the predictor
variables,
and b0 , b1, b2 ,..., bn are the coefficients, which are numeric constants.
z ( y)
1
Example for Sigmoid Function in R: (b0 b1x1 b2 x2 ...bn xn
> #Example for Sigmoid Function
1 e
)
> y<-c(-10:10);y
[1] -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
> z<-1/(1+exp(-y));z
[1] 4.539787e-05 1.233946e-04 3.353501e-04 9.110512e-04 2.472623e-03
6.692851e-03 1.798621e-02 4.742587e-02
[9] 1.192029e-01 2.689414e-
01 5.000000e-01 7.310586e-
01
8.807971e-01 9.525741e-01
9.820138e-01 9.933071e-01
[17] 9.975274e-01 9.990889e-
01 9.996646e-01 9.998766e-
01
9.999546e-01
> plot(y,z)
> rm(list=ls())
> attach(mtcars) #attaching
a data set into the R
environment
> input <- mtcars[,c("mpg","disp","hp","wt")]
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
> head(input)
mpg disp hp wt
Mazda RX4 21.0 160 110 2.620
Mazda RX4 Wag 21.0 160 110 2.875
Datsun 710 22.8 108 93 2.320
Hornet 4 Drive 21.4 258 110 3.215
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Coefficients
: disp hp wt
(Intercept)
37.105505 -0.000937 -0.031157 -3.800891
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error
Precision = TP
TP+
FP
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Precision is a useful metric in cases where False Positive is a higher
concern than False Negatives.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Recall = TP
TP+ FN
Recall is a useful metric in cases where False Negative trumps False Positive.
Recall is important in medical cases where it doesn’t matter whether
we raise a
false alarm but the actual positive cases should not go undetected!
F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea
about these two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.
F Score
2 Precision* Recall
1
Precesion 2*
1
Recall
1
Precision Recall
F1 is usually more useful than accuracy, especially if you have an
uneven class distribution.
Accuracy works best if false positives and false negatives have similar cost.
If the cost of false positives and false negatives are very different, it’s
better to look
at both Precision and Recall.
But there is a catch here. If the interpretability of the F1-score is poor,
means that we don’t know what our classifier is maximizing – precision
or recall? So, we use it in combination with other evaluation metrics which
gives us a complete picture of the result.
Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on
it and get the below confusion matrix:
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
RK VIGNAN VITS – CSE 20 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
Precision:
It tells us how many of the correctly predicted cases actually turned out to be
positive.
TP
Precision =
TP+
FP
This would determine whether our model is reliable or not.
Recall tells us how many of the actual positive cases we were able to predict
correctly with our model.
Precision = TP 560
560 0.903
TP+
60
FP
We can easily calculate Precision and Recall for our model by plugging in the
values into the above questions:
Recall = TP 560
560 0.918
TP+ FN
50
F1-
Score
Precision* Recall
F1 Score 2 *
Precision Recall
0.903* 0.918 0.8289
F Score 2 * 0.4552
1
0.903 0.918 1.821
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
AUC - ROC curve is a performance measurement for the classification problems
at various threshold settings. ROC is a probability curve and AUC represents the
degree or measure
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
The ROC curve is plotted with TPR against the FPR where TPR is on the y-axis and
FPR is on the x-axis.
Specificity
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of
a classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
TPR=TPTP+FN
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification
threshold classifies more items as positive, thus increasing both False Positives and True Positives.
The following figure shows a typical ROC curve.
Downloaded by KANNAN S
DATA ANALYTICS UNIT–3
TOBE DISCUSSED:
Receiver Operating Characteristics:
ROC & AUC
Here, we add the constant term b0, by setting x0 = 1. This gives us K+1
parameters. The left hand side of the above equation is called the logit of P
(hence, the name logistic regression).
The right hand side of the top equation is the sigmoid of z, which maps the real
line to the interval (0, 1), and is approximately linear near the origin. A useful
fact about P(z) is that the derivative P'(z) = P(z) (1 – P(z)). Here’s the
derivation:
Later, we will want to take the gradient of P with respect to the set of
coefficients b, rather than z. In that case, P'(z) = P(z) (1 – P(z))z‘, where ‘ is the
gradient taken with respect to b.
UNIT 4
NOTES MATERIAL
OBJECT SEGMENTATION
TIME SERIES METHODS
Faculty:
B. RAVIKRISHNA
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
UNIT - IV
Object Segmentation & Time Series Methods
Syllabus:
Object Segmentation: Regression Vs Segmentation – Supervised and
Unsupervised Learning, Tree Building – Regression, Classification, Overfitting,
Pruning and Complexity, Multiple Decision Trees etc.
Time Series Methods: Arima, Measures of Forecast Accuracy, STL approach,
Extract features from generated model as Height, Average Energy etc and
analyze for prediction
Topics:
Object Segmentation:
Supervised and Unsupervised Learning
Segmentaion & Regression Vs Segmentation
Regression, Classification, Overfitting,
Decision Tree Building
Pruning and Complexity
Multiple Decision Trees etc.
Unit-4 Objectives:
1. To explore the Segmentaion & Regression Vs Segmentation
2. To learn the Regression, Classification, Overfitting
3. To explore Decision Tree Building, Multiple Decision Trees etc.
4. To Learn the Arima, Measures of Forecast Accuracy
5. To understand the STL approach
Unit-4 Outcomes:
After completion of this course students will be able to
1. To Describe the Segmentaion & Regression Vs Segmentation
2. To demonstrate Regression, Classification, Overfitting
3. To analyze the Decision Tree Building, Multiple Decision Trees etc.
4. To explore the Arima, Measures of Forecast Accuracy
5. To describe the STL approach
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
The main differences between Supervised and Unsupervised learning are given
below:
Segmentation
Segmentation refers to the act of segmenting data according to your
company’s needs in order to refine your analyses based on a defined
context. It is a technique of splitting customers into separate groups
depending on their attributes or behavior.
The purpose of segmentation is to better understand your
customers(visitors), and to obtain actionable data in order to improve your
website or mobile app. In concrete terms, a segment enables you to filter
your analyses based on certain elements (single or combined).
Segmentation can be done on elements related to
a visit, as well as on elements related to multiple
visits during a studied period.
Steps:
Define purpose – Already mentioned in the statement above
Identify critical parameters – Some of the variables which come up in mind
are skill, motivation, vintage, department, education etc. Let us say that
basis past experience, we know that skill and motivation are most
important parameters. Also, for sake of simplicity we just select 2
variables. Taking additional variables will increase the complexity, but can
be done if it adds value.
Granularity – Let us say we are able to classify both skill and motivation
into High and Low using various techniques.
There are two broad set of methodologies for segmentation:
Objective (supervised) segmentation
Non-Objective (unsupervised) segmentation
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
RK VIGNAN VITS – CSE 5|Page
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Objective Segmentation
Segmentation to identify the type of customers who would respond to a
particular offer.
Segmentation to identify high spenders among customers who will use
the e- commerce channel for festive shopping.
Segmentation to identify customers who will default on their credit
obligation for a loan or credit card.
Non-Objective Segmentation
https://www.yieldify.com/blog/types-of-market-segmentation/
Segmentation of the customer base to understand the specific profiles
which exist within the customer base so that multiple marketing actions
can be personalized for each segment
Segmentation of geographies on the basis of affluence and lifestyle of
people living in each geography so that sales and distribution strategies
can be formulated accordingly.
Hence, it is critical that the segments created on the basis of an objective
segmentation methodology must be different with respect to the stated
objective (e.g. response to an offer).
However, in case of a non-objective methodology, the segments are
different with respect to the “generic profile” of observations belonging
to each segment, but not with regards to any specific outcome of interest.
The most common techniques for building non-objective segmentation are
cluster analysis, K nearest neighbor techniques etc.
Regression Vs Segmentation
Regression analysis focuses on finding a relationship between a
dependent variable and one or more independent variables.
Predicts the value of a dependent variable based on the value of at
least one independent variable.
Explains the impact of changes in an independent variable on the
dependent variable.
We use linear or logistic regression technique for developing accurate
models for predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their
effectiveness.
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
solving Classification problems.
Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
A decision tree simply asks a question, and based on the answer (Yes/No),
it further split the tree into subtrees.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
Basic Decision Tree Learning Algorithm:
Now that we know what a Decision Tree is, we’ll see how it works
internally. There are many algorithms out there which construct Decision
Trees, but one of the best is called as ID3 Algorithm. ID3 Stands for
Iterative Dichotomiser 3.
Root Node: Root node is from where the decision tree starts. It represents the
entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be
segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-
nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of possible
answers into subsets corresponding to different test results.
Each branch carries a particular test result's subset to another node.
Each node is connected to a set of possible answers.
Below diagram explains the general structure of a decision tree:
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
and each branch descending from that node corresponds to one of the
possible
values for this attribute.
An instance is classified by starting at the root node of the decision tree,
testing the attribute specified by this node, then moving down the tree
branch corresponding to the value of the attribute. This process is then
repeated at the node on this branch and so on until a leaf node is reached.
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
RK VIGNAN VITS – CSE 9|Page
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Decision trees use multiple algorithms to decide to split a node into two or more
sub- nodes. The creation of sub-nodes increases the homogeneity of resultant
sub-nodes. In other words, we can say that the purity of the node increases with
respect to the target variable. The decision tree splits the nodes on all available
variables and then selects the split which results in most homogeneous sub-
nodes.
Tree Building: Decision tree learning is the construction of a decision tree from
class- labeled training tuples. A decision tree is a flow-chart-like structure, where
each internal (non-leaf) node denotes a test on an attribute, each branch
represents the outcome of a test, and each leaf (or terminal) node holds a class
label. The topmost node in a tree is the root node. There are many specific
decision-tree algorithms. Notable ones include the following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits
when computing classification trees)
MARS → (multivariate adaptive regression splines): Extends decision trees to
handle numerical data better
Conditional Inference Trees → Statistics-based approach that uses non-
parametric tests as splitting criteria, corrected for multiple testing to avoid over
fitting.
The ID3 algorithm builds decision trees using a top-down greedy search
approach through the space of possible branches with no backtracking. A greedy
algorithm, as the name suggests, always makes the choice that seems to be the
best at that moment.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of root
attribute with the record (real dataset) attribute and, based on the comparison,
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
follows the branch and jumps to the next node.
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
For the next node, the algorithm again compares the attribute value with the
other sub- nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using the
below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for
the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in
Step -6: Continue this process until a stage is reached where you cannot
further classify the nodes and called the final node as a leaf node.
Entropy:
Entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to
draw any conclusions from that information. Flipping a coin is
an example of an action that provides information that is
random.
From the graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability is 0.5
because it projects perfect randomness in the data and there is no chance if
perfectly determining the outcome.
Information Gain
Information gain or IG is a statistical property that
measures how well a given attribute separates the
training examples according to their target
classification. Constructing a decision tree is all about
finding an attribute that returns the highest
information gain and the smallest entropy.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A
branch with entropy more than zero needs further splitting.
In order to derive the Hypothesis space, we compute the Entropy and Information
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Gain of Class and attributes. For them we use the following statistics formulae:
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
InformationGain( Attribute)
I (p , n ) pi pi ni
log log
i i ni 2
pn 2
pn n
p pn
i i i i i i i i
Entropy of an Attribute
is:
Entropy( Attribute)
p n I p n
i i
i
PN i
Data set:
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Classification Trees:
A classification tree is an algorithm where
the target variable is fixed or categorical.
The algorithm is then used to identify
the “class” within which a target variable
would most likely fall.
An example of a classification-type
problem would be determining who will or
will not subscribe to a digital platform;
or who will or will not graduate from high
school.
These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive values.
Regression Trees
A regression tree refers to an algorithm
where the target variable is and the
algorithm is used to predict its value
which is a continuous variable.
As an example of a regression type
problem, you may want to predict the
selling prices of a residential house,
which is a continuous dependent
variable.
This will depend on both continuous
factors like square footage as well as
categorical factors.
Downloaded by KANNAN S
RK VIGNAN VITS – CSE 16 | P a g e
DATA ANALYTICS UNIT-4
have the same
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
value for the dependent variable is a homogenous node that requires no further
splitting because it is "pure." For categorical (nominal, ordinal) dependent
variables the common measure of impurity is Gini, which is based on squared
probabilities of membership for each category. Splits are found that
maximize the homogeneity of child nodes with respect to the value of the
dependent variable.
One of the questions that arises in a decision tree algorithm is the optimal size of
the final tree. A tree that is too large risks overfitting the training data and poorly
generalizing to new samples. A small tree might not capture important structural
information about the sample space. However, it is hard to tell when a tree
algorithm should stop because it is impossible Before and After pruning to tell if
the addition of a single extra node will dramatically decrease error. This problem
is known as the horizon effect. A common strategy is to grow the tree until each
node contains a small number of instances then use pruning to remove nodes
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
that do not provide additional information. Pruning should reduce the size of a
learning tree without reducing predictive accuracy as measured by a cross-
RK VIGNAN VITS – CSE 17 | P a g e
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.
Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
Pre-pruning procedures prevent a complete induction of the training set
by replacing a stop () criterion in the induction algorithm (e.g. max. Tree
depth or information gain (Attr)> minGain). They considered to be more
efficient because they do not induce an entire set, but rather trees remain
small from the start.
Post-Pruning (or just pruning) is the most common way of simplifying
trees. Here, nodes and subtrees are replaced with leaves to reduce
complexity.
The procedures are differentiated on the basis of their approach in the tree: Top-
down approach & Bottom-Up approach
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and
Sonquist (1963) proposed a simple method for fitting trees to predict a
quantitative variable.
Each predictor is tested for splitting as follows: sort all the n cases on the
predictor and examine all n-1 ways to split the cluster in two. For each
possible split, compute the within-cluster sum of squares about the mean of
the cluster on the dependent variable.
Choose the best of the n-1 splits to represent the predictor’s contribution.
Now do this for every other predictor. For the actual split, choose the
predictor and its cut point which yields the smallest overall within-cluster
sum of squares. Categorical predictors require a different approach. Since
categories are unordered, all possible splits between categories must be
considered. For deciding on one split of k categories into two groups, this
means that 2k-1 possible splits must be considered. Once a split is found, its
suitability is measured on the same within-cluster sum of squares as for a
quantitative predictor.
It has to do instead with conditional discrepancies. In the analysis of
variance, interaction means that a trend within one level of a variable is not
parallel to a trend within another level of the same variable. In the ANOVA
model, interaction is represented by cross-products between predictors.
In the tree model, it is represented by branches from the same nodes which
have different splitting predictors further down the tree. Regression trees
parallel regression/ANOVA modeling, in which the dependent variable is
quantitative. Classification trees parallel discriminant analysis and
algebraic classification methods. Kass (1980) proposed a modification to
AID called CHAID for categorized dependent and independent variables. His
algorithm incorporated a sequential merge and split procedure based on a
chi-square test statistic.
Kass’s algorithm is like sequential cross-tabulation. For each predictor:
1) cross tabulate the m categories of the predictor with the k
categories of the dependent variable.
2) find the pair of categories of the predictor whose 2xk sub-table is
least significantly different on a chi-square test and merge these two
categories.
3) if the chi-square test statistic is not “significant” according to a
preset critical value, repeat this merging process for the selected
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
predictor until no non- significant chi-square is found for a sub-table,
and pick the predictor variable
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
whose chi-square is largest and split the sample into subsets, where
l is the number of categories resulting from the merging process on
that predictor.
4) Continue splitting, as with AID, until no “significant” chi-squares
result. The CHAID algorithm saves some computer time, but it is not
guaranteed to find the splits which predict best at a given step.
Only by searching all possible category subsets can we do that. CHAID is
also limited to categorical predictors, so it cannot be used for quantitative or
mixed categorical quantitative models.
From the three graphs shown above, one can clearly understand that the
leftmost figure line does not cover all the data points, so we can say that
the model is under- fitted. In this case, the model has failed to generalize
the pattern to the new dataset, leading to poor performance on testing.
The under-fitted model can be easily seen as it gives very high errors on
both training and testing data. This is because the dataset is not clean and
contains noise, the model has High Bias, and the size of the training data is
not enough.
When it comes to the overfitting, as shown in the rightmost graph, it shows
the model is covering all the data points correctly, and you might think this
is a perfect fit. But actually, no, it is not a good fit! Because the model
learns too many details from the dataset, it also considers noise. Thus, it
negatively affects the new data set; not every detail that the model has
learned during training needs also apply to the new data points, which
gives a poor performance on testing or validation
dataset. This is because the model has trained itself in a very complex
manner and has high variance.
The best fit model is shown by the middle graph, where both training and
testing (validation) loss are minimum, or we can say training and testing
accuracy should be near each other and high in value.
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
The forecasting equation is constructed as follows. First, let y denote the dth
difference of Y, which means:
If d=0: yt = Yt
If d=1: yt = Yt - Yt-1
Note that the second difference of Y (the d=2 case) is not the difference from 2
periods ago. Rather, it is the first-difference-of-the-first difference, which is the
discrete analog of a second derivative, i.e., the local acceleration of the series
rather than its local trend.
(e ) i
MFE i1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends to
over- forecast 2. Mean Absolute Deviation (MAD) For n time periods where we
have actual demand and forecast values:
n
(e ) i
MAD i 1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute size
of the errors
Uses of Forecast error:
Forecast model bias
Absolute size of the forecast errors
Compare alternative forecasting models
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that:
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for
querying and analysis purpose
Loads it into the final target (database, more specifically, operational data
store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes
time, so while the data is being pulled another transformation process executes,
processing the already received data and prepares the data for loading and as
soon as there is some data ready to be loaded into the target, the data loading
kicks off without waiting for the completion of the previous phases.
ETL systems commonly integrate data from multiple applications (systems),
typically developed and supported by different vendors or hosted on separate
computer hardware. The disparate systems containing the original data are
frequently managed and operated by different employees. For example, a cost
accounting system may combine data from payroll, sales, and purchasing.
Commercially available ETL tools include:
Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration framework
Stambia
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Downloaded by KANNAN S
DATA ANALYTICS UNIT-4
Clean: The cleaning step is one of the most important as it ensures the
quality of the data in the data warehouse. Cleaning should perform basic
data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown, M/F/null,
Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided
valueConvert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).
Transform:
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database.
In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only after
the load completes. The referential integrity needs to be maintained by
ETL tool to ensure consistency.
Staging:
It should be possible to restart, at least, some of the phases independently from
the others. For example, if the transformation step fails, it should not be
necessary to restart the Extract step. We can ensure this by implementing
proper staging. Staging means that the data is simply dumped to the location
(called the Staging Area) so that it can then be read by the next processing
phase. The staging area is also used during ETL process to store intermediate
results of processing. This is ok for the ETL process which uses for this purpose.
However, the staging area should be accessed by the load ETL process only. It
should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user. May contain incomplete or in-
the-middle-of-the- processing data.
Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of Class
and attributes. For them we use the following statistics formulae:
InformationGain( Attribute)
pi pi ni ni
I (pi , ) log
2
log
2
pi n pi n i
ni
ni pi i
pi
ni
Entropy of an Attribute is:
Data set:
Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning
D14 Rain Mild High Strong No
Downloaded by KANNAN S
DATA ANALYTICS Decision Tree Learning
NOTES MATERIAL
UNIT 5 - Data Visualization
For B.
TECH (CSE)
3rd YEAR – 1st SEM
(R18)
Faculty:
B. RAVIKRISHNA
DEPARTMENT OF CSE
VIGNAN INSTITUTE OF TECHNOLOGY & SCIENCE
DESHMUKHI
Downloaded by KANNAN S
UNIT - V
Data Visualization
Syllabus:
Data Visualization: Pixel-Oriented Visualization Techniques, Geometric
Projection Visualization Techniques, Icon-Based Visualization Techniques,
Hierarchical Visualization Techniques, Visualizing Complex Data and Relations.
Unit-5 Objectives:
1. To explore Pixel-Oriented Visualization Techniques
2. To learn Geometric Projection Visualization Techniques
3. To explore Icon-Based Visualization Techniques
4. To Learn Hierarchical Visualization Techniques
5. To understand Visualizing Complex Data and Relations
Unit-5 Outcomes:
After completion of this course students will be able to
1. To Describe the Pixel-Oriented Visualization Techniques
2. To demonstrate Geometric Projection Visualization Techniques
3. To analyze the Icon-Based Visualization Techniques
4. To explore the Hierarchical Visualization Techniques
5. To compare the Visualizing Complex Data and Relations
Downloaded by KANNAN S
Data Visualization
Downloaded by KANNAN S
For a data set of m dimensions, create m windows on the screen, one
for each dimension.
The m dimension values of a record are mapped to m pixels at
the correspondingpositions in the windows.
The colors of the pixels reflect the corresponding values.
Downloaded by KANNAN S
Line Plot:
This is the plot that you can see in the nook and corners of any sort of
analysis between 2 variables.
The line plots are nothing but the values on a series of data points
will be connected with straight lines.
The plot may seem very simple but it has more applications not only in
machine learning but in many other areas.
Used to analyze the performance of a model using the ROC- AUC curve.
Bar Plot
This is one of the widely used plots, that we would have seen multiple
times not just in data analysis, but we use this plot also wherever there
is a trend analysis in many fields.
We can visualize the data in a cool plot and can convey the details
straight forward to others.
Downloaded by KANNAN S
This plot may be simple and clear but it’s not much frequently used in
Data science applications.
Stacked Bar Graph:
Downloaded by KANNAN S
comparing the total amounts across each group/segmented bar.
100% Stack Bar Graphs show the percentage-of-the-whole of each
group and are plotted by the percentage of each value to the total
amount in each group. This makes it easier to see the relative
differences between quantities in each group.
One major flaw of Stacked Bar Graphs is that they become harder to
read the more segments each bar has. Also comparing each segment to
each other is difficult, as they're not aligned on a common baseline.
Scatter Plot
It is one of the most commonly used plots used for visualizing simple
data in Machine learning and Data Science.
This plot describes us as a representation, where each point in the entire
dataset is present with respect to any 2 to 3 features(Columns).
Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter
plot is the common one, where we will primarily try to find the patterns,
clusters, and separability of the data.
The colors are assigned to different data points based on how they were
present in the dataset i.e, target column representation.
We can color the data points as per their class label given in the dataset.
Downloaded by KANNAN S
Box and Whisker Plot
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers
will be considered as an outlier.
The box plot also gives us a
description of the 25th, 50th,75th
quartiles.
With the help of a box plot, we can
also determine the
Interquartile range(IQR) where
maximum details of the data will be
present
These box plots come under
univariate analysis, which means
that we are exploring data only with
one variable.
Pie Chart :
A pie chart shows a static number and how categories represent part of a whole
the composition of something. A pie chart represents numbers in
percentages, and the total sum of all segments needs to equal 100%.
Extensively used in presentations and offices, Pie Charts help show
proportions and percentages between categories, by dividing a circle into
proportional segments. Each arc length represents a proportion of each
category, while the full circle represents the total sum of all the data, equal to
100%.
Downloaded by KANNAN S
Donut Chart:
Downloaded by KANNAN S
A Donut Chart somewhat remedies this problem by de-emphasizing the use of
the area. Instead, readers focus more on reading the length of the arcs, rather
than comparing the proportions between slices.
Also, Donut Charts are more space-efficient than Pie Charts because the blank
space inside a Donut Chart can be used to display information inside it.
Marimekko Chart:
Downloaded by KANNAN S
Marimekko Charts are used to visualise categorical data over a pair of
variables. In a Marimekko Chart, both axes are variable with a percentage
scale, that determines both the width and height of each segment. So
Marimekko Charts work as a kind of two- way 100% Stacked Bar Graph. This
makes it possible to detect relationships between categories and their
subcategories via the two axes.
The main flaws of Marimekko Charts are that they can be hard to read,
especially when there are many segments. Also, it’s hard to accurately make
comparisons between each segment, as they are not all arranged next to each
other along a common baseline. Therefore, Marimekko Charts are better suited
for giving a more general overview of the data.
Chernoff Faces
A way to display variables on a two-dimensional surface, e.g., let x
be eyebrow slant, y be eye size, z be nose length, etc.
The figure shows faces produced using 10 characteristics–head
eccentricity,
Downloaded by KANNAN S
eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose
size, mouth shape, mouth size, and mouth opening. Each assigned
one of 10 possible values.
Stick Figure
Downloaded by KANNAN S
Hierarchical Visualization
Circle Packing
Downloaded by KANNAN S
Sunburst Diagram
As known as a Sunburst Chart, Ring Chart, Multi-level Pie Chart, Belt Chart, Radial
Treemap.
This type of visualisation shows hierarchy through a series of rings, that
are sliced for each category node. Each ring corresponds to a level in
the hierarchy, with the central circle representing the root node and
the hierarchy moving outwards from it.
Downloaded by KANNAN S
Rings are sliced up and divided based on their hierarchical relationship
to the parent slice. The angle of each slice is either divided equally
under its parent node or can be made proportional to a value.
Colour can be used to highlight hierarchal groupings or specific
categories.
Treemap:
Downloaded by KANNAN S
a Tree Diagram while also displaying quantities for each category via area
size. Each category is assigned a rectangle area with their subcategory
rectangles nested inside of it.
When a quantity is assigned to a category, its area size is displayed in
proportion to that quantity and to the other quantities within the same
parent category in a part-to-whole relationship. Also, the area size of
the parent category is the total of its subcategories. If no quantity is
assigned to a subcategory, then it's area is divided equally amongst
the other subcategories within its parent category.
The way rectangles are divided and ordered into sub-rectangles is
dependent on the tiling algorithm used. Many tiling algorithms have
been developed, but the "squarified algorithm" which keeps each
rectangle as square as possible is the one commonly used.
Ben Shneiderman originally developed Treemaps as a way of
visualising a vast file directory on a computer, without taking up too
much space on the screen. This makes Treemaps a more compact and
space-efficient option for displaying hierarchies, that gives a quick
overview of the structure. Treemaps are also great at comparing the
proportions between categories via their area size.
The downside to a Treemap is that it doesn't show the hierarchal levels
as clearly as other charts that visualise hierarchal data (such as a Tree
Diagram or Sunburst Diagram).
Downloaded by KANNAN S
Visualizing Complex Data and Relations
Downloaded by KANNAN S
Often, in a tag cloud, tags are listed alphabetically or in a user-preferred
order.
The importance of a tag is indicated by font size or color.
Word Cloud:
Downloaded by KANNAN S
Colour used on Word Clouds is usually meaningless and is primarily aesthetic,
but it can be used to categorise words or to display another data variable.
Typically, Word Clouds are used on websites or blogs to depict keyword or tag
usage. Word Clouds can also be used to compare two different bodies of text
together.
Although being simple and easy to understand, Word Clouds have some major
flaws:
Long words are emphasised over short words.
Words whose letters contain many ascenders and descenders may receive
more attention.
They're not great for analytical accuracy, so used more for aesthetic reasons
instead.
Source: https://datavizcatalogue.com/index.html
Downloaded by KANNAN S