0% found this document useful (0 votes)

18 views29 pages

Big Data Chapter 3

Uploaded by

sagarmeravi563

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views29 pages

Big Data Chapter 3

Uploaded by

sagarmeravi563

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

What Is Correlation?

• Correlation is a statistical measure.

• Correlation explains how one or more variables are

related to each other. These variables can be input
data features which have been used to forecast our
target variable.
• Two features (variables) can be positively correlated
with each other. It means that when the value of one
variable increases then the value of the other
variable(s) also increases.
Correlation is really one of the very basics of data analysis
and is an important tool for a data analyst, as it can help
define trends, make predictions and uncover root causes
for certain phenomena.
There could be essentially two types of data you can work with
when determining correlation:

Univariate Data:

• In a simple set up we work with a single variable.

We measure central tendency to enquire about the
representative data, dispersion to measure the deviations
around the central tendency, skewness to measure the shape
and size of the distribution and kurtosis to measure the
concentration of the data at the central position. This data,
relating to a single variable is called univariate data.
Bivariate data:

it often becomes essential in our analysis to study two

variables simultaneously

For example, a> height and weight of a person, b> age

and blood pressure, etc.
This statistical data on two characters of any individual,
measured simultaneously are termed as bivariate data.
Types of correlation:
1.Positive correlation 5)Perfect Positive
2.Negative correlation 6)Perfect Negative
3.Zero correlation
4.Spurious correlation
Positive correlation:
If due to increase of any of the two data, the other data
also increases, we say that those two data are positively
correlated.

For example, height and weight of a male or female are

positively correlated.
Negative correlation:
If due to increase of any of the two, the other decreases,
we say that those two data are negatively correlated.
For example, the price and demand of a commodity are
negatively correlated. When the price increases, the
demand generally goes down.
Zero correlation:

If in between the two data, there is no clear-cut trend. i.e. ,

the change in one does not guarantee the co-directional
change in the other, the two data are said to be non-
correlated or may be said to possess, zero correlation.

For example, quality like affection, kindness is in most

cases non-correlated with the academic achievements, or
better to say that intellect of a person is purely non-
correlated with complexion.
Spurious correlation:

• If the correlation is due to the influence of any other

‘third’ variable, the data is said to be spuriously
correlated.

For example, children with “body control problems” and

clumsiness has been reported as being associated with
adult obesity. One can probably say that uncontrolled and
clumsy kids participate less in sports and outdoor
activities and that is the ‘third’ variable here. At most
times, it is difficult to figure out the ‘third’ variable and even
if that is achieved, it is even more difficult to gauge the
extent of its influence on the two primary variables.
Regression

Regression is a statistical technique that is used to

model the relationship of a dependent variable
with respect to one or more independent variables.

Regression is widely used in several statistical

analysis problems and it is also one of the most
important tools in Machine Learning.

Regression is a statistical method used in finance,

investing, and other disciplines that attempts to
determine the strength and character of the
relationship between one dependent variable
(usually denoted by Y) and a series of other
variables (known as independent variables).

Regression helps investment and financial

managers to value assets and understand the
relationships between variables, such
as commodity prices and the stocks of businesses
dealing in those commodities.
The statistical techniques that expresses a functional relationship between two or
more variables in the form of an equation to estimate the value of a variable based
on the given value of another variable is Regression analysis

The variable whose value is to be estimated is called Dependant Variable.

The variable whose value is used to estimate this value is called Independent
Variable
Regression Analysis:

Regression analysis is used in stats to find trends in data.

For example, you might guess that there’s a connection

between how much you eat and how much you weigh;
regression analysis can help you quantify that.

Regression analysis will provide you with an equation for

a graph so that you can make predictions about your data
For example, if you’ve been putting on weight over the last
few years, it can predict how much you’ll weigh in ten
years time if you continue to put on weight at the same
rate.
In statistics, it’s hard to stare at a set of random numbers
in a table and try to make any sense of it.
Types of Regression Models

Regression

Simple Multiple
Simple Regression Analysis:-
• It is used to estimate the relationship between a dependant variable and a single
independent variable
• Regression models that involve one explanatory variable are called Simple Regression
• For Example The relationship between crop and rainfall

Multiple Regression Analysis

It is used to estimate relationship between a dependant variable and two or more independent
variable
When two or more explanatory variables are involved, the relationship are called Multiple
Regression
For Example, the relationship between the salaries of employees and their experience and
education
Multiple regression analysis introduces several additional complexities but may produce more
realistic results than simple regression analysis
Data Science Process
•Business Understanding –
• In this first step, we try to get a better idea of what business
needs we should be extracting from data.
•What kind of questions should we be asking to help further the
business and to help the business understand what kinds of
actions it should take from the trends that the data shows.
• This could be open ended in such that you, as the data scientist,
ask questions about the data that you see and find. Or it could be
a series of questions from your client that they specifically want to
know.
•Data Understanding –
•This is getting a business idea of the data that you have and
understanding what each part of the data means.
• This may involve actually figuring out what data would be best
needed and the best ways to acquire it.
•This also means finding out what each of the data points signifies
in terms of the business.
•For instance, if you’re given a data set from a client, you have to
know what each column and row represent. Do rows represent a
single customer? Does this one column with a heading of what
looks to be an acronym has a big relationship with the data? We
can’t really know this without understanding what exactly it means.
Data Preparation –
• The data preparation part of the process is where most
of your time will be. Cleaning the data can be more of
an art form than a science since you have to realize if
you have the correct data to proceed to a good model
and knowing how to clean it correctly so it won’t corrupt
your model. I would also consider that
having reliable data is part of this, as well. There’s an
old saying, “garbage in, garbage out”. Your model won’t
be very effective if you’re giving it bad data
•Modeling –
•Here is where doing statistics and analyzing the data
come in to create a model that best fits the data.
•You may have to try several models in order to find one
with the best fit.
•We can select best model with the help of data
preparation step
•In order to do that, going back to how the data was
prepared may often happen. There are more ways to
clean missing data. Is it safe to just remove the rows? Is
there an average we can put in for it? There may even be
a better value to put in the missing ones depending on the
business. All of these can help make the model much
better.
•Evaluation –
•This part is where you test to see if you have a good
model or not before deploying or presenting.
•As the diagram indicates, this is also the part where you
make sure the model answers the business questions you
had at the beginning of this process. Perhaps it may even
uncover more questions that are more important.
•Deployment –
•This is where you share your findings of the data.
•This isn’t limited to having an API to call that uses your
model. It could simply be documenting your findings in an
email, a shared document, or presenting to a group of
executives. While it’s easy to talk technical with your
colleagues, relaying what you find in the data to a sales
team or the executives so they can take action with it is
the key with this step.
•Sharing has many ways like 1)Share using email
2)Sharing Collegues 3)Sharing with presentation
Phases Of Data Analytics
1.Discovery:
Discovery step involves acquiring data from all the identified internal & external sources which
helps you to answer the business question.
The data can be:
•Logs from webservers
•Data gathered from social media
•Census datasets
•Data streamed from online sources using APIs
2.Data Preparation:
• Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned.
• You need to process, explore, and condition data before modeling.
• The cleaner your data, the better are your predictions.
3.Model Planning:
• In this stage, you need to determine the method and technique to draw the relation between
input variables.
• Planning for a model is performed by using different statistical formulas and visualization
tools.
• SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
4. Model Building:
• In this step, the actual model building process starts.
• Here, Data scientist distributes datasets for training and testing.
• Techniques like association, classification, and clustering are applied to the
training data set.
• The model once prepared is tested against the "testing" dataset.
5. Operationalize:
• In this stage, you deliver the final baselined model with reports, code, and
technical documents.
• Model is deployed into a real-time production environment after thorough
testing.
6. Communicate Results
• In this stage, the key findings are communicated to all stakeholders.
• This helps you to decide if the results of the project are a success or a failure
based on the inputs from the model.
Uses of Regression Analysis

1. Predictive Analytics:

Predictive analytics i.e. forecasting future opportunities

and risks is the most prominent application of regression
analysis in business.

Demand analysis, for instance, predicts the number of

items which a consumer will probably purchase. However,
demand is not the only dependent variable when it comes
to business.
RA can go far beyond forecasting impact on direct
revenue.

E.g. Insurance companies heavily rely on regression

analysis to estimate the credit standing of policyholders
and a possible number of claims in a given time period.
2. Operation Efficiency:

Regression models can also be used to optimize business

processes.

In a call center, we can analyze the relationship between

wait times of callers and number of complaints.

This improves the business performance by highlighting

the areas that have the maximum impact on the
operational efficiency and revenues.
3. Supporting Decisions:

• Businesses today are overloaded with data on

finances, operations and customer purchases.
Increasingly, executives are now leaning on data
analytics to make informed business decisions

• RA can bring a scientific angle to the management of

any businesses.
• By reducing the tremendous amount of raw data into
actionable information, regression analysis leads the
way to smarter and more accurate decisions.
• This technique acts as a perfect tool to test a
hypothesis before diving into execution.
4. Correcting Errors:

• Regression is not only great for lending empirical

support to management decisions but also for
identifying errors in judgment.

For example, a retail store manager may believe that

extending shopping hours will greatly increase sales. RA,
however, may indicate that the increase in revenue might
not be sufficient to support the rise in operating expenses
due to longer working hours (such as additional employee
labor charges). Hence, this analysis can provide
quantitative support for decisions and prevent mistakes
due to manager’s intuitions.
5. New Insights:

Over time businesses have gathered a large volume of

unorganized data that has the potential to yield valuable
insights. However, this data is useless without proper
analysis.
RA techniques can find a relationship between different
variables by uncovering patterns that were previously
unnoticed.
For example, analysis of data from point of sales systems
and purchase accounts may highlight market patterns like
increase in demand on certain days of the week or at
certain times of the year. You can maintain optimal stock
and personnel before a spike in demand arises by
acknowledging these insights.

CH05 Business Analytics Process and Data Exploration
No ratings yet
CH05 Business Analytics Process and Data Exploration
37 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
100% (1)
Data Analytics - 4 Manuscripts - Data Science For Beginners, Data Analysis With Python, SQL Computer Programming For Beginners, Statistics For Beginners
481 pages
Data Science Presentation
100% (3)
Data Science Presentation
113 pages
Chapter 3 - Tagged
No ratings yet
Chapter 3 - Tagged
63 pages
DM Merged
No ratings yet
DM Merged
169 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Unit 4
No ratings yet
Unit 4
63 pages
Module 2
No ratings yet
Module 2
62 pages
Exam 1
No ratings yet
Exam 1
12 pages
Lecture 2.3.1-2.3.3
No ratings yet
Lecture 2.3.1-2.3.3
67 pages
Module 3 Data Preparation
No ratings yet
Module 3 Data Preparation
33 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
BI Chapter 02 - Unlocked
No ratings yet
BI Chapter 02 - Unlocked
51 pages
03 Pre Processing
No ratings yet
03 Pre Processing
63 pages
Data - Preprocessing 1 19
No ratings yet
Data - Preprocessing 1 19
19 pages
Chapter3 DataPreprocessing
No ratings yet
Chapter3 DataPreprocessing
50 pages
Module 1 - Introduction To Forecasting
No ratings yet
Module 1 - Introduction To Forecasting
11 pages
Big Data SYBBA (CA)
No ratings yet
Big Data SYBBA (CA)
12 pages
03 Preprocessing
No ratings yet
03 Preprocessing
63 pages
Data Analysis
No ratings yet
Data Analysis
10 pages
Data Analytics Part 3
No ratings yet
Data Analytics Part 3
54 pages
Outliers Correlation
No ratings yet
Outliers Correlation
21 pages
Wk6 Preprocessing
No ratings yet
Wk6 Preprocessing
64 pages
03 Preprocessing
No ratings yet
03 Preprocessing
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Unit 3-BA
No ratings yet
Unit 3-BA
31 pages
Lec 7
No ratings yet
Lec 7
45 pages
Regression
No ratings yet
Regression
86 pages
Subjectivity in Performance Evaluations A Review of The Literature
No ratings yet
Subjectivity in Performance Evaluations A Review of The Literature
33 pages
DM LAQs (CT 1)
No ratings yet
DM LAQs (CT 1)
40 pages
LST Handbook of Guidelines and Procedures
No ratings yet
LST Handbook of Guidelines and Procedures
66 pages
CNRS Ranking
No ratings yet
CNRS Ranking
2 pages
What Is Statistics
No ratings yet
What Is Statistics
6 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Introduction To Analytics
100% (1)
Introduction To Analytics
45 pages
Module - 03
No ratings yet
Module - 03
28 pages
BA Chatgpt Notes
No ratings yet
BA Chatgpt Notes
27 pages
Chapter2 BI
No ratings yet
Chapter2 BI
77 pages
Chapter 3
No ratings yet
Chapter 3
63 pages
Lecture1 Introduction
No ratings yet
Lecture1 Introduction
49 pages
Alvesson, M. Critical Leadership Studies - The Case For Critical Performativity
No ratings yet
Alvesson, M. Critical Leadership Studies - The Case For Critical Performativity
37 pages
The Effect of Artificial Intelligence On China's Labor Market
No ratings yet
The Effect of Artificial Intelligence On China's Labor Market
19 pages
ECO 391 Lecture Slides - Part 2
No ratings yet
ECO 391 Lecture Slides - Part 2
26 pages
Ayebare (Edited Internship Report)
No ratings yet
Ayebare (Edited Internship Report)
33 pages
03 Pre Processing
No ratings yet
03 Pre Processing
89 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
45 pages
Statistical Methods
No ratings yet
Statistical Methods
15 pages
PG Manual Final
No ratings yet
PG Manual Final
54 pages
Business Applications of Multiple Regression
50% (4)
Business Applications of Multiple Regression
48 pages
Multivariant Data.
No ratings yet
Multivariant Data.
36 pages
Research Instrument
No ratings yet
Research Instrument
6 pages
Course Outline-Agri-Fishery Arts
No ratings yet
Course Outline-Agri-Fishery Arts
5 pages
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
No ratings yet
IT446 Wk03.2 HanKamberPei 03preprocessing PDF
64 pages
Concepts and Techniques: - Chapter 3
No ratings yet
Concepts and Techniques: - Chapter 3
63 pages
Data Science 1
No ratings yet
Data Science 1
2 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
61 pages
Data Mining and Knowledge Discovery
No ratings yet
Data Mining and Knowledge Discovery
65 pages
1 PB
No ratings yet
1 PB
12 pages
Geomorphology: Clément Roux, Adrien Alber, Mélanie Bertrand, Lise Vaudor, Hervé Piégay
No ratings yet
Geomorphology: Clément Roux, Adrien Alber, Mélanie Bertrand, Lise Vaudor, Hervé Piégay
9 pages
Effectiveness of Implementation of Blended Learning and Flipped Classroom Methods in Higher Education Institutions
No ratings yet
Effectiveness of Implementation of Blended Learning and Flipped Classroom Methods in Higher Education Institutions
6 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Research Framework
No ratings yet
Research Framework
15 pages
A Computational Model of The Cognition of Tonality
No ratings yet
A Computational Model of The Cognition of Tonality
271 pages
A New Procedure For Reservoir Fluid Characterization With EOS
No ratings yet
A New Procedure For Reservoir Fluid Characterization With EOS
12 pages
Course Work Essay
No ratings yet
Course Work Essay
12 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
Vap Synthesis Paper
No ratings yet
Vap Synthesis Paper
9 pages
Midterm SAMPLE Solution
No ratings yet
Midterm SAMPLE Solution
7 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
54 pages
Galvanism Compared To Mary Shelley's Frakenstein
No ratings yet
Galvanism Compared To Mary Shelley's Frakenstein
4 pages
A Guide To Data Exploration
No ratings yet
A Guide To Data Exploration
20 pages
Laksmi Maharani, M.SC., Apt.: Laboratorium Farmakologi & Farmasi Klinik Farmasi Fikes Unsoed
No ratings yet
Laksmi Maharani, M.SC., Apt.: Laboratorium Farmakologi & Farmasi Klinik Farmasi Fikes Unsoed
15 pages
Eda Reviewer
No ratings yet
Eda Reviewer
2 pages
Assessment Strategydocv2
No ratings yet
Assessment Strategydocv2
12 pages
Module Data Analysis
No ratings yet
Module Data Analysis
6 pages
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
No ratings yet
Correlation and Regression Are The Two Analysis Based On Multivariate Distribution
10 pages
Blackbook Assignment
No ratings yet
Blackbook Assignment
6 pages
Final Formal Lab Report SBI3U JulyAnh Nguyen
No ratings yet
Final Formal Lab Report SBI3U JulyAnh Nguyen
21 pages
Is 4031 2 1999
No ratings yet
Is 4031 2 1999
13 pages
AFM 20530: Business Finance Semester I Group Assignment I - Intake 14 and 13 EX
No ratings yet
AFM 20530: Business Finance Semester I Group Assignment I - Intake 14 and 13 EX
2 pages
What Is Learning
No ratings yet
What Is Learning
2 pages
Statistics: Practical Concept of Statistics for Data Scientists
From Everand
Statistics: Practical Concept of Statistics for Data Scientists
John Slavio
No ratings yet
2023 Specimen Paper 4 Mark Scheme
No ratings yet
2023 Specimen Paper 4 Mark Scheme
10 pages
Research Analytics
25% (4)
Research Analytics
2 pages
OHST Complete Guide
100% (1)
OHST Complete Guide
24 pages
Handbook CT MJT PDF
No ratings yet
Handbook CT MJT PDF
15 pages
Common Analytics Interview Questions
No ratings yet
Common Analytics Interview Questions
4 pages
ĐÁP ÁN ĐỀ THI THỬ SỐ 02 (2019-2020)
No ratings yet
ĐÁP ÁN ĐỀ THI THỬ SỐ 02 (2019-2020)
7 pages

Big Data Chapter 3

Uploaded by

Big Data Chapter 3

Uploaded by

What Is Correlation?

• Correlation is a statistical measure.

• Correlation explains how one or more variables are

• In a simple set up we work with a single variable.

it often becomes essential in our analysis to study two

For example, a> height and weight of a person, b> age

For example, height and weight of a male or female are

If in between the two data, there is no clear-cut trend. i.e. ,

For example, quality like affection, kindness is in most

• If the correlation is due to the influence of any other

For example, children with “body control problems” and

Regression is a statistical technique that is used to

Regression is widely used in several statistical

Regression is a statistical method used in finance,

Regression helps investment and financial

The variable whose value is to be estimated is called Dependant Variable.

Regression analysis is used in stats to find trends in data.

For example, you might guess that there’s a connection

Regression analysis will provide you with an equation for

Multiple Regression Analysis

Predictive analytics i.e. forecasting future opportunities

Demand analysis, for instance, predicts the number of

E.g. Insurance companies heavily rely on regression

Regression models can also be used to optimize business

In a call center, we can analyze the relationship between

This improves the business performance by highlighting

• Businesses today are overloaded with data on

• RA can bring a scientific angle to the management of

• Regression is not only great for lending empirical

For example, a retail store manager may believe that

Over time businesses have gathered a large volume of

You might also like