0% found this document useful (0 votes)
18 views29 pages

Big Data Chapter 3

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views29 pages

Big Data Chapter 3

Uploaded by

sagarmeravi563
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

What Is Correlation?

• Correlation is a statistical measure.

• Correlation explains how one or more variables are


related to each other. These variables can be input
data features which have been used to forecast our
target variable.
• Two features (variables) can be positively correlated
with each other. It means that when the value of one
variable increases then the value of the other
variable(s) also increases.
Correlation is really one of the very basics of data analysis
and is an important tool for a data analyst, as it can help
define trends, make predictions and uncover root causes
for certain phenomena.
There could be essentially two types of data you can work with
when determining correlation:

Univariate Data:

• In a simple set up we work with a single variable.


We measure central tendency to enquire about the
representative data, dispersion to measure the deviations
around the central tendency, skewness to measure the shape
and size of the distribution and kurtosis to measure the
concentration of the data at the central position. This data,
relating to a single variable is called univariate data.
Bivariate data:

it often becomes essential in our analysis to study two


variables simultaneously

For example, a> height and weight of a person, b> age


and blood pressure, etc.
This statistical data on two characters of any individual,
measured simultaneously are termed as bivariate data.
Types of correlation:
1.Positive correlation 5)Perfect Positive
2.Negative correlation 6)Perfect Negative
3.Zero correlation
4.Spurious correlation
Positive correlation:
If due to increase of any of the two data, the other data
also increases, we say that those two data are positively
correlated.

For example, height and weight of a male or female are


positively correlated.
Negative correlation:
If due to increase of any of the two, the other decreases,
we say that those two data are negatively correlated.
For example, the price and demand of a commodity are
negatively correlated. When the price increases, the
demand generally goes down.
Zero correlation:

If in between the two data, there is no clear-cut trend. i.e. ,


the change in one does not guarantee the co-directional
change in the other, the two data are said to be non-
correlated or may be said to possess, zero correlation.

For example, quality like affection, kindness is in most


cases non-correlated with the academic achievements, or
better to say that intellect of a person is purely non-
correlated with complexion.
Spurious correlation:

• If the correlation is due to the influence of any other


‘third’ variable, the data is said to be spuriously
correlated.

For example, children with “body control problems” and


clumsiness has been reported as being associated with
adult obesity. One can probably say that uncontrolled and
clumsy kids participate less in sports and outdoor
activities and that is the ‘third’ variable here. At most
times, it is difficult to figure out the ‘third’ variable and even
if that is achieved, it is even more difficult to gauge the
extent of its influence on the two primary variables.
Regression

Regression is a statistical technique that is used to


model the relationship of a dependent variable
with respect to one or more independent variables.

Regression is widely used in several statistical


analysis problems and it is also one of the most
important tools in Machine Learning.

Regression is a statistical method used in finance,


investing, and other disciplines that attempts to
determine the strength and character of the
relationship between one dependent variable
(usually denoted by Y) and a series of other
variables (known as independent variables).

Regression helps investment and financial


managers to value assets and understand the
relationships between variables, such
as commodity prices and the stocks of businesses
dealing in those commodities.
The statistical techniques that expresses a functional relationship between two or
more variables in the form of an equation to estimate the value of a variable based
on the given value of another variable is Regression analysis

The variable whose value is to be estimated is called Dependant Variable.


The variable whose value is used to estimate this value is called Independent
Variable
Regression Analysis:

Regression analysis is used in stats to find trends in data.

For example, you might guess that there’s a connection


between how much you eat and how much you weigh;
regression analysis can help you quantify that.

Regression analysis will provide you with an equation for


a graph so that you can make predictions about your data
For example, if you’ve been putting on weight over the last
few years, it can predict how much you’ll weigh in ten
years time if you continue to put on weight at the same
rate.
In statistics, it’s hard to stare at a set of random numbers
in a table and try to make any sense of it.
Types of Regression Models

Regression

Simple Multiple
Simple Regression Analysis:-
• It is used to estimate the relationship between a dependant variable and a single
independent variable
• Regression models that involve one explanatory variable are called Simple Regression
• For Example The relationship between crop and rainfall

Multiple Regression Analysis


It is used to estimate relationship between a dependant variable and two or more independent
variable
When two or more explanatory variables are involved, the relationship are called Multiple
Regression
For Example, the relationship between the salaries of employees and their experience and
education
Multiple regression analysis introduces several additional complexities but may produce more
realistic results than simple regression analysis
Data Science Process
•Business Understanding –
• In this first step, we try to get a better idea of what business
needs we should be extracting from data.
•What kind of questions should we be asking to help further the
business and to help the business understand what kinds of
actions it should take from the trends that the data shows.
• This could be open ended in such that you, as the data scientist,
ask questions about the data that you see and find. Or it could be
a series of questions from your client that they specifically want to
know.
•Data Understanding –
•This is getting a business idea of the data that you have and
understanding what each part of the data means.
• This may involve actually figuring out what data would be best
needed and the best ways to acquire it.
•This also means finding out what each of the data points signifies
in terms of the business.
•For instance, if you’re given a data set from a client, you have to
know what each column and row represent. Do rows represent a
single customer? Does this one column with a heading of what
looks to be an acronym has a big relationship with the data? We
can’t really know this without understanding what exactly it means.
Data Preparation –
• The data preparation part of the process is where most
of your time will be. Cleaning the data can be more of
an art form than a science since you have to realize if
you have the correct data to proceed to a good model
and knowing how to clean it correctly so it won’t corrupt
your model. I would also consider that
having reliable data is part of this, as well. There’s an
old saying, “garbage in, garbage out”. Your model won’t
be very effective if you’re giving it bad data
•Modeling –
•Here is where doing statistics and analyzing the data
come in to create a model that best fits the data.
•You may have to try several models in order to find one
with the best fit.
•We can select best model with the help of data
preparation step
•In order to do that, going back to how the data was
prepared may often happen. There are more ways to
clean missing data. Is it safe to just remove the rows? Is
there an average we can put in for it? There may even be
a better value to put in the missing ones depending on the
business. All of these can help make the model much
better.
•Evaluation –
•This part is where you test to see if you have a good
model or not before deploying or presenting.
•As the diagram indicates, this is also the part where you
make sure the model answers the business questions you
had at the beginning of this process. Perhaps it may even
uncover more questions that are more important.
•Deployment –
•This is where you share your findings of the data.
•This isn’t limited to having an API to call that uses your
model. It could simply be documenting your findings in an
email, a shared document, or presenting to a group of
executives. While it’s easy to talk technical with your
colleagues, relaying what you find in the data to a sales
team or the executives so they can take action with it is
the key with this step.
•Sharing has many ways like 1)Share using email
2)Sharing Collegues 3)Sharing with presentation
Phases Of Data Analytics
1.Discovery:
Discovery step involves acquiring data from all the identified internal & external sources which
helps you to answer the business question.
The data can be:
•Logs from webservers
•Data gathered from social media
•Census datasets
•Data streamed from online sources using APIs
2.Data Preparation:
• Data can have lots of inconsistencies like missing value, blank columns, incorrect data format
which needs to be cleaned.
• You need to process, explore, and condition data before modeling.
• The cleaner your data, the better are your predictions.
3.Model Planning:
• In this stage, you need to determine the method and technique to draw the relation between
input variables.
• Planning for a model is performed by using different statistical formulas and visualization
tools.
• SQL analysis services, R, and SAS/access are some of the tools used for this purpose.
4. Model Building:
• In this step, the actual model building process starts.
• Here, Data scientist distributes datasets for training and testing.
• Techniques like association, classification, and clustering are applied to the
training data set.
• The model once prepared is tested against the "testing" dataset.
5. Operationalize:
• In this stage, you deliver the final baselined model with reports, code, and
technical documents.
• Model is deployed into a real-time production environment after thorough
testing.
6. Communicate Results
• In this stage, the key findings are communicated to all stakeholders.
• This helps you to decide if the results of the project are a success or a failure
based on the inputs from the model.
Uses of Regression Analysis

1. Predictive Analytics:

Predictive analytics i.e. forecasting future opportunities


and risks is the most prominent application of regression
analysis in business.

Demand analysis, for instance, predicts the number of


items which a consumer will probably purchase. However,
demand is not the only dependent variable when it comes
to business.
RA can go far beyond forecasting impact on direct
revenue.

E.g. Insurance companies heavily rely on regression


analysis to estimate the credit standing of policyholders
and a possible number of claims in a given time period.
2. Operation Efficiency:

Regression models can also be used to optimize business


processes.

In a call center, we can analyze the relationship between


wait times of callers and number of complaints.

This improves the business performance by highlighting


the areas that have the maximum impact on the
operational efficiency and revenues.
3. Supporting Decisions:

• Businesses today are overloaded with data on


finances, operations and customer purchases.
Increasingly, executives are now leaning on data
analytics to make informed business decisions

• RA can bring a scientific angle to the management of


any businesses.
• By reducing the tremendous amount of raw data into
actionable information, regression analysis leads the
way to smarter and more accurate decisions.
• This technique acts as a perfect tool to test a
hypothesis before diving into execution.
4. Correcting Errors:

• Regression is not only great for lending empirical


support to management decisions but also for
identifying errors in judgment.

For example, a retail store manager may believe that


extending shopping hours will greatly increase sales. RA,
however, may indicate that the increase in revenue might
not be sufficient to support the rise in operating expenses
due to longer working hours (such as additional employee
labor charges). Hence, this analysis can provide
quantitative support for decisions and prevent mistakes
due to manager’s intuitions.
5. New Insights:

Over time businesses have gathered a large volume of


unorganized data that has the potential to yield valuable
insights. However, this data is useless without proper
analysis.
RA techniques can find a relationship between different
variables by uncovering patterns that were previously
unnoticed.
For example, analysis of data from point of sales systems
and purchase accounts may highlight market patterns like
increase in demand on certain days of the week or at
certain times of the year. You can maintain optimal stock
and personnel before a spike in demand arises by
acknowledging these insights.

You might also like