Unit-3 DS
Unit-3 DS
DATA ANALYSIS
Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases,
patterns, ranges, and distributions of values within the data. This data analytics exploration
drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s
relevance for use within modelling efforts for predictive analytics, machine learning, and/or
deep learning. Depending on a model’s accuracy, organizations can become reliant on these
insights for business decision making, allowing them to drive more scalability.
Being a branch of science, Statistics incorporates data acquisition, data interpretation, and
data validation, and statistical data analysis is the approach of conducting various
statistical operations, i.e. thorough quantitative research that attempts to quantify data and
employs some sorts of statistical analysis. Here, quantitative data typically includes
descriptive data like survey data and observational data.
In the context of business applications, it is a very crucial technique for business intelligence
organizations that need to operate with large data volumes.
The basic goal of statistical data analysis is to identify trends, for example, in the retailing
business, this method can be approached to uncover patterns in unstructured and semi-
structured consumer data that can be used for making more powerful decisions for enhancing
customer experience and progressing sales.
Apart from that, statistical data analysis has various applications in the field of statistical
analysis of market research, business intelligence(BI), data analytics in big data, machine
learning and deep learning, and financial and economical analysis.
Data comprises variables which are univariate or multivariate, and extremely relying on the
number of variables, the experts execute several statistical techniques.
If the data has a singular variable then univariate statistical data analysis can be conducted
including t-test for significance, z test, f test, ANOVA test- one way, etc.
And if the data has many variables then different multivariate techniques can be performed
such as statistical data analysis, or discriminant statistical data analysis, etc.
Here, the variable is a characteristic, changing from one individual trait of a population to
another trait. The image below shows the classification of data-variables.
Classification of variables
The discrete data can be counted and has a certain number of values, e.g. the number of
bulbs, the number of people in a group, etc.
Under statistical data analysis, the continuous data is distributed under continuous
distribution function, also known as the probability density function, and the discrete data is
distributed under a discrete distribution function, also termed as the probability mass
function.
Data can either be quantitative or qualitative.
Qualitative data are labels or names that are implemented to find a characteristic of each
element, whereas quantitative data are always in the form of numbers that intimate either how
much or how many.
Under statistical data analysis, cross-sectional and time-series data are important. For a
definition, cross-sectional data are the data accumulated at the same time or relatively the
same point in time, whereas, time-series data are the data gathered across certain time
periods.
These tools allow extensive data-handling capabilities and several statistical analysis methods
that could examine a small chunk to very comprehensive data statistics.
Though computers serve as an important factor in statistical data analysis that can assist in
the summarization of data, statistical data analysis concentrates on the interpretation of the
result in order to drive inferences and prophecies.
In other words, descriptive statistics attempts to illustrate the relationship between variables
in a sample or population and gives a summary in the form of mean, median and mode.
Inferential Statistics :
This method is used for making conclusions from the data sample by using the null and
alternative hypotheses that are subjected to random variation.
Also, probability distribution, correlation testing and regression analysis fall into this
category. In simple words, inferential statistics employs a random sample of data, taken
from a population, to make and explain inferences about the whole population.
The table below shows the factual differences between descriptive statistics and inferential
statistics;
Arrange, analyze and reflect the data in a Correlate, test and anticipate future
2
meaningful mode. outcomes.
Concluding outcomes are represented in the Final outcomes are the probability
3
form of charts, tables and graphs. scores.
The precise and actuarial definition of the problem is imperative for achieving accurate data
concerning it. It becomes extremely difficult to collect data without knowing the exact
definition/address of the problem.
After addressing the specific problem, designing multiple ways in order to accumulate data is
an important task under statistical data analysis.
Data can be collected from the actual sources or can be obtained by observation and
experimental research studies, conducted to get new data.
In an experimental study, the important variable is identified according to the defined
problem, then one or more elements in the study are controlled for getting data regarding how
these elements affect other variables.
In an observational study, no trial is executed for controlling or impacting the important
variable. For example, a conducted surrey is the examples or a common type of observational
study.
Under statistical data analysis, the analyzing methods are divided into two categories;
Exploratory methods, this method is deployed for determining what the data is revealing by
using simple arithmetic and easy-drawing graphs/description in order to summarize data.
Confirmatory methods, this method adopts concept and ideas from probability theory for
trying to answer particular problems.
Probability is extremely imperative in decision-making as it gives a procedure for estimating,
representing, and explaining the possibilities associated with forthcoming events.
4. Reporting the outcomes
Here 12 is the middle or median number that has 6 values above it and 6 values below it.
Now, consider another example with an even number of observations that are arranged in
descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and 17
When you look at the given dataset, the two middle values obtained are 27 and 29.
Now, find out the mean value for these two numbers.
i.e.,(27+29)/2 =28
Therefore, the median for the given data distribution is 28.
Mode
The mode represents the frequently occurring value in the dataset. Sometimes the dataset may
contain multiple modes and in some cases, it does not contain any mode at all.
Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5
Since the mode represents the most common value. Hence, the most frequently repeated
value in the given dataset is 5.
Based on the properties of the data, the measures of central tendency are selected.
If you have a symmetrical distribution of continuous data, all the three measures of
central tendency hold good. But most of the times, the analyst uses the mean because
it involves all the values in the distribution or dataset.
If you have skewed distribution, the best measure of finding the central tendency is
the median.
If you have the original data, then both the median and mode are the best choice of
measuring the central tendency.
If you have categorical data, the mode is the best choice to find the central tendency
Central Limit Theorem
The Central Limit Theorem(CLT) states that for any data, provided a high number of
samples have been taken. The following properties hold:
Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n
For n > 30, the sampling distribution becomes a normal distribution.
Let’s verify the properties of CLT in Python through Jupyter Notebook.
For the following Python code, we’ll use the datasets of Population and Random
Values, which we can find here.
First, import necessary libraries into Jupyter Notebook.
We imported all the necessary packages which we use in further codes. Since we are
going to sample the information randomly, we are setting a random
seed np.random.seed(42), so that the analysis is reproducible.
Now, let’s read the dataset we are dealing with,
The dataset looks like this,
Population Dataset
Let’s extract the ‘Weight’ column from the dataset and see the distribution of that column.
This weight column and its distribution graph looks like this,
As we can see, the chart is close to a Normal Distribution graph.
Let’s also find out the mean and standard deviation of the weight column through code.
Mean = 220.67326732673268
Std. Dev. = 26.643110470317723
These values are the exact Mean and Standard Deviation values of the Weight Column.
Now, let’s start sampling the data.
First, we’ll take a sample size of 30 members from the data. The reason for that is, after
repeated sampling of observations, we need to find if the sampling distribution follows
Normal Distribution or not.
The mean value for the above sample = 222.1, which is greater than the actual mean of
220.67. Let’s rerun the code,
df.Weight.sample(samp_size).mean()
The mean value for the above sample = 220.5, which is almost equal to the original mean. If
we rerun the code, we’ll get the mean value = 221.6
Each time we take a sample, the mean is different. There is variability in the sample mean
itself. Let’s move ahead and find out if the sample mean follows a distribution.
Instead of taking one sample mean at a time, we’ll take about 1000 such sample means and
assign it to a variable.
We have converted the sample_means into Series object because the list object does not
provide us with Mean and Standard Deviation functions.
The total number of samples = 1000
Now, we have 1000 samples, and it’s mean values with us. Let’s plot the distribution graph
using seaborn.
The distribution plot looks like this,
As we can observe, the above distribution looks approximately like Normal Distribution.
The other thing we need to check here is the Samples Mean and Standard Deviation.
Samples Mean = 220.6945, which is almost similar to Original Mean’s value 220.67, Sample
Std = 4.641450507418211
Let’s see the relation between the Standard deviation of samples and the Standard deviation of
actualdata.
When we divide the standard deviation of original data with its size,
df.Weight.std()/np.sqrt(samp_size)
Wegetthevalueofabovecode=4.86
The value is close to the sample_means.std().
So, from the above code, we can infer that:
Sampling distribution’s mean (μₓ¯) = Population mean (μ)
Sampling distribution’s standard deviation (standard error) = σ/√n
Till now, we have seen the original data of the “Weight” column is in the form of normal
distribution. Let’s see whether the sample distribution will be of Normal Distribution form
even if the original data is not in the Normal Distribution form.
We’ll take another data set that contains some random values and plot the values in a
distribution graph.
The Dataset and the graph looks like this,
As we can see, the Values column does not resemble the Normal Distribution graph. It looks
somewhat like an exponential distribution.
Let’s pick samples from this distribution, calculate their means, and plot the sampling
distribution.
Now, the distribution graph for the samples looks like,
Surprisingly, the Distribution of the sample_means we obtained from the Values Column,
which is far from Normal Distribution, is still very much a Normal Distribution.
Let’s compare the sample_means Mean value to its parent Mean value.
sample_means.mean()
#TheOutputwillbe:130.39213999999996df1.Value.mean()
#TheOutputis:130.4181654676259
As we can see, the sample_means mean value and original dataset’s mean value are both
similar.
Similarly,thestandard deviation of sample mean is sample_means.std() =13.263962580003142
That value should be quite close to df1.Value.std()/np.sqrt(samp_size) =14.060457446377631
Let’s compare the Distribution graphs of each Dataset with it’s corresponding sampling
distribution.
Image by Author
As we can see, irrespective of the original dataset’s distribution, the sampling distribution
resembles the Normal Distribution Curve.
There’s only one thing to consider now, i.e., Sample Size. We’ll observe that, as the sample
size increases, the sampling distribution will approximate a normal distribution even more
closely.
Effect of Sample Size on the Sampling Distribution
Let’s create different Sizes of samples and plot the corresponding distribution graphs.
Now, the Distribution Graph for Sample Sizes of 3, 10, 30, 50, 100, 200 looks like,
Distribution of Different Sample Sizes
As we can observe, the distribution graph for Sample Size 3 & 10 does not resemble Normal
Distribution. Still, from the Sample Size 30 as the Sample Size increases, the Sample
Distribution resembles Normal Distribution.
As a rule of thumb, we can say that a sample size of 30 or above is ideal for concluding that
the sampling distribution is nearly normal, and further inferences can be drawn from it.
Through this Python Code, we can conclude that CLT’s following three properties hold.
Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
Sampling distribution’s standard deviation (Standard error) = σ/√n
For n > 30, the sampling distribution becomes a normal distribution.
Estimating Mean Using CLT
The mean commute time of 30000 employees (μ)= 36.6 (sample mean) + some margin of
error. We can find this margin of error using the CLT (central limit theorem). Now that we
know what the CLT is let’s see how we can find the error margin.
Let’s say we have the mean commute time of 100 employees is X¯=36.6 min, and the
Standard Deviation of the sample is S=10 min. Using CLT, we can infer that,
Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
Sampling Distributions’ Standard Deviation = σ/√n ≈S/√n = 10/√100 = 1
SinceSamplingDistributionisaNormalDistribution
P(μ-2 < 36.6 < μ+2) = 95.4%, we get this value by 1–2–3 Rule of Normal Distribution Curve.
P(μ-2 < 36.6 < μ+2) = P(36.6–2< μ < 36.6+2) = 95.4%
You can find the standard distribution curve, Z-Table, and its properties in my previous article,
“Inferential Statistics.”
Now, we can say that there is a 95.4% probability that the Population Mean(μ) lies between
(36.6–2, 36.6+2). In other words, we are 95.4% confident that the error in estimating the mean
≤ 2.
Hence the probability associated with the claim is called confidence level (Here it is 95.4%).
The maximum error made in the sample mean is called the margin of error (Here it is 2min).
The final interval of value is called confidence interval {Here it is: (34.6, 38.6)}
We can generalize this concept in the following manner.
Let’s say that we have a sample with sample size n, mean X¯, and standard deviation S. Now,
the y% confidence interval (i.e., the confidence interval corresponding to a y% confidence
level) for μ would be given by the range:
Confidence interval = (X — (Z* S/√n), X + (Z* S/√n))
where Z* is the Z-score associated with a y% confidence level.
Some commonly used Z* values are given below:
Z* Values
That is is how we calculate the margin of error and estimate the value of the mean of the
whole population with the help of samples.
6.Basic Machine Learning Algorithms
Machine learning algorithms are classified into 4 types:
Supervised
Unsupervised Learning
Semi-supervised Learning
Reinforcement Learning
Linear Regression
Linear Regression
To understand the working functionality of Linear Regression, imagine how you would
arrange random logs of wood in increasing order of their weight. There is a catch; however –
you cannot weigh each log. You have to guess its weight just by looking at the height and
girth of the log (visual analysis) and arranging them using a combination of these visible
parameters. This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by
fitting them to a line. This line is known as the regression line and is represented by a linear
equation Y= a *X + b.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.
SVM
SVM (Support Vector Machine) Algorithm
SVM algorithm is a method of a classification algorithm in which you plot raw data as points
in an n-dimensional space (where n is the number of features you have). The value of each
feature is then tied to a particular coordinate, making it easy to classify the data. Lines called
classifiers can be used to split the data and plot them on a graph.
Naive Bayes Algorithm
A Naive Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any o
ther feature.
Even if these features are related to each other, a Naive Bayes classifier would consider all of
these properties independently when calculating the probability of a particular outcome.
A Naive Bayesian model is easy to build and useful for massive datasets. It's simple and is
known to outperform even highly sophisticated classification methods.