0% found this document useful (0 votes)
1K views21 pages

Unit-3 DS

The document discusses data analysis and statistical data analysis. It provides information on: 1) The key steps in data analysis including exploratory data analysis to identify patterns and distributions, which can help generate hypotheses for testing and determine relevance for modelling. 2) Popular programming languages and tools used for data analysis, visualization and machine learning including R, Python, SAS, and Tableau. 3) The goals and applications of statistical data analysis, which aims to identify trends in data and has uses in business intelligence, marketing research, and more. 4) Key concepts in statistical data analysis including different variable types, continuous and discrete data distributions, and cross-sectional vs. time-series data.

Uploaded by

rajkumarmtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views21 pages

Unit-3 DS

The document discusses data analysis and statistical data analysis. It provides information on: 1) The key steps in data analysis including exploratory data analysis to identify patterns and distributions, which can help generate hypotheses for testing and determine relevance for modelling. 2) Popular programming languages and tools used for data analysis, visualization and machine learning including R, Python, SAS, and Tableau. 3) The goals and applications of statistical data analysis, which aims to identify trends in data and has uses in business intelligence, marketing research, and more. 4) Key concepts in statistical data analysis including different variable types, continuous and discrete data distributions, and cross-sectional vs. time-series data.

Uploaded by

rajkumarmtech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

1.

DATA ANALYSIS
Data analysis: Here, data scientists conduct an exploratory data analysis to examine biases,
patterns, ranges, and distributions of values within the data. This data analytics exploration
drives hypothesis generation for a/b testing. It also allows analysts to determine the data’s
relevance for use within modelling efforts for predictive analytics, machine learning, and/or
deep learning. Depending on a model’s accuracy, organizations can become reliant on these
insights for business decision making, allowing them to drive more scalability.

Data science tools


Data scientists rely on popular programming languages to conduct exploratory data analysis
and statistical regression. These open source tools support pre-built statistical modeling,
machine learning, and graphics capabilities. These languages include the following (read
more at "Python vs. R: What's the Difference?"):
R Studio: An open source programming language and environment for developing statistical
computing and graphics.
Python: It is a dynamic and flexible programming language. The Python includes numerous
libraries, such as NumPy, Pandas, Matplotlib, for analyzing data quickly.
To facilitate sharing code and other information, data scientists may use GitHub and Jupyter
notebooks.
Some data scientists may prefer a user interface, and two common enterprise tools for
statistical analysis include:
SAS: A comprehensive tool suite, including visualizations and interactive dashboards, for
analyzing, reporting, data mining, and predictive modeling.
IBM SPSS: Offers advanced statistical analysis, a large library of machine learning
algorithms, text analysis, open source extensibility, integration with big data, and seamless
deployment into applications.
Data scientists also gain proficiency in using big data processing platforms, such as Apache
Spark, the open source framework Apache Hadoop, and NoSQL databases. They are also
skilled with a wide range of data visualization tools, including simple graphics tools included
with business presentation and spreadsheet applications (like Microsoft Excel), built-for-
purpose commercial visualization tools like Tableau and IBM Cognos, and open source tools
like D3.js (a JavaScript library for creating interactive data visualizations) and RAW Graphs.
For building machine learning models, data scientists frequently turn to several frameworks
like PyTorch, TensorFlow, MXNet, and Spark MLib.
Given the steep learning curve in data science, many companies are seeking to accelerate
their return on investment for AI projects; they often struggle to hire the talent needed to
realize data science project’s full potential. To address this gap, they are turning to
multipersona data science and machine learning (DSML) platforms, giving rise to the role of
“citizen data scientist.”
Multipersona DSML platforms use automation, self-service portals, and low-code/no-code
user interfaces so that people with little or no background in digital technology or expert data
science can create business value using data science and machine learning. These platforms
also support expert data scientists by also offering a more technical interface. Using a
multipersona DSML platform encourages collaboration across the enterprise.

2.Terminology and Concepts


Data analytics: Key concepts
There are four key types of data analytics:
descriptive, diagnostic, predictive, and prescriptive.
Together, these four types of data analytics can help an organization make data-driven
decisions. At a glance, each of them tells us the following:
Descriptive analytics tell us what happened.
Diagnostic analytics tell us why something happened.
Predictive analytics tell us what will likely happen in the future.
Prescriptive analytics tell us how to act.
People who work with data analytics will typically explore each of these four areas using the
dataanalysisprocess,whichincludes identifying thequestion, collecting rawdata, cleaning data,
analyzing data, and interpreting the results.
Read more: What Is Data Analysis? (With Examples)
Data analytics skills
Data analytics requires a wide range of skills to be performed effectively. According to search
and enrollment data among Coursera’s community of 87 million global learners, these are the
top in-demand data science skills, as of December 2021:
Structured Query Language (SQL), a programming language commonly used for databases
Statistical programming languages, such as R and Python, commonly used to create
advanced data analysis programs
Machine learning, a branch of artificial intelligence that involves using algorithms to spot data
patterns
Probability and statistics, in order to better analyze and interpret data trends
Data management, or the practices around collecting, organizing and storing data
Statistical visualization, or the ability to use charts and graphs to tell a story with data
Econometrics, or the ability to use data trends to create mathematical models that forecast
future trends based
While careers in data analytics require a certain amount of technical knowledge, approaching
the above skills methodically—for example by learning a little bit each day or learning from
your mistakes—can help lead to mastery, and it’s never too late to get started.
Read more: Is Data Analytics Hard? Tips for Rising to the Challenge

Data analytics careers


Typically, data analytics professionals make a higher than average salary and are in high
demand within the labor market. According to 2017 research by CrowdFlower, for example,
there are more open roles for data analysts than people with the skills to perform those jobs, a
trend that ensures data analytics professionals are much sought after by employers [1]. More
recently, the US Bureau of Labor Statistics (BLS) has projected that careers in data analytics
fields will grow by 23 percent between 2021 and 2031—much faster than average – and are
estimated to pay a higher than average annual income of $82,360 [2].
Entry-level careers in data analytics include roles such as:
 Junior data analyst
 Associate data analyst
 Junior data scientist

3. What is Statistical Data Analysis?

Being a branch of science, Statistics incorporates data acquisition, data interpretation, and
data validation, and statistical data analysis is the approach of conducting various
statistical operations, i.e. thorough quantitative research that attempts to quantify data and
employs some sorts of statistical analysis. Here, quantitative data typically includes
descriptive data like survey data and observational data.

In the context of business applications, it is a very crucial technique for business intelligence
organizations that need to operate with large data volumes.

The basic goal of statistical data analysis is to identify trends, for example, in the retailing
business, this method can be approached to uncover patterns in unstructured and semi-
structured consumer data that can be used for making more powerful decisions for enhancing
customer experience and progressing sales.

Apart from that, statistical data analysis has various applications in the field of statistical
analysis of market research, business intelligence(BI), data analytics in big data, machine
learning and deep learning, and financial and economical analysis.

SIGNIFICANCE OF DATA UNDER STATISTICAL DATA ANALYSIS,

Data comprises variables which are univariate or multivariate, and extremely relying on the
number of variables, the experts execute several statistical techniques.
If the data has a singular variable then univariate statistical data analysis can be conducted
including t-test for significance, z test, f test, ANOVA test- one way, etc.
And if the data has many variables then different multivariate techniques can be performed
such as statistical data analysis, or discriminant statistical data analysis, etc.

Here, the variable is a characteristic, changing from one individual trait of a population to
another trait. The image below shows the classification of data-variables.

Classification of variables

(Related blog: An Introduction to Probability Distribution)


Data is of two types, continuous data and discrete data. The continuous data cannot be
counted and changes over time, e.g the intensity of light, the temperature of a room, etc.

The discrete data can be counted and has a certain number of values, e.g. the number of
bulbs, the number of people in a group, etc.

(Related blog: Types of data in statistics)

Under statistical data analysis, the continuous data is distributed under continuous
distribution function, also known as the probability density function, and the discrete data is
distributed under a discrete distribution function, also termed as the probability mass
function.
Data can either be quantitative or qualitative.
Qualitative data are labels or names that are implemented to find a characteristic of each
element, whereas quantitative data are always in the form of numbers that intimate either how
much or how many.

(More to read: Steps for qualitative data analysis)

Under statistical data analysis, cross-sectional and time-series data are important. For a
definition, cross-sectional data are the data accumulated at the same time or relatively the
same point in time, whereas, time-series data are the data gathered across certain time
periods.

Statistical data analysis can be adopted in;

 Existing essential findings/conclusions unveiled through a dataset.


 Abstract and compile information.
 Compute measures of cohesiveness, relevance, or diversity in data.
 Originate forthcoming prophecies on the basis of earlier reported data.
 Test experimental forecasts.
Statistical Data Analysis Tools
Generally, under statistical data analysis, some form of statistical analysis tools are practised
that a layman can’t do without having statistical knowledge.
Various software programs are available to perform statistical data analysis, these software
include Statistical Analysis System(SAS), Statistical Package for Social Science (SPSS),
Stat soft and many more.

These tools allow extensive data-handling capabilities and several statistical analysis methods
that could examine a small chunk to very comprehensive data statistics.

Though computers serve as an important factor in statistical data analysis that can assist in
the summarization of data, statistical data analysis concentrates on the interpretation of the
result in order to drive inferences and prophecies.

(Must check: Statistical Data analysis techniques)

What are the Types of Statistical Data Analysis?


There are two important components of a statistical study, that are:
Population - an assemblage of all elements of interest in a study, and
Sample - a subset of the population.
And, there are two types of widely used statistical methods under statistical data analysis
techniques;
Descriptive Statistics :
It is a form of data analysis that is basically used to describe, show or summarize data from a
sample in a meaningful way. For example, mean, median, standard deviation and variance.

In other words, descriptive statistics attempts to illustrate the relationship between variables
in a sample or population and gives a summary in the form of mean, median and mode.

Inferential Statistics :
This method is used for making conclusions from the data sample by using the null and
alternative hypotheses that are subjected to random variation.
Also, probability distribution, correlation testing and regression analysis fall into this
category. In simple words, inferential statistics employs a random sample of data, taken
from a population, to make and explain inferences about the whole population.

(Most related: What is p-value in statistics?)

The table below shows the factual differences between descriptive statistics and inferential
statistics;

S.No Descriptive Statistics Inferential Statistics

Make inferences from the sample


Related with specifying the target
1 and make them generalize also
population.
according to the population.

Arrange, analyze and reflect the data in a Correlate, test and anticipate future
2
meaningful mode. outcomes.

Concluding outcomes are represented in the Final outcomes are the probability
3
form of charts, tables and graphs. scores.

Attempts in making conclusions


4 Explains the earlier acknowledged data. regarding the population which is
beyond the data available.

Deployed tools-Measure of central tendency


Deployed tools- Hypothesis testing,
5 (mean, median, mode), Spread of data
Analysis of variance, etc.
(Range, standard deviation, etc.)

Difference between Descriptive Statistics and Inferential Statistics

4 Basics Steps for Statistical Data Analysis


In order to analyze any problem with the use of statistical data analysis comprises four basic
steps;

1. Defining the problem

The precise and actuarial definition of the problem is imperative for achieving accurate data
concerning it. It becomes extremely difficult to collect data without knowing the exact
definition/address of the problem.

2. Accumulating the data

After addressing the specific problem, designing multiple ways in order to accumulate data is
an important task under statistical data analysis.
Data can be collected from the actual sources or can be obtained by observation and
experimental research studies, conducted to get new data.
In an experimental study, the important variable is identified according to the defined
problem, then one or more elements in the study are controlled for getting data regarding how
these elements affect other variables.
In an observational study, no trial is executed for controlling or impacting the important
variable. For example, a conducted surrey is the examples or a common type of observational
study.

3. Analyzing the data

Under statistical data analysis, the analyzing methods are divided into two categories;
Exploratory methods, this method is deployed for determining what the data is revealing by
using simple arithmetic and easy-drawing graphs/description in order to summarize data.
Confirmatory methods, this method adopts concept and ideas from probability theory for
trying to answer particular problems.
Probability is extremely imperative in decision-making as it gives a procedure for estimating,
representing, and explaining the possibilities associated with forthcoming events.
4. Reporting the outcomes

By inferences, an estimate or test that claims to be the characteristics of a population can be


derived from a sample, these results could be reported in the form of a table, a graph or a set
of percentages.
Since only a small portion of data has been investigated, therefore the reported result can
depict some uncertainties by implementing probability statements and intervals of values.
With the help of statistical data analysis, experts could forecast and anticipate future aspects
from data. By understanding the information available and utilizing it effectively may lead to
adequate decision-making. (Source)
Conclusion
The statistical data analysis furnishes sense to the meaningless numbers and thereby giving
life to lifeless data. Therefore, it is imperative for a researcher to have adequate knowledge
about statistics and statistical methods to perform any research study.
This will assist in conducting an appropriate and well-designed study preeminently to
accurate and reliable results. Also, results and inferences are explicit only and only if proper
statistical tests are practised.

4.Central Tendencies and Distribution


Measures of Central Tendency
The central tendency of the dataset can be found out using the three important measures
namely mean,medianandmode.
Mean
The mean represents the average value of the dataset. It can be calculated as the sum of all
the values in the dataset divided by the number of values. In general, it is considered as the
arithmetic mean. Some other measures of mean used to find the central tendency are as
follows:
 Geometric Mean
 Harmonic Mean
 Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric, arithmetic
and harmonic mean values are the same. If there is variability in the data, then the mean value
differs. Calculating the mean value is completely easy. The formula to calculate the mean
value is given by:
The histogram given below shows that the mean value of symmetric continuous data and the
skewed continuous data.
In symmetric data distribution, the mean value is located accurately at the centre. But in the
skewed continuous data distribution, the extreme values in the extended tail pull the mean
value away from the centre. So it is recommended that the mean can be used for the
symmetric distributions.
Median
Median is the middle value of the dataset in which the dataset is arranged in the ascending
order or in descending order. When the dataset contains an even number of values, then the
median value of the dataset can be found by taking the mean of the middle two values.
Consider the given dataset with the odd number of observations arranged in descending order
– 23, 21, 18, 16, 15, 13, 12, 10, 9, 7, 6, 5, and 2

Here 12 is the middle or median number that has 6 values above it and 6 values below it.
Now, consider another example with an even number of observations that are arranged in
descending order – 40, 38, 35, 33, 32, 30, 29, 27, 26, 24, 23, 22, 19, and 17
When you look at the given dataset, the two middle values obtained are 27 and 29.
Now, find out the mean value for these two numbers.
i.e.,(27+29)/2 =28
Therefore, the median for the given data distribution is 28.
Mode
The mode represents the frequently occurring value in the dataset. Sometimes the dataset may
contain multiple modes and in some cases, it does not contain any mode at all.
Consider the given dataset 5, 4, 2, 3, 2, 1, 5, 4, 5

Since the mode represents the most common value. Hence, the most frequently repeated
value in the given dataset is 5.
Based on the properties of the data, the measures of central tendency are selected.
 If you have a symmetrical distribution of continuous data, all the three measures of
central tendency hold good. But most of the times, the analyst uses the mean because
it involves all the values in the distribution or dataset.
 If you have skewed distribution, the best measure of finding the central tendency is
the median.
 If you have the original data, then both the median and mode are the best choice of
measuring the central tendency.
 If you have categorical data, the mode is the best choice to find the central tendency
Central Limit Theorem
 The Central Limit Theorem(CLT) states that for any data, provided a high number of
samples have been taken. The following properties hold:
 Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
 Sampling distribution’s standard deviation (Standard error) = σ/√n ≈S/√n
 For n > 30, the sampling distribution becomes a normal distribution.
 Let’s verify the properties of CLT in Python through Jupyter Notebook.
 For the following Python code, we’ll use the datasets of Population and Random
Values, which we can find here.
 First, import necessary libraries into Jupyter Notebook.
 We imported all the necessary packages which we use in further codes. Since we are
going to sample the information randomly, we are setting a random
seed np.random.seed(42), so that the analysis is reproducible.
 Now, let’s read the dataset we are dealing with,
 The dataset looks like this,

Population Dataset
Let’s extract the ‘Weight’ column from the dataset and see the distribution of that column.
This weight column and its distribution graph looks like this,
As we can see, the chart is close to a Normal Distribution graph.
Let’s also find out the mean and standard deviation of the weight column through code.
Mean = 220.67326732673268
Std. Dev. = 26.643110470317723
These values are the exact Mean and Standard Deviation values of the Weight Column.
Now, let’s start sampling the data.
First, we’ll take a sample size of 30 members from the data. The reason for that is, after
repeated sampling of observations, we need to find if the sampling distribution follows
Normal Distribution or not.
The mean value for the above sample = 222.1, which is greater than the actual mean of
220.67. Let’s rerun the code,
df.Weight.sample(samp_size).mean()
The mean value for the above sample = 220.5, which is almost equal to the original mean. If
we rerun the code, we’ll get the mean value = 221.6
Each time we take a sample, the mean is different. There is variability in the sample mean
itself. Let’s move ahead and find out if the sample mean follows a distribution.
Instead of taking one sample mean at a time, we’ll take about 1000 such sample means and
assign it to a variable.
We have converted the sample_means into Series object because the list object does not
provide us with Mean and Standard Deviation functions.
The total number of samples = 1000
Now, we have 1000 samples, and it’s mean values with us. Let’s plot the distribution graph
using seaborn.
The distribution plot looks like this,

As we can observe, the above distribution looks approximately like Normal Distribution.
The other thing we need to check here is the Samples Mean and Standard Deviation.
Samples Mean = 220.6945, which is almost similar to Original Mean’s value 220.67, Sample
Std = 4.641450507418211
Let’s see the relation between the Standard deviation of samples and the Standard deviation of
actualdata.

When we divide the standard deviation of original data with its size,

df.Weight.std()/np.sqrt(samp_size)
Wegetthevalueofabovecode=4.86
The value is close to the sample_means.std().
So, from the above code, we can infer that:
Sampling distribution’s mean (μₓ¯) = Population mean (μ)
Sampling distribution’s standard deviation (standard error) = σ/√n

Till now, we have seen the original data of the “Weight” column is in the form of normal
distribution. Let’s see whether the sample distribution will be of Normal Distribution form
even if the original data is not in the Normal Distribution form.
We’ll take another data set that contains some random values and plot the values in a
distribution graph.
The Dataset and the graph looks like this,

As we can see, the Values column does not resemble the Normal Distribution graph. It looks
somewhat like an exponential distribution.
Let’s pick samples from this distribution, calculate their means, and plot the sampling
distribution.
Now, the distribution graph for the samples looks like,

Surprisingly, the Distribution of the sample_means we obtained from the Values Column,
which is far from Normal Distribution, is still very much a Normal Distribution.
Let’s compare the sample_means Mean value to its parent Mean value.
sample_means.mean()
#TheOutputwillbe:130.39213999999996df1.Value.mean()
#TheOutputis:130.4181654676259
As we can see, the sample_means mean value and original dataset’s mean value are both
similar.
Similarly,thestandard deviation of sample mean is sample_means.std() =13.263962580003142
That value should be quite close to df1.Value.std()/np.sqrt(samp_size) =14.060457446377631
Let’s compare the Distribution graphs of each Dataset with it’s corresponding sampling
distribution.
Image by Author
As we can see, irrespective of the original dataset’s distribution, the sampling distribution
resembles the Normal Distribution Curve.
There’s only one thing to consider now, i.e., Sample Size. We’ll observe that, as the sample
size increases, the sampling distribution will approximate a normal distribution even more
closely.
Effect of Sample Size on the Sampling Distribution
Let’s create different Sizes of samples and plot the corresponding distribution graphs.
Now, the Distribution Graph for Sample Sizes of 3, 10, 30, 50, 100, 200 looks like,
Distribution of Different Sample Sizes
As we can observe, the distribution graph for Sample Size 3 & 10 does not resemble Normal
Distribution. Still, from the Sample Size 30 as the Sample Size increases, the Sample
Distribution resembles Normal Distribution.
As a rule of thumb, we can say that a sample size of 30 or above is ideal for concluding that
the sampling distribution is nearly normal, and further inferences can be drawn from it.
Through this Python Code, we can conclude that CLT’s following three properties hold.
Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
Sampling distribution’s standard deviation (Standard error) = σ/√n
For n > 30, the sampling distribution becomes a normal distribution.
Estimating Mean Using CLT
The mean commute time of 30000 employees (μ)= 36.6 (sample mean) + some margin of
error. We can find this margin of error using the CLT (central limit theorem). Now that we
know what the CLT is let’s see how we can find the error margin.
Let’s say we have the mean commute time of 100 employees is X¯=36.6 min, and the
Standard Deviation of the sample is S=10 min. Using CLT, we can infer that,
Sampling Distribution Mean(μₓ¯) = Population Mean(μ)
Sampling Distributions’ Standard Deviation = σ/√n ≈S/√n = 10/√100 = 1
SinceSamplingDistributionisaNormalDistribution
P(μ-2 < 36.6 < μ+2) = 95.4%, we get this value by 1–2–3 Rule of Normal Distribution Curve.
P(μ-2 < 36.6 < μ+2) = P(36.6–2< μ < 36.6+2) = 95.4%
You can find the standard distribution curve, Z-Table, and its properties in my previous article,
“Inferential Statistics.”
Now, we can say that there is a 95.4% probability that the Population Mean(μ) lies between
(36.6–2, 36.6+2). In other words, we are 95.4% confident that the error in estimating the mean
≤ 2.
Hence the probability associated with the claim is called confidence level (Here it is 95.4%).
The maximum error made in the sample mean is called the margin of error (Here it is 2min).
The final interval of value is called confidence interval {Here it is: (34.6, 38.6)}
We can generalize this concept in the following manner.
Let’s say that we have a sample with sample size n, mean X¯, and standard deviation S. Now,
the y% confidence interval (i.e., the confidence interval corresponding to a y% confidence
level) for μ would be given by the range:
Confidence interval = (X — (Z* S/√n), X + (Z* S/√n))
where Z* is the Z-score associated with a y% confidence level.
Some commonly used Z* values are given below:

Z* Values
That is is how we calculate the margin of error and estimate the value of the mean of the
whole population with the help of samples.
6.Basic Machine Learning Algorithms
Machine learning algorithms are classified into 4 types:
 Supervised
 Unsupervised Learning
 Semi-supervised Learning
 Reinforcement Learning
Linear Regression
Linear Regression
To understand the working functionality of Linear Regression, imagine how you would
arrange random logs of wood in increasing order of their weight. There is a catch; however –
you cannot weigh each log. You have to guess its weight just by looking at the height and
girth of the log (visual analysis) and arranging them using a combination of these visible
parameters. This is what linear regression in machine learning is like.
In this process, a relationship is established between independent and dependent variables by
fitting them to a line. This line is known as the regression line and is represented by a linear
equation Y= a *X + b.
In this equation:
Y – Dependent Variable
a – Slope
X – Independent variable
b – Intercept
The coefficients a & b are derived by minimizing the sum of the squared difference of
distance between data points and the regression line.
SVM
SVM (Support Vector Machine) Algorithm
SVM algorithm is a method of a classification algorithm in which you plot raw data as points
in an n-dimensional space (where n is the number of features you have). The value of each
feature is then tied to a particular coordinate, making it easy to classify the data. Lines called
classifiers can be used to split the data and plot them on a graph.
Naive Bayes Algorithm
A Naive Bayes classifier assumes that the presence of a particular feature in a class is
unrelated to the presence of any o
ther feature.
Even if these features are related to each other, a Naive Bayes classifier would consider all of
these properties independently when calculating the probability of a particular outcome.
A Naive Bayesian model is easy to build and useful for massive datasets. It's simple and is
known to outperform even highly sophisticated classification methods.

You might also like