0% found this document useful (0 votes)
62 views

Module 3

The document discusses different types of data analysis including descriptive analysis, exploratory data analysis, predictive analysis, and inferential analysis. Descriptive analysis uses numerical methods to summarize data through measures like mean, median, mode, standard deviation, and variance. Exploratory data analysis takes a visual approach using plots and graphs to analyze single and multiple variables. Stem and leaf plots arrange data to show frequency of values through stems and leaves. Normal distributions follow a bell curve shape and describe randomness in many phenomena through mean and standard deviation. Skewness measures asymmetry in a distribution, with positive skewness bending left and negative skewness bending right.

Uploaded by

Sayan Majumder
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

Module 3

The document discusses different types of data analysis including descriptive analysis, exploratory data analysis, predictive analysis, and inferential analysis. Descriptive analysis uses numerical methods to summarize data through measures like mean, median, mode, standard deviation, and variance. Exploratory data analysis takes a visual approach using plots and graphs to analyze single and multiple variables. Stem and leaf plots arrange data to show frequency of values through stems and leaves. Normal distributions follow a bell curve shape and describe randomness in many phenomena through mean and standard deviation. Skewness measures asymmetry in a distribution, with positive skewness bending left and negative skewness bending right.

Uploaded by

Sayan Majumder
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Module 3

Data Analysis Types:


Data analysis may be separated into four stages depending on the methodology used:

 Descriptive Analysis
 Exploratory Data Analysis
 Predictive Analysis
 Inferential Analysis

Descriptive Analysis:
Descriptive analysis is a numerical method of extracting information from data. The
numerical variables’ values are summarised in the descriptive analysis. Assume you’re
looking at sales data from a vehicle company. In descriptive analytical literature, you’ll look
for answers to queries like what is the mean, mode, and median of a car type’s selling price,
what was the income generated by selling a specific model of automobile, and so on. Using
this form of analysis, we may determine the central tendency and dispersion of the numerical
variables in the data. A descriptive analysis can assist you gain the high-level knowledge of
the data and become acclimated to the data set in most practical data science use cases.

The following are some key descriptive analysis terminologies:

 Mean: average value of total numbers given in the list of numbers


 Mode: most frequent number in the given list of numbers
 Median: middle value of the givan list of numbers
 Standard deviation: value of variation of the given set of values from the mean value
 Variance: Variation is a term that is used to describe (square of standard deviation)
 Interquartile Range (IQR): values between 25 and 75 percentile of a list of numbers

Importance of Descriptive Analysis:


Data visualisation is made simple with descriptive statistics. It enables data to be presented in
a meaningful and intelligible manner, allowing for a more straightforward understanding of
the data set. The analysis of raw data would be laborious, and determining trends and patterns
might be tough. Furthermore, raw data makes it difficult to visualise what is being displayed.

Exploratory Data Analysis:


In contrast to descriptive data analysis, which is a numerical approach to data analysis,
exploratory data analysis is a visual approach to data analysis. We will turn to exploratory
data analysis once we have a basic comprehension of the data at hand through descriptive
analysis. The exploratory data analysis may alternatively be divided into two parts:
 Uni variate analysis: Analysis of a single variable (exploring characteristics of a
single variable)
 Multivariate analysis: Analyses using many variables (comparative analysis of
multiple variables, if we compare the correlation of two variables, it is called bivariate
analysis)

We employ numerous types of plots and graphs to analyse data in the visual style of data
analysis. A bar plot, histograms, box plot with whisker, violin plot, and other plots can be
used to study a single variable (univariate analysis). We employ scatter plots, contour plots,
multi-dimensional graphs, and other multivariate analytic tools.

Need of Exploratory Data Analysis:


 Exploratory data analysis provides a visual representation of the data, which aids in
identifying the data’s features more clearly
 It assists us in determining which characteristics are most significant, which is very
handy when dealing with data that has a lot of dimensions. (i.e., dimensionality
reduction is aided by approaches like as PCA and t-SNE)
 It’s a good technique to communicate the incurred outcome to non-technical
stakeholders and executives

What are Stem and Leaf Plots?


A stem and leaf plot, also known as a stem and leaf diagram, is a way to arrange and
represent data so that it is simple to see how frequently various data values occur. It is a plot
that displays ordered numerical data. 

A stem and leaf plot is shown as a special table where the digits of a data value are divided
into a stem (first few digits) and a leaf (usually the last digit). The  symbol ‘|’ is used to split
and illustrate the stem and leaf values. For instance, 105 is written as 10 on the stem and 5 on
the leaf. This can be written as 10 | 5. Here, 10 | 5 = 105 is called the key. The key depicts the
data value a stem and leaf represent.

 
How do we Construct a Stem and Leaf Plot?
Step 1: Classify the data values in terms of the number of digits in each value, such as 2 digit
numbers or 3 digit numbers.

Step 2: Fix the key for the stem and leaf plot. For example, 2 | 5 = 25, 3 | 2 = 3.2 or 19 | 2 is
192.

Step 3: Consider the first digits as stems and the last digit as leaves.

Step 4: Find the range of the data, that is the lowest and the highest values among the data.

Step 5: Draw a vertical line. Place the stem on the left and the leaf on the right of the vertical
line.

Step 6: List the stems in the stem column. Sort them in ascending order.

Step 7: List the leaf values in the column against the stem from lowest to the highest
horizontally.

Rapid Recall

                                      
Key : 0 | 1 = 1

Solved Examples
Example 1:

The table below shows the duration of calls that Rosy makes each day. Represent the
given data using a stem and leaf plot.

Solution:

Step 1: Sort the data (number of minutes).

2, 3, 5, 6, 10, 14, 19, 23, 23, 30, 36, 56

Step 2: Choose the stems and the leaves. Just because the data values range from 2 to 56, use
the tens digit for the stem and the ones digit for the leaf. Also, include the key.

Step 3: Write down the stems on the left of the vertical line.

Step 4: Write down the leaves for each stem on the right of the vertical line.

 
 

Example 2

The stem-and-leaf plot below shows the quiz scores of students. 

(a) Find the number of students who scored less than 9 points? 

(b) Find the number of students who scored a minimum of 9 points?

Solution:
a) There are fourteen scores less than 9 points. 

They are 6.6, 7.0, 7.5, 7.7, 7.8, 8.1, 8.1, 8.3, 8.4, 8.4, 8.6, 8.8, 8.8 and 8.9.

So, fourteen students scored less than 9 points.

b) There are two scores which are at least 9 points.

They are 9.0, 9.2, 9.9, and 10.0.

So, four students scored a minimum of 9 points.

Example 3:

Construct a stem-and-leaf plot for the data in the table.

Solution:

Step 1: Sort the data values: 1, 1, 1, 2, 2, 4, 5, 5, 7, 12, 20, 23, 27, 30, 32, 33, 38, 40, 44, 47

Step 2: Choose the stems and the leaves. As the data values range from 1 to 47, use the tens
digits for the stems and the ones digits for the leaves. Be sure to include the key.

Step 3: Write the stems to the left of the vertical line from the top to bottom.
 

Step 4: Write the leaf values corresponding to each stem to the right of the vertical line.

                                                                                    Key : 0 | 1 = 1 cm

What Is a Normal Distribution?


A normal distribution is a continuous probability distribution for a random variable. A
random variable is a variable whose value depends on the outcome of a random event. For
example, flipping a coin will give you either heads or tails at random. You cannot determine
with absolute certainty if the following outcome is a head or a tail. 

When you plot the probability of a random event, you get its probability distribution. The
probability of a random variable that can take on any value is called a continuous probability
distribution. The number of values that the probability could be are infinite and form a
continuous curve. Hence, instead of writing the probability values, you define the range in
which they lie.

When the continuous probability distribution curve is bell-shaped, i.e., it looks like a hill with
a well-defined peak, it is said to be a normal distribution. The peak of the curve is at the
mean, and the data is symmetrically distributed on either side of it. The mean, median, and
mode are equal to each other or lie close to each other.
Figure 1: Normal distribution  

Consider the marks scored in a math test by students in a class. The majority of the students
would have scored the average mark. Few students would have scored a little less, and some
would have scored more. Even fewer would be in the bottom 10% and the top 10%. Some
examples of normal distributions are:

1. Blood pressure of people


2. I.Q. scores
3. Salaries

Measures of Skewness and Kurtosis


Skewness refers to the degree of symmetry, or more precisely, the degree of lack of
symmetry. Distributions, or data sets, are said to be symmetric if they appear the same on both sides
of a central point. Kurtosis refers to the proportion of data that is heavy-tailed or light-tailed in
comparison with a normal distribution.

What Is Skewness?
Skewness is used to measure the level of asymmetry in our graph. It is the measure of
asymmetry that occurs when our data deviates from the norm. 

Sometimes, the normal distribution tends to tilt more on one side. This is because the
probability of data being more or less than the mean is higher and hence makes the
distribution asymmetrical. This also means that the data is not equally distributed. The
skewness can be on two types:

1. Positively Skewed: In a distribution that is Positively Skewed, the values are more
concentrated towards the right side, and the left tail is spread out. Hence, the statistical results
are bent towards the left-hand side. Hence, that the mean, median, and mode are always
positive. In this distribution, Mean > Median > Mode.

Figure 2: Positively Skewed 


2. Negatively Skewed: In a Negatively Skewed distribution, the data points are more
concentrated towards the right-hand side of the distribution. This makes the mean, median,
and mode bend towards the right. Hence these values are always negative. In this distribution,
Mode > Median > Mean.

Figure 3: Negatively Skewed 

Pearson’s First Coefficient


The median is always the middle value, and the mean and mode are the extremes, so you can
derive a formula to capture the horizontal distance between mean and mode.

Figure 4: Pearson’s First Coefficient 

The above formula gives you Pearson's first coefficient. Division by the standard deviation
will help you scale down the difference between mode and mean. This will scale down their
values in a range of -1 to 1. Now understand the below relationship between mode, mean and
median.

Figure 5: Mode in terms of mean and median 

Substituting this in Pearson’s first coefficient gives us Pearson’s second coefficient and the
formula for skewness:
Figure 6: Pearson’s Second Coefficient

If this value is between:

1. -0.5 and 0.5, the distribution of the value is almost symmetrical


2. -1 and -0.5, the data is negatively skewed, and if it is between 0.5 to 1, the data is
positively skewed. The skewness is moderate.
3. If the skewness is lower than -1 (negatively skewed) or greater than 1 (positively
skewed), the data is highly skewed.

What Is Kurtosis?
Kurtosis is used to find the presence of outliers in our data. It gives us the total degree of
outliers present. 

The data can be heavy-tailed, and the peak can be flatter, almost like punching the
distribution or squishing it. This is called Negative Kurtosis (Platykurtic). If the distribution
is light-tailed and the top curve steeper, like pulling up the distribution, it is called Positive
Kurtosis (Leptokurtic).

Figure 7: (a) Leptokurtic, (b) Normal Distribution, (c) Platykurtic

The expected value of kurtosis is 3. This is observed in a symmetric distribution. A kurtosis


greater than three will indicate Positive Kurtosis. In this case, the value of kurtosis will range
from 1 to infinity. Further, a kurtosis less than three will mean a negative kurtosis. The range
of values for a negative kurtosis is from -2 to infinity. The greater the value of kurtosis, the
higher the peak. 

Figure 8: Excess Kurtosis

Hence, you can say that Skewness and Kurtosis are used to describe the spread and height of
your normal distribution. Skewness is used to denote the horizontal pull on the data. It tells
you how spread out the data is, and Kurtosis is used to find the vertical pull or the peak's
height. 

Looking forward to a career in Data Analytics? Check out the Data Analytics Course and get certified
today.

You might also like