Module 3
Module 3
Descriptive Analysis
Exploratory Data Analysis
Predictive Analysis
Inferential Analysis
Descriptive Analysis:
Descriptive analysis is a numerical method of extracting information from data. The
numerical variables’ values are summarised in the descriptive analysis. Assume you’re
looking at sales data from a vehicle company. In descriptive analytical literature, you’ll look
for answers to queries like what is the mean, mode, and median of a car type’s selling price,
what was the income generated by selling a specific model of automobile, and so on. Using
this form of analysis, we may determine the central tendency and dispersion of the numerical
variables in the data. A descriptive analysis can assist you gain the high-level knowledge of
the data and become acclimated to the data set in most practical data science use cases.
We employ numerous types of plots and graphs to analyse data in the visual style of data
analysis. A bar plot, histograms, box plot with whisker, violin plot, and other plots can be
used to study a single variable (univariate analysis). We employ scatter plots, contour plots,
multi-dimensional graphs, and other multivariate analytic tools.
A stem and leaf plot is shown as a special table where the digits of a data value are divided
into a stem (first few digits) and a leaf (usually the last digit). The symbol ‘|’ is used to split
and illustrate the stem and leaf values. For instance, 105 is written as 10 on the stem and 5 on
the leaf. This can be written as 10 | 5. Here, 10 | 5 = 105 is called the key. The key depicts the
data value a stem and leaf represent.
How do we Construct a Stem and Leaf Plot?
Step 1: Classify the data values in terms of the number of digits in each value, such as 2 digit
numbers or 3 digit numbers.
Step 2: Fix the key for the stem and leaf plot. For example, 2 | 5 = 25, 3 | 2 = 3.2 or 19 | 2 is
192.
Step 3: Consider the first digits as stems and the last digit as leaves.
Step 4: Find the range of the data, that is the lowest and the highest values among the data.
Step 5: Draw a vertical line. Place the stem on the left and the leaf on the right of the vertical
line.
Step 6: List the stems in the stem column. Sort them in ascending order.
Step 7: List the leaf values in the column against the stem from lowest to the highest
horizontally.
Rapid Recall
Key : 0 | 1 = 1
Solved Examples
Example 1:
The table below shows the duration of calls that Rosy makes each day. Represent the
given data using a stem and leaf plot.
Solution:
Step 2: Choose the stems and the leaves. Just because the data values range from 2 to 56, use
the tens digit for the stem and the ones digit for the leaf. Also, include the key.
Step 3: Write down the stems on the left of the vertical line.
Step 4: Write down the leaves for each stem on the right of the vertical line.
Example 2
(a) Find the number of students who scored less than 9 points?
Solution:
a) There are fourteen scores less than 9 points.
They are 6.6, 7.0, 7.5, 7.7, 7.8, 8.1, 8.1, 8.3, 8.4, 8.4, 8.6, 8.8, 8.8 and 8.9.
Example 3:
Solution:
Step 1: Sort the data values: 1, 1, 1, 2, 2, 4, 5, 5, 7, 12, 20, 23, 27, 30, 32, 33, 38, 40, 44, 47
Step 2: Choose the stems and the leaves. As the data values range from 1 to 47, use the tens
digits for the stems and the ones digits for the leaves. Be sure to include the key.
Step 3: Write the stems to the left of the vertical line from the top to bottom.
Step 4: Write the leaf values corresponding to each stem to the right of the vertical line.
Key : 0 | 1 = 1 cm
When you plot the probability of a random event, you get its probability distribution. The
probability of a random variable that can take on any value is called a continuous probability
distribution. The number of values that the probability could be are infinite and form a
continuous curve. Hence, instead of writing the probability values, you define the range in
which they lie.
When the continuous probability distribution curve is bell-shaped, i.e., it looks like a hill with
a well-defined peak, it is said to be a normal distribution. The peak of the curve is at the
mean, and the data is symmetrically distributed on either side of it. The mean, median, and
mode are equal to each other or lie close to each other.
Figure 1: Normal distribution
Consider the marks scored in a math test by students in a class. The majority of the students
would have scored the average mark. Few students would have scored a little less, and some
would have scored more. Even fewer would be in the bottom 10% and the top 10%. Some
examples of normal distributions are:
What Is Skewness?
Skewness is used to measure the level of asymmetry in our graph. It is the measure of
asymmetry that occurs when our data deviates from the norm.
Sometimes, the normal distribution tends to tilt more on one side. This is because the
probability of data being more or less than the mean is higher and hence makes the
distribution asymmetrical. This also means that the data is not equally distributed. The
skewness can be on two types:
1. Positively Skewed: In a distribution that is Positively Skewed, the values are more
concentrated towards the right side, and the left tail is spread out. Hence, the statistical results
are bent towards the left-hand side. Hence, that the mean, median, and mode are always
positive. In this distribution, Mean > Median > Mode.
The above formula gives you Pearson's first coefficient. Division by the standard deviation
will help you scale down the difference between mode and mean. This will scale down their
values in a range of -1 to 1. Now understand the below relationship between mode, mean and
median.
Substituting this in Pearson’s first coefficient gives us Pearson’s second coefficient and the
formula for skewness:
Figure 6: Pearson’s Second Coefficient
What Is Kurtosis?
Kurtosis is used to find the presence of outliers in our data. It gives us the total degree of
outliers present.
The data can be heavy-tailed, and the peak can be flatter, almost like punching the
distribution or squishing it. This is called Negative Kurtosis (Platykurtic). If the distribution
is light-tailed and the top curve steeper, like pulling up the distribution, it is called Positive
Kurtosis (Leptokurtic).
Hence, you can say that Skewness and Kurtosis are used to describe the spread and height of
your normal distribution. Skewness is used to denote the horizontal pull on the data. It tells
you how spread out the data is, and Kurtosis is used to find the vertical pull or the peak's
height.
Looking forward to a career in Data Analytics? Check out the Data Analytics Course and get certified
today.