Module 1 Overview - of - Statistics
Module 1 Overview - of - Statistics
Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and Libraries
for Visualization.
Numpy: Numpy Operations - Indexing, Slicing, Splitting, Iterating, Filtering, Sorting, Combining, and
Reshaping.
Pandas: Advantages of pandas over numpy, Disadvantages of pandas, Pandas operation - Indexing,
Slicing, Iterating, Filtering, Sorting and Reshaping using Pandas.
Overview of Statistics:
Probability:
Probability Distribution:
It shows all the values that a random variable can take, together with their probability.
Possible outcomes: 1, 2, 3, 4, 5, 6
Example:
Usually takes around 60 minutes, but could be less (no traffic) or more (traffic jam).
The probability is spread over a range of values, not fixed numbers (like 45.3 min, 60.8 min, etc.,)
Figure: Continuous probability distribution for the time taken to reach home.
The normal distribution is a continuous probability distribution. It is symmetric and bell-shaped, with
most values clustering around the mean.
2. Measures of Dispersion
3. Correlation
Measures of central tendency are often called average and describe central or typical value of a
probability distribution.
1. Mean
2. Median
3. Mode
1. Mean:
The arithmetic average is computed by summing up all measurements and dividing the sum by the
number of observation.
2. Median:
If even number of values in a dataset – the median is the average of two middle values.
3. Mode:
There may be more than one mode in cases where multiple values are equally frequent.
Example:
dataset = [4, 5, 4, 3, 4, 2, 1, 1, 2, 1]
Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 4
Mean = (4+5+4+3+4+2+1+1+2+1)/10
Mean = 2.7
Median: To calculate the median, the doe rolls have to be ordered according to their value. The
ordered values are as follows
1, 1, 1, 2, 2, 3, 4, 4, 4, 5
Since we have even number of values in dataset. The median is the average of 2 middle values.
The modes are 1 and 4. Since they are the two most frequent events.
2. Measures of Dispersion:
Dispersion, also called variability. It is the extent to which a probability distribution is stretched or
squeezed. Means how spread out or close together the data values.
Example: People’s salaries in a company range from Rs. 30,000/- to 2,00,000/- (big difference).
Example: Height of students in a class range from 160cm to 165cm (small difference).
1. Variance
2. Standard Deviation
3. Range
1. Variance:
The expected value of the squared deviation from the mean, describing how far numbers are spread
out.
2. Standard Deviation:
𝑺𝑫 = √𝑽𝒂𝒓(𝑿)
3. Range:
4 Interquartile Range:
Also called the midspread or middle 50%, this is the difference between the 75th and 25th percentiles,
or between the upper and lower quartiles.
Example:
Variance:
Mean = 30
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ∑((10 − 30)2 + (20 − 30)2 + (30 − 30)2 + (40 − 30)2 + (50 − 30)2 )
1000
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 200
5
(Tells us most values are about 14.14 units away from the mean)
Range:
Largest value = 50
Smallest value = 10
IQR = Q3 − Q1
where Q1 (First Quartile) is the 25th percentile and Q3 (Third Quartile) is the 75th percentile of the
dataset.
Arrange the Data in Ascending Order. Sorting is important to correctly find quartiles.
If even number of values in a dataset – the median is the average of two middle values.
The given dataset = (10, 20, 30, 40, 50). It has odd number of values.
Q2 = 30
IQR = Q3- Q1 = 45 – 15 = 30
Another example:
The given dataset = (10, 20, 30, 40, 50,60). It has even number of values
Q2 = (30+40)/2 = 35
IQR = Q3- Q1 = 50 – 20 = 30
Example:
We want to find a decent apartment to rent that is not too expensive compared to other apartments
we've found. The other apartments we found on a website are priced as follows: $700, $850, $1,500,
and $750 per month:
It is important to understand what kind of data we are dealing with so that we can select both the
right statistical measure and the right visualization.
This type of data describes characteristics or qualities. It is not measured with numbers.
We can further divide categorical data into nominal data and ordinal data.
Example: Colors (Red, Blue, Green), Types of Fruits (Apple, Banana, Orange)
Example: Movie ratings (Poor, Average, Good, Excellent), Education levels (High School, Bachelor's,
Master's, Ph.D.)
Example: Number of students in a class (30, 31, 32), Number of cars in a parking lot (5, 10, 15)
Example: Height of people (5.4 ft, 5.5 ft, 5.6 ft), Temperature (23.5°C, 24.1°C)
Summary Statistics:
The following table gives an overview of which measure of central tendency is best suited to a
particular type of data:
Figure: Best suited measures of central tendency for different types of data.
In data visualization, these statistical measures help summarize and interpret data effectively, making
patterns and trends easier to understand.
***********************************************************************************