0% found this document useful (0 votes)
18 views11 pages

Module 1 Overview - of - Statistics

Uploaded by

likithgn17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views11 pages

Module 1 Overview - of - Statistics

Uploaded by

likithgn17
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Module-1: Data Visualization and Data Exploration

Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and Libraries
for Visualization.

Overview of Statistics: Measures of Central Tendency, Measures of Dispersion, Correlation, Types of


Data, Summary Statistics.

Numpy: Numpy Operations - Indexing, Slicing, Splitting, Iterating, Filtering, Sorting, Combining, and
Reshaping.

Pandas: Advantages of pandas over numpy, Disadvantages of pandas, Pandas operation - Indexing,
Slicing, Iterating, Filtering, Sorting and Reshaping using Pandas.

Overview of Statistics:

Definition: Statistics is a combination of the analysis, collection, interpretation and representation of


numerical data.

It helps in making sense of data by finding patterns, trends and relationships.

Probability:

The chance of an event happening, measured between 0 (impossible) and 1 (certain).

Probability Distribution:

A probability distribution tells us how likely different outcomes are.

It can be of two types

1. Discrete Probability Distribution

2. Continuous Probability Distribution

1. Discrete Probability Distribution (Fixed Values):

It shows all the values that a random variable can take, together with their probability.

The following diagram illustrates an example of a discrete probability distribution.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 1


Example:

Rolling a Six-sided die

Possible outcomes: 1, 2, 3, 4, 5, 6

Each number has an equal probability of occurring: 1/6 (or 16.67%)

The graph of a discrete probability is typically bar chart.

Figure: Discrete probability distribution for die rolls.

2. Continuous Probability Distribution (Any value in a range):

It defines the probabilities of each possible value of continuous random variable.

The following diagram provides an example of a continuous probability distribution

Example:

Time taken to drive home

Usually takes around 60 minutes, but could be less (no traffic) or more (traffic jam).

The probability is spread over a range of values, not fixed numbers (like 45.3 min, 60.8 min, etc.,)

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 2


The graph of a continuous probability distribution is typically a smooth curve, where the area under
the curve represents the probability.

Figure: Continuous probability distribution for the time taken to reach home.

The normal distribution is a continuous probability distribution. It is symmetric and bell-shaped, with
most values clustering around the mean.

Mean – Center of the distribution

Standard Deviation – Spread of the distribution.

Further, Statistics breaking it down

1. Measures of Central Tendency

2. Measures of Dispersion

3. Correlation

1. Measures of Central Tendency:

Measures of central tendency are often called average and describe central or typical value of a
probability distribution.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 3


Three kinds of average:

1. Mean

2. Median

3. Mode

1. Mean:

The arithmetic average is computed by summing up all measurements and dividing the sum by the
number of observation.

The mean is calculated as follows

2. Median:

It is the middle value of the ordered dataset.

If odd number of values in a dataset – the median is the middle value

If even number of values in a dataset – the median is the average of two middle values.

3. Mode:

The mode is defined as the most frequently occurring value in a dataset.

If no value repeats, the dataset has no mode.

There may be more than one mode in cases where multiple values are equally frequent.

Example:

A die was rolled to 10 times

dataset = [4, 5, 4, 3, 4, 2, 1, 1, 2, 1]
Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 4
Mean = (4+5+4+3+4+2+1+1+2+1)/10

Mean = 2.7

Median: To calculate the median, the doe rolls have to be ordered according to their value. The
ordered values are as follows

1, 1, 1, 2, 2, 3, 4, 4, 4, 5

Since we have even number of values in dataset. The median is the average of 2 middle values.

Median = (2+3)/2 = 2.5

The modes are 1 and 4. Since they are the two most frequent events.

2. Measures of Dispersion:

Dispersion, also called variability. It is the extent to which a probability distribution is stretched or
squeezed. Means how spread out or close together the data values.

Stretched (wide spread data): The values are far apart.

Example: People’s salaries in a company range from Rs. 30,000/- to 2,00,000/- (big difference).

Squeezed (tightly packed data): The values are close together.

Example: Height of students in a class range from 160cm to 165cm (small difference).

The different measure of dispersion are as follows:

1. Variance

2. Standard Deviation

3. Range

4 Interquartile Range (IQR)

1. Variance:

The expected value of the squared deviation from the mean, describing how far numbers are spread
out.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 5


Variance is calculated as follows

2. Standard Deviation:

It is the square root of the variance.

𝑺𝑫 = √𝑽𝒂𝒓(𝑿)

3. Range:

It is the difference between the largest and smallest values in a dataset.

4 Interquartile Range:

Also called the midspread or middle 50%, this is the difference between the 75th and 25th percentiles,
or between the upper and lower quartiles.

Example:

dataset = [10, 20, 30, 40, 50]

Variance:

Mean = 30

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ∑((10 − 30)2 + (20 − 30)2 + (30 − 30)2 + (40 − 30)2 + (50 − 30)2 )

1000
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 200
5

(Shows how far numbers deviate from 30)


Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 6
Standard Deviation:

𝑆𝐷 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √200 = 14.14

(Tells us most values are about 14.14 units away from the mean)

Range:

Difference between the largest and smallest values.

Largest value = 50

Smallest value = 10

Range = Largest value – Smallest value = 50 – 10 = 40

Interquartile Range (IQR):

The Interquartile Range (IQR) is calculated as:

IQR = Q3 − Q1

where Q1 (First Quartile) is the 25th percentile and Q3 (Third Quartile) is the 75th percentile of the
dataset.

Arrange the Data in Ascending Order. Sorting is important to correctly find quartiles.

First, Find the median (Q2)

If odd number of values in a dataset – the median is the middle value

If even number of values in a dataset – the median is the average of two middle values.

The given dataset = (10, 20, 30, 40, 50). It has odd number of values.

Q2 = 30

Find Q1 (First Quartile - 25th Percentile)

Q1 is the median of the lower half of the data (before Q2).

The lower half of the data is: 10, 20

Median 10, 20 is Q1 = (10+20)/2 = 15


Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 7
Find Q3 (Third Quartile - 75th Percentile)

Q3 is the median of the upper half of the data (after Q2).

The upper half of the data is: 40, 50

Median 40, 50 is Q3 = (40+50)/2 = 45

Then, find the IQR

IQR = Q3- Q1 = 45 – 15 = 30

Another example:

The given dataset = (10, 20, 30, 40, 50,60). It has even number of values

Q2 = (30+40)/2 = 35

Find Q1 (First Quartile - 25th Percentile)

Q1 is the median of the lower half of the data (before Q2).

The lower half of the data is: 10, 20, 30

Median 10, 20 and 30 is Q1 = 20

Find Q3 (Third Quartile - 75th Percentile)

Q3 is the median of the upper half of the data (after Q2).

The upper half of the data is: 40, 50, 60

Median 40, 50 and 60 is Q3 = 50

Then, find the IQR

IQR = Q3- Q1 = 50 – 20 = 30

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 8


3. Correlation

correlation describes the statistical relationship between two variables:

In a positive correlation, both variables move in the same direction.

In a negative correlation, the variables move in opposite directions.

In zero correlation, the variables are not related.

Example:

We want to find a decent apartment to rent that is not too expensive compared to other apartments
we've found. The other apartments we found on a website are priced as follows: $700, $850, $1,500,
and $750 per month:

Given Rent Prices: $700, $850, $1500, $750

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 9


Types of Data:

It is important to understand what kind of data we are dealing with so that we can select both the
right statistical measure and the right visualization.

Data is mainly categorized into categorical (qualitative) and numerical (quantitative).

Categorical (Qualitative) Data:

This type of data describes characteristics or qualities. It is not measured with numbers.

We can further divide categorical data into nominal data and ordinal data.

Nominal Data (No specific order):

Example: Colors (Red, Blue, Green), Types of Fruits (Apple, Banana, Orange)

Ordinal Data (Has a specific order):

Example: Movie ratings (Poor, Average, Good, Excellent), Education levels (High School, Bachelor's,
Master's, Ph.D.)

Numerical (Quantitative) Data:

This type of data consists of numbers and represents measurable quantities.

Numerical data can be divided into discrete and continuous data.

Discrete Data (Countable, whole numbers):

Example: Number of students in a class (30, 31, 32), Number of cars in a parking lot (5, 10, 15)

Continuous Data (Can take any value within a range):

Example: Height of people (5.4 ft, 5.5 ft, 5.6 ft), Temperature (23.5°C, 24.1°C)

Other Important Considerations:

Temporal Data (Changes over time):

Example: Daily temperature, Stock prices, Monthly rainfall

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 10


Spatial Data (Related to location):

Example: Population density in different cities, Weather patterns across regions

Figure: Classification of types of data.

Summary Statistics:

The following table gives an overview of which measure of central tendency is best suited to a
particular type of data:

Figure: Best suited measures of central tendency for different types of data.

In data visualization, these statistical measures help summarize and interpret data effectively, making
patterns and trends easier to understand.

***********************************************************************************

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 11

You might also like