0% found this document useful (0 votes)

18 views11 pages

Module 1 Overview - of - Statistics

Uploaded by

likithgn17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views11 pages

Module 1 Overview - of - Statistics

Uploaded by

likithgn17

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Module-1: Data Visualization and Data Exploration

Introduction: Data Visualization, Importance of Data Visualization, Data Wrangling, Tools and Libraries
for Visualization.

Overview of Statistics: Measures of Central Tendency, Measures of Dispersion, Correlation, Types of

Data, Summary Statistics.

Numpy: Numpy Operations - Indexing, Slicing, Splitting, Iterating, Filtering, Sorting, Combining, and
Reshaping.

Pandas: Advantages of pandas over numpy, Disadvantages of pandas, Pandas operation - Indexing,
Slicing, Iterating, Filtering, Sorting and Reshaping using Pandas.

Overview of Statistics:

Definition: Statistics is a combination of the analysis, collection, interpretation and representation of

numerical data.

It helps in making sense of data by finding patterns, trends and relationships.

Probability:

The chance of an event happening, measured between 0 (impossible) and 1 (certain).

Probability Distribution:

A probability distribution tells us how likely different outcomes are.

It can be of two types

1. Discrete Probability Distribution

2. Continuous Probability Distribution

1. Discrete Probability Distribution (Fixed Values):

It shows all the values that a random variable can take, together with their probability.

The following diagram illustrates an example of a discrete probability distribution.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 1

Example:

Rolling a Six-sided die

Possible outcomes: 1, 2, 3, 4, 5, 6

Each number has an equal probability of occurring: 1/6 (or 16.67%)

The graph of a discrete probability is typically bar chart.

Figure: Discrete probability distribution for die rolls.

2. Continuous Probability Distribution (Any value in a range):

It defines the probabilities of each possible value of continuous random variable.

The following diagram provides an example of a continuous probability distribution

Example:

Time taken to drive home

Usually takes around 60 minutes, but could be less (no traffic) or more (traffic jam).

The probability is spread over a range of values, not fixed numbers (like 45.3 min, 60.8 min, etc.,)

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 2

The graph of a continuous probability distribution is typically a smooth curve, where the area under
the curve represents the probability.

Figure: Continuous probability distribution for the time taken to reach home.

The normal distribution is a continuous probability distribution. It is symmetric and bell-shaped, with
most values clustering around the mean.

Mean – Center of the distribution

Standard Deviation – Spread of the distribution.

Further, Statistics breaking it down

1. Measures of Central Tendency

2. Measures of Dispersion

3. Correlation

1. Measures of Central Tendency:

Measures of central tendency are often called average and describe central or typical value of a
probability distribution.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 3

Three kinds of average:

1. Mean

2. Median

3. Mode

1. Mean:

The arithmetic average is computed by summing up all measurements and dividing the sum by the
number of observation.

The mean is calculated as follows

2. Median:

It is the middle value of the ordered dataset.

If odd number of values in a dataset – the median is the middle value

If even number of values in a dataset – the median is the average of two middle values.

3. Mode:

The mode is defined as the most frequently occurring value in a dataset.

If no value repeats, the dataset has no mode.

There may be more than one mode in cases where multiple values are equally frequent.

Example:

A die was rolled to 10 times

dataset = [4, 5, 4, 3, 4, 2, 1, 1, 2, 1]
Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 4
Mean = (4+5+4+3+4+2+1+1+2+1)/10

Mean = 2.7

Median: To calculate the median, the doe rolls have to be ordered according to their value. The
ordered values are as follows

1, 1, 1, 2, 2, 3, 4, 4, 4, 5

Since we have even number of values in dataset. The median is the average of 2 middle values.

Median = (2+3)/2 = 2.5

The modes are 1 and 4. Since they are the two most frequent events.

2. Measures of Dispersion:

Dispersion, also called variability. It is the extent to which a probability distribution is stretched or
squeezed. Means how spread out or close together the data values.

Stretched (wide spread data): The values are far apart.

Example: People’s salaries in a company range from Rs. 30,000/- to 2,00,000/- (big difference).

Squeezed (tightly packed data): The values are close together.

Example: Height of students in a class range from 160cm to 165cm (small difference).

The different measure of dispersion are as follows:

1. Variance

2. Standard Deviation

3. Range

4 Interquartile Range (IQR)

1. Variance:

The expected value of the squared deviation from the mean, describing how far numbers are spread
out.

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 5

Variance is calculated as follows

2. Standard Deviation:

It is the square root of the variance.

𝑺𝑫 = √𝑽𝒂𝒓(𝑿)

3. Range:

It is the difference between the largest and smallest values in a dataset.

4 Interquartile Range:

Also called the midspread or middle 50%, this is the difference between the 75th and 25th percentiles,
or between the upper and lower quartiles.

Example:

dataset = [10, 20, 30, 40, 50]

Variance:

Mean = 30

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = ∑((10 − 30)2 + (20 − 30)2 + (30 − 30)2 + (40 − 30)2 + (50 − 30)2 )

1000
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = = 200
5

(Shows how far numbers deviate from 30)

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 6
Standard Deviation:

𝑆𝐷 = √𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = √200 = 14.14

(Tells us most values are about 14.14 units away from the mean)

Range:

Difference between the largest and smallest values.

Largest value = 50

Smallest value = 10

Range = Largest value – Smallest value = 50 – 10 = 40

Interquartile Range (IQR):

The Interquartile Range (IQR) is calculated as:

IQR = Q3 − Q1

where Q1 (First Quartile) is the 25th percentile and Q3 (Third Quartile) is the 75th percentile of the
dataset.

Arrange the Data in Ascending Order. Sorting is important to correctly find quartiles.

First, Find the median (Q2)

If odd number of values in a dataset – the median is the middle value

If even number of values in a dataset – the median is the average of two middle values.

The given dataset = (10, 20, 30, 40, 50). It has odd number of values.

Q2 = 30

Find Q1 (First Quartile - 25th Percentile)

Q1 is the median of the lower half of the data (before Q2).

The lower half of the data is: 10, 20

Median 10, 20 is Q1 = (10+20)/2 = 15

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 7
Find Q3 (Third Quartile - 75th Percentile)

Q3 is the median of the upper half of the data (after Q2).

The upper half of the data is: 40, 50

Median 40, 50 is Q3 = (40+50)/2 = 45

Then, find the IQR

IQR = Q3- Q1 = 45 – 15 = 30

Another example:

The given dataset = (10, 20, 30, 40, 50,60). It has even number of values

Q2 = (30+40)/2 = 35

Find Q1 (First Quartile - 25th Percentile)

Q1 is the median of the lower half of the data (before Q2).

The lower half of the data is: 10, 20, 30

Median 10, 20 and 30 is Q1 = 20

Find Q3 (Third Quartile - 75th Percentile)

Q3 is the median of the upper half of the data (after Q2).

The upper half of the data is: 40, 50, 60

Median 40, 50 and 60 is Q3 = 50

Then, find the IQR

IQR = Q3- Q1 = 50 – 20 = 30

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 8

3. Correlation

correlation describes the statistical relationship between two variables:

In a positive correlation, both variables move in the same direction.

In a negative correlation, the variables move in opposite directions.

In zero correlation, the variables are not related.

Example:

We want to find a decent apartment to rent that is not too expensive compared to other apartments
we've found. The other apartments we found on a website are priced as follows: $700, $850, $1,500,
and $750 per month:

Given Rent Prices: $700, $850, $1500, $750

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 9

Types of Data:

It is important to understand what kind of data we are dealing with so that we can select both the
right statistical measure and the right visualization.

Data is mainly categorized into categorical (qualitative) and numerical (quantitative).

Categorical (Qualitative) Data:

This type of data describes characteristics or qualities. It is not measured with numbers.

We can further divide categorical data into nominal data and ordinal data.

Nominal Data (No specific order):

Example: Colors (Red, Blue, Green), Types of Fruits (Apple, Banana, Orange)

Ordinal Data (Has a specific order):

Example: Movie ratings (Poor, Average, Good, Excellent), Education levels (High School, Bachelor's,
Master's, Ph.D.)

Numerical (Quantitative) Data:

This type of data consists of numbers and represents measurable quantities.

Numerical data can be divided into discrete and continuous data.

Discrete Data (Countable, whole numbers):

Example: Number of students in a class (30, 31, 32), Number of cars in a parking lot (5, 10, 15)

Continuous Data (Can take any value within a range):

Example: Height of people (5.4 ft, 5.5 ft, 5.6 ft), Temperature (23.5°C, 24.1°C)

Other Important Considerations:

Temporal Data (Changes over time):

Example: Daily temperature, Stock prices, Monthly rainfall

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 10

Spatial Data (Related to location):

Example: Population density in different cities, Weather patterns across regions

Figure: Classification of types of data.

Summary Statistics:

The following table gives an overview of which measure of central tendency is best suited to a
particular type of data:

Figure: Best suited measures of central tendency for different types of data.

In data visualization, these statistical measures help summarize and interpret data effectively, making
patterns and trends easier to understand.

***********************************************************************************

Mr. Gopinath C B., Assistant Professor, Dept. of AI&DS, NCE, Hassan 11

Mco 22
No ratings yet
Mco 22
26 pages
Mean, Median and Mode of Ungrouped Data
100% (2)
Mean, Median and Mode of Ungrouped Data
36 pages
Unit II Ba Bcom1
No ratings yet
Unit II Ba Bcom1
18 pages
An23 Stat Ipuc Sec A&b
No ratings yet
An23 Stat Ipuc Sec A&b
22 pages
Anderson PPT Ch03
No ratings yet
Anderson PPT Ch03
55 pages
1 - Chapter (1) Analysis of Data and Its Types Exercise
No ratings yet
1 - Chapter (1) Analysis of Data and Its Types Exercise
10 pages
Blood Pressure Levels For Boys by Age and Height Percentile
No ratings yet
Blood Pressure Levels For Boys by Age and Height Percentile
4 pages
Garch Modelling in Rats PDF
No ratings yet
Garch Modelling in Rats PDF
131 pages
MODULE 2 Coursera
No ratings yet
MODULE 2 Coursera
9 pages
Lesson #05: Data Management: Feasible)
No ratings yet
Lesson #05: Data Management: Feasible)
11 pages
Ch.2 The Simple Regression Model
No ratings yet
Ch.2 The Simple Regression Model
6 pages
Asdqwdqwdwq
No ratings yet
Asdqwdqwdwq
6 pages
Statistics and Probability W2
No ratings yet
Statistics and Probability W2
15 pages
Statistics For Data Science
No ratings yet
Statistics For Data Science
26 pages
Measures of Dispersion MG Edit
No ratings yet
Measures of Dispersion MG Edit
61 pages
Darshan Int 3
No ratings yet
Darshan Int 3
23 pages
Mini Project Report - Merged 1
No ratings yet
Mini Project Report - Merged 1
33 pages
Gtu 302 Biostatistics: Descriptive Statistics
100% (2)
Gtu 302 Biostatistics: Descriptive Statistics
57 pages
3 - Report Chapter Wise
No ratings yet
3 - Report Chapter Wise
20 pages
Introduction Main
No ratings yet
Introduction Main
21 pages
Data Analytics TB
No ratings yet
Data Analytics TB
1,944 pages
EDA W3 Obtaining-Data
No ratings yet
EDA W3 Obtaining-Data
57 pages
Variance and Standard Deviation
No ratings yet
Variance and Standard Deviation
18 pages
"Library Mannagement Sysytem": Visvesvaraya Technological University "JNANA SANGAMA", Belagavi-590018, Karnataka
No ratings yet
"Library Mannagement Sysytem": Visvesvaraya Technological University "JNANA SANGAMA", Belagavi-590018, Karnataka
7 pages
1.3 Describing Distributions With Numbers
No ratings yet
1.3 Describing Distributions With Numbers
45 pages
Lecture Slides - Capítulo 02
No ratings yet
Lecture Slides - Capítulo 02
21 pages
Lesson 4.3 Measures of Dispersion
No ratings yet
Lesson 4.3 Measures of Dispersion
19 pages
EECM3724 Unit 1 Ch3 Slides 2022
No ratings yet
EECM3724 Unit 1 Ch3 Slides 2022
48 pages
Project 1: Descriptive Analysis of Demographic Data: TU Dortmund
No ratings yet
Project 1: Descriptive Analysis of Demographic Data: TU Dortmund
20 pages
1a - Correlation and Linear Regressionnn
No ratings yet
1a - Correlation and Linear Regressionnn
10 pages
SPSS Uji Tablet
No ratings yet
SPSS Uji Tablet
17 pages
Test Week 4 Answers
No ratings yet
Test Week 4 Answers
18 pages
Module 1 Introduction To Data Visualization
No ratings yet
Module 1 Introduction To Data Visualization
5 pages
2 - Abstract, Ack, Table of Contents
No ratings yet
2 - Abstract, Ack, Table of Contents
3 pages
ESci 117 Module 2 Lesson 2.2
No ratings yet
ESci 117 Module 2 Lesson 2.2
10 pages
Chapter 02-Describing Distributions With Numbers-2023!09!13
No ratings yet
Chapter 02-Describing Distributions With Numbers-2023!09!13
22 pages
Nur Razimah (940907015834)
No ratings yet
Nur Razimah (940907015834)
8 pages
Measures of Central Tendency
100% (15)
Measures of Central Tendency
15 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
Chapter 4: Correlation: 4.1 Association Between Variables
No ratings yet
Chapter 4: Correlation: 4.1 Association Between Variables
5 pages
Sma 2103 Sta 2100 Probability and Statistics I
No ratings yet
Sma 2103 Sta 2100 Probability and Statistics I
4 pages
Ats FS
No ratings yet
Ats FS
2 pages
Ken Black QA ch03
0% (1)
Ken Black QA ch03
61 pages
Descriptive Statistics PDF
100% (1)
Descriptive Statistics PDF
40 pages
APA Format For Statistical Notation and Other Things
No ratings yet
APA Format For Statistical Notation and Other Things
4 pages
Banking Data Practice
No ratings yet
Banking Data Practice
1 page
MetNum1 2023 1 Week 10
No ratings yet
MetNum1 2023 1 Week 10
79 pages
Assignment U6MA1
No ratings yet
Assignment U6MA1
3 pages
Mba 103 PDF
No ratings yet
Mba 103 PDF
2 pages
Central Tendency - Lecture Notes
No ratings yet
Central Tendency - Lecture Notes
34 pages
R22 Unit2 CH2
No ratings yet
R22 Unit2 CH2
28 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
Lesson 3.2 Measures of Central Tendency Position and Variation
No ratings yet
Lesson 3.2 Measures of Central Tendency Position and Variation
62 pages
Lecture 3 Numerical Measures of Data
No ratings yet
Lecture 3 Numerical Measures of Data
36 pages
Share MBBS - Lecture 4 (1) - 1
No ratings yet
Share MBBS - Lecture 4 (1) - 1
68 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
63 pages
Quantitative Methods For Management
No ratings yet
Quantitative Methods For Management
118 pages
DDDDDD 2
No ratings yet
DDDDDD 2
5 pages
3.3 Mean and Standard Deviation of Grouped Data
No ratings yet
3.3 Mean and Standard Deviation of Grouped Data
15 pages
Meas T
No ratings yet
Meas T
8 pages
R3.Descriptive Statistics
No ratings yet
R3.Descriptive Statistics
5 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
02 Data
No ratings yet
02 Data
36 pages
FDSA Unit 2
No ratings yet
FDSA Unit 2
44 pages
STAE Lecture Notes - LU3
No ratings yet
STAE Lecture Notes - LU3
24 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
L3 Numerical Summary Measures
No ratings yet
L3 Numerical Summary Measures
44 pages
STAE Lecture Notes - LU3 - Annotated
No ratings yet
STAE Lecture Notes - LU3 - Annotated
10 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
Mean
No ratings yet
Mean
9 pages
Data Management
No ratings yet
Data Management
7 pages
R - Iii Unit
No ratings yet
R - Iii Unit
34 pages
Lesson 4 Notes
No ratings yet
Lesson 4 Notes
14 pages
Statistics - Imp Points
No ratings yet
Statistics - Imp Points
6 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Statistics 1
No ratings yet
Statistics 1
10 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
20 pages
Descriptive Stat
No ratings yet
Descriptive Stat
13 pages
MMW Reviewer
No ratings yet
MMW Reviewer
9 pages
Qtymeth Dispersion
No ratings yet
Qtymeth Dispersion
8 pages
Presentation 4
No ratings yet
Presentation 4
29 pages
SALMAN ALAM SHAH - Definitions of Statistics
No ratings yet
SALMAN ALAM SHAH - Definitions of Statistics
16 pages
Unit 1 - Business Statistics & Analytics
No ratings yet
Unit 1 - Business Statistics & Analytics
25 pages
Jerome Statistics
No ratings yet
Jerome Statistics
12 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
ADS PRINT Ans
No ratings yet
ADS PRINT Ans
4 pages
CH 2 Lecture Notes
No ratings yet
CH 2 Lecture Notes
12 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
No ratings yet
Describing Data: Centre Mean Is The Technical Term For What Most People Call An Average. in Statistics, "Average"
4 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Grade 10 Math Exam 4th FINAL
75% (12)
Grade 10 Math Exam 4th FINAL
5 pages
Unit-3 DS Students
No ratings yet
Unit-3 DS Students
35 pages
Basic 1
No ratings yet
Basic 1
60 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet