0% found this document useful (0 votes)

20 views56 pages

Module1 Understanding Data1

The document provides an overview of descriptive statistics, including dataset summarization and data types such as categorical, ordinal, and numerical data. It discusses various data analysis techniques, including univariate, bivariate, and multivariate analyses, as well as visualization methods like bar charts and histograms. Additionally, it covers central tendency measures, dispersion, skewness, kurtosis, and the coefficient of variation, emphasizing their importance in understanding data distributions.

Uploaded by

Lakshmi Hj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views56 pages

Module1 Understanding Data1

Uploaded by

Lakshmi Hj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 56

Module-1

Understanding of Data
2.4 Descriptive Statistics
• Descriptive statistics is a branch of statistics that does dataset
summarization.
• It is used to summarize and describe data.
• Descriptive statistics do not bother too much about ML algorithms and its
functioning.
• Descriptive analytics and data visualization techniques helps to understand
the nature of the data, which further helps to determine the kinds of
machine learning or data mining tasks that can be applied to the data.
• This step is often known as Exploratory Data Analysis (EDA).
Dataset and Data Types
• A dataset can be assumed to be a collection of data objects.
• The data objects may be records, points, vectors,patterns, events, cases, samples
or observations.
• Example: Sample Patient Table
Patient ID Name Age Blood Test Fever Disease

1. John 21 Negative Low No

2. Andre 36 Positive High Yes

• Every attribute should be associated with a value. This process is called

measurement.
• The type of attribute determines the data types.
Types of Data
Categorical or Qualitative Data
Nominal Data: Nominal data is a type of categorical data that represents labels,
names, or categories without any inherent order or ranking.
• Mathematical operations like addition or subtraction do not apply.
Example of Nominal Data:
• Gender: Male, Female
• Eye Color: Blue, Brown, Green
• Blood Group: A, B, AB, O
• Car Brands: Toyota, Ford, BMW
• Yes/No Responses: Yes, No
• Since nominal data is purely categorical, it is typically analyzed using frequency
counts or mode (most frequently occurring category).
• Only operations like (=,≠) are meaningful for these data.
• Example: The patient ID can be checked for equality and nothing else.
Ordinal Data: Ordinal data is a type of categorical data where the categories have a
meaningful order or ranking.
• But, the differences between them are not necessarily uniform or measurable.
• Arithmetic operations like addition or subtraction are not applicable.
Example of Ordinal Data:
• Customer Satisfaction Levels: Poor, Fair, Good, Very Good, Excellent
• Education Levels: High School, Bachelor's, Master's, PhD
• Movie Ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
• Ranks in a Competition: 1st place, 2nd place, 3rd place
• Fever: Low, Medium, High
• Only operations like (<,>) are meaningful for these data
• Since ordinal data has a ranking, it is often analyzed using median or percentile-
based methods.
Numeric or Quantitative Data
Interval Data: Interval data is a type of numerical data where the differences
between values are meaningful and equal, but there is no true zero point.
• This means that while addition and subtraction are possible, multiplication and
division are not meaningful.
Example of Interval Data:
• Temperature (in Celsius or Fahrenheit): 10°C, 20°C, 30°C (The difference
between 10°C and 20°C is the same as between 20°C and 30°C, but 0°C does
not mean "no temperature.")
• Time of the Day (on a 12-hour clock): 3 AM, 6 AM, 9 AM (There is no absolute
zero time; the clock continues in cycles.)
• IQ Scores: 90, 100, 110 (The differences are measurable, but there is no true
zero IQ.)
• Only operations like (+,-) are meaningful for these data
Ratio data: is a type of numerical data that has all the properties of interval data,
but with a true zero point, meaning zero represents the total absence of the
measured variable.
This allows for meaningful addition, subtraction, multiplication, and division
operations.
Examples of Ratio Data:
• Height & Weight: A person who is 100 kg weighs twice as much as someone
who is 50 kg.
• Age: A 20-year-old is twice as old as a 10-year-old.
• Kelvin Temperature: 0 K (absolute zero) means a complete absence of thermal
energy.
• Distance: 0 km means no distance, and 10 km is twice as far as 5 km.
Since ratio data has a true zero, it allows for meaningful ratio comparisons, unlike
interval data (e.g., "40°C is not twice as hot as 20°C," but 40 kg is twice as heavy as
20 kg).
• Another way of classifying the data:
1. Discrete value data
2. Continuous data
Discrete data is a type of numerical data that consists of distinct, separate values.
It can only take specific, countable values and cannot be divided into smaller parts
meaningfully.
Examples of Discrete Data:
• Number of students in a class: 25, 30, 35 (cannot be 25.5 students).
• Number of goals scored in a match: 1, 2, 3 (not 1.5 goals).

Discrete data is typically represented using bar charts or count-based statistics like
mode and frequency.
Continuous data is a type of numerical data that can take any value within a given
range, including fractions and decimals.
It is measurable rather than countable.

Examples of Continuous Data:

• Height of a person: 170.2 cm, 170.25 cm, 170.257 cm.
• Weight of an object: 65.5 kg, 65.55 kg, 65.555 kg.
• Temperature: 22.3°C, 22.35°C, 22.357°C.

Continuous data is typically represented using histograms or line graphs and

analyzed using statistical measures like mean, standard deviation, and range.
Classification of data based on number of variables:
1. Univariate Data
2. Bivariate Data
3. Multivariate Data
Univariate Data: refers to a dataset that contains only one variable.
• It focuses on analyzing a single characteristic or feature without considering
relationships with other variables.
• The analysis is simple and focuses on distribution, central tendency (mean,
median, mode), and spread (range, variance, standard deviation).
Examples of Univariate Data:
• Number of cars in different households: (1, 2, 3, 2, 4, etc.)
• Test scores of students: (85, 90, 78, 92, etc.)
• Colors of cars in a parking lot: (Red, Blue, Black, White, etc.)
Bivariate data: refers to a dataset that involves two variables and examines the
relationship between them.
• Used to analyze cause-and-effect or correlations between variables.
• It helps in identifying patterns, correlations, or dependencies between the two
variables.
Examples of Bivariate Data:
• Height vs. Weight: Analyzing how a person’s weight changes with height.
• Study Hours vs. Exam Score: Examining whether more study hours lead to higher
scores.
• Temperature vs. Ice Cream Sales: Studying how ice cream sales change with
temperature.
Multivariate data: refers to a dataset that contains three or more variables and
examines relationships among them.
It is used in complex analysis to understand how multiple factors interact with each
other.
Examples of Multivariate Data:
• Student Performance Analysis: Examining how study hours, attendance, and
sleep duration affect exam scores.
• Weather Prediction: Analyzing temperature, humidity, and wind speed to
forecast weather conditions.
• Sales Performance: Studying how price, advertising budget, and customer
reviews impact product sales.
• Health Analysis: Evaluating how age, blood pressure, cholesterol levels, and
exercise habits affect heart disease risk.
2.5 Univariate Data Analysis and Visualization
• Data visualization is the graphical representation of information and data.
• It helps people understand complex data patterns, trends, and insights by using
visual elements such as charts, graphs, maps, and dashboards.

Bar Chart: A bar chart is a graphical representation of data using rectangular bars,
where the length or height of each bar represents the value of a particular
category.
Bars can be displayed vertically (column chart) or horizontally.
• A bar chart is best suited for categorical data (data divided into groups or
categories).
• It can also be used for discrete numerical data.
• Not Ideal for Continuous Data (e.g., temperature, speed) – A line chart is usually
better for that.
• Pie Chart: A pie chart is a circular graph divided into slices, where each slice
represents a proportion of the whole.
• The size of each slice corresponds to the percentage or fraction of a category
within the dataset.
• Equally helpful in illustrating the univariate data.
Histogram: A histogram is a graphical representation of the distribution of
numerical data.
• It looks like a bar chart, but instead of showing categories, it groups continuous
data into bins (intervals) and shows the frequency of data points within each bin.
When to Use a Histogram?
• When analyzing continuous numerical data
• To understand the frequency of data within specific ranges
• To observe distribution patterns (e.g., normal, skewed, uniform)
Problem1
• There are 60 students in a class. Among them, 15 students were placed in a
company offering a 3.5 lakh package, 10 students in a 6.5 lakh package, 8
students in a 10 lakh package, and 5 students in a 12 lakh package.
Generate a bar chart, pie chart.
• Solution:
3.5 lakh package: 15/60 x 100= 25 %
6.5 lakh package: 10/60 x 100= 16.66 ≈ 16.7 %
10 lakh package: 8/60 x 100 = 13.33 %
12 lakh package: 5/60 x 100 = 8.33 %
Not placed= total – placed students = 60 - 38 = 22 students
22/60 x 100 = 36.66 ≈ 36.7 %
Placement distribution of
students
Problem-2:
Total students=60
Consider the range 0-3lakh,3-6,6-9,9-12,12-15
Package (in Lakhs) No. of placed students
College A College B
3 25 6
5 12 15
7 2 6
10 1 14
11.5 0 9
15 0 10
Central Tendency
• Central tendency refers to the measure that represents the center or typical value
of a dataset.
• It helps in understanding the overall trend of the data by identifying a single
value that best describes the distribution.
The three main measures of central tendency are:
1. Mean (Arithmetic Average)
2. Median (Middle Value)
3. Mode (Most Frequent Value)
1. Mean (Arithmetic Average)
• The sum of all values divided by the number of values.
• Formula:
Mean=∑X/N OR

Example: If the numbers are 5, 10, 15,20,25

then: Mean=(5+10+15+20+25)/5=15

Best for: Numerical data without extreme values (outliers).

2. Median (Middle Value)
• The middle value when the data is arranged in ascending order.
• If the number of values is odd, the median is the middle number.
• If the number of values is even, the median is the average of the two middle
numbers.
Example:

For 3, 7, 9 → Median = 7
For 3, 7, 9, 12 → Median = (7+9)/2 = 8

Best for: Skewed distributions or data with outliers.

• Median Formula for Continuous Data
• When dealing with continuous data (grouped frequency distributions), the
median is estimated using the following formula:
Steps to Find the Median in Continuous Data

• Calculate N/2 (where N is the total frequency).

• Identify the median class, which is the class where the cumulative frequency just
exceeds N/2.
• Use the median formula to compute the median value.
3. Mode (Most Frequent Value)

• The number that appears most frequently in the dataset.

• Example: 3, 5, 7, 7, 9, 9, 9 → Mode = 9 (since it appears the most).

Best for: Categorical or discrete data.

Choosing the Best Measure

• Use Mean if data is normally distributed (no extreme values).

• Use Median if data has outliers or skewness.
• Use Mode if dealing with categorical data or discrete data.
Dispersion
• In statistics, dispersion refers to the extent to which a set of data points are
spread out or scattered around a central value (such as the mean or median).
• It helps measure the variability or consistency of data.

Ways of measuring dispersion:

1. Range
2. Variance
3. Standard deviation
Problem: Patients age list {12,14,19,22,24,26,28,31,34}, find the IQR?
Solution: To find the Interquartile Range (IQR) for the given patients' ages:
Step 1: Arrange the Data in Ascending Order and find the median
The data is already sorted:
12, 14, 19, 22, 24, 26, 28, 31, 34. In this case, 24 is the median.

Step 2: Find the Quartiles (Q1and Q3)

• Q1or Q0.25 (First Quartile): The median of the lower half ({12, 14, 19, 22})
• Median of {12, 14, 19, 22} = (14 + 19) / 2 = 16.5
• Q3 or Q0.75 (Third Quartile): The median of the upper half ({26, 28, 31, 34})
• Median of {26, 28, 31, 34} = (28 + 31) / 2 = 29.5

Step 3: Compute the IQR

• IQR=Q3−Q1=29.5− 16.5 =13
Five-point summary
Summary of the box-plot

• Wide Box (High IQR): The data between the first quartile (Q1) and third
quartile (Q3) is more dispersed. Then box in the box plot will appear
wider.
• Narrow Box (Low IQR): The data is more concentrated around the
median.
Shape of Data
Skewness and Kurtosis (called moments) indicate the symmetry/asymmetry and
peak location of the dataset.
Skewness: It is a measure of asymmetry in the distribution of data values. It tells us
whether the data is symmetrically distributed or leans more toward one side of
the mean.
Types of Skewness
• Positive Skewness (Right-Skewed Data)
• The tail on the right side (higher values) is longer.
• Most data points are concentrated towards the left.
• Mean > Median > Mode.
• Negative Skewness (Left-Skewed Data)
• The tail on the left side (lower values) is longer.
• Most data points are concentrated towards the right.
• Mean < Median < Mode.
Zero Skewness (Symmetric Data)
• The left and right sides of the distribution are roughly mirror images.
• Mean = Median = Mode.
Kurtosis is a statistical measure that describes the shape of a probability
distribution, specifically its "tailedness" or the extremity of outliers in the data.
• In simpler terms, it tells us whether the data has heavy or light tails compared
to a normal distribution.

Why do we use kurtosis?

Outlier Detection: Kurtosis can help identify whether a dataset has extreme outliers.

• High kurtosis indicates that the data may contain outliers, which can be important in
many applications like risk management, financial modeling, or quality control.
Shape of Data
MEAN ABSOLUTE DEVIATION AND COEFFICIENT OF VARIATION
• The coefficient of variation (CV) is a statistical measure that describes the
relative variability of a dataset.
• It is the ratio of the standard deviation to the mean, often expressed as a
percentage.
Formula:
Coefficient of Variation (CV)=Standard Deviation/Mean×100

•Higher CV: A higher CV indicates greater variability relative to the mean,

suggesting that the data points are more spread out compared to the average.

•Lower CV: A lower CV means less variability relative to the mean, implying that
the data points are more consistent around the average.
Special Univariate plots
• The ideal way to check the shape of the dataset is a stem and leaf plot.
• A stem-and-leaf plot is a method of organizing numerical data to show its
distribution while maintaining the original values.
• It helps in quickly identifying patterns, such as the shape of the data, clusters,
and outliers.

Structure of a Stem-and-Leaf Plot:

• The stem represents the leading digits (e.g., tens place).
• The leaves represent the last digit (e.g., ones place).
Stem-Leaf Plot
Example:
• Consider the following set of numbers: 23, 25, 31, 32, 35, 41, 42, 43, 47 . Apply
Stem-Leaf Plot
Stem | Leaf
• 2|35
• 3|125
• 4|1237

•Stem (left side): 2, 3, 4 (representing 20s, 30s, 40s).

•Leaves (right side): The last digits of each number
Q-Q Plot
• A Q-Q (Quantile-Quantile) plot is a graphical tool used to compare the
distribution of a dataset with a theoretical distribution (such as a normal
distribution).
• It helps determine whether a dataset follows a specific distribution by plotting the
quantiles of the data against the quantiles of the chosen theoretical distribution.
Key Features of a Q-Q Plot:
• If the points lie close to a straight diagonal line, the data follows the given
distribution.
• Deviations from the line indicate differences in shape, skewness, or outliers.
END
Lab program 2
• Develop a program to Compute the correlation matrix to understand the
relationships between pairs of features. Visualize the correlation matrix using a
heatmap to know which variables have strong positive/negative correlations.
Create a pair plot to visualize pairwise relationships between features. Use
California Housing dataset.
Introduction
• In data analysis and machine learning, understanding the relationships
between features is crucial for feature selection, multicollinearity
detection, and data interpretation.
• Correlation and pair plots are two essential techniques to analyze these
relationships.
1. Correlation Matrix
• A correlation matrix is a table showing correlation coefficients between
variables.
• It helps in understanding how strongly features are related to each other.
Types of Correlation
• Positive Correlation (+1 to 0): As one feature increases, the other also
increases.
• Negative Correlation (0 to -1): As one feature increases, the other decreases.
• No Correlation (0): No linear relationship between the variables.
Why Should You Use a Correlation Matrix?

• Identifies relationships between features.

• Helps in detecting multicollinearity in machine learning models.
• Highlights redundant features that may not add value to the model.
2. Heatmap for Correlation Matrix
• A heatmap is a visual representation of the correlation matrix.
• It uses color coding to indicate the strength of relationships between variables.
• Benefits of Using a Heatmap Easy to interpret relationships between features.
• Quickly identifies highly correlated variables.
• Helps in feature selection and data preprocessing
3. Pair Plot
• A pair plot (also known as a scatter plot matrix) is a collection of scatter plots for
every pair of numerical variables in the dataset.
• It helps in visualizing relationships between variables.
Why Use a Pair Plot?
• Shows the distribution of individual features along the diagonal.
• Displays relationships between features using scatter plots.
• Helps in identifying clusters, trends, and potential outliers.

Foundations of Inventory Management
100% (10)
Foundations of Inventory Management
536 pages
(Ebook PDF) Introduction To Statistics and Data Analysis 6th Edition Instant Download
100% (1)
(Ebook PDF) Introduction To Statistics and Data Analysis 6th Edition Instant Download
49 pages
Uniformidadcontenido Troubleshooting Pharm Tech
No ratings yet
Uniformidadcontenido Troubleshooting Pharm Tech
12 pages
Data Types
No ratings yet
Data Types
5 pages
Dis Vishnu
No ratings yet
Dis Vishnu
48 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
MATH2203 Statistics I - Week 1
No ratings yet
MATH2203 Statistics I - Week 1
27 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
No ratings yet
WINSEM2024-25 MCSE615L TH VL2024250502897 2025-01-07 Reference-Material-I
50 pages
Variable and Data-2
No ratings yet
Variable and Data-2
27 pages
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
No ratings yet
Lecture 1-Statistics Introduction-Defining, Displaying and Summarizing Data
53 pages
Data Management: Bryan S. Ambre
100% (2)
Data Management: Bryan S. Ambre
104 pages
Lesson 2 Notes
No ratings yet
Lesson 2 Notes
11 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
12 pages
Crisp DM - Crisp MLQ
No ratings yet
Crisp DM - Crisp MLQ
9 pages
Ahsan Stats
No ratings yet
Ahsan Stats
9 pages
Topic 1 Introduction To Statistics
No ratings yet
Topic 1 Introduction To Statistics
35 pages
Written Report Gathering and Organizing Data
No ratings yet
Written Report Gathering and Organizing Data
13 pages
Unit 1 Computational Statistics
No ratings yet
Unit 1 Computational Statistics
4 pages
MMW Statistics
No ratings yet
MMW Statistics
50 pages
Classes of Data
No ratings yet
Classes of Data
10 pages
Intro of Statistics - Ogive
No ratings yet
Intro of Statistics - Ogive
35 pages
2. Data Analysis
No ratings yet
2. Data Analysis
49 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
Unit 2 Descriptive Analytics
No ratings yet
Unit 2 Descriptive Analytics
87 pages
Biostatistics Prabh
No ratings yet
Biostatistics Prabh
17 pages
Statistics - CH - 1 & CH - 2 - Introduction and Describing Data - Tabular and Graphical Presentation
No ratings yet
Statistics - CH - 1 & CH - 2 - Introduction and Describing Data - Tabular and Graphical Presentation
37 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
Introduction To Satistics .Doc1
No ratings yet
Introduction To Satistics .Doc1
7 pages
Basics of Statistics
No ratings yet
Basics of Statistics
32 pages
Data Types For Analyst
No ratings yet
Data Types For Analyst
8 pages
Bustat Reviewer
No ratings yet
Bustat Reviewer
6 pages
SQC
No ratings yet
SQC
53 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
14 pages
ML 2
No ratings yet
ML 2
4 pages
BUS 4055 Week 5
No ratings yet
BUS 4055 Week 5
16 pages
Classification and Organization of Data
No ratings yet
Classification and Organization of Data
12 pages
Dav Theory
No ratings yet
Dav Theory
111 pages
Ae 9 Reviewer
No ratings yet
Ae 9 Reviewer
7 pages
Tutoring Session 2023 - Statistics For Business
No ratings yet
Tutoring Session 2023 - Statistics For Business
65 pages
DS Module 01
No ratings yet
DS Module 01
17 pages
Lesson1 - Data Definitions
No ratings yet
Lesson1 - Data Definitions
57 pages
1.4 - About Data
No ratings yet
1.4 - About Data
17 pages
MDM4U
No ratings yet
MDM4U
2 pages
Central Tendencies
No ratings yet
Central Tendencies
5 pages
DSA Unit 2 Answers
No ratings yet
DSA Unit 2 Answers
22 pages
Basic Statistics
No ratings yet
Basic Statistics
54 pages
Final UNIT II-DESCRIPTIVE ANALYTICS
No ratings yet
Final UNIT II-DESCRIPTIVE ANALYTICS
128 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
No ratings yet
Business Data Analysis and Interpretation Notes Lecture Notes Lectures 1 13
20 pages
Data Management
No ratings yet
Data Management
57 pages
Know - Your - Data and Rescaling-1
No ratings yet
Know - Your - Data and Rescaling-1
78 pages
1 - Intro To Bio - Data Types&pres - SFB
No ratings yet
1 - Intro To Bio - Data Types&pres - SFB
71 pages
AL - I (Unit - I)
No ratings yet
AL - I (Unit - I)
19 pages
UNIT-I - Data Categorization-by-Dr - SKY
No ratings yet
UNIT-I - Data Categorization-by-Dr - SKY
22 pages
Data Analysis - Statistics
No ratings yet
Data Analysis - Statistics
68 pages
Data Analysis Fundamentals
100% (9)
Data Analysis Fundamentals
56 pages
Statistics: An Overview: Unit 1
No ratings yet
Statistics: An Overview: Unit 1
10 pages
Catatan Statisktik FIX
No ratings yet
Catatan Statisktik FIX
59 pages
N.D Bhatt Engineering Drawing and Graphics
No ratings yet
N.D Bhatt Engineering Drawing and Graphics
4 pages
Inferential Statistics
No ratings yet
Inferential Statistics
92 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
ST2334 Tutorial 9
No ratings yet
ST2334 Tutorial 9
3 pages
09 Factor of Safety and Probability of Failure
No ratings yet
09 Factor of Safety and Probability of Failure
14 pages
Lecture-11,12 - Chapter 6 - Continuous Random Variables - Normal Distribution
No ratings yet
Lecture-11,12 - Chapter 6 - Continuous Random Variables - Normal Distribution
107 pages
Exp1 A09 DS
No ratings yet
Exp1 A09 DS
6 pages
Mathematics (Hons./Pg) (Code - 19) : A. Classical Algebra
No ratings yet
Mathematics (Hons./Pg) (Code - 19) : A. Classical Algebra
9 pages
MTH771 Exam 2023
No ratings yet
MTH771 Exam 2023
5 pages
01.ad3491 Fdsa QB
No ratings yet
01.ad3491 Fdsa QB
16 pages
Probability Generating Functions
No ratings yet
Probability Generating Functions
4 pages
Geogebra in Teaching Statistic
No ratings yet
Geogebra in Teaching Statistic
10 pages
Essentials of Statistics For The Behavioral Sciences, 9th 9th Edition Frederick J. Gravetter - Ebook PDF PDF Download
No ratings yet
Essentials of Statistics For The Behavioral Sciences, 9th 9th Edition Frederick J. Gravetter - Ebook PDF PDF Download
53 pages
Sobretensiones y Coordinacione de Aislamiento
No ratings yet
Sobretensiones y Coordinacione de Aislamiento
280 pages
Module 2
No ratings yet
Module 2
13 pages
Uniform Random Variables: Scott She Eld
No ratings yet
Uniform Random Variables: Scott She Eld
17 pages
Ieee Risk Assessment
No ratings yet
Ieee Risk Assessment
9 pages
PSM Syllabus
No ratings yet
PSM Syllabus
13 pages
CSIR NET Statistics PYQs
No ratings yet
CSIR NET Statistics PYQs
94 pages
BTHBSC... 301 Mathematics-III 2023-24
0% (1)
BTHBSC... 301 Mathematics-III 2023-24
3 pages
Mathematics - Application and Interpretation - Command Terms and Notation
No ratings yet
Mathematics - Application and Interpretation - Command Terms and Notation
7 pages
Actuarial Models For Valuation of Critical Illness Insurance Products
No ratings yet
Actuarial Models For Valuation of Critical Illness Insurance Products
10 pages
(Ebook PDF) The Practice of Statistics For Business and Economics 4th Instant Download
100% (2)
(Ebook PDF) The Practice of Statistics For Business and Economics 4th Instant Download
51 pages
Modelling and Quantitative Methods in Fisheries - Malcolm Haddon 2th Edition-114-227
No ratings yet
Modelling and Quantitative Methods in Fisheries - Malcolm Haddon 2th Edition-114-227
114 pages
SSGB PDF
No ratings yet
SSGB PDF
78 pages
EMAG MAT105 Mock Questions 2019-2020
No ratings yet
EMAG MAT105 Mock Questions 2019-2020
5 pages
VK Malik
No ratings yet
VK Malik
25 pages
UG Mathematics Syllabus (Under NEP-2020) - 2023-24
No ratings yet
UG Mathematics Syllabus (Under NEP-2020) - 2023-24
17 pages
P2 REVIEWER Anaphy?
No ratings yet
P2 REVIEWER Anaphy?
22 pages
Actuarial CT6 Statistical Methods Sample Paper 2011
No ratings yet
Actuarial CT6 Statistical Methods Sample Paper 2011
10 pages

Module1 Understanding Data1

Uploaded by

Module1 Understanding Data1

Uploaded by

Module-1

1. John 21 Negative Low No

2. Andre 36 Positive High Yes

• Every attribute should be associated with a value. This process is called

Examples of Continuous Data:

Continuous data is typically represented using histograms or line graphs and

Example: If the numbers are 5, 10, 15,20,25

Best for: Numerical data without extreme values (outliers).

Best for: Skewed distributions or data with outliers.

• Calculate N/2 (where N is the total frequency).

• The number that appears most frequently in the dataset.

Best for: Categorical or discrete data.

• Use Mean if data is normally distributed (no extreme values).

Ways of measuring dispersion:

Step 2: Find the Quartiles (Q1​and Q3​)

Step 3: Compute the IQR

Why do we use kurtosis?

•Higher CV: A higher CV indicates greater variability relative to the mean,

Structure of a Stem-and-Leaf Plot:

•Stem (left side): 2, 3, 4 (representing 20s, 30s, 40s).

• Identifies relationships between features.

You might also like

Step 2: Find the Quartiles (Q1and Q3)