0% found this document useful (0 votes)

26 views31 pages

Knowing The Data Set

Uploaded by

Debasis Mahapatra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views31 pages

Knowing The Data Set

Uploaded by

Debasis Mahapatra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Summer Training cum Internship

Program
on
Machine Learning and Deep Learning

Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Knowing the dataset

Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning

KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Data objects and attributes
 Data sets are made up of data objects. Data objects are stored in a database
referred as data tuples/rows/sample.
 Examples:-
 In a medical database, the objects may be patients.
 In a university database, the objects may be students, professors, etc.

 Data objects are typically described by attributes, the columns correspond to the
attributes. Also known as features/dimensions/variables of the data set.
 Examples: - Attributes of Student object may be “Name”, “Subject_Name”,
“Total_Marks”,“ Obtained_Marks”, etc.
What is an attribute?
 An attribute is a data field, representing a characteristic or feature of a data
object.
 The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
 The term dimension is commonly used in data warehousing.
 Machine learning literature tends to use the term feature.
 Statisticians prefer the term variable.
 Data mining and database professionals commonly use the term attribute.
 A set of attributes used to describe a given object is called an attribute vector (or
feature vector).
 The distribution of data involving one attribute (or variable) is called univariate. A
bivariate distribution involves two attributes, and so on. In general, a distribution
with two or more variables is multivariate.
Different types of attributes
 The type of an attribute is determined by the set of possible
values. An attribute be one of the following types.

 Nominal (Qualitative)
 Binary (Qualitative)
 Ordinal (Qualitative)
 Numeric (Quantitative)
Nominal/Categorical attribute
 Nominal means “relating to names”, The values of a nominal attribute are
symbols or names of things.
 Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Examples:-
 Possible values for attribute “hair color” are black, brown, blond, red, gray,
and white.
 The attribute marital status can take on the values single, married, divorced,
and widowed.
Binary Attribute

 A binary attribute is a nominal attribute with only two categories or

states: 0 or 1.
 Examples:-
 Given the attribute smoker describing a patient object, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
 Similarly, suppose the patient undergoes a medical test that has two
possible outcomes. The attribute medical test is binary, where a value
of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
A Binary attribute can be Symmetric or Asymmetric
 A binary attribute is symmetric if both of its states are equally valuable and carry
the same weight; that is, there is no preference on which outcome should be
coded as 0 or 1.
 One such example could be the attribute gender having the states male and
female.

 A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for HIV.
 By convention, we code the most important outcome by 1 (e.g., HIV positive)
and the other by 0 (e.g., HIV negative).
Ordinal Attribute

 An ordinal attribute is an attribute with possible values that have a meaningful

order or ranking among them.
 Examples:-
 Grade (e.g., A+, A, A−, B+, and so on)
 Customer satisfaction review: 0: very dissatisfied, 1: somewhat
dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied
Numeric/Continuous attribute

 A numeric attribute is quantitative; that is, it is a measurable quantity,

represented in integer or real values.
 The terms numeric attribute and continuous attribute are often used
interchangeably.
 Examples:- Temperature, Mark, Salary, Weight, Height, etc.
Basic Statistical Descriptions of Data

 Basic statistics is used to understand/analyze the data set and also helpful
in data preprocessing tasks like filling missing values, smoothing of noisy
values, spot the outliers present in the data set.

 Some basic statistical methods used for this purpose are:

 Measure of central tendency (mean, median, mode)
 Measuring data dispersion/spread of data (Range, Quartile, Variance,
Standard deviation, Interquartile Range)
Measure of Central Tendency Mean
 The most common and effective numeric measure of the “center” of a set
of data is the (arithmetic) mean.
Median and Mode
 The median is the middle number in a sorted, ascending or descending,
list of numbers.
 If there is an odd amount of numbers, the median value is the number
that is in the middle, with the same amount of numbers below and
above.
 If there is an even amount of numbers in the list, the middle pair must
be determined, added together, and divided by two to find the median
value.
 The mode is the value that appears most often in a set of data
values.
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with two or
more modes is multimodal.
Measuring Skewness

Skewness is a measure of the symmetry in a distribution. A symmetrical

dataset will have a skewness equal to 0. So, a normal distribution will have
a skewness of 0.
Measuring data dispersion
 These are the measures to assess the dispersion or spread of numeric
data.
 Some measures are useful in finding outlier values.
 Outlier:-
 An outlier is an observation that lies at an abnormal distance from
other values in a random sample from a population.
 Examination of the data for unusual observations that are far removed
from the mass of data. These points are often referred to as outliers.

 Some measures of dispersion are Range, Quartile, Variance, Standard

deviation, Interquartile Range.
Standard deviation and dispersion
Variance=
10.24
Q. Find the variance of the data for variable X where the samples are
drawn and the values are 1,2,2,3,4,5 .
Practice Python code (Mean (Measure of central tendency))

Finding AM,GM, HM (AM >= GM >= HM)

Practice Python code (Cont..)(Median)
(Measure of Central tendency)
Practice Python code (Cont..) (Mode) (Measure of
central tendency)
Practice Python code (Cont..)
(Std. Dev, Variance) (Measure of Dispersion)
Measure of Dispersion (Cont..)
 Range:- The range of the set is the difference between the largest (max()) and
smallest (min()) values.

 Quantiles:-Quantiles are points taken at regular intervals of a data distribution,

dividing it into essentially equal size consecutive sets.
 The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
 The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution. They
are more commonly referred to as quartiles. (Points are Q1,Q2,Q3)
 The 100-quantiles are more commonly referred to as percentiles; they divide
the data distribution into 100 equal-sized consecutive sets.
Measure of Dispersion (Cont..)
 The quartiles give an indication of a distribution’s center, spread, and
shape.
 The first quartile, denoted by Q1, is the 25th percentile. It cuts off the
lowest 25% of the data.
 The third quartile, denoted by Q3, is the 75th percentile—it cuts off
the lowest 75% (or highest 25%) of the data.
 The second quartile is the 50th percentile. As the median, it gives the
center of the data distribution.
 The distance between the first and third quartiles is a simple measure of
spread that gives the range covered by the middle half of the data. This
distance is called the interquartile range (IQR) and is defined as

IQR = Q3 − Q1
Example
Q. Find the range and interquartile range of the set {3, 7, 8, 5, 12, 14, 21, 13, 18}.

 First, we write the data in increasing order: 3, 5, 7, 8, 12, 13, 14, 18, 21.
 range = max – min = 21 – 3 = 18.
 Q1 = 6 and Q3 = 16.
 Therefore, the interquartile range (IQR) = Q3 – Q1 = 16 – 6 = 10.
 The range is 18 and the interquartile range is 10.
Five number summary and Outlier detection
 The five-number summary of a data set consists of the five numbers determined
by computing the minimum, Q1 , median, Q3 , and maximum of the data set.

Find the five-number summary for the data set {3, 7, 8, 5, 12, 14, 35,
13, 18}.
Sorted: 3,5,7,8,12,13,14,18,35
Value < lower fence and
1.5*(Q3-Q1)=15 > upper fence are
Lower Fence = Q1-1.5*(IQR) = 6-15 = - considered as outliers
9 and should be removed
Upper Fence = Q3 +1.5*(IQR)=16+15 from the data set.
=31

3 , 5, 7, 8, 12,13,14, 18, 35 Remove outlier

The five-number summary is:
Minimum: 3 Q1 : 6 Median: 12 Q3 : 16 Maximum: 18
Algorithm
 Find Q1, Q2, Q3 from the given set of values say S.
 Find IQR, Lower Fence (Q1-1.5*IQR), and Upper Fence (Q3+1.5*IQR).
 Remove the points present above Upper Fence and below Lower Fence.
(Outlier removal) Update the same in S.
 Set the min_val as minimum of S.
 Set the max_val as maximum of S.
 Return min_val, Q1,Q2,Q3, and max_val. (Five number summary)
Five number summary (Python Code)
Five number summary (Python Code)

A Level Maths Edexcel S1 BOOK PDF
100% (3)
A Level Maths Edexcel S1 BOOK PDF
244 pages
Chapter 8 F4 Measures of Dispersion For Ungrouped Data PDF
No ratings yet
Chapter 8 F4 Measures of Dispersion For Ungrouped Data PDF
22 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Statistics Questions and Answers Grade 12
100% (1)
Statistics Questions and Answers Grade 12
6 pages
Cumulative Frequency
No ratings yet
Cumulative Frequency
6 pages
Attribute Oriented Analysis
No ratings yet
Attribute Oriented Analysis
27 pages
02 Data
No ratings yet
02 Data
35 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
Data Mining:: Concepts and Techniques
100% (1)
Data Mining:: Concepts and Techniques
63 pages
Unit 2 Data Preprocessing
No ratings yet
Unit 2 Data Preprocessing
8 pages
CH 2
No ratings yet
CH 2
35 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
Lec 2
No ratings yet
Lec 2
26 pages
DWDM Unit-2
No ratings yet
DWDM Unit-2
19 pages
Module No 2 - Part 2 - Compressed - Compressed
No ratings yet
Module No 2 - Part 2 - Compressed - Compressed
46 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
02 Data
No ratings yet
02 Data
36 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Presentation 1
No ratings yet
Presentation 1
46 pages
01 Data
No ratings yet
01 Data
100 pages
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
No ratings yet
CS 591.03 Introduction To Data Mining Instructor: Abdullah Mueen
52 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
Week2 UnderstandingData
No ratings yet
Week2 UnderstandingData
27 pages
CH 2
No ratings yet
CH 2
68 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
DM Introduction
No ratings yet
DM Introduction
50 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
Data Mining Unit-I
No ratings yet
Data Mining Unit-I
44 pages
Week 1B - Data
No ratings yet
Week 1B - Data
38 pages
Datalec 1
No ratings yet
Datalec 1
23 pages
Module 1
No ratings yet
Module 1
64 pages
02 Data
No ratings yet
02 Data
41 pages
Lect 3
No ratings yet
Lect 3
51 pages
Lec2 Data
No ratings yet
Lec2 Data
51 pages
02 Data
No ratings yet
02 Data
64 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
VIPDMTheory Chapter 2
No ratings yet
VIPDMTheory Chapter 2
56 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
02 Data
No ratings yet
02 Data
66 pages
CPSC 4830 2025summer Lecture 2
No ratings yet
CPSC 4830 2025summer Lecture 2
42 pages
Data Preprocessing I
No ratings yet
Data Preprocessing I
39 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Unit 2 Final Ids
No ratings yet
Unit 2 Final Ids
38 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Get To Know About Data
No ratings yet
Get To Know About Data
25 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Know Your Data
No ratings yet
Know Your Data
83 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
ISM - Session 1 - May 2025
No ratings yet
ISM - Session 1 - May 2025
54 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
29 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
02 Kinds of Data
No ratings yet
02 Kinds of Data
41 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Statistics MS. ORDONIO
No ratings yet
Statistics MS. ORDONIO
50 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
QM Notes SajinJ
No ratings yet
QM Notes SajinJ
34 pages
NSC Maths Literacy Grade 12 November 2024 P1 and Memo
No ratings yet
NSC Maths Literacy Grade 12 November 2024 P1 and Memo
36 pages
WST01 01 Que 20171018 PDF
No ratings yet
WST01 01 Que 20171018 PDF
24 pages
Top 10 Hardest GCSE Maths Questions For 2023 3
No ratings yet
Top 10 Hardest GCSE Maths Questions For 2023 3
1 page
Unit 4 Big Data Complete Notes
No ratings yet
Unit 4 Big Data Complete Notes
32 pages
Sample Test: Filling in The Blank
No ratings yet
Sample Test: Filling in The Blank
2 pages
Now You SCC Me, Now You Don'T - Using Machine Learning To Find Stress Corrosion Cracking
No ratings yet
Now You SCC Me, Now You Don'T - Using Machine Learning To Find Stress Corrosion Cracking
9 pages
Percentiles For Ungrouped Data Grade 10
No ratings yet
Percentiles For Ungrouped Data Grade 10
8 pages
EDA 1 Continuation
No ratings yet
EDA 1 Continuation
10 pages
4-Measures of Location or Position
No ratings yet
4-Measures of Location or Position
18 pages
20Sc02P-Statistical and Analytical: Name of Student: Branch: Sem: Register Number
No ratings yet
20Sc02P-Statistical and Analytical: Name of Student: Branch: Sem: Register Number
77 pages
Lecture No. 6 Measures of Variability
No ratings yet
Lecture No. 6 Measures of Variability
25 pages
TH RD TH ST
No ratings yet
TH RD TH ST
6 pages
Q4 - Written Test 3: Grade 10 Mathematics The Quartile, Decile and Percentile of Ungrouped/Grouped Data
No ratings yet
Q4 - Written Test 3: Grade 10 Mathematics The Quartile, Decile and Percentile of Ungrouped/Grouped Data
2 pages
Statistics Tutorial 1
No ratings yet
Statistics Tutorial 1
12 pages
Lp-Modulle 1-Measures of Position For Ungrouped Data
No ratings yet
Lp-Modulle 1-Measures of Position For Ungrouped Data
22 pages
Chapter 5
No ratings yet
Chapter 5
18 pages
Economics Ss2 1st Term Week 3
No ratings yet
Economics Ss2 1st Term Week 3
27 pages
Lecture-1 (Day 1)
No ratings yet
Lecture-1 (Day 1)
16 pages
STA301-Quiz-1 by Vu Topper RM
No ratings yet
STA301-Quiz-1 by Vu Topper RM
124 pages
7th Math Unit 5
No ratings yet
7th Math Unit 5
75 pages
Measures of Position Quartile Decile and Percentile
No ratings yet
Measures of Position Quartile Decile and Percentile
39 pages
Lecture-4 (Day 3) - Pandas
No ratings yet
Lecture-4 (Day 3) - Pandas
4 pages
Model Adequacy in Econometrics: September 2021
No ratings yet
Model Adequacy in Econometrics: September 2021
33 pages
Kest 106
No ratings yet
Kest 106
17 pages
Project 1 - Descriptive Statistics
No ratings yet
Project 1 - Descriptive Statistics
11 pages
Spectral Approach (BU)
No ratings yet
Spectral Approach (BU)
2 pages
Lecture-3 (Day 2) - NumPy
No ratings yet
Lecture-3 (Day 2) - NumPy
2 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet

Knowing The Data Set

Uploaded by

Knowing The Data Set

Uploaded by

Summer Training cum Internship

 A binary attribute is a nominal attribute with only two categories or

 An ordinal attribute is an attribute with possible values that have a meaningful

 A numeric attribute is quantitative; that is, it is a measurable quantity,

 Some basic statistical methods used for this purpose are:

Skewness is a measure of the symmetry in a distribution. A symmetrical

 Some measures of dispersion are Range, Quartile, Variance, Standard

Finding AM,GM, HM (AM >= GM >= HM)

 Quantiles:-Quantiles are points taken at regular intervals of a data distribution,

3 , 5, 7, 8, 12,13,14, 18, 35 Remove outlier

You might also like