0% found this document useful (0 votes)
26 views31 pages

Knowing The Data Set

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views31 pages

Knowing The Data Set

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Summer Training cum Internship

Program
on
Machine Learning and Deep Learning

Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Knowing the dataset

Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning

KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Data objects and attributes
 Data sets are made up of data objects. Data objects are stored in a database
referred as data tuples/rows/sample.
 Examples:-
 In a medical database, the objects may be patients.
 In a university database, the objects may be students, professors, etc.

 Data objects are typically described by attributes, the columns correspond to the
attributes. Also known as features/dimensions/variables of the data set.
 Examples: - Attributes of Student object may be “Name”, “Subject_Name”,
“Total_Marks”,“ Obtained_Marks”, etc.
What is an attribute?
 An attribute is a data field, representing a characteristic or feature of a data
object.
 The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
 The term dimension is commonly used in data warehousing.
 Machine learning literature tends to use the term feature.
 Statisticians prefer the term variable.
 Data mining and database professionals commonly use the term attribute.
 A set of attributes used to describe a given object is called an attribute vector (or
feature vector).
 The distribution of data involving one attribute (or variable) is called univariate. A
bivariate distribution involves two attributes, and so on. In general, a distribution
with two or more variables is multivariate.
Different types of attributes
 The type of an attribute is determined by the set of possible
values. An attribute be one of the following types.

 Nominal (Qualitative)
 Binary (Qualitative)
 Ordinal (Qualitative)
 Numeric (Quantitative)
Nominal/Categorical attribute
 Nominal means “relating to names”, The values of a nominal attribute are
symbols or names of things.
 Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
 The values do not have any meaningful order.
 Examples:-
 Possible values for attribute “hair color” are black, brown, blond, red, gray,
and white.
 The attribute marital status can take on the values single, married, divorced,
and widowed.
Binary Attribute

 A binary attribute is a nominal attribute with only two categories or


states: 0 or 1.
 Examples:-
 Given the attribute smoker describing a patient object, 1 indicates that
the patient smokes, while 0 indicates that the patient does not.
 Similarly, suppose the patient undergoes a medical test that has two
possible outcomes. The attribute medical test is binary, where a value
of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
A Binary attribute can be Symmetric or Asymmetric
 A binary attribute is symmetric if both of its states are equally valuable and carry
the same weight; that is, there is no preference on which outcome should be
coded as 0 or 1.
 One such example could be the attribute gender having the states male and
female.

 A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for HIV.
 By convention, we code the most important outcome by 1 (e.g., HIV positive)
and the other by 0 (e.g., HIV negative).
Ordinal Attribute

 An ordinal attribute is an attribute with possible values that have a meaningful


order or ranking among them.
 Examples:-
 Grade (e.g., A+, A, A−, B+, and so on)
 Customer satisfaction review: 0: very dissatisfied, 1: somewhat
dissatisfied, 2: neutral, 3: satisfied, and 4: very satisfied
Numeric/Continuous attribute

 A numeric attribute is quantitative; that is, it is a measurable quantity,


represented in integer or real values.
 The terms numeric attribute and continuous attribute are often used
interchangeably.
 Examples:- Temperature, Mark, Salary, Weight, Height, etc.
Basic Statistical Descriptions of Data

 Basic statistics is used to understand/analyze the data set and also helpful
in data preprocessing tasks like filling missing values, smoothing of noisy
values, spot the outliers present in the data set.

 Some basic statistical methods used for this purpose are:


 Measure of central tendency (mean, median, mode)
 Measuring data dispersion/spread of data (Range, Quartile, Variance,
Standard deviation, Interquartile Range)
Measure of Central Tendency Mean
 The most common and effective numeric measure of the “center” of a set
of data is the (arithmetic) mean.
Median and Mode
 The median is the middle number in a sorted, ascending or descending,
list of numbers.
 If there is an odd amount of numbers, the median value is the number
that is in the middle, with the same amount of numbers below and
above.
 If there is an even amount of numbers in the list, the middle pair must
be determined, added together, and divided by two to find the median
value.
 The mode is the value that appears most often in a set of data
values.
 Data sets with one, two, or three modes are respectively called
unimodal, bimodal, and trimodal. In general, a data set with two or
more modes is multimodal.
Measuring Skewness

Skewness is a measure of the symmetry in a distribution. A symmetrical


dataset will have a skewness equal to 0. So, a normal distribution will have
a skewness of 0.
Measuring data dispersion
 These are the measures to assess the dispersion or spread of numeric
data.
 Some measures are useful in finding outlier values.
 Outlier:-
 An outlier is an observation that lies at an abnormal distance from
other values in a random sample from a population.
 Examination of the data for unusual observations that are far removed
from the mass of data. These points are often referred to as outliers.

 Some measures of dispersion are Range, Quartile, Variance, Standard


deviation, Interquartile Range.
Standard deviation and dispersion
Variance=
10.24
Q. Find the variance of the data for variable X where the samples are
drawn and the values are 1,2,2,3,4,5 .
Practice Python code (Mean (Measure of central tendency))

Finding AM,GM, HM (AM >= GM >= HM)


Practice Python code (Cont..)(Median)
(Measure of Central tendency)
Practice Python code (Cont..) (Mode) (Measure of
central tendency)
Practice Python code (Cont..)
(Std. Dev, Variance) (Measure of Dispersion)
Measure of Dispersion (Cont..)
 Range:- The range of the set is the difference between the largest (max()) and
smallest (min()) values.

 Quantiles:-Quantiles are points taken at regular intervals of a data distribution,


dividing it into essentially equal size consecutive sets.
 The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median.
 The 4-quantiles are the three data points that split the data distribution into
four equal parts; each part represents one-fourth of the data distribution. They
are more commonly referred to as quartiles. (Points are Q1,Q2,Q3)
 The 100-quantiles are more commonly referred to as percentiles; they divide
the data distribution into 100 equal-sized consecutive sets.
Measure of Dispersion (Cont..)
 The quartiles give an indication of a distribution’s center, spread, and
shape.
 The first quartile, denoted by Q1, is the 25th percentile. It cuts off the
lowest 25% of the data.
 The third quartile, denoted by Q3, is the 75th percentile—it cuts off
the lowest 75% (or highest 25%) of the data.
 The second quartile is the 50th percentile. As the median, it gives the
center of the data distribution.
 The distance between the first and third quartiles is a simple measure of
spread that gives the range covered by the middle half of the data. This
distance is called the interquartile range (IQR) and is defined as

IQR = Q3 − Q1
Example
Q. Find the range and interquartile range of the set {3, 7, 8, 5, 12, 14, 21, 13, 18}.

 First, we write the data in increasing order: 3, 5, 7, 8, 12, 13, 14, 18, 21.
 range = max – min = 21 – 3 = 18.
 Q1 = 6 and Q3 = 16.
 Therefore, the interquartile range (IQR) = Q3 – Q1 = 16 – 6 = 10.
 The range is 18 and the interquartile range is 10.
Five number summary and Outlier detection
 The five-number summary of a data set consists of the five numbers determined
by computing the minimum, Q1 , median, Q3 , and maximum of the data set.

Find the five-number summary for the data set {3, 7, 8, 5, 12, 14, 35,
13, 18}.
Sorted: 3,5,7,8,12,13,14,18,35
Value < lower fence and
1.5*(Q3-Q1)=15 > upper fence are
Lower Fence = Q1-1.5*(IQR) = 6-15 = - considered as outliers
9 and should be removed
Upper Fence = Q3 +1.5*(IQR)=16+15 from the data set.
=31

3 , 5, 7, 8, 12,13,14, 18, 35 Remove outlier


The five-number summary is:
Minimum: 3 Q1 : 6 Median: 12 Q3 : 16 Maximum: 18
Algorithm
 Find Q1, Q2, Q3 from the given set of values say S.
 Find IQR, Lower Fence (Q1-1.5*IQR), and Upper Fence (Q3+1.5*IQR).
 Remove the points present above Upper Fence and below Lower Fence.
(Outlier removal) Update the same in S.
 Set the min_val as minimum of S.
 Set the max_val as maximum of S.
 Return min_val, Q1,Q2,Q3, and max_val. (Five number summary)
Five number summary (Python Code)
Five number summary (Python Code)

You might also like