Knowing The Data Set
Knowing The Data Set
Program
on
Machine Learning and Deep Learning
Organized By:
Department of Computer Science & Engineering
Parala Maharaja Engineering College,
Berhampur, Odisha, 761003
Knowing the dataset
Presenter:
Dr. Debasis Mohapatra
Assistant Professor
Dept. of CSE, PMEC,BAM,761003
Overall approach of Machine Learning
KDnuggets
Text Data Preprocessing: A Walkthrough in Python - KDnuggets
Data objects and attributes
Data sets are made up of data objects. Data objects are stored in a database
referred as data tuples/rows/sample.
Examples:-
In a medical database, the objects may be patients.
In a university database, the objects may be students, professors, etc.
Data objects are typically described by attributes, the columns correspond to the
attributes. Also known as features/dimensions/variables of the data set.
Examples: - Attributes of Student object may be “Name”, “Subject_Name”,
“Total_Marks”,“ Obtained_Marks”, etc.
What is an attribute?
An attribute is a data field, representing a characteristic or feature of a data
object.
The nouns attribute, dimension, feature, and variable are often used
interchangeably in the literature.
The term dimension is commonly used in data warehousing.
Machine learning literature tends to use the term feature.
Statisticians prefer the term variable.
Data mining and database professionals commonly use the term attribute.
A set of attributes used to describe a given object is called an attribute vector (or
feature vector).
The distribution of data involving one attribute (or variable) is called univariate. A
bivariate distribution involves two attributes, and so on. In general, a distribution
with two or more variables is multivariate.
Different types of attributes
The type of an attribute is determined by the set of possible
values. An attribute be one of the following types.
Nominal (Qualitative)
Binary (Qualitative)
Ordinal (Qualitative)
Numeric (Quantitative)
Nominal/Categorical attribute
Nominal means “relating to names”, The values of a nominal attribute are
symbols or names of things.
Each value represents some kind of category, code, or state, and so nominal
attributes are also referred to as categorical.
The values do not have any meaningful order.
Examples:-
Possible values for attribute “hair color” are black, brown, blond, red, gray,
and white.
The attribute marital status can take on the values single, married, divorced,
and widowed.
Binary Attribute
A binary attribute is asymmetric if the outcomes of the states are not equally
important, such as the positive and negative outcomes of a medical test for HIV.
By convention, we code the most important outcome by 1 (e.g., HIV positive)
and the other by 0 (e.g., HIV negative).
Ordinal Attribute
Basic statistics is used to understand/analyze the data set and also helpful
in data preprocessing tasks like filling missing values, smoothing of noisy
values, spot the outliers present in the data set.
IQR = Q3 − Q1
Example
Q. Find the range and interquartile range of the set {3, 7, 8, 5, 12, 14, 21, 13, 18}.
First, we write the data in increasing order: 3, 5, 7, 8, 12, 13, 14, 18, 21.
range = max – min = 21 – 3 = 18.
Q1 = 6 and Q3 = 16.
Therefore, the interquartile range (IQR) = Q3 – Q1 = 16 – 6 = 10.
The range is 18 and the interquartile range is 10.
Five number summary and Outlier detection
The five-number summary of a data set consists of the five numbers determined
by computing the minimum, Q1 , median, Q3 , and maximum of the data set.
Find the five-number summary for the data set {3, 7, 8, 5, 12, 14, 35,
13, 18}.
Sorted: 3,5,7,8,12,13,14,18,35
Value < lower fence and
1.5*(Q3-Q1)=15 > upper fence are
Lower Fence = Q1-1.5*(IQR) = 6-15 = - considered as outliers
9 and should be removed
Upper Fence = Q3 +1.5*(IQR)=16+15 from the data set.
=31