Ch01 ICS422 04
Ch01 ICS422 04
Faculty
Development
Programme on
Emerging trends
on Machine
Learning and Deep
Learning
techniques
Data Preprocessing Techniques
Roadmap
• Data Visualization
Know Your Data
Types of Data Sets: (1) Record
Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g.numerical matrix, crosstabs
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
• Transaction data
TID Items
Document 1 3 0 5 0 2 6 0 2 0 2
1 Bread, Coke, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0
• Transportation network
Molecular Structures
(precipitation, temperature,
variety of geographical
locations.
• Image data:
Data Types From A Machine Learning
Perspective
Text
Text data is basically just words.
• Descriptive
• Describe the basic features of data- simple summaries
• Data Analysis
• Analyze the previous year sales data to find interesting
insight
• Invest made by the company
• Profit % etc
• Use numerical measures
• Inferential
• Get inferences and predictions about the larger population
from which the sample was drawn
• Data Science
• Use data to find several insights/inferences
• Suggest the company about certain strategies to
increase their profit
Basic Statistical Descriptions of Data
• What
• Measure of central tendency
• Mean, Median and mode
• Location of the centre of a data distribution
• Where do most of the attribute values fall?
• Dispersion Measure
• Range, quartiles, inter quartile range, five
number summary and box plots , variance and
standard deviation
• It describes how are the data spread out.
Descriptive Statistics
Measuring Central Tendency
Mode
• Value that occurs
most frequently in the
data
• Unimodal, bimodal,
multimodal
Contd…
The Mode : The most frequently occurring score
- Unimodal distribution has only one major peak
- Bimodal distribution has two major peaks
Median
• Median is nothing
more than the
middle value of your
observations when
they are order from
the smallest to the
largest.
It involves two steps:
Negative Skew
Positive Skew
Properties of Normal Distribution
Curve
← — ————Represent data dispersion, spread — ————→
33
Properties of Normal Distribution
Curve
• The normal (distribution) curve
• From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard
deviation)
• From μ–2σ to μ+2σ: contains about 95% of
it
• From μ–3σ to μ+3σ: contains about 99.7% of
it
34
Measuring the Dispersion of
Data
Range Variance
• The difference between • Variance signifies how
the largest and the
smallest data item. much the data items are
deviating from mean
Let’s look at a very simple set Larger variance means the
of data representing the
data items deviate more from
weight of 10 males, the mean.
Smaller variance means the
55, 56, 56, 58, 60, 61, 63, 64, data items are closer to the
70, 78. mean.
Range = 78–55 =
23.
Standard deviation =[(55–62.1)² + (56–62.1)² + (56–
62.1)² + (58–62.1)² + (60- 62.1)² +
Square root of the variance. In the (61–62.1)² + (63–62.1)² + (64–
above formula, σ is the standard 62.1)² + (70–62.1)² +
deviation and σ2 is the variance (78–62.1)²]/9.
Std dev = sqrt(51.88) = 7.20 = 466.9/9
Measuring the Dispersion of
Data
Correlation - Co-variance
• Measures how two variables vary with respect to each other
Positive covariance signifies that the higher values of one variable correspond
with the higher values of the other variable, and similarly for the lower ones.
Negative covariance, on the other hand, signifies that the higher values of one
variable correspond to the lower values of the other
Measuring the Dispersion of
Data
Why percentiles?
Percentile gives the relative position of a particular
value within the dataset. If we are interested in
relative positions, then mean and standard
deviations does not make sense. In the case of exam
scores, we do not know if it might have been a
difficult exam and 7 points out of 20 was an amazing
score. In this case, personal scores in itself are
meaningless, but the percentile would reflect
everything. For example, GRE and GMAT scores are
Characteristic of Range
• Rarely used
• Its crude measure
• Highly susceptible to outliers
• Used mostly with nominal data or
ordinal data
• The Interquartile Range (IQR)
overcome these limitations to
some extent
Measures of dispersion - Interquartile
Range(IQR)
• The difference between the third quartile and the first
quartile.
IQR=Q3−Q1
Why IQR?
The interquartile range is a better
option than range because it is not
affected by outliers. It removes
the outliers by just focusing on the
distance within the middle 50% of
the data
Interquartile Range
• Quartiles divide the data into four equal sections (25%
each)
• Interquartile Range - The range between middle 50% of
data
- Measure the variability between 1st and 3rd quartile
- Variability in the middle half of the data
- Describes spread at centre of the data
- Not largely affected by the outliers
Calculating Interquartile Range
𝑋 −𝑚𝑖𝑛( 𝑋 )
𝑋 𝑛𝑜𝑟𝑚= ∗(𝑁𝑒𝑤𝑀𝑎𝑥 − 𝑁𝑒𝑤𝑀𝑖𝑛)+ 𝑁𝑒𝑤𝑀𝑖𝑛
𝑚𝑎𝑥( 𝑋 )− 𝑚𝑖𝑛( 𝑋 )
MinMax Scaler
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7,
30.9, 27.9]
81
One Hot Encoding
• It takes a column which has categorical data, which
has been label encoded, and then splits the column
into multiple columns
• For k distinct values, we can transform the feature
into a k-dimensional vector with one value of 1 and
0 as the rest values.
One Hot Encoding
Label Binarizer
• The label binarizer class to perform one hot encoding in a
single step
from sklearn.preprocessing
import LabelBinarizer To convert from the one-hot
color_lb = LabelBinarizer() encoded vector back into
make_lb = LabelBinarizer() the original text category,
the label binarizer class
X= provides the inverse
color_lb.fit_transform(df.color. transform function
values)
Xm =
print(X)
make_lb.fit_transform(df.make
green_ohe = X[[0]]
.values)
color_lb.inverse_transfor
m(green_ohe)
Any
Queries?