0% found this document useful (0 votes)

11 views84 pages

Ch01 ICS422 04

Uploaded by

Vipul Khandke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views84 pages

Ch01 ICS422 04

Uploaded by

Vipul Khandke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 84

Five Days Virtual

Faculty
Development
Programme on
Emerging trends
on Machine
Learning and Deep
Learning
techniques
Data Preprocessing Techniques
Roadmap

• Know Your Data

• Statistical Descriptions of Data

• Data Visualization
Know Your Data
Types of Data Sets: (1) Record
Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g.numerical matrix, crosstabs

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
• Transaction data
TID Items
Document 1 3 0 5 0 2 6 0 2 0 2
1 Bread, Coke, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

• Document data: Term-frequency vector (matrix)

of text documents
Types of Data Sets: (2) Graphs and
Networks

• Transportation network

• World Wide Web

 Molecular Structures

 Social or information networks

Types of Data Sets: (3) Ordered Data

• Video data: sequence of images

• Temporal data: time-series: For
example, a financial data set might
contain objects that are time series of
the daily prices of various stocks.

• Sequential Data: transaction

sequences -Time
• Genetic sequence data (Position) :
the genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
Types of Data Sets: (4) Spatial, image and multimedia Data

• Spatial data: . An example of

spatial data is weather data

(precipitation, temperature,

pressure) that is collected for a

variety of geographical

locations.

• Image data:
Data Types From A Machine Learning
Perspective

Continuous data can assume any value within a

range whereas discrete data has distinct values
• Number of students taking machine learning
Numerical Data class would be a discrete data set.
 Data points are exact numbers • You can only have discrete whole number
 Quantitative data values like 10, 25, or 33.
• A class cannot have 12.75 students enrolled,
 Measurement - number of A student either join a class or he doesn’t
residential properties in Los Continuous data are numbers that can fall
Angeles or how many houses sold anywhere within a range.
in the past year. Like a student could have an average score of
Numerical data is not ordered in 88.25 which falls between 0 and 100
time
Data Types From A Machine Learning
Perspective
 Nominal categorical: No concept of
ordering amongst the values of that
attribute Similarly movie, music
and video game genres,
country names, food
and cuisine types are
other examples of
nominal categorical
attributes
 Ordinal categorical : Sense or notion
of
order amongst its values
Categorical Data
 Represents characteristics –
cricket player’s position, team,
hometown
 Belong to a specific finite set of
categories or classes -- classes
or labels
 Major classes of categorical data
=> nominal and ordinal
Data Types From A Machine Learning
Perspective • For example
• Measure the average number of home
sales for many years.
• The difference of time series data
and numerical data is that rather
then having a bunch of numerical
values which don’t have any time
ordering, time series data does
have some implied ordering.
• There is a first data point
collected and a last data point
collected.
Time Series Data
 Sequence of numbers collected
at regular intervals over some
period of time
 Time series data has a temporal
value attached to it, so this
would be something like a date
or a time stamp that you can
look for trends in time
Data Types From A Machine Learning
Perspective

Text
 Text data is basically just words.

 A lot of the time the first thing

that you do with text is you turn
it into numbers using some
interesting functions like the bag
of words formulation.
Statistical
Descriptions of Data
Basic Statistical Descriptions of
Data
Data Matrix- tabular format
representation of cases and Frequency table
variables • Shows how the values of a
• Each row of a data matrix variable are distributed over
represents a case and each the cases.
column represent a variable
• A complete Data Matrix may
contain thousands or lakhs or
even more cases.
Types of statistics

• Descriptive
• Describe the basic features of data- simple summaries
• Data Analysis
• Analyze the previous year sales data to find interesting
insight
• Invest made by the company
• Profit % etc
• Use numerical measures
• Inferential
• Get inferences and predictions about the larger population
from which the sample was drawn
• Data Science
• Use data to find several insights/inferences
• Suggest the company about certain strategies to
increase their profit
Basic Statistical Descriptions of Data

• What
• Measure of central tendency
• Mean, Median and mode
• Location of the centre of a data distribution
• Where do most of the attribute values fall?
• Dispersion Measure
• Range, quartiles, inter quartile range, five
number summary and box plots , variance and
standard deviation
• It describes how are the data spread out.
Descriptive Statistics
Measuring Central Tendency

• What single number best represents our data?

• Whether the data are packed together?

Central tendency : indicates where the centre of the

distribution tends to be. eg., Measure of scores, Measure of
height

• Measure of central tendency answers whether the scores are

high or low.

• Determining which measure of central tendency to use

depends on:
- Scale of measurement (NOIR)
- Shape of the distribution (skew, kurtosis)
Measuring Distribution of Your
Data
Measuring Distribution of Your
Data

Mode
• Value that occurs
most frequently in the
data
• Unimodal, bimodal,
multimodal
Contd…
The Mode : The most frequently occurring score
- Unimodal distribution has only one major peak
- Bimodal distribution has two major peaks

• Mode commonly used with nominal data

Dataset A : 2 3 3 4 4 4 4 7 7 8 9 : 4
Dataset B : 2 3 4 4 4 4 7 8 9 9 9 9 10 12 : 4 and a warning
• Not susceptible to outliers at all
Outlier : significantly higher or lower than other
scores
• Mode ignores much of the data
• Used in non-parametric methods.
The Median
Median : Geographic centre of the dataset
- If the median is 30, then half the scores are
at or below 30
• Put the data in ascending order
• Find the middle number
- If n is odd, the median located at (n + 1)/2
position
- If n is even, the median will be average of
two middle numbers. Avg( n/2, n/2+1)
Contd…

Median
• Median is nothing
more than the
middle value of your
observations when
they are order from
the smallest to the
largest.
It involves two steps:

 Oder your cases from

smallest to largest
 Find the middle Value
The Mean
• Mean is located at the mathematical centre of
the distribution.
• Most commonly used with Interval and Ratio
data
• It is also called average
• In normal distribution Mean = Median =
Mode
Contd…

• Mean- The mean is When to use what

measurement of central
the sum of all the tendency ??
values divided by the
number of • If data is Categorical (Nominal
or Ordinal) it is impossible to
observations calculate mean or median. So,
go for mode.

• If your data is quantitative then

go for mean or median.
• Basically, if your data is having
some influential outliers or data
is highly skewed then median
is the best measurement for
finding central tendency.
Otherwise go for Mean
Whether the data are packed
together or spread-out?
Dispersion or Variability of Data

Variability indicated how spread out the scores

are
• When there are large differences in the
scores, the data are said to contain a lot of
Variability.
An example:
Consider you are driving cross district and
eating every A2B restaurant that you pass. Do
you expect will
Variability the bevariability
low andof the food taste from
one restaurant
consistency willtobe
the nextThe
high. will food
be high or low?
tastes similar.
Dispersion or Variability of
Data
The grater the variability, the less
accurately the data are summarized by
the measures of central tendency
Measurement
errors are
larger, the
mean is not
useful
Dispersion Vs Central
Tendency
Dispersion and Central tendency are independent
measures

Mean tells where the centre is

- Distribution with same mean may have
different variability

0, 2, 6,10,12 : Mean = 6 (High variability)

8, 7, 6, 5, 4 : Mean = 6 (Data points are closer)

6, 6, 6, 6, 6 : Mean = 6 (No variability)

Measures of dispersion -
Skewness
• To check whether the distribution has a longer
tail on one side or the other or has left-right
symmetry.

• It can be positive(representing right

skewed distribution),
negative(representing left
skewed distribution), or
zero(representing unskewed distribution).
Symmetric vs. Skewed Data
• Data can be "skewed", No Skew / symmetric
meaning it tends to have
a long tail on one side or
the other:

Negative Skew

Positive Skew
Properties of Normal Distribution
Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency

33
Properties of Normal Distribution
Curve
• The normal (distribution) curve
• From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard
deviation)
• From μ–2σ to μ+2σ: contains about 95% of
it
• From μ–3σ to μ+3σ: contains about 99.7% of
it

34
Measuring the Dispersion of
Data
Range Variance
• The difference between • Variance signifies how
the largest and the
smallest data item. much the data items are
deviating from mean
Let’s look at a very simple set Larger variance means the
of data representing the
data items deviate more from
weight of 10 males, the mean.
Smaller variance means the
55, 56, 56, 58, 60, 61, 63, 64, data items are closer to the
70, 78. mean.
Range = 78–55 =
23.
Standard deviation =[(55–62.1)² + (56–62.1)² + (56–
62.1)² + (58–62.1)² + (60- 62.1)² +
Square root of the variance. In the (61–62.1)² + (63–62.1)² + (64–
above formula, σ is the standard 62.1)² + (70–62.1)² +
deviation and σ2 is the variance (78–62.1)²]/9.
Std dev = sqrt(51.88) = 7.20 = 466.9/9
Measuring the Dispersion of
Data
Correlation - Co-variance
• Measures how two variables vary with respect to each other

Covariance indicates the direction of the linear relationship between

variables. (-∞ and +∞)

Covariance of two dependent variables measures how much in real

quantity (i.e. cm, kg, liters) on average they co-vary.

Positive covariance signifies that the higher values of one variable correspond
with the higher values of the other variable, and similarly for the lower ones.

Negative covariance, on the other hand, signifies that the higher values of one
variable correspond to the lower values of the other
Measuring the Dispersion of
Data

Positive covariance and as such it means that

both returns move in the same direction
i.e. either both have positive returns or both
have negative returns

A covariance very close to zero signifies the lack of correlation between

two variables.
Measuring the Dispersion of
Data
Correlation - how two variables move with respect to each
other
Correlation on the other hand measures both the strength and
direction of the linear relationship between two variables. (-1 to 1)
Correlation of two dependent variables measures the proportion
of how much on average these variables vary w.r.t one another.
A perfect positive correlation means that the correlation
coefficient is 1.
A perfect negative correlation means that the correlation
coefficient is -1.
A correlation coefficient of 0 means that the two variables are
independent of each other
• The correlation coefficient
Measuring the Dispersion of
Data
Measures of dispersion -
Percentile
• A measure which indicates the value below which a given
percentage of points in a dataset fall.
• The 35th percentile(P35) is the score below which 35% of the data
points may be found.
• Median represents the 50th percentile, 0th percentile representing
the minimum and 100th percentile representing the maximum of
all
To data points.
calculate kth percentile(Pk) for a data set of N observations
which is arranged in increasing order,
Step 1: Calculate i=(k/100)×N
Step 2: If i is a whole number, then count the observations in the data set
from left to right till we reach the ith data point. The kth percentile, in this
case, is equal to the average of the value of i th data point and the value of
the data
Step 3: Ifpoint that
i is not followsnumber,
a whole it then round it up to the nearest integer
and count the observations in the data set from left to right till we reach
the ith data point. The kth percentile now is just equal to the value
corresponding this data point.
Measures of dispersion - Percentile
Example
• There are 25 test scores such as: 72,54, 56, 61, 62, 66, 68,
43, 69, 69, 70, 71,77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98,
99, 99. Find the 60th percentile?
• Step 1: Arrange the data in the ascending order. Ascending
Order = 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78,
79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.
• Step 2: Find Rank, Rank = Percentile / 100 = 60 / 100 k =
0.60
• Step 3: Find 60th percentile, 60th percentile = 0.60 x 25 =
15
• Step 4: Count the values in the given data set from left to
right until you reach the number 15. From the given data set,
15th number is 79. Now take the 15th number and the 16th
number and find the average: 79 + 85 / 2 = 164 / 2 = 82
• Hence, 60th percentile of given data set = 82.
Measures of dispersion -
Percentile
Let us consider the percentile example problem: In a college, a
list of grades of 15 students has been declared. Their grades are:
85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 74, 65. Find the 80th
percentile?
• Step 1: Arrange the data in the ascending order.
• Ascending Order = 34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86,
87.
• Step 2: Find Rank, Rank = Percentile / 100 = 80 / 100 k = 0.80
• Step 3: Find 80th percentile, 80th percentile = 0.80 x 12 = 9.6
• Step 4: Since it is not a whole number, round to the nearest
whole number. Therefore, 9.6 is rounded to 10.
• Now, count the values in the given data set from left to right
until you reach the number 10. From the given data set, 10th
number is 85.
Hence,
• 80th percentile of given data set = 85
Measures of dispersion -
Quartiles
• Three points that split the data set into four equal parts such
that each group consists of one-fourth of the data.
25th percentile the first quartile(Q1), 50th percentile the second
quartile(Q2), and 75th percentile the third quartile(Q3).

Why percentiles?
Percentile gives the relative position of a particular
value within the dataset. If we are interested in
relative positions, then mean and standard
deviations does not make sense. In the case of exam
scores, we do not know if it might have been a
difficult exam and 7 points out of 20 was an amazing
score. In this case, personal scores in itself are
meaningless, but the percentile would reflect
everything. For example, GRE and GMAT scores are
Characteristic of Range

• Rarely used
• Its crude measure
• Highly susceptible to outliers
• Used mostly with nominal data or
ordinal data
• The Interquartile Range (IQR)
overcome these limitations to
some extent
Measures of dispersion - Interquartile
Range(IQR)
• The difference between the third quartile and the first
quartile.
IQR=Q3−Q1

Why IQR?
The interquartile range is a better
option than range because it is not
affected by outliers. It removes
the outliers by just focusing on the
distance within the middle 50% of
the data
Interquartile Range
• Quartiles divide the data into four equal sections (25%
each)
• Interquartile Range - The range between middle 50% of
data
- Measure the variability between 1st and 3rd quartile
- Variability in the middle half of the data
- Describes spread at centre of the data
- Not largely affected by the outliers
Calculating Interquartile Range

• Median split 1 and 2

• Median split both lower and
upper median
• subtract lower from upper

MDN = (3 + 4) / 2 = 3.5 IQR = 9 - 3

=6
Data Visualization
Graphic Displays of Basic Statistical
Descriptions
• Boxplot: graphic display of five-number
summary
• Histogram: x-axis are values, y-axis repres.
frequencies
• Scatterplot: each pair of values is a pair of
coordinates and plotted as points in the
plane
The 5 Number Summary
• Five numbers summaries the dataset
Minimum, Quartile1 (Q1), Median, Quartile3
(Q3), Maximum
Q1 : First quartile, 25th percentile, middle score
between minimum and median
Q2 : Third quartile, 75th percentile, middle of
scored between Median and Maximum

Min = 1, Q1 = 3, MDN = 3.5, Q2 = 9, max = 20

Measuring the Dispersion of Data:
Quartiles & Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th
percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median,
Q3, max
• Boxplot: Data is represented with a box

• Q1, Q3, IQR: The ends of the box are at

the first and third quartiles, i.e., the
height of the box is IQR
• Median (Q2) is marked by a line within
the box
• Whiskers: two lines outside the box
 Outliers: points beyond a specified outlier threshold, plotted
extended to Minimum and Maximum
individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
Measuring the Dispersion of Data:
Quartiles & Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.
Step 1: Order the data from smallest to largest.
Our data is already in order. 25, 28, 29, 29, 30, 34, 35, 35, 37,
38
Step 2
Find the median.
The median is the mean of the middle two numbers:
30+34/2 = 32 median=32
Step 3: Find the quartiles.
The first quartile is the median of the data points to the left of the
median. Q1=29
The third quartile is the median of the data points to the right of
the median. Q3=35
Step 4: Complete the five-number summary by finding the min and
the max.
The min is the smallest data point, which is 25.
Constructing a box and whisker plot :
Example 2
• Step 1 - take the set of numbers given…34, 18, 100, 27, 54, 52, 93, 59,
61, 87, 68, 85, 78, 82, 91
Place the numbers in order from least to greatest:18, 27, 34, 52,
54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
• Step 2 - Find the median. Remember, the median is the middle value in a
data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
68 is the median of this data set.
• Step 3 – Find the lower quartile. The lower quartile is the median of the
data set to the left of 68. (18, 27, 34, 52, 54, 59, 61,) 68, 78, 82,
85, 87, 91, 93, 100
52 is the lower quartile
• Step 4 – Find the upper quartile.The upper quartile is the median of the
data set to the right of 68. 18, 27, 34, 52, 54, 59, 61, 68, (78, 82, 85,
87, 91, 93, 100)
87 is the upper quartile
• Step 5 – Find the maximum and minimum values in the set. The maximum
is the greatest value in the data set. The minimum is the least value in the
data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100.
Constructing a box and whisker plot :
Example 2

• Step 6 – Find the inter-quartile range (IQR).

The inter-quartile (IQR) range is the difference between the
upper and lower quartiles.
Upper Quartile = 87
Lower Quartile = 52
87 – 52 = 35
35 = IQR
• Organize the 5 number summary
• Median – 68
• Lower Quartile – 52
• Upper Quartile – 87
• Max – 100
• Min – 18
Interpreting the Box Plot:
Interpreting the Box Plot:

Symmetric: If a box and whisker plot is symmetric, the

median is equidistant from the minimum and the
maximum.

Negatively Skewed: If a box and whisker plot is

negatively skewed, the distance from the median to the
minimum is greater than the distance from the
median to the maximum.

Positively Skewed: If a box and whisker plot is

positively skewed, the distance from the median to the
maximum is greater than the distance from the median
to the minimum.
Histogram Analysis

• A histogram is used to summarize discrete or continuous

data. In other words, it provides a visual interpretation of
numerical data by showing the number of data points that
fall within a specified range of values (called “bins”). It is
similar to a vertical bar graph. However, a histogram, unlike
a vertical bar graph, shows no gaps between the bars.
• Histograms can display a large amount of data and the
frequency of the data values. The median and distribution of
the data can be determined by a histogram. In addition, it
can show any outliers or gaps in the data.
Distributions of a Histogram
• A normal distribution: In a normal distribution, points on one side
of the average are as likely to occur as on the other side of the
average.

• A bimodal distribution: In a bimodal distribution, there are two

peaks. In a bimodal distribution, the data should be separated and
analyzed as separate normal distributions.
Distributions of a Histogram
• A right-skewed distribution: A right-skewed distribution is also
called a positively skewed distribution. In a right-skewed
distribution, a large number of data values occur on the left side
with a fewer number of data values on the right side. A right-
skewed distribution usually occurs when the data has a range
boundary on the left-hand side of the histogram. For example, a
boundary of 0.
Distributions of a Histogram
• A left-skewed distribution: A left-skewed distribution is also called
a negatively skewed distribution. In a left-skewed distribution, a
large number of data values occur on the right side with a fewer
number of data values on the left side. A right-skewed distribution
usually occurs when the data has a range boundary on the right-
hand side of the histogram. For example, a boundary such as 100.
Distributions of a Histogram
• A random distribution: A random distribution lacks an apparent
pattern and has several peaks. In a random distribution histogram,
it can be the case that different data properties were combined.
Therefore, the data should be separated and analyzed separately.
Example of a Histogram
• There are 3 customers waiting between 1 and 35 seconds
• There are 5 customers waiting between 1 and 40 seconds
• There are 5 customers waiting between 1 and 45 seconds
• There are 5 customers waiting between 1 and 50 seconds
• There are 2 customers waiting between 1 and 55 seconds
• can conclude that the majority of customers wait between 35.1 and 50
seconds.
Scatter plot

• A scatter plot - effective graphical methods for determining if

there appears to be a relationship, pattern, or trend between two
numeric attributes.
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
Scatter plot

• Provides a first look at bivariate data to see clusters of points,

outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
Positively and Negatively Correlated Data

• The left half fragment is positively

correlated

• The right half is negative

correlated
Uncorrelated Data
Rescaling Data
• Features are basically your column names and the
respective data in that column will be of similar feature,
this is in your everyday conventional datasets, most of
them are usually in different quantitative measurements
and in different magnitudes.
• Example the column with the name height will have data
in cm (centimetre) and column with weight will have data
in Kg(kilogram).
• Scaling data is the process of increasing or
decreasing the magnitude according to a fixed ratio,
in simpler words you change the size but not the
shape of the data .
MinMax Scaler

• Scaling each feature to a given range

• This transformation does not change the distribution of
the feature.
• Feature range parameter (default at (0,1)), Consider
following example
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7,
30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4,

76.5, 60.7]

𝑋 −𝑚𝑖𝑛( 𝑋 )
𝑋 𝑛𝑜𝑟𝑚= ∗(𝑁𝑒𝑤𝑀𝑎𝑥 − 𝑁𝑒𝑤𝑀𝑖𝑛)+ 𝑁𝑒𝑤𝑀𝑖𝑛
𝑚𝑎𝑥( 𝑋 )− 𝑚𝑖𝑛( 𝑋 )
MinMax Scaler
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7,
30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4,

76.5, 60.7]
After Age
re-scaling
= [1, 0.61568627, 0.34509804, 0, 0.37254902,
0.55294118,
0.10196078, 0.09019608, 0.45098039,
0.33333333]

Height = [0.61392405, 0.06329114, 0.92405063,

0.38607595,
0.39240506, 0.03797468, 0.06329114,
0.86708861, 1, 0]
MinMax Scaler

The same distribution of

data, but rescaled in
such a way that distances
between points won’t be
biased by differences in
scale

➢ Neural networks often require their inputs to be bounded

between 0 and 1.
➢ In images, for example, where pixels can only take on a
specific range of RGB values, data may have to be
MinMax Scaler
STANDARDIZE Z-Score

• A Z-Score describes how far a raw score falls

from the mean in the standard deviation units
- Subtract the mean from the raw score
- Divide by the standard deviation
STANDARDIZE Z-Score

• Comparing scores from different distributions

• For computing relative frequency of scores
from any distributions
• A standardised Z score always in the form of
normal curve
• It has mean of 0 and standard deviation of 1
• Z score exceeding are statically significant
STANDARDIZE Z-Score
STANDARDIZE Z-Score
Z-Score - An example

• For instance, suppose you went to college in New York

and your best friend went to college in Georgia.
• You might get a grade of 87 in a test with a mean of 77
and a standard deviation of 5, and the same day your
friend might get a grade of 612 (mean 600, standard
deviation 100).
• Although the two grades (87 and 612) can’t be
compared directly, the standardized values will allow
you to immediately see who is doing better compared
with the rest of the class.
• (612 – 600) / 100 = 0.12, so your friend’s z-score or
standardized grade is 0.12. (87 – 77) / 5 = 2, your
standardized grade.
STANDARDIZE Z-Score
Normalize or Standardize
• Normalization is good to use when you know that the
distribution of your data does not follow a Gaussian
distribution. This can be useful in algorithms that do
not assume any distribution of the data like K-Nearest
Neighbors and Neural Networks.
• Standardization, on the other hand, can be helpful in
cases where the data follows a Gaussian distribution.
However, this does not have to be necessarily true.
Also, unlike normalization, standardization does not
have a bounding range. So, even if you have outliers in
your data, they will not be affected by standardization.
Binarize Data

• All values above the threshold are

marked 1 and all equal to or below
are marked as 0.
• This is called binarizing your data or
threshold your data. It can be useful
when you have probabilities that you
want to make crisp values.
• The process of thresholding
numerical features to get Boolean
values
• It plays a key role in the
discretization of continuous feature
values.
Example :
A continuous data of pixels values of an 8-
bit grayscale image have values ranging
between 0 (black) and 255 (white) and # For age, let threshold be 35
one needs it to be black and white. So, # For salary, let threshold be
using Binarizer() one can set a threshold 61000
binarizer_1 = Binarizer(35)
converting pixel values from 0 – 127 to 0
One Hot Encoding

• To process categorical features, which

can be in numerical or text form, they
must be first transformed into a
numerical representation.
Label Encoding
• The first column is the color column,
which is all text
• Convert this kind of categorical text
data into model-understandable
numerical data, we use the Label The problem here is, since there are
Encoder different numbers in the same
column, the model will
• After applying the LabelEncoder,  misunderstand the data to be in
you’ll see that the three colours in the some kind of order, 0 < 1 < 2. But
this isn’t the case at all. To
first column have been replaced by the overcome this problem, we use One
numbers 0, 1, and 2 Hot Encoder. 80
Labeled Encoding

81
One Hot Encoding
• It takes a column which has categorical data, which
has been label encoded, and then splits the column
into multiple columns
• For k distinct values, we can transform the feature
into a k-dimensional vector with one value of 1 and
0 as the rest values.
One Hot Encoding

Label Binarizer
• The label binarizer class to perform one hot encoding in a
single step
from sklearn.preprocessing
import LabelBinarizer To convert from the one-hot
color_lb = LabelBinarizer() encoded vector back into
make_lb = LabelBinarizer() the original text category,
the label binarizer class
X= provides the inverse
color_lb.fit_transform(df.color. transform function
values)
Xm =
print(X)
make_lb.fit_transform(df.make
green_ohe = X[[0]]
.values)
color_lb.inverse_transfor
m(green_ohe)
Any
Queries?

ASTB Personal Study Guide
100% (1)
ASTB Personal Study Guide
14 pages
Exercises - Sheet - Lesson-11 Heat and Temperature PDF
No ratings yet
Exercises - Sheet - Lesson-11 Heat and Temperature PDF
5 pages
Lesson 02 Probability and Statistics
No ratings yet
Lesson 02 Probability and Statistics
127 pages
Know - Your - Data and Rescaling-1
No ratings yet
Know - Your - Data and Rescaling-1
78 pages
E-Note 33325 Content Document 20250319114322AM
No ratings yet
E-Note 33325 Content Document 20250319114322AM
69 pages
Data ch2
No ratings yet
Data ch2
16 pages
Know - Your - Data and Rescaling
No ratings yet
Know - Your - Data and Rescaling
72 pages
Lecture 2
No ratings yet
Lecture 2
33 pages
Chapter2-Statistical Analysis
No ratings yet
Chapter2-Statistical Analysis
86 pages
9-1 Data Analysis and Pre-Processing Part 1 PDF
No ratings yet
9-1 Data Analysis and Pre-Processing Part 1 PDF
19 pages
MMW (Data Management) - Part 1
No ratings yet
MMW (Data Management) - Part 1
26 pages
02 Data
No ratings yet
02 Data
36 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
42 pages
02 Data
No ratings yet
02 Data
35 pages
2 Knowing Data & Visualization
No ratings yet
2 Knowing Data & Visualization
51 pages
Unit 1
No ratings yet
Unit 1
78 pages
DM-Knowing Your Data
No ratings yet
DM-Knowing Your Data
56 pages
Statistics For Data Science PDF - Statistics-for-Data-Science PDF
No ratings yet
Statistics For Data Science PDF - Statistics-for-Data-Science PDF
14 pages
02data DMDW
No ratings yet
02data DMDW
40 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
DA Major Notes
No ratings yet
DA Major Notes
46 pages
Article Review 1 Eng
No ratings yet
Article Review 1 Eng
30 pages
02 Data
No ratings yet
02 Data
64 pages
Lesson 5 (Descriptive Statistics Part 1) - Oct 2024
No ratings yet
Lesson 5 (Descriptive Statistics Part 1) - Oct 2024
72 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
CS822 DataMining Week2
No ratings yet
CS822 DataMining Week2
28 pages
02data Edited v2
No ratings yet
02data Edited v2
43 pages
Topics To Be Covered
No ratings yet
Topics To Be Covered
58 pages
Safari
No ratings yet
Safari
385 pages
Descriptive Analytics Notes
No ratings yet
Descriptive Analytics Notes
6 pages
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
No ratings yet
Unit2PreparingtoModelpptx 2023 09 02 14 52 40
43 pages
Lecture 2 - Exploratory Data Analysis
No ratings yet
Lecture 2 - Exploratory Data Analysis
35 pages
f592b059 1643454320549
No ratings yet
f592b059 1643454320549
39 pages
2 1 Data
No ratings yet
2 1 Data
22 pages
Statistics For Data Science 1
No ratings yet
Statistics For Data Science 1
65 pages
01 Data
No ratings yet
01 Data
100 pages
Topic 8 Data Processing and Analysis PDF
No ratings yet
Topic 8 Data Processing and Analysis PDF
157 pages
Lesson 2.1 - Know Your Data PDF
No ratings yet
Lesson 2.1 - Know Your Data PDF
43 pages
Foundations or Research Analysis
No ratings yet
Foundations or Research Analysis
31 pages
Human Development Assignment
No ratings yet
Human Development Assignment
5 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Statistics and Its Types (v1.0)
No ratings yet
Statistics and Its Types (v1.0)
6 pages
Statistics - Imp Points
No ratings yet
Statistics - Imp Points
6 pages
How Much Data Does Google Handle?
No ratings yet
How Much Data Does Google Handle?
132 pages
Biostatistics 1
No ratings yet
Biostatistics 1
19 pages
Introduction To Statistics Lecture 7
No ratings yet
Introduction To Statistics Lecture 7
32 pages
Stats and Its Real World Applications.
No ratings yet
Stats and Its Real World Applications.
53 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
CH 2
No ratings yet
CH 2
35 pages
Statistics For Data Science PDF
No ratings yet
Statistics For Data Science PDF
16 pages
DM 2 Final
No ratings yet
DM 2 Final
30 pages
IT326 - Ch2
No ratings yet
IT326 - Ch2
44 pages
Descriptive Stat Lec 1
No ratings yet
Descriptive Stat Lec 1
32 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
Data Mining 1
No ratings yet
Data Mining 1
29 pages
What Are Data Types?
No ratings yet
What Are Data Types?
12 pages
3 Data
No ratings yet
3 Data
64 pages
Naive Bayes
No ratings yet
Naive Bayes
25 pages
Hive Table Session
No ratings yet
Hive Table Session
23 pages
Hive Updated
No ratings yet
Hive Updated
18 pages
Map Reduce
No ratings yet
Map Reduce
37 pages
Hive Part 2
No ratings yet
Hive Part 2
47 pages
4 Hadoop Ecosystem
No ratings yet
4 Hadoop Ecosystem
16 pages
Linear Regression
No ratings yet
Linear Regression
12 pages
R Functions - 06
No ratings yet
R Functions - 06
26 pages
R Statements - 04
No ratings yet
R Statements - 04
21 pages
5 Decision Tree Updated
No ratings yet
5 Decision Tree Updated
30 pages
5 Decision Tree
No ratings yet
5 Decision Tree
26 pages
LR Assumptions - 05
No ratings yet
LR Assumptions - 05
12 pages
R DataPreprocessing
No ratings yet
R DataPreprocessing
23 pages
3.1 K Nearest Neighbour Classifier
No ratings yet
3.1 K Nearest Neighbour Classifier
24 pages
MLR - R and R2
No ratings yet
MLR - R and R2
17 pages
R Operators - 03
No ratings yet
R Operators - 03
26 pages
R Loops - 05
No ratings yet
R Loops - 05
16 pages
R Basics - 02
No ratings yet
R Basics - 02
34 pages
R Data Structures - 07 - 1
No ratings yet
R Data Structures - 07 - 1
30 pages
Residual Analysis and Test - 02
No ratings yet
Residual Analysis and Test - 02
37 pages
MLR Multicollinearlty, Categorical Variable
No ratings yet
MLR Multicollinearlty, Categorical Variable
48 pages
R Data Structures - 07 - 4
No ratings yet
R Data Structures - 07 - 4
27 pages
R Data Structures - 07 - 3
No ratings yet
R Data Structures - 07 - 3
35 pages
R Data Structures - 07 - 2
No ratings yet
R Data Structures - 07 - 2
18 pages
Ch01 ICS422 03
No ratings yet
Ch01 ICS422 03
46 pages
Ch01 ICS422 01
No ratings yet
Ch01 ICS422 01
42 pages
Ch01 ICS422 02
No ratings yet
Ch01 ICS422 02
39 pages
Multiple Linear Regression - Excel
No ratings yet
Multiple Linear Regression - Excel
14 pages
Lugogo Bridge Report
No ratings yet
Lugogo Bridge Report
43 pages
Thesis Statement For Global Climate Change
100% (3)
Thesis Statement For Global Climate Change
5 pages
CAE Gap-Filling 11 - 15 (S)
No ratings yet
CAE Gap-Filling 11 - 15 (S)
3 pages
World Building Guide
No ratings yet
World Building Guide
11 pages
Let S Talk About The FUTURE
No ratings yet
Let S Talk About The FUTURE
1 page
WLL Grade 7 Science Quarter 4
No ratings yet
WLL Grade 7 Science Quarter 4
51 pages
Vcdfty
No ratings yet
Vcdfty
4 pages
Climate Change: Laguna State Polytechnic University - San Pablo City
No ratings yet
Climate Change: Laguna State Polytechnic University - San Pablo City
14 pages
Chapter 4 Solutions
No ratings yet
Chapter 4 Solutions
13 pages
Case Studies of Some Concrete Structural Failures ICE
100% (1)
Case Studies of Some Concrete Structural Failures ICE
96 pages
Example Application of Procedural Event Analysis Tool (PEAT)
No ratings yet
Example Application of Procedural Event Analysis Tool (PEAT)
11 pages
Fin TeM - TORGD302 - Fin - Providing Guidance On Rwanda As A Destination
No ratings yet
Fin TeM - TORGD302 - Fin - Providing Guidance On Rwanda As A Destination
168 pages
62max Umeng0000 PDF
No ratings yet
62max Umeng0000 PDF
20 pages
The Load Line Convention
No ratings yet
The Load Line Convention
3 pages
Analysis and Explanation
No ratings yet
Analysis and Explanation
12 pages
DBDH 11
No ratings yet
DBDH 11
28 pages
WiderWorld 3 Review VocabGramUoE U 1-2 Group A
No ratings yet
WiderWorld 3 Review VocabGramUoE U 1-2 Group A
3 pages
Unit 11 Probability Final
No ratings yet
Unit 11 Probability Final
16 pages
Bachelor of Business Administration (BBA) : Q.T. in Business
No ratings yet
Bachelor of Business Administration (BBA) : Q.T. in Business
4 pages
Delay in Claim
No ratings yet
Delay in Claim
9 pages
Knowledge Based System Lecture 2
No ratings yet
Knowledge Based System Lecture 2
20 pages
Unit-I Basic Introduction Meteorology and Agro Meteorology
No ratings yet
Unit-I Basic Introduction Meteorology and Agro Meteorology
10 pages
Test Clasa 5a unit 4 - копия
No ratings yet
Test Clasa 5a unit 4 - копия
2 pages
Tenses
No ratings yet
Tenses
18 pages
Community Garden Project
No ratings yet
Community Garden Project
7 pages
Rephrasing Conditional Sentences
No ratings yet
Rephrasing Conditional Sentences
2 pages
Mathematics in Our World
No ratings yet
Mathematics in Our World
12 pages
Forecast Pro V8 Statistical Reference Manual
No ratings yet
Forecast Pro V8 Statistical Reference Manual
62 pages