0% found this document useful (0 votes)
11 views84 pages

Ch01 ICS422 04

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views84 pages

Ch01 ICS422 04

Uploaded by

Vipul Khandke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Five Days Virtual

Faculty
Development
Programme on
Emerging trends
on Machine
Learning and Deep
Learning
techniques
Data Preprocessing Techniques
Roadmap

• Know Your Data

• Statistical Descriptions of Data

• Data Visualization
Know Your Data
Types of Data Sets: (1) Record
Data
• Relational records
• Relational tables, highly structured
• Data matrix, e.g.numerical matrix, crosstabs

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
• Transaction data
TID Items
Document 1 3 0 5 0 2 6 0 2 0 2
1 Bread, Coke, Milk
Document 2 0 7 0 2 1 0 0 3 0 0
2 Beer, Bread
3 Beer, Coke, Diaper, Milk Document 3 0 1 0 0 1 2 2 0 3 0

4 Beer, Bread, Diaper, Milk


5 Coke, Diaper, Milk

• Document data: Term-frequency vector (matrix)


of text documents
Types of Data Sets: (2) Graphs and
Networks

• Transportation network

• World Wide Web

 Molecular Structures

 Social or information networks


Types of Data Sets: (3) Ordered Data

• Video data: sequence of images


• Temporal data: time-series: For
example, a financial data set might
contain objects that are time series of
the daily prices of various stocks.

• Sequential Data: transaction


sequences -Time
• Genetic sequence data (Position) :
the genetic information of plants and
animals can be represented in the
form of sequences of nucleotides that
are known as genes.
Types of Data Sets: (4) Spatial, image and multimedia Data

• Spatial data: . An example of

spatial data is weather data

(precipitation, temperature,

pressure) that is collected for a

variety of geographical

locations.

• Image data:
Data Types From A Machine Learning
Perspective

Continuous data can assume any value within a


range whereas discrete data has distinct values
• Number of students taking machine learning
Numerical Data class would be a discrete data set.
 Data points are exact numbers • You can only have discrete whole number
 Quantitative data values like 10, 25, or 33.
• A class cannot have 12.75 students enrolled,
 Measurement - number of A student either join a class or he doesn’t
residential properties in Los Continuous data are numbers that can fall
Angeles or how many houses sold anywhere within a range.
in the past year. Like a student could have an average score of
Numerical data is not ordered in 88.25 which falls between 0 and 100
time
Data Types From A Machine Learning
Perspective
 Nominal categorical: No concept of
ordering amongst the values of that
attribute Similarly movie, music
and video game genres,
country names, food
and cuisine types are
other examples of
nominal categorical
attributes
 Ordinal categorical : Sense or notion
of
order amongst its values
Categorical Data
 Represents characteristics –
cricket player’s position, team,
hometown
 Belong to a specific finite set of
categories or classes -- classes
or labels
 Major classes of categorical data
=> nominal and ordinal
Data Types From A Machine Learning
Perspective • For example
• Measure the average number of home
sales for many years.
• The difference of time series data
and numerical data is that rather
then having a bunch of numerical
values which don’t have any time
ordering, time series data does
have some implied ordering.
• There is a first data point
collected and a last data point
collected.
Time Series Data
 Sequence of numbers collected
at regular intervals over some
period of time
 Time series data has a temporal
value attached to it, so this
would be something like a date
or a time stamp that you can
look for trends in time
Data Types From A Machine Learning
Perspective

Text
 Text data is basically just words.

 A lot of the time the first thing


that you do with text is you turn
it into numbers using some
interesting functions like the bag
of words formulation.
Statistical
Descriptions of Data
Basic Statistical Descriptions of
Data
Data Matrix- tabular format
representation of cases and Frequency table
variables • Shows how the values of a
• Each row of a data matrix variable are distributed over
represents a case and each the cases.
column represent a variable
• A complete Data Matrix may
contain thousands or lakhs or
even more cases.
Types of statistics

• Descriptive
• Describe the basic features of data- simple summaries
• Data Analysis
• Analyze the previous year sales data to find interesting
insight
• Invest made by the company
• Profit % etc
• Use numerical measures
• Inferential
• Get inferences and predictions about the larger population
from which the sample was drawn
• Data Science
• Use data to find several insights/inferences
• Suggest the company about certain strategies to
increase their profit
Basic Statistical Descriptions of Data

• What
• Measure of central tendency
• Mean, Median and mode
• Location of the centre of a data distribution
• Where do most of the attribute values fall?
• Dispersion Measure
• Range, quartiles, inter quartile range, five
number summary and box plots , variance and
standard deviation
• It describes how are the data spread out.
Descriptive Statistics
Measuring Central Tendency

• What single number best represents our data?

• Whether the data are packed together?

Central tendency : indicates where the centre of the


distribution tends to be. eg., Measure of scores, Measure of
height

• Measure of central tendency answers whether the scores are


high or low.

• Determining which measure of central tendency to use


depends on:
- Scale of measurement (NOIR)
- Shape of the distribution (skew, kurtosis)
Measuring Distribution of Your
Data
Measuring Distribution of Your
Data

Mode
• Value that occurs
most frequently in the
data
• Unimodal, bimodal,
multimodal
Contd…
The Mode : The most frequently occurring score
- Unimodal distribution has only one major peak
- Bimodal distribution has two major peaks

• Mode commonly used with nominal data


Dataset A : 2 3 3 4 4 4 4 7 7 8 9 : 4
Dataset B : 2 3 4 4 4 4 7 8 9 9 9 9 10 12 : 4 and a warning
• Not susceptible to outliers at all
Outlier : significantly higher or lower than other
scores
• Mode ignores much of the data
• Used in non-parametric methods.
The Median
Median : Geographic centre of the dataset
- If the median is 30, then half the scores are
at or below 30
• Put the data in ascending order
• Find the middle number
- If n is odd, the median located at (n + 1)/2
position
- If n is even, the median will be average of
two middle numbers. Avg( n/2, n/2+1)
Contd…

Median
• Median is nothing
more than the
middle value of your
observations when
they are order from
the smallest to the
largest.
It involves two steps:

 Oder your cases from


smallest to largest
 Find the middle Value
The Mean
• Mean is located at the mathematical centre of
the distribution.
• Most commonly used with Interval and Ratio
data
• It is also called average
• In normal distribution Mean = Median =
Mode
Contd…

• Mean- The mean is When to use what


measurement of central
the sum of all the tendency ??
values divided by the
number of • If data is Categorical (Nominal
or Ordinal) it is impossible to
observations calculate mean or median. So,
go for mode.

• If your data is quantitative then


go for mean or median.
• Basically, if your data is having
some influential outliers or data
is highly skewed then median
is the best measurement for
finding central tendency.
Otherwise go for Mean
Whether the data are packed
together or spread-out?
Dispersion or Variability of Data

Variability indicated how spread out the scores


are
• When there are large differences in the
scores, the data are said to contain a lot of
Variability.
An example:
Consider you are driving cross district and
eating every A2B restaurant that you pass. Do
you expect the variability of the food taste from
one restaurant to the next will be high or low?
Dispersion or Variability of Data

Variability indicated how spread out the scores


are
• When there are large differences in the
scores, the data are said to contain a lot of
Variability.
An example:
Consider you are driving cross district and
eating every A2B restaurant that you pass. Do
you expect will
Variability the bevariability
low andof the food taste from
one restaurant
consistency willtobe
the nextThe
high. will food
be high or low?
tastes similar.
Dispersion or Variability of
Data
The grater the variability, the less
accurately the data are summarized by
the measures of central tendency
Measurement
errors are
larger, the
mean is not
useful
Dispersion Vs Central
Tendency
Dispersion and Central tendency are independent
measures

Mean tells where the centre is


- Distribution with same mean may have
different variability

0, 2, 6,10,12 : Mean = 6 (High variability)

8, 7, 6, 5, 4 : Mean = 6 (Data points are closer)

6, 6, 6, 6, 6 : Mean = 6 (No variability)


Measures of dispersion -
Skewness
• To check whether the distribution has a longer
tail on one side or the other or has left-right
symmetry.

• It can be positive(representing right


skewed distribution),
negative(representing left
skewed distribution), or
zero(representing unskewed distribution).
Symmetric vs. Skewed Data
• Data can be "skewed", No Skew / symmetric
meaning it tends to have
a long tail on one side or
the other:

Negative Skew

Positive Skew
Properties of Normal Distribution
Curve
← — ————Represent data dispersion, spread — ————→

Represent central tendency

33
Properties of Normal Distribution
Curve
• The normal (distribution) curve
• From μ–σ to μ+σ: contains about 68% of the
measurements (μ: mean, σ: standard
deviation)
• From μ–2σ to μ+2σ: contains about 95% of
it
• From μ–3σ to μ+3σ: contains about 99.7% of
it

34
Measuring the Dispersion of
Data
Range Variance
• The difference between • Variance signifies how
the largest and the
smallest data item. much the data items are
deviating from mean
Let’s look at a very simple set Larger variance means the
of data representing the
data items deviate more from
weight of 10 males, the mean.
Smaller variance means the
55, 56, 56, 58, 60, 61, 63, 64, data items are closer to the
70, 78. mean.
Range = 78–55 =
23.
Standard deviation =[(55–62.1)² + (56–62.1)² + (56–
62.1)² + (58–62.1)² + (60- 62.1)² +
Square root of the variance. In the (61–62.1)² + (63–62.1)² + (64–
above formula, σ is the standard 62.1)² + (70–62.1)² +
deviation and σ2 is the variance (78–62.1)²]/9.
Std dev = sqrt(51.88) = 7.20 = 466.9/9
Measuring the Dispersion of
Data
Correlation - Co-variance
• Measures how two variables vary with respect to each other

Covariance indicates the direction of the linear relationship between


variables. (-∞ and +∞)

Covariance of two dependent variables measures how much in real


quantity (i.e. cm, kg, liters) on average they co-vary.

Positive covariance signifies that the higher values of one variable correspond
with the higher values of the other variable, and similarly for the lower ones.

Negative covariance, on the other hand, signifies that the higher values of one
variable correspond to the lower values of the other
Measuring the Dispersion of
Data

Positive covariance and as such it means that


both returns move in the same direction
i.e. either both have positive returns or both
have negative returns

A covariance very close to zero signifies the lack of correlation between


two variables.
Measuring the Dispersion of
Data
Correlation - how two variables move with respect to each
other
Correlation on the other hand measures both the strength and
direction of the linear relationship between two variables. (-1 to 1)
Correlation of two dependent variables measures the proportion
of how much on average these variables vary w.r.t one another.
A perfect positive correlation means that the correlation
coefficient is 1.
A perfect negative correlation means that the correlation
coefficient is -1.
A correlation coefficient of 0 means that the two variables are
independent of each other
• The correlation coefficient
Measuring the Dispersion of
Data
Measures of dispersion -
Percentile
• A measure which indicates the value below which a given
percentage of points in a dataset fall.
• The 35th percentile(P35) is the score below which 35% of the data
points may be found.
• Median represents the 50th percentile, 0th percentile representing
the minimum and 100th percentile representing the maximum of
all
To data points.
calculate kth percentile(Pk) for a data set of N observations
which is arranged in increasing order,
Step 1: Calculate i=(k/100)×N
Step 2: If i is a whole number, then count the observations in the data set
from left to right till we reach the ith data point. The kth percentile, in this
case, is equal to the average of the value of i th data point and the value of
the data
Step 3: Ifpoint that
i is not followsnumber,
a whole it then round it up to the nearest integer
and count the observations in the data set from left to right till we reach
the ith data point. The kth percentile now is just equal to the value
corresponding this data point.
Measures of dispersion - Percentile
Example
• There are 25 test scores such as: 72,54, 56, 61, 62, 66, 68,
43, 69, 69, 70, 71,77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98,
99, 99. Find the 60th percentile?
• Step 1: Arrange the data in the ascending order. Ascending
Order = 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78,
79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.
• Step 2: Find Rank, Rank = Percentile / 100 = 60 / 100 k =
0.60
• Step 3: Find 60th percentile, 60th percentile = 0.60 x 25 =
15
• Step 4: Count the values in the given data set from left to
right until you reach the number 15. From the given data set,
15th number is 79. Now take the 15th number and the 16th
number and find the average: 79 + 85 / 2 = 164 / 2 = 82
• Hence, 60th percentile of given data set = 82.
Measures of dispersion -
Percentile
Let us consider the percentile example problem: In a college, a
list of grades of 15 students has been declared. Their grades are:
85, 34, 42, 51, 84, 86, 78, 85, 87, 69, 74, 65. Find the 80th
percentile?
• Step 1: Arrange the data in the ascending order.
• Ascending Order = 34, 42, 51, 65, 69, 74, 78, 84, 85, 85, 86,
87.
• Step 2: Find Rank, Rank = Percentile / 100 = 80 / 100 k = 0.80
• Step 3: Find 80th percentile, 80th percentile = 0.80 x 12 = 9.6
• Step 4: Since it is not a whole number, round to the nearest
whole number. Therefore, 9.6 is rounded to 10.
• Now, count the values in the given data set from left to right
until you reach the number 10. From the given data set, 10th
number is 85.
Hence,
• 80th percentile of given data set = 85
Measures of dispersion -
Quartiles
• Three points that split the data set into four equal parts such
that each group consists of one-fourth of the data.
25th percentile the first quartile(Q1), 50th percentile the second
quartile(Q2), and 75th percentile the third quartile(Q3).

Why percentiles?
Percentile gives the relative position of a particular
value within the dataset. If we are interested in
relative positions, then mean and standard
deviations does not make sense. In the case of exam
scores, we do not know if it might have been a
difficult exam and 7 points out of 20 was an amazing
score. In this case, personal scores in itself are
meaningless, but the percentile would reflect
everything. For example, GRE and GMAT scores are
Characteristic of Range

• Rarely used
• Its crude measure
• Highly susceptible to outliers
• Used mostly with nominal data or
ordinal data
• The Interquartile Range (IQR)
overcome these limitations to
some extent
Measures of dispersion - Interquartile
Range(IQR)
• The difference between the third quartile and the first
quartile.
IQR=Q3−Q1

Why IQR?
The interquartile range is a better
option than range because it is not
affected by outliers. It removes
the outliers by just focusing on the
distance within the middle 50% of
the data
Interquartile Range
• Quartiles divide the data into four equal sections (25%
each)
• Interquartile Range - The range between middle 50% of
data
- Measure the variability between 1st and 3rd quartile
- Variability in the middle half of the data
- Describes spread at centre of the data
- Not largely affected by the outliers
Calculating Interquartile Range

• Median split 1 and 2


• Median split both lower and
upper median
• subtract lower from upper

MDN = (3 + 4) / 2 = 3.5 IQR = 9 - 3


=6
Data Visualization
Graphic Displays of Basic Statistical
Descriptions
• Boxplot: graphic display of five-number
summary
• Histogram: x-axis are values, y-axis repres.
frequencies
• Scatterplot: each pair of values is a pair of
coordinates and plotted as points in the
plane
The 5 Number Summary
• Five numbers summaries the dataset
Minimum, Quartile1 (Q1), Median, Quartile3
(Q3), Maximum
Q1 : First quartile, 25th percentile, middle score
between minimum and median
Q2 : Third quartile, 75th percentile, middle of
scored between Median and Maximum

Min = 1, Q1 = 3, MDN = 3.5, Q2 = 9, max = 20


Measuring the Dispersion of Data:
Quartiles & Boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th
percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median,
Q3, max
• Boxplot: Data is represented with a box

• Q1, Q3, IQR: The ends of the box are at


the first and third quartiles, i.e., the
height of the box is IQR
• Median (Q2) is marked by a line within
the box
• Whiskers: two lines outside the box
 Outliers: points beyond a specified outlier threshold, plotted
extended to Minimum and Maximum
individually
 Outlier: usually, a value higher/lower than 1.5 x IQR
Measuring the Dispersion of Data:
Quartiles & Boxplots
A sample of 10 boxes of raisins has these weights (in grams):
25, 28, 29, 29, 30, 34, 35, 35, 37, 38
Make a box plot of the data.
Step 1: Order the data from smallest to largest.
Our data is already in order. 25, 28, 29, 29, 30, 34, 35, 35, 37,
38
Step 2
Find the median.
The median is the mean of the middle two numbers:
30+34/2 = 32 median=32
Step 3: Find the quartiles.
The first quartile is the median of the data points to the left of the
median. Q1=29
The third quartile is the median of the data points to the right of
the median. Q3=35
Step 4: Complete the five-number summary by finding the min and
the max.
The min is the smallest data point, which is 25.
Constructing a box and whisker plot :
Example 2
• Step 1 - take the set of numbers given…34, 18, 100, 27, 54, 52, 93, 59,
61, 87, 68, 85, 78, 82, 91
Place the numbers in order from least to greatest:18, 27, 34, 52,
54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
• Step 2 - Find the median. Remember, the median is the middle value in a
data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100
68 is the median of this data set.
• Step 3 – Find the lower quartile. The lower quartile is the median of the
data set to the left of 68. (18, 27, 34, 52, 54, 59, 61,) 68, 78, 82,
85, 87, 91, 93, 100
52 is the lower quartile
• Step 4 – Find the upper quartile.The upper quartile is the median of the
data set to the right of 68. 18, 27, 34, 52, 54, 59, 61, 68, (78, 82, 85,
87, 91, 93, 100)
87 is the upper quartile
• Step 5 – Find the maximum and minimum values in the set. The maximum
is the greatest value in the data set. The minimum is the least value in the
data set. 18, 27, 34, 52, 54, 59, 61, 68, 78, 82, 85, 87, 91, 93, 100.
Constructing a box and whisker plot :
Example 2

• Step 6 – Find the inter-quartile range (IQR).


The inter-quartile (IQR) range is the difference between the
upper and lower quartiles.
Upper Quartile = 87
Lower Quartile = 52
87 – 52 = 35
35 = IQR
• Organize the 5 number summary
• Median – 68
• Lower Quartile – 52
• Upper Quartile – 87
• Max – 100
• Min – 18
Interpreting the Box Plot:
Interpreting the Box Plot:

Symmetric: If a box and whisker plot is symmetric, the


median is equidistant from the minimum and the
maximum.

Negatively Skewed: If a box and whisker plot is


negatively skewed, the distance from the median to the
minimum is greater than the distance from the
median to the maximum.

Positively Skewed: If a box and whisker plot is


positively skewed, the distance from the median to the
maximum is greater than the distance from the median
to the minimum.
Histogram Analysis

• A histogram is used to summarize discrete or continuous


data. In other words, it provides a visual interpretation of
numerical data by showing the number of data points that
fall within a specified range of values (called “bins”). It is
similar to a vertical bar graph. However, a histogram, unlike
a vertical bar graph, shows no gaps between the bars.
• Histograms can display a large amount of data and the
frequency of the data values. The median and distribution of
the data can be determined by a histogram. In addition, it
can show any outliers or gaps in the data.
Distributions of a Histogram
• A normal distribution: In a normal distribution, points on one side
of the average are as likely to occur as on the other side of the
average.

• A bimodal distribution: In a bimodal distribution, there are two


peaks. In a bimodal distribution, the data should be separated and
analyzed as separate normal distributions.
Distributions of a Histogram
• A right-skewed distribution: A right-skewed distribution is also
called a positively skewed distribution. In a right-skewed
distribution, a large number of data values occur on the left side
with a fewer number of data values on the right side. A right-
skewed distribution usually occurs when the data has a range
boundary on the left-hand side of the histogram. For example, a
boundary of 0.
Distributions of a Histogram
• A left-skewed distribution: A left-skewed distribution is also called
a negatively skewed distribution. In a left-skewed distribution, a
large number of data values occur on the right side with a fewer
number of data values on the left side. A right-skewed distribution
usually occurs when the data has a range boundary on the right-
hand side of the histogram. For example, a boundary such as 100.
Distributions of a Histogram
• A random distribution: A random distribution lacks an apparent
pattern and has several peaks. In a random distribution histogram,
it can be the case that different data properties were combined.
Therefore, the data should be separated and analyzed separately.
Example of a Histogram
• There are 3 customers waiting between 1 and 35 seconds
• There are 5 customers waiting between 1 and 40 seconds
• There are 5 customers waiting between 1 and 45 seconds
• There are 5 customers waiting between 1 and 50 seconds
• There are 2 customers waiting between 1 and 55 seconds
• can conclude that the majority of customers wait between 35.1 and 50
seconds.
Scatter plot

• A scatter plot - effective graphical methods for determining if


there appears to be a relationship, pattern, or trend between two
numeric attributes.
• Provides a first look at bivariate data to see clusters of points,
outliers, etc
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
Scatter plot

• Provides a first look at bivariate data to see clusters of points,


outliers, etc.
• Each pair of values is treated as a pair of coordinates and plotted
as points in the plane
Positively and Negatively Correlated Data

• The left half fragment is positively


correlated

• The right half is negative


correlated
Uncorrelated Data
Rescaling Data
• Features are basically your column names and the
respective data in that column will be of similar feature,
this is in your everyday conventional datasets, most of
them are usually in different quantitative measurements
and in different magnitudes.
• Example the column with the name height will have data
in cm (centimetre) and column with weight will have data
in Kg(kilogram).
• Scaling data is the process of increasing or
decreasing the magnitude according to a fixed ratio,
in simpler words you change the size but not the
shape of the data .
MinMax Scaler

• Scaling each feature to a given range


• This transformation does not change the distribution of
the feature.
• Feature range parameter (default at (0,1)), Consider
following example
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7,
30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4,


76.5, 60.7]

𝑋 −𝑚𝑖𝑛( 𝑋 )
𝑋 𝑛𝑜𝑟𝑚= ∗(𝑁𝑒𝑤𝑀𝑎𝑥 − 𝑁𝑒𝑤𝑀𝑖𝑛)+ 𝑁𝑒𝑤𝑀𝑖𝑛
𝑚𝑎𝑥( 𝑋 )− 𝑚𝑖𝑛( 𝑋 )
MinMax Scaler
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7,
30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4,


76.5, 60.7]
After Age
re-scaling
= [1, 0.61568627, 0.34509804, 0, 0.37254902,
0.55294118,
0.10196078, 0.09019608, 0.45098039,
0.33333333]

Height = [0.61392405, 0.06329114, 0.92405063,


0.38607595,
0.39240506, 0.03797468, 0.06329114,
0.86708861, 1, 0]
MinMax Scaler

The same distribution of


data, but rescaled in
such a way that distances
between points won’t be
biased by differences in
scale

➢ Neural networks often require their inputs to be bounded


between 0 and 1.
➢ In images, for example, where pixels can only take on a
specific range of RGB values, data may have to be
MinMax Scaler
STANDARDIZE Z-Score

• A Z-Score describes how far a raw score falls


from the mean in the standard deviation units
- Subtract the mean from the raw score
- Divide by the standard deviation
STANDARDIZE Z-Score

• Comparing scores from different distributions


• For computing relative frequency of scores
from any distributions
• A standardised Z score always in the form of
normal curve
• It has mean of 0 and standard deviation of 1
• Z score exceeding are statically significant
STANDARDIZE Z-Score
STANDARDIZE Z-Score
Z-Score - An example

• For instance, suppose you went to college in New York


and your best friend went to college in Georgia.
• You might get a grade of 87 in a test with a mean of 77
and a standard deviation of 5, and the same day your
friend might get a grade of 612 (mean 600, standard
deviation 100).
• Although the two grades (87 and 612) can’t be
compared directly, the standardized values will allow
you to immediately see who is doing better compared
with the rest of the class.
• (612 – 600) / 100 = 0.12, so your friend’s z-score or
standardized grade is 0.12. (87 – 77) / 5 = 2, your
standardized grade.
STANDARDIZE Z-Score
Normalize or Standardize
• Normalization is good to use when you know that the
distribution of your data does not follow a Gaussian
distribution. This can be useful in algorithms that do
not assume any distribution of the data like K-Nearest
Neighbors and Neural Networks.
• Standardization, on the other hand, can be helpful in
cases where the data follows a Gaussian distribution.
However, this does not have to be necessarily true.
Also, unlike normalization, standardization does not
have a bounding range. So, even if you have outliers in
your data, they will not be affected by standardization.
Binarize Data

• All values above the threshold are


marked 1 and all equal to or below
are marked as 0.
• This is called binarizing your data or
threshold your data. It can be useful
when you have probabilities that you
want to make crisp values.
• The process of thresholding
numerical features to get Boolean
values
• It plays a key role in the
discretization of continuous feature
values.
Example :
A continuous data of pixels values of an 8-
bit grayscale image have values ranging
between 0 (black) and 255 (white) and # For age, let threshold be 35
one needs it to be black and white. So, # For salary, let threshold be
using Binarizer() one can set a threshold 61000
binarizer_1 = Binarizer(35)
converting pixel values from 0 – 127 to 0
One Hot Encoding

• To process categorical features, which


can be in numerical or text form, they
must be first transformed into a
numerical representation.
Label Encoding
• The first column is the color column,
which is all text
• Convert this kind of categorical text
data into model-understandable
numerical data, we use the Label The problem here is, since there are
Encoder different numbers in the same
column, the model will
• After applying the LabelEncoder,  misunderstand the data to be in
you’ll see that the three colours in the some kind of order, 0 < 1 < 2. But
this isn’t the case at all. To
first column have been replaced by the overcome this problem, we use One
numbers 0, 1, and 2 Hot Encoder. 80
Labeled Encoding

81
One Hot Encoding
• It takes a column which has categorical data, which
has been label encoded, and then splits the column
into multiple columns
• For k distinct values, we can transform the feature
into a k-dimensional vector with one value of 1 and
0 as the rest values.
One Hot Encoding

Label Binarizer
• The label binarizer class to perform one hot encoding in a
single step
from sklearn.preprocessing
import LabelBinarizer To convert from the one-hot
color_lb = LabelBinarizer() encoded vector back into
make_lb = LabelBinarizer() the original text category,
the label binarizer class
X= provides the inverse
color_lb.fit_transform(df.color. transform function
values)
Xm =
print(X)
make_lb.fit_transform(df.make
green_ohe = X[[0]]
.values)
color_lb.inverse_transfor
m(green_ohe)
Any
Queries?

You might also like