Know - Your - Data and Rescaling
Know - Your - Data and Rescaling
Data
Roadmap
• Data Visualisation
Types of Data Sets
Data Types From A Machine Learning
Perspective
Numerical Data
➢ Data points are exact numbers
➢ Quantitative data
➢ Measurement - number of residential
properties in Coimbatore or how many
houses sold in the past year.
Numerical data is not ordered in time
Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range
Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range
Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range
Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range
But if you said, “It is twice as hot outside than inside,” you would be incorrect. By
stating the temperature is twice that outside as inside, you’re using 0 degrees as the
reference point to compare the two temperatures. Since it’s possible to measure
temperature below 0 degrees, you can’t use it as a reference point for comparison. You
must use an actual number (such as 16 degrees) instead
Data Types From A Machine Learning
Perspective
Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range
For Example
• Measure the average number of house sales
for many years.
• The difference of time series data and
numerical data is that rather than having a
bunch of numerical values which don’t have
any time ordering, time series data does have
some implied ordering.
• There is a first data point collected and a last
data point collected.
Data Types From A Machine Learning
Perspective
Text
➢ Text data is basically just words.
➢ A lot of the time the first thing that you do
with text is you turn it into numbers using
some interesting functions like the bag of
words formulation.
• Susceptible to outliers
An example:
Consider you are driving cross district and eating every A2B
restaurant that you pass. Do you expect the variability of the
food taste from one restaurant to the next will be high or
low?
Dispersion or Variability of Data
An example:
Consider you are driving cross district and eating every A2B
restaurant that you pass. Do you expect the variability of the
food taste from one restaurant to the next will be high or
low?
Variability will be low and consistency will be
high. The food tastes similar.
Dispersion or Variability of Data
0, 2, 6,10,12 : Range = 12
8, 7, 6, 5, 4 : Range = 4
6, 6, 6, 6, 6 : Range = 0
Characteristic of Range
• Rarely used
1 0 45
2 5 42
3 10 33
4 15 31
5 20 29
Smoking and Lung Capacity
• xi = Data variable of X
• yi = Data variable of Y
• = Mean of X
• = Mean of Y
10 36 20 10 -70 -7 29
❑ ❑
𝑥 𝑦 ∑= -215
−215
𝑐𝑜𝑣 ( 𝑥 , 𝑦 )= =−53.75=𝑆𝑥𝑦
4
Correlation (rxy)
• Show whether and how strongly pairs of variables are related
• The main result of a correlation is called the correlation
coefficient (or “r")
• It ranges from -1 to 1
• The closer r is to +1 or -1, the more closely the two variables
are related.
• If r is close to 0, it means there is no relationship between the
variables.
• If r is positive, it means that as one variable gets larger the
other gets larger.
• If r is negative, it means that as one gets larger, the other
gets smaller (often called an "inverse" correlation).
Correlation Coefficient rxy
Correlation Coefficient - Common
Expression
Sx - Standard Deviation of X
Sy - Standard deviation of Y
Correlation Coefficient - Common
Expression
Cigs (X) Cap (Y)
0 45
5 42
10 33
15 31
20 29
SD = 7.90 SD = 7.071
− 53.75
𝑟 𝑥𝑦 = =− 0.96
7.90 ∗ 7.071
Conclusion
• Features are basically your column names and the respective data in
that column will be of similar feature, this is in your everyday
conventional datasets, most of them are usually in different
quantitative measurements and in different magnitudes.
• Example the column with the name height will have data in cm
(centimetre) and column with weight will have data in Kg(kilogram).
• Scaling data is the process of increasing or decreasing the
magnitude according to a fixed ratio, in simpler words you change
the size but not the shape of the data .
MinMax Scaler
• Scaling each feature to a given range
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7, 30.9, 27.9]
Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4, 76.5, 60.7]
MinMax Scaler
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7, 30.9, 27.9]
Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4, 76.5, 60.7]
After re-scaling
Age = [1, 0.61568627, 0.34509804, 0, 0.37254902, 0.55294118,
The same distribution of data, but rescaled in such a way that distances between points won’t
be biased by differences in scale
values.
Example :
A continuous data of pixels values of an 8-bit grayscale image have values ranging between 0 (black) and 255
(white) and one needs it to be black and white. So, using Binarizer() one can set a threshold converting pixel
values from 0 – 127 to 0 and 128 – 255 as 1.
One Hot Encoding
The problem here is, since there are different numbers in the same column, the model will
misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To
overcome this problem, we use One Hot Encoder.
One Hot Encoding
• It takes a column which has categorical data, which has been label encoded,
and then splits the column into multiple columns
• For k distinct values, we can transform the feature into a k-dimensional
vector with one value of 1 and 0 as the rest values.
One Hot Encoding
Label Binarizer
• The label binarizer class to perform one hot encoding in a single step
from sklearn.preprocessing import LabelBinarizer
color_lb = LabelBinarizer()
make_lb = LabelBinarizer()
X = color_lb.fit_transform(df.color.values)
Xm = make_lb.fit_transform(df.make.values)
To convert from the one-hot encoded vector back into the original text category, the
label binarizer class provides the inverse transform function
print(X)
green_ohe = X[[0]]
color_lb.inverse_transform(green_ohe)
Thank You