0% found this document useful (0 votes)
14 views72 pages

Know - Your - Data and Rescaling

The document provides a comprehensive overview of data types relevant to machine learning, including numerical, categorical, and time series data, along with their characteristics and examples. It also discusses statistical concepts such as central tendency, dispersion, variance, and covariance, explaining how these measures help in understanding data distribution and relationships between variables. Additionally, it includes practical exercises to test understanding of the material presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views72 pages

Know - Your - Data and Rescaling

The document provides a comprehensive overview of data types relevant to machine learning, including numerical, categorical, and time series data, along with their characteristics and examples. It also discusses statistical concepts such as central tendency, dispersion, variance, and covariance, explaining how these measures help in understanding data distribution and relationships between variables. Additionally, it includes practical exercises to test understanding of the material presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 72

Machine Learning – Know Your

Data
Roadmap

• Know You Data

• Statistical Descriptions of Data

• Data Visualisation
Types of Data Sets
Data Types From A Machine Learning
Perspective
Numerical Data
➢ Data points are exact numbers
➢ Quantitative data
➢ Measurement - number of residential
properties in Coimbatore or how many
houses sold in the past year.
Numerical data is not ordered in time

Continuous data can assume any value within a range


whereas discrete data has distinct values
• Number of students taking machine learning class would be
a discrete data set.
• You can only have discrete whole number values like 10,
25, or 33.
• A class cannot have 12.75 students enrolled, A student
either join a class or he doesn’t
Continuous data are numbers that can fall anywhere within a
range. Like a student could have an average score of 88.25
which falls between 0 and 100
Data Types From A Machine Learning
Perspective

Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range

➢ Nominal categorical: No concept of ordering


amongst the values of that attribute Country Nominal Gender Nominal

Similarly movie, music and France 1 Male 1


video game genres, country Female 2
names, food and cuisine Spain 2
types are other examples of
nominal categorical Germany 3
attributes
Data Types From A Machine Learning
Perspective

Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range

Place finished in the Race:


➢ Ordinal categorical : Sense or notion of 1st, 2nd,
order amongst its values 3rd, …

Place in Election result:


1st, 2nd,
3rd, …
Data Types From A Machine Learning
Perspective

Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range

➢ Interval categorical : Numbers are ordered,


but there are equal interval between
adjacent categories.

Temperature in Degrees Fahrenheit:


The difference between 78 degrees and 79 degrees
(1 degree) is the same as 45 and 46 degrees.

Interval Vs Ordinal : Temperature Vs Place


in Race ?
Data Types From A Machine Learning
Perspective

Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range

➢ Interval categorical : Numbers are ordered,


but there are equal interval between
adjacent categories.
The temperature in an air-conditioned room is 16 degrees Celsius, while the
temperature outside the room is 32 degrees Celsius. You can conclude the temperature
outside is 16 degrees higher than inside the room

But if you said, “It is twice as hot outside than inside,” you would be incorrect. By
stating the temperature is twice that outside as inside, you’re using 0 degrees as the
reference point to compare the two temperatures. Since it’s possible to measure
temperature below 0 degrees, you can’t use it as a reference point for comparison. You
must use an actual number (such as 16 degrees) instead
Data Types From A Machine Learning
Perspective

Categorical Data
➢ Represents characteristics – cricket
player’s position, team, hometown
➢ Belong to a specific finite set of
categories or classes -- classes or labels
➢ Major classes of categorical data =>
nominal, ordinal, interval, and range

➢ Range categorical : Differences are


meaningful (like interval), plus ratios are
meaningful and there is a true zero point

Interval Vs Range : Weight in pounds. 10 lbs. is twice as much as 5


Temperature Vs lbs. (ratios are meaningful: 10/5 = 2), and Zero
Weight pounds means no weight or an absence of weight
(true zero point)
Test Your Skill
Test Your Skill
“Places from where students are travelling to take
a machine course” is an example of which scale of
measurement?

“Students’ scores on a biology test” is an


example of which scale of measurement?

A meteorologist compiles a list of temperatures in


degrees Celsius for the month of May.

"Amount of calories in a small Yogurt" is which


scale of measurement?

Country of your birth is an example of which scale


of measurement?

Arranging the shirt sizes as small, medium and


large is an example of which scale of
measurement?
Test Your Skill
“Places from where students are travelling to take
Nominal
a machine course” is an example of which scale of
measurement?
Ratio
“Students’ scores on a biology test” is an
example of which scale of measurement?

A meteorologist compiles a list of temperatures in


Interval
degrees Celsius for the month of May.

"Amount of calories in a small Yogurt" is which


Ratio
scale of measurement?

Country of your birth is an example of which scale Nominal


of measurement?

Arranging the shirt sizes as small, medium and


Ordinal
large is an example of which scale of
measurement?
Data Types From A Machine Learning
Perspective

Time Series Data


➢ Sequence of numbers collected at regular
intervals over some period of time
➢ Time series data has a temporal value
attached to it, so this would be something
like a date or a time stamp that you can
look for trends in time

For Example
• Measure the average number of house sales
for many years.
• The difference of time series data and
numerical data is that rather than having a
bunch of numerical values which don’t have
any time ordering, time series data does have
some implied ordering.
• There is a first data point collected and a last
data point collected.
Data Types From A Machine Learning
Perspective

Text
➢ Text data is basically just words.
➢ A lot of the time the first thing that you do
with text is you turn it into numbers using
some interesting functions like the bag of
words formulation.

Text : John likes to watch movies. Mary likes movies too.


BoW = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
The list : [1, 2, 1, 1, 2, 1, 1, 0, 0, 0]
Generate Bag-of-Words: “The quick brown fox jumped over the
lazy dog.”
Data Types From A Machine Learning
Perspective
Generate Bag-of-Words: “The quick brown fox jumped over the
lazy dog.”
Basic Statistical Descriptions of
Data
Measuring Distribution of Your
Data
Measuring Central Tendency
• What single number best represents our data?

• Whether the data are packed together?

Central tendency : indicates where the centre of the distribution


tends to be. eg., Measure of scores, Measure of height

• Measure of central tendency answers whether the scores are high


or low.

• Determining which measure of central tendency to use depends


on:
- Scale of measurement (NOIR) - nominal, ordinal, interval, and
ratio
- Shape of the distribution (skew, kurtosis)
The Mode
The Mode : The most frequently occurring score
- Unimodal distribution has only one major peak
- Bimodal distribution has two major peaks
The Mode
The Mode : The most frequently occurring score
- Unimodal distribution has only one major peak
- Bimodal distribution has two major peaks

• Mode commonly used with nominal data


Dataset A : 2 3 3 4 4 4 4 7 7 8 9 : 4
Dataset B : 2 3 4 4 4 4 7 8 9 9 9 9 10 12 : 4
• Mode ignores much of the data
• Used in non-parametric methods.
The Median
Median : Geographic centre of the dataset
- If the median is 30, then half the scores are at or
below 30

• Put the data in ascending order


• Find the middle number
- If n is odd, the median located at (n + 1)/2 position
- If n is even, the median will be average of two middle
numbers. Avg( n/2, n/2+1)
The characteristics of
Median
• Can have only one score

• Will not always be one of the actual scores

• Useful with highly skewed data

• Median is a preferred when the data is ordinal

• Not useful with nominal data

• Used in non-parametric methods.


The Mean
• Mean is located at the mathematical centre of the
distribution.

• Most commonly used with Interval and Ratio data

• It is also called average

• In normal distribution Mean = Median = Mode


The characteristics of Mean
• Only used with Interval or Ratio level data not excessively
skewed.

• May not actually exist in the sample

• Susceptible to outliers

• Used in parametric methods.


Whether the data are packed together
or spread-out?
Measuring the Dispersion of Data
Dispersion or Variability of Data

Variability indicated how spread out the scores are


• When there are large differences in the scores, the data are
said to contain a lot of Variability.

An example:
Consider you are driving cross district and eating every A2B
restaurant that you pass. Do you expect the variability of the
food taste from one restaurant to the next will be high or
low?
Dispersion or Variability of Data

Variability indicated how spread out the scores are


• When there are large differences in the scores, the data are
said to contain a lot of Variability.

An example:
Consider you are driving cross district and eating every A2B
restaurant that you pass. Do you expect the variability of the
food taste from one restaurant to the next will be high or
low?
Variability will be low and consistency will be
high. The food tastes similar.
Dispersion or Variability of Data

The grater the variability, the less


accurately the data are summarized by
the measures of central tendency
Measurement
error are larger,
the mean is not
useful
Dispersion Vs Central Tendency
Dispersion and Central tendency are independent
measures

Mean tells where the centre is


- Distribution with same mean may have different variability

0, 2, 6,10,12 : Mean = 6 (High variability)

8, 7, 6, 5, 4 : Mean = 6 (Data points are closer)

6, 6, 6, 6, 6 : Mean = 6 (No variability)


Range and IQR
• The distance between two most extreme scores
in the dataset

R = High score - Low score

0, 2, 6,10,12 : Range = 12

8, 7, 6, 5, 4 : Range = 4

6, 6, 6, 6, 6 : Range = 0
Characteristic of Range
• Rarely used

• Highly susceptible to outliers

• Used mostly with nominal data or ordinal data

• The Interquartile Range (IQR) overcome these


limitations to some extent
Interquartile Range
• Quartiles divide the data into four equal sections
(25% each)
• Interquartile Range - The range between middle
50% of data
- Measure the variability between 1st and 3rd
quartile
- Variability in the middle half of the data
- Describes spread at centre of the data
- Not largely affected by the outliers
Calculating Interquartile Range
• Median split 1 and 2

• Median split both lower and upper median

• subtract lower from upper

DN = (3 + 4) / 2 = 3.5 R=19 IQR = 9 - 3 =


The 5 Number Summary
• Five numbers summaries the dataset

Minimum, Quartile1 (Q1), Median, Quartile3 (Q3),


Maximum
Q1 : First quartile, 25th percentile, middle score
between minimum and median
Q2 : Third quartile, 75th percentile, middle of scored
between Median and Maximum

Min = 1, Q1 = 3, MDN = 3.5, Q2 = 9, max = 20


Make a box plot of the
data
• Step 1: Order the data from smallest to largest.

• Step 2: Find the median.

• Step 3: Find the quartiles.

• Step 4: Complete the five-number summary by


finding the min and the max.
Box Plot
Sum of Squares
Finds overall variability in the dataset.
Variance
• The sum of squares score increased when we
increase the size of the dataset. The score not
actually represents the variance in the data.
Average Variability : 38 / 6 = 6.33
• The Average Variability is also called Variance.
• Variance is hard to interpret
• Variance is based on squared deviation, So its
unrealistically large number.
Standard Deviation
• Taking a square root of a number returns to the
original units.
SD =
Tells how consistent and close the scores are
- Small SD : close and consistence
- Large SD : far apart and inconsistence
Standard Deviation
• SD affects shape of the distribution
• Small SD - leptokurtic
• Large SD - platykurtic
Covariance

• Variables may change in relation to each other


• Covariance measures how much the movement
in one variable predicts the movement in a
corresponding variable
• Covariance is a statistical term that refers to a
systematic relationship between two random
variables in which a change in the other reflects
a change in one variable
Covariance
• The covariance value can range from -∞ to +∞, with a
negative value indicating a negative relationship and a
positive value indicating a positive relationship.

• The greater this number, the more reliant the relationship.


Positive covariance denotes a direct relationship and is
represented by a positive number.

• A negative number, on the other hand, denotes negative


covariance, which indicates an inverse relationship between
the two variables. Covariance is great for defining the type of
relationship, but it's terrible for interpreting the magnitude.
Smoking and Lung Capacity
• Example: investigate relationship between
cigarette smoking and lung capacity
• Data: sample group response data on smoking
habits, and measured lung capacities,
respectively

N Cigarettes (X ) Lung Capacity (Y )

1 0 45
2 5 42
3 10 33
4 15 31
5 20 29
Smoking and Lung Capacity

• Observe that as smoking


exposure goes up, corresponding
lung capacity goes down

• Variables covary inversely


Covariance
• Similar to variance, for theoretical reasons, average is typically
computed using (N -1), not N . Thus,

• xi = Data variable of X

• yi = Data variable of Y

• = Mean of X

• = Mean of Y

• N= Number of data variables.


Calculating Covariance
Cigs (X ) Cap (Y )
Lung
Cigs (X ) Cap (Y )
0 45 0 -10 -90 9 45
5 42 5 -5 -30 6 42
10 33
10 0 0 -3 33
15 31
20 29 15 5 -25 -5 31

10 36 20 10 -70 -7 29
❑ ❑
𝑥 𝑦 ∑= -215

−215
𝑐𝑜𝑣 ( 𝑥 , 𝑦 )= =−53.75=𝑆𝑥𝑦
4
Correlation (rxy)
• Show whether and how strongly pairs of variables are related
• The main result of a correlation is called the correlation
coefficient (or “r")
• It ranges from -1 to 1
• The closer r is to +1 or -1, the more closely the two variables
are related.
• If r is close to 0, it means there is no relationship between the
variables.
• If r is positive, it means that as one variable gets larger the
other gets larger.
• If r is negative, it means that as one gets larger, the other
gets smaller (often called an "inverse" correlation).
Correlation Coefficient rxy
Correlation Coefficient - Common
Expression

Sxy - Covariance between features x and y

Sx - Standard Deviation of X

Sy - Standard deviation of Y
Correlation Coefficient - Common
Expression
Cigs (X) Cap (Y)
0 45
5 42
10 33
15 31
20 29
SD = 7.90 SD = 7.071

− 53.75
𝑟 𝑥𝑦 = =− 0.96
7.90 ∗ 7.071
Conclusion

• rxy = -0.96 implies almost certainty smoker will have diminish


lung capacity

• Greater smoking exposure implies greater likelihood of lung


damage
Rescaling of Data
Rescaling Data

• Features are basically your column names and the respective data in
that column will be of similar feature, this is in your everyday
conventional datasets, most of them are usually in different
quantitative measurements and in different magnitudes.
• Example the column with the name height will have data in cm
(centimetre) and column with weight will have data in Kg(kilogram).
• Scaling data is the process of increasing or decreasing the
magnitude according to a fixed ratio, in simpler words you change
the size but not the shape of the data .
MinMax Scaler
• Scaling each feature to a given range

• Feature range parameter (default at (0,1)), Consider


following example

Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7, 30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4, 76.5, 60.7]
MinMax Scaler
Age = [44.9, 35.1, 28.2, 19.4, 28.9, 33.5, 22.0, 21.7, 30.9, 27.9]

Height = [70.4, 61.7, 75.3, 66.8, 66.9, 61.3, 61.7, 74.4, 76.5, 60.7]

After re-scaling
Age = [1, 0.61568627, 0.34509804, 0, 0.37254902, 0.55294118,

0.10196078, 0.09019608, 0.45098039, 0.33333333]

Height = [0.61392405, 0.06329114, 0.92405063, 0.38607595,


0.39240506, 0.03797468, 0.06329114, 0.86708861, 1, 0]
MinMax Scaler

The same distribution of data, but rescaled in such a way that distances between points won’t
be biased by differences in scale

➢ Neural networks often require their inputs to be bounded between 0 and 1.


➢ In images, for example, where pixels can only take on a specific range of
RGB values, data may have to be normalised.
MinMax Scaler
STANDARDIZE Z-Score
• A Z-Score describes how far a raw score
falls from the mean in the standard
deviation units
- Subtract the mean from the raw score
- Divide by the standard deviation
STANDARDIZE Z-Score
• Comparing scores from different
distributions
• For computing relative frequency of scores
from any distributions
• A standardised Z score always in the form of
normal curve
• It has mean of 0 and standard deviation of 1
• Z score exceeding are statically significant
STANDARDIZE Z-Score
STANDARDIZE Z-Score
Z-Score - An example
• For instance, suppose you went to college in New York and your
best friend went to college in Georgia.
• You might get a grade of 87 in a test with a mean of 77 and a
standard deviation of 5, and the same day your friend might get a
grade of 612 (mean 600, standard deviation 100).
• Although the two grades (87 and 612) can’t be compared directly,
the standardized values will allow you to immediately see who is
doing better compared with the rest of the class.
• (612 – 600) / 100 = 0.12, so your friend’s z-score or standardized
grade is 0.12. (87 – 77) / 5 = 2, your standardized grade.
• With standardized data you have grounds to boast to your friend
that you are doing much better than he or she is in the class
STANDARDIZE Z-Score
Binarize Data
• All values above the threshold are
marked 1 and all equal to or below
are marked as 0.
• This is called binarizing your data or
threshold your data. It can be useful
when you have probabilities that
you want to make crisp values.
• The process of thresholding
numerical features to get Boolean
values # For age, let threshold be 35
# For salary, let threshold be 61000
• It plays a key role in the binarizer_1 = Binarizer(35)
discretization of continuous feature binarizer_2 = Binarizer(61000)

values.
Example :
A continuous data of pixels values of an 8-bit grayscale image have values ranging between 0 (black) and 255
(white) and one needs it to be black and white. So, using Binarizer() one can set a threshold converting pixel
values from 0 – 127 to 0 and 128 – 255 as 1.
One Hot Encoding

• To process categorical features, which can be in


numerical or text form, they must be first
transformed into a numerical representation.
Label Encoding
• The first column is the country column, which
is all text
• Convert this kind of categorical text data into
model-understandable numerical data, we use
the Label Encoder
• After applying the LabelEncoder,  you’ll see
that the three colours in the first column have
been replaced by the numbers 0, 1, and 2

The problem here is, since there are different numbers in the same column, the model will
misunderstand the data to be in some kind of order, 0 < 1 < 2. But this isn’t the case at all. To
overcome this problem, we use One Hot Encoder.
One Hot Encoding
• It takes a column which has categorical data, which has been label encoded,
and then splits the column into multiple columns
• For k distinct values, we can transform the feature into a k-dimensional
vector with one value of 1 and 0 as the rest values.
One Hot Encoding

Label Binarizer
• The label binarizer class to perform one hot encoding in a single step
from sklearn.preprocessing import LabelBinarizer
color_lb = LabelBinarizer()
make_lb = LabelBinarizer()
X = color_lb.fit_transform(df.color.values)
Xm = make_lb.fit_transform(df.make.values)

To convert from the one-hot encoded vector back into the original text category, the
label binarizer class provides the inverse transform function

print(X)
green_ohe = X[[0]]
color_lb.inverse_transform(green_ohe)
Thank You

You might also like