Data Preprocessing and Linear Regression
Data Preprocessing and Linear Regression
-Data Scientist
Ganit Inc.
Data Preprocessing
Real World Data S.No Credit_rati Age Income Credit_car
ng ds
Any Problem? 1 0.00 21 10000 y
2 1.0 2500 n
3 2.0 62 -500 y
4 100.012 42 n
5 yes 200 1y
6 30 0 Seventy No
thousand
Data Preprocessing
● Data Cleaning
● Data Integration
● Data Reduction
● Data Transformation
Data Cleaning
1. Missing Data
● Central Imputation
● KNN Imputation
● 2. Noisy Data
● Smoothing
● Clustering
1. Outlier Removal
● Using Boxplot
Imputation
S.No Qualification Age Income
● Replace with mean or a median
1 B.Tech 25 30k
● When to use mean?
● Replace with nearest neighbour 2 M.Tech 30 50k
● How much nearest to see?
3 B.Tech 26 32k
4 B.Tech 25 ?
5 M.Tech 29 60k
6 B.Tech ? 30k
Outlier
● BoxPlot
Data Transformation
● Normalization
Min-max normalization
1. Min Max Normalization
2. Z - Score Normalization
3. Decimal scaling
Decimal scaling
v= v/10^j
Data Integration
6 20
Y = ????????? 4 14
3 11
7 23
4 14
2 8
5 17
Relationship
x Y
2 8
6 20
Y = 2 + 3(X) 4 14
3 11
7 23
4 14
2 8
5 17
What is 2 here?
x Y
2 8
6 20
Y = 2 + 3(X) 4 14
3 11
7 23
4 14
2 8
5 17
Find the Y in ? x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
4 14
2 8
5 17
10 ?
1 ?
Value for Y with given X x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
4 14
2 8
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
3 = Slope
10 32
1 5
Terminology x Y
2 8
6 20
4 14
3 11
Y = 2 + 3(X)
7 23
Y = Model 4 14
2 8
2 = Intercept
5 17
3 = Slope
10 32
X = input
1 5
Formula for a line
Linear
Regression
Welcome to the world of data science
What is linear?
What is linear?
A Straight line
What is Regression?
What is Regression?
1 1
2 3
4 3
3 2
5 5
y = B0 + B1 * x
B1 = sum((xi-mean(x)) * (yi-mean(y))) / sum((xi – mean(x))^2)
B0 = mean(y) – B1 * mean(x)
x mean(x) x - mean(x) y mean(y) y - mean(y)
1 3 -2 1 2.8 -1.8
B1 = 8 / 10
2 3 -1 3 2.8 0.2
4 3 1 3 2.8 0.2
B1 = 0.8
3 3 0 2 2.8 -0.8
5 2.8 2.2
5 3 2 B0 = mean(y) – B1 * mean(x)
or
B0 = 2.8 – 0.8 * 3
8 y = 0.4 + 0.8 * x 10
x y predicted y
1 1 1.2
2 3 2
4 3 3.6
3 2 2.8
5 5 4.4
RMSE = 0.692
Gradient Descent
Any Suggestions?
Line of best fit
Learning Rate
Momentum
Partial Derivative
Step 2
Step 3
Step 4
Step 5
Advantage of Linear Regression