Preprocessing ch.1
Preprocessing ch.1
preprocessing
PREPROCESSING FOR MACHINE LEARNING IN PYTHON
James Chapman
Curriculum Manager, DataCamp
What is data preprocessing?
After exploratory data analysis and data cleaning
Preparing data for modeling
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 11 columns):
# Column Non-Null Count Dtype
-- ------ -------------- -----
0 Prop_ID 33 non-null object
1 Name 33 non-null object
2 Location 33 non-null object
3 Park_Name 33 non-null object
4 Length 29 non-null object
5 Difficulty 27 non-null object
6 Other_Details 31 non-null object
7 Accessible 33 non-null object
8 Limited_Access 33 non-null object
9 lat 0 non-null float64
10 lon 0 non-null float64
dtypes: float64(2), object(9)
memory usage: 3.0+ KB
A B C A B C
0 1.0 NaN 2.0 1 4.0 7.0 3.0
1 4.0 7.0 3.0 4 5.0 9.0 7.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
A B C A B C
0 1.0 NaN 2.0 0 1.0 NaN 2.0
1 4.0 7.0 3.0 4 5.0 9.0 7.0
2 7.0 NaN NaN
3 NaN 7.0 NaN
4 5.0 9.0 7.0
A B C B C
0 1.0 NaN 2.0 0 NaN 2.0
1 4.0 7.0 3.0 1 7.0 3.0
2 7.0 NaN NaN 2 NaN NaN
3 NaN 7.0 NaN 3 7.0 NaN
4 5.0 9.0 7.0 4 9.0 7.0
A B C A 1
0 1.0 NaN 2.0 B 2
1 4.0 7.0 3.0 C 2
2 7.0 NaN NaN dtype: int64
3 NaN 7.0 NaN
4 5.0 9.0 7.0 print(df.dropna(subset=["B"]))
A B C
1 4.0 7.0 3.0
3 NaN 7.0 NaN
4 5.0 9.0 7.0
A B C A B C
0 1.0 NaN 2.0 0 1.0 NaN 2.0
1 4.0 7.0 3.0 1 4.0 7.0 3.0
2 7.0 NaN NaN 4 5.0 9.0 7.0
3 NaN 7.0 NaN
4 5.0 9.0 7.0
James Chapman
Curriculum Manager, DataCamp
Why are types important?
print(volunteer.info()) object : string/mixed types
int64 : integer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664 float64 : float
Data columns (total 35 columns):
# Column Non-Null Count Dtype datetime64 : dates and times
-- ------ -------------- -----
0 opportunity_id 665 non-null int64
1 content_id 665 non-null int64
2 vol_requests 665 non-null int64
3 event_time 665 non-null int64
4 title 665 non-null object
.. ... ... ...
34 NTA 0 non-null float64
dtypes: float64(13), int64(8), object(14)
memory usage: 182.0+ KB
A B C <class 'pandas.core.frame.DataFrame'>
0 1 string 1.0 RangeIndex: 3 entries, 0 to 2
1 2 string2 2.0 Data columns (total 3 columns):
2 3 string3 3.0 # Column Non-Null Count Dtype
-- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
2 C 3 non-null object
dtypes: int64(1), object(2)
memory usage: 200.0+ bytes
A B C
0 1 string 1.0 A int64
1 2 string2 2.0 B object
2 3 string3 3.0 C float64
dtype: object
James Chapman
Curriculum Manager, DataCamp
Why split?
1. Reduces overfitting
X_train y_train
0 1.0 n
1 4.0 n
...
5 5.0 n
6 6.0 n
X_test y_test
0 9.0 y
1 1.0 n
2 4.0 n
y["labels"].value_counts()
class1 80
class2 20
Name: labels, dtype: int64
class1 60 class1 20
class2 15 class2 5
Name: labels, dtype: int64 Name: labels, dtype: int64