6 - Machine Learning 2
6 - Machine Learning 2
(Continued)
Welcome to Machine Learning!
https://machinelearningmastery.com
/a-gentle-introduction-to-scikit-learn-a-python-machine-learnin
g-library/
Splitting a Dataset with scikit-learn
https://towardsdatascience.com
/splitting-a-dataset-e328dab2760a
Scikit-learn
• Train, Test split
• Features, target
• fit() - for training
• model
• new_features
• model.predict() -for testing
Concepts on Features
• Feature extraction
• Numerical features: feature scaling
• Categorical features:
• One-Hot Encoding
• Ordinal Encoding
A Note on Feature Extraction
• Not all your data will be ready to input into an algorithm.
Preprocessing!
• transform() scales
the data to that
mean and
variance
• fit_transform()
does both!
Dealing with Categorical Variables as
Features
• Feature Encoding:
• One-Hot Encoding
• Ordinal Encoding
https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
A shortcut from Pandas
• get_dummies()
Encoding the Labels
• Label encoder can be used
to normalize labels
• Or transform categorical
labels into numerical labels
Which Model to Choose? Underfitting vs
Overfitting
Model Improvement- if you care to know…
• Validation dataset
• Hyperparameter tuning
• Cross-validation
• Cost functions
• Regularization
Activity for Group Project
Run a decision tree algorithm with scikit-learn for the dataset you
chose for your project.
• Remember to separate features from targets. You might also have to
convert your data to a numpy array.
• Remember to train/test split adequately.
• Fit the decision tree (either classifier or regressor) to your data.
• Predict using the testing features.
• Compare the expected values (test_target) to your predictions.