Kaggle Course Notes
Kaggle Course Notes
2021 − 04 − 11
Junfan Zhu
([email protected]; [email protected])
Course Links
https://www.kaggle.com/learn/
Table of Contents
• Kaggle Course Notes
• Junfan Zhu
• Course Links
• Table of Contents
• 1. Machine Learning
• 1.1. Decision Tree
• 1.2. Random Forest
• 1.3. Missing Values
• 1.4. Categorical Variables
• 1.5. Pipelines
– 1.5.1. Step 1: Define Preprocessing Steps
– 1.5.2. Step 2: Define the Model
– 1.5.3. Step 3: Create and Evaluate the Pipeline
• 1.6. Cross Valiation
• 1.7. XGBoost
– 1.7.1. Parameter Tuning
• 1.8. Prevent Data Leakage
– 1.8.1. Target Leakage
– 1.8.2. Train-Test Contamination
• 2. Pandas
• 2.1. DataFrame and Series
• 2.2. Indexing
1
• 2.3. Label-based selection
• 2.4. Functions and Maps
• 2.5. Grouping and Sorting
– 2.5.1. Group by
– 2.5.2. Multi-indexes
– 2.5.3. Sorting
• 2.6. Data Types
• 2.7. Renaming and Combining
• 3. Data Visualization
• 3.1. Plot
• 3.2. Bar Chart
• 3.3. Heat Map
• 3.4. Scatter Plots
• 3.5. Distribution
• 3.6. Plot Types
• 4. Feature Engineering
• 4.1. Mutual Info (MI)
• 4.2. Creating Features
• 4.3. Group Transforms
• 4.4. Features’ Pros and Cons
• 4.5. K-Means Clustering
• 4.6. PCA
• 4.7. Target Encoding
• 4.8. Feature Engineering for House Prices
– 4.8.1. Data Preprocessing
– 4.8.2. Clean Data
– 4.8.3. Establish Baseline
– 4.8.4. Feature Utility Scores
– 4.8.5. Create Features
– 4.8.6. K-Means Clustering
– 4.8.7. PCA
– 4.8.8. Target Encoding
– 4.8.9. Create Final Feature Set
– 4.8.10. Hyperparameter Tuning
– 4.8.11. Train Model
• 5. Data Cleaning
• 5.1. Missing data
• 5.2. Scaling
• 5.3. Parsing Dates
• 5.4. Character Encodings
• 5.5. Inconsistent Data Entry
2
– 5.5.1. Fuzzy matching to correct inconsistent data entry
• 6. Intro to SQL
• 6.1. Dataset
• 6.2. Table Schema
• 6.3. Queries
• 6.4. Count, Group by
– 6.4.1. Count
– 6.4.2. Group by
– 6.4.3. Group by . . . Having
– 6.4.4. Aliasing
• 6.5. Order By
• 6.6. EXTRACT
• 6.7. As & With
• 6.8. Joining Data
• 7. Advanced SQL
• 7.1. Join & Union
– 7.1.1. Join
– 7.1.2. Union
• 7.2. Analytic Functions
– 7.2.1. Over
– 7.2.2. Window frame clauses
– 7.2.3. 3 types of analytic functions
• 7.3. Nested and Repeated Data
– 7.3.1. Nested data
– 7.3.2. Repeated data
• 7.4. Writing Efficient Queries
• 8. Deep Learning
• 8.1. Linear Units in Keras
• 8.2. Deep NN
• 8.3. Sequential models
• 8.4. Stochastic Gradient Descent
– 8.4.1. Loss Function
– 8.4.2. Optimizer - Stochastic Gradient Descent
– 8.4.3. Red Wine Quality
• 8.5. Overfitting and Underfitting
– 8.5.1. Early Stopping
– 8.5.2. Red Wine Example Again
• 8.6. Dropout and Batch Normalization
3
– 8.6.1. Dropout
– 8.6.2. Batch Normalization (batchnorm)
– 8.6.3. Red Wine Example Again. . .
• 8.7. Binary Classification
• 8.8. Detecting Higgs Boson with TPUs
– 8.8.1. Wide and Deep Network
– 8.8.2. Load Data
– 8.8.3. Model
– 8.8.4. Training
• 9. Computer Vision
• 9.1. CNN Classifier
• 9.2. Convolution, ReLU, Max Pooling
– 9.2.1. Feature Extraction
– 9.2.2. Weights/Kernels
– 9.2.3. Activations/Feature Maps
– 9.2.4. Detect with ReLU
– 9.2.5. Max Pooling
– 9.2.6. Translation Invariance
– 9.2.7. Sliding Window
• 9.3. CNN
– 9.3.1. Model
– 9.3.2. Train
• 9.4. Data Augmentation
• 10. Machine Learning Explainability
• 10.1. Permutation Importance
• 10.2. Partial Dependece Plots
– 10.2.1. D Partial Dependence Plots
• 10.3. SHAP Values
– 10.3.1. Break down components of individual predictions
– 10.3.2. Summary Plots
– 10.3.3. SHAP Dependence Contribution Plots
4
1. Machine Learning
This is in-sample score, we use single sample of data and it’s bad, since the
pattern was derived from the training data, but not accurate in practice. We
need to exclude some data from the model-building process, and then use those
to test the model’s accuracy on data it hasn’t seen before, which is validation
data.
model = DecisionTreeRegressor()
model.fit(train_X, train_y)
val_predictions = model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))
Control tree depth and overfitting vs underfitting. The more leaves we allow the
model to take, the more we move from the underfitting area to the overfitting
area.
We use utility function to help compare MAE scores from different values for
max_leaf_nodes.
5
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes = max_leaf_nodes, random_state = 0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return (mae)
X = data[features]
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))
6
Imputation fills in missing values with some number.
One-Hot encoding creates new column indicating the presence of each possible
value in the original data. It doens’t assume an ordering of the categories (Red
is not more or less than Yellow), this is good for nominal variables.
7
# Approach 1: Drop Categorical Variables
drop_X_train = X_train.select_dtypes(exclude=[’object’])
drop_X_valid = X_valid.select_dtypes(exclude=[’object’])
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
1.5. Pipelines
A pipeline bundles preprocessing and modeling steps so you can use the whole
bundle as if it were a single step.
8
1.5.1. Step 1: Define Preprocessing Steps
Use Pipeline class to define pipeline that bundles the preprocessing and model-
ing steps.
• With pipeline, we preprocess the training data and fit the model in a
single line of code. Without a pipline, we have to do imputation, one-hot
encoding, and model training in separate steps. This becomes messy if we
have to deal with both numerical and categorical variables.
9
• With pipeline, we supply the unpreprocessed features in X_valid() to
the predict() command, and the pipeline automatically preprocesses
the features before generating predictions. Without pipeline, we have to
remember to preprocesses the validation data before making predictions.
# Evaluate model
score = mean_absolute_error(y_valid, preds)
print(’MAE:’, score)
• For small datesets, when extra computational burden isn’t a big deal, you
should run cross-validation.
• For large datasets, a single validation set is sufficient, there’s little need to
reuse some of it for holdout.
Define a pipeline that uses an imputer to fill in missing values, and a random
forest to make predictions.
my_pipeline = Pipeline(steps = [
(’preprocessor’, SimpleImputer()),(’model’, RandomForestRegressor (n_estimators=50,
])
10
1.7. XGBoost
XGBoost has few parameters that can dramatically affect accuracy and training
speed.
n_estimators specifies how many times to go through the modeling cycle
described above. It’s equal to number of models we use in the ensemble. Typically
range from 100 - 1000, and depends on learning_rate.
1 Make predictions ⇒ Calculate loss ⇒ Train new model ⇒ Add new model to ensember
⇒ Repeat.
11
• Too low value ⇒ underfitting, inaccurate predictions on both training data
and testing data.
• Too high value ⇒ overfitting, accurate predictions on training data, inac-
curate predictions on test data.
Data Leakage. When training data contains info about the target, but similar
data won’t be available when the model is used for prediction. So we have high
performance on training set, but model will perform poorly in production.
Leakage causes a model to look accurate untl you start making decision with
model which becomes very inaccurate.
When your predictors include data that won’t be available at the time you make
predictions. So any variable updated after the target value is realized should be
excluded.
12
1.8.2. Train-Test Contamination
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]
print(’Fraction of those who didn\’t receive a card and had no expenditures: %.2f’ \
%((expenditures_noncardholders == 0).mean()))
Results: everyone who didn’t receive a card had no expenditures, while only
2% of those who received a card had no expenditures. Although our model
appears to have high accuracy, but this is target leakage, where expenditures
probably means expenditures on the card they applied for. Since share is partially
determined by expenditure, it should be excluded too.
13
# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, cv=5, scoring = ’accuracy’)
print(’Cross-val accuracy: %f’ % cv_scores.mean())
The accuracy is lower now, but we can expect it to be right about 80%, whereas
the leaky model would do much worse than that.
2. Pandas
Yes No
A 50 131
B 21 2
pd.Series([30,35,40], index = [’2015 Sales’, ’2016 Sales’, ’2017 Sales’], name = ’A’)
See how large the DataFrame is. Sometimes csv file has built-in index, which
pandas didn’t pick up automatically, so we can specify an index_col.
2.2. Indexing
Both loc and iloc are row-first, column-second. It’s different from basic
Python’s column-first and row-second.
14
Select first row of data
books.iloc[0]
It’s the data index value, not its position, that matters.
Get first entry in reviews
review.loc[0,’country’]
Note: iloc is simpler than loc because it ignores the dataset’s indices. iloc
treat dataset as a matrix, we have to index into by position. In contrast, loc
uses info in the indices.
reviews.loc[:,[’Name’,’Twitter’,’points’]]
0 Sam @Sam 87
1 Kate @Kate 90
• iloc uses Python stdlib indexing scheme, where the first element of the
range is included and last one excluded. 0:10 will select entries 0,...,9.
But loc indexes inclusively, 0:10 will select 0,...,10.
• loc can index any stdlib type, including strings. DataFrame with index val-
ues Apples, ..., Bananas, then df.loc[’Apples’:’Bananas’] is more
convenient.
• df.iloc[0:1000] will return 1000 entries, while df.loc[0:1000] return
1001 entries, and you need to change to df.loc[0:999].
set_index() method
15
reviews.set_index(’title’)
To select relevant data, use conditional selection. isin can select data whole
value is in a list of values. isnull can highlight values which are NaN.
map() is used to create new representations from existing data, or for transforming
data. The function you pass to map() expects a single value from the Series, and
returns a new Series where all the values have been transformed by the function.
Remean the scores the wines received to 0.
reviews_points_mean = reviews.points.mean()
reviews.points.map(lambda p: p-review_points_mean)
def remean_points(row):
row.points = row.points - review_points_mean
return row
reviews.apply(remean_points, axis = ’columns’)
groupby() creates a group of reviews which allotted the same point values to
given wines. For each of these groups, we grabbed points() column and counted
how many times it appreared. value_counts() is a shortcut. Each group we
generate is a slice of DataFrame containing data with values that match. With
apply() method, we can manipulate data in any way we see fit, here we want to
select the name of the first wine reviewed from each winery. We can also group
by more than one column.
16
reviews.groupby(’points’).points.count()
reviews.groupby(’points’).points.min()
reviews.groupby([’country’,’province’]).apply(lambda df: df.loc[df.points.idxmax()])
2.5.2. Multi-indexes
2.5.3. Sorting
# Sort by index
countries_reviewed.sort_index()
reviews[pd.isnull(reviews.country)]
# replace NaN with Unknown
reviews.region.fillna(’Unknown’)
# replace non-null values
reviews.name.replace(’apple’,’banana’)
17
2.7. Renaming and Combining
rename() changes index names and/or column names, and rename index or
column values by specifying index or column.
But to rename index (not columns) values, set_index() is more convenient.
rename_axis() can change names, as both row and column indices have their
own name attribute.
reviews.rename(columns={’points’:’score’})
reviews.rename(index={0:’firstEntry’, 1:’secondEntry’})
reviews.rename_axis(’wines’, axis=’rows’).rename_axis(’fields’, axis=’columns’)
concat() combines.
data1 = pd.read_csv(’...’)
data2 = pd.read_csv(’...’)
pd.concat([data1, data2])
3. Data Visualization
# Setup
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print(’setup complete’)
# Load data
path = ’...’
18
index_col = ’Date’: column to use as row tables. When we load data, we want
each entry in the first column to denote a different row. So we set index_col to
the name of the first column (Date)
parse_dates = True: recognize the row labels as dates, tells notebook to
understand each row label as a date.
3.1. Plot
sns.lineplot says we want to create a line chart. sns.barplot, sns.heatmap.
list(data.columns)
# to plot only a single column, and add label..
# ..to make the line appear in the legend and set its label
sns.lineplot(data = data[’Sharpe Ratio’], label = "Sharpe Ratio")
plt.xlabel(’Date’)
19
3.3. Heat Map
annot=True ensures that the values for each cell appear on the chart. Leaving
this out will remove the numbers from each of the cells.
plt.figure(figsize=(14,7))
# Heat map showing average arrival delay for each airline by month
sns.heatmap(data=flight_data, annot=True)
plt.xlabel(’Airline’)
Color-code the points by smoker, and plot the other two columns bmi, charges
on the axes. sns.lmplot add two regression lines, corresponding to smokers
and nonsmokers.
3.5. Distribution
2D KDE.
Color-coded plots
Break dataset into 3 separate files, one for each species.
20
iris_set_filepath = ’...’
iris_ver_filepath = ’...’
iris_vir_filepath = ’...’
iris_set_data = pd.read_csv(iris_set_filepath, index_col = ’Id’)
iris_ver_filepath = pd.read_csv(iris_ver_filepath, index_col = ’Id’)
iris_vir_filepath = pd.read_csv(iris_vir_filepath, index_col = ’Id’)
sns.distplot(a=iris_set_data[’Petal Length’], label = ’Iris-setosa’, shade = True)
sns.distplot(a=iris_ver_data[’Petal Length’], label = ’Iris-versicolor’, kde= False)
sns.distplot(a=iris_vir_data[’Petal Length’], label = ’Iris-virginica’, kde= False)
plt.title(’histogram of petal length by species’)
# Force legend to appear
plt.legend()
Trends
Relationship
Distribution
21
data = pd.read_csv(path, index_col = ’Date’, parse_dates = True)
sns.set_style(’dark’) # dark theme
# Line chart
plt.figure(figsize = (12,6))
sns.lineplot(data=data)
4. Feature Engineering
• Determine which features are most important with mutual info.
• Invent new features in several real-world problem domains
• Encode high-cardinality categoricals with a target encoding
• Create segmentation features with k-means clustering
• Decompose a dataset’s variation into features with PCA
Goal
Adding a few synthetic features to dataset can improve the predictive performance
of a random forest model.
Various ingredients go into each variety of concrete. Add some additional
synthetic features derived from these can help a model to learn important
relationships among them.
First, establish a baseline by training the model oon the un-augmented dataset,
so we can determine whether our new features are useful.
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
X = df.copy()
y = X.pop(’CompressiveStrength’)
22
Ratio of ingredients in a recipe is a better predictor of how the recipe turns out
than absolute amounts. The ratios of the features above are a good predictor of
CompressiveStrength.
X = df.copy()
y = X.pop(’CompressiveStrength’)
23
X = df.copy()
y = X.pop(’price’)
#Label encoding for categoricals
for colname in X.select_dtypes(’object’):
X[colname], _ = X[colname].factorize()
# All discrete features should now have integer dtypes (double-check before MI)
discrete_features = X.dtypes == int
def make_mi_scores(X,y,discrete_features):
mi_scores = mutual_info_regression(X,y,discrete_features=discrete_features)
mi_scores = pd.Series(mi_scores, name=’MI Scores’, index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
mi_scores = make_mi_scores(X,y,discrete_features)
mi_scores[::3] # show a few features with MI scores
Identify some potential features, and develop them. The more complicated a
combination is, the more difficult for a model to learn.
autos[’stoke_ratio’] = autos.stroke/autos.bore
autos[[’stroke’,’bore’,’stroke_ratio’]].head()
24
autos[’displacement’] = (
np.pi * ((0.5 * autos.bore) ** 2) * autos.stroke * autos.num_of_cylinders
)
# If feature has 0.0 values, use np.log1p (log(1+x)) instead of np.log
accidents[’LogWindSpped’] = accidents.WindSpeed.apply(np.log1p)
# Plot a comparison
fig, axs = plt.subplots(1,2,figsize=(8,4))
sns.kdeplot(accidents.WindSpped, shade=True, ax=axs[0])
sns.kdeplot(accidents.LogWindSpeed, shade=True, ax=axs[1])
roadway_features = [’Amenity’,’Bump’,’Crossing’]
accidents[’RoadwayFeatures’] = accidents[roadway_features].sum(axis=1)
accidents[roadway_features + [’RoadwayFeature’]].head(10)
components = [’Cement’,’Water’]
concrete[’Components’] = concrete[components].gt(0).sum(axis=1)
concrete[components + [’Components’]].head(10)
customer[’AverageIncome’] = (
customer.groupby(’State’) # for each state
[’Income’] # select income
.transform(’mean’) # compute mean
)
customer[[’State’,’Income’,’AverageIncome’]].head(10)
25
mean can pass as a string to transform.
customer[’StateFreq’]= (
customer.groupby(’State’)
[’State’]
.transform(’count’)
/ customer.State.count()
)
customer[[’State’,’StateFreq’]].head(10)
# create splits
df_train = customer.sample(frac=0.5)
df_valid = customer.drop(df_train.index)
• Linear models learn sums and differences, but can’t learn complex proper-
ties
• Ratio is difficult for most models to learn. Ratio combinations lead to easy
performance gains.
• Linear models and neural nets do better with normalized features. NN need
features scaled to values not too far from 0. Tree-based models (random
forests and XGBoost) can benefit from normalization, but much less so.
• Tree models can learn to approximate almost any combination of features,
but when a combination is especially important they can benefit from
having it explicitly created, when data is limited.
• Counts are helpful for tree models, these model don’t have a natural way
of aggregating info across many features at once.
26
• assign points to the nearest cluster centroid
• move each centroid to minimize the distance to its points
kmeans = KMeans(n_clusters = 6)
X[’Cluster’] = kmeans.fit_predict(X)
X[’Cluster’] = X[’Cluster’].astype(’category’)
sns.replot(
x = ’Longiture’, y=’Latitude’, hue=’Cluster’, data=X, height=6
)
X[’MedHouseVal’] = df[’MedHouseVal’]
sns.catplot(x=’MedHouseVal’,y=’Cluster’, data=X, kind=’boxen’, height=6)
4.6. PCA
features = [’horsepower’,’curb_weight’]
X = df.copy()
y = X.pop(’price’)
X = X.loc[:,features]
# Standardize
X_scaled = (X - X.mean(axis=0)) / X.std(axis=0)
27
component_names = [f’PC{i+1}’ for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns = component_names)
X_pca.head()
# wrap up loadings in a dataframe
loadings = pd.DataFrame(
pca.components_.T # transform matrix of loadings
columns = component_names, # columns are principal components
index = X.columns, # rows are original features
)
loadings
plot_variance(pca)
mi_scores = make_mi_scores(X_pca, y, discrete_features = False)
# third component shows contrast between the first two
idx = X_pca[’PC3’].sort_values(ascending=False).index
cols = [’horspower’,’curb_weight’]
df.loc[idx, cols]
# create new ratio feature to express contrast
df[’sports_or_wagon’] = X.curb_weight / X.horsepower
sns.regplot(x=’sports_or_wagon’, y=’price’, data=df, order=2)
Target encoding = replace a feature’s categories with some number derived from
the target. mean encoding, bin encoding.
autos[’make_encoded’] = autos.groupby(’make’)[’price’].transform(’mean’)
autos[[’make’,’price’,’make_encoded’]].head(10)
28
X = df.copy()
y = X.pop(’Rating’)
X_encode = X.sample(frac=0.25)
y_encode = y[X_encode.index]
X_pretrain = X.drop(X_encode.index)
y_train = y[X_pretrain.index]
plt.figure(dpi=90)
ax = sns.distplot(y, kde=False, norm_hist=True)
ax = sns.kdeplot(X_train.Zipcode, color=’r’, ax=ax)
ax.set_xlabel(’Rating’)
ax.legend(labels=[’Zipcode’,’Rating’])
def load_data():
data_dir = Path(’...’)
df_train = pd.read_csv(data_dir / ’train.csv’, index_col = ’Id’)
df_test = pd.read_csv(data_dir / ’test.csv’, index_col = ’Id’)
df = pd.concat([df_train, df_test])
df = clean(df)
df = encode(df)
df = impute(df)
# reform splits
df_train = df.loc[df_train.index, :]
df_test = df.loc[df_test.index, :]
return df_train, df_test
data_dir = Path(’...’)
df = pd.read_csv(data_dir / ’train.csv’, index_col = ’Id’)
29
df.Exterior2nd.unique()
def clean(df):
df[’Exterior2nd’] = df[’Exterior2nd’].replace({’Brk Cmn’:’BrkComm’})
# some values of GarageYrBlt are corrupt, so we replace them
# with the year the house was built
df[’GarageYrBlt’] = df[’GarageYrBlt’].where(df.GarageYrBlt <= 2010, df.YearBuilt)
# Names beginning with numbers are awkward to work with
df.rename(columns={
’1stFlrSF’ : ’FirstFlrSF’,
’2ndFlrSF’ : ’SecondFlrSF’,
’3SsnPorch’ : ’Threeseasonporch’,
}, inplace = True)
return df
# Load Data
df_train, df_test = load_data()
# display(df_train)
# display(df_test.info())
30
X = df_train.copy()
y = X.pop(’SalePrice’)
baseline_score = score_dataset(X,y)
print(f’Baseline score: {baseline_score:.5f} RMSLE’)
Baseline score helps us know whether some set of features we’ve assembled has
led to any improvement or not.
Mutual info computes a utility score for a feature, how much potential the feature
has.
def make_mi_scores(X,y):
X = X.copy()
for colname in X.select_dtypes([’object’,’category’]):
X[colname], _ = X[colname].factorize()
# all discrete features should have integer dtypes
discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]
mi_scores = mutual_info_regression(X,y,discrete_features = discrete_features, random_sta
mi_scores = pd.Series(mi_scores, name=’MI Scores’, index=X.columns)
mi_scores = mi_scores.sort_values(ascending=False)
return mi_scores
def plot_mi_scores(scores):
scores = scores.sort_values(ascending=True)
width = np.arange(len(scores))
ticks = list(scores.index)
plt.barh(width, scores)
plt.yticks(width, ticks)
plt.title(’Mutual Info Scores’)
X = df_train.copy()
y = X.pop(’SalePrice’)
mi_scores = make_mi_scores(X,y)
31
X = df.copy()
y = X.pop(’SalePrice’)
X = X.join(create_features_1(X))
X = X.join(create_features_2(X))
# ...
return X
A label encoding is ok for any kind of categorical feature when we’ve using a tree-
ensemble like XGBoost, even for unordered categories. If it’s linear regression
model, we will use a one-hot encoding (especially for features with unordered
categories).
def mathematical_transforms(df):
X = pd.DataFrame() # to hold new features
X[’LivLotRatio’] = df.GrLivArea / df.LotArea
X[’Spaciousness’] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvFrd
return X
def interactions(df):
X = pd.get_dummies(df.BldgType, prefix = ’Bldg’)
X = X.mul(df.GrLivArea, axis=0)
return X
def counts(df):
X = pd.DataFrame()
X[’PorchTypes’] = df[[
’WoodDeckSF’, ’ScreenPorch’, #...
]].gt(0.0).sum(axis=1)
return X
def break_down(df):
X = pd.DataFrame()
X[’MSClass’] = df.MSSubClass.str.split(’_’, n=1, expand=True)[0]
return X
def group_transforms(df):
X = pd.DataFrame()
X[’MedNhbdArea’] = df.groupby(’Neighborhood’)[’GrLivArea’].transform(’median’)
return X
32
• Interactions between quality Qual and condition Cond features.
OveralQual is a hgh-scoring feature, we can combine it with OverallCond
by convering both to integer type and taking a product.
• Square roots of area features, can convert units of square feet to feet
• Logarithms of numeric features, if a feature has skewed distribution, we
can normalize it.
• Interactions between numeric and categorical features that describe the
same thing.
• Group statistics in Neighborhood, do mean, std, couont, and make differ-
ence, etc.
cluster_featurs = [
’LotArea’, ’TotalBsmtSF’, ’FirstFlrSF’, ’SecondFlrSF’, ’GrLivArea’
]
4.8.7. PCA
# utility function
def apply_pca(X, standardize=True):
# standardize
if standardize:
33
X = (X-X.mean(axis=0)) / X.std(axis=0)
# create principal components
pca = PCA()
X_pca = pca.fit_transform(X)
# convert to dataframe
component_names = [f’PC{i+1}’ for i in range(X_pca.shape[1])]
X_pca = pd.DataFrame(X_pca, columns = component_names)
# create loadings
loaddings = pd.DataFrame(
pca.components_.T, # transpose matrix of loadings
columns = component_names, # columns are principal components
index = X.columns, # rows are original features
)
return pca, X_pca, loadings
34
PCA doesn’t change the distance between points, it’s just like rotation. So
clustering with the full set of components is the same as clustering with the
original features. Instead, pick some subset of components, maybe those with
the most variance of the highest MI scores.
def indicate_outliers(df):
X_new = pd.DataFrame()
X_new[’Outlier’] = (df.Neighborhood == ’Edwards’) & (df.SaleCondition == ’Partial’)
return X_new
Can use target encoding without held-out encoding data, similar to cross valida-
tion.
• Split the data into folds, each fold having 2 splits of dataset.
• Train the encoder on one split but transform the values of the other.
• Repeat for all splits.
class CrossFoldEncoder:
def __init__(self, encoder, **kwargs):
self.encoder_ = encoder
self.kwargs_ = kwargs # keyword arguments for encoder
self.cv_ = KFold(n_splits = 5)
35
X_encoded = []
for idx_encode, idx_train in self.cv_.split(X):
fitted_encoder = self.encoder_(cols = cols, **self.kwargs_)
fitted_encoder.fit(
X.iloc[idx_encode, :], y.iloc[idx_encode],
)
X_encoded.append(fitted_encoder.transform(X.iloc[idx_train,:])[cols])
self.fitted_encoders_.append(fitted_encoder)
X_encoded = pd.concat(X_encoded)
X_encodedd.columns = [name + ’_encoded’ for name in X_encoded.columns]
return X_encoded
# transform the test data, average the encodings learned from each fold
def transform(self,X):
from functools import reduce
X_encoded_list = []
for fitted_encoder in self.fitted_encoders_:
X_encoded = fitted_encoder.transform(X)
X_encoded_list.append(X_encoded[self.cols_])
X_encoded = reduce(
lambda x, y: x.add(y, fill_value = 0), X_encoded_list
) / len(X_encoded_list)
X_encoded.columns = [name + ’_encoded’ for name in X_encoded.columns]
return X_encoded
We can turn any of the encoders from category_encoders library into a cross-
fold encoder. CatBoostEncoder is similar to MEstimateEncoder, but uses tricks
to better prevent overfitting. Its smoothing parameter is a not m.
36
X_test = df_test.copy()
X_test.pop(’SalePrice’)
X = pd.concat([X, X_test])
# Mutual Info
X = drop_uninformative(X, mi_scores)
# Transformations
X = X.join(mathematical_transforms(X))
X = X.join(interactions(X))
X = X.join(counts(X))
X = X.join(break_down(X))
X = X.join(group_transforms(X))
# Clustering
X = X.join(cluster_labels(X, cluster_features, n_clusters = 20))
X = X.join(cluster_distance(X, cluster_features, n_clusters = 20))
# PCA
X = X.join(pca_inspired(X))
X = X.join(pca_components(X,pca_features))
X = X.join(indicate_outliers(X))
X = label_encode(X)
# Reform splits
if df_test is not None:
X_test = X.loc[df_test.index,:]
X.drop(df_test.index, inplace=True)
# Target Encoder
encoder = CrossFolEncoer(MEstimateEncoder, m=1)
X = X.join(encoder.fit_transform(X,y,cols=[’MSSubClass’]))
if df_test is not None:
X_test = X_test.join(encoder.transform(X_test))
if df_test is not None:
return X, X_test
else:
return X
X_train = create_features(df_train)
y_train = df_train.loc[:,’SalePrice’]
37
xgb_params = dict(
max_depth = 6, # tree depth: 2 to 10
learning_rate = 0.01, # effect of each tree: 0.0001 to 1
n_estimators = 1000, # number of trees/boosting rounds: 1000 to 8000
min_child_weight = 1, # min number of houses in a leaf: 1 to 10
colsample_bytree = 0.7, # features/columns per tree: 0.2 to 1.0
subsample = 0.7, # instances/rows per tree: 0.2 to 1.0
reg_alpha = 0.5, # L1 regularization LASSO: 0.0 to 10.0
reg_lambda = 1.0, # L2 regularization Ridge: 0.0 to 10.0
num_parallel_tree = 1, # > 1 for boosted random forests
)
xgb = XGBRegressor(**xgb_params)
score_dataset(X_train, y_train, xgb)
import optuna
def objective(trail):
xgb_params = dict(
max_depth = trial.suggest_int(’max_depth’, 2, 10),
learning_rate = trial.suggest_float(’learning_rate’, 1e-4, 1e-1, log=True),
n_estimators = trail.suggest_int(’n_estimators’, 1000, 8000),
min_child_weight = trail.suggest_int(’min_child_weight’, 1, 10),
colsample_bytree = trail.suggest_float(’colsample_bytree’, 0.2, 1.0),
subsample = trial.suggest_float(’subsample’, 0.2, 1.0),
reg_alpha = trial.suggest_float(’reg_alpha’, 1e-4, 1e2, log=True),
reg_lambda = trail.suggest_float(’reg_lambda’, 1e-4, 1e2, log=True),
)
xgb = XGBRegressor(**xgb_params)
return score_dataset(X_train, y_train, xgb)
38
xgb = XGBRegressor(**xgb_params)
# XBG minimizes MSE, but competition loss is RMSLE
# So we need to log-transform y to train and exp-transform the predictions
xgb.fit(X_train, np.log(y))
predictions = np.exp(xgb.predict(X_test))
output = pd.DataFrame({’Id’: X_test.index, ’SalePrice’:predictions})
output.to_csv(’my_submission.csv’, index=False)
5. Data Cleaning
5.2. Scaling
import pandas as pd
import numpy as np
# Box-Cox transformation
from scipy import stats
# min_max scaling
from mlxtend.preprocessing import minmax_scaling
# plot
import seaborn as sns
import matplotlib.pyplot as plt
# set seed for reproducibility
np.random.seed(0)
39
# min-max scale the data between 0 and 1
scaled_data = minmax_scaling(original_data, columns = [0])
# normalized_data = stats.boxcox(orignal_data)
Convert date columns to datetime, we are taking in a string and identifying its
component parts.
infer_datetime_format = True. What if I run into error with multiple date
formats? We can have pandas try to infer what the right date should be. But
it’s slower.
dt.day doesn’t know how to deal with a column with dtype object. Even if our
dataframe has dates, we have to parse them before we can interact them.
import pandas as pd
import numpy as np
import seaborn as sns
import datetime
# create new column, date_parsed
data[’date_parsed’] = pd.to_datetime(data[’date’], format = ’%m/%d/%y’)
# Now, dtype: datetime64[ns]
# remove NaN
day_of_month_data = day_of_month_data.dropna()
# plot day of month
sns.distplot(day_of_month_data, kde=False, bins = 31)
40
import pandas as pd
import numpy as np
# character encoding
import chardet
before = ’some foreign languages..’
after = before.encode(’utf-8’, errors = ’replace’)
print(after.decode(’utf-8’))
import pandas as pd
import numpy as np
import fuzzywuzzy
from fuzzywuzzy import process
import chardet
data = pd.read_csv(’...’)
np.random.seed(0)
# get unique values in ’Country’ column
countries = data[’Country’].unique()
# sort alphabetically
countries.sort()
# convert to lower case
data[’Country’] = data[’Country’].str.lower()
# remove trailing white spaces
data[’Country’] = data[’Country’].str.strip()
Fuzzy Matching. ‘SouthKorea’ and ‘south korea’ should be the same. A string
is closer to another. Fuzzywuzzy returns a ratio given 2 strings. The closer the
ratio is to 100, the smaller the edit distance between 2 strings.
Then, replace all rows that have a ratio of >47 with ‘south korea’. We write a
function (easy to reuse).
41
# get top 10 closest matches to ’south korea’
matches = fuzzywuzzy.process.extract(’south korea’, countries, limit=10, scorer=fuzzywuzzy.f
# Check out column and see if we’ve tidied up ’south korea’ correctly.
# get all unique values in the column
countries = data[’Country’].unique()
# sort alphabetically
countries.sort()
# We see there is only ’south korea’. Done.
6. Intro to SQL
6.1. Dataset
client = bigquery.Client()
dataset_ref = client.dataset(’hacker_news’, project = ’bigquery-public-data’)
dataset = client.get_dataset(dataset_ref)
42
for table in tables:
print(table.table_id)
Fetch a table.
-- field (column) = by
-- data in the field = string
-- NULL value allowed
-- contains the username ...
6.3. Queries
43
6.4. Count, Group by
6.4.1. Count
6.4.2. Group by
How many of each type of animal we have in the pets table? Use GROUP BY
to group together rows that have the same value in the Animal column, while
using COUNT() to find out how many ID’s we have in each group.
Takes the name of one or more columns, and treats all rows with the same
value in that column as a single group when applying aggregate functions like
count().
Animal f0_
Rabbit 1
Dog 1
Cat 2
Notes
It doesn’t make sense to use GROUP BY without an aggregate function, all variables
should be passed to either
• a GROUP BY command
• an aggregation function
Ex: Two variables: parent and id. parent is passed to GROUP BY commend
GROUP BY parent, id is passed to aggregate function COUNT(id). But author
44
column isn’t passedd to an aggregate function or a GROUP BY clause, so it’s
wrong.
Animal f0_
Cat 2
6.4.4. Aliasing
• Aliasing: The column resulting from COUNT(id) was f0__, can change the
name by adding AS NumPosts.
• If unsure what to put inside COUNT(), can do COUNT(1) to count the rows
in each group. It’s readable bacause we know it’s not focusing on other
columns. It also scans less data than if supplied column names, so it can
be faster.
6.5. Order By
Change the order of results in the last clause in the query ORDER BY.
45
6.6. EXTRACT
Name Day
Tom 2
Peter 7
num_accidents day_of_week
5658 7
4928 1
4918 6
4828 5
4728 4
4628 2
4528 3
46
FROM ’data.pet_records.pets’
GROUP BY Animal
Animal Number
Rabbit 1
Dog 1
Cat 2
With. . . As. Common table expression (CTE): Temporary table that you return
within query. You can split your queries into readable chuncks.
Use pets table to ask questions about older animals in particular.
WITH Senior AS
(
SELECT ID, Name
FROM ’data.records.pets’
WHERE Years_old > 5
)
The incomplete query won’t return anything, it just creates a CTE named
Seniors (not returned by the query) while writing the rest of the query.
ID Name
2 Tom
4 Peter
WITH Senior AS
(
SELECT ID, Name
FROM ’data.records.pets’
WHERE Years_old > 5
)
SELECT ID
FROM Seniors
[ID, 2, 4]
CTE only exists inside the query, we create CTE and then write a query that
uses the CTE.
47
How many bitcoin transactions are made per month?
WITH time AS
(
SELECT DATE(block_timestamp) AS trans_date
FROM ’data.bitcoin.transactions’
)
SELECT COUNT(1) AS transactions, trans_date
FROM time
GROUP BY trans_date
ORDER BY trans_date
transactions trans_date
0 12 2020-01-03
1 64 2020-01-09
2 193 2020-01-29
Can use Python to plot, because CTE shifts a lot of data cleaning into SQL, so
it’s faster than doing the work in Pandas.
transactions_by_date.set_index(’trans_date’).plot()
Combine infomation from both tables by matching rows where ID column in the
pets table matches the Pet_ID column in the owners table.
On. Determines which column in each table to use to combine the tables. ID
column exists in both tables, we have to clarify which table your column comes
from: p.ID is from pets table and o.Pet_ID is from owners table.
Inner Join. A row will only be put in the final output table if the value shows
up in both the tables you’re joining.
48
SELECT L.license, COUNT(1) AS number_of_files
FROM ’data.github_repos.sample_files’ AS sf
INNER JOIN ’data.github_repos.licenses’ AS L
ON sf.repo_name = L.repo_name
GROUP BY L.license
ORDER BY number_of_files DESC
license number_of_files
0 mit 20200103
1 gpl-2.0 16200109
2 apache-2.0 700129
7. Advanced SQL
7.1.1. Join
Some pets are not included. But we want to create a table containing all pets,
regardless of whether they have owners.
Create table containing all rows from owners table
• LEFT JOIN: Table that appears before JOIN in the query. Returns all
rows where 2 tables have matching entries, along with all of the rows in
the left table regardless of whether there is a match or not.
• RIGHT_JOIN: Table that is after the JOIN. Return matching rows,
along with all rows in the right table.
• FULL JOIN: Return all rows from both tables. Any row without a match
will have NULL entries.
49
7.1.2. Union
WITH c AS
(
SELECT parent, COUNT(*) as num_comments
FROM ’data.hacker_news.comments’
GROUP BY parent
)
SELECT s.id as story_id, s.by, s.title, c.num_comments
FROM ’data.hacker_news.stories’ AS s
LEFT JOIN c
ON s.id = c.parent
WHERE EXTRACT(DATE FROM s.time_ts) = ’2021-01-01’
ORDER BY c.num_comments DESC
SELECT c.by
FROM ’data.hacker_news.comments’ AS c
WHERE EXTRACT(DATE FROM c.time_ts) = ’2021-01-01’
UNION DISTINCT
SELECT s.by
FROM ’data.hacker_news.stories’ AS s
WHERE EXTRACT(DATE FROM s.time_ts) = ’2021-01-01’
50
7.2.1. Over
SELECT *
AVG(time) OVER(
PARTITION BY id
ORDER BY date
ROWS BETWEEN 1 PRECEDING AND CURRENT ROW
) as avg_time
FROM ’data.runners.train_time’
• first_value(), last_value()
• lead(), lag(): value on subsequent/preceding row.
• row_number()
• rank(): all rows with the same value in the ordering column receive the
same rank value.
51
WITH trips_by_day AS
( -- calculate daily number of trips
SELECT DATE(start_date) AS trip_date,
COUNT(*) as num_trips
FROM ’data.san_francisco.bikeshare_trips’
WHERE EXTRACT(YEAR FROM start_date) = 2015
GROUP BY trip_date
)
SELECT *,
SUM(num_trips) -- SUM is aggregate function
OVER(
ORDER BY trip_date -- earliest dates appear first
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- window frame, all rows up to and including the
-- current dadte are used to calculate the cumulative sum
) AS cumulative_trips
FROM trips_by_day
SELECT bike_number,
TIME(start_date) AS trip_time,
-- FIRST_VALUE analytic function
FIRST_VALUE(start_station_id)
OVER(
-- break data into partitions based on bike_number column
-- This column holds unique identifiers for the bikes,
-- this ensures the calculations performed separately for each bike
PARTITION BY bike_number
ORDER BY start_date
-- window frame: each row’s entire partition
-- is used to perform the calculation.
-- This ensures the calculated values
-- for rows in the same partition are identical.
ROWS BETWEEN UNBOUNDED PRECEDING AND BOUNDED FOLLOWING
) AS first_station_id,
-- LAST_VALUE analytic function
LAST_VALUE(end_station_id)
OVER (
52
PARTITION BY bike_number
ORDER BY start_dadte
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_station_id,
start_station_id,
end_station_id
FROM ’data.san_francisco.bikeshare_trips’
WHERE DATE(start_date) = ’2021-01-02’
0 22 13:25:00 2 16 2 16
1 25 11:43:00 77 51 77 60
2 25 12:14:00 77 51 60 51
pets_and_toys_type
53
ID Name Age Animal Toys
toys_type
ID Type Pet_ID
1 Bone 1
2 Feather 2
3 Ball 3
Toys column contains repeated data, it permits more than one value for each
row. Mode of Toys column is REPEATED (not NULLABLE).
Each entry in a repeated field is an ARRAY. The entry in the Toys column
for Moon the Dog is [Rope, Bone] array with 2 values.
When querying repeated data, need to put the name of the column containing
the repeated data inside an UNNEST() function.
This flattens the repeated data, and appended to the right side of the table, so
that we have one element on each row.
What if pets can have multiple toys? Make Toys column both nested and
repeated.
54
SELECT Name AS Pet_Name,
t.Name AS Toy_Name,
t.Type AS Toy_Type
FROM ’data.pet_records.more_pets_an_toys’,
UNNEST(Toys) AS t
55
--small join: decrease the size of the JOIN and runs faster.
WITH commits AS
(
SELECT COUNT(DISTINCT committer.name) AS num_committers, repo
FROM ’data.github_repos.commits’,
UNNEST(repo_name) as repo
WHERE repo IN (’tensorflow/tensorflow’, ’facebook/react’, ’Microsoft/vscode’)
GROUP BY repo
),
files AS
(
SELECT COUNT(DISTINCT id) AS num_fields, repo_name as repo
FROM ’data.github_repos.files’
WHERE repo_name IN (’tensorflow/tensorflow’, ’facebook/react’, ’Microsoft/vscode’)
GROUP BY repo
)
SELECT commits.repo, commits.num_committers, files.num_files
FROM commits
INNER JOIN files
ON commits.repo = files.repo
ORDER BY repo
8. Deep Learning
8.2. Deep NN
A layer can be any kind of data transformation. Without activation functions,
neural nets can only learn linear relationships. In order to fit curves, we need
56
to use activation functions, which is a sfuncion we apply to each of a layer’s
outputs, like ReLU.
Will connect together a list of layers in order. We pass all layers together in a
list, like [layer, layer, ...] instead of separate arguments.
model = keras.Sequential([
# hidden ReLU layers
layers.Dense(units = 4, activation = ’relu’, input_shape=[2]),
layers.Dense(units = 3, activation = ’relu’),
# linear output layer
layers.Dense(units=1)
])
Train network = adjust its weights so that it can transform the features into
target.
Loss function measures how good the network prediction are, the difference
between target’s value and model prediction value. Common loss function for
regression are mean absolute error (MAE) = abs(y_true - y_pred); mean
squared error (MSE); Huber loss.
Optimizer tells network how to change weights, it adjusts the weights to minimize
the loss.
Def. SGD
57
• descent: use gradient to descend the loss curve towards a minimum.
Steps.
Two parameters that have largest effect on how SGD training proceeds:
Optimizer: Adam has adaptive learning rate that makes it suitable for most
problems without any parameter tuning, it’s self-tuning.
Keras keeps a history of the training and validation loss over the epochs training
the model.
import pandas as pd
from IPython.display import display
red_wine = pd.read_csv(’...’)
# scale to [0,1]
max_ = df_train.max(axis=0)
min_ = df_train.min(axis=0)
df_train = (df_train - min_) / (max_ - min_)
58
df_valid = (df_valid - min_) / (max_ - min_)
model.compile(
optimizer=’adam’,
loss=’mae’
)
# training
# fit method keeps you updated on the loss as the model trains.
history = model.fit(
X_train, y_train,
validation_data = (X_valid, y_valid),
batch_size = 256, # training data has 256 rows
epochs=10 # do 10 times all the way through data
)
# convert fit data to dataframe and plot
import pandas as pd
history_df = pd.DataFrame(history.history)
# plot
history_df[’loss’].plot()
• Training loss goes down either when the model learns signal or when it
learns noise.
59
Figure 1: Learning Curve
60
• Validation loss goes down only when the model learns signal. Whatever
noise the model learned from training set won’t generalize to new data.
• Gap: how much noise the model has learned.
Model Capacity: size and complexity of patterns it’s able to learn. How many
neurons it has, how they are connected to each other? If network is underfitting
data, we should increase capacity, by making it:
• wider (more units to existing layers), easier time learning more linear
relationships
• deeper (add more layers), prefer more nonlinear ones
model = keras.Sequential([
layers.Dense(16, activation=’relu’),
layers.Dense(1)
])
wider = keras.Sequential([
layers.Dense(32, activation=’relu’),
layers.Dense(1)
])
deeper = keras.Sequential([
layers.Dense(16, activation=’relu’),
layers.Dense(16, activation=’relu’),
layers.Dense(1)
])
When model is too eagerly learning noise, the validation loss may increase during
training. So we simply stop the training whenever it seems the validation loss
isn’t decreasing anymore. Once we see validation loss is rising again, we can
reset weights back to where the minimum occurred. This ensures the model
won’t learn noise and overfit data.
Early stopping prevents overfitting from training too long, and also preents
underfitting from not training long enough. We just set training epochs to large
number, and early stopping will do the rest. callback includes early stopping.
If there hasn’t been at least an improvement of 0.001 in the validation loss over
previous 20 epochs, then stop the training and keep the model.
61
early_stopping = callbacks.EarlyStopping(
min_delta = 0.001, # min amount of change to count as improvement
patience = 20, # how many epochs to wait before stopping
restore_best_weights = True
)
model = keras.Sequential([
layer.Dense(512, activation=’relu’, input_shape=[11]),
layers.Dense(512, activation=’relu’),
layers.Dense(512, activation=’relu’),
layers.Dense(1)
])
model.compile(
optimizer=’adam’,
loss=’mae’
)
history = model.fit(
X_train, y_train,
validation_data = (X_valid, y_valid),
batch_size = 256,
epochs = 500,
callbacks=[early_stopping], # put callbacks in a list
verbose = 0 # turn off training log
)
history_df = pd.DataFrame(history.history)
history_df.loc[:, [’loss’, ’val_loss’]].plot()
print(’Min validation loss: {}’.format(history_df[’val_loss’].min()))
8.6.1. Dropout
62
keras.Sequential([
layers.Dropout(rate=0.3), # 30% dropout to next layer
layers.Dense(16)
])
layers.BatchNormalization()
It can be put at any point (between layers, after layer) in a network, as an aid
to optimization process:
63
y_train = df_train[’quality’]
y_valid = df_valid[’quality’]
64
By sigmoid activation function, real-valued outputs produced by dense layer will
output classification by the threshold probability in sigmoid.
Wide and Deep network trains a linear layer side-by-side with deep stack of
dense layers.
TPU creates distribution strategy, each TPU has 8 cores acting independently.
65
# model configuration
UNITS = 2048
ACTIVATION = ’relu’
DROPOUT = 0.1
# training configuration
BATCH_SIZE_PER_REPLICA = 2048
# Set up
# Tensorflow
import tensorflow as tf
print(’Tensorflow version’ + tf.__version__)
# Detect and init TPU
try: # detect TPU
tpu = tf.distribute.cluster_resolver.TPUClusterResolver.connect()
strategy = tf.distribute.TPUStrategy(tpu)
except ValueError:
strategy = tf.distribute.get_strategy()
# default strategy that works on CPU and single GPU
print(’Number of accelerators: ’, strategy.num_replicas_in_sync)
# Plot
import pandas as pd
import matplotlib.pyplot as plt
# Matplotlib
plt.style.use(’seaborn-whitegrid’)
plt.rc(’figure’, autolayout=True)
plt.rc(’axes’, labelweight=’bold’, labelsize=’large’, titleweight = ’bold’, titlesize = 18,
# Data
from kaggle_datasets import KaggleDatasets
from tensorflow.io import FixedLenFeature
AUTO = tf.data.experimental.AUTOTUNE
# Model
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import callbacks
Datasets are encoded in a binary file TFRecords, 2 functions will parse the
TFRecords and build tf.data.Dataset object for training.
def make_decoder(feature_description):
def decoder(example):
66
example = tf.io.parse_single_example(example, feature_description)
features = tf.io.parse_tensor(example[’features’], tf.float32)
features = tf.reshape(features, [28])
label = example[’label’]
return features, label
return decoder
dataset_size = int(11e6)
validation_size = int(5e5)
training_size = dataset_size - validation_size
# for model.fit
batch_size = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
steps_per_epoch = training_size // batch_size
validation_steps = validation_size // batch_size
# for model.compile
steps_per_execution = 256
feature_description = {
’features’: FixedLenFeature([], tf.string),
’label’ : FixedLenFeature([], tf.float32)
}
decoder = make_decoder(feature_description)
data_dir = KaggleDatasets().get_gcs_path(’higgs-boson’)
train_files = tf.io.gfile.glob(data_dir + ’/training’ + ’/*.tfrecord’)
valid_files = tf.io.gfile.glob(data_dir + ’/validation’ + ’/*.tfrecord’)
67
.shuffle(2 ** 19)
.batch(batch_size)
.prefetch(AUTO)
)
8.8.3. Model
with strategy.scope():
# Wide network
wide = keras.experimental.LinearModel()
# Deep Network
inputs = keras.Input(shape=[28])
x = dense_block(UNITS, ACTIVATION, DROPOUT)(inputs)
x = dense_block(UNITS, ACTIVATION, DROPOUT)(x)
x = dense_block(UNITS, ACTIVATION, DROPOUT)(x)
x = dense_block(UNITS, ACTIVATION, DROPOUT)(x)
x = dense_block(UNITS, ACTIVATION, DROPOUT)(x)
outputs = layers.Dense(1)(x)
deep = keras.Model(inputs = inputs, outputs = outputs)
# Wide and Deep Network
wide_and_deep = keras.experimental.WideDeepModel(
linear_model = wide,
dnn_model = deep,
activation = ’sigmoid’
)
wide_and_deep.compile(
68
loss = ’binary_crossentropy’,
optimizer=’adam’,
metrics = [’AUC’, ’binary_accuracy’]
experimental_steps_per_execution = steps_per_execution
)
8.8.4. Training
Gradually decreasing the learning rate over training can improve performance.
Define leaning rate schedule, will multiply the learning rate by 0.2 if validation
loss didn’t decrease after an epoch.
early_stopping = callbacks.EarlyStopping(
patience = 2,
min_delta = 0.001,
restore_best_weights = True
)
lr_schedule = callbacks.ReduceLROnPlateau(
patience = 0,
factor = 0.2,
min_lr = 0.001
)
history = wide_and_deep.fit(
ds_train,
validation_data = ds_valid,
epochs = 50,
steps_per_epoch = steps_per_epoch,
validation_steps = validation_steps,
callbacks = [early_stopping, lr_schedule]
)
history_frame = pd.DataFrame(history.history)
history_frame.loc[:,[’loss’,’val_loss’]].plot(title=’cross entropy loss’)
history_frame.loc[:,[’auc’,’val_auc’]].plot(title=’AUC’)
9. Computer Vision
Structure of CNN classifier: a head as a classifier atop of a base for the feature
extraction.
69
We can attach a unit that performs feature engineering to the classifier itself.
So, given right network structure, the NN can learn how to engineer the feature
it needs to solve its problem.
Training Goal
3
We reuse the base of a pretrained model, attach an untrained head.
Because head has only a few dense layers, very accurate classifers can be created
from little data.
Transfer learning = Reusing the pretrained model. Almost every image
classifier will use it.
# Reproducability
def set_seed(seed=31415):
np.random.seed(seed)
tf.random.set_seed(seed)
os.environ[’PYTHONHASHSEED’] = str(seed)
os.environ[’TF_DETERMINISTIC_OPS’] = ’1’
set_seed(31415)
# Matplotlib
plt.rc(’figure’, autolayout=True)
plt.rc(’axes’, labelweight=’bold’, labelsize=’large’, titleweight=’bold’, titlesize=18, titl
plt.rc(’image’, cmap=’magma’)
warnings.filterwarnings(’ignore’) # to clean up output cells
70
interpolation = ’nearest’,
batch_size = 64,
shuffle = True
)
ds_valid_ = image_dataset_from_directory(
’...’,
labels=’inferred’,
label_mode = ’binary’,
image_size = [128,128],
interpolation = ’nearest’,
batch_size = 64,
shuffle = False
)
# Data Pipeline
def convert_to_float(image, label):
image = tf.image.convert_image_dtype(image, dtype = tf.float32)
return image, label
AUTOTUNE = tf.data.experimental.AUTOTUNE
ds_train = (
ds_train_
.map(convert_to_float)
.cache()
.prefetch(buffer_size = AUTOTUNE)
)
ds_valid = (
ds_valid_
.map(convert_to_float)
.cache()
.prefetch(buffer_size = AUTOTUNE)
)
71
layers.Dense(6, activation=’relu’),
layers.Dense(1, activation=’sigmoid’)
])
9.2.2. Weights/Kernels
Kernels defines how a convolutional layer is connecte to the layer that follows,
and determines what kind of features it creates. Usually odd-number dimensions
like (3,3) or (5,5), so that a pixel sits at the center. CNN learns what features it
needs to solve the classification problem.
72
9.2.3. Activations/Feature Maps
Filter with Conv2D layers and detect with ReLU activation. filter tells convo-
lutional layer how many feature maps it creates as output.
Max pooling. After ReLU detecting the feature map, there are a lot of ‘dead
space’ (large areas with only 0, i.e. black image). To reduce the size of model,
we condense the feature map to retain only the feature itself.
Max pooling takes a patch of activations in the original feature map, and
replaces them with max activation in that patch. When applied after ReLU,
it has ‘intensifying’ effect. The pooling step increases the proportion of active
pixels to 0 pixels.
MaxPool2D uses simple max function instead of kernel.
model = keras.Sequential([
layers.Conv2D(filters = 64, kernel_size = 3), # activation = ’relu’, here is None
layers.MaxPool2D(pool_size = 2)
])
import tensorflw as tf
# Kernel
kenel = tf.constant([
[-1,-1,-1],[-1,8,-1],[-1,-1,-1]
], dtype = tf.float32)
plt.figure(figsize=(3,3))
show_kernel(kernel)
image_filter = tf.nn.conv2d(
input = image,
filters = kernel
strides = 1,
padding = ’SAME’
73
)
plt.figure(figsize=(6,6))
plt.imshow(tf.squeeze(image_filter))
plt.axis(’off’)
plt.show()
image_detect = tf.nn.relu(image_filter)
plt.figure(figsize=(6,6))
plt.imshow(tf.squeeze(image_detect))
plt.axis(’off’)
plt.show()
image_condense = tf.nn.pool(
input = image_detect, # image in Detect step
window_shape = (2,2),
pooling_type = ’MAX’,
strides = (2,2),
padding = ’SAME’
)
plt.figuer(figsize=(6,6))
plt.imshow(tf.squeeze(image_condense))
plt.axis(’off’)
plt.show()
74
This invariance to small differences in positions of features is a nice property,
because of differences in perspective or framing, the same kind of feature might
be positioned in various parts of the image, but they recognized the same. Since
the invariance s built into the network, we can use less data for training, and we
don’t need to teach it to ignore the difference. So CNN is effective with only
dense layers.
• strides of window = how far the window should move at each step. For
high quality features, strides = (1,1). To miss some information, like max
pooling layers, strides = (2,2) [strides=2] or (3,3).
• padding = how to handle the image edges.
– padding = ’same’. Pad the input with 0 around borders,
– padding = ’valid’. Window entirely inside the input. So the output
shrinks (loses pixels), and shrinks more for larger kernels. Limit the
number of layers the network can contain, for small input size.
75
9.3. CNN
9.3.1. Model
How number of filters doubled block-by-block: 64, 128, 256. MaxPool2D layer
is reducing the size of feature maps, we can afford to increase the quantity we
create.
# Classifier Hrad
layers.Flatten(),
layers.Dense(units=6, activation=’relu’),
layers.Dense(units=1, activation=’sigmoid’)
])
model.summary()
9.3.2. Train
model.compile(
optimizer = tf.keras.optimizers.Adam(epsilon=0.01),
loss = ’binary_crossentropy’,
metrics = [’binary_accuracy’]
)
history = model.fit(
76
ds_train,
validation_data = ds_valid,
epochs=40, verbose=0
)
import pandas as pd
history_frame = pd.DataFrame(history.history)
history_frame.loc[:,[’loss’,’val_loss’]].plot()
history_frame.loc[:,[’binary_accuracy’,’val_binary_accuracy’]].plot()
Augment training data with flipped images, our classifier will learn that left or
right should ignore.
Data augmentation is done online - as images being fea into network for training.
Training is usually done on mini-batches of data, so data augmentation may use
a batch of 16 images.
77
history_frame = pd.DataFrame(history.history)
history_frame.loc[:, [’loss’,’val_loss’]].plot()
history_frame.loc[:,[’binary_accuracy’, ’val_binary_accuracy’]].plot()
We can see that we achieved modest improvement in validation loss and accuracy.
Idea
It’s calculated after a model has been fitted. So we don’t change the model.
Question is, if we randomly shuffle a single column of the validation data, leaving
the target and all other columns in place, how would that affect the accuracy of
predictions?
Pros
• fast to calculate
• widely used and understood
• consistent with properties we want a feature importance measure to have
Steps
78
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv(’...’)
y = (data[’Match’] == ’Yes’) # convert from string Yes/No to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
random_state = 0).fit(train_X, train_y)
We can see the Weight | Feature table. First number in each row shows how much
model performance decrease with a random shuffling, the ± term measures how
performance variedd from one-reshuffling to the next, the randomness. Random
chance may also cause the prediction on shuffled data to be more accurate, which
is common for small data, because there is more room for luck and chance.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
data = pd.read_csv(’...’)
y = (data[’Match’] == ’Yes’)
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
tree_model = DecisionTreeClassifier(random_state = 0, max_depth = 5, min_samples_split=5).fi
79
# Decision Tree Graph visualization
from sklearn import tree
import graphviz
tree_graph = tree.export_graphviz(tree_model, out_file = None, feature_names= feature_names)
graphviz.Source(tree_graph)
# Random Forest
rf_model = RandomForestClassfier(random_state=0).fit(train_X, train_y)
pdp_dist = pdp.pdp_isolate(model = rf_model, dataset=val_X, model_features = feature_names,
pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
In the plot, y axis is change in the prediction from what it would be predicted
at the baseline or leftmost value. The blue shaded area indicates level of
confidence.
Interactions between features. The graph shows predictions for any combination
of Gold and Silver.
80
Figure 2: 1D PDP
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
81
Figure 3: 2D PDP
82
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv(’...’)
y = (data[’Match’] == ’Yes’) # convert from string yes/no to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64, np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)
83
• a large effect for a few predictions, but no effect in general; or
• a medium effect for all predictions
import numpy as np
import panddas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RanomForestClassifier
data = pd.read_csv(’...’)
y = (data[’Match’] == ’Yes’)
84
feature_names = [i for i in data.columns if data[i].dtype in [np.int64, np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state = 0).fit(train_X, train_y)
import shap
explainer = shap.TreeExplainer(my_model)
# calculate shap_values for all val_X rather than a single row
shap_values = explainer.shap_values(val_X)
# make plot. Index of [1]
shap.summary_plot(shap_values[1], val_X)
Notes:
What’s the distribution of effects? Does the value vary a lot depending on the
values of other features?
Each dot represents a row of data. Horizontal location is actual value from
dataset, vertical location is what the value did to the prediction. The upward
slope means, the more you possess the ball, the higher the model’s prediction is.
Other features must interact with Ball Possession.
The two points are far away from the upward trend. Having the ball increasesa
team’s chance of having their player win the award. But it they only score one
goal, that trend reverses and the award judges may penalize them for having
the ball so much if they score that little.
import shap
explainer = shap.TreeExplainer(my_model)
shap_values = explainer.shap_values(X)
shap.dependece_plot(’Ball Possession %’, shap_values[1], X, interaction_index = ’Goal Scored
85
Figure 6: SHAP Dependence Contribution Plots
86
Figure 7: High possession lowered prediction
87