0% found this document useful (0 votes)
164 views39 pages

1 - Practical Guide For Kaggle Competitions

The document provides tips for participating in machine learning competitions. It recommends defining goals for participation, organizing ideas systematically, and sorting parameters by importance and understandability. It also suggests starting with simple solutions, debugging the full pipeline, and progressing from simple to complex models. Additional tips include using good code practices like commenting and version control, reusing code between training and testing, and reading papers for new ideas and domain knowledge. The document stresses keeping code clean and reproducible.

Uploaded by

Armanul Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views39 pages

1 - Practical Guide For Kaggle Competitions

The document provides tips for participating in machine learning competitions. It recommends defining goals for participation, organizing ideas systematically, and sorting parameters by importance and understandability. It also suggests starting with simple solutions, debugging the full pipeline, and progressing from simple to complex models. Additional tips include using good code practices like commenting and version control, reusing code between training and testing, and reading papers for new ideas and domain knowledge. The document stresses keeping code clean and reproducible.

Uploaded by

Armanul Alam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Practical Guide

Alexander Guschin
Practical guide: intro
Before you enter a competition

Define your goals. What you can get out of your participation?
1. To learn more about an interesting problem
2. To get acquainted with new software tools
3. To hunt for a medal
Before you enter a competition

Define your goals. What you can get out of your participation?
1. To learn more about an interesting problem
2. To get acquainted with new software tools
3. To hunt for a medal
Before you enter a competition

Define your goals. What you can get out of your participation?
1. To learn more about an interesting problem
2. To get acquainted with new software tools
3. To hunt for a medal
Before you enter a competition

Define your goals. What you can get out of your participation?
1. To learn more about an interesting problem
2. To get acquainted with new software tools
3. To hunt for a medal
After you enter a competition:
Working with ideas

1. Organize ideas in some structure


2. Select the most important and promising ideas
3. Try to understand the reasons why something
does/doesn’t work
After you enter a competition:
Everything is a hyperparameter

Sort all parameters by these principles:


1. Importance
2. Feasibility
3. Understanding
Note: changing one parameter can affect the whole pipeline
Dmitry Altukhov
Data loading

• Do basic preprocessing and convert csv/txt files into


hdf5/npy for much faster loading
• Do not forget that by default data is stored in 64-bit arrays,
most of the times you can safely downcast it to 32-bits
• Large datasets can be processed in chunks
Performance evaluation

• Extensive validation is not always needed


• Start with fastest models - LightGBM
Fast and dirty always better

• Don’t pay too much attention to code quality


• Keep things simple: save only important things
• If you feel uncomfortable with given computational
resources - rent a larger server
Mikhail Trofimov
Initial pipeline

• Start with simple (or even primitive) solution


Initial pipeline

• Start with simple (or even primitive) solution


• Debug full pipeline
− From reading data to writing submission file
Initial pipeline

• Start with simple (or even primitive) solution


• Debug full pipeline
− From reading data to writing submission file
• “From simple to complex”
− I prefer to start with Random Forest rather than
Gradient Boosted Decision Trees
Best Practices from Software Development
Best Practices from Software Development

• Use good variable names


− If your code is hard to read — you definitely will have
problems soon or later
Best Practices from Software Development

• Use good variable names


− If your code is hard to read — you definitely will have
problems soon or later
• Keep your research reproducible
− Fix random seed
− Write down exactly how any features were generated
− Use Version Control Systems (VCS, for example, git)
Best Practices from Software Development

• Use good variable names


− If your code is hard to read — you definitely will have
problems soon or later
• Keep your research reproducible
− Fix random seed
− Write down exactly how any features were generated
− Use Version Control Systems (VCS, for example, git)
• Reuse code
− Especially important to use same code for train and
test stages
Read papers
Read papers

• This can get you ideas about ML-related things


− For example, how to optimize AUC
Read papers

• This can get you ideas about ML-related things


− For example, how to optimize AUC
• Way to get familiar with problem domain
− Especially useful for feature generation
Dmitry Ulyanov
My pipeline

• Read forums and examine kernels first


– There are always discussions happening!
My pipeline

• Read forums and examine kernels first


– There are always discussions happening!
• Start with EDA and a baseline
– To make sure the data is loaded correctly
– To check if validation is stable
My pipeline

• Read forums and examine kernels first


– There are always discussions happening!
• Start with EDA and a baseline
– To make sure the data is loaded correctly
– To check if validation is stable
• I add features in bulks
– At start I create all the features I can make up
– I evaluate many features at once (not “add one and
evaluate”)
My pipeline

• Read forums and examine kernels first


– There are always discussions happening!
• Start with EDA and a baseline
– To make sure the data is loaded correctly
– To check if validation is stable
• I add features in bulks
– At start I create all the features I can make up
– I evaluate many features at once (not “add one and
evaluate”)
• Hyperparameters optimization
– First find the parameters to overfit train dataset
– And then try to trim model
Code organization: keeping it clean

• Very important to have reproducible results!


– Keep important code clean
Code organization: keeping it clean

• Very important to have reproducible results!


– Keep important code clean
• Long execution history leads to mistakes
Code organization: keeping it clean

• Very important to have reproducible results!


– Keep important code clean
• Long execution history leads to mistakes

• Your notebooks can become a total mess


Code organization: keeping it clean

• One notebook per submission (and use git)


Code organization: keeping it clean

• One notebook per submission (and use git)

• Before creating a submission restart the kernel


– Use “Restart and run all” button
Code organization: test/val

• Split train.csv into train and val with structure of train.csv


and test.csv

1.

2.
Code organization: test/val

• When validating, set it at the top of the notebook

• To retrain models on the whole dataset and get


predictions for test set just change
Code organization: macros

I use macros for a frequent code


Code organization: macros
Code organization: custom library

• I use a library with frequent operations implemented


– Out-of-fold predictions
– Averaging
– I can specify a classifier by it’s name

You might also like