0% found this document useful (0 votes)
25 views

Framework To Approach A Kaggle Problem: 1. Importing The Training / Test Population

This document provides a framework and tips for approaching problems on the Kaggle machine learning competition platform. It notes that competing with experienced data scientists can be challenging, as some have automated tools for data exploration. The tips include: working hard; teaming up initially; focusing on feature engineering; researching the domain and problem; making simple initial submissions; being open to starting from scratch; and experimenting with algorithms and ensembles. It then outlines a framework involving importing training/test data, sampling the population, choosing attributes, and comparing models. The goal is to help readers get started competing on Kaggle to enter the new era of analytics and machine learning.

Uploaded by

Govind Naik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Framework To Approach A Kaggle Problem: 1. Importing The Training / Test Population

This document provides a framework and tips for approaching problems on the Kaggle machine learning competition platform. It notes that competing with experienced data scientists can be challenging, as some have automated tools for data exploration. The tips include: working hard; teaming up initially; focusing on feature engineering; researching the domain and problem; making simple initial submissions; being open to starting from scratch; and experimenting with algorithms and ensembles. It then outlines a framework involving importing training/test data, sampling the population, choosing attributes, and comparing models. The goal is to help readers get started competing on Kaggle to enter the new era of analytics and machine learning.

Uploaded by

Govind Naik
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Competing with the best data scientists can be challenging.

Especially so, if some of them


have been doing so for years. I know a few people who have well automated scripts to perform
most of the data exploration! These people are out deciding on best algorithms when rest of
the world is still figuring out the nuances of the data.

Here are a few things you need to keep in mind before starting a problem on Kaggle :

1. Like all good things in life, winning a Kaggle competition is all about hard work. Get
ready to devote long hours wondering on the same problem for days/weeks/months.
2. Team up with a good team mate for competing in initial competitions. Good team mate
is some one with similar bent of mind and thought process, but might have
complementary skills on tool / domain / work experience.
3. Be ready to do a lot of feature engineering – that is what differentiates the best from
the rest.
4. Do a preliminary research on the domain and the problem. There might be good
research papers with non-conventional effective solutions available on the internet.
5. Make simple initial solutions and submit them to get a sense on how much gap you
need to cover
6. Always be open to start from scratch
7. Experiment with different algorithms and be prepared to prepare ensembles.

The list is not exhaustive, but covers a significant portion. Now let’s look at a simple framework
to approach a Kaggle problem. Participants are challenged at each step of this framework by
Kaggle.

Framework to approach a Kaggle Problem


Next, we will take you through a step by step process of taking a simple shot on a Kaggle
statement. The process generally involve following pieces :

1. Importing the training / test population : Kaggle challenges you to import the training /
test dataset. In general, this is not very straight forward. For example in following problems,
training data needs to messaged well before we start working on the model.

Here are two problem statements where you need to extract data from multiple excel files :

a. Driver Telematic Analysis

b. BCI Challenge @ NER 2015

2. Sampling the population : In general the population size is huge and might not be the
best idea to train using the entire population. For example, “Sentiment Analysis fro Movie
Review” with an enormous number of phrases might be a bad idea to build an initial dictionary.
Choosing this sample can be done randomly or in a stratified way.
3. Choosing the right attributes : This is the most critical step which distinguishes different
submissions on Kaggle. In general we use Principle component analysis, factor analysis,
Information Value, Weight of Evidence to do this part. But there is no set procedure to do this.

4. Compare different ensemble / simple models : Once we have the input and the target
variables, we start building different models. The choice of model depends on the evaluation
metrics, type of input / target variable, distribution of population on target values etc.

In this article we will start with the first step leveraging the BCI challenge. We will start with
the problem statement and then define the scope of this article. After reading this article, I
believe you can start competing on Kaggle and start your journey to discover the new era of
Analytics & Machine Learning.

You might also like