Data Analysis and Machine Learning With Kaggle How To Win Competitions On Kaggle and Build A Successful Career in Data Science 1801817472 9781801817479
Data Analysis and Machine Learning With Kaggle How To Win Competitions On Kaggle and Build A Successful Career in Data Science 1801817472 9781801817479
Data Analysis and Machine Learning With Kaggle How To Win Competitions On Kaggle and Build A Successful Career in Data Science 1801817472 9781801817479
This is an Early Access product. Every effort has been made in the preparation
of this book to ensure the accuracy of the information presented. However, the
content and extracts of this book may evolve as it is being developed to ensure it
is up-to-date.
All rights reserved. No part of this book may be reproduced, stored in a retrieval
system, or transmitted in any form or by any means, without the prior written
permission of the publisher, except in the case of brief quotations embedded in
critical articles or reviews.
The information contained in this book is sold without warranty, either express
or implied. Neither the author nor Packt Publishing or its dealers and distributors
will be held liable for any damages caused or alleged to have been caused
directly or indirectly by this book.
Early Access Publication: Data Analysis and Machine Learning with Kaggle
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK
ISBN: 978-1-80181-747-9
www.packt.com
Table of Contents
1. Data Analysis and Machine Learning with Kaggle: How to win
competitions on Kaggle and build a successful career in data science
2. Introducing data science competitions
I. The rise of data science competition platforms
II. Kaggle competition platform
III. Other competition platforms
IV. Stages of a competition
V. Types of competitions and examples
VI. Submission and leaderboard dynamics
VII. Computational resources
VIII. Teaming and networking
IX. Performance tiers and rankings
X. Criticism and opportunities
3. Organizing Data with Datasets
I. Setting up a dataset
II. Gathering the data
III. Using the Kaggle datasets outside of Kaggle
IV. Building around datasets
V. Legal caveats
4. Working and Learning with Kaggle Notebooks
I. Setting up a kernel
II. Upgrade to GCP
III. One step beyond
IV. Kaggle courses
Data Analysis and Machine
Learning with Kaggle: How to win
competitions on Kaggle and build a
successful career in data science
Welcome to Packt Early Access. We’re giving you an exclusive preview of this
book before it goes on sale. It can take many months to write a book, but our
authors have cutting-edge information to share with you today. Early Access
gives you an insight into the latest developments by making chapter drafts
available. The chapters may be a little rough around the edges right now, but our
authors will update them over time. You’ll be notified when a new version is
ready. This title is in development, with more chapters still to be written, which
means you have the opportunity to have your say about the content. We want to
publish books that provide useful information to you and other customers, so
we’ll send questionnaires out to you regularly. All feedback is helpful, so please
be open about your thoughts and opinions. Our editors will work their magic on
the text of the book, so we’d like your input on the technical elements and your
experience as a reader. We’ll also provide frequent updates on how our authors
have changed their chapters based on your feedback. You can dip in and out
of this book or follow along from start to finish; Early Access is designed to be
flexible. We hope you enjoy getting to know more about the process of writing a
Packt book. Join the exploration of new topics by contributing your ideas and
see them come to life in print.
In this chapter, we will explore how competitive programming evolved into data
science competitions, why the Kaggle platform is the most popular site for such
competitions and how it works.
The first KDD cup, held in 1997, involved a problem on direct marketing for lift
curve optimization and it started a long series of competitions (you can find the
archives containing datasets, instructions, and winners at:
https://www.kdd.org/kdd-cup) that continues up to nowadays (here is the latest
available at the time of writing: https://www.kdd.org/kdd2020/kdd-cup). KDD
cups proved quite effective in establishing best practices with many published
papers describing solutions and techniques and competition dataset sharing that
has been useful for many practitioners for experimentation, education and
benchmarking.
The experience of competitive programming and KDD cups together gave rise to
data science competition platforms, platforms where companies can host data
science challenges that are somehow hard to solve and that could benefit from a
crowdsourcing approach. In fact, given the fact that there is no golden approach
for all the problems in data science, many problems require a time-consuming
approach of the kind of try-all-that you-can-try.
In fact, no algorithm on the long run can beat all the others on all the problems,
but each machine learning algorithm performs if and only if its space of
hypothesis comprises the solution. Yet you cannot know that beforehand, hence
you have to try and test to be assured that you are doing the right thing. You can
consult the no free lunch theorem for a theoretical explanation of this practical
truth, here is a complete article from Analytics India Magazine on the topic:
https://analyticsindiamag.com/what-are-the-no-free-lunch-theorems-in-data-
science/.
The Kaggle competition platform has until now held hundreds of competitions
and these two are just a couple of examples of companies that successfully used
its competitions to boost their own models and analytics efforts. Let’s take a step
back from specific competitions for a moment and let’s talk about the Kaggle
company, which is the common thread of all this book.
Kaggle took its first steps in February 2010 thanks to the idea of Anthony
Goldbloom, an Australian trained economist (he has a degree in Economics and
Econometrics from Melbourne University). After working at Australia's
Department of Treasury and in the Research department at the Reserve Bank of
Australia, Goldbloom worked in London as an intern at The Economist, the
international weekly newspaper on current affairs, international business,
politics, and technology. At The Economist he had occasion to write an article
about big data that inspired his idea of building a competition platform that
could crowdsource the best analytical experts in solving interesting machine
learning problems (https://www.smh.com.au/technology/from-bondi-to-the-big-
bucks-the-28yearold-whos-making-data-science-a-sport-20111104-1myq1.html).
Since the crowdsourcing dynamics had a relevant part in the business idea for
this platform, he derived the name Kaggle, which recalls by rhyme the term
“gaggle” i.e. a flock of geese (the goose is also the symbol of the platform).
After moving to the Silicon Valley in the USA, his Kaggle start-up received
$11.25 million in Series A funding from a round led by Khosla Ventures and
Index Ventures, two quite renown venture capital firms. First competitions rolled
out, the community grew and some of the initial competitors came to become
quite prominent, such as Jeremy Howards, the Australian data scientist and
entrepreneur, who, after winning a couple of competitions on Kaggle, become
the President and Chief Scientist of the company. Jeremy Howard left his
position as President in December 2013 and thereafter he started a new start-up,
fast.ai (www.fast.ai), offering machine learning courses and a deep learning
library for coders.
At the times there were other prominent Kagglers (the name to indicate frequent
participants to competitions held by Kaggle) such as Jeremy Achin and Thomas
de Godoy. After reaching the top 20 global rankings on the platform, they
promptly decided to retire and to found their own company, DataRobot. They
soon after started hiring their best employers among the participants in the
Kaggle competitions in order to instill the best machine learning knowledge and
practice into the software they were developing. Today DataRobot is an
undoubted leader in autoML (automatic machine learning).
The Kaggle competitions claimed more and more attention from a larger
audience and even Geoffrey Hinton, the Godfather of deep learning participated
(and won) in a Kaggle competition hosted by Merck in 2012
(https://www.kaggle.com/c/MerckActivity/overview/winners). Kaggle has also
been the platform where Francois Chollet launched his deep learning package
Keras during the Otto Group Product Classification Challenge
(https://www.kaggle.com/c/otto-group-product-classification-
challenge/discussion/13632) and Tianqi Chen launched XGBoost, a speedier and
more accurate version of the gradient boosting machines, in the Higgs Boson
Machine Learning Challenge (https://www.kaggle.com/c/higgs-
boson/discussion/10335).
Today the Kaggle community is still active and growing. It has offered to many
of his participants opportunities to create their own company, to launch machine
learning software and packages, to get interviews on magazines
(https://www.wired.com/story/solve-these-tough-data-problems-and-watch-job-
offers-roll-in/), to arrange a course on Coursera
(https://www.coursera.org/learn/competitive-data-science), to write machine
learning books (https://twitter.com/antgoldbloom/status/745662719588589568),
to find their dream job and, most important, or just to learn more about skills and
technicalities about data science.
Since many other competition platforms are localized in specific countries or are
specialized in certain kinds of competitions, for completeness we will briefly
introduce some of them, at least those we have some experience and knowledge
of.
The alternatives and opportunities besides Kaggle are quite a lot. The interesting
aspect of such an abundance of opportunities is that you can find more easily a
competition that could interests you more because of its specialization and data.
Also, expect less competitive pressure on these challenges since they are less
known and advertized. Also, expect less sharing among participants since no
other competition platform up to now has reached the same richness of sharing
and networking tools as Kaggle has.
Stages of a competition
A competition on Kaggle is arranged through different steps. By having a glance
at each of them, you can get a better understanding at how a data science
competition works and what to expect from it.
When a competition is launched, there are usually some posts on social media
(for instance on Kaggle Twitter profile: https://twitter.com/kaggle) that announce
it and a new tab will appear between active competitions at the page
https://www.kaggle.com/competitions. If you click on the competition’s tab,
you’ll be taken to the competition page. Immediately, at a glance you can at least
check if the competition will have prizes (and if it awards points and medals, a
secondary consequence of participating in a competition), how many teams are
at the moment involved and how much is still left for you to work on a solution.
There you can explore the overview menu first which will provide information
to you about the topic of the competition, its evaluation metric (your models will
be evaluated against that metric), the time line of the competition, the prizes and
the legal or competition requirements. Usually the time line is a bit overlook, but
it should be one of the first things you have to check, in fact it doesn’t tell you
simply when the competition starts and ends, but it will provide you with the
rules acceptance deadline, which is usually from seven day to two weeks before
the competitions closes. The rule acceptance deadline marks the time limit you
can join the competition, by accepting its rules, and the team merger deadline:
you can arrange to combine your team with other competitors’ one just before
that deadline, after that it won’t be possible. In addition, the rules menu is quite
often overlooked (with people just jumping to data), but it is important to check
on them because they can tell you about the requirements of the competition.
Among the key information you can get from the rules there is the eligibility for
a prize and a few other important details such as if you can use external data to
improve your score, how many submissions (tests of your solution) a day you
get, how many final solutions you can choose, and so on. Finally, you can have a
look at the data, though you can download it only after accepting the rules of the
competition.
Once you have accepted the rules, you can download any data or directly start
working on Kaggle Kernel, an only notebook, re-using code that others have
made available to other participants or creating your own code from scratch. If
you decide to download the data, also consider that you have a Kaggle API that
can help you to run downloads and submission in an almost automated way. You
can find more details about the API at https://www.kaggle.com/docs/api and you
can get the code from Github from https://github.com/Kaggle/kaggle-api. By the
way, if you closely check on Kaggle Github repo, you can also find all the
docker images they use for their online notebooks the Kaggle Kernels. At this
point, as you develop your solution, it is not a bad idea to contact other
competitors through the discussion forum: there you can ask and answer
questions. Often you will also find useful hints at specific problems with the data
or even useful ideas there to improve your solution.
Once your solution is ready, you can submit it to the Kaggle evaluation engine,
accordingly to the specification of the competition (some competitions will
accept a csv file as a solution, other will require you to code and produce results
in a Kaggle Kernel). During all the competition, you can keep submitting
solutions.
Every time you submit a solution, the leaderboard will provide you soon after,
depending on the computations necessary for the evaluation, with a score and a
position among the competitors. That position is only indicative anyway,
because it reflects the performance of your model on a part of the test set, called
as the public one since the performance on it is made public during the
competition for everyone to know. Only when the competition closes and the
contestants have decided on what among their model have to be scored, it is
reveled their score on another part of the test set, called the private one. This
new leaderboard, the private leaderboard, constitutes the final, effective rankings
of the competition.
When a competition is closed, the Kaggle team will take a certain time to check
that everything is correct and that all contestants have respected the rules of the
competition. After a while (and sometimes after some changes) the private
leaderboard will become definitive, the winners will be declared, and many
among participants, at their own will, will unveil their strategies, their solutions
and their code to others on the competition discussion forum.
Here are the official categories that you can use to filter out the different
competitions:
Featured
Masters
Annuals
Research
Recruitment
Getting started
Playground
Analytics
InClass
There are examples of Featured competitions every time you access to Kaggle.
At the moment, many of them are problems related to the application of deep
learning methods to unstructured data like text, images, videos or sound but in
the past tabular data competitions, that is competitions based on problems related
to structured data that can be found in a database. Now such competitions are
really less required because a crowdsourced solution would not often be such an
advance in respect of what a good team of data scientists or even an autoML
software could reach. First using random forests, then using gradient boosting
methods with clever feature engineering, in the past tabular data solution derived
from Kaggle could really largely improve a solution. Nowadays, given the
spread of better software and good practices, the obtainable increased results
from competitions could be indeed marginal. In the unstructured data world,
instead, a good deep learning solution could still do the difference since, for
instance, pretrained networks such as BERT brought about double-digit
increases in previous standards of many well known with text.
“Masters” are less usual now, but they are private, invite-only competitions. The
purpose was to create competitions only among experts (generally competitors
ranked as Masters or Grandmasters, based on Kaggle medal rankings), based on
their rankings on Kaggle.
“Annuals” are competitions that always appear on a certain period of the year.
Among the Annuals we have the Santa Claus competitions (usually based on an
algorithmic optimization problem) and the March Machine Learning
Competition, run every year since 2014 during the US College Basketball
Tournaments.
Google has released a few Research competitions in the past such as the Google
Landmark Recognition 2020 (https://www.kaggle.com/c/landmark-recognition-
2020) - Label famous (and not-so-famous) landmarks in images
Sponsors that want to test potential job candidates for their ability hold
“Recruitment” competitions. These competitions are limited to teams of one
and offer to best placed competitors an interview with the sponsor as a prize. The
competitors have to upload their curriculum vitae at the end of the competition if
they want to be considered for being contacted.
“Getting started” competitions do not offer any prize but a friendly and easy
problem for beginners to get accustomed by Kaggle principles and dynamics.
Usually they are semi-permanent competitions whose leaderboard is refreshed
from time to time. If you are looking for a tutorial in machine learning, these
competitions are the right place where to start.
“Playground” competitions are a little bit more difficult than the Getting Started
ones, but they are also meant for having competitors learn and test their abilities
without the pressure of a fully fledged Featured competition (though sometimes
also in Playground competitions the heat of the competition may turn quite
high). The usual prizes of such competitions are just swag or little money.
One famous Playground competition has been the original “Dogs vs Cats”
competition (https://www.kaggle.com/c/dogs-vs-cats) - Create an algorithm to
distinguish dogs from cats
The system works the best if the task is well defined and the data is of good
quality. On the long run, the performance of solutions improves by small gains
until they reach an asymptote. The process can be speeded up by allowing a
certain amount of sharing among participants (as it happens on Kaggle by means
of discussions and sharing Kernel notebooks and extra data by Datasets). It can
be noticed that the only competitive pressure, in spite of any degree of sharing
among participants doesn’t stop the improvement in the solution, it just makes it
slower.
This is because the secret sauce in the CTF paradigm is the competition itself,
that, in the framework of a practical problem whose empirical performance has
to be improved, always leads to the emergence of new benchmarks, new data
and modeling solutions, and in general to an improved application of machine
learning to the problem objected of the competition. A competition can therefore
provide a new way to solve a prediction problem, new ways of feature
engineering, new algorithmic or modeling solutions. For instance, deep learning
has not simply emerged from academic research, but it has first gained a great
boost because of successful competitions that declared its efficacy (we have
already mentioned for instance the Merck competition won by Geoffrey Hinton’s
team https://www.kaggle.com/c/MerckActivity/overview/winners).
Coupled with the open software movement, which allows everyone to access to
powerful analytical tools (such as Scikit-learn or TensorFlow or PyTorch), the
CTF paradigm brings even better results because all competitors are on the same
line at the start. On the other end, the reliance of a solution to a competition on
specialized or improved hardware can limit the achievable results because it can
prevent competitors without access to such resources to properly participate and
contribute directly to the solution or indirectly by exercising competitive
pressure on the other participants. Understandably, that the reason why Kaggle
started also offering cloud services for free to participants to its competitions
(the Kernels): it can flatten some differences in hardware intense competitions
(like most deep learning ones are) and increase the overall competitive pressure.
There are, anyway, occurrences that can go wrong and instead led to a
suboptimal result in a competition:
You have leakage from data when part of the solution is to be retraced in the data
itself. For instance, certain variables are posterior to the target variable (so they
reveal something of it) or the ordering of the training and test examples or some
identifier is evocative of the solution. Such solution leakage, sometimes named
by the competitors as “golden features” (because getting a hint of such nuances
in the data problem could turn into gold prizes for the participants) invariably
leads to a not reusable solution. It also implies a suboptimal result for the
sponsor (who at least should have learned something about leaking features that
should affect a solution to his/her problem).
In order to avoid problems with leaderboard probing and overfitting, Kaggle has
recently introduced different innovations based on Code Competitions, where
the evaluation is split into two distinct stages, that is you have a test set for the
public leaderboard (the leaderboard you follow during the competition) and a
completely hold out test set for the final private leaderboard. In this way,
participants are actually blind to the actual data their solutions will be evaluated
against and they should be forced to consider more their own tests and a more
general solution against a solution specific to the test set.
Computational resources
Some competitions do pose limitations in order to render available to production
feasible solutions, for instance the Bosh Production Line Performance
competition - https://www.kaggle.com/c/bosch-production-line-performance -
had strict limits on execution time, model file output and memory limit for your
solution. Also Kernel based competitions, when requiring both training and
inference to be executed on Kernels, do not pose a problem for the resources you
have to use because Kaggle will provide with all the resources you need (and
this is also intended as a way to put all participants on the same line for a better
competition result).
Problems for you arise when you have kernel competitions just limited to
inference time and therefore you can train your models on your own machine
and the only limit is then based at test time on the number and complexity of
models you produce. Since at moment most competitions require deep learning
solutions, you have to consider that you surely need specialized hardware such
as GPUs in order to achieve some interesting result in a competition. Anyway,
also if you participate in some of the now rare tabular competitions, you’ll soon
realize that you need a strong machine with quite a number of processors and
memory in order to easily apply feature engineering to data, run experiments and
build models quickly.
Our suggestion, unless your ambition is to climb to the top rankings of Kaggle
participants is therefore to go with the machines provided free by Kaggle, the
Kaggle Notebooks (also previously known as the Kaggle Kernels).
A CPU Notebook owns 4 CPU cores and 16 GB of memory, you can run 10
Notebooks of this kind at a time but you don’t have any time quote for them
A GPU features 2 CPU cores and 13 GB of memory, you can run 2
Notebooks of this kind at a time and you have a 30 hours weekly quota for
such kind of Notebook
A TPU features 4 CPU cores and 16 GB of memory, you can run 2
Notebooks of this kind at a time you have a 30 hours weekly quota for such
kind of Notebook
All Notebooks can run for 9 hours maximum, and have 20 GB disk saving
allowance to store your models and results plus an additional scratchpad disk
that can exceed 20 GBs for temporary usage during script running.
In certain cases, the GPU enhanced machine provided by Kaggle kernels may
not be enough. For instance, the recent Deepfake Detection Challenge
(https://www.kaggle.com/c/deepfake-detection-challenge) required to process
data consisting of about around 500 GB of videos. That is especially because of
the time limit of weekly usage, that at the time of this writing is about 30 hours a
week and because of the fact that you cannot have more than two machines with
GPU running at the same time (10 machines at a time is the limit for the CPU
only instances). Even if you can double your machine time by changing your
code to leverage the usage of TPUs instead of GPUs (and you can find some
guidance for achieving that easily here: https://www.kaggle.com/docs/tpu), that
may still prove not enough for fast experimentation on a data heavy competitions
such as the Deepfake Detection Challenge. That’s the reason in the chapter
devoted to Kaggle Kernel we are going to provide you with many tips and tricks
for successful coping with such limitations with decent results without having to
buy a heavy performing machine. We are also going to show you how to
integrate Kaggle Kernels with Google Cloud Services (GCP) or simply how to
move away all your work on another cloud based solution.
There are also quite many occasions to network with other Kagglers outside of
the Kaggle platform itself. First of all, there are a few Slack channels that can be
helpful. For instance, KaggleNoobs (see: https://www.kaggle.com/getting-
started/20577) is a channel, opened up 5 years ago, that feature many discussions
about Kaggle competitions and they have a supportive community that can help
you if you have some specific problem with code and models. There are quite a
few other channels around devoted to exchanging opinions about Kaggle
competitions and data science related topics. Some channels are organized on a
regional or national basis. For instance, the Japanese channel Kaggler-ja
(http://kaggler-ja-wiki.herokuapp.com/) or the Russian community, created six
years ago, Open Data Science Network (https://ods.ai/) which later opened also
to non-Russian speaking participants. The Open Data Science Network (mostly
simply known as ODS) doesn’t simply offer a Slack channel but also courses on
how to win competitions, events, and reporting on active competitions around on
all known data science platforms (see https://ods.ai/competitions).
Apart from Slack channels, also quite a lot of local meetups themed about
Kaggle in general or about specific competitions have sprout out, some for short
time, others for longer. A meetup on Kaggle competition, usually built around a
presentation from a competitor who wants to share her or his experience and
suggestions, is the best situation to meet other Kagglers in person, to exchange
opinions and to build alliances for participating together in data science contests.
In this league, a mention apart is for Kaggle Days (https://kaggledays.com/),
built by Maria Parysz and Paweł Jankiewicz, a renowned Kaggle competitor.
The Kaggle Days organization arranged a few events in major locations around
the World (https://kaggledays.com/about-us/) with the aim of bringing together a
conference of Kaggle experts (which had to come to an abrupt stop due to the
COVID-19 pandemic) and it created a network of local meetups in different
countries which are still quite active (https://kaggledays.com/meetups/).
First, there is a general leaderboard, that combines all the leaderboards of the
single competitions. In this general leaderboard
(https://www.kaggle.com/rankings), one is ranked based on the position on each
single competition she or he took, which awards some points that all summed
together provide the ranking in the general leaderboard. At first glance, the
formula for the scoring of the points in a competition may look a bit complex:
Nevertheless, in reality it is simply based on a few ingredients: the rank in a
competition, your team size, the popularity of the competition and how much the
competition is old.
Another point to keep in mind is that point decay with time. The decaying is not
linear, but as a rule of thumb just think that after a year very little is left of the
points you gained. Therefore, glory on the general leaderboard of Kaggle cannot
last long and it is ephemeral unless you keep on participating on competitions
coming up with similar results as before. As a consolation, on your profile you’ll
always keep the highest rank you ever reach, as a memento of your great
combined results at a certain time.
More last longing is the medal system that covers all the four aspects of
competing in Kaggle. You will be prized with medals in competitions,
notebooks, discussion and datasets based on your results. In competitions,
medals are awarded based on your position on the leaderboard. In other areas
such as discussion, notebook and datasets medals are awarded based on the
upvotes of other competitors (which actually led sometimes to some suboptimal
situation since upvotes are a less objective metric and also depends on
popularity). The more medals you get, the higher ranks of Kaggle mastery you
can enter. The ranks are classified in Novice, Contributor, Expert, Master, and
Grandmaster. The page https://www.kaggle.com/ progression explains
everything about how to get medals and how many and of what kind are needed
to access the different ranks.
Please keep in mind that such ranks and honors are always relative and that they
do change in time. A few years ago, in fact the scoring system and the ranks
were quite different. Most probably in the future, the ranks will change again in
order to keep the higher ones rarer and thus more valuable.
Many perceive Kaggle, as many other data science competition platform, far
from what data science is in reality. The point they raise are that business
problems are not given from nowhere and you seldom already have a well-
prepared dataset to start with since you usually built it along the way based on
refining the business specifications and understanding of the problem at hand.
Moreover, they emphasize that production is neither considered, since a winning
solution cannot be constrained by resource limits or considerations about
technical debt (though this is not always true for all competitions).
We cannot but not notice how all such criticism is related in the end about both
the fact that Kaggle is a crowdsourcing experience with a purpose (the CTF
paradigm) and how Kaggle ranking standings do relate in the data science world
in comparison with data science education and work experience. One persistent
myth that ailments criticism is in fact that Kaggle competitions may help getting
you a job or a better job in data science or that performing in Kaggle
competitions may put you on another plane in respect of data scientists that do
not participate at all.
Our stance on such a myth is that it is misleading belief that Kaggle rankings do
have an automatic value beyond the Kaggle community. For instance, in a job
search, Kaggle can provide you with some very useful competencies on
modeling data and problems and effective model testing. It can also expose you
to many techniques and different data/business problems (even beyond your
actual experience and comfort zone), but it cannot supplement you with
everything you need to successfully place yourself as a data scientist in a
company.
You can use Kaggle for learning and for differentiating yourself from other
candidates in a job search; however, how this will be considered will
considerably vary from company to company. Anyway, what will learn on
Kaggle will invariably prove useful throughout all of your career and will
provide you a hedge when you’ll have to solve complex and unusual problems
with data modeling because by participating in Kaggle competitions you build
up strong competencies in modeling and validating. You also network with other
data scientists and that can get you a reference for a job more easily and provide
you with another way to handle difficult problems beyond your skills because
you will have access to others’ competencies and opinions.
Hence, our opinion is that Kaggle can more indirectly help you in your career as
a data scientist and that it can do that in different ways. Of course, sometimes
Kaggle will help you directly being contacted as a job candidate based on your
competitions’ successes, but more often Kaggle will be helpful by providing you
with the intellectual and experience skills you need to succeed first as a
candidate then as a practitioner. In fact, after playing with data and models on
Kaggle for a while, you’ll have had the chance to see enough different datasets
problems and ways to deal with them under the pressure of time, that when faced
with similar problems in real settings you’ll get quite skilled in finding solutions
quickly and effectively.
Use Kaggle and other competition platforms in a smart way. Kaggle is not a
passepartout, being first on a competition won't assure you a highly paid job or
glory beyond the Kaggle community. However, consistently participating in
competitions is instead a card to be played smartly to show interest and passion
in your data science job search and to improve some specific skills that can
differentiate you as a data scientist and not make you obsolete in front of
autoML solutions.
If you are going to follow us along this book, we will show you how.
Organizing Data with Datasets
In his story “The Adventure of the Copper Breeches”, Arthur Conan Doyle has
Sherlock Holmes shout “Data! Data! Data! I cannot make bricks without
clay”—and this mindset, which served the most famous detective in literature so
well, should be adopted by every data scientist. For that reason, we begin the
more technical part of this book with a chapter dedicated to data: specifically, in
the Kaggle context, leveraging the power of Kaggle Datasets functionality for
our purposes.
Setting up a dataset
In principle, any data you can use (subject to limitations—see the legal caveats
section below), you can upload to Kaggle. The specific limits at the time of
writing this book are: 20 gigabytes per dataset and 100 gb total quota. Keep in
mind that the size limit per single dataset is calculated uncompressed—
uploading compressed versions speeds up the transfer but does not help against
the limits. You can check the most recent documentation of the datasets at this
link:
https://www.kaggle.com/docs/datasets
Kaggle promotes itself as a “home of open data science” and the impressive
collection of datasets available from the site certainly lends some credence to
that claim: before uploading the data for your project into a dataset, make sure to
check the existing content—for several popular applications, there is a chance it
has already been stored there:
For the sake of this introduction, let us assume the kind of data you will be using
in your project is not already there—so you need to create a new one. When you
head to the menu with three lines on the left-hand side and click on Data you
will be redirected to the dataset page:
When you click on New Dataset you will be prompted for the basics: uploading
the actual data and giving it a title:
Keep in mind that Kaggle is a popular platform, so numerous people upload their
data there—including private (not publicly visible) ones—so try to think of a
non-generic title.
Voila! Your first dataset is ready. You can then head to the Data tab:
In principle you do not have to fill out all the fields—your newly created dataset
is perfectly usable without them (and if it is a private one, you probably do not
care—after all you know what is in it). However, the community etiquette would
suggest filling the info for the ones you make public: the more you specify, the
more usable the data will be to others (and measured by the usability score,
displayed in the upper right corner).
Discussion of the different frameworks for harvesting data from social media
(Twitter, Reddit etc) is outside the scope of this book.
But even once we move the computations there, we might still want to have
access to the Kaggle datasets—so importing them into Colab is a rather handy
feature.
The first thing we do—since you are reading this, we assume you already are
registered on Kaggle—is head to the account page to generate the API token:
Next step is to create a folder named “ Kaggle ” in your drive and upload the
.json there
Once done, you need to create a new Colab notebook and mount your drive:
Get the authorization code from the URL prompt and provide in the empty box,
then execute the following code to prove the path to the .json config:
We can download the dataset now: begin by going to Kaggle and copying the
API command:
The dataset will be downloaded as a .zip archive—unpack it and you are good
to go.
Alternatively, you can start a conversation around the data by clicking on Create
a discussion.
Legal caveats
Just because you can put some data on Kaggle does not necessarily mean that
you should—excellent example would be the “People of Tinder dataset”: in
2017, a developer used the Tinder API to scrape the website for semi-private
profiles and uploaded the data on Kaggle. After the issue became known, Kaggle
ended up taking the dataset down. You can read the full story here:
https://www.forbes.com/sites/janetwburns/2017/05/02/tinder-profiles-have-been-
looted-again-this-time-for-teaching-ai-to-genderize-faces/?sh=1afb86b25454
In general, before you upload anything to Kaggle ask yourself two questions: is
it legal (from a copyright standpoint—always check the licenses) and are there
any risks associated with this dataset (privacy or otherwise).
Working and Learning with Kaggle
Notebooks
Kaggle notebooks—which until recently were called kernels, so please excuse
me if I occasionally use those terms interchangeably—are Jupyter notebooks in
the browser that can run free of charge. This means you can execute your
experiments from any device with an internet connection, although something
bigger than a mobile phone is probably a good idea. The technical specification
of the environment (as of the time of this writing) is given below:
Without further ado, let us jump into it. The first thing we do is figure out how to
set up a notebook.
Setting up a kernel
There are two primary methods of creating a notebook: from the front page or
from a dataset level.
To proceed with the first method, go to the Code section of the menu on the left-
hand side of the landing page at https://www.kaggle.com/ and press the New
Notebook button. This is a preferred method if you are planning an experiment
involving uploading your own dataset:
Alternatively, you can go to the page of the dataset you are interested in and
click the New Notebook button there:
Whichever method you chose, after clicking New Notebook you will be taken to
your notebook page:
An important aspect of using notebooks: you can always take an existing one
(created by somebody) and clone it to modify and adjust to your needs. This can
be achieved by pressing the Copy and Edit button on the kernel page, although
in Kaggle parlance, the process is referred to as forking:
A note on etiquette: if you have participated in a Kaggle competition, you
probably noticed that the leaderboard is flooded by forks of forks of well scoring
notebooks. Nothing wrong with building on somebody else’s work—but if you
do, remember to upvote the original author.
A notebook you create is private (i.e. only visible to you) by default. If you want
to make it available to others, you can select
Upgrade to GCP
Sometimes the resources provided freely by Kaggle are not sufficient for the task
you need, and you need to move to a beefier machine. You can setup the whole
environment yourself—or you can stay within the framework of notebooks but
swap the underlying machine. This is there Google Cloud AI Notebooks come
in.
There are many potential criteria to consider when building your data science
portfolio (branding, audience reach, enabling a pitch to your potential employer
etc.) but none of them matter if nobody can find them. Because Kaggle is part of
Google, the notebooks are indexed by the most popular search engine in the
world—so if someone is looking for a topic related to your code, it will show up
in their search results.
Below I show a “personal” example: a few years ago, I wrote a notebook for a
competition—the problem I wanted to tackle was adversarial validation (for
those unfamiliar with the topic: a fairly easy way to see if your training and test
sets have a similar distribution is to build a binary classifier trained to tell them
apart). When writing this chapter, I tried to give it a try and lo and behold, it
shows up high in the search results (notice the fact that I did not mention Kaggle
or any personal details like name in my query):
Your Kaggle profile comes with followers/following option and gives you a
possibility to link other professional networks like LinkedIn or GitHub, so you
can leverage the connection gained inside the community:
In this day and age, it is easy to be skeptical about claims of “community
building”—but in the case of Kaggle, it happens to actually be true. Their brand
recognition in the data science universe is second to none, both among
practitioners and among recruiters who actually do their homework. In practice,
this means that a (decent enough) Kaggle profile can get you through the door
already—which, as we all know, is frequently the hardest step.
Kaggle courses
A great many things about Kaggle are about acquiring knowledge be it the things
you learn in a competition, datasets you manage to find in the ever-growing
repository or demonstration of a hitherto unknown model class, there is always
something new to find out. The newest addition to that collection are the courses
gathered under the Kaggle Learn label: https://www.kaggle.com/learn. Those are
micro-courses marketed by Kaggle as “the single fastest way to gain the skills
you’ll need to do independent data science projects”, the core unifying theme
being a crash course introduction across a variety of topics. Each course is
divided into small chapters, followed by coding practice questions.
Below, we provide a brief summary of their content:
Apart from the original content created by Kaggle, there are multiple other
learning opportunities available on the platform using kernels. A prominent
example worth mentioning is the extremely popular fast.ai course:
https://www.kaggle.com/general/63077