Python For Data Science Quickstart Guide
Python For Data Science Quickstart Guide
elitedatascience.com/python-quickstart
This quickstart tutorial will get you set up and coding in Python for data science.
If you want to learn one of the most in-demand programming languages in the world…
you’re in the right place.
By the end of this guide, you’ll have a strong foundation and be able to follow along other
tutorials on this site, even if you’ve never programmed before. Let’s jump right in!
Table of Contents
It’s the closest thing to a one-stop-shop for all your setup needs.
Simply download Anaconda with the latest version of Python 3 and follow the wizard:
1/13
Step 2: Start Jupyter Notebook
Jupyter Notebook is our favorite IDE (integrated development environment) for data
science in Python. An IDE is just a fancy name for an advanced text editor for coding.
(As an analogy, think of Excel as an “IDE for spreadsheets.” For example, it has tabs,
plugins, keyboard shortcuts, and other useful extras.)
The good news is that Jupyter Notebook already came installed with Anaconda. Three
cheers for synergy! To open it, run the following command in the Command Prompt
(Windows) or Terminal (Mac/Linux):
MS DOS
1 jupyter notebook
Alternatively, you can open Anaconda's “Navigator” application, and then launch the
notebook from there:
2/13
You should see this dashboard open in your browser:
*Note: If you get a message about “logging in,” simply follow the instructions in the
browser. You’ll just need to paste in a token from the Command Prompt/Terminal.
3/13
Then, open a new notebook by clicking “New” in the top right. It will open in your default
web browser. You should see a blank canvas brimming with potential:
1 import math
2
3 # Area of circle with radius 5
4 25*math.pi
5
6 # Two to the fourth
7 2**4
8
9 # Length of triangle's hypotenuse
10 math.sqrt(3**2 + 4**2)
4/13
(To run a code cell, click into the cell so that it’s highlighted and then press Shift + Enter
on your keyboard.)
In addition, Jupyter Notebook will only display the output from final line of code:
To print multiple calculations in one output, wrap each of them in the print(…) function.
Another useful tip is that you can store things in objects (i.e. variables). See if you can
follow along what this code is doing:
5/13
1 message = "The length of the hypotenuse is"
2 c = math.sqrt(3**2 + 4**2)
3 print( message, c )
By the way, in the above code, the message was surrounded by quotes, which means it’s
a string. A string is any sequence of characters surrounded by single or double quotes.
Now, we’re not going to dive much further into the weeds right now. To learn more about
programming fundamentals, check out our Python for Data Science Self-Study Guide.
Contrary to popular belief, you won’t actually need to learn an immense amount of
programming to use Python for data science. That’s because most of the data science
and machine learning functionality you’ll need are already packaged into libraries, or
bundles of code that you can import and use out of the box.
Which brings to the next step... Let’s import those libraries! In a new code cell (Insert >
Insert Cell Below), write the following code:
1 import pandas as pd
2
3 import matplotlib.pyplot as plt
4 %matplotlib inline
5 from sklearn.linear_model import LinearRegression
(It might take a while to run this code the first time.)
First, we imported the Pandas library. We also gave it the alias of pd. This means
we can evoke the library with pd. You’ll see this in action shortly.
Next, we imported the pyplot module from the matplotlib library. Matplotlib is the
main plotting library for Python. There’s no need to bring in the entire library, so we
just imported a single module. Again, we gave it an alias of plt.
Oh yea, and the %matplotlib inline command? That’s Jupyter Notebook specific. It
simply tells the notebook to display our plots inside the notebook, instead of in a
separate screen.
Finally we imported a basic linear regression algorithm from scikit-learn. Scikit-learn
has a buffet of algorithms to choose from. At the end of this guide, we’ll point you to
a few resources for learning more about these algorithms.
6/13
There are plenty of other great libraries available for data science, but these are the most
commonly used.
For this tutorial, we’ll be reading from an Excel file that has data on the energy efficiency
of buildings. Don’t worry – even if you don’t have Excel installed, you can still follow
along.
First, download the dataset and put it into the same folder as your current Juptyer
notebook.
Then, use the following code to read the file and store its contents in a df object (“df” is
short for dataframe).
1 df = pd.read_excel( 'ENB2012_data.xlsx' )
If you saved the dataset in a subfolder, then you would write the code like this instead:
1 df = pd.read_excel( 'subfolder_name/ENB2012_data.xlsx' )
To see what’s inside, just run this code in your notebook (it displays the first 5
observations from the dataframe):
1 df.head()
7/13
For extra practice on this step, feel free to download a few others from our hand-picked
list of datasets. Then, try using other IO tools (such as pd.read_csv()) to import datasets
with different formats.
We showcase more of what you can do in Pandas in our Python Data Wrangling Tutorial
(opens in a new tab).
We won’t go through the entire exploratory analysis phase right now. Instead, let’s just
take a quick glance at the distributions of our variables. We’ll start with the “X1” variable,
which refers to “Relative Compactness” as described in the file’s data dictionary.
Plot histogram
Python
1 plt.hist( df.X1 )
In general, these types of functions will have different parameters that you can pass into
them. Those parameters control things like the color scheme, the number of bins used,
the axes, and so on.
8/13
There’s no need to memorize all of the parameters. Instead, get in the habit of checking
the documentation page for available options. For example, the documentation page of
plt.hist() indicates that you can change the number of bins in the histogram:
Plot parameters
Python
For now, we don’t recommend trying to get too fancy with matplotlib. It’s a powerful, but
complex library.
Instead, we prefer a library that’s built on top of matplotlib called seaborn. If matplotlib
“tries to make easy things easy and hard things possible”, seaborn tries to make a well-
defined set of hard things easy as well.
9/13
Learn more about it in our Seaborn Data Visualization Tutorial.
Even so, for illustrative purposes, let’s at least check for missing values. You can do so
with just one line of code (but there’s a ton of cool stuff packed into this one line).
1 df.isnull().sum()
df is where we stored the data. It’s called a “dataframe,” and it’s also a Python
object, like the variables from Step 4.
.isnull() is called a method, which is just a fancy term for a function attached to an
object. This method looks through our entire dataframe and labels any cell with a
missing value as True. (Tip: Try running df.head().isnull() and see what you get!)
Finally, .sum() is a method that sums all of the True values across each column.
Well… technically, it sums any number, while treating True as 1 and False as 0.
You can learn more about .isnull() and .sum() on the documentation page for Pandas
dataframes.
In that example, the “highest education level” categorical feature is also ordinal. In other
words, its classes have an implied order to them. For example, ['college'] implies more
schooling than ['high school'].
10/13
A problem arises when categorical features are not ordinal. In fact, we have this problem
in our current dataset.
If you remember from its data dictionary, features X6 (Orientation) and X8 (Glazing Area
Distribution) are actually categorical. For example, X6 has four possible values:
1 2 == 'north',
2 3 == 'east',
3 4 == 'south',
4 5 == 'west'
However, in the current way it’s encoded (i.e. as four integers), an algorithm will interpret
“east” as “1 more than north” and “west” as “2 times the value east.”
Therefore, we should create dummy variables for X6 and X8. These are brand new
input features that only take the value of 0 or 1. You’d create one dummy per unique class
for each feature.
So for X6, we’d create four variables—X6_2, X6_3, X6_4, and X6_5—that represent its
four unique classes. We can do this for both X6 and X8 in one fell swoop:
(Tip: after running this code, trying running df.head() again. Is it what you expected?)
We won’t cover any more feature engineering for now, but you can get a checklist of
specific ideas in our Guide to Feature Engineering Best Practices.
After just a few short steps, we’re actually ready to train a model. But before we jump in,
just a quick disclaimer: we won’t be using model training best practices for now. Instead,
this code is simplified to the extreme. But it’s super helpful to start with these “toy
problems” as learning tools.
Before we do anything else, let’s split our dataset into separate objects for our input
features (X) and the target variable (y). The target variable is simply what we wish to
predict with our model.
11/13
Let’s predict “Y1,” a building’s “Heating Load.”
1 # Target variable
2 y = df.Y1
3
4 # Input features
5 X = df.drop( ['Y1', 'Y2' ], axis=1)
In the first line of code, we’re copying Y1 from the dataframe into a separate y object.
Then, in the second line of code, we’re copying all of the variables except Y1 and Y2 into
the X object.
First, we initialize a model instance. Think of this as a single “version” of the model. For
example, if you wanted to train a separate model and compare them, you can initialize a
separate instance (e.g. model_2 = LinearRegression()).
Then, we call the .fit() method and pass the input features (X) and target variable (y) as
parameters.
There are many cool mechanics working under the hood, but that’s basically all you need
to create a basic model. In fact, you can get predictions and calculate the model’s R^2
like so:
12/13
1 from sklearn.metrics import r2_score
2
3 # Get model R^2
4 y_hat = model.predict(X)
5 r2_score(y_hat, y)
6 # 0.9072741541257009
Congratulations! You are now officially up and running Python for data science.
But this was a great start, and you’re well on your way to learning the rest!
Next Steps
As mentioned earlier, we’ve just scratched the surface. Even so, hopefully you’ve seen
how easy it is to just get started.
Just get started, and don’t overthink it. Data science has a lot of moving pieces, so just
take it one step at a time.
From here, there are three routes you can go for next steps. You’ll want to do all three of
them eventually, but you can take them in any order.
Strike while the iron is hot, and keep practicing with tutorials like:
Shore up programming fundamentals and your Python skills with our Self-Study Guide to
Learning Python for Data Science.
13/13