0% found this document useful (0 votes)

343 views13 pages

Python For Data Science Quickstart Guide

This document provides a tutorial for getting started with Python for data science. It outlines 7 steps: 1) Install Anaconda distribution, 2) Start Jupyter Notebook, 3) Open a new notebook, 4) Try math calculations in Python, 5) Import common data science libraries like Pandas, Matplotlib, and Scikit-Learn, 6) Import a sample energy efficiency dataset using Pandas, and 7) Explore the data through basic visualizations like histograms using Matplotlib and Seaborn. The goal is to get readers set up with a Python environment and demonstrate basic data analysis tasks to build foundational skills.

Uploaded by

Sebastián Emdef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

343 views13 pages

Python For Data Science Quickstart Guide

Uploaded by

Sebastián Emdef

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Python for Data Science (Ultimate Quickstart Guide)

[Link]/python-quickstart

May 20, 2018

This quickstart tutorial will get you set up and coding in Python for data science.

If you want to learn one of the most in-demand programming languages in the world…
you’re in the right place.

By the end of this guide, you’ll have a strong foundation and be able to follow along other
tutorials on this site, even if you’ve never programmed before. Let’s jump right in!

Table of Contents

Step 1: Install Anaconda

We strongly recommend installing the Anaconda Distribution, which includes Python,
Jupyter Notebook (a lightweight IDE very popular among data scientists), and all the
major libraries.

It’s the closest thing to a one-stop-shop for all your setup needs.

Simply download Anaconda with the latest version of Python 3 and follow the wizard:

1/13
Step 2: Start Jupyter Notebook
Jupyter Notebook is our favorite IDE (integrated development environment) for data
science in Python. An IDE is just a fancy name for an advanced text editor for coding.

(As an analogy, think of Excel as an “IDE for spreadsheets.” For example, it has tabs,
plugins, keyboard shortcuts, and other useful extras.)

The good news is that Jupyter Notebook already came installed with Anaconda. Three
cheers for synergy! To open it, run the following command in the Command Prompt
(Windows) or Terminal (Mac/Linux):

MS DOS

1 jupyter notebook

Alternatively, you can open Anaconda's “Navigator” application, and then launch the
notebook from there:

2/13
You should see this dashboard open in your browser:

*Note: If you get a message about “logging in,” simply follow the instructions in the
browser. You’ll just need to paste in a token from the Command Prompt/Terminal.

Step 3: Open New Notebook

First, navigate to the folder you’d like to save the notebook in. For beginners, we
recommend having a single “Data Science” folder that you can use to store your datasets
as well.

3/13
Then, open a new notebook by clicking “New” in the top right. It will open in your default
web browser. You should see a blank canvas brimming with potential:

Step 4: Try Math Calculations

Next, let’s write some code. Python is awesome because it’s extremely versatile. For
example, you can use Python as a calculator:

Python for calculations

Python

1 import math
2
3 # Area of circle with radius 5
4 25*[Link]
5
6 # Two to the fourth
7 2**4
8
9 # Length of triangle's hypotenuse
10 [Link](3**2 + 4**2)

4/13
(To run a code cell, click into the cell so that it’s highlighted and then press Shift + Enter
on your keyboard.)

A few important notes:

First, we imported Python’s math module, which provides convenient functions

(e.g. [Link]()) and math constants (e.g. [Link]).
Second, 2*2*2*2… or “two to the fourth”… is written as 2**4. If you write 2^4, you’ll
get a very different output!
Finally, the text following the "hashtags" (#) are called comments. Just as their
name implies, these text snippets are not run as code.

In addition, Jupyter Notebook will only display the output from final line of code:

To print multiple calculations in one output, wrap each of them in the print(…) function.

Python's print function

Python

1 # Area of circle with radius 5

2 print( 25*[Link] )
3
4 # Two to the fourth
5 print( 2**4 )
6
7 # Length of triangle's hypotenuse
8 print( [Link](3**2 + 4**2) )

Another useful tip is that you can store things in objects (i.e. variables). See if you can
follow along what this code is doing:

Python objects (variables)

Python

5/13
1 message = "The length of the hypotenuse is"
2 c = [Link](3**2 + 4**2)
3 print( message, c )

By the way, in the above code, the message was surrounded by quotes, which means it’s
a string. A string is any sequence of characters surrounded by single or double quotes.

Now, we’re not going to dive much further into the weeds right now. To learn more about
programming fundamentals, check out our Python for Data Science Self-Study Guide.

Contrary to popular belief, you won’t actually need to learn an immense amount of
programming to use Python for data science. That’s because most of the data science
and machine learning functionality you’ll need are already packaged into libraries, or
bundles of code that you can import and use out of the box.

Step 5: Import Data Science Libraries

Think of Jupyter Notebook as a big playground for Python. Now that you've set this up,
you can play to your heart’s content. Anaconda has almost all of the libraries you’ll need,
so testing a new one is as simple as importing it.

Which brings to the next step... Let’s import those libraries! In a new code cell (Insert >
Insert Cell Below), write the following code:

Import data science libraries

Python

1 import pandas as pd
2
3 import [Link] as plt
4 %matplotlib inline
5 from sklearn.linear_model import LinearRegression

(It might take a while to run this code the first time.)

So what did we just do? Let’s break it down.

First, we imported the Pandas library. We also gave it the alias of pd. This means
we can evoke the library with pd. You’ll see this in action shortly.
Next, we imported the pyplot module from the matplotlib library. Matplotlib is the
main plotting library for Python. There’s no need to bring in the entire library, so we
just imported a single module. Again, we gave it an alias of plt.
Oh yea, and the %matplotlib inline command? That’s Jupyter Notebook specific. It
simply tells the notebook to display our plots inside the notebook, instead of in a
separate screen.
Finally we imported a basic linear regression algorithm from scikit-learn. Scikit-learn
has a buffet of algorithms to choose from. At the end of this guide, we’ll point you to
a few resources for learning more about these algorithms.

6/13
There are plenty of other great libraries available for data science, but these are the most
commonly used.

Step 6: Import Your Dataset

Next, let’s import a dataset. Pandas has a suite of IO tools that allow you to read and
write data. You can work with formats such as CSV, JSON, Excel, SQL databases, or
even raw text files.

For this tutorial, we’ll be reading from an Excel file that has data on the energy efficiency
of buildings. Don’t worry – even if you don’t have Excel installed, you can still follow
along.

First, download the dataset and put it into the same folder as your current Juptyer
notebook.

Then, use the following code to read the file and store its contents in a df object (“df” is
short for dataframe).

Read Excel dataset

Python

1 df = pd.read_excel( 'ENB2012_data.xlsx' )

If you saved the dataset in a subfolder, then you would write the code like this instead:

Read dataset from subfolder

Python

1 df = pd.read_excel( 'subfolder_name/ENB2012_data.xlsx' )

Nice! You’ve successfully imported your first dataset using Python.

To see what’s inside, just run this code in your notebook (it displays the first 5
observations from the dataframe):

View example observations

Python

1 [Link]()

7/13
For extra practice on this step, feel free to download a few others from our hand-picked
list of datasets. Then, try using other IO tools (such as pd.read_csv()) to import datasets
with different formats.

We showcase more of what you can do in Pandas in our Python Data Wrangling Tutorial
(opens in a new tab).

Step 7: Explore Your Data

In step 6, we already saw some example observations from the dataframe. Now we’re
ready to look at plots.

We won’t go through the entire exploratory analysis phase right now. Instead, let’s just
take a quick glance at the distributions of our variables. We’ll start with the “X1” variable,
which refers to “Relative Compactness” as described in the file’s data dictionary.

Plot histogram
Python

1 [Link]( df.X1 )

As you’ve probably guessed, [Link]() produces a histogram.

In general, these types of functions will have different parameters that you can pass into
them. Those parameters control things like the color scheme, the number of bins used,
the axes, and so on.

8/13
There’s no need to memorize all of the parameters. Instead, get in the habit of checking
the documentation page for available options. For example, the documentation page of
[Link]() indicates that you can change the number of bins in the histogram:

That means we can change the number of bins like so:

Plot parameters
Python

1 [Link]( df.X1, bins=5 )

For now, we don’t recommend trying to get too fancy with matplotlib. It’s a powerful, but
complex library.

Instead, we prefer a library that’s built on top of matplotlib called seaborn. If matplotlib
“tries to make easy things easy and hard things possible”, seaborn tries to make a well-
defined set of hard things easy as well.

9/13
Learn more about it in our Seaborn Data Visualization Tutorial.

Step 8: Clean Your Dataset

After we explore the dataset, it’s time to clean it. Fortunately, this dataset is pretty clean
already because it was originally collected from controlled simulations.

Even so, for illustrative purposes, let’s at least check for missing values. You can do so
with just one line of code (but there’s a ton of cool stuff packed into this one line).

Check for missing values

Python

1 [Link]().sum()

Let’s unpack that:

df is where we stored the data. It’s called a “dataframe,” and it’s also a Python
object, like the variables from Step 4.
.isnull() is called a method, which is just a fancy term for a function attached to an
object. This method looks through our entire dataframe and labels any cell with a
missing value as True. (Tip: Try running [Link]().isnull() and see what you get!)
Finally, .sum() is a method that sums all of the True values across each column.
Well… technically, it sums any number, while treating True as 1 and False as 0.

You can learn more about .isnull() and .sum() on the documentation page for Pandas
dataframes.

Step 9: Engineer Features

Feature engineering is typically where data scientists spend the most time. It’s where you
can use “domain knowledge” to create new input features (i.e. variables) for your
models, which can drastically improve their performance.

Let’s start with a low-hanging fruit: creating dummy variables.

Typically, you’ll have two types of features: numerical and categorical…

Numerical ones are pretty self-explanatory… For example, “number of years of

education” would be a numerical feature.
Categorical features are those that have classes instead of numeric values…. For
example, “highest education level” would be a categorical feature, and the classes
could be: ['high school', 'some college', 'college', 'some graduate', 'graduate'].

In that example, the “highest education level” categorical feature is also ordinal. In other
words, its classes have an implied order to them. For example, ['college'] implies more
schooling than ['high school'].

10/13
A problem arises when categorical features are not ordinal. In fact, we have this problem
in our current dataset.

If you remember from its data dictionary, features X6 (Orientation) and X8 (Glazing Area
Distribution) are actually categorical. For example, X6 has four possible values:

Numerical encoding of categorical feature

Python

1 2 == 'north',
2 3 == 'east',
3 4 == 'south',
4 5 == 'west'

However, in the current way it’s encoded (i.e. as four integers), an algorithm will interpret
“east” as “1 more than north” and “west” as “2 times the value east.”

That doesn’t make sense, right?

Therefore, we should create dummy variables for X6 and X8. These are brand new
input features that only take the value of 0 or 1. You’d create one dummy per unique class
for each feature.

So for X6, we’d create four variables—X6_2, X6_3, X6_4, and X6_5—that represent its
four unique classes. We can do this for both X6 and X8 in one fell swoop:

Create dummy variables

Python

1 df = pd.get_dummies( df, columns = ['X6', 'X8'] )

(Tip: after running this code, trying running [Link]() again. Is it what you expected?)

We won’t cover any more feature engineering for now, but you can get a checklist of
specific ideas in our Guide to Feature Engineering Best Practices.

Step 10: Train a Simple Model

Have you been following along? Great!

After just a few short steps, we’re actually ready to train a model. But before we jump in,
just a quick disclaimer: we won’t be using model training best practices for now. Instead,
this code is simplified to the extreme. But it’s super helpful to start with these “toy
problems” as learning tools.

Before we do anything else, let’s split our dataset into separate objects for our input
features (X) and the target variable (y). The target variable is simply what we wish to
predict with our model.

11/13
Let’s predict “Y1,” a building’s “Heating Load.”

Separate input features and target variable

Python

1 # Target variable
2 y = df.Y1
3
4 # Input features
5 X = [Link]( ['Y1', 'Y2' ], axis=1)

In the first line of code, we’re copying Y1 from the dataframe into a separate y object.
Then, in the second line of code, we’re copying all of the variables except Y1 and Y2 into
the X object.

.drop() is another dataframe method, and it has two important parameters:

The variables to drop… (e.g. ['Y1', 'Y2'])

Whether to drop from the index ( axis=0) or the columns ( axis=1)

Now we’re ready to train a simple model. It’s a two-step process:

Train a simple model

Python

1 # Initialize model instance

2 model = LinearRegression()
3
4 # Train the model on the data
5 [Link](X, y)

First, we initialize a model instance. Think of this as a single “version” of the model. For
example, if you wanted to train a separate model and compare them, you can initialize a
separate instance (e.g. model_2 = LinearRegression()).

Then, we call the .fit() method and pass the input features (X) and target variable (y) as
parameters.

And that’s it!

There are many cool mechanics working under the hood, but that’s basically all you need
to create a basic model. In fact, you can get predictions and calculate the model’s R^2
like so:

Calculate model R^2

Python

12/13
1 from [Link] import r2_score
2
3 # Get model R^2
4 y_hat = [Link](X)
5 r2_score(y_hat, y)
6 # 0.9072741541257009

Congratulations! You are now officially up and running Python for data science.

To be clear, the full data science process is much meatier…

There’s more exploratory analysis, data cleaning, and feature engineering...

You’ll want to try other algorithms...
And you’ll need model training best practices such as train/test splitting, cross-
validation, and hyperparamater tuning to prevent overfitting...

But this was a great start, and you’re well on your way to learning the rest!

Next Steps
As mentioned earlier, we’ve just scratched the surface. Even so, hopefully you’ve seen
how easy it is to just get started.

And that’s the key!

Just get started, and don’t overthink it. Data science has a lot of moving pieces, so just
take it one step at a time.

From here, there are three routes you can go for next steps. You’ll want to do all three of
them eventually, but you can take them in any order.

Route #1: Get More Practice

Strike while the iron is hot, and keep practicing with tutorials like:

Route #2: Solidify Python Fundamentals

Shore up programming fundamentals and your Python skills with our Self-Study Guide to
Learning Python for Data Science.

Route #3: Learn Essential Theory

Learn more about popular algorithms and essential concepts:

13/13

ML Notesv1
100% (2)
ML Notesv1
300 pages
65 Free Data Science Resources For Beginners PDF
No ratings yet
65 Free Data Science Resources For Beginners PDF
19 pages
Python Cheat Sheet For Data Analysis
No ratings yet
Python Cheat Sheet For Data Analysis
2 pages
Python Core Material
No ratings yet
Python Core Material
162 pages
Data Science Tutorial
67% (3)
Data Science Tutorial
60 pages
Python Programming Lab Guide
No ratings yet
Python Programming Lab Guide
15 pages
Pandas
100% (1)
Pandas
1,131 pages
Celsius to Fahrenheit Conversion
No ratings yet
Celsius to Fahrenheit Conversion
28 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
Python Tutorial: Release 2.0
100% (1)
Python Tutorial: Release 2.0
77 pages
Python Numpy Tutorial
No ratings yet
Python Numpy Tutorial
28 pages
Scikit Learn Docs PDF
100% (3)
Scikit Learn Docs PDF
2,204 pages
Python CheatSheet - CodeWithHarry
No ratings yet
Python CheatSheet - CodeWithHarry
23 pages
Python for Data Science Overview
No ratings yet
Python for Data Science Overview
1 page
Data Science Course Overview
No ratings yet
Data Science Course Overview
31 pages
11 Beginner Tips For Learning Python Programming - Real Python
No ratings yet
11 Beginner Tips For Learning Python Programming - Real Python
8 pages
Python Pandas: 12 Data Manipulation Techniques
100% (2)
Python Pandas: 12 Data Manipulation Techniques
19 pages
Pandas Plotting Capabilities
No ratings yet
Pandas Plotting Capabilities
27 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
1 page
Documenting Data Science Projects
No ratings yet
Documenting Data Science Projects
9 pages
STAT 451: Intro To Machine Learning Lecture Notes
100% (1)
STAT 451: Intro To Machine Learning Lecture Notes
17 pages
Edureka Python Ebook
No ratings yet
Edureka Python Ebook
21 pages
Python For Data Science Cheat Sheet: Subset Slice
50% (2)
Python For Data Science Cheat Sheet: Subset Slice
1 page
Effective Data Visualization Techniques in Data Science Using Python
No ratings yet
Effective Data Visualization Techniques in Data Science Using Python
14 pages
How To Learn Python For Data Science
100% (1)
How To Learn Python For Data Science
22 pages
Python CheatSheet PDF
No ratings yet
Python CheatSheet PDF
1 page
Pythonfree PDF
100% (1)
Pythonfree PDF
77 pages
Python Dev Basic Notes
No ratings yet
Python Dev Basic Notes
46 pages
Python 3 Programming Tutorial
No ratings yet
Python 3 Programming Tutorial
29 pages
Python Data Analysis & Visualization
No ratings yet
Python Data Analysis & Visualization
34 pages
Python Generators: How To Create A Generator in Python?
No ratings yet
Python Generators: How To Create A Generator in Python?
8 pages
Python Programming E Book
No ratings yet
Python Programming E Book
88 pages
Python Exercises PDF
100% (1)
Python Exercises PDF
138 pages
AI & ML Cheat Sheets Collection
100% (1)
AI & ML Cheat Sheets Collection
24 pages
1 Introduction Python Programming For Data Science
No ratings yet
1 Introduction Python Programming For Data Science
11 pages
Data Mining Slides
No ratings yet
Data Mining Slides
43 pages
Machine Learning in Python Main Developments and T
100% (1)
Machine Learning in Python Main Developments and T
44 pages
EDA Starter Pack for Data Scientists
No ratings yet
EDA Starter Pack for Data Scientists
40 pages
Exploratory Data Analysis
100% (1)
Exploratory Data Analysis
203 pages
Python Data Analysis Tutorial for Beginners
100% (1)
Python Data Analysis Tutorial for Beginners
26 pages
Programming for Data Science Syllabus
100% (1)
Programming for Data Science Syllabus
13 pages
Data Science Course with Python Overview
No ratings yet
Data Science Course with Python Overview
4 pages
Association Rule Mining Overview
No ratings yet
Association Rule Mining Overview
9 pages
What Are The Types of Machine Learning?
100% (1)
What Are The Types of Machine Learning?
24 pages
Python Classes and Objects Guide
No ratings yet
Python Classes and Objects Guide
6 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
Machine Learning Python
100% (2)
Machine Learning Python
9 pages
Python Setup For Machine Learning
100% (1)
Python Setup For Machine Learning
3 pages
Machine Learning Techniques Explained
100% (1)
Machine Learning Techniques Explained
12 pages
Data Science Experiment Guide
100% (2)
Data Science Experiment Guide
43 pages
Pandas Visualisation
No ratings yet
Pandas Visualisation
27 pages
LAB Manual
No ratings yet
LAB Manual
100 pages
Python Basics for Beginners
100% (3)
Python Basics for Beginners
115 pages
Pandas Data Structures Guide
No ratings yet
Pandas Data Structures Guide
72 pages
1.introduction To Python For Data Science
No ratings yet
1.introduction To Python For Data Science
6 pages
Data Ingestion and Reshaping Guide
100% (1)
Data Ingestion and Reshaping Guide
2 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
Python For Data Science
No ratings yet
Python For Data Science
89 pages
Python Data Science Handbook - Python Data Science Handbook
No ratings yet
Python Data Science Handbook - Python Data Science Handbook
4 pages
Python Data Science Cheat Sheet
100% (2)
Python Data Science Cheat Sheet
6 pages
MySQL CHEAT SHEET
No ratings yet
MySQL CHEAT SHEET
20 pages
Python Data Science Cheat Sheet
0% (1)
Python Data Science Cheat Sheet
3 pages
40 Advanced Useful VBA Codes For Excel
No ratings yet
40 Advanced Useful VBA Codes For Excel
29 pages
YFinance Library: A Comprehensive Guide
No ratings yet
YFinance Library: A Comprehensive Guide
13 pages
Data Visualization and Exploratory Analysis
No ratings yet
Data Visualization and Exploratory Analysis
23 pages
Data Cleaning (Examples)
No ratings yet
Data Cleaning (Examples)
9 pages
Python Data Wrangling for Crypto Analysis
No ratings yet
Python Data Wrangling for Crypto Analysis
15 pages
Python ML Tutorial: Scikit-Learn Wine Quality
No ratings yet
Python ML Tutorial: Scikit-Learn Wine Quality
16 pages
EPGC Digital Marketing Course Certification IHUB
No ratings yet
EPGC Digital Marketing Course Certification IHUB
17 pages
Starting GCDKit
0% (1)
Starting GCDKit
7 pages
Revision Answer Key Worksheet 1 3
No ratings yet
Revision Answer Key Worksheet 1 3
13 pages
Excel 2016 Intermediate Cheat Sheet
No ratings yet
Excel 2016 Intermediate Cheat Sheet
3 pages
YO19 Excel Ch06 Prepare PartA Golf Data Instructions
No ratings yet
YO19 Excel Ch06 Prepare PartA Golf Data Instructions
4 pages
How To Calculate Ros
No ratings yet
How To Calculate Ros
3 pages
Autodesk Inventor 5 - Tips &amp Tricks en Ingles
No ratings yet
Autodesk Inventor 5 - Tips &amp Tricks en Ingles
52 pages
2019W.PAT - Syll.472.Kuuskoski 2018.12.12
No ratings yet
2019W.PAT - Syll.472.Kuuskoski 2018.12.12
8 pages
Resume Java Springboot
No ratings yet
Resume Java Springboot
2 pages
Dissertation Gantt Chart Template
100% (2)
Dissertation Gantt Chart Template
8 pages
CATT Variants Quick Reference Guide
No ratings yet
CATT Variants Quick Reference Guide
2 pages
Ims Afr User Guide
No ratings yet
Ims Afr User Guide
9 pages
M Arkstrat Rules & Help: (Also Available at Inside - Insead.edu/carmon/mm2)
No ratings yet
M Arkstrat Rules & Help: (Also Available at Inside - Insead.edu/carmon/mm2)
2 pages
ZFB50 - N Upload Programme Manual
No ratings yet
ZFB50 - N Upload Programme Manual
7 pages
MX Component 4 Programming Manual
No ratings yet
MX Component 4 Programming Manual
624 pages
Task 1-1: Basic Work With Excel
No ratings yet
Task 1-1: Basic Work With Excel
2 pages
Basic Computing 608 Eng
No ratings yet
Basic Computing 608 Eng
59 pages
Formulas in Excel Spreadsheet
100% (1)
Formulas in Excel Spreadsheet
19 pages
Astrology and Microsoft Excel - Video 1 - Slides
No ratings yet
Astrology and Microsoft Excel - Video 1 - Slides
28 pages
A Crash Course in Office 365 Ebook
0% (1)
A Crash Course in Office 365 Ebook
47 pages
L61000MH1978PLC020435 Dividend-2016
No ratings yet
L61000MH1978PLC020435 Dividend-2016
138 pages
KCL Coursework Extension
100% (2)
KCL Coursework Extension
8 pages
VSM Template
100% (1)
VSM Template
5 pages
PowerPoint Chart and SmartArt Guide
No ratings yet
PowerPoint Chart and SmartArt Guide
9 pages
VBA Malware Techniques Guide
No ratings yet
VBA Malware Techniques Guide
19 pages
Younity Community & EdTech Module
No ratings yet
Younity Community & EdTech Module
91 pages
Dr. Bob's Reliability Calculation Tools
No ratings yet
Dr. Bob's Reliability Calculation Tools
26 pages
Excel Skills - Cashbook & Bank Reconciliation Template
No ratings yet
Excel Skills - Cashbook & Bank Reconciliation Template
17 pages
LibreOffice Calc
No ratings yet
LibreOffice Calc
19 pages
Notes For Learner
No ratings yet
Notes For Learner
6 pages

Python For Data Science Quickstart Guide

Uploaded by

Python For Data Science Quickstart Guide

Uploaded by

Python for Data Science (Ultimate Quickstart Guide)

May 20, 2018

Step 1: Install Anaconda

Step 3: Open New Notebook

Step 4: Try Math Calculations

Python for calculations

A few important notes:

First, we imported Python’s math module, which provides convenient functions

Python's print function

1 # Area of circle with radius 5

Python objects (variables)

Step 5: Import Data Science Libraries

Import data science libraries

So what did we just do? Let’s break it down.

Step 6: Import Your Dataset

Read Excel dataset

Read dataset from subfolder

Nice! You’ve successfully imported your first dataset using Python.

View example observations

Step 7: Explore Your Data

As you’ve probably guessed, [Link]() produces a histogram.

That means we can change the number of bins like so:

1 [Link]( df.X1, bins=5 )

Step 8: Clean Your Dataset

Check for missing values

Let’s unpack that:

Step 9: Engineer Features

Let’s start with a low-hanging fruit: creating dummy variables.

Typically, you’ll have two types of features: numerical and categorical…

Numerical ones are pretty self-explanatory… For example, “number of years of

Numerical encoding of categorical feature

That doesn’t make sense, right?

Create dummy variables

1 df = pd.get_dummies( df, columns = ['X6', 'X8'] )

Step 10: Train a Simple Model

Separate input features and target variable

.drop() is another dataframe method, and it has two important parameters:

The variables to drop… (e.g. ['Y1', 'Y2'])

Now we’re ready to train a simple model. It’s a two-step process:

Train a simple model

1 # Initialize model instance

And that’s it!

Calculate model R^2

To be clear, the full data science process is much meatier…

There’s more exploratory analysis, data cleaning, and feature engineering...

And that’s the key!

Route #1: Get More Practice

Route #2: Solidify Python Fundamentals

Route #3: Learn Essential Theory

Learn more about popular algorithms and essential concepts:

You might also like