0% found this document useful (0 votes)
98 views

Data+Science+in+Python+ +Data+Prep+&+EDA

Uploaded by

dragoslavraseta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views

Data+Science+in+Python+ +Data+Prep+&+EDA

Uploaded by

dragoslavraseta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 196

DATA SCIENCE IN PYTHON

PART 1:
Data Prep &
Exploratory
Data Analysis
With Expert Data Science Instructor Alice Zhao

*Copyright Maven Analytics, LLC


ABOUT THIS SERIES

This is Part 1 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP

PART 1 PART 2 PART 3 PART 4 PART 5


Data Prep & EDA Regression Classification Unsupervised Natural Language
Learning Processing

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Introduce the field of data science, review essential skills, and introduce each
1 Intro to Data Science phase of the data science workflow

2 Scoping a Project Review the process of scoping a data science project, including brainstorming
problems and solutions, choosing techniques, and setting clear goals

Install Anaconda and introduce Jupyter Notebook, the user-friendly coding


3 Installing Jupyter Notebook environment we’ll use for writing Python code

Read flat files into a Pandas DataFrame in Python, and review common data
4 Gathering Data sources & formats, including Excel spreadsheets and SQL databases

Identify and convert data types, find and fix common data issues like missing
5 Cleaning Data values, duplicates, and outliers, and create new columns for analysis

*Copyright Maven Analytics, LLC


COURSE OUTLINE

Explore datasets to discover insights by sorting, filtering, and grouping data,


6 Exploratory Data Analysis then visualize it using common chart types like scatterplots & histograms

7 MID-COURSE PROJECT Put your skills to the test by cleaning, exploring and visualizing data from a
brand-new data set containing Rotten Tomatoes movie ratings

Structure your data so that it’s ready for machine learning models by creating
8 Preparing for Modeling a numeric, non-null table and engineering new features

Apply all the skills learned throughout the course by gathering, cleaning,
9 FINAL COURSE PROJECT exploring, and preparing multiple data sets for Maven Music

*Copyright Maven Analytics, LLC


INTRODUCING THE COURSE PROJECT

You’ve just been hired as a Jr. Data Scientist for Maven Music, a streaming service
THE that’s been losing more customers than usual the past few months and would like
SITUATION to use data science to figure out how to reduce customer churn

You’ll have access to data on Maven Music’s customers, including subscription


THE details and music listening history
ASSIGNMENT Your task is to gather, clean, and explore the data to provide insights about the
recent customer churn issues, then prepare it for modeling in the future

1. Scope the data science project


THE
2. Gather the data in Python
OBJECTIVES
3. Clean the data
4. Explore & visualize the data
5. Prepare the data for modeling

*Copyright Maven Analytics, LLC


SETTING EXPECTATIONS

This course covers data gathering, cleaning and exploratory data analysis
• We’ll review common techniques for gathering, cleaning and analyzing data with Python, but will not
cover more complex data formats or advanced statistical tools

We will NOT be applying machine learning models in this course


• This course will focus on preparing raw data for deeper analysis and modeling; we will introduce and apply
supervised and unsupervised machine learning algorithms in-depth later in this series

We’ll use Jupyter Notebook as our primary coding environment


• Jupyter Notebook is free to use, and the industry standard for conducting data analysis with Python

You do NOT need to be a Python expert to take this course


• We strongly recommended completing the Maven Analytics Data Analysis with Python and Pandas
course before this one, or having some familiarity working with Pandas DataFrames

*Copyright Maven Analytics, LLC


INTRO TO DATA SCIENCE

*Copyright Maven Analytics, LLC


INTRO TO DATA SCIENCE

In this section we’ll introduce the field of data science, discuss how it compares to
other data fields, and walk through each phase of the data science workflow

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:

• Compare roles under the broader data analytics


What is Data Science? Essential Skills umbrella
• Discuss essential skills of a data scientist
What is Machine
Data Science Workflow • Compare data science and machine learning
Learning?

• Introduce supervised and unsupervised learning, and


commonly used algorithms
• Review each phase of the data science workflow

*Copyright Maven Analytics, LLC


WHAT IS DATA SCIENCE?

Data science is about using data to make smart decisions


What is Data
Science?
Wait, isn’t that business
predictive
data analysis
intelligence
analytics ?
Essential Skills

What is Machine
Yes! The differences lie in the types of problems you solve, and tools and
Learning? techniques you use to solve them:
Data Science
Workflow
What happened? What’s going to happen?
• Descriptive Analytics • Predictive Analytics
• Data Analysis • Data Mining
• Business Intelligence • Data Science

*Copyright Maven Analytics, LLC


DATA SCIENCE SKILL SET

Data science requires a blend of coding, math, and domain expertise

What is Data
Science?

The key is in applying these along


Essential Skills with soft skills like:
Machine
Coding Learning
Math • Communication
What is Machine
Learning? • Problem solving
Data • Curiosity & creativity
Science
Data Science • Grit
Workflow Danger Traditional
Zone! Research • Googling prowess

Data scientists & analysts approach problem


Domain solving in similar ways, but data scientists will
Expertise often work with larger, more complex data sets
and utilize advanced algorithms

*Copyright Maven Analytics, LLC


WHAT IS MACHINE LEARNING?

Machine learning enables computers to learn and make decisions from data
What is Data
Science?
How can a computer learn?
Essential Skills

What is Machine
By using algorithms, which is a set of instructions for a computer to follow
Learning?

Data Science
Workflow How does this compare with data science?

Data scientists know how to apply algorithms, meaning they’re able to tell a
computer how to learn from data

*Copyright Maven Analytics, LLC


SUPERVISED VS. UNSUPERVISED LEARNING

Machine learning algorithms fall into two broad categories:


supervised learning and unsupervised learning
What is Data
Science?
Supervised Learning Unsupervised Learning
Using historical data to predict the future Finding patterns and relationships in data
Essential Skills

What is Machine
Learning?

What will house prices look like How can I segment my


Data Science for the next 12 months? customers?
Workflow

How can I flag suspicious emails Which TV shows should I


as spam? recommend to each user?

*Copyright Maven Analytics, LLC


COMMON ALGORITHMS

These are some of the most common machine learning algorithms that data
scientists use in practice
What is Data
Science?
MACHINE LEARNING
Essential Skills

Supervised Learning Unsupervised Learning Another category of machine


What is Machine
learning is called
Learning? K-Means Clustering
reinforcement learning, which
Hierarchical Clustering is more commonly used in
Data Science Regression Classification robotics and gaming
Anomaly Detection
Workflow Other fields like deep learning
Linear Regression KNN Matrix Factorization and natural language
Regularized Regression Logistic Regression Principal Components Analysis
processing utilize both
supervised and unsupervised
Time Series Tree-Based Models Recommender Systems learning techniques
Naïve Bayes (NLP) Topic Modeling (NLP)

Support vector machines, neural networks, DBSCAN, T-SNE, factor analysis,


deep learning, etc. association rule mining, etc.

*Copyright Maven Analytics, LLC


DATA SCIENCE WORKFLOW

The data science workflow consists of scoping a project, gathering, cleaning and
What is Data
exploring the data, building models, and sharing insights with end users
Science?

1 2 3 4 5 6
Essential Skills

What is Machine
Learning?

Data Science
Workflow
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights

This is not a linear process! You’ll likely go back to further gather, clean and explore your data

*Copyright Maven Analytics, LLC


STEP 1: SCOPING A PROJECT

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
Projects don’t start with data, they start with a clearly defined scope:
Data Science
Workflow • Who are your end users or stakeholders?
• What business problems are you trying to help them solve?
• Is this a supervised or unsupervised learning problem? (do you even need data science?)
• What data do you need for your analysis?

*Copyright Maven Analytics, LLC


STEP 2: GATHERING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
A project is only as strong as the underlying data, so gathering the right data is
essential to set a proper foundation for your analysis
Data Science
Workflow
Data can come from a variety of sources, including:
• Files (flat files, spreadsheets, etc.)
• Databases
• Websites
• APIs

*Copyright Maven Analytics, LLC


STEP 3: CLEANING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning? A popular saying within data science is “garbage in, garbage out”, which means that
cleaning data properly is key to producing accurate and reliable results
Data Science
Workflow
Data cleaning tasks may include: Building models
The flashy part of data science
• Correcting data types
• Imputing missing data Cleaning data
• Dealing with data inconsistencies Less fun, but very important
(Data scientists estimate that around
• Reformatting the data 50-80% of their time is spent here!)

*Copyright Maven Analytics, LLC


STEP 4: EXPLORING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
Exploratory data analysis (EDA) is all about exploring and understanding the
Data Science
data you’re working with before building models
Workflow

EDA tasks may include:


A good number of the final insights that you share
• Slicing & dicing the data will come from the exploratory data analysis phase!

• Summarizing the data


• Visualizing the data

*Copyright Maven Analytics, LLC


STEP 5: MODELING DATA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
Modeling data involves structuring and preparing data for specific modeling
techniques, and applying algorithms to make predictions or discover patterns
Data Science
Workflow
Data modeling tasks may include: With fancy new algorithms introduced every
year, you may feel the need to learn and apply
• Restructuring the data the latest and greatest techniques

• Feature engineering (adding new fields) In practice, simple is best; businesses &
leadership teams appreciate solutions that are
• Applying machine learning algorithms easy to understand, interpret and implement

*Copyright Maven Analytics, LLC


STEP 6: SHARING INSIGHTS

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
The final step of the workflow involves summarizing your key findings and sharing
insights with end users or stakeholders:
Data Science
Workflow • Reiterate the problem
Even with all the technical work
• Interpret the results of your analysis that’s been done, it’s important to
remember that the focus here is
• Share recommendations and next steps on non-technical solutions

• Focus on potential impact, not technical details

NOTE: Another way to share results is to deploy your model, or put it into production

*Copyright Maven Analytics, LLC


DATA PREP & EDA

What is Data
Science?

Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights

What is Machine
Learning?
DATA PREP & EDA REGRESSION, CLASSIFICATION,
UNSUPERVISED LEARNING, NLP
Data Science
Workflow

Data prep and EDA is a critical part of every data science project, and should
always come first before applying machine learning algorithms

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Data science is about using data to make smart decisions


• Supervised learning techniques use historical data to predict the future, and unsupervised learning
techniques use algorithms to find patterns and relationships

Data scientists have both coding and math skills along with domain expertise
• In addition to technical expertise, soft skills like communication, problem-solving, curiosity, creativity, grit,
and Googling prowess round out a data scientist’s skillset

The data science workflow starts with defining a clear scope


• Once the project scope is defined, you can move on to gathering and cleaning data, performing exploratory data
analysis, preparing data for modeling, applying algorithms, and sharing insights with end users

Much of a data scientist’s time is spent cleaning & preparing data for analysis
• Properly cleaning and preparing your data ensures that your results are accurate and meaningful (garbage in,
garbage out!)

*Copyright Maven Analytics, LLC


SCOPING A PROJECT

*Copyright Maven Analytics, LLC


SCOPING A PROJECT

In this section we’ll discuss the process of scoping a data science project, from
understanding your end users to deciding which tools and techniques to deploy

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Outline the key steps for defining a clear and
Project Scoping Steps Thinking Like an End User effective data science project scope

• Learn how to collaborate with end users to


Problems & Solutions Modeling Techniques understand the business context, explore potential
problems and solutions, and align on goals

• Identify which types of solutions would require


Data Requirements Summarizing the Scope
supervised vs. unsupervised techniques

• Review common data requirements, including


structure, features, sources and scope

*Copyright Maven Analytics, LLC


SCOPING A PROJECT

Scoping a data science project means clearly defining the goals, techniques, and
Project Scoping
data sources you plan to use for your analysis
Steps
Scoping steps:
Thinking Like an
End User
1. Think like an end user Project scoping is one of the most
difficult yet important steps of the
2. Brainstorm problems data science workflow
Problems & 3. Brainstorm solutions If a project is not properly scoped,
Solutions a lot of time can be wasted going
4. Determine the techniques down a path to create a solution
that solves the wrong problem
Modeling 5. Identify data requirements
Techniques
6. Summarize the scope & objectives
Data
Requirements

Summarizing the We’ll be scoping the course project in this section together to
Scope
set you up for success throughout the rest of the course

*Copyright Maven Analytics, LLC


THINK LIKE AN END USER

An end user (also known as a stakeholder) is a person, team, or business that will
ultimately benefit from the results of your analysis
Project Scoping
Steps When introduced to a new project, start by asking questions that help you:
Thinking Like an
• Empathize with the end user – what do they care about?
End User • Focus on impact – what metrics are important to them?

Problems &
Solutions What’s the situation?
Data Scientist End User
Modeling People keep cancelling their streaming music subscriptions
Techniques
Why is this a major issue?
Data
Requirements Our monthly revenue growth is slowing

Summarizing the Debbie Dayda What would success look like for you? Carter Careswell
Scope Data Science Team Customer Care Team
Decreasing cancellation rate by 2% would be a big win

*Copyright Maven Analytics, LLC


BRAINSTORM PROBLEMS

Before thinking about potential solutions, it helps to brainstorm problems with


the end user to identify improvements and opportunities
Project Scoping
Steps • The ideas do not have to be data science related
• The sky is the limit – don’t think about your data or resource limitations
Thinking Like an
End User
• You want to start by thinking big and then whittle it down from there

Problems &
Solutions
Is our product differentiated from competitors?
Data Scientist End User
Modeling Are technical bugs or limitations to blame?
Techniques

Is our product too expensive?


Data
Requirements
Do we need to expand our music library?

Summarizing the Debbie Dayda Carter Careswell


Do we even know WHY customers are cancelling?
Scope Data Science Team Customer Care Team

That’s a good one – let’s focus on that problem first!

*Copyright Maven Analytics, LLC


BRAINSTORM SOLUTIONS

Once you settle on a problem, the next step is to brainstorm potential solutions

Project Scoping Data science can be one potential solution, but it’s not the only one:
Steps
• PROS: Solutions are backed by data, may lead to hidden insights, let you make predictions
Thinking Like an • CONS: Projects take more time and specialized resources, and can lead to complex solutions
End User

POTENTIAL SOLUTIONS
Problems &
Solutions
Ask the product team to add a survey to capture Speak with customer reps about any changes they’ve
Modeling cancellation feedback noticed in recent customer interactions
Techniques
Conduct customer interviews to gather Research external factors that may be impacting
qualitative data and insights cancellations (competitive landscape, news, etc.)
Data
Requirements
Suggest that the leadership team speak with Use data to identify the top predictors for
Summarizing the other leaders in the space to compare notes account cancellations
Scope
This is the only data science solution!

*Copyright Maven Analytics, LLC


DETERMINE THE TECHNIQUES

If you decide to take a data science approach to solving the problem, the next step
Project Scoping is to determine which techniques are most suitable:
Steps
• Do you need supervised or unsupervised learning?
Thinking Like an
End User

Problems & MACHINE LEARNING


Solutions

Modeling
Techniques Supervised Learning Unsupervised Learning

Data Focused on using historical Focused on finding patterns


Requirements data to predict the future or relationships in the data

Summarizing the
Scope

*Copyright Maven Analytics, LLC


KNOWLEDGE CHECK

Estimating how many customers will Looking at the products purchased by Visualizing cancellation rate over time
visit your website on New Year’s Day the highest-spend customers for various customer segments
Project Scoping
Steps
Identifying the main themes mentioned Clustering customers into different Flagging which customers are most
in customer reviews groups based on their preferences likely to cancel their membership
Thinking Like an
End User
Supervised Learning Unsupervised Learning
Problems &
Solutions

Modeling
Techniques

Data
Requirements

Summarizing the
Scope

*Copyright Maven Analytics, LLC


KNOWLEDGE CHECK

Looking at the products purchased by Visualizing cancellation rate over time


the highest-spend customers for various customer segments
Project Scoping
Steps

Thinking Like an
End User
Supervised Learning Unsupervised Learning
Problems &
Solutions
Estimating how many customers will Identifying the main themes mentioned
visit your website on New Year’s Day in customer reviews
Modeling
Techniques

Flagging which customers are most Clustering customers into different


Data likely to cancel their membership groups based on their preferences
Requirements

Summarizing the
Scope

*Copyright Maven Analytics, LLC


IDENTIFY DATA REQUIREMENTS

After deciding on a technique, the next step is to identify data requirements,


Project Scoping including structure, features, sources and scope
Steps

Thinking Like an
End User

Problems &
Solutions

Structure Features Sources Scope


Modeling
Techniques
You will need to structure Brainstorm specific Determine which Narrow down your data to
your data in different columns or “features” that additional data sources (if just what you need to get
Data ways, whether you are might provide insight any) are required to started (you can expand
Requirements using supervised or (these will be good inputs create or “engineer” those and iterate from there)
unsupervised techniques for your models) features

Summarizing the
Scope

*Copyright Maven Analytics, LLC


DATA STRUCTURE

How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data

Thinking Like an
End User
MACHINE LEARNING
Problems &
Solutions
Supervised Learning Unsupervised Learning
Modeling
Techniques
Focused on using historical Focused on finding patterns
data to predict the future or relationships in the data
Data
Requirements
Has a “label” Does NOT have a “label”
Summarizing the
Scope A “label” is an observed variable which you are trying to predict
(house price ($), spam or not (1/0), chance of failure (probability), etc.)

*Copyright Maven Analytics, LLC


DATA STRUCTURE

How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data

Thinking Like an
End User EXAMPLE Supervised Learning: Predicting which customers are likely to cancel

Problems & x variables = inputs = features y variable = output = target


Solutions (what goes into a model) (what you are trying to predict)

Modeling Months Monthly Listened to


Techniques Customer Cancelled?
Active Payment Indie Artists?

Each row Aria 25 $9.99 Yes No Each cancellation


Data represents a value is called the
Requirements customer Chord 2 $0 No No label for the row
Melody 14 $14.99 No Yes
Summarizing the
Scope When shown a new customer, predict whether they will cancel or not
Rock 12 $9.99 Yes ???

*Copyright Maven Analytics, LLC


DATA STRUCTURE

How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data

Thinking Like an
End User EXAMPLE Unsupervised Learning: Clustering customers based on listening behavior

Problems & x variables = inputs = features


Solutions (what goes into a model)

Modeling Daily Listening


Customer # Pop Songs # Soundtracks # Podcasts
Techniques Hours

Each row Aria 10 50 3 4


Data represents a
Requirements customer Chord 3 28 0 0

Melody 2 0 0 3
Summarizing the When shown a new customer, figure out which customers it’s most similar to
Scope
Rock 1 4 0 2

*Copyright Maven Analytics, LLC


MODEL FEATURES

To identify relevant model features, brainstorm any potential variables that


might be useful as model inputs based on your goal, for example:
Project Scoping
Steps
• Supervised learning – features that will do a good job predicting cancellations
• Unsupervised learning – features that will do a good job differentiating customers
Thinking Like an
End User

Problems & What factors might help us predict cancellations?


Solutions
Data Scientist End User
Well, we recently increased the monthly subscription rate
Modeling
Techniques
It could also have something to do with auto-renewals

Data
Requirements I wonder if certain demographics are more likely to cancel
Debbie Dayda Carter Careswell
Data Science Team Maybe a competitor launched a new offer or promotion? Customer Care Team
Summarizing the
Scope

*Copyright Maven Analytics, LLC


DATA SOURCES

Once you’ve identified a list of relevant features, brainstorm data sources you
can use to create or engineer those features
Project Scoping
Steps

Thinking Like an
Features: Potential Sources: Ease of Access:
End User
Monthly rate
Customer subscription history,
Problems &
Solutions
customer listening history
Auto-renew
Easy
Modeling
(internal data)
Age
Techniques
Customer demographics
Urban vs. Rural
Data
Requirements
Customer’s other subcriptions, Hard
Competitor Promotion
Summarizing the competitor promotion history (external data)
Scope

*Copyright Maven Analytics, LLC


DATA SCOPE

Next, narrow the scope of your data to remove sources that are difficult to
Project Scoping
obtain and prioritize the rest
Steps
• Remember that more data doesn’t necessarily mean a better model!
Thinking Like an
End User
Data Sources:
Timeframe:
Problems & Past three months of data
Solutions Customer subscription history
Customers:
Modeling Customer listening history Only customers who actively
Techniques subscribed, not those who
were grandfathered in
Data
Requirements

PRO TIP: Aim to produce a minimum viable product (MVP) at this stage – if you go too deep
Summarizing the
without feedback, you risk heading down unproductive paths or over-complicating the solution
Scope

*Copyright Maven Analytics, LLC


SUMMARIZE THE SCOPE & OBJECTIVES

The final step is to clearly summarize the project scope and objectives:
Project Scoping • What techniques and data do you plan to leverage?
Steps
• What specific impact are you trying to make?

Thinking Like an
End User

We plan to use supervised learning to predict which customers are


Problems & likely to cancel their subscription, using the past three months of
Solutions subscription and listening history. This will allow us to:
Data Scientist • Identify the top predictors for cancellation and figure out End User
Modeling
Techniques how to address them
• Use the model to flag customers who are likely to cancel
Data
and take proactive steps to keep them subscribed
Requirements
Our goal is to reduce cancellations by 2% over the next year.

Summarizing the
Debbie Dayda Carter Careswell
Data Science Team Customer Care Team
Scope Sweet!

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

When scoping a project, start by thinking like an end user


• Take time to sit down with end users or stakeholders to understand the situation and business context,
brainstorm potential problems and solutions, and align on key objectives

Don’t limit yourself when brainstorming ideas for problems, solutions, or data
• Start by thinking big and whittling ideas down from there, and keep in mind that many potential solutions
likely won’t require data science

Supervised & unsupervised models require different data structures


• Supervised models use labeled data which includes a target variable for the model to predict, while
unsupervised models use unlabeled data to find patterns or relationships

Leverage the MVP approach when working on data science projects


• Start with something relatively quick and simple as a proof of concept, then continuously iterate to refine and
improve your model once you know you’re on the right track

*Copyright Maven Analytics, LLC


INSTALLING JUPYTER NOTEBOOK

*Copyright Maven Analytics, LLC


INSTALLING JUPYTER NOTEBOOK

In this section we’ll install Anaconda and introduce Jupyter Notebook, a user-friendly
coding environment where we’ll be coding in Python

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Install Anaconda and launch Jupyter Notebook
Why Python? Installation & Setup
• Get comfortable with the Jupyter Notebook
environment and interface
Notebook Interface Code vs Markdown Cells

Helpful Resources

*Copyright Maven Analytics, LLC


WHY PYTHON?

Why Python?

Installation & Python is the most popular programming language used by data scientists
Setup
around the world due to its:
Notebook
Interface

Code vs
Markdown Cells
Scalability Versatility Automation Community
Helpful
Resources Unlike some data tools or With powerful libraries Python can automate Become part of a large
self-service platforms, and packages, Python workflows & complex and active Python user
Python is open source, can add value at every tasks out of the box, community, where you
free, and built for scale stage of the data without complicated can share resources, get
science workflow, from integrations or plug-ins help, offer support, and
data prep to data viz to connect with other users
machine learning

*Copyright Maven Analytics, LLC


INSTALL ANACONDA (MAC)

1) Go to anaconda.com/products/distribution and click

Why Python?
3) Follow the installation steps
(default settings are OK)
Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources 2) Launch the downloaded Anaconda pkg file

*Copyright Maven Analytics, LLC


INSTALL ANACONDA (PC)

1) Go to anaconda.com/products/distribution and click

Why Python?
3) Follow the installation steps
(default settings are OK)
Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources 2) Launch the downloaded Anaconda exe file

*Copyright Maven Analytics, LLC


LAUNCHING JUPYTER NOTEBOOK

1) Launch Anaconda Navigator 2) Find Jupyter Notebook and click


Why Python?

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


YOUR FIRST JUPYTER NOTEBOOK

1) Once inside the Jupyter interface, create a folder to store your notebooks for the course

Why Python?

Installation &
Setup

Notebook
Interface NOTE: You can rename your folder by clicking “Rename” in the top left corner

Code vs 2) Open your new coursework folder and launch your first Jupyter notebook!
Markdown Cells

Helpful
Resources

NOTE: You can rename your notebook by clicking on the title at the top of the screen

*Copyright Maven Analytics, LLC


THE NOTEBOOK SERVER

NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface
Why Python?

Installation &
Setup

Notebook
Interface

Code vs
If you close the server window,
Markdown Cells
your notebooks will not run!
Helpful
Resources
Depending on your OS, and method
of launching Jupyter, one may not
open. As long as you can run your
notebooks, don’t worry!

*Copyright Maven Analytics, LLC


THE NOTEBOOK INTERFACE

Menu Bar Toolbar Mode Indicator


Why Python?
Options to manipulate the way Buttons for the most-used actions Displays whether you are in Edit
the notebook functions within the notebook Mode or Command Mode

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources
Code Cell
Input field where you will write and edit
new code to be executed

*Copyright Maven Analytics, LLC


MENU OPTIONS

File Edit View Insert


Save or revert, make a Edit cells within your
Edit cosmetic options for Insert new cells into your
Why Python? copy, open a notebook, notebook (while in
your notebook. notebook
download, etc. command mode)

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


MENU OPTIONS

Cell Kernel Widgets Help


Interact with the instance Manage interactive View or edit keyboard
Access options for running
Why Python? of Python that runs your elements, or ‘widgets’ in shortcuts and access
the cells in your notebook
code your notebook Python reference pages

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


THE TOOLBAR

The toolbar provides easy access to the most-used notebook actions


Why Python? • These actions can also be performed using hotkeys (keyboard shortcuts)

Installation & Save & Checkpoint Move cells up/down Interrupt the kernel Open the command palette
Setup
S I I CTRL SHIFT F

Notebook
Interface

Code vs
Markdown Cells Insert cell below Cut, Copy & Paste Run cell & select below Restart kernel, rerun Change cell type
B X cut SHIFT ENTER 0 0 Y code
Helpful
Resources C copy M markdown

V paste R raw

Shortcuts may differ depending on which mode you are in

*Copyright Maven Analytics, LLC


EDIT & COMMAND MODES

EDIT MODE is for editing content within cells, and is indicated by green highlights and a pen icon

Why Python?

Installation &
Setup

Notebook
Interface

Code vs
COMMAND MODE is for editing the notebook, and is indicated by blue highlights and no icon
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


THE CODE CELL

The code cell is where you’ll write and execute Python code
Why Python?

Installation &
Setup In edit mode, the cell will
be highlighted green and a
pencil icon will appear
Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources
Type some code, and click Run to execute
• In[]: Our code (input)
• Out[]: What our code produced (output)*
*Note: not all code has an output!

*Copyright Maven Analytics, LLC


THE CODE CELL

The code cell is where you’ll write and execute Python code
Why Python?

Installation &
Setup Click back into the cell (or use the up arrow)
and press SHIFT + ENTER to rerun the code

Notebook
Interface

Code vs
Note that our output hasn’t changed, but the
Markdown Cells
number in the brackets increased from 1 to 2.
This is a cell execution counter, which indicates
Helpful how many cells you’ve run in the current session.
Resources
If the cell is still processing, you’ll see In[*]

The cell counter will continue to


increment as you run additional cells

*Copyright Maven Analytics, LLC


COMMENTING CODE

Comments are lines of code that start with ‘#’ and do not run
Why Python? • They are great for explaining portions of code for others who may use or review it
• They can also serve as reminders for yourself when you revisit your code in the future
Installation &
Setup
Think about your audience when
commenting your code (you may not
Notebook need to explain basic arithmetic to an
Interface experienced Python programmer)

Code vs
Markdown Cells

Be conscious of over-commenting,
Helpful which can actually make your code
Resources even more difficult to read

Comments should explain individual cells or lines of code, NOT your entire workflow – we have better tools for that!

*Copyright Maven Analytics, LLC


THE MARKDOWN CELL

Markdown cells let you write structured text passages to explain your workflow,
provide additional context, and help users navigate the notebook
Why Python?

Installation & To create a markdown cell:


Setup
1. Create a new cell above the top cell (press A with the cell selected)
Notebook 2. Select “Markdown” in the cell menu (or press M)
Interface

Code vs
Markdown Cells

Helpful
Resources

This is now a markdown cell


(notice that the In[]: disappeared)

*Copyright Maven Analytics, LLC


MARKDOWN SYNTAX

Markdown cells use a special text formatting syntax


Why Python?

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


MARKDOWN SYNTAX

Markdown cells use a special text formatting syntax


Why Python?

Installation &
Setup

Notebook
Interface

Code vs
Markdown Cells

Helpful
Resources

*Copyright Maven Analytics, LLC


HELPFUL RESOURCES

Google your questions – odds are someone else has


asked the same thing and it has been answered
(include Python in the query!)
Why Python?

Stack Overflow is a public coding forum that will most


Installation & likely have the answers to most of the questions you’ll
Setup search for on Google
https://stackoverflow.com/
Notebook
Interface

The Official Python Documentation is a great


Code vs
“cheat sheet” for library and language references
Markdown Cells
https://docs.python.org/3/
Helpful
Resources
There are many quality Python & Analytics Blogs on the
web, and you can learn a lot by subscribing and reviewing
the concepts and underlying code
https://towardsdatascience.com/

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Jupyter Notebook is a user-friendly coding environment


• Jupyter Notebook is popular among data scientists since they allow you to create and document entire
machine learning workflows and render outputs and visualizations on screen

Code cells are where you write and execute Python code
• Make sure that you know how to run, add, move, and remove cells, as well as how to restart your kernel or
stop the code from executing

Use comments and markdown cells for documentation


• Comments should be used to explain specific portions of your code, and markdown should be used to
document your broader workflow and help users navigate the notebook

*Copyright Maven Analytics, LLC


GATHERING DATA

*Copyright Maven Analytics, LLC


GATHERING DATA

In this section we’ll cover the steps for gathering data, including reading files into Python,
connecting to databases within Python, and storing data in Pandas DataFrames

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Understand the data gathering process and the
Data Gathering Process Reading Files Into Python various ways that data can be stored and structured

• Identify the characteristics of a Pandas DataFrame


Connecting to a Database Exploring DataFrames
• Use Pandas to read in flat files and Excel
spreadsheets, and connect to SQL databases

• Quickly explore a Pandas DataFrame

*Copyright Maven Analytics, LLC


DATA GATHERING PROCESS

The data gathering process involves finding data, reading it into Python,
transforming it if necessary, and storing the data in a Pandas DataFrame
Data Gathering
Process

Reading Files 1) Find the data 2) Read in the data 3) Store the data
into Python

Connecting to a
Database

Exploring
DataFrames

Raw data can come in Apply transformations using Tables in Python are stored
many shapes and forms Python, if necessary as Pandas DataFrames
(more on this later!)

*Copyright Maven Analytics, LLC


DATA SOURCES

Data can come from a variety of sources:

Data Gathering
Process

Reading Files
into Python
Local files Databases Web access
Connecting to a You can read data from a file You can connect to a database You can programmatically
Database
stored on your computer and write queries extract data from websites

Exploring Common examples: Common examples: Common examples:


DataFrames
• Flat files (.csv, .txt, etc.) • SQL databases • Web scraping (.html, etc.)
• Spreadsheets (.xlsx, etc.) • NoSQL databases • API (.json, .xml, etc.)

Python is a great tool for data gathering due to its ability to read in and
transform data coming from a wide variety of sources and formats

*Copyright Maven Analytics, LLC


STRUCTURED VS UNSTRUCTURED DATA

Structured Semi-Structured Unstructured


Data Gathering
Process Already stored as tables Easily converted to tables Not easily converted to tables

Reading Files
into Python
.xlsx .csv .pdf
Connecting to a
Database

.db .json .jpg


Exploring
DataFrames

To do analysis, data needs to be structured as a table with rows and columns


• Most data sources are already organized in this way and are ready for analysis
• Unstructured data sources require additional transformations before doing analysis

*Copyright Maven Analytics, LLC


THE PANDAS DATAFRAME

To do analysis in Python, data must reside within a Pandas DataFrame


Data Gathering • DataFrames look just like regular tables with rows and columns, but with additional features
Process
• External structured and semi-structured data can be read directly into a DataFrame
Reading Files
into Python

Each column contains data


Connecting to a
of a single data type
Database
The row index assigns a (integer, text, etc.)
unique ID to each row
Exploring
DataFrames (The index starts at 0)

*Copyright Maven Analytics, LLC


READING DATA INTO PYTHON

Pandas allows you to read in data from multiple file formats:


Data Gathering
Process

Reading Files pd.read_csv Reads in data from a delimiter-separated flat file pd.read_csv(file path, sep, header)
into Python

Connecting to a pd.read_excel Reads in data from a Microsoft Excel file pd.read_excel(file path, sheet_name)
Database

Exploring pd.read_json Reads in data from a JSON file pd.read_json(file path)


DataFrames

‘pd’ is the standard alias for the Pandas library

*Copyright Maven Analytics, LLC


READING FLAT FILES

pd.read_csv() lets you read flat files by specifying the file path

Data Gathering
Process

pd.read_csv(file path, sep, header)


Reading Files
into Python

Connecting to a The file location, name and extension The column delimiter If the data has column headers
Database in the first row
Examples: Examples:
Exploring • ‘data.csv’ • sep=‘,’ (default) • header=‘infer’ (default)
DataFrames • ‘sales/october.tsv’ • sep=‘\t’ • header=None (no headers)

PRO TIP: Place the file in the same folder as the Jupyter Notebook
so you don’t have to specify the precise file location

For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html *Copyright Maven Analytics, LLC
READING FLAT FILES

pd.read_csv() lets you read flat files by specifying the file path

Data Gathering
Process

Reading Files
into Python

Connecting to a
Database

Exploring
DataFrames

*Copyright Maven Analytics, LLC


READING EXCEL FILES

pd.read_excel() lets you read Excel files by specifying the file path
• You can use the “sheet_name” argument to specify the worksheet (default is 0 – first sheet)
Data Gathering
Process

Reading Files
into Python

Connecting to a
Database

Exploring
DataFrames

*Copyright Maven Analytics, LLC


CONNECTING TO A SQL DATABASE

To connect to a SQL database, import a database driver and specify the database
connection, then use pd.read_sql() to query the database using SQL code
Data Gathering
Process

Reading Files
into Python SQL Software Database Driver

SQLite sqlite3
Connecting to a
Database MySQL mysql.connector

Oracle cx_Oracle
Exploring
DataFrames PostgreSQL psycopg2

SQL Server pyodbc

*Copyright Maven Analytics, LLC


QUICKLY EXPLORE A DATAFRAME

After reading data into Python, it’s common to quickly explore the DataFrame to
make sure the data was imported correctly
Data Gathering
Process

Reading Files
into Python df.head() Display the first five rows of a DataFrame

Connecting to a
Database df.shape Display the number of rows and columns of a DataFrame

Exploring
DataFrames df.count() Display the number of values in each column

df.describe() Display summary statistics like mean, min and max

df.info() Display the non-null values and data types of each column

*Copyright Maven Analytics, LLC


QUICKLY EXPLORE A DATAFRAME

After reading data into Python, it’s common to quickly explore the DataFrame to
make sure the data was imported correctly
Data Gathering
Process

Reading Files
into Python

Connecting to a
Database

Exploring
DataFrames

*Copyright Maven Analytics, LLC


ASSIGNMENT: READ A FILE INTO PYTHON

Key Objectives
NEW MESSAGE
June 1, 2023
1. Read in data from a .csv file
From: Anna Analysis (Senior Data Scientist)
Subject: New Survey Data 2. Store the data in a DataFrame

Hi there, 3. Quickly explore the data in the DataFrame


I hear you’re the new junior data scientist on the team.
Welcome!

Our polling partners just finished collecting survey data on the


happiness levels in each country. I’ve attached the file.

Can you load the data into Python and confirm the number of
rows in the data and the range of the happiness scores? Then
we can walk through next steps together.

Thanks!
Anna

happiness_survey_data.csv

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Python can read data from your computer, a database, or the internet
• Flat files & spreadsheets are typically stored locally on a computer, but as a data scientist you’ll likely need to
connect to and query a SQL database, and potentially collect data through web scraping or an API

Data needs to be organized into rows & columns for analysis


• Flat files, spreadsheets and SQL tables are already in this format, but if your data is unstructured you would
need to extract the relevant data and transform it into rows and columns yourself

Python tables are known as Pandas DataFrames


• DataFrames include a unique row index and each column must be the same data type (integer, text, etc.)
• There are several ways to quickly explore the characteristics of a DataFrame (head, shape, describe, etc.)

*Copyright Maven Analytics, LLC


CLEANING DATA

*Copyright Maven Analytics, LLC


CLEANING DATA

In this section we’ll cover the steps for cleaning data, including converting columns to the
correct data type, handling data issues and creating new columns for analysis

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Identify and convert between Python data types
Data Cleaning Overview Data Types
• Find common “messy” data issues and become
familiar with the different ways to resolve them
Data Issues Creating New Columns
• Create new calculated columns from existing data

*Copyright Maven Analytics, LLC


DATA CLEANING OVERVIEW

The goal of data cleaning is to get raw data into a format that’s ready for analysis
Data Cleaning
Overview This includes:
• Converting columns to the correct data types for analysis
Data Types
• Handling data issues that could impact the results of your analysis
• Creating new columns from existing columns that are useful for analysis
Data Issues

Creating New
The order in which you complete these data cleaning tasks will vary by dataset, but
Columns this is a good starting point

Even though there are automated tools available, doing some manual data cleaning
provides a good opportunity to start understanding and getting a good feel for your data

*Copyright Maven Analytics, LLC


DATA TYPES

When using Pandas to read data, columns are automatically assigned a data type
• Use the .dtypes attribute to view the data type for each DataFrame column
Data Cleaning
Overview • Note that sometimes numeric columns (int, float) and date & time columns (datetime) aren’t
recognized properly by Pandas and get read in as text columns (object)
Data Types
Default data types

Data Issues

Creating New These need to be converted to be analyzed!


Columns
Data Type Description
bool Boolean, True/False
int64 Integers
float64 Floating point, decimals
object Text or mixed values
PRO TIP: Use the .info() method as an alternative to
datetime64 Date and time values
show additional information along with the data types

*Copyright Maven Analytics, LLC


CONVERTING TO DATETIME

Use pd.to_datetime() to convert object columns into datetime columns

Data Cleaning
Overview

Data Types

Data Issues
Note that the missing values
are now NaT instead of NaN
Creating New (more on missing data later!)
Columns

PRO TIP: Pandas does a pretty good job at detecting the datetime values from an object if the text is in a
standard format (like “YYYY-MM-DD”), but you can also manually specify the format using the format
argument within the pd.to_datetime() function: pd.to_datetime(dt_col, format='%Y-%M-%D’)

For a full list of formats, visit: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior *Copyright Maven Analytics, LLC
CONVERTING TO NUMERIC

Use pd.to_numeric() to convert object columns into numeric columns


• To remove non-numeric characters ($, %, etc.), use str.replace()
Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns

Currency data is read in as an object


by Python due to the dollar sign ($)
and comma separator (,) An alternative is to use Series.astype() to convert to more specific data
types like ‘int’, ‘float’, ‘object’ and ‘bool’, but pd.to_numeric() can handle
missing values (NaN), while Series.astype() cannot

*Copyright Maven Analytics, LLC


ASSIGNMENT: CONVERTING DATA TYPES

Key Objectives
NEW MESSAGE
June 12, 2023
1. Read in data from the Excel spreadsheet and
From: Alan Alarm (Researcher) store it in a Pandas DataFrame
Subject: Data in Python Request
2. Check the data type of each column
Hi there, 3. Convert object columns into numeric or
datetime columns as needed
We just finished collecting survey data from a few thousand
customers who’ve purchased our alarm clocks (see attached).

Can you read the data into Python and make sure the data
type of each column makes sense? We’d like to do quite a few
calculations using the data, so if a column can be converted to
a numeric or datetime column, please do so.

Thanks!
Alan

Alarm Survey Data.xlsx

*Copyright Maven Analytics, LLC


DATA ISSUES OVERVIEW

Data issues need to be identified and corrected upfront in order to not impact or
skew the results of your analysis
Data Cleaning
Overview Common “messy” data issues include:

Data Types

Duplicates
Data Issues

Outliers
Creating New
Columns
Inconsistent
text & typos

Missing data

*Copyright Maven Analytics, LLC


MISSING DATA

There are various ways to represent missing data in Python


• np.NaN – NumPy’s NaN is the most common representation (values are stored as floats)
Data Cleaning
• pd.NA – Pandas’ NA is a newer missing data type (values can be stored as integers)
Overview
• None – Base Python’s default missing data type (doesn’t allow numerical calculations)

Data Types

Data Issues

Creating New
Columns

Missing values are treated as np.NaN


when data is read into Pandas

*Copyright Maven Analytics, LLC


IDENTIFYING MISSING DATA

The easiest way to identify missing data is with the .isna() method
• You can also use .info() or .value_counts(dropna=False)
Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns Use sum() to return the Or use any(axis=1) to select
missing values by column the rows with missing values

This returns True for


any missing values

*Copyright Maven Analytics, LLC


HANDLING MISSING DATA

There are multiple ways to handle missing data:


• Keep the missing data as is
Data Cleaning • Remove an entire row or column with missing data
Overview
• Impute missing numerical data with a 0 or a substitute like the average, mode, etc.

Data Types
• Resolve the missing data based on your domain expertise

Data Issues

Creating New Missing age values


Columns are likely ok to keep
We have no data
on this customer

This is likely Can we calculate an


Jacksonville, FL approximate income?

There is no right or wrong way to deal with missing data, which is why it’s
important to be thoughtful and deliberate in how you handle it

*Copyright Maven Analytics, LLC


KEEPING MISSING DATA

You can still perform calculations if you choose to keep missing data

Data Cleaning
Overview

These ignore the missing values


Data Types

Data Issues

Creating New
Columns

*Copyright Maven Analytics, LLC


REMOVING MISSING DATA

The .dropna() method removes rows with missing data

Data Cleaning
Overview

Data Types

Data Issues

Our original data set with all of its


rows, including all missing values
Creating New
Columns

*Copyright Maven Analytics, LLC


REMOVING MISSING DATA

The .dropna() method removes rows with missing data

Data Cleaning
Overview Removes any rows with NaN values

Data Types

Data Issues

Creating New
Columns

Note that the row index is now skipping values, but you can
reset the index with df.dropna().reset_index()

*Copyright Maven Analytics, LLC


REMOVING MISSING DATA

The .dropna() method removes rows with missing data

Data Cleaning
Overview Removes any rows with NaN values

Data Types
Removes rows that only have NaN values

Data Issues

Creating New
Columns

*Copyright Maven Analytics, LLC


REMOVING MISSING DATA

The .dropna() method removes rows with missing data

Data Cleaning
Overview Removes any rows with NaN values

Data Types
Removes rows that only have NaN values

Data Issues
Removes rows that don’t have at least “n” values

Creating New
Columns

*Copyright Maven Analytics, LLC


REMOVING MISSING DATA

The .dropna() method removes rows with missing data

Data Cleaning
Overview Removes any rows with NaN values

Data Types
Removes rows that only have NaN values

Data Issues
Removes rows that don’t have at least “n” values

Creating New
Columns
Removes rows with NaN values in a specified column

*Copyright Maven Analytics, LLC


PRO TIP: KEEPING NON-MISSING DATA

You can use .notna() to keep non-missing data instead

Data Cleaning
Overview

Data Types

Data Issues
Note that using .dropna() or .notna()
to remove rows with missing data
does not make permanent changes
Creating New
to the DataFrame, so you need to
Columns
save the output to a new DataFrame
(or the same one) or set the
argument inplace=True

*Copyright Maven Analytics, LLC


IMPUTING MISSING DATA

The .fillna() method imputes missing data with an appropriate value


• There are many ways to impute data in a column (zero, mean, mode, etc.), so take a moment
Data Cleaning to decide the best one – if you’re unsure, reach out to a subject matter expert
Overview

Data Types

Using the median removes


Data Issues the impact of the outlier

Creating New
Columns

What value can


be used here?

*Copyright Maven Analytics, LLC


RESOLVING MISSING DATA

An alternative to imputing is to resolve missing data with domain expertise


• You can use the .loc[] accessor to update a specific value
Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns

*Copyright Maven Analytics, LLC


ASSIGNMENT: MISSING DATA

Key Objectives
NEW MESSAGE
June 13, 2023 1. Find any missing data
From: Alan Alarm (Researcher) 2. Deal with the missing data
Subject: Missing Data Check

Hi again,

Can you check the file I sent you yesterday for missing data?

Please use your best judgement when choosing how to handle


the missing data and let me know what approaches you
decide to take.

Thanks!
Alan

*Copyright Maven Analytics, LLC


INCONSISTENT TEXT & TYPOS

Inconsistent text & typos in a data set are represented by values that are either:
• Incorrect by a few digits or characters
Data Cleaning
• Inconsistent with the rest of a column
Overview

Data Types

Data Issues
Finding inconsistent text and typos
within a large data set in Python is not
Creating New a straightforward approach, as there
Columns is no function that will automatically
identify these situations

These are full state names, while the


rest of the column has abbreviations

*Copyright Maven Analytics, LLC


IDENTIFYING INCONSISTENT TEXT & TYPOS

While there is no specific method to identify inconsistent text & typos, you can
take the following two approaches to check a column depending on its data type:
Data Cleaning
Overview

Categorical data Numerical data


Data Types
Look at the unique values in the column Look at the descriptive stats of the column

Data Issues

Creating New
Columns
These represent
the same thing! This looks like a
realistic age range

*Copyright Maven Analytics, LLC


HANDLING INCONSISTENT TEXT & TYPOS

You can fix inconsistent text & typos by using:


Data Cleaning
Overview
1• .loc[] to update a value at a particular location
Data Types
2• np.where() to update values in a column based on a conditional statement

Data Issues
3• .map() to map a set of values to another set of values

Creating New
Columns
4• String methods like str.lower(), str.strip() & str.replace() to clean text data

We’ve already covered the loc[] accessor to resolve missing data using domain expertise

*Copyright Maven Analytics, LLC


UPDATE VALUES BASED ON A LOGICAL CONDITION

Use np.where() to update values based on a logical condition


Data Cleaning
Overview

np.where(condition, if_true, if_false)


Data Types

Data Issues Calls the A logical expression that Value to return when Value to return when
NumPy library evaluates to True or False the expression is True the expression is False
Creating New
Columns

This is different from the Pandas where method, which has similar
functionality, but different syntax
The NumPy function is used more often than the Pandas method
because np.where is vectorized, meaning it executes faster

*Copyright Maven Analytics, LLC


UPDATE VALUES BASED ON A LOGICAL CONDITION

Use np.where() to update values based on a logical condition


Data Cleaning
Overview If a State value is equal to ‘Utah’, then replace it with ‘UT’,
Otherwise, keep the value that was already in the State column

Data Types

Data Issues

Creating New
Columns

*Copyright Maven Analytics, LLC


MAP VALUES

Use .map() to map values from one set of values to another set of values

Data Cleaning
Overview

Data Types

Data Issues
You can pass a dictionary with existing values
as the keys, and new values as the values

Creating New
Columns These states were mapped to
these state abbreviations

This is similar to creating a lookup table in Excel and using VLOOKUP to search for a
value in a column of the table and retrieve a corresponding value from another column

*Copyright Maven Analytics, LLC


PRO TIP: CLEANING TEXT

String methods are commonly used to clean text and standardize it for analysis

Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns

There is a lot more you can do


to clean text data, which will
be covered in the Natural
Language Processing course

*Copyright Maven Analytics, LLC


ASSIGNMENT: INCONSISTENT TEXT & TYPOS

Key Objectives
NEW MESSAGE
June 14, 2023 1. Find any inconsistent text and typos
From: Alan Alarm (Researcher) 2. Deal with the inconsistent text and typos
Subject: Inconsistent Text Check

Hi again,

Thanks for your help with the missing data yesterday. I like
how you decided to handle those missing values.

Can you check the same file for inconsistencies in the text and
resolve any issues that you find?

Thanks!
Alan

*Copyright Maven Analytics, LLC


DUPLICATE DATA

Duplicate data represents the presence of one or more redundant rows that
contain the same information as another, and can therefore be removed
Data Cleaning
Overview

Data Types

This is duplicate data


Data Issues on the same customer

Creating New
Columns

These are the same values,


but they’re not considered
duplicate data

*Copyright Maven Analytics, LLC


IDENTIFYING DUPLICATE DATA

Use .duplicated() to identify duplicate rows of data

Data Cleaning
Overview

Data Types
This returns True for every row that
is a duplicate of a previous row

Data Issues

Creating New
Columns You can use keep=False to return
all the duplicate rows

*Copyright Maven Analytics, LLC


REMOVING DUPLICATE DATA

Use .drop_duplicates() to remove duplicate rows of data

Data Cleaning
Overview

Data Types

You may need to


reset the index
Data Issues

Creating New
Columns

*Copyright Maven Analytics, LLC


ASSIGNMENT: DUPLICATE DATA

Key Objectives
NEW MESSAGE
June 15, 2023 1. Find any duplicate data
From: Alan Alarm (Researcher) 2. Deal with the duplicate data
Subject: Duplicate Data Check

Hi again,

I know the last task around finding inconsistent text was a bit
tricky. This should be more straightforward!

Can you check the same file for duplicate data and resolve any
issues?

Thanks!
Alan

*Copyright Maven Analytics, LLC


OUTLIERS

An outlier is a value in a data set that is much bigger or smaller than the others

Data Cleaning
Overview

Data Types
Average income (including outlier) = $4.1M
Average income (excluding outlier) = $82K

Data Issues

Creating New
Columns If outliers are not identified and dealt
with, they can have a notable impact
on calculations and models

*Copyright Maven Analytics, LLC


IDENTIFYING OUTLIERS

You can identify outliers in different ways using plots and statistics

Data Cleaning EXAMPLE Identifying outliers in student grades from a college class
Overview

Histogram Box Plot Standard Deviation


Data Types


Data Issues

Creating New
Columns

It’s also important to define what you’ll consider an outlier in each scenario

Should we flag 3 or 2 outliers?


Use your domain expertise!
*Copyright Maven Analytics, LLC
HISTOGRAMS

Histograms are used to visualize the distribution (or shape) of a numerical column
• They help identify outliers by showing which values fall outside of the normal range
Data Cleaning
Overview

You can create them with the seaborn library


Data Types

Data Issues

Creating New
Columns The height of each bar is how
often the value occurs

These are the


potential outliers

This is the range of values


in the data set
*Copyright Maven Analytics, LLC
BOXPLOTS

Boxplots are used to visualize the descriptive statistics of a numerical column


• They automatically plot outliers as dots outside of the min/max data range
Data Cleaning
Overview

Data Types

median
The width of the “box” is the interquartile range
Data Issues (IQR), which is the middle 50% of the data
min max
Any value farther away than 1.5*IQR from each
These are side of the box is considered an outlier
the outliers
Creating New
Columns

Q1 Q3

*Copyright Maven Analytics, LLC


STANDARD DEVIATION

The standard deviation is a measure of the spread of a data set from the mean

Data Cleaning
Overview

Standard
Data Types Deviation = 8.2

Data Issues

Large spread
Creating New
Columns

Standard
Deviation = 5.8

Small spread
*Copyright Maven Analytics, LLC
STANDARD DEVIATION

The standard deviation is a measure of the spread of a data set from the mean
Values at least 3 standard deviations away from the mean are considered outliers
Data Cleaning
Overview • This is meant for normally distributed, or bell shaped, data
• The threshold of 3 standard deviations can be changed to 2 or 4+ depending on the data
Data Types

Data Issues

Creating New
Columns

This returns a list of values, or outliers, at least 3


standard deviations away from the mean

*Copyright Maven Analytics, LLC


HANDLING OUTLIERS

Like with missing data, there are multiple ways to handle outliers:
• Keep outliers
Data Cleaning
Overview • Remove an entire row or column with outliers
• Impute outliers with NaN or a substitute like the average, mode, max, etc.
Data Types • Resolve outliers based on your domain expertise

Data Issues

Creating New
Columns

How would you handle this?

*Copyright Maven Analytics, LLC


ASSIGNMENT: OUTLIERS & REVIEW CLEANED DATA

Key Objectives
NEW MESSAGE
June 16, 2023 1. Find any outliers
From: Alan Alarm (Researcher) 2. Deal with the outliers
Subject: Outlier Check
3. Quickly explore the updated DataFrame. How do
things look now after handling the data issues
Hi again,
compared to the original DataFrame?
I have one last request for you and then I think our data is
clean enough for now.

Can you check the file for outliers and resolve any issues?

Thanks for all your help this week – you rock!

Best,
Alan

*Copyright Maven Analytics, LLC


CREATING NEW COLUMNS

After cleaning data types & issues, you may still not have the exact data that you
need, so you can create new columns from existing data to aid your analysis
Data Cleaning
Overview • Numeric columns – calculating percentages, applying conditional calculations, etc.
• Datetime columns – extracting datetime components, applying datetime calculations, etc.
Data Types • Text columns – extracting text, splitting into multiple columns, finding patterns, etc.

Data Issues
Name Cost Date Notes Name Cost + Tax Month Person Note

Alexis $0 4/15/23 Coach: great job! Alexis $0 April Coach great job!
Creating New
Columns Alexis $0 4/22/23 Coach: keep it up Alexis $0 April Coach keep it up

Alexis $25 5/10/23 PT: add strength training Alexis $27.00 May PT add strength training

David $20 5/1/23 Trainer: longer warm up David $21.60 May Trainer longer warm up

David $20 5/10/23 Trainer: pace yourself David $21.60 May Trainer pace yourself

Add 8% Extract the Split into two Data is ready for further analysis!
tax month columns

*Copyright Maven Analytics, LLC


CALCULATING PERCENTAGES

To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview

EXAMPLE Finding the percentage of total spend for each item


Data Types

Data Issues

This will be the numerator


Creating New
Columns

*Copyright Maven Analytics, LLC


CALCULATING PERCENTAGES

To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview

EXAMPLE Finding the percentage of total spend for each item


Data Types

Data Issues

Creating New
Columns

This will be the denominator

sum = 9.17

*Copyright Maven Analytics, LLC


CALCULATING PERCENTAGES

To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview

EXAMPLE Finding the percentage of total spend for each item


Data Types

Data Issues

Creating New
Columns

These add up to 100%

*Copyright Maven Analytics, LLC


CALCULATING BASED ON A CONDITION

Use np.where() to create a new column based on a logical condition

Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns

If Location is equal to ‘gym’, then increase


the Fee by 8% in the Fee with Tax column
Otherwise, set it to the existing Fee

*Copyright Maven Analytics, LLC


ASSIGNMENT: CREATE COLUMNS FROM NUMERIC DATA

Key Objectives
NEW MESSAGE
July 5, 2023 1. Read data into Python
From: Peter Penn (Sales Rep) 2. Check the data type of each column
Subject: Pen Sales Data
3. Create a numeric column using arithmetic
Hello, 4. Create a numeric column using conditional logic
I’ve attached the data on our June pen sales.

Can you create two new columns?

• A “Total Spend” column that includes both the pen cost


and shipping cost for each sale
• A “Free Shipping” column that says yes if the sale
included free shipping, and no otherwise

Thanks!
Peter

Pen Sales Data.xlsx

*Copyright Maven Analytics, LLC


EXTRACTING DATETIME COMPONENTS

Use dt.component to extract a component from a datetime value (day, month, etc.)

Data Cleaning
Overview Component Output

dt.date Date (without time component)


Data Types dt.year Year

dt.month Numeric month (1-12)

Data Issues dt.day Day of the month

dt.dayofweek Numeric weekday (Mon=0, Sun=6)

Creating New dt.time Time (without date component)


Columns
dt.hour Hour (0-23)

dt.minute Minute (0-59)

dt.second Second (0-59)

*Copyright Maven Analytics, LLC


DATETIME CALCULATIONS

Datetime calculations between columns can be done using basic arithmetic


• Use pd.to_timedelta() to add or subtract a particular timeframe
Data Cleaning
Overview

Data Types

Data Issues

Note that the data type changed


Creating New
Columns

Time delta units:


• D = day
• W = week
• H = hour
• T = minute
• S = second

*Copyright Maven Analytics, LLC


ASSIGNMENT: CREATE COLUMNS FROM DATETIME DATA

NEW MESSAGE Key Objectives


July 12, 2023

From: Peter Penn (Sales Rep) 1. Calculate the difference between two datetime
columns and save it as a new column
Subject: Delivery Time of Pens?
2. Take the average of the new column
Hello again,

Using the data I sent over last week, can you calculate the
number of days between the purchase and delivery date for
each sale and save it as a new column called “Delivery Time”?

What were the average days from purchase to delivery?

Thanks!
Peter

*Copyright Maven Analytics, LLC


EXTRACTING TEXT

You can use .str[start:end] to extract characters from a text field


• Note that the position of each character in a string is 0-indexed, and the “end” is non-inclusive
Data Cleaning
Overview

Data Types

The blank starts


from 0 by default
Data Issues

Creating New
Columns

The negative grabs the “start”


from the end of the list, and
the blank “end” goes to the
end of the text string

*Copyright Maven Analytics, LLC


SPLITTING INTO MULTIPLE COLUMNS

Use str.split() to split a column by a delimiter into multiple columns

Data Cleaning
Overview

Data Types Splitting text


returns a list

Data Issues

Creating New
Columns
This is now a
DataFrame with
two columns

*Copyright Maven Analytics, LLC


FINDING PATTERNS

Use str.contains() to find words or patterns within a text field

Data Cleaning
Overview

Data Types

Data Issues

Creating New
Columns

Regex stands for regular expression,


which is a way of finding patterns
within text (more on this topic will
be covered in the Natural Language
Processing course)

*Copyright Maven Analytics, LLC


ASSIGNMENT: CREATE COLUMNS FROM TEXT DATA

Key Objectives
NEW MESSAGE
July 19, 2023 1. Split one column into multiple columns
From: Peter Penn (Sales Rep) 2. Create a Boolean column (True/False) to show
Subject: Pen Reviews whether a text field contains particular words

Hello again,

You may have noticed that the data I sent over a few weeks
ago also includes pen reviews.

Can you split the reviews on the “|” character to create two
new columns: “User Name” and “Review Text”?

Can you also create a “Leak or Spill” column that flags the
reviews that mention either “leak” or “spill”?

Thanks!
Peter

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Always check that the data type for each column matches their intended use
• Sometimes all columns are read in as objects into a DataFrame, so they may need to be converted into
numeric or datetime columns to be able to do any appropriate calculations down the line

It’s important to resolve any messy data issues that could impact your analysis
• When you’re looking over a data set for the first time, check for missing data, inconsistent text & typos,
duplicate data, and outliers, then be thoughtful and deliberate about how you handle each

You can create new columns based on existing columns in your DataFrame
• Depending on the data type, you can apply numeric, datetime, or string calculations on existing columns to
create new columns that can be useful for your analysis

You goal should be to make the data clean enough for your analysis
• It’s difficult to identify all the issues in your data in order to make it 100% clean (especially if working with text
data), so spending extra time trying to do so can be counterproductive – remember the MVP approach!

*Copyright Maven Analytics, LLC


EXPLORATORY DATA ANALYSIS

*Copyright Maven Analytics, LLC


EXPLORATORY DATA ANALYSIS

In this section we’ll cover exploratory data analysis (EDA), which includes a variety of
techniques used to better understand a dataset and discover hidden patterns & insights

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Learn Python techniques for exploring a new data
EDA Overview Exploring Data set, like filtering, sorting, grouping, and visualizing

• Practice finding patterns and drawing insights from


Visualizing Data EDA Tips data by exploring it from multiple angles

• Understand that while EDA focuses on technical


aspects, the goal is to get a good feel for the data

*Copyright Maven Analytics, LLC


EXPLORATORY DATA ANALYSIS

The exploratory data analysis (EDA) phase gives data scientists a chance to:
EDA Overview
• Get a better sense of the data by viewing it from multiple angles
• Discover patterns, relationships, and insights from the data

Exploring Data

From data… … to insights

Visualizing Data Students with


more sleep got
higher grades!

EDA Tips
EDA

*Copyright Maven Analytics, LLC


COMMON EDA TECHNIQUES

These are some common EDA techniques used by data scientists:

EDA Overview

Exploring data Visualizing data


Exploring Data
Viewing & summarizing Using charts to identify
data from multiple angles trends & patterns
Visualizing Data
Examples: Examples:

EDA Tips
• Filtering • Histograms
• Sorting • Scatterplots
• Grouping • Pair plots

There is no particular order that these tasks need to be completed in;


you are free to mix and match them depending on your data set

*Copyright Maven Analytics, LLC


FILTERING

You can filter a DataFrame by passing a logical test into the loc[] accessor
• Apply multiple filters by using the “&” and “|” operators (AND/OR)
EDA Overview

Exploring Data
This returns rows where
“type” is equal to “herbal”
Visualizing Data

EDA Tips Use a Boolean mask to


apply multiple filters with
complex logic

This returns rows where


“type” is equal to “herbal”
OR “temp” is greater than
or equal to 200

*Copyright Maven Analytics, LLC


SORTING

You can sort a DataFrame by using the .sort_values() method


EDA Overview • This sorts in ascending order by default

Exploring Data

Visualizing Data
- +

EDA Tips

+ -

*Copyright Maven Analytics, LLC


GROUPING

The “split-apply-combine” approach is used to group data in a DataFrame and


apply calculations to each group
EDA Overview

EXAMPLE Finding the average temperature for each type of tea


Exploring Data

Split the data by “type”

Visualizing Data Apply a calculation (average)


180
on the “temp” in each group

EDA Tips 205


Combine the “type” and
average “temp” for each
group into a final table

199.75

190

*Copyright Maven Analytics, LLC


GROUPING

You can group a DataFrame by using the .groupby() method

EDA Overview

df.groupby(col)[col].aggregation()
Exploring Data

Visualizing Data The DataFrame The column(s) to group by The column(s) to apply The calculation(s) to
to group (unique values determine the calculation to apply for each group
the rows in the output) (these become the new (these become the new
EDA Tips columns in the output) values in the output)
Examples:
• mean()
• sum()
• min()
• max()
• count()
• nunique()

*Copyright Maven Analytics, LLC


GROUPING

You can group a DataFrame by using the .groupby() method

EDA Overview

Exploring Data
This returns the average
“temp” by “type”

Visualizing Data

EDA Tips

Use reset_index() to
return a DataFrame

*Copyright Maven Analytics, LLC


MULTIPLE AGGREGATIONS

Chain the .agg() method to .groupby() to apply multiple aggregations to each group

EDA Overview

Exploring Data
This returns the minimum &
maximum temperatures, the
number of teas, and the unique
Visualizing Data temperatures for each tea type

EDA Tips

PRO TIP: When chaining multiple


You can also write the code this way! methods together, wrap the code in
parenthesis so you can place each
method on a separate line
(this makes reading the code easier!)

*Copyright Maven Analytics, LLC


PRO TIP: HEAD & TAIL

In addition to aggregating grouped data, you can also use the .head() and .tail()
methods to return the first or last “n” records for each group
EDA Overview

Return the first row of each group Return the last 3 rows of each group
Exploring Data

Visualizing Data

EDA Tips

*Copyright Maven Analytics, LLC


ASSIGNMENT: EXPLORING DATA
NEW MESSAGE
August 1, 2023 Key Objectives
From: Anna Analysis (Senior Data Scientist)
Subject: Happiest Countries of the 2010s? 1. Filter out any data before 2010 and after 2019
2. Group the data by country and calculate the
Hi, maximum happiness score for each one
The marketing team would like to share out the five happiest 3. Sort the grouped countries by happiness score
countries of the 2010s on social media. and return the top five
I’ve attached a notebook that another data scientist started
with happiness data inside. I would recommend:
4. Group the data by country and calculate the
average happiness score for each one
• Creating a list of each country’s highest happiness score,
and then sorting it from happiest to least happy country 5. Sort the grouped countries by happiness score
• Creating a list of each country’s average happiness score, and return the top five
and then sorting it from happiest to least happy country
6. Compare the two lists
Are there any differences between the two lists?

Thanks!

section06_exploring_data_assignment.ipynb

*Copyright Maven Analytics, LLC


VISUALIZING DATA

Visualizing data as part of the exploratory data analysis process lets you:
• More easily identify patterns and trends in the data
EDA Overview
• Quickly spot anomalies to further investigate

Exploring Data Scatter plot


Students with
more sleep got
Visualizing Data higher grades

One student
barely slept but
EDA Tips tested very well

Data visualization is also used later in the data science


process to communicate insights to stakeholders

*Copyright Maven Analytics, LLC


DATA VISUALIZATION IN PANDAS

You can use the .plot() method to create quick and simple visualizations directly
from a Pandas DataFrame
EDA Overview

Exploring Data

Visualizing Data

EDA Tips

*Copyright Maven Analytics, LLC


PRO TIP: PAIR PLOTS

Use sns.pairplot() to create a pair plot that shows all the scatterplots and
histograms that can be made using the numeric variables in a DataFrame
EDA Overview

2 4
1.
1 Outlier – This student barely
Exploring Data studied and still aced the test
(look into them)
2.
2 Relationship – The hours spent
studying and sleeping don’t
Visualizing Data seem related at all
1 (this is a surprising insight)
3.
3 Relationship – The test grade is
PRO TIP: Create a pair highly correlated with the class
EDA Tips
plot as your first visual to grade (can ignore one of the two
identify general patterns 3 fields for the analysis)
that you can dig into later 4.
4 Data Type – The cups of coffee
individually field only has integers (keep this
in mind)
5.
5 Outlier – This student drinks a
lot of coffee (check on them)
5

*Copyright Maven Analytics, LLC


DISTRIBUTIONS

A distribution shows all the possible values in a column and how often each occurs
It can be shown in two ways:
EDA Overview

Frequency table Histogram


Exploring Data
1
1 1 Most students drink 1 cup
of coffee daily (67)
Visualizing Data
2 A few students drink 8+
cups of coffee daily (2)

EDA Tips
2
2

PRO TIP: Distributions can be used to find inconsistencies and outliers

*Copyright Maven Analytics, LLC


FREQUENCY TABLES

You can create a frequency table with the .value_counts() method


EDA Overview

Exploring Data

Visualizing Data

EDA Tips

*Copyright Maven Analytics, LLC


HISTOGRAMS

You can create a histogram by passing a DataFrame column to sns.histplot()

EDA Overview

Exploring Data

Visualizing Data

EDA Tips

*Copyright Maven Analytics, LLC


COMMON DISTRIBUTIONS

These are some common distributions you’ll encounter as a data scientist

EDA Overview
Normal distribution Binomial distribution
If you collect the height of Knowing that 10% of all ad
500 women, most will be views result in a click, these
Exploring Data ~64” with fewer being much are the clicks you’ll get if you
shorter or taller than that show an ad to 20 customers

Visualizing Data

Uniform distribution Poisson distribution


EDA Tips If you roll a die 500 times, the Knowing that you typically
chances of getting any of the get 2 cancellations each day,
6 numbers is the same these are the number of
cancellations you will see on
any given day

While it’s not necessary to memorize the formulas for each distribution, being able to recognize the shapes
and data types for these distributions will be helpful for future data science modeling and analysis

*Copyright Maven Analytics, LLC


NORMAL DISTRIBUTION

Many numeric variables naturally follow a normal distribution, or “bell curve”


• Normal distributions are described by two values: the mean (μ) & standard deviation (σ)
EDA Overview • The standard deviation measures, on average, how far each value lies from the mean

Exploring Data
EXAMPLE Women’s Heights (in)

Visualizing Data
μ The empirical rule outlines where most
σ values fall in a normal distribution:
• 68% fall within 1σ from the mean
EDA Tips • 95% fall within 2σ from the mean
• 99.7% fall within 3σ from the mean
(this is why data points over 3σ away
from the mean are considered outliers)

A sample of 500 women shows a mean height of 5’4”


(64 inches) and a standard deviation of 2.2 inches

*Copyright Maven Analytics, LLC


SKEW

The skew represents the asymmetry of a normal distribution around its mean
• For example, data on household income is typically right skewed
EDA Overview

Exploring Data Left skew Normal Distribution Right skew

Visualizing Data

EDA Tips

There are techniques that can deal with skewed data, such as taking the log of the data set, to turn it
into normally distributed data; more on this will be discussed in the prep for modeling section

*Copyright Maven Analytics, LLC


ASSIGNMENT: DISTRIBUTIONS

NEW MESSAGE
Key Objectives
August 8, 2023
1. Import the seaborn library
From: Sarah Song (Music Analysis Team)
Subject: Musical Attribute Distributions 2. Create a pair plot
3. Interpret the histograms
Hi,
• Which fields are normally distributed?
Our music analysis team has labeled 100 songs with their
• Which fields have other distributions?
musical attributes, such as its danceability and energy.
• Do you see any skew?
Can you tell us more about the distributions of the musical
• Do you see any outliers?
attributes? I’m guessing that since most of the musical
attributes are numeric, they should be normally distributed, • Any other observations?
but let me know if you see otherwise. • Do your interpretations make sense?
Thanks!

section06_visualizing_data_assignments.ipynb

*Copyright Maven Analytics, LLC


SCATTERPLOTS

Scatterplots are used to visualize the relationship between numerical variables


• sns.scatterplot(data=df, x="x axis column", y=“y axis column")
EDA Overview

Exploring Data

Visualizing Data

EDA Tips

*Copyright Maven Analytics, LLC


CORRELATION

A correlation describes the relationship between two numerical columns (-1 to 1)


• -1 is a perfect negative correlation, 0 is no correlation, and 1 is a perfect positive correlation
EDA Overview

Correlation = –0.7 Correlation = 0 Correlation = 1

Exploring Data

Visualizing Data

EDA Tips

This is a negative correlation This has no correlation This is a perfect positive correlation
(Students who spent more time on (No relationship exists between hours (This is likely an error and the grades
social media got worse grades) talking with friends and test grades) are exact copies of each other)

Correlation does not imply causation! Just because two variables are related
does not necessarily mean that changes to one cause changes to the other

*Copyright Maven Analytics, LLC


CORRELATION

Use the .corr() method to calculate the correlation coefficient between each pair of
numerical variables in a DataFrame
EDA Overview • A correlation under 0.5 is weak, between 0.5 and 0.8 is moderate, and over 0.8 is strong

Exploring Data

Visualizing Data 1

EDA Tips 3

1 The correlation between a 2 While this looks like a negative 3 The only strong correlation in this table is that
variable with itself will always correlation, this value is very close the grade on the test is positively correlated
be 1 – you can ignore these to zero, meaning that hours studied with the grade in the class (keep this in mind
values across the diagonal and cups of coffee are uncorrelated for future modeling or insights)

*Copyright Maven Analytics, LLC


ASSIGNMENT: CORRELATIONS

Key Objectives
NEW MESSAGE
August 9, 2023 1. Look at the previously created pair plot
From: Sarah Song (Music Analysis Team) 2. Interpret the scatterplots
Subject: Musical Attribute Relationships
• Which fields are highly correlated?
• Which fields are uncorrelated?
Hi again,
• Which fields have positive or negative correlations?
Thanks for looking into those distributions earlier. Can you
• Any other observations?
also tell us more about the correlations of the musical
attributes? • Do your interpretations make sense?

Can you tell us about any relationships between the


attributes, like if some are positively or negatively correlated?

Thanks!

section06_visualizing_data_assignments.ipynb

*Copyright Maven Analytics, LLC


DATA VISUALIZATION IN PRACTICE

Data visualizations can be difficult to tweak and polish in Python, so feel free to
export the data as a CSV file and import it into another tool (Excel, Tableau, etc.)
EDA Overview

Exploring Data

Visualizing Data

PRO TIP: Before sharing a visual, take a


EDA Tips moment to think about what you want
your audience to take away from it
If possible, modify, remove or highlight
specific parts emphasize your points

*Copyright Maven Analytics, LLC


EDA TIPS

Before diving into EDA:


• Remind yourself of the original question(s) that you’re trying to answer
EDA Overview
As you go through EDA:
• Keep a running list of observations and questions for both yourself and your client
Exploring Data
• Apply the techniques in any order that makes the most sense for your data
• You may have to go back to earlier steps and gather more data or further clean your
Visualizing Data
data as you discover more about your data during EDA

EDA Tips You know you’ve completed your initial EDA once you’ve:
• Investigated any initial idea or question that comes to mind about the data and
gathered some meaningful insights (you can always come back to EDA after modeling!)

Working with a large data set? Already answered your question?


Look at a subset, apply EDA techniques, Sometimes EDA is all you need, and you
then extrapolate to the whole data set may not need to apply any algorithms

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

EDA is all about looking at and exploring data from multiple angles
• You can get a better understanding of your data just from filtering, sorting and grouping your data in Python

It’s often helpful to visualize data to more easily see patterns in the data
• One of the first plots that data scientists create to visualize their data is a pair plot, which includes scatter
plots for looking at correlations and histograms for looking at distributions

By exploring and visualizing data, you’ll start to discover insights


• These insights can be saved for the end of the project to share with stakeholders, or they can be used as
context when preparing the data for modeling

*Copyright Maven Analytics, LLC


MID-COURSE PROJECT

*Copyright Maven Analytics, LLC


MID-COURSE PROJECT: MOVIE RATINGS

Key Objectives
NEW MESSAGE
June 1, 2023 1. Explore the data by filtering, sorting, and
From: Katniss Potter (Podcast Host) grouping the data
Subject: Movie Ratings Exploration 2. Create new columns to aid in analysis
3. Visualize the data
Hi there,
4. Interpret the aggregations & plots and share
I’m the host of a movie reviews podcast and I’m currently
interesting insights
making an episode about movie review aggregators.

I found this data set from Rotten Tomatoes (inside the .ipynb
(detailed steps and questions in the Jupyter Notebook)
file that I’ve attached). Could you dig into the data and share
any interesting insights that you find? My audience loves fun
facts about movies.

Thank you!
KP

section07_midcourse_project.ipynb

*Copyright Maven Analytics, LLC


PREPARING FOR MODELING

*Copyright Maven Analytics, LLC


PREPARING FOR MODELING

In this section we’ll learn to prepare data for modeling, including merging data into a
single table, finding the right row granularity for analysis, and engineering new features

TOPICS WE’LL COVER: GOALS FOR THIS SECTION:


• Switch from an exploratory to a modeling mindset
Data Prep for Modeling Creating a Single Table
• Become familiar with key modeling terms

Preparing Rows Preparing Columns • Understand the data structures required as inputs
for machine learning models

Feature Engineering Preview: Modeling • Learn common feature engineering techniques

We will NOT cover any modeling techniques in depth


(these will be taught in the rest of the courses in this series)

*Copyright Maven Analytics, LLC


CASE STUDY: PREPARING FOR MODELING

You’ve just been hired as a Data Science Intern for Maven Mega Mart, a large
ecommerce website that sells everything from clothing to pet supplies

From: Candice Canine (Senior Data Scientist) Scope: Identify customers that are most likely to
Subject: Email Targeting purchase dog food

Hi!
Technique: Supervised learning
We’re launching a new dog food brand and would like to send out an
email blast to a subset of customers that are most likely to purchase Label (y): Whether a customer has purchased dog
dog food.
food recently or not
Can you help me prep our data so I can identify those customers?
You’ll need to:
Features (x):
1. Create a single table with the appropriate row & column formats
2. Engineer features • What other items a customer has purchased
• How much money a customer has spent
• etc.

*Copyright Maven Analytics, LLC


DATA PREP: EDA VS MODELING

So far, we’ve gathered and cleaned data to prepare for EDA, but a few additional
steps are required to prepare for modeling
Data Prep for
Modeling

Creating a Data ready for EDA Data ready for modeling


Single Table
Address City Color Price ($) Index Walk Score Blue White Price ($)

Preparing Rows 200 N Michigan Chicago Blue 350,000 0 97 1 0 350,000


10 S State Chicago Blue 500,000 1 92 1 0 500,000
Preparing 123 Main St Evanston White 180,000 2 70 0 1 180,000
Columns
50 Dempster Evanston Yellow 270,000 3 64 0 0 270,000

Feature
Engineering
Here, “modeling” refers to applying an
algorithm, which is different from “data
Preview: modeling”, where the goal is to visually
Modeling show the relationship between the
tables in a relational database A data model

*Copyright Maven Analytics, LLC


PREPARING FOR MODELING

To prepare for modeling means to transform the data into a structure and format
Data Prep for that can be used as a direct input for a machine learning algorithm
Modeling

Creating a You can prepare your data for modeling by:


Single Table

Preparing Rows 1• Creating a single table

Preparing 2• Setting the correct row granularity


Columns

Feature
3• Ensuring each column is non-null and numeric
Engineering

4• Engineering features for the model


Preview:
Modeling

*Copyright Maven Analytics, LLC


CREATING A SINGLE TABLE

There are two ways to combine multiple tables into a single table:
Data Prep for • Appending stacks the rows from multiple tables with the same column structure
Modeling
• Joining adds related columns from one tables to another, based on common values

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature Appending these two tables Joining these two tables added
Engineering with the same columns added the region column based on the
the rows from one to the other matching store values

Preview:
Modeling

*Copyright Maven Analytics, LLC


APPENDING

Use pd.concat() to append, or vertically stack, multiple DataFrames


Data Prep for • The columns for the DataFrames must be identical
Modeling
• pd.concat([df_1, df_2]) will stack the rows from “df_2” at the bottom of “df_1”

Creating a
Single Table pd.concat([df1, df2, df3, …] )

Preparing Rows

Preparing
Columns PRO TIP: You can also
use .concat() to combine
Feature DataFrames horizontally
Engineering by setting axis = 1

Preview:
Modeling
Chain .reset_index() to the code to
make the index go from 0 to 5

*Copyright Maven Analytics, LLC


JOINING

Use .merge() to join two DataFrames based on common values in a column(s)


Data Prep for • The DataFrames must have at least one column with matching values
Modeling
• This is different from the Pandas .join() method, which joins DataFrames on their indices

Creating a
Single Table

“Right” DataFrame
Preparing Rows
left_df.merge(right_df, to join with “Left”

how, Type of join


Preparing
Columns left_on, Column(s) in “Left”

Feature
right_on) DataFrame to join by

Engineering

Preview: “Left” DataFrame Column(s) in “Right”


Modeling to join with “Right” DataFrame to join by

For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html *Copyright Maven Analytics, LLC
JOINING

Use .merge() to join two DataFrames based on common values in a column(s)


Data Prep for • The DataFrames must have at least one column with matching values
Modeling
• This is different from the Pandas .join() method, which joins DataFrames on their indices

Creating a
Single Table

Preparing Rows

Preparing This added the


Columns region from the
“regions” table to
the “sales” table
Feature
based on the “store”
Engineering
field

Preview:
Modeling

*Copyright Maven Analytics, LLC


TYPES OF JOINS

These are the most common types of joins you can use with .merge()
Data Prep for
Modeling

Returns records that exist in BOTH tables, and


Creating a how = ‘inner’ excludes unmatched records from either table
Single Table
Most commonly used joins
Returns ALL records from the LEFT table, and any
Preparing Rows how = ‘left' matching records from the RIGHT table

Rarely used – in practice,


Preparing Returns ALL records from the RIGHT table, and
Columns
how = ‘right’ any matching records from the LEFT table
switch the tables and use a
left join instead

Feature Returns ALL records from BOTH tables, including


Engineering how = ‘outer’ non-matching records

Preview:
Modeling

*Copyright Maven Analytics, LLC


TYPES OF JOINS
Left Table Right Table

Data Prep for


Modeling

Creating a n=5
Single Table

Preparing Rows n=10

Preparing Left Join Inner Join Outer Join


Columns

Feature
Engineering

Preview: n=6
Modeling

n=10
n=11
*Copyright Maven Analytics, LLC
ASSIGNMENT: CREATING A SINGLE TABLE

Key Objectives
NEW MESSAGE
July 3, 2023 1. Read all four files into a Jupyter Notebook
From: Brooke Reeder (Owner, Maven Books) 2. Append the May and June book sales to the April
Subject: Combine data sets DataFrame
3. Join the newly created book sales DataFrame
Hi there,
with the customers DataFrame on customer_id
We just finished collecting our Q2 book sales data. Can you
help us create one giant table that includes: • Which type of join would work best here?

• April, May and June’s book sales


• Customer data

The customer_id field links all the tables together.

Thanks!
Brooke

Book_Sales_April.xlsx, Book_Sales_May.xlsx,
Book_Sales_June.xlsx, Book_Customers.csv

*Copyright Maven Analytics, LLC


PREPARING ROWS FOR MODELING

To prepare rows for modeling, you need to think about the question you’re trying
to answer and determine what one row (observation) of your table will look like
Data Prep for
Modeling • In other words, you need to determine the granularity of each row

Creating a
Single Table
GOAL Predict which customers are most likely to buy dog food in June

Preparing Rows
Because we are predicting something for a customer, one
row of data in table should represent one customer
Preparing
Columns Number of pet supplies How much was spent on
purchased before June all items before June
Feature
Engineering
Group by
Preview: customer
Modeling
x variables y variable
(features to be input into a model) (label / output)

*Copyright Maven Analytics, LLC


ASSIGNMENT: PREPARE ROWS FOR MODELING

Key Objectives
NEW MESSAGE
July 5, 2023 1. Determine the row granularity needed
From: Brooke Reeder (Owner, Maven Books) 2. Create a column called “June Purchases” that sums
Subject: Please format data for analysis all purchases in June
3. Create a column called “Total Spend” that sums the
Hi again,
prices of the books purchased in April & May
We’re trying to predict which customers will purchase a book
this month. 4. Combine the “June Purchases” and “Total Spend”
columns into a single DataFrame for modeling
Can you reformat the data you compiled earlier this week so
that it’s ready to be input into a model, with each row
representing a customer instead of a purchase?

Thanks!
Brooke

*Copyright Maven Analytics, LLC


PREPARING COLUMNS FOR MODELING

Once you have the data in a single table with the right row granularity, you’ll move
on to preparing the columns for modeling:
Data Prep for
Modeling

Creating a
1• All values should be non-null
Single Table
• Use df.info() or df.isna() to identify null values and either remove them, impute
them, or resolve them based on your domain expertise
Preparing Rows

2• All values should be numeric


Preparing
Columns • Turn text fields to numeric fields using dummy variables
• Turn datetime fields to numeric fields using datetime calculations
Feature
Engineering

Preview: PRO TIP: There are some algorithms that can handle null and non-numeric values, including tree-
Modeling based models and some classification models, but it is still best practice to prepare the data this way

*Copyright Maven Analytics, LLC


DUMMY VARIABLES

A dummy variable is a field that only contains zeros and ones to represent the
presence (1) or absence (0) of a value, also known as one-hot encoding
Data Prep for
Modeling • They are used to transform a categorical field into multiple numeric fields

Creating a
Single Table These dummy variables are numeric
representations of the “Color” field

Preparing Rows
House ID Price Color House ID Price Color Blue White Yellow

Preparing 1 $350,000 Blue 1 $350,000 Blue 1 0 0


Columns
2 $500,000 Blue 2 $500,000 Blue 1 0 0

Feature 3 $180,000 White 3 $180,000 White 0 1 0


Engineering
4 $270,000 Yellow 4 $270,000 Yellow 0 0 1

Preview: 5 $245,000 White 5 $245,000 White 0 1 0


Modeling
It only takes 2 columns to distinguish 3 values, so
you might sometimes see one less column

*Copyright Maven Analytics, LLC


DUMMY VARIABLES

Use pd.get_dummies() to create dummy variables in Python


Data Prep for
Modeling

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature
Engineering Set drop_first=True to remove one of
the dummy variable columns
(it does not matter which column
Preview: removed, as long as one is dropped)
Modeling

*Copyright Maven Analytics, LLC


PRO TIP: PREPARING DATETIME COLUMNS

To prepare datetime columns for modeling they need to be converted to


numeric columns, but extracting datetime components isn’t enough
Data Prep for
Modeling
The “month” field is not an appropriate numeric input because a
Creating a model would interpret May (5) as being better than April (4)
Single Table

Preparing Rows

Instead, you can prepare them using:


Preparing The average time
Columns
Dummy variables The days from “today” between dates

Feature
Engineering

Preview:
Modeling

Number of purchases by month Days since the latest purchase Average days between purchases

*Copyright Maven Analytics, LLC


PRO TIP: PREPARING DATETIME COLUMNS

EXAMPLE Using dummy variables to prepare a date column for modeling

Data Prep for


Modeling

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature
Engineering

Preview:
Modeling

*Copyright Maven Analytics, LLC


PRO TIP: PREPARING DATETIME COLUMNS

EXAMPLE Using the days from “today” to prepare a date column for modeling

Data Prep for


Modeling

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature
Engineering

Preview:
Modeling

*Copyright Maven Analytics, LLC


PRO TIP: PREPARING DATETIME COLUMNS

EXAMPLE Using the average time between dates to prepare a date column for modeling

Data Prep for


Modeling

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature
Engineering

A variation of .diff() is .shift(), which will


Preview: shift all the rows of a column up or down
Modeling

*Copyright Maven Analytics, LLC


ASSIGNMENT: PREPARE COLUMNS FOR MODELING

Key Objectives
NEW MESSAGE
July 7, 2023 1. Create dummy variables from the “Audience”
From: Brooke Reeder (Owner, Maven Books) field
Subject: Please help make data numeric 2. Using the “Audience” dummy variables, create
three new columns that contain the number of
Hi again, Adult, Children, and Teen books purchased by
each customer
Thanks for your help earlier this week!
3. Combine the three new columns back with the
We just learned that we also have to make all the data
numeric before inputting it into a predictive model. customer-level data

Can you turn the “Audience” text field into a numeric field?

Thanks!
Brooke

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING

Feature engineering is the process of creating columns that you think will be
helpful inputs for improving a model (help predict, segment, etc.)
Data Prep for
Modeling • When preparing rows & columns for modeling you’re already feature engineering!

Creating a
Single Table

Preparing Rows

Preparing
Columns

Feature Other feature engineering techniques: Once you have your data in a single table and
Engineering
have prepared the rows & columns, it’s ready for
• Transformations (log transform) modeling
Preview:
Modeling
• Scaling (normalization & standardization) However, being deliberate about engineering
new features can be the difference between a
• Proxy variables good model and a great one

*Copyright Maven Analytics, LLC


TRANSFORMATIONS: LOG TRANSFORMS

Transformations require mapping a set of values to another in a consistent way


Data Prep for • Log transforms turn skewed data into more normally-distributed data
Modeling
• You can use the np.log() function to apply a log transform to a Series

Creating a
Single Table

Right-skewed data Normally-distributed data


Preparing Rows

Preparing
Columns

Feature
Engineering

Preview:
Modeling

*Copyright Maven Analytics, LLC


SCALING: NORMALIZATION

Scaling, as its name implies, requires setting all input features on a similar scale
• Normalization transforms all values to be between 0 and 1 (or between -1 and 1)
Data Prep for
Modeling

Creating a
Single Table Normalization equation

𝑥 − 𝑥𝑚𝑖𝑛
You can also use the .MinMaxScaler()
Preparing Rows function from the sklearn library 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛

Preparing
Columns

Feature PRO TIP: Normalization is typically used


Engineering when the distribution of the data is unknown

Preview:
Modeling

*Copyright Maven Analytics, LLC


SCALING: STANDARDIZATION

Scaling, as its name implies, requires setting all input features on a similar scale
• Normalization transforms all values to be between 0 and 1 (or between -1 and 1)
Data Prep for
Modeling • Standardization transforms all values to have a mean of 0 and standard deviation of 1

Creating a
Single Table Standardization equation

𝑥 − 𝑥𝑚𝑒𝑎𝑛
You can also use the .StandardScaler()
Preparing Rows function from the sklearn library 𝑥𝑠𝑡𝑑

Preparing
Columns

Feature PRO TIP: Standardization is typically used


Engineering when the distribution is normal (bell curve)

Preview:
Modeling

*Copyright Maven Analytics, LLC


PROXY VARIABLES

A proxy variable is a feature meant to approximately represent another


Data Prep for
• They are used when a feature is either difficult to gather or engineer into a new feature
Modeling

Creating a Zip codes may look numeric, but


Single Table should not be input into a model
Instead of turning the zip code into dummy
(60202 is not better than 60201) variables, you can use a proxy variable like
the median income for the zip code, or its
Preparing Rows
distance from the city center

Preparing
Columns

You may not be able to engineer


Feature
proxy variables from existing
Engineering
data, but they can be gathered
from external sources
Preview:
Modeling

*Copyright Maven Analytics, LLC


FEATURE ENGINEERING TIPS

1.
1 Anyone can apply an algorithm, but only someone with domain expertise
Data Prep for can engineer relevant features, which is what makes a great model
Modeling

Creating a
2.
2 You want your data to be long, not wide (many rows, few columns)
Single Table

3.
3 If you’re working with customer data, a popular marketing technique is to
Preparing Rows engineer features related to the recency, frequency, and monetary value
(RFM) of a customer’s transactions
Preparing
Columns
4.
4 Once you start modeling, you’re bound to find things you missed during
Feature
data prep and will continue to engineer features and gather, clean,
Engineering explore, and visualize the data
Preview:
Modeling

*Copyright Maven Analytics, LLC


ASSIGNMENT: FEATURE ENGINEERING

Key Objectives
NEW MESSAGE
July 10, 2023 1. Brainstorm features that would make good
From: Brooke Reeder (Owner, Maven Books) predictors for a model
Subject: Please create new features 2. Engineer two new features
3. Add them to the non-null, numeric DataFrame
Hi again,
that is ready for modeling
I have one final request for you.

As a reminder, our goal is to try and predict which customers


will purchase a book this month.

Can you create new features that you think will do a good job
making a prediction?

Thanks!
Brooke

*Copyright Maven Analytics, LLC


PREVIEW: APPLYING ALGORITHMS

You can input the prepared data into a supervised learning model to predict
which customers are most likely to purchase dog food
Data Prep for
Modeling
x y

Creating a
Single Table

Preparing Rows

Preparing x test
Columns

Feature
Engineering

Preview:
Modeling Whether a customer purchased pet Yvonne is the
supplies in recently is a good predictor of most likely to
whether they will buy dog food buy dog food

*Copyright Maven Analytics, LLC


KEY TAKEAWAYS

Preparing data for EDA is different than preparing data for modeling
• The goal of EDA is exploration while the goal of modeling is to use an algorithm to answer a question, so to prep
for modeling means to get the data in a format that can be directly input into a model

All data needs to be in a single table with non-null, numeric values for modeling
• Join together multiple tables with .merge and .concat, remove null values with .isna, create dummy variables to
turn text into numeric values and use datetime calculations to turn datetimes into numeric values

Engineering features turns good models into great models


• Use the techniques learned and intuition built during EDA to create meaningful features for modeling as well
as feature transformation, feature scaling and proxy variables

*Copyright Maven Analytics, LLC


FINAL PROJECT

*Copyright Maven Analytics, LLC


RECAP: THE COURSE PROJECT

You’ve just been hired as a Jr. Data Scientist for Maven Music, a streaming service
THE that’s been losing more customers than usual the past few months and would like
SITUATION to use data science to figure out how to reduce customer churn

You’ll have access to data on Maven Music’s customers, including subscription


THE details and music listening history
ASSIGNMENT Your task is to gather, clean, and explore the data to provide insights about the
recent customer churn issues, then prepare it for modeling in the future

1. Scope the data science project


THE
2. Gather the data in Python
OBJECTIVES
3. Clean the data
4. Explore & visualize the data
5. Prepare the data for modeling

*Copyright Maven Analytics, LLC


FINAL PROJECT: MAVEN MUSIC

Key Objectives
NEW MESSAGE
June 8, 2023 1. Revisit the project scope
From: Carter Careswell (Customer Care) 2. Read the data files into Python
Subject: Customer Churn Data Prep
3. Clean the data by converting data types,
resolving data issues, and creating new columns
Hi again,

Thanks for scoping the customer churn project with us earlier. 4. Explore the data independently, then join the
We’ve identified the location of the data sets for the project tables for further exploration
that we talked about – customer and listening history.
5. Create a non-null, numeric DataFrame and
Can you read the data into Python, clean it, explore it and engineer features that could be good predictors
prepare it for modeling? More details can be found in the of customer churn
attached Jupyter Notebook.
6. Visualize and interpret the data in the final
Thanks for all your help!
DataFrame that is ready for modeling
Carter

section09_final_project.ipynb

*Copyright Maven Analytics, LLC

You might also like