Data+Science+in+Python+ +Data+Prep+&+EDA
Data+Science+in+Python+ +Data+Prep+&+EDA
PART 1:
Data Prep &
Exploratory
Data Analysis
With Expert Data Science Instructor Alice Zhao
This is Part 1 of a 5-Part series designed to take you through several applications of data science
using Python, including data prep & EDA, regression, classification, unsupervised learning & NLP
Introduce the field of data science, review essential skills, and introduce each
1 Intro to Data Science phase of the data science workflow
2 Scoping a Project Review the process of scoping a data science project, including brainstorming
problems and solutions, choosing techniques, and setting clear goals
Read flat files into a Pandas DataFrame in Python, and review common data
4 Gathering Data sources & formats, including Excel spreadsheets and SQL databases
Identify and convert data types, find and fix common data issues like missing
5 Cleaning Data values, duplicates, and outliers, and create new columns for analysis
7 MID-COURSE PROJECT Put your skills to the test by cleaning, exploring and visualizing data from a
brand-new data set containing Rotten Tomatoes movie ratings
Structure your data so that it’s ready for machine learning models by creating
8 Preparing for Modeling a numeric, non-null table and engineering new features
Apply all the skills learned throughout the course by gathering, cleaning,
9 FINAL COURSE PROJECT exploring, and preparing multiple data sets for Maven Music
You’ve just been hired as a Jr. Data Scientist for Maven Music, a streaming service
THE that’s been losing more customers than usual the past few months and would like
SITUATION to use data science to figure out how to reduce customer churn
This course covers data gathering, cleaning and exploratory data analysis
• We’ll review common techniques for gathering, cleaning and analyzing data with Python, but will not
cover more complex data formats or advanced statistical tools
In this section we’ll introduce the field of data science, discuss how it compares to
other data fields, and walk through each phase of the data science workflow
What is Machine
Yes! The differences lie in the types of problems you solve, and tools and
Learning? techniques you use to solve them:
Data Science
Workflow
What happened? What’s going to happen?
• Descriptive Analytics • Predictive Analytics
• Data Analysis • Data Mining
• Business Intelligence • Data Science
What is Data
Science?
Machine learning enables computers to learn and make decisions from data
What is Data
Science?
How can a computer learn?
Essential Skills
What is Machine
By using algorithms, which is a set of instructions for a computer to follow
Learning?
Data Science
Workflow How does this compare with data science?
Data scientists know how to apply algorithms, meaning they’re able to tell a
computer how to learn from data
What is Machine
Learning?
These are some of the most common machine learning algorithms that data
scientists use in practice
What is Data
Science?
MACHINE LEARNING
Essential Skills
The data science workflow consists of scoping a project, gathering, cleaning and
What is Data
exploring the data, building models, and sharing insights with end users
Science?
1 2 3 4 5 6
Essential Skills
What is Machine
Learning?
Data Science
Workflow
Scoping a Gathering Cleaning Exploring Modeling Sharing
Project Data Data Data Data Insights
This is not a linear process! You’ll likely go back to further gather, clean and explore your data
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
Projects don’t start with data, they start with a clearly defined scope:
Data Science
Workflow • Who are your end users or stakeholders?
• What business problems are you trying to help them solve?
• Is this a supervised or unsupervised learning problem? (do you even need data science?)
• What data do you need for your analysis?
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
A project is only as strong as the underlying data, so gathering the right data is
essential to set a proper foundation for your analysis
Data Science
Workflow
Data can come from a variety of sources, including:
• Files (flat files, spreadsheets, etc.)
• Databases
• Websites
• APIs
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning? A popular saying within data science is “garbage in, garbage out”, which means that
cleaning data properly is key to producing accurate and reliable results
Data Science
Workflow
Data cleaning tasks may include: Building models
The flashy part of data science
• Correcting data types
• Imputing missing data Cleaning data
• Dealing with data inconsistencies Less fun, but very important
(Data scientists estimate that around
• Reformatting the data 50-80% of their time is spent here!)
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
Exploratory data analysis (EDA) is all about exploring and understanding the
Data Science
data you’re working with before building models
Workflow
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
Modeling data involves structuring and preparing data for specific modeling
techniques, and applying algorithms to make predictions or discover patterns
Data Science
Workflow
Data modeling tasks may include: With fancy new algorithms introduced every
year, you may feel the need to learn and apply
• Restructuring the data the latest and greatest techniques
• Feature engineering (adding new fields) In practice, simple is best; businesses &
leadership teams appreciate solutions that are
• Applying machine learning algorithms easy to understand, interpret and implement
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
The final step of the workflow involves summarizing your key findings and sharing
insights with end users or stakeholders:
Data Science
Workflow • Reiterate the problem
Even with all the technical work
• Interpret the results of your analysis that’s been done, it’s important to
remember that the focus here is
• Share recommendations and next steps on non-technical solutions
NOTE: Another way to share results is to deploy your model, or put it into production
What is Data
Science?
Essential Skills
1 2 3 4 5 6
Scoping a
Project
Gathering
Data
Cleaning
Data
Exploring
Data
Modeling
Data
Sharing
Insights
What is Machine
Learning?
DATA PREP & EDA REGRESSION, CLASSIFICATION,
UNSUPERVISED LEARNING, NLP
Data Science
Workflow
Data prep and EDA is a critical part of every data science project, and should
always come first before applying machine learning algorithms
Data scientists have both coding and math skills along with domain expertise
• In addition to technical expertise, soft skills like communication, problem-solving, curiosity, creativity, grit,
and Googling prowess round out a data scientist’s skillset
Much of a data scientist’s time is spent cleaning & preparing data for analysis
• Properly cleaning and preparing your data ensures that your results are accurate and meaningful (garbage in,
garbage out!)
In this section we’ll discuss the process of scoping a data science project, from
understanding your end users to deciding which tools and techniques to deploy
Scoping a data science project means clearly defining the goals, techniques, and
Project Scoping
data sources you plan to use for your analysis
Steps
Scoping steps:
Thinking Like an
End User
1. Think like an end user Project scoping is one of the most
difficult yet important steps of the
2. Brainstorm problems data science workflow
Problems & 3. Brainstorm solutions If a project is not properly scoped,
Solutions a lot of time can be wasted going
4. Determine the techniques down a path to create a solution
that solves the wrong problem
Modeling 5. Identify data requirements
Techniques
6. Summarize the scope & objectives
Data
Requirements
Summarizing the We’ll be scoping the course project in this section together to
Scope
set you up for success throughout the rest of the course
An end user (also known as a stakeholder) is a person, team, or business that will
ultimately benefit from the results of your analysis
Project Scoping
Steps When introduced to a new project, start by asking questions that help you:
Thinking Like an
• Empathize with the end user – what do they care about?
End User • Focus on impact – what metrics are important to them?
Problems &
Solutions What’s the situation?
Data Scientist End User
Modeling People keep cancelling their streaming music subscriptions
Techniques
Why is this a major issue?
Data
Requirements Our monthly revenue growth is slowing
Summarizing the Debbie Dayda What would success look like for you? Carter Careswell
Scope Data Science Team Customer Care Team
Decreasing cancellation rate by 2% would be a big win
Problems &
Solutions
Is our product differentiated from competitors?
Data Scientist End User
Modeling Are technical bugs or limitations to blame?
Techniques
Once you settle on a problem, the next step is to brainstorm potential solutions
Project Scoping Data science can be one potential solution, but it’s not the only one:
Steps
• PROS: Solutions are backed by data, may lead to hidden insights, let you make predictions
Thinking Like an • CONS: Projects take more time and specialized resources, and can lead to complex solutions
End User
POTENTIAL SOLUTIONS
Problems &
Solutions
Ask the product team to add a survey to capture Speak with customer reps about any changes they’ve
Modeling cancellation feedback noticed in recent customer interactions
Techniques
Conduct customer interviews to gather Research external factors that may be impacting
qualitative data and insights cancellations (competitive landscape, news, etc.)
Data
Requirements
Suggest that the leadership team speak with Use data to identify the top predictors for
Summarizing the other leaders in the space to compare notes account cancellations
Scope
This is the only data science solution!
If you decide to take a data science approach to solving the problem, the next step
Project Scoping is to determine which techniques are most suitable:
Steps
• Do you need supervised or unsupervised learning?
Thinking Like an
End User
Modeling
Techniques Supervised Learning Unsupervised Learning
Summarizing the
Scope
Estimating how many customers will Looking at the products purchased by Visualizing cancellation rate over time
visit your website on New Year’s Day the highest-spend customers for various customer segments
Project Scoping
Steps
Identifying the main themes mentioned Clustering customers into different Flagging which customers are most
in customer reviews groups based on their preferences likely to cancel their membership
Thinking Like an
End User
Supervised Learning Unsupervised Learning
Problems &
Solutions
Modeling
Techniques
Data
Requirements
Summarizing the
Scope
Thinking Like an
End User
Supervised Learning Unsupervised Learning
Problems &
Solutions
Estimating how many customers will Identifying the main themes mentioned
visit your website on New Year’s Day in customer reviews
Modeling
Techniques
Summarizing the
Scope
Thinking Like an
End User
Problems &
Solutions
Summarizing the
Scope
How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data
Thinking Like an
End User
MACHINE LEARNING
Problems &
Solutions
Supervised Learning Unsupervised Learning
Modeling
Techniques
Focused on using historical Focused on finding patterns
data to predict the future or relationships in the data
Data
Requirements
Has a “label” Does NOT have a “label”
Summarizing the
Scope A “label” is an observed variable which you are trying to predict
(house price ($), spam or not (1/0), chance of failure (probability), etc.)
How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data
Thinking Like an
End User EXAMPLE Supervised Learning: Predicting which customers are likely to cancel
How you structure your data often depends on which technique you’re using:
• A supervised learning model takes in labeled data (the outcome you want to predict)
Project Scoping
Steps • An unsupervised learning model takes in unlabeled data
Thinking Like an
End User EXAMPLE Unsupervised Learning: Clustering customers based on listening behavior
Melody 2 0 0 3
Summarizing the When shown a new customer, figure out which customers it’s most similar to
Scope
Rock 1 4 0 2
Data
Requirements I wonder if certain demographics are more likely to cancel
Debbie Dayda Carter Careswell
Data Science Team Maybe a competitor launched a new offer or promotion? Customer Care Team
Summarizing the
Scope
Once you’ve identified a list of relevant features, brainstorm data sources you
can use to create or engineer those features
Project Scoping
Steps
Thinking Like an
Features: Potential Sources: Ease of Access:
End User
Monthly rate
Customer subscription history,
Problems &
Solutions
customer listening history
Auto-renew
Easy
Modeling
(internal data)
Age
Techniques
Customer demographics
Urban vs. Rural
Data
Requirements
Customer’s other subcriptions, Hard
Competitor Promotion
Summarizing the competitor promotion history (external data)
Scope
Next, narrow the scope of your data to remove sources that are difficult to
Project Scoping
obtain and prioritize the rest
Steps
• Remember that more data doesn’t necessarily mean a better model!
Thinking Like an
End User
Data Sources:
Timeframe:
Problems & Past three months of data
Solutions Customer subscription history
Customers:
Modeling Customer listening history Only customers who actively
Techniques subscribed, not those who
were grandfathered in
Data
Requirements
PRO TIP: Aim to produce a minimum viable product (MVP) at this stage – if you go too deep
Summarizing the
without feedback, you risk heading down unproductive paths or over-complicating the solution
Scope
The final step is to clearly summarize the project scope and objectives:
Project Scoping • What techniques and data do you plan to leverage?
Steps
• What specific impact are you trying to make?
Thinking Like an
End User
Summarizing the
Debbie Dayda Carter Careswell
Data Science Team Customer Care Team
Scope Sweet!
Don’t limit yourself when brainstorming ideas for problems, solutions, or data
• Start by thinking big and whittling ideas down from there, and keep in mind that many potential solutions
likely won’t require data science
In this section we’ll install Anaconda and introduce Jupyter Notebook, a user-friendly
coding environment where we’ll be coding in Python
Helpful Resources
Why Python?
Installation & Python is the most popular programming language used by data scientists
Setup
around the world due to its:
Notebook
Interface
Code vs
Markdown Cells
Scalability Versatility Automation Community
Helpful
Resources Unlike some data tools or With powerful libraries Python can automate Become part of a large
self-service platforms, and packages, Python workflows & complex and active Python user
Python is open source, can add value at every tasks out of the box, community, where you
free, and built for scale stage of the data without complicated can share resources, get
science workflow, from integrations or plug-ins help, offer support, and
data prep to data viz to connect with other users
machine learning
Why Python?
3) Follow the installation steps
(default settings are OK)
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources 2) Launch the downloaded Anaconda pkg file
Why Python?
3) Follow the installation steps
(default settings are OK)
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources 2) Launch the downloaded Anaconda exe file
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
1) Once inside the Jupyter interface, create a folder to store your notebooks for the course
Why Python?
Installation &
Setup
Notebook
Interface NOTE: You can rename your folder by clicking “Rename” in the top left corner
Code vs 2) Open your new coursework folder and launch your first Jupyter notebook!
Markdown Cells
Helpful
Resources
NOTE: You can rename your notebook by clicking on the title at the top of the screen
NOTE: When you launch a Jupyter notebook, a terminal window may pop up as
well; this is called a notebook server, and it powers the notebook interface
Why Python?
Installation &
Setup
Notebook
Interface
Code vs
If you close the server window,
Markdown Cells
your notebooks will not run!
Helpful
Resources
Depending on your OS, and method
of launching Jupyter, one may not
open. As long as you can run your
notebooks, don’t worry!
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Code Cell
Input field where you will write and edit
new code to be executed
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Installation & Save & Checkpoint Move cells up/down Interrupt the kernel Open the command palette
Setup
S I I CTRL SHIFT F
Notebook
Interface
Code vs
Markdown Cells Insert cell below Cut, Copy & Paste Run cell & select below Restart kernel, rerun Change cell type
B X cut SHIFT ENTER 0 0 Y code
Helpful
Resources C copy M markdown
V paste R raw
EDIT MODE is for editing content within cells, and is indicated by green highlights and a pen icon
Why Python?
Installation &
Setup
Notebook
Interface
Code vs
COMMAND MODE is for editing the notebook, and is indicated by blue highlights and no icon
Markdown Cells
Helpful
Resources
The code cell is where you’ll write and execute Python code
Why Python?
Installation &
Setup In edit mode, the cell will
be highlighted green and a
pencil icon will appear
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Type some code, and click Run to execute
• In[]: Our code (input)
• Out[]: What our code produced (output)*
*Note: not all code has an output!
The code cell is where you’ll write and execute Python code
Why Python?
Installation &
Setup Click back into the cell (or use the up arrow)
and press SHIFT + ENTER to rerun the code
Notebook
Interface
Code vs
Note that our output hasn’t changed, but the
Markdown Cells
number in the brackets increased from 1 to 2.
This is a cell execution counter, which indicates
Helpful how many cells you’ve run in the current session.
Resources
If the cell is still processing, you’ll see In[*]
Comments are lines of code that start with ‘#’ and do not run
Why Python? • They are great for explaining portions of code for others who may use or review it
• They can also serve as reminders for yourself when you revisit your code in the future
Installation &
Setup
Think about your audience when
commenting your code (you may not
Notebook need to explain basic arithmetic to an
Interface experienced Python programmer)
Code vs
Markdown Cells
Be conscious of over-commenting,
Helpful which can actually make your code
Resources even more difficult to read
Comments should explain individual cells or lines of code, NOT your entire workflow – we have better tools for that!
Markdown cells let you write structured text passages to explain your workflow,
provide additional context, and help users navigate the notebook
Why Python?
Code vs
Markdown Cells
Helpful
Resources
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Installation &
Setup
Notebook
Interface
Code vs
Markdown Cells
Helpful
Resources
Code cells are where you write and execute Python code
• Make sure that you know how to run, add, move, and remove cells, as well as how to restart your kernel or
stop the code from executing
In this section we’ll cover the steps for gathering data, including reading files into Python,
connecting to databases within Python, and storing data in Pandas DataFrames
The data gathering process involves finding data, reading it into Python,
transforming it if necessary, and storing the data in a Pandas DataFrame
Data Gathering
Process
Reading Files 1) Find the data 2) Read in the data 3) Store the data
into Python
Connecting to a
Database
Exploring
DataFrames
Raw data can come in Apply transformations using Tables in Python are stored
many shapes and forms Python, if necessary as Pandas DataFrames
(more on this later!)
Data Gathering
Process
Reading Files
into Python
Local files Databases Web access
Connecting to a You can read data from a file You can connect to a database You can programmatically
Database
stored on your computer and write queries extract data from websites
Python is a great tool for data gathering due to its ability to read in and
transform data coming from a wide variety of sources and formats
Reading Files
into Python
.xlsx .csv .pdf
Connecting to a
Database
Reading Files pd.read_csv Reads in data from a delimiter-separated flat file pd.read_csv(file path, sep, header)
into Python
Connecting to a pd.read_excel Reads in data from a Microsoft Excel file pd.read_excel(file path, sheet_name)
Database
pd.read_csv() lets you read flat files by specifying the file path
Data Gathering
Process
Connecting to a The file location, name and extension The column delimiter If the data has column headers
Database in the first row
Examples: Examples:
Exploring • ‘data.csv’ • sep=‘,’ (default) • header=‘infer’ (default)
DataFrames • ‘sales/october.tsv’ • sep=‘\t’ • header=None (no headers)
PRO TIP: Place the file in the same folder as the Jupyter Notebook
so you don’t have to specify the precise file location
For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html *Copyright Maven Analytics, LLC
READING FLAT FILES
pd.read_csv() lets you read flat files by specifying the file path
Data Gathering
Process
Reading Files
into Python
Connecting to a
Database
Exploring
DataFrames
pd.read_excel() lets you read Excel files by specifying the file path
• You can use the “sheet_name” argument to specify the worksheet (default is 0 – first sheet)
Data Gathering
Process
Reading Files
into Python
Connecting to a
Database
Exploring
DataFrames
To connect to a SQL database, import a database driver and specify the database
connection, then use pd.read_sql() to query the database using SQL code
Data Gathering
Process
Reading Files
into Python SQL Software Database Driver
SQLite sqlite3
Connecting to a
Database MySQL mysql.connector
Oracle cx_Oracle
Exploring
DataFrames PostgreSQL psycopg2
After reading data into Python, it’s common to quickly explore the DataFrame to
make sure the data was imported correctly
Data Gathering
Process
Reading Files
into Python df.head() Display the first five rows of a DataFrame
Connecting to a
Database df.shape Display the number of rows and columns of a DataFrame
Exploring
DataFrames df.count() Display the number of values in each column
df.info() Display the non-null values and data types of each column
After reading data into Python, it’s common to quickly explore the DataFrame to
make sure the data was imported correctly
Data Gathering
Process
Reading Files
into Python
Connecting to a
Database
Exploring
DataFrames
Key Objectives
NEW MESSAGE
June 1, 2023
1. Read in data from a .csv file
From: Anna Analysis (Senior Data Scientist)
Subject: New Survey Data 2. Store the data in a DataFrame
Can you load the data into Python and confirm the number of
rows in the data and the range of the happiness scores? Then
we can walk through next steps together.
Thanks!
Anna
happiness_survey_data.csv
Python can read data from your computer, a database, or the internet
• Flat files & spreadsheets are typically stored locally on a computer, but as a data scientist you’ll likely need to
connect to and query a SQL database, and potentially collect data through web scraping or an API
In this section we’ll cover the steps for cleaning data, including converting columns to the
correct data type, handling data issues and creating new columns for analysis
The goal of data cleaning is to get raw data into a format that’s ready for analysis
Data Cleaning
Overview This includes:
• Converting columns to the correct data types for analysis
Data Types
• Handling data issues that could impact the results of your analysis
• Creating new columns from existing columns that are useful for analysis
Data Issues
Creating New
The order in which you complete these data cleaning tasks will vary by dataset, but
Columns this is a good starting point
Even though there are automated tools available, doing some manual data cleaning
provides a good opportunity to start understanding and getting a good feel for your data
When using Pandas to read data, columns are automatically assigned a data type
• Use the .dtypes attribute to view the data type for each DataFrame column
Data Cleaning
Overview • Note that sometimes numeric columns (int, float) and date & time columns (datetime) aren’t
recognized properly by Pandas and get read in as text columns (object)
Data Types
Default data types
Data Issues
Data Cleaning
Overview
Data Types
Data Issues
Note that the missing values
are now NaT instead of NaN
Creating New (more on missing data later!)
Columns
PRO TIP: Pandas does a pretty good job at detecting the datetime values from an object if the text is in a
standard format (like “YYYY-MM-DD”), but you can also manually specify the format using the format
argument within the pd.to_datetime() function: pd.to_datetime(dt_col, format='%Y-%M-%D’)
For a full list of formats, visit: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior *Copyright Maven Analytics, LLC
CONVERTING TO NUMERIC
Data Types
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
June 12, 2023
1. Read in data from the Excel spreadsheet and
From: Alan Alarm (Researcher) store it in a Pandas DataFrame
Subject: Data in Python Request
2. Check the data type of each column
Hi there, 3. Convert object columns into numeric or
datetime columns as needed
We just finished collecting survey data from a few thousand
customers who’ve purchased our alarm clocks (see attached).
Can you read the data into Python and make sure the data
type of each column makes sense? We’d like to do quite a few
calculations using the data, so if a column can be converted to
a numeric or datetime column, please do so.
Thanks!
Alan
Data issues need to be identified and corrected upfront in order to not impact or
skew the results of your analysis
Data Cleaning
Overview Common “messy” data issues include:
Data Types
Duplicates
Data Issues
Outliers
Creating New
Columns
Inconsistent
text & typos
Missing data
Data Types
Data Issues
Creating New
Columns
The easiest way to identify missing data is with the .isna() method
• You can also use .info() or .value_counts(dropna=False)
Data Cleaning
Overview
Data Types
Data Issues
Creating New
Columns Use sum() to return the Or use any(axis=1) to select
missing values by column the rows with missing values
Data Types
• Resolve the missing data based on your domain expertise
Data Issues
There is no right or wrong way to deal with missing data, which is why it’s
important to be thoughtful and deliberate in how you handle it
You can still perform calculations if you choose to keep missing data
Data Cleaning
Overview
Data Issues
Creating New
Columns
Data Cleaning
Overview
Data Types
Data Issues
Data Cleaning
Overview Removes any rows with NaN values
Data Types
Data Issues
Creating New
Columns
Note that the row index is now skipping values, but you can
reset the index with df.dropna().reset_index()
Data Cleaning
Overview Removes any rows with NaN values
Data Types
Removes rows that only have NaN values
Data Issues
Creating New
Columns
Data Cleaning
Overview Removes any rows with NaN values
Data Types
Removes rows that only have NaN values
Data Issues
Removes rows that don’t have at least “n” values
Creating New
Columns
Data Cleaning
Overview Removes any rows with NaN values
Data Types
Removes rows that only have NaN values
Data Issues
Removes rows that don’t have at least “n” values
Creating New
Columns
Removes rows with NaN values in a specified column
Data Cleaning
Overview
Data Types
Data Issues
Note that using .dropna() or .notna()
to remove rows with missing data
does not make permanent changes
Creating New
to the DataFrame, so you need to
Columns
save the output to a new DataFrame
(or the same one) or set the
argument inplace=True
Data Types
Creating New
Columns
Data Types
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
June 13, 2023 1. Find any missing data
From: Alan Alarm (Researcher) 2. Deal with the missing data
Subject: Missing Data Check
Hi again,
Can you check the file I sent you yesterday for missing data?
Thanks!
Alan
Inconsistent text & typos in a data set are represented by values that are either:
• Incorrect by a few digits or characters
Data Cleaning
• Inconsistent with the rest of a column
Overview
Data Types
Data Issues
Finding inconsistent text and typos
within a large data set in Python is not
Creating New a straightforward approach, as there
Columns is no function that will automatically
identify these situations
While there is no specific method to identify inconsistent text & typos, you can
take the following two approaches to check a column depending on its data type:
Data Cleaning
Overview
Data Issues
Creating New
Columns
These represent
the same thing! This looks like a
realistic age range
Data Issues
3• .map() to map a set of values to another set of values
Creating New
Columns
4• String methods like str.lower(), str.strip() & str.replace() to clean text data
We’ve already covered the loc[] accessor to resolve missing data using domain expertise
Data Issues Calls the A logical expression that Value to return when Value to return when
NumPy library evaluates to True or False the expression is True the expression is False
Creating New
Columns
This is different from the Pandas where method, which has similar
functionality, but different syntax
The NumPy function is used more often than the Pandas method
because np.where is vectorized, meaning it executes faster
Data Types
Data Issues
Creating New
Columns
Use .map() to map values from one set of values to another set of values
Data Cleaning
Overview
Data Types
Data Issues
You can pass a dictionary with existing values
as the keys, and new values as the values
Creating New
Columns These states were mapped to
these state abbreviations
This is similar to creating a lookup table in Excel and using VLOOKUP to search for a
value in a column of the table and retrieve a corresponding value from another column
String methods are commonly used to clean text and standardize it for analysis
Data Cleaning
Overview
Data Types
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
June 14, 2023 1. Find any inconsistent text and typos
From: Alan Alarm (Researcher) 2. Deal with the inconsistent text and typos
Subject: Inconsistent Text Check
Hi again,
Thanks for your help with the missing data yesterday. I like
how you decided to handle those missing values.
Can you check the same file for inconsistencies in the text and
resolve any issues that you find?
Thanks!
Alan
Duplicate data represents the presence of one or more redundant rows that
contain the same information as another, and can therefore be removed
Data Cleaning
Overview
Data Types
Creating New
Columns
Data Cleaning
Overview
Data Types
This returns True for every row that
is a duplicate of a previous row
Data Issues
Creating New
Columns You can use keep=False to return
all the duplicate rows
Data Cleaning
Overview
Data Types
Creating New
Columns
Key Objectives
NEW MESSAGE
June 15, 2023 1. Find any duplicate data
From: Alan Alarm (Researcher) 2. Deal with the duplicate data
Subject: Duplicate Data Check
Hi again,
I know the last task around finding inconsistent text was a bit
tricky. This should be more straightforward!
Can you check the same file for duplicate data and resolve any
issues?
Thanks!
Alan
An outlier is a value in a data set that is much bigger or smaller than the others
Data Cleaning
Overview
Data Types
Average income (including outlier) = $4.1M
Average income (excluding outlier) = $82K
Data Issues
Creating New
Columns If outliers are not identified and dealt
with, they can have a notable impact
on calculations and models
You can identify outliers in different ways using plots and statistics
Data Cleaning EXAMPLE Identifying outliers in student grades from a college class
Overview
3σ
Data Issues
Creating New
Columns
It’s also important to define what you’ll consider an outlier in each scenario
Histograms are used to visualize the distribution (or shape) of a numerical column
• They help identify outliers by showing which values fall outside of the normal range
Data Cleaning
Overview
Data Issues
Creating New
Columns The height of each bar is how
often the value occurs
Data Types
median
The width of the “box” is the interquartile range
Data Issues (IQR), which is the middle 50% of the data
min max
Any value farther away than 1.5*IQR from each
These are side of the box is considered an outlier
the outliers
Creating New
Columns
Q1 Q3
The standard deviation is a measure of the spread of a data set from the mean
Data Cleaning
Overview
Standard
Data Types Deviation = 8.2
Data Issues
Large spread
Creating New
Columns
Standard
Deviation = 5.8
Small spread
*Copyright Maven Analytics, LLC
STANDARD DEVIATION
The standard deviation is a measure of the spread of a data set from the mean
Values at least 3 standard deviations away from the mean are considered outliers
Data Cleaning
Overview • This is meant for normally distributed, or bell shaped, data
• The threshold of 3 standard deviations can be changed to 2 or 4+ depending on the data
Data Types
Data Issues
Creating New
Columns
Like with missing data, there are multiple ways to handle outliers:
• Keep outliers
Data Cleaning
Overview • Remove an entire row or column with outliers
• Impute outliers with NaN or a substitute like the average, mode, max, etc.
Data Types • Resolve outliers based on your domain expertise
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
June 16, 2023 1. Find any outliers
From: Alan Alarm (Researcher) 2. Deal with the outliers
Subject: Outlier Check
3. Quickly explore the updated DataFrame. How do
things look now after handling the data issues
Hi again,
compared to the original DataFrame?
I have one last request for you and then I think our data is
clean enough for now.
Can you check the file for outliers and resolve any issues?
Best,
Alan
After cleaning data types & issues, you may still not have the exact data that you
need, so you can create new columns from existing data to aid your analysis
Data Cleaning
Overview • Numeric columns – calculating percentages, applying conditional calculations, etc.
• Datetime columns – extracting datetime components, applying datetime calculations, etc.
Data Types • Text columns – extracting text, splitting into multiple columns, finding patterns, etc.
Data Issues
Name Cost Date Notes Name Cost + Tax Month Person Note
Alexis $0 4/15/23 Coach: great job! Alexis $0 April Coach great job!
Creating New
Columns Alexis $0 4/22/23 Coach: keep it up Alexis $0 April Coach keep it up
Alexis $25 5/10/23 PT: add strength training Alexis $27.00 May PT add strength training
David $20 5/1/23 Trainer: longer warm up David $21.60 May Trainer longer warm up
David $20 5/10/23 Trainer: pace yourself David $21.60 May Trainer pace yourself
Add 8% Extract the Split into two Data is ready for further analysis!
tax month columns
To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview
Data Issues
To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview
Data Issues
Creating New
Columns
sum = 9.17
To calculate a percentage, you can set up two columns with the numerator and
denominator values and then divide them (you can also multiply by 100 if desired)
Data Cleaning
Overview
Data Issues
Creating New
Columns
Data Cleaning
Overview
Data Types
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
July 5, 2023 1. Read data into Python
From: Peter Penn (Sales Rep) 2. Check the data type of each column
Subject: Pen Sales Data
3. Create a numeric column using arithmetic
Hello, 4. Create a numeric column using conditional logic
I’ve attached the data on our June pen sales.
Thanks!
Peter
Use dt.component to extract a component from a datetime value (day, month, etc.)
Data Cleaning
Overview Component Output
Data Types
Data Issues
From: Peter Penn (Sales Rep) 1. Calculate the difference between two datetime
columns and save it as a new column
Subject: Delivery Time of Pens?
2. Take the average of the new column
Hello again,
Using the data I sent over last week, can you calculate the
number of days between the purchase and delivery date for
each sale and save it as a new column called “Delivery Time”?
Thanks!
Peter
Data Types
Creating New
Columns
Data Cleaning
Overview
Data Issues
Creating New
Columns
This is now a
DataFrame with
two columns
Data Cleaning
Overview
Data Types
Data Issues
Creating New
Columns
Key Objectives
NEW MESSAGE
July 19, 2023 1. Split one column into multiple columns
From: Peter Penn (Sales Rep) 2. Create a Boolean column (True/False) to show
Subject: Pen Reviews whether a text field contains particular words
Hello again,
You may have noticed that the data I sent over a few weeks
ago also includes pen reviews.
Can you split the reviews on the “|” character to create two
new columns: “User Name” and “Review Text”?
Can you also create a “Leak or Spill” column that flags the
reviews that mention either “leak” or “spill”?
Thanks!
Peter
Always check that the data type for each column matches their intended use
• Sometimes all columns are read in as objects into a DataFrame, so they may need to be converted into
numeric or datetime columns to be able to do any appropriate calculations down the line
It’s important to resolve any messy data issues that could impact your analysis
• When you’re looking over a data set for the first time, check for missing data, inconsistent text & typos,
duplicate data, and outliers, then be thoughtful and deliberate about how you handle each
You can create new columns based on existing columns in your DataFrame
• Depending on the data type, you can apply numeric, datetime, or string calculations on existing columns to
create new columns that can be useful for your analysis
You goal should be to make the data clean enough for your analysis
• It’s difficult to identify all the issues in your data in order to make it 100% clean (especially if working with text
data), so spending extra time trying to do so can be counterproductive – remember the MVP approach!
In this section we’ll cover exploratory data analysis (EDA), which includes a variety of
techniques used to better understand a dataset and discover hidden patterns & insights
The exploratory data analysis (EDA) phase gives data scientists a chance to:
EDA Overview
• Get a better sense of the data by viewing it from multiple angles
• Discover patterns, relationships, and insights from the data
Exploring Data
EDA Tips
EDA
EDA Overview
EDA Tips
• Filtering • Histograms
• Sorting • Scatterplots
• Grouping • Pair plots
You can filter a DataFrame by passing a logical test into the loc[] accessor
• Apply multiple filters by using the “&” and “|” operators (AND/OR)
EDA Overview
Exploring Data
This returns rows where
“type” is equal to “herbal”
Visualizing Data
Exploring Data
Visualizing Data
- +
EDA Tips
+ -
199.75
190
EDA Overview
df.groupby(col)[col].aggregation()
Exploring Data
Visualizing Data The DataFrame The column(s) to group by The column(s) to apply The calculation(s) to
to group (unique values determine the calculation to apply for each group
the rows in the output) (these become the new (these become the new
EDA Tips columns in the output) values in the output)
Examples:
• mean()
• sum()
• min()
• max()
• count()
• nunique()
EDA Overview
Exploring Data
This returns the average
“temp” by “type”
Visualizing Data
EDA Tips
Use reset_index() to
return a DataFrame
Chain the .agg() method to .groupby() to apply multiple aggregations to each group
EDA Overview
Exploring Data
This returns the minimum &
maximum temperatures, the
number of teas, and the unique
Visualizing Data temperatures for each tea type
EDA Tips
In addition to aggregating grouped data, you can also use the .head() and .tail()
methods to return the first or last “n” records for each group
EDA Overview
Return the first row of each group Return the last 3 rows of each group
Exploring Data
Visualizing Data
EDA Tips
Thanks!
section06_exploring_data_assignment.ipynb
Visualizing data as part of the exploratory data analysis process lets you:
• More easily identify patterns and trends in the data
EDA Overview
• Quickly spot anomalies to further investigate
One student
barely slept but
EDA Tips tested very well
You can use the .plot() method to create quick and simple visualizations directly
from a Pandas DataFrame
EDA Overview
Exploring Data
Visualizing Data
EDA Tips
Use sns.pairplot() to create a pair plot that shows all the scatterplots and
histograms that can be made using the numeric variables in a DataFrame
EDA Overview
2 4
1.
1 Outlier – This student barely
Exploring Data studied and still aced the test
(look into them)
2.
2 Relationship – The hours spent
studying and sleeping don’t
Visualizing Data seem related at all
1 (this is a surprising insight)
3.
3 Relationship – The test grade is
PRO TIP: Create a pair highly correlated with the class
EDA Tips
plot as your first visual to grade (can ignore one of the two
identify general patterns 3 fields for the analysis)
that you can dig into later 4.
4 Data Type – The cups of coffee
individually field only has integers (keep this
in mind)
5.
5 Outlier – This student drinks a
lot of coffee (check on them)
5
A distribution shows all the possible values in a column and how often each occurs
It can be shown in two ways:
EDA Overview
EDA Tips
2
2
Exploring Data
Visualizing Data
EDA Tips
EDA Overview
Exploring Data
Visualizing Data
EDA Tips
EDA Overview
Normal distribution Binomial distribution
If you collect the height of Knowing that 10% of all ad
500 women, most will be views result in a click, these
Exploring Data ~64” with fewer being much are the clicks you’ll get if you
shorter or taller than that show an ad to 20 customers
Visualizing Data
While it’s not necessary to memorize the formulas for each distribution, being able to recognize the shapes
and data types for these distributions will be helpful for future data science modeling and analysis
Exploring Data
EXAMPLE Women’s Heights (in)
Visualizing Data
μ The empirical rule outlines where most
σ values fall in a normal distribution:
• 68% fall within 1σ from the mean
EDA Tips • 95% fall within 2σ from the mean
• 99.7% fall within 3σ from the mean
(this is why data points over 3σ away
from the mean are considered outliers)
The skew represents the asymmetry of a normal distribution around its mean
• For example, data on household income is typically right skewed
EDA Overview
Visualizing Data
EDA Tips
There are techniques that can deal with skewed data, such as taking the log of the data set, to turn it
into normally distributed data; more on this will be discussed in the prep for modeling section
NEW MESSAGE
Key Objectives
August 8, 2023
1. Import the seaborn library
From: Sarah Song (Music Analysis Team)
Subject: Musical Attribute Distributions 2. Create a pair plot
3. Interpret the histograms
Hi,
• Which fields are normally distributed?
Our music analysis team has labeled 100 songs with their
• Which fields have other distributions?
musical attributes, such as its danceability and energy.
• Do you see any skew?
Can you tell us more about the distributions of the musical
• Do you see any outliers?
attributes? I’m guessing that since most of the musical
attributes are numeric, they should be normally distributed, • Any other observations?
but let me know if you see otherwise. • Do your interpretations make sense?
Thanks!
section06_visualizing_data_assignments.ipynb
Exploring Data
Visualizing Data
EDA Tips
Exploring Data
Visualizing Data
EDA Tips
This is a negative correlation This has no correlation This is a perfect positive correlation
(Students who spent more time on (No relationship exists between hours (This is likely an error and the grades
social media got worse grades) talking with friends and test grades) are exact copies of each other)
Correlation does not imply causation! Just because two variables are related
does not necessarily mean that changes to one cause changes to the other
Use the .corr() method to calculate the correlation coefficient between each pair of
numerical variables in a DataFrame
EDA Overview • A correlation under 0.5 is weak, between 0.5 and 0.8 is moderate, and over 0.8 is strong
Exploring Data
Visualizing Data 1
EDA Tips 3
1 The correlation between a 2 While this looks like a negative 3 The only strong correlation in this table is that
variable with itself will always correlation, this value is very close the grade on the test is positively correlated
be 1 – you can ignore these to zero, meaning that hours studied with the grade in the class (keep this in mind
values across the diagonal and cups of coffee are uncorrelated for future modeling or insights)
Key Objectives
NEW MESSAGE
August 9, 2023 1. Look at the previously created pair plot
From: Sarah Song (Music Analysis Team) 2. Interpret the scatterplots
Subject: Musical Attribute Relationships
• Which fields are highly correlated?
• Which fields are uncorrelated?
Hi again,
• Which fields have positive or negative correlations?
Thanks for looking into those distributions earlier. Can you
• Any other observations?
also tell us more about the correlations of the musical
attributes? • Do your interpretations make sense?
Thanks!
section06_visualizing_data_assignments.ipynb
Data visualizations can be difficult to tweak and polish in Python, so feel free to
export the data as a CSV file and import it into another tool (Excel, Tableau, etc.)
EDA Overview
Exploring Data
Visualizing Data
EDA Tips You know you’ve completed your initial EDA once you’ve:
• Investigated any initial idea or question that comes to mind about the data and
gathered some meaningful insights (you can always come back to EDA after modeling!)
EDA is all about looking at and exploring data from multiple angles
• You can get a better understanding of your data just from filtering, sorting and grouping your data in Python
It’s often helpful to visualize data to more easily see patterns in the data
• One of the first plots that data scientists create to visualize their data is a pair plot, which includes scatter
plots for looking at correlations and histograms for looking at distributions
Key Objectives
NEW MESSAGE
June 1, 2023 1. Explore the data by filtering, sorting, and
From: Katniss Potter (Podcast Host) grouping the data
Subject: Movie Ratings Exploration 2. Create new columns to aid in analysis
3. Visualize the data
Hi there,
4. Interpret the aggregations & plots and share
I’m the host of a movie reviews podcast and I’m currently
interesting insights
making an episode about movie review aggregators.
I found this data set from Rotten Tomatoes (inside the .ipynb
(detailed steps and questions in the Jupyter Notebook)
file that I’ve attached). Could you dig into the data and share
any interesting insights that you find? My audience loves fun
facts about movies.
Thank you!
KP
section07_midcourse_project.ipynb
In this section we’ll learn to prepare data for modeling, including merging data into a
single table, finding the right row granularity for analysis, and engineering new features
Preparing Rows Preparing Columns • Understand the data structures required as inputs
for machine learning models
You’ve just been hired as a Data Science Intern for Maven Mega Mart, a large
ecommerce website that sells everything from clothing to pet supplies
From: Candice Canine (Senior Data Scientist) Scope: Identify customers that are most likely to
Subject: Email Targeting purchase dog food
Hi!
Technique: Supervised learning
We’re launching a new dog food brand and would like to send out an
email blast to a subset of customers that are most likely to purchase Label (y): Whether a customer has purchased dog
dog food.
food recently or not
Can you help me prep our data so I can identify those customers?
You’ll need to:
Features (x):
1. Create a single table with the appropriate row & column formats
2. Engineer features • What other items a customer has purchased
• How much money a customer has spent
• etc.
So far, we’ve gathered and cleaned data to prepare for EDA, but a few additional
steps are required to prepare for modeling
Data Prep for
Modeling
Feature
Engineering
Here, “modeling” refers to applying an
algorithm, which is different from “data
Preview: modeling”, where the goal is to visually
Modeling show the relationship between the
tables in a relational database A data model
To prepare for modeling means to transform the data into a structure and format
Data Prep for that can be used as a direct input for a machine learning algorithm
Modeling
Feature
3• Ensuring each column is non-null and numeric
Engineering
There are two ways to combine multiple tables into a single table:
Data Prep for • Appending stacks the rows from multiple tables with the same column structure
Modeling
• Joining adds related columns from one tables to another, based on common values
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature Appending these two tables Joining these two tables added
Engineering with the same columns added the region column based on the
the rows from one to the other matching store values
Preview:
Modeling
Creating a
Single Table pd.concat([df1, df2, df3, …] )
Preparing Rows
Preparing
Columns PRO TIP: You can also
use .concat() to combine
Feature DataFrames horizontally
Engineering by setting axis = 1
Preview:
Modeling
Chain .reset_index() to the code to
make the index go from 0 to 5
Creating a
Single Table
“Right” DataFrame
Preparing Rows
left_df.merge(right_df, to join with “Left”
Feature
right_on) DataFrame to join by
Engineering
For a full list of arguments, visit: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html *Copyright Maven Analytics, LLC
JOINING
Creating a
Single Table
Preparing Rows
Preview:
Modeling
These are the most common types of joins you can use with .merge()
Data Prep for
Modeling
Preview:
Modeling
Creating a n=5
Single Table
Feature
Engineering
Preview: n=6
Modeling
n=10
n=11
*Copyright Maven Analytics, LLC
ASSIGNMENT: CREATING A SINGLE TABLE
Key Objectives
NEW MESSAGE
July 3, 2023 1. Read all four files into a Jupyter Notebook
From: Brooke Reeder (Owner, Maven Books) 2. Append the May and June book sales to the April
Subject: Combine data sets DataFrame
3. Join the newly created book sales DataFrame
Hi there,
with the customers DataFrame on customer_id
We just finished collecting our Q2 book sales data. Can you
help us create one giant table that includes: • Which type of join would work best here?
Thanks!
Brooke
Book_Sales_April.xlsx, Book_Sales_May.xlsx,
Book_Sales_June.xlsx, Book_Customers.csv
To prepare rows for modeling, you need to think about the question you’re trying
to answer and determine what one row (observation) of your table will look like
Data Prep for
Modeling • In other words, you need to determine the granularity of each row
Creating a
Single Table
GOAL Predict which customers are most likely to buy dog food in June
Preparing Rows
Because we are predicting something for a customer, one
row of data in table should represent one customer
Preparing
Columns Number of pet supplies How much was spent on
purchased before June all items before June
Feature
Engineering
Group by
Preview: customer
Modeling
x variables y variable
(features to be input into a model) (label / output)
Key Objectives
NEW MESSAGE
July 5, 2023 1. Determine the row granularity needed
From: Brooke Reeder (Owner, Maven Books) 2. Create a column called “June Purchases” that sums
Subject: Please format data for analysis all purchases in June
3. Create a column called “Total Spend” that sums the
Hi again,
prices of the books purchased in April & May
We’re trying to predict which customers will purchase a book
this month. 4. Combine the “June Purchases” and “Total Spend”
columns into a single DataFrame for modeling
Can you reformat the data you compiled earlier this week so
that it’s ready to be input into a model, with each row
representing a customer instead of a purchase?
Thanks!
Brooke
Once you have the data in a single table with the right row granularity, you’ll move
on to preparing the columns for modeling:
Data Prep for
Modeling
Creating a
1• All values should be non-null
Single Table
• Use df.info() or df.isna() to identify null values and either remove them, impute
them, or resolve them based on your domain expertise
Preparing Rows
Preview: PRO TIP: There are some algorithms that can handle null and non-numeric values, including tree-
Modeling based models and some classification models, but it is still best practice to prepare the data this way
A dummy variable is a field that only contains zeros and ones to represent the
presence (1) or absence (0) of a value, also known as one-hot encoding
Data Prep for
Modeling • They are used to transform a categorical field into multiple numeric fields
Creating a
Single Table These dummy variables are numeric
representations of the “Color” field
Preparing Rows
House ID Price Color House ID Price Color Blue White Yellow
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature
Engineering Set drop_first=True to remove one of
the dummy variable columns
(it does not matter which column
Preview: removed, as long as one is dropped)
Modeling
Preparing Rows
Feature
Engineering
Preview:
Modeling
Number of purchases by month Days since the latest purchase Average days between purchases
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature
Engineering
Preview:
Modeling
EXAMPLE Using the days from “today” to prepare a date column for modeling
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature
Engineering
Preview:
Modeling
EXAMPLE Using the average time between dates to prepare a date column for modeling
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature
Engineering
Key Objectives
NEW MESSAGE
July 7, 2023 1. Create dummy variables from the “Audience”
From: Brooke Reeder (Owner, Maven Books) field
Subject: Please help make data numeric 2. Using the “Audience” dummy variables, create
three new columns that contain the number of
Hi again, Adult, Children, and Teen books purchased by
each customer
Thanks for your help earlier this week!
3. Combine the three new columns back with the
We just learned that we also have to make all the data
numeric before inputting it into a predictive model. customer-level data
Can you turn the “Audience” text field into a numeric field?
Thanks!
Brooke
Feature engineering is the process of creating columns that you think will be
helpful inputs for improving a model (help predict, segment, etc.)
Data Prep for
Modeling • When preparing rows & columns for modeling you’re already feature engineering!
Creating a
Single Table
Preparing Rows
Preparing
Columns
Feature Other feature engineering techniques: Once you have your data in a single table and
Engineering
have prepared the rows & columns, it’s ready for
• Transformations (log transform) modeling
Preview:
Modeling
• Scaling (normalization & standardization) However, being deliberate about engineering
new features can be the difference between a
• Proxy variables good model and a great one
Creating a
Single Table
Preparing
Columns
Feature
Engineering
Preview:
Modeling
Scaling, as its name implies, requires setting all input features on a similar scale
• Normalization transforms all values to be between 0 and 1 (or between -1 and 1)
Data Prep for
Modeling
Creating a
Single Table Normalization equation
𝑥 − 𝑥𝑚𝑖𝑛
You can also use the .MinMaxScaler()
Preparing Rows function from the sklearn library 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛
Preparing
Columns
Preview:
Modeling
Scaling, as its name implies, requires setting all input features on a similar scale
• Normalization transforms all values to be between 0 and 1 (or between -1 and 1)
Data Prep for
Modeling • Standardization transforms all values to have a mean of 0 and standard deviation of 1
Creating a
Single Table Standardization equation
𝑥 − 𝑥𝑚𝑒𝑎𝑛
You can also use the .StandardScaler()
Preparing Rows function from the sklearn library 𝑥𝑠𝑡𝑑
Preparing
Columns
Preview:
Modeling
Preparing
Columns
1.
1 Anyone can apply an algorithm, but only someone with domain expertise
Data Prep for can engineer relevant features, which is what makes a great model
Modeling
Creating a
2.
2 You want your data to be long, not wide (many rows, few columns)
Single Table
3.
3 If you’re working with customer data, a popular marketing technique is to
Preparing Rows engineer features related to the recency, frequency, and monetary value
(RFM) of a customer’s transactions
Preparing
Columns
4.
4 Once you start modeling, you’re bound to find things you missed during
Feature
data prep and will continue to engineer features and gather, clean,
Engineering explore, and visualize the data
Preview:
Modeling
Key Objectives
NEW MESSAGE
July 10, 2023 1. Brainstorm features that would make good
From: Brooke Reeder (Owner, Maven Books) predictors for a model
Subject: Please create new features 2. Engineer two new features
3. Add them to the non-null, numeric DataFrame
Hi again,
that is ready for modeling
I have one final request for you.
Can you create new features that you think will do a good job
making a prediction?
Thanks!
Brooke
You can input the prepared data into a supervised learning model to predict
which customers are most likely to purchase dog food
Data Prep for
Modeling
x y
Creating a
Single Table
Preparing Rows
Preparing x test
Columns
Feature
Engineering
Preview:
Modeling Whether a customer purchased pet Yvonne is the
supplies in recently is a good predictor of most likely to
whether they will buy dog food buy dog food
Preparing data for EDA is different than preparing data for modeling
• The goal of EDA is exploration while the goal of modeling is to use an algorithm to answer a question, so to prep
for modeling means to get the data in a format that can be directly input into a model
All data needs to be in a single table with non-null, numeric values for modeling
• Join together multiple tables with .merge and .concat, remove null values with .isna, create dummy variables to
turn text into numeric values and use datetime calculations to turn datetimes into numeric values
You’ve just been hired as a Jr. Data Scientist for Maven Music, a streaming service
THE that’s been losing more customers than usual the past few months and would like
SITUATION to use data science to figure out how to reduce customer churn
Key Objectives
NEW MESSAGE
June 8, 2023 1. Revisit the project scope
From: Carter Careswell (Customer Care) 2. Read the data files into Python
Subject: Customer Churn Data Prep
3. Clean the data by converting data types,
resolving data issues, and creating new columns
Hi again,
Thanks for scoping the customer churn project with us earlier. 4. Explore the data independently, then join the
We’ve identified the location of the data sets for the project tables for further exploration
that we talked about – customer and listening history.
5. Create a non-null, numeric DataFrame and
Can you read the data into Python, clean it, explore it and engineer features that could be good predictors
prepare it for modeling? More details can be found in the of customer churn
attached Jupyter Notebook.
6. Visualize and interpret the data in the final
Thanks for all your help!
DataFrame that is ready for modeling
Carter
section09_final_project.ipynb