The Data Science Process
The Data Science Process
• REDUNDANT WHITESPACE
Whitespaces, though hard to detect, can cause significant errors in data
processing, like mismatches when joining data keys. In Python you can use the
strip() function to remove leading and trailing spaces.
3.1. Cleansing data
FIXING CAPITAL LETTER MISMATCHES
• Capital letter mismatches are common. Most programming languages
make a distinction between “Brazil” and “brazil”. In this case you can
solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python.
• “Brazil”.lower() == “brazil”.lower() should result in true.
IMPOSSIBLE VALUES AND SANITY CHECKS
• Sanity checks are another valuable type of data check. Here you check
the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 299 years. Sanity
checks can be directly expressed with rules:
• check = 0 <= age <= 120
3.1. Cleansing data
OUTLIERS
• An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a different
logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum
values.
3.1. Cleansing data
• DEALING WITH MISSING VALUES
3.1. Cleansing data
DEVIATIONS FROM A CODE BOOK
• It explains how to detect errors in large data sets using set operations
compared to a code book, which serves as metadata describing the data.
A code book includes details like the number of variables, observations,
and meanings of encoded values (e.g., “0” for negative, “5” for very
positive). By comparing data sets, you can identify values in the data that
don't match the code book, signaling errors. Using tables and difference
operators can help streamline this process, especially when working with
large amounts of data.
DIFFERENT UNITS OF MEASUREMENT
• When integrating two data sets, it's crucial to account for differences in
units of measurement. For example, when analyzing global gasoline
prices, some data sets may report prices per gallon, while others use
prices per liter. In such cases, a simple unit conversion can resolve the
3.1. Cleansing data
DIFFERENT LEVELS OF AGGREGATION
• Different levels of aggregation in data sets, like weekly data versus
work-week data, are similar to measurement differences. These
discrepancies are usually easy to spot and can be resolved by
summarizing or expanding the data. Cleaning data early is crucial before
combining information from various sources.
CORRECT ERRORS AS EARLY AS POSSIBLE
• Data errors should be fixed as early as possible in the collection process to
prevent costly mistakes and repeated corrections in multiple projects.
These errors can reveal issues like faulty business processes, defective
equipment, or software bugs. However, data scientists may not always
control data collection, so handling errors in code becomes necessary. It's
essential to keep a copy of original data to avoid losing valuable
information during cleaning. Combining data from different sources is often
more challenging than cleaning individual data sets.
3.2 Integrating data
• This section focuses on integrating data from various sources, which can
differ in size, type, and structure, such as databases, Excel files, and text
documents. For simplicity, the chapter concentrates on table-structured
data.
• There are two main ways to combine data:
1. Joining – Enriches data by merging information from one table with
another.
2. Appending/Stacking – Adds rows from one table to another.
• You can either create a new physical table or a virtual table (view). A view
saves disk space, as it doesn't store data separately.
3.2 Integrating data
JOINING TABLES
• Joining tables allows you to combine information from two tables to enrich individual
observations. For example, you can merge customer purchase data with their regional
information by using a common field, known as a key, such as a customer name or
Social Security number. Keys that uniquely identify records are called primary keys.
Joining tables is similar to using a lookup function in Excel. The number of rows in the
output depends on the type of join used, which will be explained later.
3.2 Integrating data
Appending
• Appending or stacking tables involves adding the observations from one table to
another, resulting in a larger table. For example, appending January's data with
February's creates a combined table with observations from both months. This
operation is similar to the union operation in set theory, and in SQL, it's performed
using the UNION command. Other set operations, like set difference and intersection,
are also used in data science.
3.2 Integrating data
• USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
• To avoid data duplication, you can use views to virtually combine data.
Unlike creating a new physical table, which requires additional storage
space, a view acts as a virtual layer that combines data from multiple
tables without duplicating it. For example, sales data from different months
can be virtually combined into a yearly sales table. However, views have a
drawback: they recreate the join each time they are queried, consuming
more processing power than a pre-calculated table.
3.2 Integrating data
ENRICHING AGGREGATED MEASURES
• Data enrichment involves adding calculated information to a table, such as total sales
or the percentage of total stock sold in a specific region. This aggregated data provides
additional insights, enabling the calculation of each product's participation within its
category. While useful for data exploration, it's especially beneficial when creating data
models. Generally, models that use relative measures, like percentage sales, tend to
perform better than those using raw numbers.
3.3 Transforming data
Relationships
between an input
variable and an
output variable aren’t
always linear. Taking
the log of the
independent variables
simplifies the
estimation problem
dramatically. The
transforming the input
variables greatly
simplifies the
estimation problem.
3.3 Transforming data
REDUCING THE NUMBER OF VARIABLES
• Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model. Having too many
variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input
variables. The method used to reduce number of variables from the datset
is known as dimensionality reduction. The principal components
analysis (PCA) is commonly used dimensionality reduction techniques.
3.3 Transforming data
TURNING VARIABLES INTO DUMMIES
Dummy variables convert categorical
variables into binary indicators that
can take values of 1 (true) or 0
(false). This technique creates
separate columns for each category;
for example, a "Weekdays" column
can be transformed into individual
columns for Monday through
Sunday, where a 1 indicates the
presence of that day and a 0
indicates its absence. This method is
commonly used in modeling,
Step 4: Exploratory data analysis
During exploratory data analysis (EDA), you thoroughly examine your data, primarily
using graphical techniques to visualize and understand the interactions between
variables. This phase emphasizes exploration, so it's important to remain
open-minded and attentive. While the main goal is to discover previously overlooked
anomalies that may require corrective action.
The bar chart, a line plot, and a
distribution are some of the graphs
used in exploratory analysis.
Histogram Boxplot
Step 5 Build the models
In this phase, you have clean data and a
clear understanding of it. Now, you're
ready to build models to achieve specific Building a model is an iterative process,
goals like making better predictions, meaning you'll refine it over time. The
classifying objects, or understanding the process may vary depending on whether
system you're analyzing. This step is you use traditional statistics or modern
more focused compared to earlier machine learning. Most models follow
exploration because you already know these steps:
what you're trying to find and what you 1.Choose a technique and select the
want to achieve. variables to use.
2.Run the model.
3.Diagnose and compare models to find the
best one.
5.1 Model and variable selection
• When building a model, you need to choose the right variables and a modeling
technique. Your earlier exploratory analysis should help you figure out which variables
will be useful. There are many modeling techniques, and picking the right one requires
good judgment. You should also consider factors like:
• Will the model be easy to implement in a production environment?
• How hard will it be to maintain, and how long will it stay relevant without changes?
• Does the model need to be easy to explain?
• Once you've thought about these things, you're ready to take action and start building
the model.
5.2 Model execution
• Once you’ve chosen a model you’ll need to implement it in code.
Linear regression tries to fit a line while Confusion matrix: it shows how many cases
minimizing the distance to each point were correctly classified and incorrectly
classified by comparing the prediction with the
real values.
5.3 Model diagnostics and model comparison
Step 6: Presenting findings and building
applications
•on top
After of them
building a well-performing model, it's important to present the findings
to stakeholders. This stage often requires automating model predictions or
creating tools to update reports and presentations. Automation helps avoid
repeating manual tasks. Finally, soft skills are crucial for effectively
communicating insights, as it's essential to ensure that stakeholders
understand and value your work.