0% found this document useful (0 votes)
18 views33 pages

The Data Science Process

Introduction to data science module 2 ppt

Uploaded by

shravyap045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views33 pages

The Data Science Process

Introduction to data science module 2 ppt

Uploaded by

shravyap045
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Module 2

THE DATA SCIENCE PROCESS-Overview of the data science process- defining


research goals and creating project charter, retrieving data, cleansing, integrating
and transforming data, exploratory data analysis, Build the models, presenting
findings and building application on top of them
1Overview of the data science process
Step 1: Defining research goals and creating a
project charter
•A project starts by understanding the what, the why, and the
how of your project, What does the company expect you to do?
•Answering these three questions (what, why, how) is the goal
of the first phase, so that everybody knows what to do and can
agree on the best course of action.
•1.1 Spend time understanding the goals and context of
your research
•1.2 Create a project charter
Clients want to know what they are paying for right from the start. Once you
understand their business problem, it's important to agree on exactly what
you'll deliver to them.All these details should be written down in a project
charter.
Step 1: Defining research goals and
creating a project charter
Project charter requires teamwork, and your input covers at
least the following:
■ A clear research goal
■ The project mission and context
■ How you’re going to perform your analysis
■ What resources you expect to use
■ Proof that it’s an achievable project, or proof of concepts
■ Deliverables and a measure of success
■ A timeline
Step 2: Retrieving data
Sometimes you need to go into
the field and design a data
collection process yourself, but
most of the time you won’t be
involved in this step. Many Data can be stored in many forms, ranging from
companies will have already simple text files to tables in a database. The
collected and stored the data for objective now is acquiring all the data you need.
you, and what they don’t have This may be difficult, and even if you succeed,
can often be bought from third data is often like a diamond in the rough: it
parties. needs polishing to be of any use to you.
2.1. Start with data stored within the company
Data is typically stored in databases, data marts, data warehouses, or data lakes.
Databases are for data storage, while data warehouses are for analysis. A data mart
is a smaller subset of a data warehouse for specific business units. Data warehouses
and marts hold preprocessed data, while data lakes contain raw data. Sometimes,
important data may still exist in Excel files on someone’s computer.
Step 2:
Retrieving
data
2.2 Don’t be
afraid to shop
around
If data isn’t
available inside
your
organization, look 2.3. Do data quality checks now to prevent problems
outside your
later
organization’s
walls. Most of the errors you’ll encounter during the data
gathering phase are easy to spot, but being too
careless will make you spend many hours solving data
issues that could have been prevented during data
Step 3: Cleansing, integrating, and
transforming data
Step 3
• 3.1. Cleansing data
• Data cleansing is a key step in the data science process that involves
fixing errors to ensure the data accurately represents the real-world
processes it comes from. There are two main types of errors to address:
interpretation errors (e.g., unrealistic values like an age of 300 years)
and inconsistencies (e.g., representing "Female" as both "Female" and
"F" in different tables). Common issues include physically impossible
values, typos, outliers, missing data, and inconsistent units (like Pounds
vs. Dollars). Cleansing the data ensures it is ready for accurate analysis
and modeling.
3.1. Cleansing data
3.1. Cleansing data
• DATA ENTRY ERRORS

• REDUNDANT WHITESPACE
Whitespaces, though hard to detect, can cause significant errors in data
processing, like mismatches when joining data keys. In Python you can use the
strip() function to remove leading and trailing spaces.
3.1. Cleansing data
FIXING CAPITAL LETTER MISMATCHES
• Capital letter mismatches are common. Most programming languages
make a distinction between “Brazil” and “brazil”. In this case you can
solve the problem by applying a function that returns both strings in
lowercase, such as .lower() in Python.
• “Brazil”.lower() == “brazil”.lower() should result in true.
IMPOSSIBLE VALUES AND SANITY CHECKS
• Sanity checks are another valuable type of data check. Here you check
the value against physically or theoretically impossible values such as
people taller than 3 meters or someone with an age of 299 years. Sanity
checks can be directly expressed with rules:
• check = 0 <= age <= 120
3.1. Cleansing data
OUTLIERS
• An outlier is an observation that seems to be distant from other
observations or, more specifically, one observation that follows a different
logic or generative process than the other observations. The easiest way
to find outliers is to use a plot or a table with the minimum and maximum
values.
3.1. Cleansing data
• DEALING WITH MISSING VALUES
3.1. Cleansing data
DEVIATIONS FROM A CODE BOOK
• It explains how to detect errors in large data sets using set operations
compared to a code book, which serves as metadata describing the data.
A code book includes details like the number of variables, observations,
and meanings of encoded values (e.g., “0” for negative, “5” for very
positive). By comparing data sets, you can identify values in the data that
don't match the code book, signaling errors. Using tables and difference
operators can help streamline this process, especially when working with
large amounts of data.
DIFFERENT UNITS OF MEASUREMENT
• When integrating two data sets, it's crucial to account for differences in
units of measurement. For example, when analyzing global gasoline
prices, some data sets may report prices per gallon, while others use
prices per liter. In such cases, a simple unit conversion can resolve the
3.1. Cleansing data
DIFFERENT LEVELS OF AGGREGATION
• Different levels of aggregation in data sets, like weekly data versus
work-week data, are similar to measurement differences. These
discrepancies are usually easy to spot and can be resolved by
summarizing or expanding the data. Cleaning data early is crucial before
combining information from various sources.
CORRECT ERRORS AS EARLY AS POSSIBLE
• Data errors should be fixed as early as possible in the collection process to
prevent costly mistakes and repeated corrections in multiple projects.
These errors can reveal issues like faulty business processes, defective
equipment, or software bugs. However, data scientists may not always
control data collection, so handling errors in code becomes necessary. It's
essential to keep a copy of original data to avoid losing valuable
information during cleaning. Combining data from different sources is often
more challenging than cleaning individual data sets.
3.2 Integrating data
• This section focuses on integrating data from various sources, which can
differ in size, type, and structure, such as databases, Excel files, and text
documents. For simplicity, the chapter concentrates on table-structured
data.
• There are two main ways to combine data:
1. Joining – Enriches data by merging information from one table with
another.
2. Appending/Stacking – Adds rows from one table to another.
• You can either create a new physical table or a virtual table (view). A view
saves disk space, as it doesn't store data separately.
3.2 Integrating data
JOINING TABLES
• Joining tables allows you to combine information from two tables to enrich individual
observations. For example, you can merge customer purchase data with their regional
information by using a common field, known as a key, such as a customer name or
Social Security number. Keys that uniquely identify records are called primary keys.
Joining tables is similar to using a lookup function in Excel. The number of rows in the
output depends on the type of join used, which will be explained later.
3.2 Integrating data
Appending
• Appending or stacking tables involves adding the observations from one table to
another, resulting in a larger table. For example, appending January's data with
February's creates a combined table with observations from both months. This
operation is similar to the union operation in set theory, and in SQL, it's performed
using the UNION command. Other set operations, like set difference and intersection,
are also used in data science.
3.2 Integrating data
• USING VIEWS TO SIMULATE DATA JOINS AND APPENDS
• To avoid data duplication, you can use views to virtually combine data.
Unlike creating a new physical table, which requires additional storage
space, a view acts as a virtual layer that combines data from multiple
tables without duplicating it. For example, sales data from different months
can be virtually combined into a yearly sales table. However, views have a
drawback: they recreate the join each time they are queried, consuming
more processing power than a pre-calculated table.
3.2 Integrating data
ENRICHING AGGREGATED MEASURES
• Data enrichment involves adding calculated information to a table, such as total sales
or the percentage of total stock sold in a specific region. This aggregated data provides
additional insights, enabling the calculation of each product's participation within its
category. While useful for data exploration, it's especially beneficial when creating data
models. Generally, models that use relative measures, like percentage sales, tend to
perform better than those using raw numbers.
3.3 Transforming data
Relationships
between an input
variable and an
output variable aren’t
always linear. Taking
the log of the
independent variables
simplifies the
estimation problem
dramatically. The
transforming the input
variables greatly
simplifies the
estimation problem.
3.3 Transforming data
REDUCING THE NUMBER OF VARIABLES
• Sometimes you have too many variables and need to reduce the number
because they don’t add new information to the model. Having too many
variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input
variables. The method used to reduce number of variables from the datset
is known as dimensionality reduction. The principal components
analysis (PCA) is commonly used dimensionality reduction techniques.
3.3 Transforming data
TURNING VARIABLES INTO DUMMIES
Dummy variables convert categorical
variables into binary indicators that
can take values of 1 (true) or 0
(false). This technique creates
separate columns for each category;
for example, a "Weekdays" column
can be transformed into individual
columns for Monday through
Sunday, where a 1 indicates the
presence of that day and a 0
indicates its absence. This method is
commonly used in modeling,
Step 4: Exploratory data analysis
During exploratory data analysis (EDA), you thoroughly examine your data, primarily
using graphical techniques to visualize and understand the interactions between
variables. This phase emphasizes exploration, so it's important to remain
open-minded and attentive. While the main goal is to discover previously overlooked
anomalies that may require corrective action.
The bar chart, a line plot, and a
distribution are some of the graphs
used in exploratory analysis.
Histogram Boxplot
Step 5 Build the models
In this phase, you have clean data and a
clear understanding of it. Now, you're
ready to build models to achieve specific Building a model is an iterative process,
goals like making better predictions, meaning you'll refine it over time. The
classifying objects, or understanding the process may vary depending on whether
system you're analyzing. This step is you use traditional statistics or modern
more focused compared to earlier machine learning. Most models follow
exploration because you already know these steps:
what you're trying to find and what you 1.Choose a technique and select the
want to achieve. variables to use.
2.Run the model.
3.Diagnose and compare models to find the
best one.
5.1 Model and variable selection
• When building a model, you need to choose the right variables and a modeling
technique. Your earlier exploratory analysis should help you figure out which variables
will be useful. There are many modeling techniques, and picking the right one requires
good judgment. You should also consider factors like:
• Will the model be easy to implement in a production environment?
• How hard will it be to maintain, and how long will it stay relevant without changes?
• Does the model need to be easy to explain?
• Once you've thought about these things, you're ready to take action and start building
the model.
5.2 Model execution
• Once you’ve chosen a model you’ll need to implement it in code.

Linear regression tries to fit a line while Confusion matrix: it shows how many cases
minimizing the distance to each point were correctly classified and incorrectly
classified by comparing the prediction with the
real values.
5.3 Model diagnostics and model comparison
Step 6: Presenting findings and building
applications
•on top
After of them
building a well-performing model, it's important to present the findings
to stakeholders. This stage often requires automating model predictions or
creating tools to update reports and presentations. Automation helps avoid
repeating manual tasks. Finally, soft skills are crucial for effectively
communicating insights, as it's essential to ensure that stakeholders
understand and value your work.

You might also like