Data Science Process
Data Science Process
Freebase.org Open Database that retrieves the information from sites like
Wikipedia, MusicBrains, and the SEC archive
Value Count
Good 1678763
Bad 1274648
Goood 15
Bda 5
Redundant Whitespace
• Whitespaces at the end of the string are hard detect and results in
wrong results from the data analysis.
• During the ETL process, the whitespace may lead into errors in the
dataset.
• For example there is a difference between “AB ” and “AB”.
• During the data cleaning process, most of the programming languages
can easily remove the leading and trailing whitespace.
• In Python, the leading and trailing whitespaces can be removed by
using strip() function.
Capital Letter Mismatch
• The capital letter mismatches in data are common.
• For most programming languages there is difference between “India”
and “india”.
• Using Python you can solve this problem by applying a function,
.lower(), that returns both strings in lowercase.
“India”.lower() == “india”.lower()
• should result in true.
Impossible Values
• Another valuable type of data check is sanity check of impossible
values.
• In this case, you need to check whether there is any physically or
theoretically impossible values in the data set.
• People taller than 3 meters or people with an age of 299 years are
examples are impossible values.
• You can do the sanity check with rules: check = 0 <= age <=120
Outliers
• The observation that seems to be distant from other observations is
known as outliers.
• Outliers follow different generative process or logic than rest of the
observation.
• You can easily find the outliers using a plot or a table with the
minimum and maximum values.
• The first figure below shows no outliers whereas the second figure
shows the presence of the outliers.
Outliers
• The first plot shows that the most of the observations are around the
mean value of the distribution.
• The frequency of observation decreases as they are away from the
mean value.
• For a normal distribution, the high values in the right hand side of the
bottom graph shows the presence of the outliers.
• Outliers can influence the results of your data model.
• So, you need to investigate them first.
Missing Values
• In our dataset, you may find several missing values.
• They are not always wrong, still you need to handle them.
• Several data modeling techniques cannot handle missing values
effectively.
• Sometimes, they may be a result of faulty data collection process or
an error during ETL process.
• The common techniques of fixing the missing values are given below.
• The data scientists and machine learning engineers use them
according to their data and requirements of the model.
Missing Values
Technique Advantage Disadvantage
Omit the values Easy to perform You lose information from observation
Set value to null Easy to perform All the data modeling techniques cannot
handle the null values
Insert a static value such as 0 or the mean Easy to perform May result into the false estimations from
Information from other variables in the the model
observation is not lost
Modeling the value (nondependent) Model is not disturbed Can lead to too much confidence in the
model
Can artificially raise dependence among the
variables
Harder to execute
You make data assumptions
Unit of Measurements
• While integrating two different data sets, you should check the
respective unit of measurement.
• For example: the distance between two cities in the world can be
provided by different data providers.
• One dataset may contain the distance in miles where are other
dataset may contain the distance in kilometers.
• You can use the simple conversion formula to fix such issues.
Correct Errors as early as possible
• Many key organization decisions of depend on the insights from the
data.
• Hence organizations spend good amount of money on the data
retrieval process.
• But the data retrieval process is a difficult, expensive, and error prone
task.
• Hence, as soon as the data is retrieved, is should be corrected due to
following reasons.
Correct Errors as early as possible
• Spotting the data anomalies is not easy for everyone. The organization may
end up in taking costly and wrong decisions, if they are taken based models
that are using the incorrect data
• If the data cleansing is not done at an early stage, all the downstream
projects have to clean the data that is a costly and time-consuming process
• Data errors may be due to a fault in the business process that needs to be
corrected else it may result in the loss of revenue for the organization
• Data errors may be due to defects in the equipment such as defective
sensors
• Data errors may be due to bugs in the software that may be critical for the
company.
Correct Errors as early as possible
• Ideally, the data errors should be fixed as soon as it is captured.
• But the data scientists do not control every data sources.
• They may point out the data errors but it is up to the owner of the data to
fix the issue.
• If you cannot fix the data at the source, you have to handle it inside your
data model and the code.
• It is always recommended to keep a copy of the original data.
• On several occasions, when you start data cleaning process, you may make
mistakes.
• You may insert the wrong values, or delete an outlier that contained useful
information, or alter the data due to misinterpretation of the data.
Correct Errors as early as possible
• If you have a copy of the original data, you can start again.
• The real time data is manipulated at the time of arrival.
• Hence it is not possible to keep a copy of the original data.
• In that case you can use data between certain timeframe for tweaking
before you start using the data.
• Cleansing the individual data is not the most difficult thing.
• However, combining the data from different sources is the real
challenge.
Combine Data from Different Sources
• In this step we focus on integrating the data that we get from different sources.
• This data varies in size, type, and structure.
• It is available in several formats ranging from database and Excel files to text
documents.
• In this section for the sake of brevity, we will focus on data in table structure.
• In order to combine data from two different sources, you can perform two
operations: joining and appending or stacking.
• In the joining operation, we enrich an observation from one table with
information from another table.
• In the second operation, appending or stacking, we add the observations of one
table to those of another table.
• During these operations, you can either create a new physical table or a new
virtual table, called view. The views do not consume much of the disk space.
Joining Tables
• Joining tables help you combine two corresponding observations in
two tables and enrich a single observation.
• For example, if one table contains the information about the
purchases of a customer.
• Another table contains the information about the region where the
customer lives.
• You can join the tables to combine the information and use it for your
model.
Joining Tables
• In order to join table, you need to use the columns or variables that
represent the same observation in both the tables such as customer
name, customer id, or social security number.
• These common fields are known as keys.
• When the keys store unique and not null information, they are called
Primary Keys.
• In the picture above, you can see that Client variable is common in
two tables.
• Using the join, the region variable has been included with the
purchase information.
Appending Tables
• In the appending operation, you can append or stack observations
from one table to another table.
• In the example below, one table contains the data from January
whereas another table contains the data from February.
• As a result of appending table operation, you get a larger table with
data from January as well as February.
• This operation is known as “Union” in set theory as well as SQL.
Using Views
• We use views to avoid the duplication of data.
• In the previous example, we took the monthly data from two tables and
combined both the data in a new physical table.
• Due to this the data was duplicated and more storage space was required.
• If each table contains terabyte of data, the duplication of data would be an
issue.
• In such situations you can use the views.
• The view creates a virtual layer that combines the tables for you.
• The figure below shows how the sales data from the different months is
combined virtually into a yearly sales table instead of duplicating the data.
• The views use more processing power than a pre-calculate table because,
every time the view is queried, the join that creates the view is recreated.
Transforming Data
• Certain models need the data in a certain format.
• After cleansing and integrating steps, you focus on transforming the
data.
• The objective of this step is to ensure that the data takes the required
shape for the data modeling.
Data Transformation
• On various occasions, the relationship between input and output
variable is not linear.
• For example, the relationship 𝑦 = 𝑎𝑒 𝑏𝑥
• Log of the equation simplifies the estimation problem significantly.
• Combining two variables can also simplifies the estimation problem.
Reduce the number of variables
• Too many input variables in a model, make the model complex and
difficult to handle.
• You need to reduce the number of variables without loosing the
information.
• You can do so by reducing the number of variables that do not add
new information to the model.
• You can use the special methods to reduce the number of variables
but retain the maximum amount of data.
• One such method, known as principal component analysis, is given
below.
Reduce the number of variables
• In this method we can see that two variables account for 50.6% of
variations within data set (component1 = 27.8% + component2 =
22.8%).
• These variables, called “component1” and “component2,” are both
combinations of the original variables.
• They’re the principal components of the underlying data structure.
Dummy Variables
• Dummy variable work with categorical variables in which the different
values have no real numerical relationship with each other.
• Dummy variables take only two values: true (1) or false (0).
• For example, if we take the value of male and female in the application
form as 0 and 1, it does not mean that male has zero influence among all
the input variable and female has one.
• In such case, we make insert two dummy variables one for male and
another one for female.
• For the male dummy variable, we insert the value 1 for the rows
corresponding to the male and 0 otherwise.
• For the female dummy variable, we insert the value 1 for the rows
corresponding to the female and 0 otherwise
Step 4: Exploratory Data Analysis
Exploratory Data Analysis (EDA)
• During the exploratory data analysis (EDA), you take a deep dive into the
data.
• You can easily gain the information about the data from the graphs.
• Therefore, you use graphical techniques to understand the data and
relationship among the variables.
• We explore the data in this step.
• However, we still try to find if there is any anomaly left in the data even
after the data cleansing and transformation steps.
• If you discover them in this step, you should go back and fix them before
moving to next step.
• You can use several visualization techniques that includes bar chart, line
plots, and distributions as given below.
Exploratory Data Analysis (EDA)
• Sometimes, you can compose a composite graph from the simple
graphs to gain better insights from the data.
• The graph below helps you understand the relationships among
various variables and their structure.
• You can also make animated and interactive graphs.
Exploratory Data Analysis (EDA)
• You can also overlay several plots.
• In the example below, we have combined several graphs into a Pareto
chart or 80:20 diagram.
• The Pareto diagram below is a combination of the values and
cumulative distribution.
• We can find out from the diagram below that the first 50% of the
countries contain approximately 80% of the total amount.
• If this data represents sales of a multi-national company, we can
conclude that 80% of the sales is from 50% of the countries.
Exploratory Data Analysis (EDA)
• Boxplot and histogram are
one of the most important
graphs.
• In a histogram a variable is
cut into discrete categories
and the number of
occurrences in each category
are summed up and shown
in the graph.
Exploratory Data Analysis (EDA)
• The boxplot, on the other
hand, doesn’t show how
many observations are
present but does offer an
impression of the
distribution within
categories.
• It can show the maximum,
minimum, median, and
other characterizing
measures at the same time.
EDA
• So far, we have covered the visual techniques only in practice we also
use other techniques such as tabulation, clustering, and other
modeling techniques in the step.
• Sometimes we also build simple models in this step.
Step 5: Data Modeling
Data Modeling
• By now, you have clean data in place and you have developed a good
understanding of the data.
• Now you can focus on building models with the goal of making better
predictions, classifying objects, or achieving any other research goal.
• This step is more focused than the exploratory data analysis step
because you understand the data, requirement, and expected
outcomes better.
• You will be using machine learning, statistical modeling, and data
mining techniques while building the models.
• Data modeling is an iterative process.
Data Modeling
• Data modeling is an iterative process.
• You can either use classic statistical techniques or recent machine
learning and deep learning models.
• You can also use both the techniques according to the nature of the
data and research objectives.
• The data modeling consists of the following steps:
• Selection of a modeling technique and variables to enter in the model
• Execution of the model
• Diagnosis and model comparison
Model and Variable Selection
• In this step, you select the variable that you want to include in your model
and the data modeling techniques.
• With the help of exploratory data analysis, you should get the
understanding the required variables for your model.
• For selecting the modeling technique, you can use your judgement based
on data types, variables, and project objective.
• You need to consider the performance of the model and the project
requirements. Other factors that you need to consider are
• Whether the model would be moved to production. If yes, how the model would be
implemented
• How the maintenance of the model would be done
• Whether the model is explainable
Model Execution
• Once you have decided upon your model, now you have to write a
code to implement it.
• Python has libraries such as StatsModels or Scikit-learn.
• Using these packages, you can easily and quickly model several of the
most popular techniques.
• As you can see in the following code, it’s fairly easy to use linear
regression with StatsModels or Scikit-learn.
• The following listing shows the execution of a linear prediction model.
• The linear regression model tries to fit a line while minimizing the
distance to each point.
Model Execution
• The
results.summary
() outputs the
table as given
below
Model Fit
• We use the R-squared or adjusted R-squared to analyze the model fit.
• R-squared explains the degree to which your input variables explain
the variation of your output / predicted variable.
• So, if R-square is 0.8, it means 80% of the variation in the output
variable is explained by the input variables.
• So, in simple terms, higher the R squared, the more variation is
explained by your input variables and hence better is your model.
Model Fit
• However, the problem with R-squared is that it will either stay the same or
increase with addition of more variables, even if they do not have any
relationship with the output variables.
• This is where “Adjusted R square” comes to help.
• Adjusted R-square penalizes you for adding variables which do not improve
your existing model.
• Hence, if you are building Linear regression on multiple variables, it is
always suggested that you use Adjusted R-squared to judge goodness of
model.
• In case you only have one input variable, R-square and Adjusted R squared
would be exactly same.
Model Fit
• Typically, the more non-significant variables you add into the model, the
gap in R-squared and Adjusted R-squared increases.
• A model gets complex when many variables (or features) are introduced.
• You don’t need a complex model if a simple model is available, so the
adjusted R-squared punishes you for overcomplicating.
• At any rate, 0.893 is high, and it should be because we created the data by
keeping the relationship in mind.
• Rules of thumb exist, but for models in businesses, models above 0.85 are
often considered good.
• If you want to win a competition you need in the high 90s. For research
however, often very low model fits (<0.2 even) are found.
Predictor variables have a coefficient
• Coefficients are the numbers by which the variables in an equation
are multiplied.
• For example, in the equation 𝑦 = 0.7658𝑥1 + 1.1252𝑥2 , the
variables 𝑥1 and 𝑥2 are multiplied by 0.7658 and 1.1252,
respectively, so the coefficients are 0.7658 and 1.1252.
• The size and sign of a coefficient in an equation affect its graph. In a
simple linear equation (contains only one 𝑥 variable), the coefficient
is the slope of the line.
• For a linear model this is easy to interpret. In our example if you
increase 𝑥1 by 1, it will change 𝑦 by 0.7658 if 𝑥2 is held constant.
Predictor Significance
• Coefficients provide useful information about the relationship
between predictor and response.
• But sometimes we need evidence to show that the influence is there.
• We need p-value for this.
• If the p-value of a coefficient is less than the chosen significance level,
such as 0.05, the relationship between the predictor and the
response is statistically significant.
K-Nearest Neighbour
• We use linear regression to predict a value. But we use classification
models to classify the observations.
• One of the best-known classification methods is k-nearest neighbors.
• The k-nearest neighbors model looks at labeled point that are nearby
an unlabeled point and try to predict the label of the unlabeled point.
K-Nearest Neighbour
• In this case also, we construct random correlated data and hence 85%
of the cases were correctly classified. knn.score() returns the
model accuracy.
• However, to score the model, first we apply the model to predict the
values
• prediction = knn.predict(predictors)
• Now we can use the prediction and compare it to the real thing using
a confusion matrix.
• metrics.confusion_matrix(target,prediction)
• We get 3 × 3 matrix as given below .
Confusion Matrix
• Confusion matrix: it shows how many cases were correctly classified
and incorrectly classified by comparing the prediction with the real
values.
• The confusion matrix shows we have correctly predicted 17+405+5
cases
Model Diagnostics and Model Comparison
• In this process, you will build multiple models from which you will
choose the best model based on multiple criteria.
• You can split the data between training and testing data.
• A fraction of data is used to train the model on the data set.
• The remaining data, that is unseen to the model, is then used to
evaluate the performance of the model.
• Your model should work on the unseen data.
• You can use several error measures to evaluating and comparing the
performance of the models.
Model Diagnostics and Model Comparison
• Mean Square Error (MSE) is one of the most commonly used
measure. The formula for Mean Square Error is given below
𝑛
1 2
𝑀𝑆𝐸 = 𝑌𝑖 − 𝑌𝑖
𝑛
𝑖=1
• Mean square error is a simple measure.
• It checks for every prediction how far it was from the truth, square
this error, and add up the error of every prediction.
Model Diagnostics and Model Comparison
• The figure below compares the performance of two models to predict the
order size from the price.
• The first model is 𝑠𝑖𝑧𝑒 = 3 × 𝑝𝑟𝑖𝑐𝑒 and the second model is 𝑠𝑖𝑧𝑒 = 10.
• To estimate the models, we use 800 randomly chosen observations out of
1,000 (or 80%), without showing the other 20% of data to the model.
• Once the model is trained, we predict the values for the other 20% of the
variables based on those for which we already know the true value, and
calculate the model error with an error measure.
• Then we choose the model with the lowest error. In this example we chose
model 1 because it has the lowest total error.
Model Diagnostics and Model Comparison
• Many models make strong assumptions, such as independence of the
inputs, and you have to verify that these assumptions are indeed met.
This is called model diagnostics.
Step 6: Presenting findings and
building applications on top of
them
Presenting findings and building applications
on top of them
• After you’ve successfully analyzed the data and built a well-performing model,
you’re ready to present your findings to the world. This is an exciting part; all your
hours of hard work have paid off and you can explain what you found to the
stakeholders.
• Sometimes people get so excited about your work that you’ll need to repeat it
over and over again because they value the predictions of your models or the
insights that you produced. For this reason, you need to automate your models.
This doesn’t always mean that you have to redo all of your analysis all the time.
Sometimes it’s sufficient
• that you implement only the model scoring; other times you might build an
application that automatically updates reports, Excel spreadsheets, or PowerPoint
presentations.
• The last stage of the data science process is where your soft skills will be most
useful, and yes, they’re extremely important.
Thanks
Samatrix Consulting Pvt Ltd