Module 2 Data Science
Module 2 Data Science
Lecture Notes
on
Module 2
INTRODUCTION TO DATA SCIENCE
(21CS754)
2021 Scheme
Prepared By,
Mrs. Prathibha S ,
Assistant Professor,
Department of CSE,PESITM
Module 2-Introduction to Data Science (21CS754)
MODULE -2
Data Science Process
Data Cleansing
Data cleansing is a sub process of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
Two types of errors:
Interpretation error such as when you take the value as granted.
Example : age of a person is greater than 300years
Errors pointing to false values within one data set (Interpretion Errors
Deviations from a code book Match on keys or else use manual overrules
aggregation
They often require human intervention, and because humans are only human, they
make typos or lose their concentration for a second and introduce an error into the
chain.
Data collected from machines or computers isn’t free from errors either.
Errors can arise from human sloppiness, whereas others are due to machine or
hardware failure
Example:
When you have a variable that can take only two values: “Good” and “Bad”, you can
create a frequency table and see if those are truly the only two values present.
The values “Godo ” and “Bade” point out something went wrong in at least 16 cases.
Most errors of this type are easy to fix with simple assignment statements and if-then
else rules:
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Whitespaces tend to be hard to detect but cause errors like other redundant
characters would. Many programming languages provide string functions that will remove
In this case you can solve the problem by applying a function that returns both strings in
true.
Check the value against physically or theoretically impossible values such as people
Outliers
process than the other observations. The easiest way to find outliers is to use a plot
Missing values aren’t necessarily wrong, but you still need to handle them
separately.
indicator
Detecting errors in larger data sets against a code book or against standardized
(For instance “0” equals “negative”, “5” stands for “very positive”.)
When integrating two data sets, you have to pay attention to their respective units
of measurement.
Example: When you study the prices of gasoline in the world.Data sets can
contain prices per gallon and others can contain prices per liter. A simple conversion will
measurement.
Example: A data set containing data per week versus one containing data per
work week. This type of error is generally easy to detect, and summarizing (or the inverse,
Joining Tables
Appending Tables
Joining Tables
Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table. The focus is on enriching a single
observation.
Appending Tables
Appending or stacking tables is effectively adding observations from one table to
another table.
Growth, sales by product class, and rank sales are examples of derived and aggregate
measures.
Transforming the input variables greatly simplifies the estimation problem.
Having too many variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input variables.
Data scientists use special methods to reduce the number of variables but retain the
maximum amount of data.
Example:
Principal Component Analysis is a well-known dimension reduction technique.It
transforms the variables into a new set of variables called as principal components. These
PCA is well suite for multidimensional data.
Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.
The visualization techniques you use in this phase range from simple line graphs or
histograms
Barchart
Lineplot
Distribution Plot
Multiple Plots can help you understand the structure of your data over
multiple variables.
Link and brush allows you to select observations in one plot and highlight the
same observations in the other plots.
Histogram
Boxplot: each user category has a distribution of the appreciation each has
for a certain picture on a photography website.
Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning
school, and the type of technique you want to use.
Phases of Model Building
Model and Variable selection
Model Execution
Model diagnostics and model comparison
Model Execution
Most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn. Coding a model is a nontrivial task in most cases, so having
these libraries available can speed up the process.
In model execution phase there are three parts:
Model Fit
Model Fit
For model fit, the R-squared or adjusted R-squared is used.
This measure is an indication of the amount of variation in the data that
gets captured by the model.
The difference between the adjusted R-squared and the R-squared is
minimal here because the adjusted one is the normal one + a penalty or
model complexity.
For research however, often very low model fits (<0.2 even) are found.
A holdout sample is a part of the data you leave out of the model building so
it can be used to evaluate the model afterward.
The principle here is simple: the model should work on unseen data.
Choose the model with the lowest error.
Many models make strong assumptions, such as independence of the inputs,
and you have to verify that these assumptions are indeed met. This is called
model diagnostics
Model evaluation is not a one-time task; it is an iterative process. If the model
falls short of expectations, data scientists go back to previous stages, adjust
parameters, or even reconsider the algorithm choice. This iterative
refinement is crucial for achieving optimal model performance.
The last stage of the data science process is where your soft skills will be most useful,
and yes, they’re extremely important.