0% found this document useful (0 votes)
11 views

Module 2 Data Science

The document outlines the Data Science Process, detailing steps such as setting research goals, data retrieval, data preparation, exploratory data analysis, and model building. It emphasizes the importance of teamwork, data cleansing, and the use of various statistical and visualization techniques to derive insights from data. The document serves as lecture notes for a course on Data Science at PES Institute of Technology and Management, prepared by an assistant professor in the Computer Science and Engineering department.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module 2 Data Science

The document outlines the Data Science Process, detailing steps such as setting research goals, data retrieval, data preparation, exploratory data analysis, and model building. It emphasizes the importance of teamwork, data cleansing, and the use of various statistical and visualization techniques to derive insights from data. The document serves as lecture notes for a course on Data Science at PES Institute of Technology and Management, prepared by an assistant professor in the Computer Science and Engineering department.

Uploaded by

Pratibha S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

PES Institute of Technology and Management

NH-206, Sagar Road,Shivamogga-577204

Department of Computer Science and Engineering


Affiliated to

VISVESVARAYA TECHNOLOGICAL UNIVERSITY


Jnana Sangama, Belagavi, Karnataka –590018c

Lecture Notes
on

Module 2
INTRODUCTION TO DATA SCIENCE
(21CS754)
2021 Scheme
Prepared By,
Mrs. Prathibha S ,
Assistant Professor,
Department of CSE,PESITM
Module 2-Introduction to Data Science (21CS754)

MODULE -2
Data Science Process

Step 1: Setting the research goal


A project starts by understanding the questions.
• What does the company expect you to do?
• Why does management place such a value on your research?
• Is it part of a bigger strategic picture or a“lone wolf” project originating from an
opportunity someone detected?
The outcome should be a clear research goal, a good understanding of the con-
text, well defined deliverables, and a plan of action with a timetable as a project
charter.

A project charter requires teamwork, and covers at least the following:

 A clear research goal


 The project mission and context
 How you’re going to perform your analysis
 What resources you expect to use
 Proof that it’s an achievable project, or proof of concepts
 Deliverables and a measure of success
 A timeline

PESITM, Dept of CSE Prepared By, Prathibha S Page 2


Module 2-Introduction to Data Science (21CS754)

Step 2: Data Retreival


The objective is to acquiring all the data you need which may be internal or external.
 Internal Data within the company
Data within the company is scan be collected from sources within company in the
form of
 Databases -Storage of Data
 Datawarehouse -Reading and Analysing data
 Datamart -Subset of Data warehouse
 DataLake -Data in Natural format

 External Data outside the company


Government and Non government organizations, companies share their data to enrich
their services and ecosystem. Example, Twitter,Linked In, Facebook.
This data is helpful when you want to enrich proprietary data but also convenient
when training your data science skills at home.

Step 3: Data Preparation


This stage has sthree phases.
 Data Cleansing
 Removal of interpretation errors
 Removal of consistency errors
 Combining Data
 Joining Tables
 Appending Tables
 Using Views to simulate Joins and Appends
 Transforming Data
 Transforming Variables
 Reducing the number of variables
 Transforming variables into dummies

PESITM, Dept of CSE Prepared By, Prathibha S Page 3


Module 2-Introduction to Data Science (21CS754)

 Data Cleansing
Data cleansing is a sub process of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
Two types of errors:
Interpretation error such as when you take the value as granted.
Example : age of a person is greater than 300years

Consistency Error are inconsistencies between data sources against standardized


company values.
Example: Another example is that you use Pounds in one table and Dollars in another.

 Errors pointing to false values within one data set (Interpretion Errors

Error Description Possible Solution

Mistakes during data entry Manual overrules


Redundant white space Use string functions
Impossible values Manual overrules
Missing values Remove observation or value
Outliers Validate and, if erroneous, treat as missing value
(remove or insert)

 Errors pointing to inconsistencies between data sets (Consistency Error)

Error Description Possible Solution

Deviations from a code book Match on keys or else use manual overrules

PESITM, Dept of CSE Prepared By, Prathibha S Page 4


Module 2-Introduction to Data Science (21CS754)

Different units of measurement Recalculate

Different levels of aggregation Bring to same level of measurement by

aggregation

Data Entry Errors

 Data collection and data entry are error-prone processes.

 They often require human intervention, and because humans are only human, they

make typos or lose their concentration for a second and introduce an error into the

chain.

 Data collected from machines or computers isn’t free from errors either.

 Errors can arise from human sloppiness, whereas others are due to machine or

hardware failure

Example:

When you have a variable that can take only two values: “Good” and “Bad”, you can

create a frequency table and see if those are truly the only two values present.

The values “Godo ” and “Bade” point out something went wrong in at least 16 cases.

 Most errors of this type are easy to fix with simple assignment statements and if-then

else rules:

if x == “Godo”:

x = “Good”

if x == “Bade”:

x = “Bad”

PESITM, Dept of CSE Prepared By, Prathibha S Page 5


Module 2-Introduction to Data Science (21CS754)

Redundant White Spaces

Whitespaces tend to be hard to detect but cause errors like other redundant

characters would. Many programming languages provide string functions that will remove

the leading and trailing whitespaces.

 For instance, in Python you can use the

strip() function to remove leading and trailing spaces.

FIXING CAPITAL LETTER MISMATCHES

Capital letter mismatches are common.

 Most programming languages make a distinction between “Brazil” and “brazil”.

In this case you can solve the problem by applying a function that returns both strings in

lowercase, such as .lower() in Python. “Brazil”.lower() ==“brazil”.lower() should result in

true.

Impossible Values And Sanity Checks

Check the value against physically or theoretically impossible values such as people

taller than 3 meters or someone with an age of 299 years.

• Sanity checks can be directly expressed

with rules: check = 0 <= age <= 120

Outliers

An outlier is an observation that seems to be distant from other observations or,

more specifically, one observation that follows a different logic or generative

process than the other observations. The easiest way to find outliers is to use a plot

or a table with the minimum and maximum values.

PESITM, Dept of CSE Prepared By, Prathibha S Page 6


Module 2-Introduction to Data Science (21CS754)

Dealing with Missing Values

Missing values aren’t necessarily wrong, but you still need to handle them

separately.

Certain modeling techniques can’t handle missing values. They might be an

indicator

that something went wrong in your data collection or

An overview of techniques to handle missing data

Deviations From A Code Book

 Detecting errors in larger data sets against a code book or against standardized

values can be done with the help of set operations.

 A code book is a description of your data,a form of metadata. It contains things

such as the number of variables per observation,the number of observations, and

what each encodin g within a variable means.

PESITM, Dept of CSE Prepared By, Prathibha S Page 7


Module 2-Introduction to Data Science (21CS754)

(For instance “0” equals “negative”, “5” stands for “very positive”.)

Different Units Of Measurement

 When integrating two data sets, you have to pay attention to their respective units

of measurement.

Example: When you study the prices of gasoline in the world.Data sets can

contain prices per gallon and others can contain prices per liter. A simple conversion will

do the trick in this case

Different Levels Of Aggregation

 Having different levels of aggregation is similar to having different types of

measurement.

 Example: A data set containing data per week versus one containing data per

work week. This type of error is generally easy to detect, and summarizing (or the inverse,

expanding)the data sets will fix it.

 Combining the Data


Different operations to combine information from different data sets.

 Joining Tables

 Appending Tables

 Using Views to simulate Joins and Appends

 Joining Tables

PESITM, Dept of CSE Prepared By, Prathibha S Page 8


Module 2-Introduction to Data Science (21CS754)

Joining tables allows you to combine the information of one observation found in one table
with the information that you find in another table. The focus is on enriching a single

observation.

 Appending Tables
Appending or stacking tables is effectively adding observations from one table to
another table.

 Using Views To Simulate Data Joins And Appends


To avoid duplication of data, you virtually combine data with views. In the previous
example we took the monthly data and combined it in a new physical table. The problem is
that we duplicated the data and therefore needed more storage space. A view behaves as if
you’re working on a table, but this table is nothing but a virtual layer that combines the
tables .

PESITM, Dept of CSE Prepared By, Prathibha S Page 9


Module 2-Introduction to Data Science (21CS754)

 Combining the Data

 Enriching Aggregated Measures


Data enrichment can also be done by adding calculated information to the table, such
as the number of sales or what percentage of total stock has been sold in a certain
region. Extra measures such as these can add perspective.

Growth, sales by product class, and rank sales are examples of derived and aggregate
measures.
Transforming the input variables greatly simplifies the estimation problem.

Example: a relationship of the form y = aebx.


Taking the log of the independent variables simplifies the estimation problem
dramatically.

PESITM, Dept of CSE Prepared By, Prathibha S Page 10


Module 2-Introduction to Data Science (21CS754)

 Reducing the number of variables

Having too many variables in your model makes the model difficult to handle, and certain
techniques don’t perform well when you overload them with too many input variables.
Data scientists use special methods to reduce the number of variables but retain the
maximum amount of data.
Example:
Principal Component Analysis is a well-known dimension reduction technique.It
transforms the variables into a new set of variables called as principal components. These
PCA is well suite for multidimensional data.

 Turning Variables Into Dummies


Dummy variables can only take two values: true(1) or false(0). They’re used to indicate
the absence of a categorical effect that may explain the observation

Step 3: Data Exploratory Analysis


 The Exploratory Data Analysis (EDA) phase is where data scientists dive deep into
the dataset, unravelling its patterns, trends, and characteristics. This phase employs
statistical analysis and visualisation techniques to gain insights that will inform
subsequent modelling decisions.

PESITM, Dept of CSE Prepared By, Prathibha S Page 11


Module 2-Introduction to Data Science (21CS754)

 Information becomes much easier to grasp when shown in a picture, therefore you
mainly use graphical techniques to gain an understanding of your data and the
interactions between variables.
 The visualization techniques you use in this phase range from simple line graphs or
histograms

 Barchart

 Lineplot

PESITM, Dept of CSE Prepared By, Prathibha S Page 12


Module 2-Introduction to Data Science (21CS754)

 Distribution Plot

 Multiple Plots can help you understand the structure of your data over
multiple variables.

 Link and brush allows you to select observations in one plot and highlight the
same observations in the other plots.

PESITM, Dept of CSE Prepared By, Prathibha S Page 13


Module 2-Introduction to Data Science (21CS754)

 Histogram

 Boxplot: each user category has a distribution of the appreciation each has
for a certain picture on a photography website.

Step 5: Building the Model

PESITM, Dept of CSE Prepared By, Prathibha S Page 14


Module 2-Introduction to Data Science (21CS754)

Building a model is an iterative process. The way you build your model depends on
whether you go with classic statistics or the somewhat more recent machine learning
school, and the type of technique you want to use.
Phases of Model Building
 Model and Variable selection
 Model Execution
 Model diagnostics and model comparison

 Model and Variable selection


 Picking the correct algorithm is crucial, contingent on the issue, data type, and goal.
Options include classification, regression, clustering, and deep learning.
 Select the variables you want to include in your model and a modeling technique of
what variables will help you construct a good model.
 Many modeling techniques are available, and choosing the right model for a
problem requires judgment on your part. One need to consider model performance
and whether your project meets all the requirements to use your model, as well as
other factors:
 Must the model be moved to a production environment and, if so, would it be
easy to implement?
 How difficult is the maintenance on the model: how long will it remain
relevant if left untouched?
 Does the model need to be easy to explain?

 Model Execution
Most programming languages, such as Python, already have libraries such as
StatsModels or Scikit-learn. Coding a model is a nontrivial task in most cases, so having
these libraries available can speed up the process.
In model execution phase there are three parts:
 Model Fit

PESITM, Dept of CSE Prepared By, Prathibha S Page 15


Module 2-Introduction to Data Science (21CS754)

 Predictor variables have a coefficient


 Predictor significance

 Model Fit
 For model fit, the R-squared or adjusted R-squared is used.
 This measure is an indication of the amount of variation in the data that
gets captured by the model.
 The difference between the adjusted R-squared and the R-squared is
minimal here because the adjusted one is the normal one + a penalty or
model complexity.
 For research however, often very low model fits (<0.2 even) are found.

 Predictor variables have a coefficient


 For a linear model this is easy to interpret. In our example if you add “1”
to x1, it will change y by “0.7658
 Detecting influences is more important in scientific studies than perfectly
fitting models.
 If, for instance, you determine that a certain gene is significant as a
cause for cancer, this is important knowledge, even if that gene in
itself doesn’t determine whether a person will get cancer.
 Predictor significance
 We compared the prediction with the real values, true, but we never
predicted based on fresh data.
 The prediction was done using the same data as the data used to build the
model.
 This is all fine and dandy to make yourself feel good, but it gives you no
indication of whether your model will work when it encounters truly new data.
 Model diagnostics and model comparison
 Working with a holdout sample helps you pick the best-performing model.

PESITM, Dept of CSE Prepared By, Prathibha S Page 16


Module 2-Introduction to Data Science (21CS754)

 A holdout sample is a part of the data you leave out of the model building so
it can be used to evaluate the model afterward.
 The principle here is simple: the model should work on unseen data.
 Choose the model with the lowest error.
 Many models make strong assumptions, such as independence of the inputs,
and you have to verify that these assumptions are indeed met. This is called
model diagnostics
 Model evaluation is not a one-time task; it is an iterative process. If the model
falls short of expectations, data scientists go back to previous stages, adjust
parameters, or even reconsider the algorithm choice. This iterative
refinement is crucial for achieving optimal model performance.

Step 6: Presentation and Automation


 It is sufficient that you implement only the model scoring; other times you might
build an application that automatically updates reports. Excel spreadsheets, or
PowerPoint presentations.

 The last stage of the data science process is where your soft skills will be most useful,
and yes, they’re extremely important.

MORE INFORMATION ON DATA SCIENCE


1. Data Collection and Cleaning: Python’s pandas library is
instrumental in handling data at this stage. It provides
powerful data structures for efficient data manipulation and
analysis. Additionally, tools like NumPy complement pandas
for numerical operations.

2. Exploratory Data Analysis (EDA): Libraries such as


Matplotlib and Seaborn make visualising data straightforward.
Jupyter Notebooks, an interactive computing environment,
PESITM, Dept of CSE Prepared By, Prathibha S Page 17
Module 2-Introduction to Data Science (21CS754)

further enhance the exploratory data analysis process,


allowing for a step-by-step examination of the data.

3. Feature Engineering: Python provides a host of libraries for


feature engineering, including Scikit-learn and Feature-engine.
These tools assist in transforming raw data into a format
suitable for model training.

4. Model Building and Training: Scikit-learn stands out as a


comprehensive machine learning library in Python. Its
simplicity and extensive documentation make it a go-to-choice
for implementing various machine learning algorithms.

5. Model Evaluation and Fine-Tuning: Python offers tools such


as Scikit-learn and Statsmodels for model evaluation. Grid
search and randomised search techniques help fine-tune
model hyperparameters, optimising performance.

6. Model Deployment: Flask and Django, popular web


frameworks in Python, facilitate the deployment of machine
learning models as web services. This integration ensures
seamless interaction between models and applications.
7. Communication of Results: , enable data scientists Python’s
visualisation libraries, coupled with Jupyter Notebooks to
create compelling visualisations and narratives, making it
easier to communicate findings to diverse audiences.

EXEMPLE OF PROJECT UNDER DATA SCIENCE PROCESS

Step 1: Problem Identification Public health officials identify high


rates of lead poisoning in children in their jurisdiction, but
PESITM, Dept of CSE Prepared By, Prathibha S Page 18
Module 2-Introduction to Data Science (21CS754)

current practice only remediates issues in homes after a child has


tested positive for elevated blood lead levels. They would like to
reduce lead poisoning in children by proactively identifying
children who may be at risk before poisoning occurs.

Step 2: Data Collection


A scoping session is held including public health officials,
clinicians, lead hazard inspections teams, and data scientists to
understand the data available and how risk scores would be put
into use. Because of the need to work with private health
information and data pertaining to children, the decision is made
to restrict all analytical work to the Department of Public Health’s
secure server environment. Primary intervention is identified as
lead hazard inspections in homes with high risk of lead hazards
and presence of a child younger than 12 months. The key goal
identified in the scoping phase was to effectively reduce
childhood lead poisoning in an equitable manner across
underserved communities

The Department of Public Health provides a database and server


for analysis in their environment with an extract of individual-
level blood lead test results as well as inspection reports from
lead hazard inspections. Data from additional sources are
imported into the environment, including census data, childhood
nutrition benefit program data (to identify potentially vulnerable
PESITM, Dept of CSE Prepared By, Prathibha S Page 19
Module 2-Introduction to Data Science (21CS754)

children), and information about buildings from the county


assessor website. Address normalization and geocoding 2 of 4
allows data to be linked across these sources, and data scientists
work closely ith the owner of each data source to ensure they
understand the data structures and fields.

The data scientists use a combination of descriptive statistics,


bivariate correlations, spatial and temporal analysis to begin to
understand the relationships in the data and its limitations.
Missing values in the childhood nutrition benefit data set identify
an error in the extract, transform, load process that is corrected
with a new data extract, while a sharp decrease in the number of
blood tests in data older than 17 years reflects a change in policy
around testing that defines the limitation in historical training
data.

Drawing on what they learned in exploring the data, the data


scientists work with the public health officials to formulate a
classification problem at the address level using blood lead levels
above a specific level as a training label. Monthly risk scores will
be produced for every house with a child younger than 12 months
to correspond with the planning cycle of the department’s
housing inspection team and evaluated on the basis of precision
(positive predictive value) among the top 250 highest risk
addresses, consistent with their monthly capacity for lead
PESITM, Dept of CSE Prepared By, Prathibha S Page 20
Module 2-Introduction to Data Science (21CS754)

inspections, as well as the representativeness of underserved


communities in the results.

The data scientists run a grid of thousands of model


specifications, including several families of classifiers and
hyperparameters. Based on its ability to both achieve high
precision in the top 250 and balance false omission rates across
race and socioeconomic status, a random forest model was chosen
to test in a field trial.

A 1-year field trial was developed, during which a random 50% of


the 250 highest risk addresses were inspected for the presence of
lead each month, and remediated where hazards were found. The
trial confirmed the performance of the model in identifying
children at risk of poisoning because of the presence of lead in
their homes as well as its representativeness across communities.

Although the number of households with lead issues remediated


was too small to have a significant impact on the number of
children diagnosed with lead poisoning during the trial period,
calculations suggested that deploying the model could
appreciably impact lead poisoning over the following decade. The
Department of Public Health decided to move forward with
putting it into practice, committing resources to maintain and
PESITM, Dept of CSE Prepared By, Prathibha S Page 21
Module 2-Introduction to Data Science (21CS754)

periodically refresh and reevaluate the model.

PESITM, Dept of CSE Prepared By, Prathibha S Page 22

You might also like