0% found this document useful (0 votes)

16 views47 pages

FDS UNIT 1 Part2

Uploaded by

akarshnaik690

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views47 pages

FDS UNIT 1 Part2

Uploaded by

akarshnaik690

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Step 1: Defining research goals and creating a

project charter
Step 1: Defining research goals and creating a project charter
2.2.1 Spend time understanding the goals and context of your research
• Purpose of your research clear and focused manner.
• Understand the business goals and context is critical for project success
2.2.2 Create a project charter
Project charter gives Good understanding of the business problem
Formal agreement on the deliverables.
For any significant project this would be mandatory.
Step 1: Defining research goals and creating a
project charter
Step 2: Retrieving data
Step 2: Retrieving data
•Data can be stored in many forms, ranging from simple text files to tables in a
database.
•The objective now is acquiring all the data you need.
2.3.1 Start with data stored within the company
Most companies have a program for maintaining key data cleaning work.
This data can be stored in official data repositories and is maintained by a team
of IT professionals.
such as
• Databases - The primary goal of a database is data storage,
• Data warehouse is designed for reading and analyzing that data.
• A data mart is a subset of the data warehouse and geared toward serving a specific business unit.
• Data warehouses and data marts are home to preprocessed data
• Data lakes contains data in its natural or raw format.

•But the possibility exists that your data still resides in Excel ﬁles on the desktop
of a domain expert.
Step 2: Retrieving data
2.3.1 Start with data stored within the company
• Finding data even within your own company can be a challenge.
• When companies grow, their data becomes scattered around many places
• Organizations understand the value and sensitivity of data
• They Set up policies in place to access data
• These policies translate into physical and digital barriers called Chinese walls.
• These “walls” are mandatory and well-regulated for customer data in most countries.
Step 2: Retrieving data

2.3.2 Don’t be afraid to shop around

2.3.3 Do data quality checks now to prevent problems later
During data retrieval, you check to see if the data is equal to the data in the source document and
look to see if you have the right data types. This shouldn’t take too long; when you
have enough evidence that the data is similar to the data you ﬁnd in the source document,
you stop.
With data preparation, you do a more elaborate check. If you did a good job during the previous
phase, the errors you ﬁnd now are also present in the source document. The focus is on the
content of the variables: you want to get rid of
typos and other data entry errors and bring the data to a common standard among the data sets.
For example, you might correct USQ to USA and United Kingdom to UK.
During the exploratory phase your focus shifts to what you can learn from the data. Now
you assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.
Step 3: Cleansing, integrating, and transforming data
Step 3: Cleansing, integrating, and transforming data

3.1 Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist.
Interpretation error- value in your data is taken for granted
Person’s age is greater than 300 years.

Inconsistencies error - between data sources or against your company’s

standardized values.
“Female” in one table and “F” in another when they represent the same thing: that the person is
female.
Pounds in one table and Dollars in another.
Step 3: Cleansing, integrating, and
transforming data
Step 3: Cleansing, integrating, and
transforming data
Advanced methods, such as simple
modeling, to ﬁnd and identify data
errors; diagnostic plots can be
especially insightful
Step 3: Cleansing, integrating, and transforming
data

A. DATA ENTRY ERRORS Data collection and data entry are error-prone processes.
Requires human intervention- typos or lose their concentration for a second and
introduce an error into the chain

if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Step 3: Cleansing, integrating, and
transforming data
B. REDUNDANT WHITESPACE
This
caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched. Python you can use the
strip() function to remove leading and trailing spaces.
i) FIXING CAPITAL LETTER MISMATCHES
“Brazil” and “brazil”.
“Brazil”.lower() == “brazil”.lower() should result in true.
C. IMPOSSIBLE VALUES AND SANITY CHECKS
check the value against physically or theoretically impossible values
check = 0 <= age <= 120
Step 3: Cleansing, integrating, and
transforming data
D.OUTLIERS
• An outlier is an observation
that seems to be distant from
other observations or, more
specifically, one observation
that follows a different logic or
generative process than the
other observations.
• The easiest way to find outliers
is to use a plot or a table with
the minimum and maximum
values.
Step 3: Cleansing, integrating, and transforming data

F.DEALING WITH MISSING VALUES

• Missing values aren’t
necessarily wrong
• But cannot be handled
separately
• Use certain modelling
techniques
• Then we have indicator if data
collection has wrong data or
that an error happened in the
ETL process.
Step 3: Cleansing, integrating, and
transforming data
G. DEVIATIONS FROM A CODE BOOK
Detecting errors in larger data sets against a code book or against standardized values can
be done with the help of set operations.
A code book is a description of your data, a form of metadata. It contains things such as
the number of variables per observation, the number of observations, and what each
encoding within a variable means.
For instance “0” equals “negative”, “5” stands for “very positive”.
Step 3: Cleansing, integrating, and
transforming data
H. DIFFERENT UNITS OF MEASUREMENT
When integrating two data sets, you have to pay attention to their respective units of
measurement
Data sets can contain prices per gallon and others can contain prices per liter.

I. DIFFERENT LEVELS OF AGGREGATION

Having different levels of aggregation is similar to having different
types of measurement.
This type of error is generally easy to detect, and summarizing (or
the inverse, expanding) the data sets will fix it.
Step 3: Cleansing, integrating, and
transforming data
3.2 Correct errors as early as possible
A good practice is to mediate data errors as early as possible in the data collection chain
and to fix as little as possible inside your program while fixing the origin of the problem.
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources

a. THE DIFFERENT WAYS OF COMBINING DATA

To join tables, you use variables that represent the same object in both tables, such
as a date, a country name, or a Social Security number. These common fields are
known as keys This are primary keys
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources

b. THE DIFFERENT WAYS OF COMBINING DATA

Appending or stacking tables is effectively adding observations from one table to another table
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources

b. USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

• To avoid duplication of data, you virtually

combine data with views. In the previous
• example we took the monthly data and
combined it in a new physical table. The
problem
• is that we duplicated the data and therefore
needed more storage space
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources

c. E NRICHING AGGREGATED MEASURES

Data enrichment can also be done by adding calculated information to the table, such
as the total number of sales or what percentage of total stock has been sold in a certain
region
Step 3: Cleansing, integrating, and transforming data

3.4 Transforming data

• Certain models require their data to be in a

certain shape
• So transform your data so it takes a suitable
form for data modeling.

Methods
1. TRANSFORMING DATA
2. REDUCING THE NUMBER OF VARIABLES
3. TURNING VARIABLES INTO DUMMIES

1. TRANSFORMING DATA
Example : Take the log of the independent
variables simplifies the estimation problem
dramatically
Step 3: Cleansing, integrating, and transforming data

3.4 Transforming data

REDUCING THE NUMBER OF VARIABLES
2.4.4 Transforming data

3. TURNING VARIABLES INTO DUMMIES

Variables can be turned into dummy variables
Dummy variables can
only take two values: true(1) or false(0).

Example: An example is turning one column

named Weekdays into the columns Monday
through Sunday.
Indicator to show if the observation
was on a Monday🡪 put 1
Sunday🡪 0 .
Step 4: Exploratory data analysis
Step 4: Exploratory data analysis
• Information becomes much easier to grasp when shown in a picture, therefore you mainly use graphical
techniques to gain an understanding of your data and the interactions between variables.
• The visualization techniques you use in this phase range from simple line graphs or Histograms
• Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight into the data
Step 4: Exploratory data analysis
Step 4: Exploratory data analysis

brushing and linking

Combine and link different graphs and tables (or views) so changes in one graph are automatically
transferred to the other graphs.
Step 4: Exploratory data analysis

In a histogram a variable is cut into

discrete categories and the number of
occurrences in each category are
summed up and shown in the graph.
Step 4: Exploratory data analysis

• The boxplot doesn’t show how many

observations are present offers an
impression of the distribution within
categories.
• It can show the maximum, minimum,
median, and other characterizing
measures
at the same time.
Step 5: Build the models
3 Steps to build a model
•Model and variable selection
•Model execution
•Model diagnostics and model comparison
Step 5: Build the models
1. Model and variable selection
• Select the variables of the model and a modelling technique.
• Choosing the right model for a problem
• To consider model performance , check if project meets all the requirements
Other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to implement?
■ Maintenance on the model - how long will it remain relevant if left untouched?
■ Does the model need to be easy to explain?
Step 5: Build the models
2.Model execution
• Python, already have libraries such as StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Linear regression with StatsModels or Scikit-learn
Step 5: Build the models
Step 5: Build the models
Step 5: Build the models – Linear
Regression >>> y_pred = model.predict(x)
>>> import numpy as np // Import packages >>> print(f"predicted response:\n{y_pred}")
and classes

>>> from sklearn.linear_model import predicted response:

LinearRegression [ 8.33333333 13.73333333 19.13333333
>>> x = np.array([5, 15, 25, 35, 45, 55]) 24.53333333 29.93333333 35.33333333]
//Provide data

>>> y = np.array([5, 20, 14, 32, 22, 38]) >>> r_sq = model.score(x, y)
>>> print(f"coefficient of determination: {r_sq}")
>>> model = LinearRegression() //Create a coefficient of determination: 0.7158756137479542
model and fit it

>>> model.fit(x, y)
Model Fitting is a measurement of how well a
>>> OR model = LinearRegression().fit(x, y) ( instead of last 2 machine learning model adapts to data that is
statements)
similar to the data on which it was trained
Step 5: Build the models – Linear
Regression
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()
Step 5: Build the models – Linear
Regression
Step 5: Build the models – KNN
Step 5: Build the models – KNN
from sklearn import neighbors
x = np.array([5, 15, 25, 35, 45, 55]) //Provide data
y = np.array([5, 20, 14, 32, 22, 38])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train_scaled, y_train)

y_pred_train = knn.predict(X_train)

predictions=knn.predict(X_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
print("Training Accuracy:", train_accuracy * 100)
Step 6: Presenting findings and
building applications on top of them

Build an application that automatically updates reports, Excel spreadsheets, or

PowerPoint presentation.
Data Process Summary
Total Information Awareness
Terrorist attack of Sept. 11, 2001
Four people enrolled in different flight schools, learn how to pilot commercial
aircraft
The information needed to predict and foil the attack was available in data
But failed to examine the data and detect suspicious events.
The response was a program called TIA, or Total Information Awareness
Intended to mine all the data it could find, including credit-card receipts, hotel
records, travel data, and many other kinds of information in order to track terrorist
activity
Bonferroni’s Principle
When there is a certain amount of data and look for events of a certain type.
Such events occurrence can be when data is completely random
Number of occurrences of these events will grow as the size of the data grows.
These occurrences are “bogus,” - unusual features that look significant but aren’t.
Bonferroni correction
A theorem of statistics provides a statistical way to avoid bogus positive responses to
a search through the data
Bonferroni’s Principle
Avoids treating random occurrences as if they were real
Number of occurrences of the events is significantly larger than the number of real
instances then we can expect anything found to be bogus

Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
94 pages
DATA WRANGLING New
No ratings yet
DATA WRANGLING New
13 pages
DS-Unit-2 ABM Final
No ratings yet
DS-Unit-2 ABM Final
134 pages
Comptia Data+ Da0-001
No ratings yet
Comptia Data+ Da0-001
10 pages
Session2 Parts 3 4
No ratings yet
Session2 Parts 3 4
202 pages
VIPDMTheory Chapter 3
No ratings yet
VIPDMTheory Chapter 3
87 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning
No ratings yet
Data Cleaning
8 pages
Module 2 Data Science New
No ratings yet
Module 2 Data Science New
57 pages
Mod2 DM
No ratings yet
Mod2 DM
86 pages
CS7641 Machine Learning Midterm Notes PDF
No ratings yet
CS7641 Machine Learning Midterm Notes PDF
239 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Unit 1 (DWV)
No ratings yet
Unit 1 (DWV)
12 pages
Unit 1
No ratings yet
Unit 1
11 pages
ML Full Slides Final
No ratings yet
ML Full Slides Final
458 pages
22UCS303 DS-Unit II-N
No ratings yet
22UCS303 DS-Unit II-N
71 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Applied Spatial Statistics and Econometrics 1st Edition Katarzyna Kopczewska PDF Download
No ratings yet
Applied Spatial Statistics and Econometrics 1st Edition Katarzyna Kopczewska PDF Download
62 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Part II, Meet 4 - CH 6 Dan 7 UNP
No ratings yet
Part II, Meet 4 - CH 6 Dan 7 UNP
19 pages
Data Mining
No ratings yet
Data Mining
22 pages
Session2 Short
No ratings yet
Session2 Short
196 pages
Unit 2 Data Preprocessing and Association Rule Mining
No ratings yet
Unit 2 Data Preprocessing and Association Rule Mining
31 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
Data Cleaning
No ratings yet
Data Cleaning
35 pages
Unit 2
No ratings yet
Unit 2
21 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
DSBD
No ratings yet
DSBD
23 pages
M11 Final Document
No ratings yet
M11 Final Document
82 pages
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
From Everand
IT Specialist: Data Analytics Certification Prep - 500 Exam Questions and Explanations
Steve Brown
No ratings yet
Answer Book (Ashish)
100% (1)
Answer Book (Ashish)
21 pages
Document
No ratings yet
Document
29 pages
Unit-2 - DS Notes
No ratings yet
Unit-2 - DS Notes
22 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
Integrating Data From Different Sources
No ratings yet
Integrating Data From Different Sources
11 pages
Unit 2 - DS - 1st Year
No ratings yet
Unit 2 - DS - 1st Year
7 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
Brainheaters Notes: SERIES 313-2018 (A.Y
No ratings yet
Brainheaters Notes: SERIES 313-2018 (A.Y
69 pages
U1 - DA - Data Preprocessing
No ratings yet
U1 - DA - Data Preprocessing
6 pages
A Major Project Report-2
No ratings yet
A Major Project Report-2
82 pages
Week 3
No ratings yet
Week 3
23 pages
Preprocessing Techniques
No ratings yet
Preprocessing Techniques
63 pages
Bai602 ML I
100% (1)
Bai602 ML I
4 pages
Unit 3
No ratings yet
Unit 3
18 pages
Introduction To Machine Learning - Unit 3 - Week 1
No ratings yet
Introduction To Machine Learning - Unit 3 - Week 1
3 pages
21BCAD5C01 IDA Module 2 Notes
No ratings yet
21BCAD5C01 IDA Module 2 Notes
16 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
As You Delve Into The World of Data Analytics
No ratings yet
As You Delve Into The World of Data Analytics
10 pages
DDOS Attack Final
No ratings yet
DDOS Attack Final
41 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Cleaning Ebook
No ratings yet
Data Cleaning Ebook
25 pages
Math211101020
No ratings yet
Math211101020
12 pages
Lesson 7 Data Description and Diagnostics
No ratings yet
Lesson 7 Data Description and Diagnostics
14 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Random Forest Application For Crop Yield Prediction
No ratings yet
Random Forest Application For Crop Yield Prediction
14 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Multi Modal Hate Speech Detection Using Machine Learning
100% (1)
Multi Modal Hate Speech Detection Using Machine Learning
5 pages
A Study of Malicious URL Detection Using Machine Learning and Heuristic Approaches
No ratings yet
A Study of Malicious URL Detection Using Machine Learning and Heuristic Approaches
11 pages
Hepatitis Disease Prediction Using - Machine.Learning
No ratings yet
Hepatitis Disease Prediction Using - Machine.Learning
12 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
Mobile Application Development
No ratings yet
Mobile Application Development
75 pages
AI in Disease Manegment
No ratings yet
AI in Disease Manegment
13 pages
Data Science 6th Sem CS Engineesring Questions
No ratings yet
Data Science 6th Sem CS Engineesring Questions
35 pages
Unit-2: Logistic Regression
No ratings yet
Unit-2: Logistic Regression
30 pages
Airline Passenger Satisfaction Prediction: Sheila Fitria Al'asqalani
No ratings yet
Airline Passenger Satisfaction Prediction: Sheila Fitria Al'asqalani
22 pages
ML - Question Bank
No ratings yet
ML - Question Bank
4 pages
Machine Learning-Based Prediction of Hospital Prolonged Length of Stay Admission at Emergency Department A Gradient Boosting Algor
No ratings yet
Machine Learning-Based Prediction of Hospital Prolonged Length of Stay Admission at Emergency Department A Gradient Boosting Algor
19 pages
Machine Learning Based Solutions For Security of Internet of Things (IoT) A Survey
No ratings yet
Machine Learning Based Solutions For Security of Internet of Things (IoT) A Survey
18 pages
Performance Analysis of Machine Learning Classification Model and Ensemble Learning For Water Quality Index of Taal Lake
No ratings yet
Performance Analysis of Machine Learning Classification Model and Ensemble Learning For Water Quality Index of Taal Lake
6 pages
Spe 196096 Ms
No ratings yet
Spe 196096 Ms
21 pages
Mini Question Bank 6th sem-1903BS005-MACHINE LEARNING
No ratings yet
Mini Question Bank 6th sem-1903BS005-MACHINE LEARNING
3 pages
1 s2.0 S0010482522003985 Main
No ratings yet
1 s2.0 S0010482522003985 Main
12 pages
Aditya Bhandari, Ameya Joshi, Rohit Patki, Bird Species Identification From An Image
No ratings yet
Aditya Bhandari, Ameya Joshi, Rohit Patki, Bird Species Identification From An Image
5 pages
Duval
No ratings yet
Duval
9 pages
The Prediction of Disease Using Machine Learning: December 2021
No ratings yet
The Prediction of Disease Using Machine Learning: December 2021
8 pages
Main
No ratings yet
Main
12 pages

FDS UNIT 1 Part2

Uploaded by

FDS UNIT 1 Part2

Uploaded by

Step 1: Defining research goals and creating a

2.3.2 Don’t be afraid to shop around

Inconsistencies error - between data sources or against your company’s

F.DEALING WITH MISSING VALUES

I. DIFFERENT LEVELS OF AGGREGATION

3.3 Combining data from different data sources

3.3 Combining data from different data sources

3.3 Combining data from different data sources

• To avoid duplication of data, you virtually

3.3 Combining data from different data sources

3.4 Transforming data

• Certain models require their data to be in a

3.4 Transforming data

3. TURNING VARIABLES INTO DUMMIES

Example: An example is turning one column

brushing and linking

In a histogram a variable is cut into

• The boxplot doesn’t show how many

>>> from sklearn.linear_model import predicted response:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

Build an application that automatically updates reports, Excel spreadsheets, or

You might also like