0% found this document useful (0 votes)
16 views47 pages

FDS UNIT 1 Part2

Uploaded by

akarshnaik690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views47 pages

FDS UNIT 1 Part2

Uploaded by

akarshnaik690
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Step 1: Defining research goals and creating a

project charter
Step 1: Defining research goals and creating a project charter
2.2.1 Spend time understanding the goals and context of your research
• Purpose of your research clear and focused manner.
• Understand the business goals and context is critical for project success
2.2.2 Create a project charter
Project charter gives Good understanding of the business problem
Formal agreement on the deliverables.
For any significant project this would be mandatory.
Step 1: Defining research goals and creating a
project charter
Step 2: Retrieving data
Step 2: Retrieving data
•Data can be stored in many forms, ranging from simple text files to tables in a
database.
•The objective now is acquiring all the data you need.
2.3.1 Start with data stored within the company
Most companies have a program for maintaining key data cleaning work.
This data can be stored in official data repositories and is maintained by a team
of IT professionals.
such as
• Databases - The primary goal of a database is data storage,
• Data warehouse is designed for reading and analyzing that data.
• A data mart is a subset of the data warehouse and geared toward serving a specific business unit.
• Data warehouses and data marts are home to preprocessed data
• Data lakes contains data in its natural or raw format.

•But the possibility exists that your data still resides in Excel files on the desktop
of a domain expert.
Step 2: Retrieving data
2.3.1 Start with data stored within the company
• Finding data even within your own company can be a challenge.
• When companies grow, their data becomes scattered around many places
• Organizations understand the value and sensitivity of data
• They Set up policies in place to access data
• These policies translate into physical and digital barriers called Chinese walls.
• These “walls” are mandatory and well-regulated for customer data in most countries.
Step 2: Retrieving data

2.3.2 Don’t be afraid to shop around


2.3.3 Do data quality checks now to prevent problems later
During data retrieval, you check to see if the data is equal to the data in the source document and
look to see if you have the right data types. This shouldn’t take too long; when you
have enough evidence that the data is similar to the data you find in the source document,
you stop.
With data preparation, you do a more elaborate check. If you did a good job during the previous
phase, the errors you find now are also present in the source document. The focus is on the
content of the variables: you want to get rid of
typos and other data entry errors and bring the data to a common standard among the data sets.
For example, you might correct USQ to USA and United Kingdom to UK.
During the exploratory phase your focus shifts to what you can learn from the data. Now
you assume the data to be clean and look at the statistical properties such as distributions,
correlations, and outliers.
Step 3: Cleansing, integrating, and transforming data
Step 3: Cleansing, integrating, and transforming data

3.1 Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist.
Interpretation error- value in your data is taken for granted
Person’s age is greater than 300 years.

Inconsistencies error - between data sources or against your company’s


standardized values.
“Female” in one table and “F” in another when they represent the same thing: that the person is
female.
Pounds in one table and Dollars in another.
Step 3: Cleansing, integrating, and
transforming data
Step 3: Cleansing, integrating, and
transforming data
Advanced methods, such as simple
modeling, to find and identify data
errors; diagnostic plots can be
especially insightful
Step 3: Cleansing, integrating, and transforming
data

A. DATA ENTRY ERRORS Data collection and data entry are error-prone processes.
Requires human intervention- typos or lose their concentration for a second and
introduce an error into the chain

if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Step 3: Cleansing, integrating, and
transforming data
B. REDUNDANT WHITESPACE
This
caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched. Python you can use the
strip() function to remove leading and trailing spaces.
i) FIXING CAPITAL LETTER MISMATCHES
“Brazil” and “brazil”.
“Brazil”.lower() == “brazil”.lower() should result in true.
C. IMPOSSIBLE VALUES AND SANITY CHECKS
check the value against physically or theoretically impossible values
check = 0 <= age <= 120
Step 3: Cleansing, integrating, and
transforming data
D.OUTLIERS
• An outlier is an observation
that seems to be distant from
other observations or, more
specifically, one observation
that follows a different logic or
generative process than the
other observations.
• The easiest way to find outliers
is to use a plot or a table with
the minimum and maximum
values.
Step 3: Cleansing, integrating, and transforming data

F.DEALING WITH MISSING VALUES


• Missing values aren’t
necessarily wrong
• But cannot be handled
separately
• Use certain modelling
techniques
• Then we have indicator if data
collection has wrong data or
that an error happened in the
ETL process.
Step 3: Cleansing, integrating, and
transforming data
G. DEVIATIONS FROM A CODE BOOK
Detecting errors in larger data sets against a code book or against standardized values can
be done with the help of set operations.
A code book is a description of your data, a form of metadata. It contains things such as
the number of variables per observation, the number of observations, and what each
encoding within a variable means.
For instance “0” equals “negative”, “5” stands for “very positive”.
Step 3: Cleansing, integrating, and
transforming data
H. DIFFERENT UNITS OF MEASUREMENT
When integrating two data sets, you have to pay attention to their respective units of
measurement
Data sets can contain prices per gallon and others can contain prices per liter.

I. DIFFERENT LEVELS OF AGGREGATION


Having different levels of aggregation is similar to having different
types of measurement.
This type of error is generally easy to detect, and summarizing (or
the inverse, expanding) the data sets will fix it.
Step 3: Cleansing, integrating, and
transforming data
3.2 Correct errors as early as possible
A good practice is to mediate data errors as early as possible in the data collection chain
and to fix as little as possible inside your program while fixing the origin of the problem.
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources


a. THE DIFFERENT WAYS OF COMBINING DATA

To join tables, you use variables that represent the same object in both tables, such
as a date, a country name, or a Social Security number. These common fields are
known as keys This are primary keys
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources


b. THE DIFFERENT WAYS OF COMBINING DATA

Appending or stacking tables is effectively adding observations from one table to another table
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources


b. USING VIEWS TO SIMULATE DATA JOINS AND APPENDS

• To avoid duplication of data, you virtually


combine data with views. In the previous
• example we took the monthly data and
combined it in a new physical table. The
problem
• is that we duplicated the data and therefore
needed more storage space
Step 3: Cleansing, integrating, and transforming data

3.3 Combining data from different data sources


c. E NRICHING AGGREGATED MEASURES

Data enrichment can also be done by adding calculated information to the table, such
as the total number of sales or what percentage of total stock has been sold in a certain
region
Step 3: Cleansing, integrating, and transforming data

3.4 Transforming data

• Certain models require their data to be in a


certain shape
• So transform your data so it takes a suitable
form for data modeling.

Methods
1. TRANSFORMING DATA
2. REDUCING THE NUMBER OF VARIABLES
3. TURNING VARIABLES INTO DUMMIES

1. TRANSFORMING DATA
Example : Take the log of the independent
variables simplifies the estimation problem
dramatically
Step 3: Cleansing, integrating, and transforming data

3.4 Transforming data


REDUCING THE NUMBER OF VARIABLES
2.4.4 Transforming data

3. TURNING VARIABLES INTO DUMMIES


Variables can be turned into dummy variables
Dummy variables can
only take two values: true(1) or false(0).

Example: An example is turning one column


named Weekdays into the columns Monday
through Sunday.
Indicator to show if the observation
was on a Monday🡪 put 1
Sunday🡪 0 .
Step 4: Exploratory data analysis
Step 4: Exploratory data analysis
• Information becomes much easier to grasp when shown in a picture, therefore you mainly use graphical
techniques to gain an understanding of your data and the interactions between variables.
• The visualization techniques you use in this phase range from simple line graphs or Histograms
• Sometimes it’s useful to compose a composite graph from simple graphs to get even more insight into the data
Step 4: Exploratory data analysis
Step 4: Exploratory data analysis

brushing and linking


Combine and link different graphs and tables (or views) so changes in one graph are automatically
transferred to the other graphs.
Step 4: Exploratory data analysis

In a histogram a variable is cut into


discrete categories and the number of
occurrences in each category are
summed up and shown in the graph.
Step 4: Exploratory data analysis

• The boxplot doesn’t show how many


observations are present offers an
impression of the distribution within
categories.
• It can show the maximum, minimum,
median, and other characterizing
measures
at the same time.
Step 5: Build the models
3 Steps to build a model
•Model and variable selection
•Model execution
•Model diagnostics and model comparison
Step 5: Build the models
1. Model and variable selection
• Select the variables of the model and a modelling technique.
• Choosing the right model for a problem
• To consider model performance , check if project meets all the requirements
Other factors:
■ Must the model be moved to a production environment and, if so, would it be easy to implement?
■ Maintenance on the model - how long will it remain relevant if left untouched?
■ Does the model need to be easy to explain?
Step 5: Build the models
2.Model execution
• Python, already have libraries such as StatsModels or Scikit-learn.
• These packages use several of the most popular techniques.
• Linear regression with StatsModels or Scikit-learn
Step 5: Build the models
Step 5: Build the models
Step 5: Build the models – Linear
Regression >>> y_pred = model.predict(x)
>>> import numpy as np // Import packages >>> print(f"predicted response:\n{y_pred}")
and classes

>>> from sklearn.linear_model import predicted response:


LinearRegression [ 8.33333333 13.73333333 19.13333333
>>> x = np.array([5, 15, 25, 35, 45, 55]) 24.53333333 29.93333333 35.33333333]
//Provide data

>>> y = np.array([5, 20, 14, 32, 22, 38]) >>> r_sq = model.score(x, y)
>>> print(f"coefficient of determination: {r_sq}")
>>> model = LinearRegression() //Create a coefficient of determination: 0.7158756137479542
model and fit it

>>> model.fit(x, y)
Model Fitting is a measurement of how well a
>>> OR model = LinearRegression().fit(x, y) ( instead of last 2 machine learning model adapts to data that is
statements)
similar to the data on which it was trained
Step 5: Build the models – Linear
Regression
import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

plt.scatter(x, y)
plt.show()
Step 5: Build the models – Linear
Regression
Step 5: Build the models – KNN
Step 5: Build the models – KNN
from sklearn import neighbors
x = np.array([5, 15, 25, 35, 45, 55]) //Provide data
y = np.array([5, 20, 14, 32, 22, 38])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

knn = KNeighborsClassifier(n_neighbors=3)

knn.fit(X_train_scaled, y_train)

y_pred_train = knn.predict(X_train)

predictions=knn.predict(X_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
print("Training Accuracy:", train_accuracy * 100)
Step 6: Presenting findings and
building applications on top of them

Build an application that automatically updates reports, Excel spreadsheets, or


PowerPoint presentation.
Data Process Summary
Total Information Awareness
Terrorist attack of Sept. 11, 2001
Four people enrolled in different flight schools, learn how to pilot commercial
aircraft
The information needed to predict and foil the attack was available in data
But failed to examine the data and detect suspicious events.
The response was a program called TIA, or Total Information Awareness
Intended to mine all the data it could find, including credit-card receipts, hotel
records, travel data, and many other kinds of information in order to track terrorist
activity
Bonferroni’s Principle
When there is a certain amount of data and look for events of a certain type.
Such events occurrence can be when data is completely random
Number of occurrences of these events will grow as the size of the data grows.
These occurrences are “bogus,” - unusual features that look significant but aren’t.
Bonferroni correction
A theorem of statistics provides a statistical way to avoid bogus positive responses to
a search through the data
Bonferroni’s Principle
Avoids treating random occurrences as if they were real
Number of occurrences of the events is significantly larger than the number of real
instances then we can expect anything found to be bogus

You might also like