FDS UNIT 1 Part2
FDS UNIT 1 Part2
project charter
Step 1: Defining research goals and creating a project charter
2.2.1 Spend time understanding the goals and context of your research
• Purpose of your research clear and focused manner.
• Understand the business goals and context is critical for project success
2.2.2 Create a project charter
Project charter gives Good understanding of the business problem
Formal agreement on the deliverables.
For any significant project this would be mandatory.
Step 1: Defining research goals and creating a
project charter
Step 2: Retrieving data
Step 2: Retrieving data
•Data can be stored in many forms, ranging from simple text files to tables in a
database.
•The objective now is acquiring all the data you need.
2.3.1 Start with data stored within the company
Most companies have a program for maintaining key data cleaning work.
This data can be stored in official data repositories and is maintained by a team
of IT professionals.
such as
• Databases - The primary goal of a database is data storage,
• Data warehouse is designed for reading and analyzing that data.
• A data mart is a subset of the data warehouse and geared toward serving a specific business unit.
• Data warehouses and data marts are home to preprocessed data
• Data lakes contains data in its natural or raw format.
•But the possibility exists that your data still resides in Excel files on the desktop
of a domain expert.
Step 2: Retrieving data
2.3.1 Start with data stored within the company
• Finding data even within your own company can be a challenge.
• When companies grow, their data becomes scattered around many places
• Organizations understand the value and sensitivity of data
• They Set up policies in place to access data
• These policies translate into physical and digital barriers called Chinese walls.
• These “walls” are mandatory and well-regulated for customer data in most countries.
Step 2: Retrieving data
3.1 Data cleansing is a subprocess of the data science process that focuses on removing
errors in your data so your data becomes a true and consistent representation of the
processes it originates from.
By “true and consistent representation” we imply that at least two types of errors exist.
Interpretation error- value in your data is taken for granted
Person’s age is greater than 300 years.
A. DATA ENTRY ERRORS Data collection and data entry are error-prone processes.
Requires human intervention- typos or lose their concentration for a second and
introduce an error into the chain
if x == “Godo”:
x = “Good”
if x == “Bade”:
x = “Bad”
Step 3: Cleansing, integrating, and
transforming data
B. REDUNDANT WHITESPACE
This
caused a mismatch of keys such as “FR ” – “FR”, dropping the observations that
couldn’t be matched. Python you can use the
strip() function to remove leading and trailing spaces.
i) FIXING CAPITAL LETTER MISMATCHES
“Brazil” and “brazil”.
“Brazil”.lower() == “brazil”.lower() should result in true.
C. IMPOSSIBLE VALUES AND SANITY CHECKS
check the value against physically or theoretically impossible values
check = 0 <= age <= 120
Step 3: Cleansing, integrating, and
transforming data
D.OUTLIERS
• An outlier is an observation
that seems to be distant from
other observations or, more
specifically, one observation
that follows a different logic or
generative process than the
other observations.
• The easiest way to find outliers
is to use a plot or a table with
the minimum and maximum
values.
Step 3: Cleansing, integrating, and transforming data
To join tables, you use variables that represent the same object in both tables, such
as a date, a country name, or a Social Security number. These common fields are
known as keys This are primary keys
Step 3: Cleansing, integrating, and transforming data
Appending or stacking tables is effectively adding observations from one table to another table
Step 3: Cleansing, integrating, and transforming data
Data enrichment can also be done by adding calculated information to the table, such
as the total number of sales or what percentage of total stock has been sold in a certain
region
Step 3: Cleansing, integrating, and transforming data
Methods
1. TRANSFORMING DATA
2. REDUCING THE NUMBER OF VARIABLES
3. TURNING VARIABLES INTO DUMMIES
1. TRANSFORMING DATA
Example : Take the log of the independent
variables simplifies the estimation problem
dramatically
Step 3: Cleansing, integrating, and transforming data
>>> y = np.array([5, 20, 14, 32, 22, 38]) >>> r_sq = model.score(x, y)
>>> print(f"coefficient of determination: {r_sq}")
>>> model = LinearRegression() //Create a coefficient of determination: 0.7158756137479542
model and fit it
>>> model.fit(x, y)
Model Fitting is a measurement of how well a
>>> OR model = LinearRegression().fit(x, y) ( instead of last 2 machine learning model adapts to data that is
statements)
similar to the data on which it was trained
Step 5: Build the models – Linear
Regression
import matplotlib.pyplot as plt
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
Step 5: Build the models – Linear
Regression
Step 5: Build the models – KNN
Step 5: Build the models – KNN
from sklearn import neighbors
x = np.array([5, 15, 25, 35, 45, 55]) //Provide data
y = np.array([5, 20, 14, 32, 22, 38])
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train_scaled, y_train)
y_pred_train = knn.predict(X_train)
predictions=knn.predict(X_test)
train_accuracy = accuracy_score(y_train, y_pred_train)
print("Training Accuracy:", train_accuracy * 100)
Step 6: Presenting findings and
building applications on top of them