100% found this document useful (1 vote)

83 views

The Data Science Process

The document describes the standard data science process, which involves determining the business problem, obtaining and exploring data, cleaning and preprocessing the data, selecting machine learning algorithms, training models and evaluating their performance, and deploying the best model. It provides a Python example of applying linear regression to predict housing prices. The standard processes of CRISP-DM, SEMMA, and KDD are also overviewed, as well as the machine learning canvas framework.

Uploaded by

Thành Cao Đức

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

83 views

The Data Science Process

Uploaded by

Thành Cao Đức

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

PHUONG NGUYEN

THE DATA SCIENCE PROCESS

HOW to embed Machine Learning into business
CONTENT
1. A SIMPLE EXAMPLE IN PYTHON

2. STANDARD DATA SCIENCE PROCESSES

3. MACHINE LEARNING CANVAS

DATA SCIENCE PROCESS

https://github.com/nnbphuong/datascience4biz/blob/
master/Overview_of_the_Data_Science_Process.ipynb
THE DATA SCIENCE PROCESS
1. DETERMINE THE PURPOSE
▪

▪
2. OBTAIN THE DATA
▪
▪

▪
import pandas as pd

# Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find dimension of data frame
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data

# Rename columns: replace spaces with '_’

housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE’}) # explicit
housing_df.columns = [s.strip().replace(' ', '_’)
for s in housing_df.columns] # all columns

# Show first four rows of the data

housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive
housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1
# Different ways of showing the first 10
# values in column TOTAL_VALUE

housing_df['TOTAL_VALUE'].iloc[0:10]
housing_df.iloc[0:10]['TOTAL_VALUE']
housing_df.iloc[0:10].TOTAL_VALUE
# use dot notation if the column name has no spaces

# Show the fifth row of the first 10 columns

housing_df.iloc[4][0:10]
housing_df.iloc[4, 0:10]
housing_df.iloc[4:5, 0:10]
# use a slice to return a data frame
# Use pd.concat to combine non-consecutive columns into a
# new data frame. Axis argument specifies dimension along
# which concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6,0:2],
housing_df.iloc[4:6,4:6]], axis=1)

# To specify a full column, use:

housing.iloc[:,0:1]
housing.TOTAL_VALUE

# show the first 10 rows of the first column

housing_df['TOTAL_VALUE'][0:10]
# Descriptive statistics

# show length of first column

print('Number of rows ', len(housing_df['TOTAL_VALUE’]))

# show mean of column

print('Mean of TOTAL_VALUE ',
housing_df['TOTAL_VALUE'].mean())

# show summary statistics for each column

housing_df.describe()
# random sample of 5 observations
housing_df.sample(5)

# oversample houses with over 10 rooms

weights = [0.9 if rooms > 10 else 0.01
for rooms in housing_df.ROOMS]
housing_df.sample(5, weights=weights)
3. EXPLORE, CLEAN, AND PRE-PROCESS THE DATA

→
▪
▪
▪
housing_df.columns # print a list of variables

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',

'YR_BUILT', 'GROSS_AREA','LIVING_AREA',
'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH',
'HALF_BATH','KITCHEN', 'FIREPLACE',
'REMODEL'], dtype='object')

•
HANDLING VARIABLES
▪

▪
▪
▪
▪
▪
▪
# REMODEL needs to be converted to a categorical variable
housing_df.REMODEL = housing_df.REMODEL.astype('category')
housing_df.REMODEL.cat.categories # Show number of categories
housing_df.REMODEL.dtype # Check type of converted variable

# use drop_first=True to drop the first dummy variable

housing_df = pd.get_dummies(housing_df,
prefix_sep='_', drop_first=True)
housing_df.columns
housing_df.loc[:,'REMODEL_Old':'REMODEL_Recent'].head(5)

['None', 'Old', 'Recent']

REMODEL_Old REMODEL_Recent
0 0 0
1 0 1
2 0 0
3 0 0
4 0 0
DETECTING OUTLIERS
▪

▪
housing_df.plot.scatter(x='ROOMS', y='FLOORS', legend=False)
HANDLING MISSING DATA
▪

▪
▪
▪

▪
▪
▪
▪
# To illustrate missing data procedures,
# we first convert a few entries for bedrooms to NA’s.
# Then we impute these missing values
# using the median of the remaining values.

missingRows = housing_df.sample(10).index
housing_df.loc[missingRows, 'BEDROOMS'] = np.nan
print(‘Number of rows with valid BEDROOMS values
after setting to NAN: ’, housing_df['BEDROOMS'].count())

medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS =
housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(‘Number of rows with valid BEDROOMS values
after filling NA values: ’, housing_df['BEDROOMS'].count())
NORMALIZING/RESCALING DATA
▪

▪
▪

▪
# Normalizing a data frame
norm_df = (housing_df - housing_df.mean()) /
housing_df.std()

# Rescaling a data frame

norm_df = (housing_df - housing_df.min()) /
(housing_df.max() - housing_df.min())
4. REDUCE THE DATA DIMENSION
▪

▪
5. DETERMINE THE DATA SCIENCE TASK
▪

→
6. PARTITION THE DATA
▪
# set random_state for reproducibility

# training (60%) and validation (40%)

trainData, validData = train_test_split(housing_df,
test_size=0.40, random_state=1)

# produces Training: 3481 Validation: 2321

# training (50%), validation (30%), and test (20%)

trainData, temp = train_test_split(housing_df, test_size=0.5, random_state=1)
# now split temp into validation and test
validData, testData = train_test_split(temp, test_size=0.4, random_state=1)

# produces Training: 2901 Validation: 1741 Test: 1160

7. CHOOSE THE TECHNIQUES
▪
8. PERFORM THE TASK
▪

LinearRegression
# create list of predictors and outcome
excludeColumns = ('TOTAL_VALUE', 'TAX')
predictors = [s for s in housing_df.columns if s
not in excludeColumns]
outcome = 'TOTAL_VALUE’

# partition data
X = housing_df[predictors]
y = housing_df[outcome]
train_X, valid_X, train_y, valid_y =
train_test_split(X, y, test_size=0.4,
random_state=1)

model = LinearRegression()
model.fit(train_X, train_y)
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
'TOTAL_VALUE': train_y,
'predicted': train_pred,
'residual': train_y - train_pred
})

# show sample of predictions

train_results.head()

TOTAL_VALUE predicted residual

2024 392.0 387.726258 4.273742
5140 476.3 430.785540 45.514460
5259 367.4 384.042952 -16.642952
421 350.3 369.005551 -18.705551
1401 348.1 314.725722 33.374278
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
'TOTAL_VALUE': valid_y,
'predicted': valid_pred,
'residual': valid_y - valid_pred
})

valid_results.head()

TOTAL_VALUE predicted residual

1822 462.0 406.946377 55.053623
1998 370.4 362.888928 7.511072
5126 407.4 390.287208 17.112792
808 316.1 382.470203 -66.370203
4034 393.2 434.334998 -41.134998
# import the utility function regressionSummary
from dmba import regressionSummary

# training set
regressionSummary(train_results.TOTAL_VALUE,
train_results.predicted)

# validation set
regressionSummary(valid_results.TOTAL_VALUE,
valid_results.predicted)
OUTPUT
Regression statistics (training)

Mean Error (ME) : -0.0000

Root Mean Squared Error (RMSE) : 43.0306
Mean Absolute Error (MAE) : 32.6042
Mean Percentage Error (MPE) : -1.1116
Mean Absolute Percentage Error (MAPE) : 8.4886

Regression statistics (validation)

Mean Error (ME) : -0.1463

Root Mean Squared Error (RMSE) : 42.7292
Mean Absolute Error (MAE) : 31.9663
Mean Percentage Error (MPE) : -1.0884
Mean Absolute Percentage Error (MAPE) : 8.3283
9. ASSESS AND INTERPRET THE RESULTS
▪

▪
10. DEPLOY THE BEST MODEL
▪

▪
new_data = pd.DataFrame({
'LOT_SQFT': [4200, 6444, 5035],
'YR_BUILT': [1960, 1940, 1925],
'GROSS_AREA': [2670, 2886, 3264],
'LIVING_AREA': [1710, 1474, 1523],
'FLOORS': [2.0, 1.5, 1.9],
'ROOMS': [10, 6, 6],
'BEDROOMS': [4, 3, 2],
'FULL_BATH': [1, 1, 1],
'HALF_BATH': [1, 1, 0],
'KITCHEN': [1, 1, 1],
'FIREPLACE': [1, 1, 0],
'REMODEL_Old': [0, 0, 0],
'REMODEL_Recent': [0, 0, 1],
})
print('Predictions: ', model.predict(new_data))

> Predictions: [384.47210285 378.06696706 386.01773842]

THE STANDARD DATA SCIENCE PROCESS
▪

▪
▪
▪
▪

▪
▪
THE STANDARD DATA SCIENCE PROCESS
THE STANDARD DATA SCIENCE PROCESS: CRISM-DM
THE STANDARD DATA SCIENCE PROCESS: SEMMA
THE STANDARD DATA SCIENCE PROCESS: KDD
THE STANDARD DATA SCIENCE PROCESS
THE MACHINE LEARNING CANVAS
▪
▪
▪

▪
THE MACHINE LEARNING CANVAS: GOAL
▪

▪
▪
▪
THE MACHINE LEARNING CANVAS: LEARN
▪

▪
▪

▪
THE MACHINE LEARNING CANVAS: PREDICT
▪

▪
THE MACHINE LEARNING CANVAS: EVALUATE
▪

Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
Gradient Descent - Linear Regression
100% (1)
Gradient Descent - Linear Regression
47 pages
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
100% (1)
8 Best Python Cheat Sheets For Beginners and Intermediate Learners
13 pages
Bagging, Boosting
100% (1)
Bagging, Boosting
32 pages
XG Boost PDF
100% (1)
XG Boost PDF
3 pages
Homoscedasticity, Heteroscedasticity and Multicollinearity
100% (1)
Homoscedasticity, Heteroscedasticity and Multicollinearity
10 pages
Classification and Prediction
100% (1)
Classification and Prediction
31 pages
Functions and Packages
No ratings yet
Functions and Packages
7 pages
Recipes For State Space Models in R Paul Teetor
No ratings yet
Recipes For State Space Models in R Paul Teetor
27 pages
Octave Programming and Linear Algebra
No ratings yet
Octave Programming and Linear Algebra
17 pages
Athanasopoulos and Hyndman (2008)
No ratings yet
Athanasopoulos and Hyndman (2008)
13 pages
EVT Presentation
No ratings yet
EVT Presentation
58 pages
Tractable Stochastic Analysis in High Dimensions Via Robust Optimization
100% (1)
Tractable Stochastic Analysis in High Dimensions Via Robust Optimization
48 pages
Baumol Benhabib 1989
No ratings yet
Baumol Benhabib 1989
30 pages
Sales Forecasting
100% (1)
Sales Forecasting
10 pages
Package Fextremes': September 20, 2011
No ratings yet
Package Fextremes': September 20, 2011
37 pages
Lab 3. Linear Regression 230223
100% (1)
Lab 3. Linear Regression 230223
7 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
10 pages
Module 1 Notes
100% (1)
Module 1 Notes
73 pages
Correlation & Regression Analysis
100% (1)
Correlation & Regression Analysis
39 pages
Fitdistrplus
No ratings yet
Fitdistrplus
87 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
GLPK Intro
No ratings yet
GLPK Intro
12 pages
Regression Anallysis Hands0n 1
100% (1)
Regression Anallysis Hands0n 1
3 pages
CS550 Regression Aug12
100% (1)
CS550 Regression Aug12
63 pages
Vinee
100% (1)
Vinee
28 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
100% (1)
Outlines: Statements of Problems Objectives Bagging Random Forest Boosting Adaboost
14 pages
CS464 Ch9 LinearRegression
100% (1)
CS464 Ch9 LinearRegression
43 pages
Classification Problems
100% (1)
Classification Problems
25 pages
EMF CheatSheet V4
100% (1)
EMF CheatSheet V4
2 pages
Linear Regression: What Is Regression Analysis?
100% (1)
Linear Regression: What Is Regression Analysis?
21 pages
Book
100% (1)
Book
480 pages
PR01
100% (1)
PR01
41 pages
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
100% (1)
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
11 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Maximum Entropy Distribution of Stock Price Fluctuations
No ratings yet
Maximum Entropy Distribution of Stock Price Fluctuations
29 pages
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
100% (1)
A) What Is Motivation Behind Ensemble Methods? Give Your Answer in Probabilistic Terms
6 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
Python Programming For Economics Finance
No ratings yet
Python Programming For Economics Finance
267 pages
Import As
100% (1)
Import As
27 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
K-NN (Nearest Neighbor)
100% (1)
K-NN (Nearest Neighbor)
17 pages
Regressao Linear Simples - Ipynb - Colaboratory
100% (1)
Regressao Linear Simples - Ipynb - Colaboratory
2 pages
Juno-G Ug PDF
No ratings yet
Juno-G Ug PDF
28 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
100% (1)
Introduction to Boosting: Slides Adapted from Che Wanxiang (车万翔) at HIT, and Robin Dhamankar of Many thanks!
41 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
Study Guide For STA3701
No ratings yet
Study Guide For STA3701
325 pages
Unit - 4 Machine Learning
100% (1)
Unit - 4 Machine Learning
84 pages
ML Lect1
100% (1)
ML Lect1
51 pages
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
100% (1)
Loading The Dataset: First We Load The Dataset and Find Out The Number of Columns, Rows, NULL Values, Etc
8 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
Heteroskedasticity
100% (1)
Heteroskedasticity
23 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
Module 2
No ratings yet
Module 2
20 pages
Artificial Intelligence: Long Short Term Memory Networks
No ratings yet
Artificial Intelligence: Long Short Term Memory Networks
14 pages
Trí tuệ nhân tạo trong điều khiển: Convolution Neural Networks Mạng nơron tích chập
No ratings yet
Trí tuệ nhân tạo trong điều khiển: Convolution Neural Networks Mạng nơron tích chập
25 pages
Artificial Intelligence: Binary Classifiers For Multi-Class Classification Problems
No ratings yet
Artificial Intelligence: Binary Classifiers For Multi-Class Classification Problems
12 pages
Artificial Intelligence: Alexnet
No ratings yet
Artificial Intelligence: Alexnet
20 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
47 pages
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
No ratings yet
Phuong Nguyen: The Complete Guide To Cluster Analysis Using Python
68 pages
Logistic Regression
No ratings yet
Logistic Regression
30 pages
K-Nearest Neighbors
100% (1)
K-Nearest Neighbors
32 pages
Predictive Performance
No ratings yet
Predictive Performance
33 pages
Tree-Based Methods
No ratings yet
Tree-Based Methods
32 pages
Data Visualization
No ratings yet
Data Visualization
55 pages
A Crash Course On Python
No ratings yet
A Crash Course On Python
27 pages
Business Analytics
No ratings yet
Business Analytics
42 pages
Introduction To Data Science and Analytics
100% (1)
Introduction To Data Science and Analytics
31 pages
DATA SUMMARIZATION - Print
No ratings yet
DATA SUMMARIZATION - Print
28 pages
Religious and Moral Education Edited
89% (18)
Religious and Moral Education Edited
178 pages
Q3 - Module 1 (INTRODUCTION)
100% (1)
Q3 - Module 1 (INTRODUCTION)
7 pages
Lesson Plan
No ratings yet
Lesson Plan
5 pages
Unit 6 - CPH 4510 Biostatistics - Reading Assignment
No ratings yet
Unit 6 - CPH 4510 Biostatistics - Reading Assignment
13 pages
Latin Square Design
100% (1)
Latin Square Design
46 pages
3b Weibull Analysis Supporting Notes
No ratings yet
3b Weibull Analysis Supporting Notes
15 pages
Abductive Reasoning
No ratings yet
Abductive Reasoning
13 pages
MCQ Testing of Hypothesis With Correct Answers
93% (15)
MCQ Testing of Hypothesis With Correct Answers
7 pages
QM
No ratings yet
QM
4 pages
Kemampuan Komunikasi Matematis Siswa Ditinjau Melalui Model Pembelajaran Kooperatif Tipe Complete Sentence Dan Team Quiz
No ratings yet
Kemampuan Komunikasi Matematis Siswa Ditinjau Melalui Model Pembelajaran Kooperatif Tipe Complete Sentence Dan Team Quiz
17 pages
A New One Parameter Distribution Properties and Estimation With Applications To Complete and Type II Censored Data
No ratings yet
A New One Parameter Distribution Properties and Estimation With Applications To Complete and Type II Censored Data
9 pages
Bautista On Philippine Sociology 1990s
100% (2)
Bautista On Philippine Sociology 1990s
25 pages
A Statistical Development of Fixed Odds Betting Rules in Soccer
No ratings yet
A Statistical Development of Fixed Odds Betting Rules in Soccer
20 pages
Etech
No ratings yet
Etech
12 pages
Jadual Math
No ratings yet
Jadual Math
1 page
Data Analysis: Chi-Square Test of Independence For Gender and Burnout Level
100% (1)
Data Analysis: Chi-Square Test of Independence For Gender and Burnout Level
3 pages
Dicu Bogdan
No ratings yet
Dicu Bogdan
19 pages
Reviewer Sa Stats Q1
No ratings yet
Reviewer Sa Stats Q1
4 pages
MBTV 681 Item Questions Answers Soc Sci
No ratings yet
MBTV 681 Item Questions Answers Soc Sci
166 pages
RMD-Assignment-2 - Group-4 PDF
No ratings yet
RMD-Assignment-2 - Group-4 PDF
7 pages
Business Statistics Midterm Exam: Fall 2019: BUS41000
No ratings yet
Business Statistics Midterm Exam: Fall 2019: BUS41000
16 pages
Statistics PT PDF
No ratings yet
Statistics PT PDF
16 pages
T Statistic and Z Statics Difference
No ratings yet
T Statistic and Z Statics Difference
4 pages
Assignment 2 - Updated
No ratings yet
Assignment 2 - Updated
5 pages
Correlation
No ratings yet
Correlation
2 pages
CH02
0% (1)
CH02
6 pages
To Calibrate Absorbance Scale and Detect Stray Light in An UV/Vis Spectrophotometer
100% (2)
To Calibrate Absorbance Scale and Detect Stray Light in An UV/Vis Spectrophotometer
2 pages
Principles of Sociology, 1st SESSION
No ratings yet
Principles of Sociology, 1st SESSION
31 pages
Research Methodologys and Statististical Quantitative Methods
No ratings yet
Research Methodologys and Statististical Quantitative Methods
6 pages
On H&M
No ratings yet
On H&M
5 pages

The Data Science Process

Uploaded by

The Data Science Process

Uploaded by

PHUONG NGUYEN

THE DATA SCIENCE PROCESS

2. STANDARD DATA SCIENCE PROCESSES

3. MACHINE LEARNING CANVAS

# Rename columns: replace spaces with '_’

# Show first four rows of the data

# Show the fifth row of the first 10 columns

# To specify a full column, use:

# show the first 10 rows of the first column

# show length of first column

# show mean of column

# show summary statistics for each column

# oversample houses with over 10 rooms

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',

# use drop_first=True to drop the first dummy variable

['None', 'Old', 'Recent']

# Rescaling a data frame

# training (60%) and validation (40%)

# produces Training: 3481 Validation: 2321

# training (50%), validation (30%), and test (20%)

# produces Training: 2901 Validation: 1741 Test: 1160

# show sample of predictions

TOTAL_VALUE predicted residual

TOTAL_VALUE predicted residual

Mean Error (ME) : -0.0000

Regression statistics (validation)

Mean Error (ME) : -0.1463

> Predictions: [384.47210285 378.06696706 386.01773842]

You might also like