100% found this document useful (1 vote)
83 views

The Data Science Process

The document describes the standard data science process, which involves determining the business problem, obtaining and exploring data, cleaning and preprocessing the data, selecting machine learning algorithms, training models and evaluating their performance, and deploying the best model. It provides a Python example of applying linear regression to predict housing prices. The standard processes of CRISP-DM, SEMMA, and KDD are also overviewed, as well as the machine learning canvas framework.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
83 views

The Data Science Process

The document describes the standard data science process, which involves determining the business problem, obtaining and exploring data, cleaning and preprocessing the data, selecting machine learning algorithms, training models and evaluating their performance, and deploying the best model. It provides a Python example of applying linear regression to predict housing prices. The standard processes of CRISP-DM, SEMMA, and KDD are also overviewed, as well as the machine learning canvas framework.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

PHUONG NGUYEN

THE DATA SCIENCE PROCESS


HOW to embed Machine Learning into business
CONTENT
1. A SIMPLE EXAMPLE IN PYTHON

2. STANDARD DATA SCIENCE PROCESSES

3. MACHINE LEARNING CANVAS


DATA SCIENCE PROCESS

https://github.com/nnbphuong/datascience4biz/blob/
master/Overview_of_the_Data_Science_Process.ipynb
THE DATA SCIENCE PROCESS
1. DETERMINE THE PURPOSE


2. OBTAIN THE DATA


import pandas as pd

# Load data
housing_df = pd.read_csv('WestRoxbury.csv')
housing_df.shape #find dimension of data frame
housing_df.head() #show the 1st five rows
print(housing_df) #show all the data

# Rename columns: replace spaces with '_’


housing_df = housing_df.rename
(columns={'TOTAL VALUE ': 'TOTAL_VALUE’}) # explicit
housing_df.columns = [s.strip().replace(' ', '_’)
for s in housing_df.columns] # all columns

# Show first four rows of the data


housing_df.loc[0:3] # loc[a:b] gives rows a to b, inclusive
housing_df.iloc[0:4] # iloc[a:b] gives rows a to b-1
# Different ways of showing the first 10
# values in column TOTAL_VALUE

housing_df['TOTAL_VALUE'].iloc[0:10]
housing_df.iloc[0:10]['TOTAL_VALUE']
housing_df.iloc[0:10].TOTAL_VALUE
# use dot notation if the column name has no spaces

# Show the fifth row of the first 10 columns


housing_df.iloc[4][0:10]
housing_df.iloc[4, 0:10]
housing_df.iloc[4:5, 0:10]
# use a slice to return a data frame
# Use pd.concat to combine non-consecutive columns into a
# new data frame. Axis argument specifies dimension along
# which concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6,0:2],
housing_df.iloc[4:6,4:6]], axis=1)

# To specify a full column, use:


housing.iloc[:,0:1]
housing.TOTAL_VALUE

# show the first 10 rows of the first column


housing_df['TOTAL_VALUE'][0:10]
# Descriptive statistics

# show length of first column


print('Number of rows ', len(housing_df['TOTAL_VALUE’]))

# show mean of column


print('Mean of TOTAL_VALUE ',
housing_df['TOTAL_VALUE'].mean())

# show summary statistics for each column


housing_df.describe()
# random sample of 5 observations
housing_df.sample(5)

# oversample houses with over 10 rooms


weights = [0.9 if rooms > 10 else 0.01
for rooms in housing_df.ROOMS]
housing_df.sample(5, weights=weights)
3. EXPLORE, CLEAN, AND PRE-PROCESS THE DATA





housing_df.columns # print a list of variables

Index(['TOTAL_VALUE', 'TAX', 'LOT_SQFT',


'YR_BUILT', 'GROSS_AREA','LIVING_AREA',
'FLOORS', 'ROOMS', 'BEDROOMS', 'FULL_BATH',
'HALF_BATH','KITCHEN', 'FIREPLACE',
'REMODEL'], dtype='object')


HANDLING VARIABLES







# REMODEL needs to be converted to a categorical variable
housing_df.REMODEL = housing_df.REMODEL.astype('category')
housing_df.REMODEL.cat.categories # Show number of categories
housing_df.REMODEL.dtype # Check type of converted variable

# use drop_first=True to drop the first dummy variable


housing_df = pd.get_dummies(housing_df,
prefix_sep='_', drop_first=True)
housing_df.columns
housing_df.loc[:,'REMODEL_Old':'REMODEL_Recent'].head(5)

['None', 'Old', 'Recent']

REMODEL_Old REMODEL_Recent
0 0 0
1 0 1
2 0 0
3 0 0
4 0 0
DETECTING OUTLIERS


housing_df.plot.scatter(x='ROOMS', y='FLOORS', legend=False)
HANDLING MISSING DATA







# To illustrate missing data procedures,
# we first convert a few entries for bedrooms to NA’s.
# Then we impute these missing values
# using the median of the remaining values.

missingRows = housing_df.sample(10).index
housing_df.loc[missingRows, 'BEDROOMS'] = np.nan
print(‘Number of rows with valid BEDROOMS values
after setting to NAN: ’, housing_df['BEDROOMS'].count())

medianBedrooms = housing_df['BEDROOMS'].median()
housing_df.BEDROOMS =
housing_df.BEDROOMS.fillna(value=medianBedrooms)
print(‘Number of rows with valid BEDROOMS values
after filling NA values: ’, housing_df['BEDROOMS'].count())
NORMALIZING/RESCALING DATA



# Normalizing a data frame
norm_df = (housing_df - housing_df.mean()) /
housing_df.std()

# Rescaling a data frame


norm_df = (housing_df - housing_df.min()) /
(housing_df.max() - housing_df.min())
4. REDUCE THE DATA DIMENSION


5. DETERMINE THE DATA SCIENCE TASK


6. PARTITION THE DATA

# set random_state for reproducibility

# training (60%) and validation (40%)


trainData, validData = train_test_split(housing_df,
test_size=0.40, random_state=1)

# produces Training: 3481 Validation: 2321

# training (50%), validation (30%), and test (20%)


trainData, temp = train_test_split(housing_df, test_size=0.5, random_state=1)
# now split temp into validation and test
validData, testData = train_test_split(temp, test_size=0.4, random_state=1)

# produces Training: 2901 Validation: 1741 Test: 1160


7. CHOOSE THE TECHNIQUES

8. PERFORM THE TASK

LinearRegression
# create list of predictors and outcome
excludeColumns = ('TOTAL_VALUE', 'TAX')
predictors = [s for s in housing_df.columns if s
not in excludeColumns]
outcome = 'TOTAL_VALUE’

# partition data
X = housing_df[predictors]
y = housing_df[outcome]
train_X, valid_X, train_y, valid_y =
train_test_split(X, y, test_size=0.4,
random_state=1)

model = LinearRegression()
model.fit(train_X, train_y)
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
'TOTAL_VALUE': train_y,
'predicted': train_pred,
'residual': train_y - train_pred
})

# show sample of predictions


train_results.head()

TOTAL_VALUE predicted residual


2024 392.0 387.726258 4.273742
5140 476.3 430.785540 45.514460
5259 367.4 384.042952 -16.642952
421 350.3 369.005551 -18.705551
1401 348.1 314.725722 33.374278
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
'TOTAL_VALUE': valid_y,
'predicted': valid_pred,
'residual': valid_y - valid_pred
})

valid_results.head()

TOTAL_VALUE predicted residual


1822 462.0 406.946377 55.053623
1998 370.4 362.888928 7.511072
5126 407.4 390.287208 17.112792
808 316.1 382.470203 -66.370203
4034 393.2 434.334998 -41.134998
# import the utility function regressionSummary
from dmba import regressionSummary

# training set
regressionSummary(train_results.TOTAL_VALUE,
train_results.predicted)

# validation set
regressionSummary(valid_results.TOTAL_VALUE,
valid_results.predicted)
OUTPUT
Regression statistics (training)

Mean Error (ME) : -0.0000


Root Mean Squared Error (RMSE) : 43.0306
Mean Absolute Error (MAE) : 32.6042
Mean Percentage Error (MPE) : -1.1116
Mean Absolute Percentage Error (MAPE) : 8.4886

Regression statistics (validation)

Mean Error (ME) : -0.1463


Root Mean Squared Error (RMSE) : 42.7292
Mean Absolute Error (MAE) : 31.9663
Mean Percentage Error (MPE) : -1.0884
Mean Absolute Percentage Error (MAPE) : 8.3283
9. ASSESS AND INTERPRET THE RESULTS


10. DEPLOY THE BEST MODEL


new_data = pd.DataFrame({
'LOT_SQFT': [4200, 6444, 5035],
'YR_BUILT': [1960, 1940, 1925],
'GROSS_AREA': [2670, 2886, 3264],
'LIVING_AREA': [1710, 1474, 1523],
'FLOORS': [2.0, 1.5, 1.9],
'ROOMS': [10, 6, 6],
'BEDROOMS': [4, 3, 2],
'FULL_BATH': [1, 1, 1],
'HALF_BATH': [1, 1, 0],
'KITCHEN': [1, 1, 1],
'FIREPLACE': [1, 1, 0],
'REMODEL_Old': [0, 0, 0],
'REMODEL_Recent': [0, 0, 1],
})
print('Predictions: ', model.predict(new_data))

> Predictions: [384.47210285 378.06696706 386.01773842]


THE STANDARD DATA SCIENCE PROCESS






THE STANDARD DATA SCIENCE PROCESS
THE STANDARD DATA SCIENCE PROCESS: CRISM-DM
THE STANDARD DATA SCIENCE PROCESS: SEMMA
THE STANDARD DATA SCIENCE PROCESS: KDD
THE STANDARD DATA SCIENCE PROCESS
THE MACHINE LEARNING CANVAS



THE MACHINE LEARNING CANVAS: GOAL




THE MACHINE LEARNING CANVAS: LEARN



THE MACHINE LEARNING CANVAS: PREDICT


THE MACHINE LEARNING CANVAS: EVALUATE

You might also like