Webinar2 DataScienceforPredictiveModelling TessyBadriyah
Webinar2 DataScienceforPredictiveModelling TessyBadriyah
Pendahuluan
Data Science
Predictive Modeling
Workshop Predictive
Modeling dengan Python
Courtesy: https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/lifecycle
PENDAHULUAN
• Berbagai aktivitas kegiatan dalam kehidupan secara langsung maupun tidak
memerlukan pengelolaan data
• Contoh:
• Bank : menabung, transfer, deposit.
• Reservasi : hotel, pesawat, kereta api.
• Belanja : toko, mall, supermarket.
• Dan lain-lain.
DATA EVOLUTION
Pengetahuan/
Data Mentah Informasi Decision
Knowledge
DARI INFORMASI MENJADI PENGETAHUAN
Tahapan Data Mining
FORMAT DATA
• Berikut ini 3 jenis Format data:
• Structured, relational database (RDBMS)
• Semi-Structured, XML, JSON
• Unstructured, document, jurnal, metadata, gambar, video, file teks, audio, ebooks,
email message, social media, dll.
BIG DATA AND DATA SCIENCE
• Bidang ilmu Data Science berkaitan dengan penyelesaian
permasalahan kompleks menggunakan data, tidak hanya data
terstruktur seperti SQL tapi juga data yang tidak terstruktur dan
semi-terstruktur dari era kemunculan Big Data.
• # Distribusi Class
• from pandas import read_csv
• filename = "pima-indians-diabetes.data.csv"
• names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
• data = read_csv(filename, names=names)
Output :
• class_counts = data.groupby('class').size() class
0 500
• print(class_counts) 1 268
KORELASI DIANTARA ATRIBUT
• # Korelasi menggunakan Pairwise Pearson
• from pandas import read_csv
• from pandas import set_option
• filename = "pima-indians-diabetes.data.csv"
• names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
• data = read_csv(filename, names=names)
• set_option('display.width', 100)
• set_option('precision', 3)
• correlations = data.corr(method='pearson')
• print(correlations)
Pemodelan dengan Logistic Regression
Teknik sampling: train_test_split
Evalusi menggunakan data training dan data testing
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
Output :
X = array[:,0:8]
Accuracy: 75.591%
Y = array[:,8]
test_size = 0.33
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_size,
random_state=seed)
model = LogisticRegression()
model.fit(X_train, Y_train)
20
result = model.score(X_test, Y_test)
print("Accuracy: %.3f%%") % (result*100.0)
Teknik sampling :
k-fold Cross Validation
# Evaluasi menggunakan Cross Validation
from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
filename = 'pima-indians-diabetes.data.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8] Output :
Y = array[:,8] Accuracy: 76.951% (4.841%)
num_folds = 10
seed = 7
kfold = KFold(n_splits=num_folds, random_state=seed)
model = LogisticRegression()
results = cross_val_score(model, X, Y, cv=kfold)
21
print("Accuracy: %.3f%% (%.3f%%)") % (results.mean()*100.0, results.std()*100.0)
Tuning Parameter pada Algoritma
Support Vector Machine (SVM)
22
Tuning Parameter pada Algoritma
Support Vector Machine (SVM)
23
PENUTUP
Telah diselesaikan, pembelajaran webinar series #2 dengan topik Data
Science for Predictive Modeling, dengan pembahasan topik sebagai
berikut:
• Pendahuluan
• Data Science
• Predictive Modeling
• Workshop Predictive Modeling dengan Python