Statistical Machine Learning With Python Week #1
Statistical Machine Learning With Python Week #1
Outline
Week #1
Week #2
Week #3
Similar to engineering problems, though, there are sets of common tasks that underlie the
business problems.
Data scientists usually decompose a business problem into subtasks. The solutions to the
subtasks can then be composed to solve the overall problem. (Decomposition of business
problems and re-composition of solutions)
Some of these subtasks are unique to the particular business problem, but others are common
data mining tasks.
Despite the large number of specific data mining algorithms developed over the years, there
are only a handful of fundamentally different types of tasks these algorithms address.The
following will explain these basic tasks.
take a large set of data and replace it with a smaller set of data that conta
ins much of the important information in the larger set.
attempts to identify similar individuals based on data known about them. Similarity matching can be
used directly to find similar entities. (The basic logic of problem-solving: find the similarity among
different objects and discover the dissimilarity from similar things
)
For example, IBM is interested in finding companies similar to their best business customers,
in order to focus their sales force on the best opportunities. They use similarity matching based
on “firmographic” data describing characteristics of the companies.
Similarity matching is the basis for one of the most popular methods for making product
recommendations (finding people who are similar to you in terms of the products they have
liked or have purchased).
Similarity measures underlie certain solutions to other data mining tasks, such as classification,
regression, and clustering.
2 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
An example profiling question would be: “What is the typical cell phone usage of this customer
segment?”
Behavior may not have a simple description; profiling cell phone usage might require a
complex description of night and weekend airtime averages, international usage, roaming
charges, text minutes, and so on.
Profiling is often used to establish behavioral norms for anomaly detection applications such as
fraud detection and monitoring for intrusions to computer systems.
For example, if we know what kind of purchases a person typically makes on a credit card, we
can determine whether a new charge on the card fits that profile or not. We can use the degree
of mismatch as a suspicion score and issue an alarm if it is too high.
Task 4. Clustering
An example clustering question would be: “Do our customers form natural groups or
segments?”
Clustering is useful in preliminary domain exploration to see which natural groups exist
because these groups in turn may suggest other data mining tasks or approaches.
Clustering also is used as input to decision-making processes.
for example:What (example:What) products should we offer or develop? How should our
customer care teams (or sales teams) be structured?
also known as frequent itemset mining, association rule discovery, and market-basket analysis,
attempts to find associations between entities based on transactions involving them.
3 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Classification problem : “Among all the customers of MegaTelCo, which are likely to respond to
a given offer?” In this example the two classes could be called will respond and will not
respond.
For a classification task, a data mining procedure produces a model that, given a new
individual, determines which class that individual belongs to.
A closely related task is scoring or class probability estimation. A scoring model applied to an
indi‐ vidual produces, instead of a class prediction, a score representing the probability (or
some other quantification of likelihood) that that individual belongs to each class.
Classification and scoring are very closely related; as we shall see, a model that can do one
can usually be modified to do the other.
regression question : How much will a given customer use the service? The property (variable)
to be predicted here is service usage, and a model could be generated by looking at other or
similar individuals in the population and their historical usage.
Regression is related to classification, but the two are different.
Classification : predicts whether something will happen
Regression : predicts how much something will happen
Link prediction is common in social networking systems: “Since you and Karen share 10
friends, maybe you’d like to be Karen’s friend?”
Link prediction can also estimate the strength of a link. For example, for recommending movies
to customers one can think of a graph between customers and the movies they’ve watched or
rated. Within the graph, we search for links that do not exist between customers and movies,
but that we predict should exist and should be strong. These links form the basis for
recommendations.
4 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Use predictive modeling to target advertisements to consumers, and we observe that indeed
the targeted consumers purchase at a higher rate subsequent to having been targeted.
Was this because the advertisements influenced the consumers to purchase? Or did the
predictive models simply do a good job of identifying those consumers who would have
purchased anyway? (ha ha!)
Causal modeling include experiments (A/B tests) and observational method; they attempt to
understand what would be the difference between the situations—which cannot both happen
—where the “treatment” event were to happen, and were not to happen.
In all cases, a careful data scientist should always include with a causal conclusion the exact
assumptions that must be made in order for the causal conclusion to hold (there always are
such assumptions—always ask). (
)
u(t) = KP (e(t) + dt )
1 t de(t)
Tt
∫0 e(t)dt + TD
From PID control to data-driven closed-loop control
Data sensitive ( )
What kind of computation and visualization can we do under nominal scale, order scale,
interval scale, and ratio scale? Under structural and ill-structural data? One more step
towards modeling approach and algorithms.
Data mashups ( )
Get the right meaning of different kind of data - record, graph, sequence, text, audio,
video, and try to mix them together in your analysis. (
)
Models mashups ( )
From Statistics and Machine Learning to the backdrop hung by Operations Research.
( )
Prototyping tools ( )
Hands-on through R, Python, Julia … Learning by doing, doing along with learning. (
)
Fusion with other information technologies ( IT )
Linux, Web, Cloud, Hadoop, Spark, NoSQL … Learning will never end up. ( )
All built on business understanding ( )!
5 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
The data mining (or machine learning, or predictive modeling process) is inherently hands-on.
( )
An article about computational science in a scientific publication is not the scholarship itself, it
is merely advertising of the scholarship. The actual scholarship is the complete software
development environment and the complete set of instructions which generated the figures.
from Buckheit, J. and Donoho, D.L. (1995), “WaveLab and Reproducible Research”, in A.
Antoniadis, G. Oppenheim (eds.), “Wavelets in Statistics”, pp. 55-82, Springer-Verlag, New
York.
PHP, R, Python, …
They are dynamic and evolvable. Don’t be surprised, there may have new packages iploaded
just after our classes end.
6 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Assembler
Whole groups of bit operations are assembled.
Machine code
You must highlight the differences between the fourth and the third generation programming
languages during the learning of data-driven programming.
Define the model you want to build. It’s a model with unknown parameters.
clf = MultinomialNB()
Input training data and Python will fit the model. We are going to have a
parameterized model.
clf.fit(sms_dtm_train, sms_raw_train[‘type’])
pred = clf.predict(sms_dtm_test)
7 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Functional Programming in R
Most R scripts look like as follows:
kmeans.results <- kmeans(x = iris2, centers = 3)
Object <- function(argument1=value1, argument1=value1)
<- ( 1 1 1, 2 2 2,…)
?reserved
8 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
(list) (R S3 ) Python
_
The class (or type) name of the output is usually same as the function name which
creates that object.
Most R scripts involve several functions, please comprehend the script parts by parts and read
it inside out.
Do not neglect the error message, read it and understand that will improve you a lot.
9 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Model Task
Both
Unsupervised learning: As opposed to predictive models (i.e. supervised learning) that predict
a target of interest, in unsupervised learning, no single feature is more important than any
other. In fact, because there is no target to learn, the process of training a descriptive model is
called unsupervised learning.
Model Task
k-means Clustering
10 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
labels capture some underlying aspect of similarity among the individuals within the group.
Clustering is an unsupervised machine learning task that automatically divides the data into
clusters, or groups of similar items. It does this without having been told how the groups
should look ahead of time.
As we may not even know what we’re looking for, clustering is used for knowledge discovery
rather than prediction. It provides an insight into the natural groupings found within data.
Clustering - 1
Clustering is guided by the principle that items or objects inside a cluster should be very similar
to each other, but very different from those outside.
The definition of similarity might vary across applications, but the basic idea is always the
same— group the data so that the related elements are placed together.
Clustering methods employed in the following applications:
Segmenting customers into groups with similar demographics or buying patterns for
targeted marketing campaigns
Detecting anomalous behavior, such as unauthorized network intrusions, by identifying
patterns of use falling outside the known clusters
Simplifying extremely large datasets by grouping features with similar values into a
smaller number of homogeneous categories
Clustering - 2
Type of Clusterings
Partitional clustering
Hierarchical clustering
Types of Clusters
Well-separated
Center-based
Contiguous
Density-based
Property or conceptual
Described by an Objective Function
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Graph-based clustering
K-means Clustering
11 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Pseudocode of K-means
Limitations of K-means
Sizes
Densities
Non-globular shapes
Outliers
K-means algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this
means that it starts with an initial guess for the cluster assignments, and then modifies the
assignments slightly to see whether the changes improve the homogeneity within the clusters.
These teenagers are a coveted demographic for businesses hoping to sell snacks, beverages,
electronics, and hygiene products.
One way to gain this edge is to identify segments of teenagers who share similar tastes, so that
marketer can avoid targeting advertisements to teens with no interest in the product being
sold.
Given the text of teenagers’ SNS pages, we can identify groups that share common interests
such as sports, religion, or music.
Clustering can automate the process of discovering the natural segments in this population.
However, it will be up to us to decide whether or not the clusters are interesting and how we
can use them for advertising.
12 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Download a dataset representing a random sample of 30,000 U.S. high school students who
had profiles on a well-known SNS, and each teen’s gender, age, and number of SNS friends
was recorded.
A text mining tool was used to divide the remaining SNS page content into words (i.e. word
segmentation). From the top 500 words appearing across all the pages, 36 words were chosen
to represent five categories of interests: namely extracurricular activities, fashion, religion,
romance, and antisocial behavior.
Collecting Data
Set up the configuration for us to program in R and Python interchangeable
import numpy as np
import pandas as pd
teens = pd.read_csv('./snsdata.csv', encoding = 'utf-8')
Data understanding
print(teens.dtypes)
13 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## gradyear int64
## gender object
## age float64
## friends int64
## basketball int64
## football int64
## soccer int64
## softball int64
## volleyball int64
## swimming int64
## cheerleading int64
## baseball int64
## tennis int64
## sports int64
## cute int64
## sex int64
## sexy int64
## hot int64
## kissed int64
## dance int64
## band int64
## marching int64
## music int64
## rock int64
## god int64
## church int64
## jesus int64
## bible int64
## hair int64
## dress int64
## blonde int64
## mall int64
## shopping int64
## clothes int64
## hollister int64
## abercrombie int64
## die int64
## death int64
## drunk int64
## drugs int64
## dtype: object
print(teens.describe(include='all'))
14 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
teens.isnull().sum()
15 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## gradyear 0
## gender 2724
## age 5086
## friends 0
## basketball 0
## football 0
## soccer 0
## softball 0
## volleyball 0
## swimming 0
## cheerleading 0
## baseball 0
## tennis 0
## sports 0
## cute 0
## sex 0
## sexy 0
## hot 0
## kissed 0
## dance 0
## band 0
## marching 0
## music 0
## rock 0
## god 0
## church 0
## jesus 0
## bible 0
## hair 0
## dress 0
## blonde 0
## mall 0
## shopping 0
## clothes 0
## hollister 0
## abercrombie 0
## die 0
## death 0
## drunk 0
## drugs 0
## dtype: int64
Training Model - 1
Feature selection : cluster analysis by 36 keywords.
interests=teens.loc[:, 'basketball':'drugs']
Training Model - 2
16 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
The scales of term frequency differ, so normalization (centering & scaling) can help to avoid
some feature dominate other features.
teens_z=preprocessing.scale(interests)
teens_z = pd.DataFrame(teens_z)
teens_z.head(6)
## 0 1 2 ... 33 34 35
## 0 -0.332217 -0.357697 -0.242874 ... -0.261530 -0.220403 -0.174908
## 1 -0.332217 1.060049 -0.242874 ... -0.261530 -0.220403 -0.174908
## 2 -0.332217 1.060049 -0.242874 ... 2.027908 -0.220403 -0.174908
## 3 -0.332217 -0.357697 -0.242874 ... -0.261530 -0.220403 -0.174908
## 4 -0.332217 -0.357697 -0.242874 ... -0.261530 2.285122 2.719316
## 5 -0.332217 -0.357697 -0.242874 ... -0.261530 2.285122 -0.174908
##
## [6 rows x 36 columns]
Training Model - 3
k-means clustering why centers = 5 ?
Five stereotypes in the Breakfast Club by John Hughes (1985) - a Brain, an Athelete, a
Basket Case, a Princess, and a Criminal, so let’s start with k=5
mdl = KMeans(n_clusters = 5)
tot.withinss is the sum of five withinss, totss is the sum of tot.withinss and betweenss(how to
check it?)
mdl.fit(teens_z)
pre = dir(mdl)
17 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
mdl.fit(teens_z)
post = dir(mdl)
# # Show again
print(post[51:56])
print(list(set(post) - set(pre)))
## []
Quantify vs Qualitative evaluating (cluster validity, Sum of Squares Within /Sum of Squares
Between)
If the groups are too large or too small, they are not likely to be very useful.
pd.Series(mdl.labels_).value_counts()
## 1 22440
## 3 5827
## 2 1124
## 0 608
## 4 1
## dtype: int64
18 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
mdl.labels_[:10]
## (5, 36)
19 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
20 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
ax.set_xticklabels(list(cen.T.index), rotation=90)
fig = ax.get_figure()
fig.tight_layout()
# fig.savefig('./_img/sns_lineplot.png')
21 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
teens[['gender','age','friends','cluster']][0:5]
Look at the mean age of each cluster, the difference of them is little, because valid age for
teenagers are between 13 to 20.
teens.groupby('cluster').aggregate({'age': np.mean})
## age
## cluster
## 0 18.137230
## 1 18.129906
## 2 17.533893
## 3 17.567499
## 4 18.119000
teens.groupby('cluster').aggregate({'female': np.mean})
teens.groupby('cluster').aggregate({'friends': np.mean})
## friends
## cluster
## 0 32.804276
## 1 27.843048
## 2 30.250000
## 3 38.887249
## 4 44.000000
import pandas as pd
import numpy as np
cell = pd.read_csv('segmentationOriginal.csv')
cell.head(2)
23 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 2019 entries, 0 to 2018
## Columns: 119 entries, Cell to YCentroid
## dtypes: float64(49), int64(68), object(2)
## memory usage: 1.8+ MB
cell.shape
## (2019, 119)
cell.dtypes
24 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## Cell int64
## Case object
## Class object
## AngleCh1 float64
## AngleStatusCh1 int64
## AreaCh1 int64
## AreaStatusCh1 int64
## AvgIntenCh1 float64
## AvgIntenCh2 float64
## AvgIntenCh3 float64
## AvgIntenCh4 float64
## AvgIntenStatusCh1 int64
## AvgIntenStatusCh2 int64
## AvgIntenStatusCh3 int64
## AvgIntenStatusCh4 int64
## ConvexHullAreaRatioCh1 float64
## ConvexHullAreaRatioStatusCh1 int64
## ConvexHullPerimRatioCh1 float64
## ConvexHullPerimRatioStatusCh1 int64
## DiffIntenDensityCh1 float64
## DiffIntenDensityCh3 float64
## DiffIntenDensityCh4 float64
## DiffIntenDensityStatusCh1 int64
## DiffIntenDensityStatusCh3 int64
## DiffIntenDensityStatusCh4 int64
## EntropyIntenCh1 float64
## EntropyIntenCh3 float64
## EntropyIntenCh4 float64
## EntropyIntenStatusCh1 int64
## EntropyIntenStatusCh3 int64
## ...
## ShapeP2ACh1 float64
## ShapeP2AStatusCh1 int64
## SkewIntenCh1 float64
## SkewIntenCh3 float64
## SkewIntenCh4 float64
## SkewIntenStatusCh1 int64
## SkewIntenStatusCh3 int64
## SkewIntenStatusCh4 int64
## SpotFiberCountCh3 int64
## SpotFiberCountCh4 int64
## SpotFiberCountStatusCh3 int64
## SpotFiberCountStatusCh4 int64
## TotalIntenCh1 int64
## TotalIntenCh2 int64
## TotalIntenCh3 int64
## TotalIntenCh4 int64
## TotalIntenStatusCh1 int64
## TotalIntenStatusCh2 int64
## TotalIntenStatusCh3 int64
## TotalIntenStatusCh4 int64
## VarIntenCh1 float64
25 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## VarIntenCh3 float64
## VarIntenCh4 float64
## VarIntenStatusCh1 int64
## VarIntenStatusCh3 int64
## VarIntenStatusCh4 int64
## WidthCh1 float64
## WidthStatusCh1 int64
## XCentroid int64
## YCentroid int64
## Length: 119, dtype: object
cell.describe(include = "all")
26 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
#cell.isnull()
#type(cell.isnull()) # pandas.core.frame.DataFrame, so .index, .column, and .v
alues three important attributes
#cell.isnull().values
#type(cell.isnull().values) # numpy.ndarray
## False
cell.isnull().sum()
#cell['Case'].nunique()
cell['Case'].unique()
cell.Case.value_counts()
#select the training set
## Test 1010
## Train 1009
## Name: Case, dtype: int64
cell['Case'][:10]
27 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## 0 Test
## 1 Train
## 2 Train
## 3 Train
## 4 Test
## 5 Test
## 6 Test
## 7 Test
## 8 Test
## 9 Test
## Name: Case, dtype: object
## <class 'pandas.core.series.Series'>
cell[['Case']][:10]
## Case
## 0 Test
## 1 Train
## 2 Train
## 3 Train
## 4 Test
## 5 Test
## 6 Test
## 7 Test
## 8 Test
## 9 Test
## <class 'pandas.core.frame.DataFrame'>
28 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
cell_data = cell_train.drop(cell_train.columns[0:3], 1)
cell_data.head()
Create class label vector (y) (label encoding and one-hot encoding)
# label encoding
le_class = LabelEncoder().fit(cell['Class']) # 'PS': 0, 'WS': 1
Class_label = le_class.transform(cell['Class']) # 0: PS, 1: WS
Class_label.shape # (2019,)
# one-hot encoding
## (2019,)
29 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## /opt/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.
py:414: FutureWarning: The handling of integer data will change in version 0.2
2. Currently, the categories are determined based on the range [0, max(value
s)], while in the future they will be determined based on the unique values.
## If you want the future behaviour and silence this warning, you can specify
"categories='auto'".
## In case you used a LabelEncoder before this OneHotEncoder to convert the ca
tegories to integers, then you can now use the OneHotEncoder directly.
## warnings.warn(msg, FutureWarning)
ohe_class.get_params()
#{'categorical_features': 'all',
# 'dtype': float,
# 'handle_unknown': 'error',
# 'n_values': 'auto',
# 'sparse': False}
#ohe_class.categorical_features
Class_ohe=ohe_class.transform(Class_label.reshape(-1,1)) # (2019, 2)
## (2019, 1)
## (2019, 2)
Class_ohe
30 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## array([[1., 0.],
## [1., 0.],
## [0., 1.],
## ...,
## [1., 0.],
## [0., 1.],
## [0., 1.]])
## PS WS
## 0 1 0
## 1 1 0
## 2 0 1
## 3 1 0
## 4 1 0
print(cell_data.columns)
type(cell_data.columns) # pandas.core.indexes.base.Index
## <class 'pandas.core.indexes.base.Index'>
dir(pd.Series.str)
31 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
## 0 False
## 1 True
## 2 False
## 3 True
## 4 False
## dtype: bool
cell_data.columns[pd.Series(cell_data.columns).str.contains("Status")] # again
pandas.core.indexes.base.Index
#type(cell_data.columns[pd.Series(cell_data.columns).str.contains("Status")])
# pandas.core.indexes.base.Index
32 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
len(cell_data.columns[pd.Series(cell_data.columns).str.contains("Status")]) #
58 features with "Status"
## 58
cell_num = cell_data.drop(cell_data.columns[pd.Series(cell_data.columns).str.c
ontains("Status")],axis=1)
cell_num.head()
33 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
dir(dr)
34 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
type(dr.components_) # numpy.ndarray
## <class 'numpy.ndarray'>
## (58, 58)
35 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
# scree plot
dr.explained_variance_ratio_
import matplotlib.pyplot as plt
plt.plot(range(1, 26), dr.explained_variance_ratio_[:25], '-o')
plt.xlabel('# of components')
plt.ylabel('ratio of variance explained')
# list(range(1,59))
# range(1,59).tolist() # AttributeError: 'range' object has no attribute 'toli
st'
cell_dr = cell_pca[:,:5]
cell_dr
# pd.DataFrame(cell_dr).to_csv('cell_dr.csv')
References:
Hastie, T., Tibshirani, R., Friedman, J. (2001), The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer.
36 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 file:///Users/Vince/Google Drive/MCUT/Indonesia/III-DataMini...
Ledolter, J. (2013), Data Mining and Business Analytics with R, John Wiley & Sons.
Provost, F. and Fawcett, T. (2013), Data Science for Business: What You Need to Know About
Data Mining and Data-Analytic Thinking, O’Reilly.
Raschka, S. (2015), Python Machine Learning: Unlock deeper insights into machine learning
with this vital guide to cutting- edge predictive analytics, PACKT publishing.
Tan, P.-N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining, Pearson.
Torgo, L. (2011), Data Mining with R: Learning with Case Studies, CRC Press.
Williams, G. (2011), Data Mining with Rattle and R: The Art of Excavating Data for Knowledge
Discovery, Springer.
Zhao, Y. (2013), R and Data Mining: Examples and Case Studies, Academic Press.
37 of 37 2020/8/20, 6:16 PM