0% found this document useful (0 votes)

34 views37 pages

Python Machine Learning Basics

Uploaded by

Yenni Prawita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views37 pages

Python Machine Learning Basics

Uploaded by

Yenni Prawita

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Statistical Machine Learning with

Python Week #1
Ching-Shih Tsou (Ph.D.), Distinguished Prof. at the Department of
Mechanical Engineering/Director of the Center for Artifical Intelligence &
Data Science, Ming Chi University of Technology
August/2020 at MCUT

Outline
Week #1

Business Problems and Data Mining Tasks

Dimensionality Reduction and Principal Component Analysis
Clustering Analysis

Week #2

Association Rule Mining

k Nearest Neighbors
Tree-Based Models (Classification Trees, Regression Trees, and Model Trees incl.)

Week #3

Naïve Bayes Classification (Text Mining incl.)

Support Vector Machines

*Bagging, Boosting, and Random Forest

Confusion Matrix [Slides from ROC]

ROC Curve, AUC, Lift Chart

Prob.&Stats. + Data Mining + Machine

Learning = Data Analysis + Computer
Programming
Business Problems and Data Mining
Tasks
1 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Each data-driven business decision-making problem is unique, comprising its own

combination of goals, desires, constraints, and even personalities.

Similar to engineering problems, though, there are sets of common tasks that underlie the
business problems.

Data scientists usually decompose a business problem into subtasks. The solutions to the
subtasks can then be composed to solve the overall problem. (Decomposition of business
problems and re-composition of solutions)

Some of these subtasks are unique to the particular business problem, but others are common
data mining tasks.

Despite the large number of specific data mining algorithms developed over the years, there
are only a handful of fundamentally diﬀerent types of tasks these algorithms [Link]
following will explain these basic tasks.

Task 1. Data reduction ( )

take a large set of data and replace it with a smaller set of data that conta
ins much of the important information in the larger set.

The smaller dataset may be easier to deal with or to process.

For example, a massive dataset on consumer movie-viewing preferences may be reduced to a
much smaller dataset revealing the consumer taste preferences that are latent in the viewing
data (for example, viewer genre preferences).
Data reduction usually involves loss of information. What is important is the trade-oﬀ for
improved insight.

Task 2. Similarity matching

attempts to identify similar individuals based on data known about them. Similarity matching can be
used directly to find similar entities. (The basic logic of problem-solving: find the similarity among
diﬀerent objects and discover the dissimilarity from similar things
)

For example, IBM is interested in finding companies similar to their best business customers,
in order to focus their sales force on the best opportunities. They use similarity matching based
on “firmographic” data describing characteristics of the companies.

Similarity matching is the basis for one of the most popular methods for making product
recommendations (finding people who are similar to you in terms of the products they have
liked or have purchased).

Similarity measures underlie certain solutions to other data mining tasks, such as classification,
regression, and clustering.

Task 3. Profiling (aka. behavior description)

2 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

attempts to characterize the typical behavior of an individual, group, or pop

ulation.

An example profiling question would be: “What is the typical cell phone usage of this customer
segment?”

Behavior may not have a simple description; profiling cell phone usage might require a
complex description of night and weekend airtime averages, international usage, roaming
charges, text minutes, and so on.

Profiling is often used to establish behavioral norms for anomaly detection applications such as
fraud detection and monitoring for intrusions to computer systems.

For example, if we know what kind of purchases a person typically makes on a credit card, we
can determine whether a new charge on the card fits that profile or not. We can use the degree
of mismatch as a suspicion score and issue an alarm if it is too high.

Task 4. Clustering

attempts to group individuals in a population together by their similarity

An example clustering question would be: “Do our customers form natural groups or
segments?”
Clustering is useful in preliminary domain exploration to see which natural groups exist
because these groups in turn may suggest other data mining tasks or approaches.
Clustering also is used as input to decision-making processes.
for example:What (example:What) products should we oﬀer or develop? How should our
customer care teams (or sales teams) be structured?

Task 5. Co-occurrence grouping (or aﬃnity grouping)

also known as frequent itemset mining, association rule discovery, and market-basket analysis,
attempts to find associations between entities based on transactions involving them.

For example: what items are commonly purchased together?

While clustering looks at similarity between objects based on the objects’ attributes, co-
occurrence grouping considers similarity of objects based on their appearing together (by
frequency counting) in transactions.
Deciding how to act upon this discovery might require some creativity, but it could suggest a
special promotion, product display, or combination oﬀer.
Some recommendation systems also perform a type of aﬃnity grouping by finding, for
example, pairs of books that are purchased frequently by the same people.
Co-occurrence situation include co-occurrence frequency(support of association rule), and
surprisingness(novelty degree of association rule)

Task 6. Classification and class probability estimation

3 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

attempt to predict, for each individual in a population, which of a (small) se

t of classes this individual belongs to. Usually the classes are mutually excl
usive.

Classification problem : “Among all the customers of MegaTelCo, which are likely to respond to
a given oﬀer?” In this example the two classes could be called will respond and will not
respond.
For a classification task, a data mining procedure produces a model that, given a new
individual, determines which class that individual belongs to.

A closely related task is scoring or class probability estimation. A scoring model applied to an
indi‐ vidual produces, instead of a class prediction, a score representing the probability (or
some other quantification of likelihood) that that individual belongs to each class.
Classification and scoring are very closely related; as we shall see, a model that can do one
can usually be modified to do the other.

Task 7. Value estimation

attempts to estimate or predict, for each individual, the numerical value of

some variable for that individual.

regression question : How much will a given customer use the service? The property (variable)
to be predicted here is service usage, and a model could be generated by looking at other or
similar individuals in the population and their historical usage.
Regression is related to classification, but the two are diﬀerent.
Classification : predicts whether something will happen
Regression : predicts how much something will happen

Task 8. Link prediction (one of the applications of graph mining

)

attempts to predict connections between data items, usually by suggesting tha

t a link should **exist**, and possibly also estimating the **strength** of th
e link.

Link prediction is common in social networking systems: “Since you and Karen share 10
friends, maybe you’d like to be Karen’s friend?”
Link prediction can also estimate the strength of a link. For example, for recommending movies
to customers one can think of a graph between customers and the movies they’ve watched or
rated. Within the graph, we search for links that do not exist between customers and movies,
but that we predict should exist and should be strong. These links form the basis for
recommendations.

Task 9. Causal modeling

4 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

attempts to help us understand what events or actions actually influence oth

ers.

Use predictive modeling to target advertisements to consumers, and we observe that indeed
the targeted consumers purchase at a higher rate subsequent to having been targeted.
Was this because the advertisements influenced the consumers to purchase? Or did the
predictive models simply do a good job of identifying those consumers who would have
purchased anyway? (ha ha!)
Causal modeling include experiments (A/B tests) and observational method; they attempt to
understand what would be the diﬀerence between the situations—which cannot both happen
—where the “treatment” event were to happen, and were not to happen.
In all cases, a careful data scientist should always include with a causal conclusion the exact
assumptions that must be made in order for the causal conclusion to hold (there always are
such assumptions—always ask). (
)

Statistics & Industrial Control

Open-loop PID (proportion-integral-derivative) control

u(t) = KP (e(t) + dt )
1 t de(t)
Tt
∫0 e(t)dt + TD
From PID control to data-driven closed-loop control

Capability for Analyzing Big Data (

)
What do we need to have?

Data sensitive ( )
What kind of computation and visualization can we do under nominal scale, order scale,
interval scale, and ratio scale? Under structural and ill-structural data? One more step
towards modeling approach and algorithms.
Data mashups ( )
Get the right meaning of diﬀerent kind of data - record, graph, sequence, text, audio,
video, and try to mix them together in your analysis. (
)
Models mashups ( )
From Statistics and Machine Learning to the backdrop hung by Operations Research.
( )
Prototyping tools ( )
Hands-on through R, Python, Julia … Learning by doing, doing along with learning. (
)
Fusion with other information technologies ( IT )
Linux, Web, Cloud, Hadoop, Spark, NoSQL … Learning will never end up. ( )
All built on business understanding ( )!

5 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Three Types of Model

Hands-on Cases Oriented Course

Design ( )
Motivation

The data mining (or machine learning, or predictive modeling process) is inherently hands-on.
( )

An article about computational science in a scientific publication is not the scholarship itself, it
is merely advertising of the scholarship. The actual scholarship is the complete software
development environment and the complete set of instructions which generated the figures.

from Buckheit, J. and Donoho, D.L. (1995), “WaveLab and Reproducible Research”, in A.
Antoniadis, G. Oppenheim (eds.), “Wavelets in Statistics”, pp. 55-82, Springer-Verlag, New
York.

Programming Languages in Computer

Science
Fourth Generation Language

PHP, R, Python, …
They are dynamic and evolvable. Don’t be surprised, there may have new packages iploaded
just after our classes end.

6 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Third Generation Language

Fortran, Basic, Pascal, C/C++, Java, …

Static and not so much changes after ten years.

Second Generation Language

Assembler
Whole groups of bit operations are assembled.

First Generation Language

Machine code

Programming a computer at 0 and 1 states or bits is possible

You must highlight the diﬀerences between the fourth and the third generation programming
languages during the learning of data-driven programming.

Object-Oriented Programming in Python

Most Python scripts look like as follows:
Import relevant class function first.

from sklearn.naive_bayes import MultinomialNB

Define the model you want to build. It’s a model with unknown parameters.

clf = MultinomialNB()

Input training data and Python will fit the model. We are going to have a
parameterized model.

[Link](sms_dtm_train, sms_raw_train[‘type’])

Use the model to find the predictions.

pred = [Link](sms_dtm_test)

Please carefully differentiate the following

7 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

things for each line of Python script. (

)
Class name or type name
Object name defined by ourselve (ususally located at the left hand side of assignment
operators, such as “=”) ( <- , = )
Method name
Function name
Argument names in a function are usually omitted, but it is not a good practice for you.
Attention to the argument values please. ( )
Other reserved names ( )

Functional Programming in R
Most R scripts look like as follows:
[Link] <- kmeans(x = iris2, centers = 3)
Object <- function(argument1=value1, argument1=value1)
<- ( 1 1 1, 2 2 2,…)

Please carefully differentiate the following

things for each line of R script (
)
Object name defined by ourselve (ususally located at the left hand side of assignment
operators, such as “<-” or “=”) ( <- , = )
Function name
Argument names in a function are usually omitted, but it is not a good practice for you.
Attention to the argument values please. ( )
Other reserved names ( )

?reserved

Data-Driven Programming in R and

Python
What is the class (or type) and structure of input object? ( )
Attention to the arguments setting variations ( arguments?)
Diﬀerent settings mean diﬀerent types of transformation.

Always look up the documentation to understand diﬀerent settings.

What is the class (or type) and structure of output object? ( )

Usually encapsulate all results (or output) by a list in R or with a suﬃx _ in Python.

8 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

(list) (R S3 ) Python
_
The class (or type) name of the output is usually same as the function name which
creates that object.

Reminders Before Hands-On Practice

How R and Python works? There are some key concepts.

Object-oriented, especially the data objects ( )
Package encloses the data, function, and documentation. = + + (
)
Both R and Python have functional programming as mentioned above
(input(s) -> arguments -> output(s))
As a data science tool, data objects are, of course, the main focus.
So, always attention to the class of data object and its dimension.

Most R scripts involve several functions, please comprehend the script parts by parts and read
it inside out.
Do not neglect the error message, read it and understand that will improve you a lot.

It’s better to type the script by yourself. copy -> paste.

R is case-sensitive and all parenthses should be in pairs. /

Continuously improve your English and self-learning ability.

The Steps of Using Machine Learning to

Analyze Data
Collecting Data
Exploring and preparing the data
Training Model
Evaluating model performance
Improving model performance

9 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Algorithms Selection according to the

Characteristics of Data
Supervised learning: Supervised learning starts with a set of observations containing values for
both the predictor variables and the outcome. The dataset is then divided into a training
sample and a validation sample. A predictive model is developed using the data in the training
sample and tested for accuracy using the data in the validation sample.

Model Task

Nearest Neighbor Method Classification

Naive Bayes Classification

Decision Tree Classification

Classification rules learning algorithm Classification

Linear regression Numerical Prediction

Regression Tree Numerical Prediction

Model Tree Numerical Prediction

Neural Network Both

Both

Unsupervised learning: As opposed to predictive models (i.e. supervised learning) that predict
a target of interest, in unsupervised learning, no single feature is more important than any
other. In fact, because there is no target to learn, the process of training a descriptive model is
called unsupervised learning.

Model Task

Association Rules Patterns Detection

k-means Clustering

Prologue - An Interesting Observation

Related to Clustering
Have you ever spent time watching a large crowd? If so, you are likely to have seen some
recurring personalities.
Perhaps a certain type of person, identified by a freshly pressed suit and a briefcase,
comes to typify the “fat cat” business executive.
A twenty-something wearing skinny jeans, a flannel shirt, and sunglasses might be
dubbed a “hipster,” while a woman unloading children from a minivan may be labeled a
“soccer mom.”
Of course, applying stereotypes to individuals are dangerous, as no two people are exactly
alike. Yet understood as a way to describe a collective (that’s what clustering want to do), the

10 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

labels capture some underlying aspect of similarity among the individuals within the group.
Clustering is an unsupervised machine learning task that automatically divides the data into
clusters, or groups of similar items. It does this without having been told how the groups
should look ahead of time.
As we may not even know what we’re looking for, clustering is used for knowledge discovery
rather than prediction. It provides an insight into the natural groupings found within data.

Clustering - 1
Clustering is guided by the principle that items or objects inside a cluster should be very similar
to each other, but very diﬀerent from those outside.
The definition of similarity might vary across applications, but the basic idea is always the
same— group the data so that the related elements are placed together.
Clustering methods employed in the following applications:
Segmenting customers into groups with similar demographics or buying patterns for
targeted marketing campaigns
Detecting anomalous behavior, such as unauthorized network intrusions, by identifying
patterns of use falling outside the known clusters
Simplifying extremely large datasets by grouping features with similar values into a
smaller number of homogeneous categories

Clustering - 2
Type of Clusterings
Partitional clustering
Hierarchical clustering
Types of Clusters
Well-separated
Center-based
Contiguous
Density-based
Property or conceptual
Described by an Objective Function
Clustering Algorithms
K-means and its variants
Hierarchical clustering
Density-based clustering
Graph-based clustering

The Optimization Problem

k
2
‖x − ⎯x⎯⎯j ‖
∑∑‖ i
arg min ‖ ,
S
j=1 xi ⊆Sj

K-means Clustering

11 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Pseudocode of K-means

Select K points as the initial centroids

repeat Form K clusters by assigning all points to the closest centroid (corresponds to
Expectation) Recompute the centroid of each cluster (corresponds to Maximization)
until The centroids don’t change

Limitations of K-means

Sizes

Densities

Non-globular shapes

Outliers

K-means algorithm uses a heuristic process that finds locally optimal solutions. Put simply, this
means that it starts with an initial guess for the cluster assignments, and then modifies the
assignments slightly to see whether the changes improve the homogeneity within the clusters.

Animation Demostration for K-Means

Algorithm
# not run here, please execute the following by yourself
library(animation)
[Link]()

Hands-Ons Case: Teenage Market

Segmentation
Interacting with friends on a social networking service (SNS), such as Facebook has become a
rite of passage ( ) for teenagers around the world.

These teenagers are a coveted demographic for businesses hoping to sell snacks, beverages,
electronics, and hygiene products.

One way to gain this edge is to identify segments of teenagers who share similar tastes, so that
marketer can avoid targeting advertisements to teens with no interest in the product being
sold.

Given the text of teenagers’ SNS pages, we can identify groups that share common interests
such as sports, religion, or music.

Clustering can automate the process of discovering the natural segments in this population.

However, it will be up to us to decide whether or not the clusters are interesting and how we
can use them for advertising.

12 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Download a dataset representing a random sample of 30,000 U.S. high school students who
had profiles on a well-known SNS, and each teen’s gender, age, and number of SNS friends
was recorded.

A text mining tool was used to divide the remaining SNS page content into words (i.e. word
segmentation). From the top 500 words appearing across all the pages, 36 words were chosen
to represent five categories of interests: namely extracurricular activities, fashion, religion,
romance, and antisocial behavior.

Such as football, sexy, kissed, bible, shopping, death, and drugs…

Collecting Data
Set up the configuration for us to program in R and Python interchangeable

import numpy as np
import pandas as pd
teens = pd.read_csv('./[Link]', encoding = 'utf-8')

Data understanding

print([Link])

13 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## gradyear int64
## gender object
## age float64
## friends int64
## basketball int64
## football int64
## soccer int64
## softball int64
## volleyball int64
## swimming int64
## cheerleading int64
## baseball int64
## tennis int64
## sports int64
## cute int64
## sex int64
## sexy int64
## hot int64
## kissed int64
## dance int64
## band int64
## marching int64
## music int64
## rock int64
## god int64
## church int64
## jesus int64
## bible int64
## hair int64
## dress int64
## blonde int64
## mall int64
## shopping int64
## clothes int64
## hollister int64
## abercrombie int64
## die int64
## death int64
## drunk int64
## drugs int64
## dtype: object

Exploring and Preparing the Data - 1

For [Link], generic function summary() tells us NAs appeared only in ‘gender’ and ‘age’

print([Link](include='all'))

14 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## gradyear gender ... drunk drugs

## count 30000.000000 27276 ... 30000.000000 30000.000000
## unique NaN 2 ... NaN NaN
## top NaN F ... NaN NaN
## freq NaN 22054 ... NaN NaN
## mean 2007.500000 NaN ... 0.087967 0.060433
## std 1.118053 NaN ... 0.399125 0.345522
## min 2006.000000 NaN ... 0.000000 0.000000
## 25% 2006.750000 NaN ... 0.000000 0.000000
## 50% 2007.500000 NaN ... 0.000000 0.000000
## 75% 2008.250000 NaN ... 0.000000 0.000000
## max 2009.000000 NaN ... 8.000000 16.000000
##
## [11 rows x 40 columns]

Exploring and Preparing the Data - 2

Check the nullity for whole dataset.

[Link]().sum()

15 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## gradyear 0
## gender 2724
## age 5086
## friends 0
## basketball 0
## football 0
## soccer 0
## softball 0
## volleyball 0
## swimming 0
## cheerleading 0
## baseball 0
## tennis 0
## sports 0
## cute 0
## sex 0
## sexy 0
## hot 0
## kissed 0
## dance 0
## band 0
## marching 0
## music 0
## rock 0
## god 0
## church 0
## jesus 0
## bible 0
## hair 0
## dress 0
## blonde 0
## mall 0
## shopping 0
## clothes 0
## hollister 0
## abercrombie 0
## die 0
## death 0
## drunk 0
## drugs 0
## dtype: int64

Training Model - 1
Feature selection : cluster analysis by 36 keywords.

interests=[Link][:, 'basketball':'drugs']

Training Model - 2

16 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

The scales of term frequency diﬀer, so normalization (centering & scaling) can help to avoid
some feature dominate other features.

from sklearn import preprocessing

teens_z=[Link](interests)
teens_z = [Link](teens_z)
teens_z.head(6)

## 0 1 2 ... 33 34 35
## 0 -0.332217 -0.357697 -0.242874 ... -0.261530 -0.220403 -0.174908
## 1 -0.332217 1.060049 -0.242874 ... -0.261530 -0.220403 -0.174908
## 2 -0.332217 1.060049 -0.242874 ... 2.027908 -0.220403 -0.174908
## 3 -0.332217 -0.357697 -0.242874 ... -0.261530 -0.220403 -0.174908
## 4 -0.332217 -0.357697 -0.242874 ... -0.261530 2.285122 2.719316
## 5 -0.332217 -0.357697 -0.242874 ... -0.261530 2.285122 -0.174908
##
## [6 rows x 36 columns]

Training Model - 3
k-means clustering why centers = 5 ?
Five stereotypes in the Breakfast Club by John Hughes (1985) - a Brain, an Athelete, a
Basket Case, a Princess, and a Criminal, so let’s start with k=5

from [Link] import KMeans

mdl = KMeans(n_clusters = 5)

[Link] is the sum of five withinss, totss is the sum of [Link] and betweenss(how to
check it?)

[Link](teens_z)

# Get the attributes and methods before fitting

## KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

## n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
## random_state=None, tol=0.0001, verbose=0)

pre = dir(mdl)

# Show a few of them

print(pre[51:56])

# Input standardized document-term matrix for model fitting

## ['score', 'set_params', 'tol', 'transform', 'verbose']

17 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

[Link](teens_z)

# Get the attributes and methods after fitting

## KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,

## n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
## random_state=None, tol=0.0001, verbose=0)

post = dir(mdl)

# # Show again
print(post[51:56])

# Difference set between 'post' and 'pre'

## ['score', 'set_params', 'tol', 'transform', 'verbose']

print(list(set(post) - set(pre)))

## []

Model Performance Evaluation

Unsupervised learning results can be somewhat subjective, so it’s diﬃcult to evaluate the
results.

Quantify vs Qualitative evaluating (cluster validity, Sum of Squares Within /Sum of Squares
Between)

If the groups are too large or too small, they are not likely to be very useful.

Saving sklearn model and read in again

# not run here

import pickle
filename = './_data/[Link]'
# [Link](mdl, open(filename, 'wb'))
res = [Link](open(filename, 'rb'))

[Link](mdl.labels_).value_counts()

## 1 22440
## 3 5827
## 2 1124
## 0 608
## 4 1
## dtype: int64

18 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Cluster analysis results need human interpretation

mdl.labels_[:10]

## array([1, 3, 1, 1, 2, 1, 1, 1, 1, 3], dtype=int32)

# Check the shape of cluster centers matrix

print(mdl.cluster_centers_.shape)

# Create a pandas DataFrame with keyworda for better presentation

## (5, 36)

cen = [Link](mdl.cluster_centers_, index = range(5),

columns=[Link][:,4:40].columns)
print(cen)

## basketball football soccer ... death drunk drugs

## 0 -0.086946 0.062030 -0.101212 ... 0.054774 -0.088533 -0.065422
## 1 -0.147462 -0.149321 -0.080005 ... -0.077514 -0.084582 -0.111042
## 2 0.362595 0.378270 0.135084 ... 0.906592 1.748383 2.580394
## 3 0.506195 0.494311 0.292165 ... 0.115203 -0.005238 -0.063782
## 4 -0.332217 2.477795 -0.242874 ... 13.475099 14.812744 -0.174908
##
## [5 rows x 36 columns]

19 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Line Plot of Term Frequency for Five Clusters

Interpretation of the Clustering

36 words’ line plot of each cluster

# Transpose the cluster centers matrix for plotting

ax = [Link]()
# x-axis ticks position setting
ax.set_xticks(list(range(36)))
# x-axis labels setting (low-level plotting)

20 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## [<[Link] object at 0x7ffe3888a890>, <[Link] o

bject at 0x7ffe3888eb50>, <[Link] object at 0x7ffe38886cd0>, <m
[Link] object at 0x7ffe30b31610>, <[Link] object
at 0x7ffe30b424d0>, <[Link] object at 0x7ffe30b42d10>, <matplot
[Link] object at 0x7ffe30b503d0>, <[Link] object at 0x7
ffe30b50a10>, <[Link] object at 0x7ffe3888e810>, <[Link]
[Link] object at 0x7ffe30b6ac10>, <[Link] object at 0x7ffe600
d0390>, <[Link] object at 0x7ffe600d08d0>, <[Link]
ck object at 0x7ffe600d0f10>, <[Link] object at 0x7ffe600d0210
>, <[Link] object at 0x7ffe30b6a5d0>, <[Link] ob
ject at 0x7ffe30b50310>, <[Link] object at 0x7ffe600d6950>, <ma
[Link] object at 0x7ffe600d6710>, <[Link] object a
t 0x7ffe600de610>, <[Link] object at 0x7ffe600dec50>, <matplotl
[Link] object at 0x7ffe600e42d0>, <[Link] object at 0x7f
fe600e4910>, <[Link] object at 0x7ffe600e4790>, <[Link]
[Link] object at 0x7ffe600e43d0>, <[Link] object at 0x7ffe600d
6690>, <[Link] object at 0x7ffe600eb090>, <[Link]
k object at 0x7ffe600ebb10>, <[Link] object at 0x7ffe600f31d0>,
<[Link] object at 0x7ffe600f37d0>, <[Link] objec
t at 0x7ffe600f3e10>, <[Link] object at 0x7ffe600f9490>, <matpl
[Link] object at 0x7ffe600f9ad0>, <[Link] object at 0
x7ffe600f93d0>, <[Link] object at 0x7ffe600f3090>, <matplotlib.
[Link] object at 0x7ffe600e4750>, <[Link] object at 0x7ffe6
0101710>]

ax.set_xticklabels(list([Link]), rotation=90)

## [Text(0, 0, 'basketball'), Text(0, 0, 'football'), Text(0, 0, 'soccer'), Te

xt(0, 0, 'softball'), Text(0, 0, 'volleyball'), Text(0, 0, 'swimming'), Text
(0, 0, 'cheerleading'), Text(0, 0, 'baseball'), Text(0, 0, 'tennis'), Text(0,
0, 'sports'), Text(0, 0, 'cute'), Text(0, 0, 'sex'), Text(0, 0, 'sexy'), Text
(0, 0, 'hot'), Text(0, 0, 'kissed'), Text(0, 0, 'dance'), Text(0, 0, 'band'),
Text(0, 0, 'marching'), Text(0, 0, 'music'), Text(0, 0, 'rock'), Text(0, 0, 'g
od'), Text(0, 0, 'church'), Text(0, 0, 'jesus'), Text(0, 0, 'bible'), Text(0,
0, 'hair'), Text(0, 0, 'dress'), Text(0, 0, 'blonde'), Text(0, 0, 'mall'), Tex
t(0, 0, 'shopping'), Text(0, 0, 'clothes'), Text(0, 0, 'hollister'), Text(0,
0, 'abercrombie'), Text(0, 0, 'die'), Text(0, 0, 'death'), Text(0, 0, 'drunk
'), Text(0, 0, 'drugs')]

fig = ax.get_figure()
fig.tight_layout()
# [Link]('./_img/sns_lineplot.png')

Model Performance Improvement - 1

Append clustering results to original data frame ‘teens’.

teens = [Link]([teens,[Link](mdl.labels_).rename('cluster')], axis=1)

21 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Some columns of new table

teens[['gender','age','friends','cluster']][0:5]

## gender age friends cluster

## 0 M 18.982 7 1
## 1 F 18.801 0 3
## 2 M 18.335 69 1
## 3 F 18.875 0 1
## 4 NaN 18.995 10 2

Look at the mean age of each cluster, the diﬀerence of them is little, because valid age for
teenagers are between 13 to 20.

[Link]('cluster').aggregate({'age': [Link]})

## age
## cluster
## 0 18.137230
## 1 18.129906
## 2 17.533893
## 3 17.567499
## 4 18.119000

Model Performance Improvement - 2

Find the female ratio of each cluster.
Notice the Princesses cluster.

[Link]('cluster').aggregate({'female': [Link]})

Calculate average number of friends for each cluster.

Notice the Criminals and Basket Cases.

[Link]('cluster').aggregate({'friends': [Link]})

## friends
## cluster
## 0 32.804276
## 1 27.843048
## 2 30.250000
## 3 38.887249
## 4 44.000000

Dimensionality Reduction and Principal

Component Analysis
22 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Matrix scheme for PCA

Cell Segmentation Case

import pandas as pd
import numpy as np
cell = pd.read_csv('[Link]')

Data Understanding and Missing Values Identifying

[Link](2)

## Cell Case Class ... WidthStatusCh1 XCentroid YCentroid

## 0 207827637 Test PS ... 2 42 14
## 1 207932307 Train PS ... 1 215 347
##
## [2 rows x 119 columns]

[Link]() # RangeIndex, Columns, dtypes, memory type

23 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## <class '[Link]'>
## RangeIndex: 2019 entries, 0 to 2018
## Columns: 119 entries, Cell to YCentroid
## dtypes: float64(49), int64(68), object(2)
## memory usage: 1.8+ MB

[Link]

## (2019, 119)

[Link] # 119 variable names

[Link]

24 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## Cell int64
## Case object
## Class object
## AngleCh1 float64
## AngleStatusCh1 int64
## AreaCh1 int64
## AreaStatusCh1 int64
## AvgIntenCh1 float64
## AvgIntenCh2 float64
## AvgIntenCh3 float64
## AvgIntenCh4 float64
## AvgIntenStatusCh1 int64
## AvgIntenStatusCh2 int64
## AvgIntenStatusCh3 int64
## AvgIntenStatusCh4 int64
## ConvexHullAreaRatioCh1 float64
## ConvexHullAreaRatioStatusCh1 int64
## ConvexHullPerimRatioCh1 float64
## ConvexHullPerimRatioStatusCh1 int64
## DiffIntenDensityCh1 float64
## DiffIntenDensityCh3 float64
## DiffIntenDensityCh4 float64
## DiffIntenDensityStatusCh1 int64
## DiffIntenDensityStatusCh3 int64
## DiffIntenDensityStatusCh4 int64
## EntropyIntenCh1 float64
## EntropyIntenCh3 float64
## EntropyIntenCh4 float64
## EntropyIntenStatusCh1 int64
## EntropyIntenStatusCh3 int64
## ...
## ShapeP2ACh1 float64
## ShapeP2AStatusCh1 int64
## SkewIntenCh1 float64
## SkewIntenCh3 float64
## SkewIntenCh4 float64
## SkewIntenStatusCh1 int64
## SkewIntenStatusCh3 int64
## SkewIntenStatusCh4 int64
## SpotFiberCountCh3 int64
## SpotFiberCountCh4 int64
## SpotFiberCountStatusCh3 int64
## SpotFiberCountStatusCh4 int64
## TotalIntenCh1 int64
## TotalIntenCh2 int64
## TotalIntenCh3 int64
## TotalIntenCh4 int64
## TotalIntenStatusCh1 int64
## TotalIntenStatusCh2 int64
## TotalIntenStatusCh3 int64
## TotalIntenStatusCh4 int64
## VarIntenCh1 float64

25 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## VarIntenCh3 float64
## VarIntenCh4 float64
## VarIntenStatusCh1 int64
## VarIntenStatusCh3 int64
## VarIntenStatusCh4 int64
## WidthCh1 float64
## WidthStatusCh1 int64
## XCentroid int64
## YCentroid int64
## Length: 119, dtype: object

[Link](include = "all")

## Cell Case Class ... WidthStatusCh1 XCentroid YCent

roid
## count 2.019000e+03 2019 2019 ... 2019.000000 2019.000000 2019.00
0000
## unique NaN 2 2 ... NaN NaN
NaN
## top NaN Test PS ... NaN NaN
NaN
## freq NaN 1010 1300 ... NaN NaN
NaN
## mean 2.084024e+08 NaN NaN ... 0.271421 260.727093 177.34
3239
## std 2.790457e+05 NaN NaN ... 0.607706 140.365593 107.72
0132
## min 2.078276e+08 NaN NaN ... 0.000000 9.000000 8.00
0000
## 25% 2.083325e+08 NaN NaN ... 0.000000 142.000000 88.00
0000
## 50% 2.083843e+08 NaN NaN ... 0.000000 262.000000 165.00
0000
## 75% 2.084052e+08 NaN NaN ... 0.000000 382.000000 253.00
0000
## max 2.109641e+08 NaN NaN ... 2.000000 501.000000 501.00
0000
##
## [11 rows x 119 columns]

[Link]().any() # check NA by column

26 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

[Link]().[Link]() # False, means no missing value ! Check the differe

nce between above two !!!!

#[Link]()
#type([Link]()) # [Link], so .index, .column, and .v
alues three important attributes

#[Link]().values
#type([Link]().values) # [Link]

## False

[Link]().sum()

Select the training set

#cell['Case'].nunique()
cell['Case'].unique()

## array(['Test', 'Train'], dtype=object)

[Link].value_counts()
#select the training set

## Test 1010
## Train 1009
## Name: Case, dtype: int64

cell_train = [Link][cell['Case']=='Train'] # same as cell[cell['Case']=='Tra

in']
cell_train.head()

## Cell Case Class ... WidthStatusCh1 XCentroid YCentroid

## 1 207932307 Train PS ... 1 215 347
## 2 207932463 Train WS ... 0 371 252
## 3 207932470 Train PS ... 0 487 295
## 11 207932484 Train WS ... 0 211 495
## 14 207932459 Train PS ... 0 172 207
##
## [5 rows x 119 columns]

cell['Case'][:10]

27 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## 0 Test
## 1 Train
## 2 Train
## 3 Train
## 4 Test
## 5 Test
## 6 Test
## 7 Test
## 8 Test
## 9 Test
## Name: Case, dtype: object

type(cell['Case']) # <class '[Link]'>

## <class '[Link]'>

cell[['Case']][:10]

## Case
## 0 Test
## 1 Train
## 2 Train
## 3 Train
## 4 Test
## 5 Test
## 6 Test
## 7 Test
## 8 Test
## 9 Test

type(cell[['Case']]) # <class '[Link]'>

## <class '[Link]'>

Create feature matrix (X)

cell_data = cell_train.drop(['Cell','Class','Case'], axis=1)

cell_data.head()

# alternative way to do the same thing

28 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## AngleCh1 AngleStatusCh1 AreaCh1 ... WidthStatusCh1 XCentroid YC

entroid
## 1 133.752037 0 819 ... 1 215
347
## 2 106.646387 0 431 ... 0 371
252
## 3 69.150325 0 298 ... 0 487
295
## 11 109.416426 0 256 ... 0 211
495
## 14 104.278654 0 258 ... 0 172
207
##
## [5 rows x 116 columns]

cell_data = cell_train.drop(cell_train.columns[0:3], 1)
cell_data.head()

## AngleCh1 AngleStatusCh1 AreaCh1 ... WidthStatusCh1 XCentroid YC

Create class label vector (y) (label encoding and one-hot encoding)

from [Link] import OneHotEncoder

from [Link] import LabelEncoder # Encode labels with value betw
een 0 and n_classes-1.

# label encoding
le_class = LabelEncoder().fit(cell['Class']) # 'PS': 0, 'WS': 1
Class_label = le_class.transform(cell['Class']) # 0: PS, 1: WS
Class_label.shape # (2019,)

# one-hot encoding

## (2019,)

29 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

ohe_class = OneHotEncoder(sparse=False).fit(Class_label.reshape(-1,1)) # spars

e : boolean, default=True Will return sparse matrix if set True else will retu
rn an array.
#help(OneHotEncoder)

## /opt/anaconda3/lib/python3.7/site-packages/sklearn/preprocessing/_encoders.
py:414: FutureWarning: The handling of integer data will change in version 0.2
2. Currently, the categories are determined based on the range [0, max(value
s)], while in the future they will be determined based on the unique values.
## If you want the future behaviour and silence this warning, you can specify
"categories='auto'".
## In case you used a LabelEncoder before this OneHotEncoder to convert the ca
tegories to integers, then you can now use the OneHotEncoder directly.
## [Link](msg, FutureWarning)

ohe_class.get_params()
#{'categorical_features': 'all',
# 'dtype': float,
# 'handle_unknown': 'error',
# 'n_values': 'auto',
# 'sparse': False}
#ohe_class.categorical_features

## {'categorical_features': None, 'categories': None, 'drop': None, 'dtype':

<class 'numpy.float64'>, 'handle_unknown': 'error', 'n_values': None, 'sparse
': False}

Class_ohe=ohe_class.transform(Class_label.reshape(-1,1)) # (2019, 2)

Class_label.reshape(-1,1).shape # (2019, 1) different to 1darray (2019,)

## (2019, 1)

Class_ohe.shape # (2019, 2) 2darray

## (2019, 2)

Class_ohe

30 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## array([[1., 0.],
## [1., 0.],
## [0., 1.],
## ...,
## [1., 0.],
## [0., 1.],
## [0., 1.]])

# Fast way to do one-hot encoding or dummy encoding

Class_dum = pd.get_dummies(cell['Class'])
print (Class_dum.head())

## PS WS
## 0 1 0
## 1 1 0
## 2 0 1
## 3 1 0
## 4 1 0

Diﬀerentiate categorical features from numeric features

print(cell_data.columns)

## Index(['AngleCh1', 'AngleStatusCh1', 'AreaCh1', 'AreaStatusCh1', 'AvgIntenC

h1',
## 'AvgIntenCh2', 'AvgIntenCh3', 'AvgIntenCh4', 'AvgIntenStatusCh1',
## 'AvgIntenStatusCh2',
## ...
## 'VarIntenCh1', 'VarIntenCh3', 'VarIntenCh4', 'VarIntenStatusCh1',
## 'VarIntenStatusCh3', 'VarIntenStatusCh4', 'WidthCh1', 'WidthStatusCh
1',
## 'XCentroid', 'YCentroid'],
## dtype='object', length=116)

type(cell_data.columns) # [Link]

## <class '[Link]'>

dir([Link])

31 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## ['class', 'delattr', 'dict', 'dir', 'doc', 'eq', '_

_format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__',
'__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__',
'__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_freeze', '_get_s
eries_list', '_make_accessor', '_validate', '_wrap_result', 'capitalize', 'cat
', 'center', 'contains', 'count', 'decode', 'encode', 'endswith', 'extract', '
extractall', 'find', 'findall', 'get', 'get_dummies', 'index', 'isalnum', 'isa
lpha', 'isdecimal', 'isdigit', 'islower', 'isnumeric', 'isspace', 'istitle', '
isupper', 'join', 'len', 'ljust', 'lower', 'lstrip', 'match', 'normalize', 'pa
d', 'partition', 'repeat', 'replace', 'rfind', 'rindex', 'rjust', 'rpartition
', 'rsplit', 'rstrip', 'slice', 'slice_replace', 'split', 'startswith', 'strip
', 'swapcase', 'title', 'translate', 'upper', 'wrap', 'zfill']

[Link](cell_data.columns).[Link]("Status").head() # logical indices a

fter making cell_data.columns as [Link]
#type([Link](cell_data.columns).[Link]("Status")) # [Link]
[Link]

## 0 False
## 1 True
## 2 False
## 3 True
## 4 False
## dtype: bool

cell_data.columns[[Link](cell_data.columns).[Link]("Status")] # again
[Link]
#type(cell_data.columns[[Link](cell_data.columns).[Link]("Status")])
# [Link]

32 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## Index(['AngleStatusCh1', 'AreaStatusCh1', 'AvgIntenStatusCh1',

## 'AvgIntenStatusCh2', 'AvgIntenStatusCh3', 'AvgIntenStatusCh4',
## 'ConvexHullAreaRatioStatusCh1', 'ConvexHullPerimRatioStatusCh1',
## 'DiffIntenDensityStatusCh1', 'DiffIntenDensityStatusCh3',
## 'DiffIntenDensityStatusCh4', 'EntropyIntenStatusCh1',
## 'EntropyIntenStatusCh3', 'EntropyIntenStatusCh4', 'EqCircDiamStatusC
h1',
## 'EqEllipseLWRStatusCh1', 'EqEllipseOblateVolStatusCh1',
## 'EqEllipseProlateVolStatusCh1', 'EqSphereAreaStatusCh1',
## 'EqSphereVolStatusCh1', 'FiberAlign2StatusCh3', 'FiberAlign2StatusCh
4',
## 'FiberLengthStatusCh1', 'FiberWidthStatusCh1', 'IntenCoocASMStatusCh
3',
## 'IntenCoocASMStatusCh4', 'IntenCoocContrastStatusCh3',
## 'IntenCoocContrastStatusCh4', 'IntenCoocEntropyStatusCh3',
## 'IntenCoocEntropyStatusCh4', 'IntenCoocMaxStatusCh3',
## 'IntenCoocMaxStatusCh4', 'KurtIntenStatusCh1', 'KurtIntenStatusCh3',
## 'KurtIntenStatusCh4', 'LengthStatusCh1', 'MemberAvgAvgIntenStatusCh2
',
## 'MemberAvgTotalIntenStatusCh2', 'NeighborAvgDistStatusCh1',
## 'NeighborMinDistStatusCh1', 'NeighborVarDistStatusCh1',
## 'PerimStatusCh1', 'ShapeBFRStatusCh1', 'ShapeLWRStatusCh1',
## 'ShapeP2AStatusCh1', 'SkewIntenStatusCh1', 'SkewIntenStatusCh3',
## 'SkewIntenStatusCh4', 'SpotFiberCountStatusCh3',
## 'SpotFiberCountStatusCh4', 'TotalIntenStatusCh1', 'TotalIntenStatusC
h2',
## 'TotalIntenStatusCh3', 'TotalIntenStatusCh4', 'VarIntenStatusCh1',
## 'VarIntenStatusCh3', 'VarIntenStatusCh4', 'WidthStatusCh1'],
## dtype='object')

len(cell_data.columns[[Link](cell_data.columns).[Link]("Status")]) #
58 features with "Status"

## 58

cell_num = cell_data.drop(cell_data.columns[[Link](cell_data.columns).str.c
ontains("Status")],axis=1)
cell_num.head()

## AngleCh1 AreaCh1 AvgIntenCh1 ... WidthCh1 XCentroid YCentroid

## 1 133.752037 819 31.923274 ... 32.161261 215 347
## 2 106.646387 431 28.038835 ... 21.185525 371 252
## 3 69.150325 298 19.456140 ... 13.392830 487 295
## 11 109.416426 256 18.828571 ... 17.546861 211 495
## 14 104.278654 258 17.570850 ... 17.660339 172 207
##
## [5 rows x 58 columns]

Dimensionality Reduction (dr) by PCA

33 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

from [Link] import PCA

dr = PCA() # Principal Components Analysis

cell_pca = dr.fit_transform(cell_num) # PCA only for numeric

cell_pca

## array([[ 8.47983436e+04, -1.09206431e+05, 3.08230196e+04, ...,

## -1.85584989e-02, -1.86032858e-02, 1.47400180e-11],
## [-2.14595263e+04, -1.77417260e+04, -1.78468558e+03, ...,
## -2.00628855e-02, 5.54777551e-03, 1.46113095e-11],
## [-5.35770250e+04, 1.03968298e+04, 9.15969476e+03, ...,
## -1.08847740e-02, 2.82838575e-02, 6.42386175e-14],
## ...,
## [-2.57298792e+04, -2.43260560e+04, -5.99996296e+04, ...,
## -1.03950386e-02, 1.48297048e-02, -1.54181159e-12],
## [-2.71587740e+04, -1.94760869e+04, -4.94019505e+04, ...,
## -1.42084059e-02, 3.19415826e-03, -6.61852643e-12],
## [ 1.14504120e+03, -4.19702178e+04, -2.30591893e+04, ...,
## -5.69891526e-03, 2.15411583e-03, 9.98910549e-13]])

dir(dr)

## ['abstractmethods', 'class', 'delattr', 'dict', 'dir',

'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__
', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__',
'__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__',
'__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '_
_weakref__', '_abc_impl', '_fit', '_fit_full', '_fit_svd_solver', '_fit_trunca
ted', '_get_param_names', '_get_tags', 'components_', 'copy', 'explained_varia
nce_', 'explained_variance_ratio_', 'fit', 'fit_transform', 'get_covariance',
'get_params', 'get_precision', 'inverse_transform', 'iterated_power', 'mean_',
'n_components', 'n_components_', 'n_features_', 'n_samples_', 'noise_variance_
', 'random_state', 'score', 'score_samples', 'set_params', 'singular_values_',
'svd_solver', 'tol', 'transform', 'whiten']

dr.components_[:2] # [:2] can be removed, if you want to see more results of r

otation matrix.

34 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

## array([[-1.20772373e-05, 1.11800234e-03, 1.39935119e-03,

## 9.52570715e-04, 2.48690529e-04, 9.40628085e-04,
## -4.15401765e-07, 9.14069295e-08, 3.98507719e-04,
## 1.40289157e-04, 5.07769261e-04, 6.87236820e-06,
## 8.13559199e-07, 8.52284533e-06, 2.85013735e-05,
## -2.64088142e-06, 3.85927180e-03, 2.52352309e-03,
## 4.47249694e-03, 3.06992719e-02, -4.70489024e-07,
## -3.72382194e-07, 1.89140825e-05, 2.36813742e-05,
## 1.07032223e-07, -3.77790202e-07, -2.60762862e-05,
## -2.88382099e-06, -2.04797676e-06, 4.59901695e-06,
## 3.40979143e-07, -5.41351578e-07, -3.63271978e-06,
## 2.40700604e-06, -7.66636701e-06, 2.44217672e-05,
## 1.14568916e-05, 1.49472962e-05, -1.16156582e-05,
## 8.51909135e-05, 2.09939072e-07, -1.70034827e-06,
## -1.96609334e-06, -1.79799998e-06, 9.67078769e-07,
## -1.88217019e-06, 5.24739478e-06, 8.04863458e-06,
## 7.40593712e-01, 4.69191466e-01, 1.61895667e-01,
## 4.51862380e-01, 7.26861406e-04, 3.71841771e-04,
## 7.59918222e-04, 2.98077199e-05, -1.43658866e-04,
## -1.37681271e-04],
## [ 2.12581608e-05, -2.68374511e-04, 1.30045966e-03,
## -1.12786669e-03, -3.64008269e-04, -1.92919185e-03,
## 3.04717255e-07, 1.00102791e-07, 3.37747427e-04,
## -2.04776688e-04, -1.08017057e-03, 2.11861099e-06,
## -9.51419445e-06, -1.36747734e-05, -1.12801572e-05,
## 4.48336024e-06, 1.82882051e-04, -4.00217100e-04,
## -1.07262609e-03, -2.74478711e-03, 2.09941789e-08,
## 1.05117916e-08, -3.23766369e-06, -1.07306291e-05,
## 6.66252157e-07, 6.27804271e-07, 1.49711626e-05,
## -1.62046940e-06, -6.35220368e-06, -7.85177205e-06,
## 7.38650679e-07, 9.91619257e-07, -5.63744748e-06,
## 2.25840662e-05, 1.66775434e-05, 9.16663635e-06,
## -1.88640693e-05, -2.37173687e-05, -1.61950483e-05,
## -2.79365856e-05, -1.50508544e-07, 3.06601353e-06,
## 1.33954125e-06, -1.33644490e-06, 3.10996567e-06,
## 4.63405918e-06, 9.05566152e-07, 6.86521445e-06,
## 6.57513374e-01, -3.71191529e-01, -1.50167780e-01,
## -6.38218269e-01, 6.82321741e-04, -2.83726353e-04,
## -1.55158886e-03, -1.82824730e-05, -2.05493644e-04,
## 2.14671008e-05]])

type(dr.components_) # [Link]

## <class '[Link]'>

dr.components_.shape # (58, 58)

## (58, 58)

35 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

# scree plot
dr.explained_variance_ratio_
import [Link] as plt
[Link](range(1, 26), dr.explained_variance_ratio_[:25], '-o')
[Link]('# of components')
[Link]('ratio of variance explained')

Scree Plot of PCA

# list(range(1,59))
# range(1,59).tolist() # AttributeError: 'range' object has no attribute 'toli
st'

cell_dr = cell_pca[:,:5]
cell_dr
# [Link](cell_dr).to_csv('cell_dr.csv')

## array([[ 84798.34361854, -109206.43140231, 30823.01955784,

## 17203.74850499, 10043.47510989],
## [ -21459.52630159, -17741.72601452, -1784.68558447,
## -1034.14605652, 2397.90486914],
## [ -53577.02497608, 10396.82979117, 9159.69476304,
## -4910.95656292, 959.5703947 ],
## ...,
## [ -25729.8792278 , -24326.05603993, -59999.62957106,
## -6529.78382658, -3576.79204107],
## [ -27158.77402526, -19476.08687044, -49401.95049235,
## -12577.75498609, -961.91880638],
## [ 1145.04120403, -41970.21778886, -23059.18929957,
## -6122.01345777, 1546.57100978]])

References:
Hastie, T., Tibshirani, R., Friedman, J. (2001), The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer.

Kuhn, M. and Johnson, K. (2013), Applied Predictive Modeling, Springer.

36 of 37 2020/8/20, 6:16 PM
Statistical Machine Learning with Python Week #1 [Link] Drive/MCUT/Indonesia/III-DataMini...

Lantz, B. (2013), Machine Learning with R, Packt Publishing.

Ledolter, J. (2013), Data Mining and Business Analytics with R, John Wiley & Sons.

Provost, F. and Fawcett, T. (2013), Data Science for Business: What You Need to Know About
Data Mining and Data-Analytic Thinking, O’Reilly.

Raschka, S. (2015), Python Machine Learning: Unlock deeper insights into machine learning
with this vital guide to cutting- edge predictive analytics, PACKT publishing.

Tan, P.-N., Steinbach, M., and Kumar, V. (2006), Introduction to Data Mining, Pearson.

Torgo, L. (2011), Data Mining with R: Learning with Case Studies, CRC Press.

Varmuza, K. and Filzmoser, P., 2009. Introduction to Multivariate Statistical Analysis in

Chemometrics, CRC.

Williams, G. (2011), Data Mining with Rattle and R: The Art of Excavating Data for Knowledge
Discovery, Springer.

Zhao, Y. (2013), R and Data Mining: Examples and Case Studies, Academic Press.

Thanks for your attention.

Email me at cstsou @ [Link] or cstsou @ [Link] if there is anything I can help.

37 of 37 2020/8/20, 6:16 PM

20 - Chapter 2: Business Problems and Data Science Solutions
No ratings yet
20 - Chapter 2: Business Problems and Data Science Solutions
4 pages
Data Science For Business 9
No ratings yet
Data Science For Business 9
5 pages
2 - Business Problems and Data Science Solutions
No ratings yet
2 - Business Problems and Data Science Solutions
26 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
Chapter 2
No ratings yet
Chapter 2
35 pages
DataMining Chapter1
No ratings yet
DataMining Chapter1
13 pages
UCS-401 - CSE7th M L Lect 02 - Done
No ratings yet
UCS-401 - CSE7th M L Lect 02 - Done
22 pages
Data Mining Predictive Modeling Que Sans
No ratings yet
Data Mining Predictive Modeling Que Sans
23 pages
DSand ML
No ratings yet
DSand ML
76 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Data Mining and Visualization
No ratings yet
Data Mining and Visualization
9 pages
Data Mining Insights and Applications
No ratings yet
Data Mining Insights and Applications
10 pages
Data Mining and ML Challenges
No ratings yet
Data Mining and ML Challenges
26 pages
Ch2 DTasks
No ratings yet
Ch2 DTasks
44 pages
Data Mining Techniques Explained
No ratings yet
Data Mining Techniques Explained
3 pages
Unit 1
No ratings yet
Unit 1
11 pages
Week 1 Homework ITS 632 UC
No ratings yet
Week 1 Homework ITS 632 UC
7 pages
Lecture 1
No ratings yet
Lecture 1
62 pages
Emru
No ratings yet
Emru
5 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
Data Science & Analytics Basics
No ratings yet
Data Science & Analytics Basics
71 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Data Mining and Warehousing Overview
No ratings yet
Data Mining and Warehousing Overview
523 pages
lecture1&2-đã chuyển đổi
No ratings yet
lecture1&2-đã chuyển đổi
46 pages
Wk. 1. Introduction (08.10.2020)
No ratings yet
Wk. 1. Introduction (08.10.2020)
30 pages
Chap1 Intro
No ratings yet
Chap1 Intro
28 pages
Introduction To Data Mining Techniques: Dr. Rajni Jain
No ratings yet
Introduction To Data Mining Techniques: Dr. Rajni Jain
11 pages
Data Mining Intro IEP
No ratings yet
Data Mining Intro IEP
47 pages
Data Mining All Summary
No ratings yet
Data Mining All Summary
47 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
Data Mining for Business Students
No ratings yet
Data Mining for Business Students
75 pages
COEN413 Machine Learning-2
No ratings yet
COEN413 Machine Learning-2
38 pages
Key Data Mining Techniques Explained
No ratings yet
Key Data Mining Techniques Explained
58 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
No ratings yet
Predictive Analytics and Data Mining: Charles Elkan Elkan@cs - Ucsd.edu May 31, 2011
165 pages
Data Mining Techniques Overview
No ratings yet
Data Mining Techniques Overview
40 pages
Bda Unit 5
No ratings yet
Bda Unit 5
11 pages
Week 02 Classification & Clustering
No ratings yet
Week 02 Classification & Clustering
31 pages
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
No ratings yet
Sat - 100.Pdf - Prediction of Cyber Attacks Using Data Science Technique
11 pages
DWM Quesans
No ratings yet
DWM Quesans
21 pages
DWDM Unit 1 Part 1
No ratings yet
DWDM Unit 1 Part 1
35 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Data Science for Professionals
No ratings yet
Data Science for Professionals
37 pages
Unit 1 Data Mining Task
No ratings yet
Unit 1 Data Mining Task
7 pages
AI Overview Simplified
No ratings yet
AI Overview Simplified
17 pages
Report-Machine-Learning-101 - 1 35
No ratings yet
Report-Machine-Learning-101 - 1 35
1 page
Outline: Three Basic Algorithms
No ratings yet
Outline: Three Basic Algorithms
34 pages
An Introduction To Data Mining IIT Bombay
No ratings yet
An Introduction To Data Mining IIT Bombay
48 pages
Data Mining, Data Wharehousing and Olap
No ratings yet
Data Mining, Data Wharehousing and Olap
33 pages
Data Mining
No ratings yet
Data Mining
23 pages
Machine Learning Overview & Techniques
No ratings yet
Machine Learning Overview & Techniques
30 pages
Clustering Techniques Explained
No ratings yet
Clustering Techniques Explained
61 pages
Network Intrusion Detection Tech
No ratings yet
Network Intrusion Detection Tech
5 pages
Lecture 7 - Introduction To Data Mining
No ratings yet
Lecture 7 - Introduction To Data Mining
31 pages
Classification
No ratings yet
Classification
50 pages
Data Mining Applications & Tasks
No ratings yet
Data Mining Applications & Tasks
25 pages
4 - Best Practices For Sodium Hypochlorite Storage and Metering Systems
100% (1)
4 - Best Practices For Sodium Hypochlorite Storage and Metering Systems
48 pages
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 ebook fresh digital copy
100% (2)
Mining of Massive Data Sets 2nd Edition by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman ISBN 1107077230 9781107077232 ebook fresh digital copy
87 pages
Evermotion Archexteriors Vol 15 PDF
No ratings yet
Evermotion Archexteriors Vol 15 PDF
2 pages
ASA Firepower NGFW Typical Deployment Scenarios
No ratings yet
ASA Firepower NGFW Typical Deployment Scenarios
114 pages
New Microsoft Excel Worksheet
No ratings yet
New Microsoft Excel Worksheet
1 page
User-Manual STS-1800A En-Us 501331070102
No ratings yet
User-Manual STS-1800A En-Us 501331070102
74 pages
UNIT2 Notes
No ratings yet
UNIT2 Notes
24 pages
Colocacion de La Mascara Drager
No ratings yet
Colocacion de La Mascara Drager
1 page
Class 10th SP - PDF - 20240121 - 120133 - 0000-1
No ratings yet
Class 10th SP - PDF - 20240121 - 120133 - 0000-1
3 pages
Lecture ADC
No ratings yet
Lecture ADC
19 pages
Service and Repair Manual: Z - 60 DC Z - 60 FE
No ratings yet
Service and Repair Manual: Z - 60 DC Z - 60 FE
193 pages
Authorization to Mark for Luminaires
No ratings yet
Authorization to Mark for Luminaires
1 page
EPEAT Lexmark PDF
No ratings yet
EPEAT Lexmark PDF
20 pages
Degree of Composite Functions
No ratings yet
Degree of Composite Functions
356 pages
Compilerpptxpdf Lyst1752818804344
No ratings yet
Compilerpptxpdf Lyst1752818804344
300 pages
Environmental Engineering 2 by SJ (73pgs)
No ratings yet
Environmental Engineering 2 by SJ (73pgs)
73 pages
Module No 2 Linear Data Structures
No ratings yet
Module No 2 Linear Data Structures
90 pages
rtl9301 CG Layer 3 Managed 24x10 100 1000m - 4x10g - Port - Switch - Controller
No ratings yet
rtl9301 CG Layer 3 Managed 24x10 100 1000m - 4x10g - Port - Switch - Controller
71 pages
DPR 2025 Transportation Dumpers Ashok Leyland - 25
No ratings yet
DPR 2025 Transportation Dumpers Ashok Leyland - 25
5 pages
Suneel Prajapat Piping Design Engineer
No ratings yet
Suneel Prajapat Piping Design Engineer
3 pages
TD-28 NA To SS EN 1993-1-2-2009
No ratings yet
TD-28 NA To SS EN 1993-1-2-2009
11 pages
Chapter 5 Communicating Electronically
No ratings yet
Chapter 5 Communicating Electronically
24 pages
Indian School Sohar Student Diary 2021-22
No ratings yet
Indian School Sohar Student Diary 2021-22
39 pages
RC Ipv-Dpv
No ratings yet
RC Ipv-Dpv
3 pages
SW4000 DRO User's Manual Overview
No ratings yet
SW4000 DRO User's Manual Overview
74 pages
Calculator Techniques by Engr. Mendoza
No ratings yet
Calculator Techniques by Engr. Mendoza
92 pages
Digital Thesis Universitas Kristen Petra
No ratings yet
Digital Thesis Universitas Kristen Petra
5 pages
Sample Photographic Log-1
100% (4)
Sample Photographic Log-1
11 pages
Design and Optimization of A Reconnaissance Glider UAV Utilizing Static Soaring Techniques
No ratings yet
Design and Optimization of A Reconnaissance Glider UAV Utilizing Static Soaring Techniques
52 pages
CCSP Exam Cram Domain 6 Handout
No ratings yet
CCSP Exam Cram Domain 6 Handout
142 pages