PYTHON MACHINE
LEARNING
Why Python?
A large community
tons of machine learning specific libraries
easy to learn
TensorFlow make Python the leading language in the data science community.
About Python
It is case sensitive
Text indentation is important in logic flow!
Use # symbol to add a code comment
Use ‘’’ ‘’’ to comment a block of code
Prepare your machine
Install Python 2.7.xx
Pip install pandas
Pip install matplotlib Sklearn Statistics scipy seaborn Ipython
pip install --no-cache-dir pystan #to force redownload exist package in case of error
Get Data set for training through
http://archive.ics.uci.edu/ml/datasets.html
How to Run Python ML application
Open “IDLE (Python GUI)” File New File
Write your code on the new window then Run Run Module
Alt+3 Comment lines
Alt+4 Remove lines comments
Anaconda
Anaconda is Python distribution for data scientists .
It has around 270 packages including the most important ones for most scientific
applications, data analysis, and machine learning such as NumPy, SciPy, Pandas,
IPython, matplotlib, and scikit-learn.
Download it for free from https://www.continuum.io/downloads
Import Statistical libraries
import time
import random
import datetime
import pandas as pd
import matplotlib.pyplot as plt
import statistics
from scipy import stats
import sklearn
import seaborn
from IPython.display import Image
import numpy as np
import pandas as pd
Pandas DataFrame similar to Python lists but it has extra column (index) and Any
operation to perform on a DataFrame , get’s performed on every single element inside it.
Example of convert python list to DataFrame
Common Operations
1) Set dataset columns names data.columns = [‘blue’,’green’,’red’]
2) Get value of any cell use cell index and column index/name like this : data.ix[2,’red’]
3) replace NaN with zero using : data = data.replace(np.nan,0)
4) Select a subset of columns in new dataFrame : NewDF= data [ [ ‘blue’ , ’red’ ] ]
5) Count of rows that has data on a column: data[‘red'].value_counts().sum()
The missing values are dennoted by NaN __
6) Select count of distinct value on a column: data[‘blue’].nunique()
7) Drop duplicates rows: data = data.drop_duplicates()
Basic commands
Reading the Data from disk into Memory
data = pd.read_csv('examples/trip.csv’)
Reading the Data from HDFS into Memory
with hd.open("/home/file.csv") as f:
data = pd.read_csv(f)
Printing Size of the Dataset and Printing First/Last Few Rows
print len(data)
data.head()
data.tail()
data.describe() #get data summary count, mean, the min and max values
Order Dataset and get first and last element of a column
data = data.sort_values(by= 'birthyear')
data.reset_index()
print data.ix[1, 'birthyear’ ] #first element
print data.ix[len(data)-1, 'birthyear’ ] #last element
DataFrame cleaning
If need to convert column with negative values
data['AwaitingTime'] = data['AwaitingTime'].apply(lambda x: abs(x))
Convert string values to integer values (DayOfTheWeek to Integers)
Day_mapping = {'Monday' : 0, 'Tuesday' : 1, 'Wednesday' : 2, 'Thursday' : 3, 'Saturday' : 5, 'Sunday' : 6}
data['DayOfTheWeek'] = data['DayOfTheWeek'].map(Day_mapping)
Remove rows that has age column less than zero
data = data[data['Age'] >= 0]
Delete column from Dataset
del data[‘from_station_id']
Filter data to remove rows that has NA/0
> df = pd.DataFrame({'a':[0,0,1,1], 'b':[0,1,0,1]})
df = df[(df.T != 0).any()] #keep row if any feature contain values
> print df
> df = df[(df.T != 0).all()] #keep row if all features contain values
> print df
Original DataFrame Keep raw if ANY contains a value Keep raw if ALL contains values
a b a b a b
1 0 0 2 1 0 4 1 1
2 1 0 3 0 1
3 0 1 4 1 1 df[(df.T != 0).all()]
4 1 1
df[(df.T != 0).any()]
Drop the columns where Any/All elements are NaN
data.dropna(how='any’)
how : {‘any’, ‘all’}
any : if any NA values are present, drop that label
all : if all values are NA, drop that label
Drop duplicate rows
Consider duplicate if all the row are the same, return result in new dataframe
dataframe1 = df.drop_duplicates(subset=None, inplace=False)
Consider duplicate if Column1, Column3 are the same, drop duplicate except last row, return result in the same dataframe
df.drop_duplicates(subset=[‘Column1', ‘Column3'], keep=‘last’)
Save unique data
df.to_csv(file_name_output)
Split Dataset, Create training, and test dataset
How to split dataset by 70-30?
[70% of the observations fall into training and 30% of observations fall into the test dataset.]
x_train,x_test, y_train, y_test = train_test_split(data[‘x’],data[‘y'], test_size=0.3, random_state=1)
import matplotlib.pyplot as plt
plt.bar – creates a bar chart
plt.scatter – makes a scatter plot
plt.boxplot – makes a box and whisker plot
plt.hist – makes a histogram
plt.plot – creates a line plot
Plot for Data array
import numpy as np
import matplotlib.pyplot as plt
Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])
Column1=0
Column2=1
for i in range( len(Data) ):
plt.plot( Data[ i ] [Column1] , Data[ i ] [Column2] ,'r.’) #r. :red
plt.show()
Bar plot for Data array
import numpy as np
import matplotlib.pyplot as plt
Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])
Column1=0
Column2=1
BarWidth=1/1.5
for i in range( len(Data) ):
plt.bar( Data[ i ] [Column1] , Data[ i ] [Column2] , BarWidth , color="blue")
plt.show()
Scatter plot for Data
import numpy as np
import matplotlib.pyplot as plt
Data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])
Column1=0
Column2=1
plt.scatter( Data[:, Column1] , Data[:, Column2] , marker = "x“ , s=150 , linewidths = 5 , zorder = 1)
plt.show()
#Where s is the size of the marker
Calculating Pair Plot Between All Features
import numpy as np
import matplotlib.pyplot as plt
data=np.array([ [1,2], [3,2], [4,6], [7,2], [1,4], [9,1] ])
import pandas
df = pandas.DataFrame(Data)
#Calculating Pair Plot Between All Features
seaborn.pairplot(df)
plt.show()
Basic commands
Sort, Filter, group, and Plotting the Data
data = data.sort_values(by='birthyear’)
data = data[(data['birthyear'] >= 1931) & (data['birthyear']<=1999)]
groupby_birthyear = data.groupby('birthyear').size()
groupby_birthyear.plot.bar(title = 'Distribution of birth years’, figsize = (15,4))
plt.show()
Stacked column chart
data1 = data.groupby(['birthyear', 'gender']).size().unstack('gender').fillna(0)
data1.plot.bar(title ='Distribution of birth years by Gender', stacked=True, figsize = (15,4))
Convert string to DateTime
Create new column as date/time
data['StartTime1'] = data['starttime'].apply(lambda x: datetime.datetime.strptime(x, "%m/%d/%Y %H:%M") )
Extract Year, Month, Day, and Hour
data['year'] = data['StartTime1'].apply(lambda x: x.year )
data['month'] = data['StartTime1'].apply(lambda x: x.month )
data['day'] = data['StartTime1'].apply(lambda x: x.day )
data['hour'] = data['StartTime1'].apply(lambda x: x.hour )
Split String column to create new columns for Year-Month-Day
for index, component in enumerate(['year', 'month', 'day']):
data[component] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-')[index]))
OR
data[‘year’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[0]))
data[‘month’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[1]))
data[‘day’] = data[‘AppointmentRegistration’].apply(lambda x: int(x.split('T')[0].split('-’)[2]))
Notes
‘abcde’[0:2] ‘ab’
‘abcde’[:2] ‘ab’
‘abcde’[:-1] ‘abcd’
Concepts
Mean : Average
Median: When the frequency of observations in the data is odd, the middle
data point is returned as the median.
Mode: returns the observation in the dataset with the highest frequency.
Variance: represents variability of data points about the mean.
Standard deviation: just like variance, also captures the spread of data along
the mean. The only difference is that it is a square root of the variance.
Normal distribution (Gaussian distribution): the mean lies at the center of this
distribution with a spread (i.e., standard deviation) around it. Some 68% of the
observations lie within 1 standard deviation from the mean; 95% of the
observations lie within 2 standard deviations from the mean, whereas 99.7% of
the observations lie within 3 standard deviations from the mean.
Outliers: refer to the values distinct from majority of the observations. These
occur either naturally, due to equipment failure, or because of entry mistakes.
Plot mean
Draw mean of ‘TripDuration’ group by ‘StartTime_date’
data.groupby('starttime_date')['tripduration'].mean()
.plot.bar(title = 'Distribution of Trip duration by date', figsize = (15,4))
Calculate Mean, Standard deviation, Median
print data['tripduration_mean'].mean()
print data['tripduration_mean'].std()
print data['tripduration_mean'].median()
Seasonal pattern vs Cyclic Pattern vs Trend
Seasonal: pattern over fixed period, like monthly pattern
Cycle: pattern overall without worry about periods, to check
if patterns repeat over non-periodic time cycles.
Trend: In long-term does the continuous variable increase/decrease
Seasonal pattern Cycle pattern Trend
Correlation directions
Calculate the correlation
pd.set_option('precision', 3)
correlations = data[['tripduration','age']].corr()
print(correlations)
.corr(method='pearson')
method : {‘pearson’, ‘kendall’, ‘spearman’}
pearson : standard correlation coefficient
kendall : Kendall Tau correlation coefficient
spearman : Spearman rank correlation
For more information http://www.statisticssolutions.com/correlation-pearson-kendall-spearman/
Rename Dataset columns' names
Get data from the next URL and save as “concrete_data.csv”
http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength
data = pd.read_csv('examples/concrete_data.csv’)
print len(data)
data.head()
Renaming the Columns
data.columns = ['cement_component', 'furnace_slag', 'flay_ash’,\
'water_component’, 'superplasticizer', 'coarse_aggregate’, \
'fine_aggregate', 'age', 'concrete_strength']
Loop through dataset
Draw relation between each feature and concrete_strength
plt.figure(figsize=(15,10.5))
plot_count = 1
for feature in list(data.columns)[:-1]:
plt.subplot(3,3,plot_count)
plt.scatter(data[feature], data['concrete_strength'])
plt.xlabel(feature.replace('_',' ').title())
plt.ylabel('Concrete strength')
plot_count+=1
plt.show()
#//Get the correlation between data
pd.set_option('display.width', 100)
pd.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)
Dimensionality Reduction
PCA Dimensionality Reduction
When we have a big dataset and we need to reduce the number of columns for fast
training (remove redundancy features).
Example : Dataset (data) contains 20 features and we need to reduce them to 4 features
from sklearn.decomposition import PCA
NewDataFrame = PCA(n_components = 4).fit_transform(data)
Association Algorithm
Association Rules
Association rules analysis is a technique to uncover how items are
associated to each other.
There are three common ways to measure association:
1) Support: This says how popular an itemset is, as measured by the
proportion of transactions in which an itemset appears.
2) Confidence: This says how likely item Y is purchased when item X is
purchased (not good if one item is common)
3) Lift: This says how likely item Y is purchased when item X is purchased
Association Rules (Apriori Algorithm)
from apyori import apriori
transactions = [
['beer', 'nuts'],
['beer', 'cheese'],
]
results = list(apriori(transactions))
Association Rules (Example)
pip install apyori
# Import the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Import Dataset
dataset = pd.read_csv(‘apriori_data2.csv’, header = None)
records = [] for i in range(0, 11):
records.append([str(dataset.values[i,j]) for j in range(0, 10)])
# Train Apriori Model
from apyori import apriori
rules = apriori(records, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)
# Visualising the results
results = list(rules)
Time Series
Supervised machine learning
Time Series
A time series forecast is different from regression in that time acts as an exploratory variable and should be continuous
along equal intervals.
Algorithms
1) Autoregressive Forecast Model: uses observations at previous time steps to predict observations at future time step.
2) ARIMA Forecast Model: linear model, predict after remove trends and seasonality.
3) Prophet: Facebook library, it is quick and gives very good results, forecasting time series data.
Related Packages: pip install fbprophet
The input to Prophet is always a dataframe with two columns: ds and y.
The ds (datestamp) column must contain a date or datetime (either is fine).
The y column must be numeric, and represents the measurement we wish to forecast,
It is good to use log(y) and not the actual y to remove trends and noise data
FB Prophet
COMPUTER VISION
Open CV for computer vision
To install OpenCV pip install opencv-python
Normal images are 4 dimensional array for each pixel (Blue ,Green, Red, and Alpha)
It is normal to process images in Gray scale for fast performance
Video is just a loop of images. So, any code processing images can process video too
For more information about open CV please visit
https://www.tutorialspoint.com/opencv/index.htm
https://www.youtube.com/playlist?list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq
Common OpenCV functions
image = cv2.imread(ImgPATH) #Read image
cv2.cvtColor(ImgPATH, cv2.COLOR_BGR2GRAY) #Convert to gray scale
cv2.imshow( ‘Window Title’, image ) #Show Image in a window
cv2.imwrite(ImgPATH, image) # to write image to HD
_ , image = cv2.threshold(image, threshold, maxval, thresholdType) #change pixel to maxval if value greater than threshold
image = cv2.adaptiveThreshold(image, maxValue, adaptiveMethod, thresholdType, blockSize, C) #Threshold is automatic calculated
Original THRESH_BINARY THRESH_BINARY_INV THRESH_TRUNC THRESH_TOZERO THRESH_TOZERO_INV
thresholdType
adaptiveMethod
ADAPTIVE_THRESH_MEAN_C : threshold value is the mean of neighborhood area.
ADAPTIVE_THRESH_GAUSSIAN_C : threshold value is the weighted sum of neighborhood values where weights are a Gaussian window.
blockSize : A variable of the integer type representing size of the pixelneighborhood used to calculate the threshold value.
C : A variable of double type representing the constant used in the both methods (subtracted from the mean or weighted mean).
OpenCV Video Stream Example
#VideoStream1 = cv2.VideoCapture('c:/boxed-correct.avi’) read stream from avi video file
VideoStream1 = cv2.VideoCapture(0) read stream from the first connected video cam
while True:
IsReturnFrame1,ImgFrame1=VideoStream1.read()
if IsReturnFrame1 == 0: break
GrayImgFrame1=cv2.cvtColor(ImgFrame1,cv2.COLOR_BGR2GRAY)
cv2.imshow('ImageTitle1',ImgFrame1)
cv2.imshow('ImageTitle2',GrayImgFrame1)
if cv2.waitKey(0): break
VideoStream1.release()
cv2.destroyAllWindows()
Image processing using OpenCV
Original
Threshold_Color
ThresholdGray Threshold_GAUSSIAN Threshold_MEAN
Object Recognition MainImage
template
Search for image with exact lighting/scale/angle
Result
NLP
Natural Language Processing
NLP (Natural Language Processing)
Why we need NLP packages?
NLP Package handle a wide range of tasks such as Named-entity recognition , part-of-speech (POS) tagging, sentiment
analysis, document classification, topic modeling, and much more.
Named-entity recognition: means extract names of persons, organizations, locations, time, quantities, percentages, etc.
example: Jim bought 300 shares of Acme Corp. in 2006. [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.
Part-of-speech tagging: used to identification of words as nouns, verbs, adjectives, adverbs, etc.
Top 5 Python NLP libraries
NLTK good as education and research tool. Its modularized structure makes it excellent for learning and exploring NLP
concepts, but it's not meant for production.
TextBlob is built on top of NLTK, and it's more easily-accessible. good library for fast-prototyping or building applications that
don't require highly optimized performance. Beginners should start here.
Stanford's CoreNLP is a Java library with Python wrappers. It's in many existing production systems due to its speed.
SpaCy is a new NLP library that's designed to be fast, streamlined, and production-ready. It's not as widely adopted.
Gensim is most commonly used for topic modeling and similarity detection. It's not a general-purpose NLP library, but for the
tasks it does handle, it does them well.
NLTK Package for Natural Language Processing
Pip install nltk
To download all packages using GUI, write nltk.download()
NLTK concepts
Tokenizing
Word tokenizers : split text by words
Sentence tokenizers : split text by paragraphs
Lexicon
Get words and their actual means in the context
Corpora
Text classification. Ex: medical journals, presidential speech
Lemmatizing (stemming)
return the word to its root, ie (gone, went, going) go
WordNet
List of different words that have the same meaning for the given word
NLTK Simple example
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."
WordsArray = sent_tokenize(EXAMPLE_TEXT)
print(WordsArray)
ListOfAvailableStopWords= set(stopwords.words("english"))
print(ListOfAvailableStopWords)
textblob
pip install textblob
python -m textblob.download_corpora
Sentimental Analysis Example
We have a sample of 3000 random users comments on imdb.com, amazon.com, and
yelp.com website
http://archive.ics.uci.edu/ml/machine-learning-databases/00331/
Each comment has a score, Score is either 1 (for positive) or 0 (for negative)
Python Deep Learning
Common Deep learning fields
Main Fields Visual examples:
Computer vision Colorization of Black and White Images.
https://www.youtube.com/watch?v=_MJU8VK2PI4
Speech recognition
Adding Sounds To Silent Movies.
Natural language processing https://www.youtube.com/watch?v=0FW99AQmMc8
Audio recognition Automatic Machine Translation.
Social network filtering Object Classification in Photographs.
Machine translation Automatic Handwriting Generation.
Character Text Generation.
Image Caption Generation.
Automatic Game Playing.
https://www.youtube.com/watch?v=TmPfTpjtdgg
Common Deep learning development tool (Keras)
Keras is a free Artificial Neural Networks (ANN) library (deep learning library).
it is a high-level neural networks API, written in Python and capable of running on top of
TensorFlow, CNTK, or Theano.
It was developed with a focus on enabling fast experimentation. Being able to go from
idea to result with the least possible delay is key to doing good research.
Keras
TensorFlow CNTK Theano
Deep learning Computer vision (image processing)
3 common ways to detect objects
median based features
edge based features
threshold based features
How to use Neural Network Models in Keras
Five steps
1. Define Network.
2. Compile Network.
3. Fit Network.
4. Evaluate Network.
5. Make Predictions.
Step 1. Define Network
Neural networks are defined in Keras as a sequence of layers.
The first layer in the network must define the number of inputs to expect. for a Multilayer
Perceptron model this is specified by the input_dim attribute.
Example of small Multilayer Perceptron model (2 inputs, 5 hidden layers, 1 output)
model = Sequential()
model.add(Dense(5, input_dim=2))
model.add(Dense(1))
Re-write after add activation function
model = Sequential()
model.add(Dense(5, input_dim=2, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
Available Activation Functions
Optional, it is like a filter, used to solve some common predictive modeling problem, to get significant boost in performance.
Sigmoid: used for Binary Classification (2 class) one neuron the output layer. What ever the input it will map to zero or one.
Softmax: used for Multiclass Classification (>2 class), one output neuron per class value.
Linear: used for Regression, the number of neurons matching the number of outputs.
Tanh: what ever the input it will convert to number between -1 and 1
Relu: either 0 for a<0 or a for a>0. so, it just remove the negative values and pass the positive as it is.
LeakyReLU: minimize the value of negative values and pass as it is if positive
elu
selu
softplus
softsign
hard_sigmoid
PReLU
ELU
ThresholdedReLU
Step 2. Compile Network
Specifically the optimization algorithm to use to train the network and the loss function used to evaluate the network that is
minimized by the optimization algorithm.
model.compile(optimizer='sgd', loss='mse’)
Optimizers tool to minimize loss between prediction and real value. Commonly used optimization algorithms:
‘sgd‘ (Stochastic Gradient Descent) requires the tuning of a learning rate and momentum.
ADAM requires the tuning of learning rate.
RMSprop requires the tuning of learning rate.
Loss functions:
Regression: Mean Squared Error or ‘mse‘.
Binary Classification (2 class): ‘binary_crossentropy‘.
Multiclass Classification (>2 class): ‘categorical_crossentropy‘.
Finally, you can also specify metrics to collect while fitting the model in addition to the loss function. Generally, the most useful
additional metric to collect is accuracy.
model.compile(optimizer='sgd', loss='mse', metrics=['accuracy'])
Optimizers
Step 3. Fit Network
history = model.fit(X, y, batch_size=10, epochs=100)
The network is trained using the backpropagation algorithm
Batch size is the number of samples that going to be propagated through the network.
epochs is the number of training times (for ALL the training examples)
Example: if you have 1000 training examples, and your batch size is 500, then it will take 2
iterations to complete 1 epoch.
Step 4. Evaluate Network
The model evaluates the loss across all of the test patterns, as well as any other metrics
specified when the model was compiled.
For example, for a model compiled with the accuracy metric, we could evaluate it on
a new dataset as follows:
loss, accuracy = model.evaluate(X, Y)
print("Loss: %.2f, Accuracy: %.2f% %" % (loss, accuracy*100))
Step 5. Make Predictions
probabilities = model.predict(X)
predictions = [float(round(x)) for x in probabilities]
accuracy = numpy.mean(predictions == Y) #count the number of True and divide by the total size
print("Prediction Accuracy: %.2f% %" % (accuracy*100))
Binary classification using Neural Network in Keras
Diabetes Data Set
Detect Diabetes Disease based on analysis
Dataset Attributes:
1. Number of times pregnant
2. Plasma
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)
Save prediction model
After train our model, ie, model.fit(X_train, Y_train), we can save this traning to use later.
This task can done by Pickle package(Python Object Serialization Library), using dump
and load methods. Pickle can save any object not just the prediction model.
import pickle
…………….
model.fit(X_train, Y_train)
# save the model to disk
pickle.dump(model, open("c:/data.dump", 'wb’)) #wb= write bytes
# some time later... load the model from disk
model = pickle.load(open("c:/data.dump", 'rb’)) #rb= read bytes