F# For Machine Learning Essentials - Sample Chapter
F# For Machine Learning Essentials - Sample Chapter
$ 39.99 US
25.99 UK
P U B L I S H I N G
Sudipta Mukherjee
ee
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Sudipta Mukherjee
@samthecoder.
Preface
Machine learning (ML) is more prevalent now than ever before. Every day a lot of
data is being generated. Machine learning algorithms perform heavy duty number
crunching to improve our lives every day. The following image captures the major
tasks that machine learning algorithms perform. These are the classes or types of
problems that ML algorithms solve.
Our lives are more and more driven by the output of these ML algorithms than we
care to admit. Let me walk you through the image once:
Preface
Classification: This is the ML algorithm that works hard to keep your spam
e-mails away from your priority inbox. The same algorithm can be used to
identify objects from images or videos and surprisingly, the same algorithm
can be used to predict whether a patient has cancer or not. Generally, a lot of
data is provided to the algorithm, from which it learns. That's why this set
of algorithms is sometime referred to as supervised learning algorithms,
and this constitutes the vast majority of machine learning algorithms.
Recommendations: Every time you visit Amazon and rate a product, the site
recommends some items to you. Under the hood is a clever machine learning
algorithm in action called collaborative filtering, which takes cues from other
users purchasing similar items as you are. Recommender systems are a very
active research topic now and several other algorithms are being considered.
Sentiment analysis: Whenever a product hits the market, the company that
brought it into the market wants to know how the market is reacting towards
it. Is it positive or negative? Sentiment analysis techniques help to identify
these reactions. Also, in review websites, people post several comments,
and the website might be interested in publishing a generalized positive
or negative rating for the item under review. Here, sentiment analysis
techniques can be quite helpful.
Information retrieval: Whenever you hit the search button on your favorite
search engine, a plethora of information retrieval algorithms are used under
the hood. These algorithms are also used in the content-based filtering that is
used in recommender systems.
Preface
Now that you have a top-level idea of what ML algorithms can do for you, let's see
why F# is the perfect fit for the implementations. Here are my reasons for using F#
to implement machine learning algorithms:
Introduction to Machine
Learning
"To learn is to discover patterns."
You have been using products that employ machine learning, but maybe you've
never realized that the systems or programs that you have been using, use machine
learning under the hood. Most of what machine learning does today is inspired by
sci-fi movies. Machine learning scientists and researchers are on a perpetual quest
to make the gap between the sci-fi movies and the reality disappear. Learning about
machine learning algorithms can be fun.
This is going to be a very practical book about machine learning. Throughout the
book I will be using several machine learning frameworks adopted by the industry.
So I will cut the theory of machine learning short and will get away with just enough
to implement it. My objective in this chapter is to get you excited about machine
learning by showing how you can use these techniques to solve real world problems.
Objective
After reading this chapter, you will be able to understand the different terminologies
used in machine learning and the process of performing machine learning activities.
Also, you will be able to look at a problem statement and immediately identify which
problem domain the problem belongs to; whether it is a classification or a regression
problem, and such. You will find connections between seemingly disparate sets of
problems. You will also find basic intuition behind some of the major algorithms
used in machine learning today. Finally, I wrap up this chapter with a motivating
example of identifying hand written digits using a supervised learning algorithm.
This is analogous to your Hello world program.
[1]
Getting in touch
I have created the following Twitter account for you (my dear reader) to get in touch
with me. If you want to ask a question, post errata, or just have a suggestion, tag this
twitter ID and I will surely get back as soon as I can.
https://twitter.com/fsharpforml
I will post contents here that will augment the content in the book.
The preceding image shows some of the areas where machine learning techniques
are used extensively. In this book, you will learn about most of these usages.
Machines learn almost the same way as we humans do. We learn in three different
ways.
As kids our parents taught us the alphabets and thus we can distinguish between the
A's and H's. The same is true with machines. Machines are also taught the same way
to recognize characters. This is known as supervised learning.
While growing up, we taught ourselves the differences between the teddy bear toy
and an actual bear. This is known as unsupervised learning, because there is no
supervision required in the process of the learning. The main type of unsupervised
learning is called clustering; that's the art of finding groups in unlabeled datasets.
Clustering has several applications, one of them being customer base segmentation.
[2]
Chapter 1
Remember those days when you first learnt how to take the stairs? You probably
fell many times before successfully taking the stairs. However, each time you fell,
you learnt something useful that helped you later. So your learning got re-enforced
every time you fell. This process is known as reinforcement learning. Ever saw
those funky robots crawling uneven terrains like humans. That's the result of
re-enforcement learning. This is a very active topic of research.
Whenever you shop online at Amazon or on other sites, the site recommends back
to you other stuff that you might be interested in. This is done by a set of algorithms
known as recommender systems.
Machine learning is very heavily used to determine whether suspicious credit
card transactions are fraudulent or not. The technique used is popularly known as
anomaly detection. Anomaly detection works on the assumption that most of the
entries are proper and that the entry that is far (also called an outlier) from the other
entries is probably fraudulent.
In the coming decade, machine learning is going to be very commonplace and it's
about time to democratize the machine learning techniques. In the next few sections,
I will give you a few examples where these different types of machine learning
algorithms are used to solve several problems.
[3]
Code written in F# is generally very expressive and is close to its actual algorithm
description. That's why you shall see more and more mathematically inclined
domains adopting F#.
At every stage of a machine learning activity, F# has a feature or an API to help.
Following are the major steps in a machine learning activity:
Major step in
machine learning
activity
Data Acquisition
Data Scrubbing/
Data Cleansing
Learning the
Model
Deedle (http://bluemountaincapital.github.io/
Deedle/) is an API written in F#, primarily for exploratory data
analysis. This framework also has lot of features that can help in the
data cleansing phase.
F# has a way to name a variable the way you want if you wrap it with double back
quotes like"my variable". This feature can make the code much more readable.
[4]
Chapter 1
[5]
Classification
Regression
Decision Tree
Linear Regression
Logistic Regression
Neural Networks
Anomaly Detection
Sentiment Analysis
In the next few sections, I will walk you through the overview of a few of these
algorithms and their mathematical basis. However, we will get away with as
minimal math as possible, since the objective of the book is to help you use
machine learning in the real settings.
[6]
Chapter 1
"Malignant" cases (which are represented as M in the dataset), or were fortunate and
diagnosed as "Benign" (non-harmful/non-cancerous) cases (which are represented as
B in the dataset). If you want to understand what all the other fields mean, take a look
at https://archive.ics.uci.edu/ml/machine-learning-databases/breastcancer-wisconsin/wdbc.names.
Now the question is, given a new entry with all the other records except the tag
M or B, can we predict that? In ML terminology, this value "M" or "B" is sometimes
referred to as "class tag" or just "class". The task of a classification algorithm is to
determine this class for a new data point. K-NN does this in the following way: it
measures the distance from the given data to all the training data and then takes into
consideration the classes for only the k-nearest neighbors to determine the class of
the new entry. So for the current case, if more than 50% of the k-nearest neighbors
is of class "B", then k-NN will conclude that the new entry is of type "B".
Distance metrics
The distance metric used is generally Euclidean, that you learnt in high school. For
example, given two points in 3D.
d ( p, q ) =
( p1 q1 ) + ( p2 q2 ) + ( p3 q3 )
2
In this preceding example, p1 and q1 denote their values in the X axis, p2 and q2
denote their values in the Y axis, and p3 and q3 denote their values in the z axis.
[7]
Extrapolating this, we get the following formula for calculating the distance in N
dimension:
d ( p, q ) = d ( q , p ) =
( q1 p1 ) + ( q2 p2 )
2
(q p )
i
+ + ( qn pn )
i =1
Thus, after calculating the distance from all the training set data, we can create a list
of tuples with the distance and the class, as follows. This list is made for the sake of
demonstration. This is not calculated from the actual data.
Distance from test/new data
Class/Tag/Category
0.34235
0.45343
1.34233
6.23433
66.3435
Let's assume that k is set to be 4. Now for each k, we take into consideration the class.
So for the first three entries, we found that the class is B and for the last one, it is M.
Since the number of B's is more than the number of M's, k-NN will conclude that the
new patient's data is of type B.
[8]
Chapter 1
Decision tree is a set of classification algorithm that uses this approach to determine
the class of an unknown entry. As the name suggests, a decision tree is a tree where
the nodes depict the questions asked and the edges represent the decisions (yes/
no). Leaf nodes determine the final class of the unknown entry. Following is a classic
textbook example of a decision tree:
The preceding figure depicts the decision whether we can play lawn tennis or
not, based on several attributes such as Outlook, Humidity, and Wind. Now the
question that you may have is why outlook was chosen as the root node of the tree.
The reason was that by choosing outlook as the first feature/attribute to split the
dataset, the outcomes were split more evenly than if the split had been done with
other attributes such as "humidity" or "wind".
The process of finding the attribute that can split the dataset more evenly than others
is guided by entropy. Lesser the entropy, better the parameter. Entropy is known as
the measure of information gain. It is calculated by the following formula:
H ( X ) = P ( xi ) I ( xi ) = P ( xi ) log b P ( xi )
i
Here P ( xi ) stands for the probability of xi and I ( xi ) denotes the information gain.
Let's take the example of tennis dataset from Weka. Following is the file in the
CSV format:
outlook,temperature,humidity,wind,playTennis
sunny, hot, high, weak, no
sunny, hot, high, strong, no
overcast, hot, high, weak, yes
rain, mild, high, weak, yes
[9]
You can see from the dataset that out of 14 instances (there are 14 rows in the file), 5
instances had the value no for playTennis and 9 instances had the value yes. Thus,
the overall information is given by the following formula:
9 5
5
9
log 2 + log 2
14 14
14
14
This roughly evaluates to 0.94. Now from the next steps, we must pick the attribute
that maximizes the information gain. Information gain is denoted as the difference
between the total entropy and the entropy calculated for each possible split.
Let's go with one example. For the outlook attribute, there are three possible values:
rain, sunny, and overcast, and for each of these values, the value of the attribute
playTennis is either no or yes.
For rain, out of 5 instances, 3 instances have the value yes for the attribute
playTennis; thus, the entropy is as follows:
3 2
3
log 2 + log 2
5 5
5
This is equal to 0.97.
For overcast, every instance has the value yes:
4
4
log 2
4
4
This is equal to 0.0.
[ 10 ]
Chapter 1
2 3
3
2
log 2 + log 2
5 5
5
5
So the expected new entropy is given by the following formula:
4
5
5
0.0 + 0.97 + 0.97
14
14
14
This is roughly equal to 0.69. If you follow these steps for the other attributes, you
will find that the new entropies are like as follows:
Attribute
Entropy
Information gain
outlook
0.69
temperature
0.91
humidity
0.724
windy
0.87
So the highest information gain is attained if we split the dataset based on the
outlook attribute.
Sometimes multiple trees are constructed by generating a random subset of all the
available features. This technique is known as random forest.
Linear regression
Regression is used to predict the target value of the real valued variable. For
example, let's say we have data about the number of bedrooms and the total area
of many houses in a locality. We also have their prices listed as follows:
Number of Bedrooms
Price
1150
2300000
2500
5600000
1780
4571030
3000
9000000
[ 11 ]
Now let's say we have this data in a real estate site's database and we want to create
a feature to predict the price of a new house with three bedrooms and total area
of 1650 square feet.
Linear regression is used to solve these types of problems. As you can see, these
types of problems are pretty common.
In linear regression, you start with a model where you represent the target variable
the variable for which you want to predict the value. A polynomial model is selected
that minimizes the least square error (this will be explained later in the chapter). Let
me walk you through this example.
Each row of the available data can be represented as a tuple where the first
few elements represent the value of the known/input parameters and the last
parameter shows the value of the price (the target variable). So taking inspiration
from mathematics, we can represent the unknown with x and known as y . Thus,
x , x , x , , x | y
n
each row can be represented as 1 2 3
where x1 to xn represent the
parameters (the total area and the number of bedrooms) and y represents the
target value (the price of the house). Linear regression works on a model where y is
represented with the x values.
h ( x ) = 0 + 1 x1 + 2 x2
Note that this hypothesis is still a polynomial model and we are just using two
features: the number of bedrooms and the total area represented by x1 and x2 .
So the square error is calculated by the following formula:
(h ( x) y)
The task of linear regression is to choose a set of values for the coefficients
which minimizes this error. The algorithm that minimizes this error is called
gradient descent or batch gradient descent. You will learn more about it in
Chapter 2, Linear Regression.
[ 12 ]
Chapter 1
Logistic regression
Unlike linear regression, logistic regression predicts a Boolean value indicating
the class/tag/category of the target variable. Logistic regression is one of the most
popular binary classifiers and is modelled by the equation that follows. xi and
yi stands for the independent input variables and their classes/tags respectively.
i
1
1
1 + e xi 1 1 + e xi
i =1
Recommender systems
Whenever you buy something from the web (say Amazon), it recommends you stuff
that you might find interesting and might eventually buy as well. This is the result of
recommender system. Let's take the following example of a movie rating:
Movie
Bob
Lucy
Jane
Jennifer
Jacob
Paper Towns
Focus
Cinderella
Jurrasic World
Die Hard
So in this toy example, we have 5 users and they have rated 5 movies. But not all the
users have rated all the movies. For example, Jane hasn't rated "Focus" and Jacob
hasn't rated "Jurassic World". The task of a recommender system is to initially guess
what would be the ratings for the movies that aren't rated by the user and then
recommend movies that have a guessed rating which is beyond a threshold (say 3).
There are several algorithms to solve this problem. One popular algorithm is known
as collaborative filtering where the algorithm takes clues from the other user ratings.
You will learn more about this in Chapter 5, Collaborative Filtering.
Unsupervised learning
As the name suggests, unlike supervised learning, unsupervised learning works
on data that is not labeled or that doesn't have a category associated with each
training example.
[ 13 ]
The number of visits per month (number of times the customer shows up)
The initial data that the supermarket had might look like the following in a
spreadsheet:
So the data plotted in these 2 dimensions, after being clustered, might look like this
following image:
Here you see that there are 4 types of people with two extreme cases that have been
annotated in the preceding image. Those who are very thorough and disciplinarian
and know what they want, go to the store very few times and buy what they want,
and generally their bills are very high. The vast majority falls under the basket where
people make many trips (kind of like darting into a super market for a packet of
chips, maybe) but their bills are really low. This type of information is crucial for
the super market because they can optimize their operations based on these data.
[ 14 ]
Chapter 1
This type of segmenting task has a special name in machine learning. It is called
"clustering". There are several clustering algorithms and K Means Clustering
is quite popular. The only flip side of k Means Clustering is that the number of
possible clusters has to be told in the beginning.
Accord.NET (http://accord-framework.net/)
WekaSharp (http://accord-framework.net/)
You are much better off using these frameworks than creating your own because a
lot of work has been done and they are used in the industry. So if you pick up using
these frameworks along the way while learning about machine learning algorithms
in general, that's a great thing. You will be in luck.
The next section gets you started with a kaggle competition; getting the data and
solving it.
[ 15 ]
A matrix can be represented as a 2-D array where each pixel is represented by each
cell. However, any 2-D array can be visually unwrapped to be a 1-D array where
the length of the array is the product of the length and the breadth of the array. For
example, for the 8 by 8 matrix, the size of the single dimensional array will be 64.
Now if we store several images and their 2D matrix representations, we will have
something as shown in the following spreadsheet:
The header Label denotes the number and the remaining values are the pixel values.
Lesser the pixel values, the darker the cell is in the pictorial representation of the
number 2, as shown previously.
In this program, you will write code to solve the digit recognizer challenge from
Kaggle, available at:
https://www.kaggle.com/c/digit-recognizer.
Once you get there, download the data and save it in some folder. We will be using the
train.csv file (You can get the file from www.kaggle.com/c/digit-recognizer/
data) for training our classifier. In this example, you will implement the k nearest
neighbor algorithm from scratch, and then deploy this algorithm to recognize the digit.
[ 16 ]
Chapter 1
sudipto80/72e6e56d07110baf4d4d.
[ 17 ]
4. Once you create the project by clicking "OK", your program.fs file will look
as the following image:
[ 18 ]
Chapter 1
[ 19 ]
d 2 ( p, q ) = ( p1 q1 ) + ( p2 q2 ) + + ( pi qi ) + + ( pn qn )
2
Here p and q denote the two vectors. In this case, p might denote one example
from the training set and q might denote the test example or the new uncategorized
data that we have depicted by newEntry in the preceding code.
The loadValues function loads the pixel values and the category for each training/
test data, and creates a list of Entry types from the CSV file.
The k-NN algorithm is implemented in the kNN function. Refer to the following line
of code:
|> List.map( fun x -> ( x.Label, distance
(newEntry) |>Array.toList )))
[ 20 ]
(x.Values, snd
Chapter 1
This preceding code creates a list of tuples where the first element is the category of
the entry and the second is the distance square value for the test data from each of
the training entry. So it might look as follows:
It sorts this list of tuples based on the increasing distance from the test data. Thus,
the preceding list will become as shown in the following image:
If you see, there are four 9s and three 4s in this list. The following line transforms this
list into a histogram:
|> Seq.countBy (fun x -> fst x)
[ 21 ]
Summary
In this chapter, you have learnt about several different types of machine learning
techniques and their possible usages. Try to spot probable machine learning
algorithms that might be deployed deep inside some applications. Following are
some examples of machine learning. Your mailbox is curated by an automatic
spam protector and it learns every time you move an e-mail from your inbox to the
spam folder. This is an example of a supervised classification algorithm. When you
apply for a health insurance, then based on several parameters, they (the insurance
company) try to fit your data and predict what premium you might have to pay.
This is an example of linear regression. Sometimes when people buy baby diapers at
supermarkets, they get a discount coupon for buying beer. Sounds crazy, right! But
the machine learning algorithm figured out that people who buy the diapers buy
beer too. So they want to provoke the users to buy more. There is lot of buzz right
now about predictive analytics. It is nothing but predicting an event in the future by
associating a probability score. For example, figuring out how long will a shopper
take to return to the store for her next purchase. These data are extracted from the
visit patterns. That's unsupervised learning working in the background.
Sometimes one simple algorithm doesn't provide the needed accuracy. So then
several methods are used and a special class of algorithm, known as Ensemble
method, is used to club the individual results. In loose terms, it kind of resonates
with the phrase "crowd-smart". You will learn about some ensemble methods in
a later chapter.
I want to wrap up this chapter with the following tip. When you have a problem that
you want to solve and you think machine learning can help, follow the following
steps. Break the problem into smaller chunks and then try to locate a class of machine
learning problem domain for this smaller problem. Then find the best method in
that class to solve. Iterate over and over until your error rates are within permissible
limits. And then wrap it in a nice application/user interface.
[ 22 ]
www.PacktPub.com
Stay Connected: