Building Machine Learning Systems With Python - Second Edition - Sample Chapter
Building Machine Learning Systems With Python - Second Edition - Sample Chapter
Second Edition
Using machine learning to gain deeper insights from
data is a key skill required by modern application
developers and analysts alike. Python is a wonderful
language to develop machine learning applications.
As a dynamic language, it allows for fast exploration
and experimentation. With its excellent collection of
open source machine learning libraries you can focus
on the task at hand while being able to quickly try out
many ideas.
This book shows you exactly how to find patterns
in your raw data. You will start by brushing up on your
Python machine learning knowledge and introducing
libraries. Youll quickly get to grips with serious,
real-world projects on datasets, using modeling,
creating recommendation systems. Later on, the
book covers advanced topics such as topic modeling,
basket analysis, and cloud computing. These will
extend your abilities and enable you to create large
complex systems.
$ 49.99 US
32.99 UK
P U B L I S H I N G
Second Edition
ee
Sa
m
pl
C o m m u n i t y
E x p e r i e n c e
D i s t i l l e d
Chapter 7, Regression, explains how to use the classical topic, regression, in handling
data, which is still relevant today. You will also learn about advanced regression
techniques such as the Lasso and ElasticNets.
Chapter 8, Recommendations, builds recommendation systems based on costumer
product ratings. We will also see how to build recommendations just from shopping data
without the need for ratings data (which users do not always provide).
Chapter 9, Classification Music Genre Classification, makes us pretend that someone
has scrambled our huge music collection, and our only hope to create order is to let a
machine learner classify our songs. It will turn out that it is sometimes better to trust
someone else's expertise than creating features ourselves.
Chapter 10, Computer Vision, teaches how to apply classification in the specific context
of handling images by extracting features from data. We will also see how these methods
can be adapted to find similar images in a collection.
Chapter 11, Dimensionality Reduction, teaches us what other methods exist that can help
us in downsizing data so that it is chewable by our machine learning algorithms.
Chapter 12, Bigger Data, explores some approaches to deal with larger data by taking
advantage of multiple cores or computing clusters. We also have an introduction to using
cloud computing (using Amazon Web Services as our cloud provider).
Appendix, Where to Learn More Machine Learning, lists many wonderful resources
available to learn more about machine learning.
Classification Detecting
Poor Answers
Now that we are able to extract useful features from text, we can take on the
challenge of building a classifier using real data. Let's come back to our imaginary
website in Chapter 3, Clustering Finding Related Posts, where users can submit
questions and get them answered.
A continuous challenge for owners of those Q&A sites is to maintain a decent level of
quality in the posted content. Sites such as StackOverflow make considerable efforts
to encourage users with diverse possibilities to score content and offer badges and
bonus points in order to encourage the users to spend more energy on carving out
the question or crafting a possible answer.
One particular successful incentive is the ability for the asker to flag one answer
to their question as the accepted answer (again there are incentives for the asker
to flag answers as such). This will result in more score points for the author of
the flagged answer.
Would it not be very useful to the user to immediately see how good his answer is
while he is typing it in? That means, the website would continuously evaluate his
work-in-progress answer and provide feedback as to whether the answer shows
some signs of a poor one. This will encourage the user to put more effort into writing
the answer (providing a code example? including an image?), and thus improve the
overall system.
Let's build such a mechanism in this chapter.
[ 95 ]
[ 96 ]
Chapter 5
</posts>
Name
Id
Type
Integer
Description
PostTypeId
Integer
Question
Answer
Other values will be ignored.
ParentId
Integer
Name
CreationDate
Type
DateTime
Description
Score
Integer
ViewCount
Body
Integer
or empty
String
OwnerUserId
Id
Title
String
AcceptedAnswerId
Id
CommentCount
Integer
[ 98 ]
Chapter 5
ViewCount, in contrast, is most likely of no use for our task. Even if it would help the
classifier to distinguish between good and bad, we would not have this information
at the time when an answer is being submitted. Drop it!
The Title attribute is also ignored here, although it could add some more
information about the question.
CommentCount is also ignored. Similar to ViewCount, it could help the classifier
with posts that are out there for a while (more comments = more ambiguous post?).
It will, however, not help the classifier at the time an answer is posted.
As we will access this per answer, instead of keeping this attribute, we will create
the new attribute IsAccepted, which is 0 or 1 for answers and ignored for questions
(ParentId=-1).
We end up with the following format:
Id <TAB> ParentId <TAB> IsAccepted <TAB> TimeToAnswer <TAB> Score
<TAB> Text
For the concrete parsing details, please refer to so_xml_to_tsv.py and choose_
instance.py. Suffice to say that in order to speed up processing, we will split the
data into two files: in meta.json, we store a dictionary mapping a post's Id value to
its other data except Text in JSON format so that we can read it in the proper format.
For example, the score of a post would reside at meta[Id]['Score']. In data.tsv, we
store the Id and Text values, which we can easily read with the following method:
def fetch_posts():
for line in open("data.tsv", "r"):
post_id, text = line.split("\t")
yield int(post_id), text.strip()
[ 99 ]
Chapter 5
KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski', n_neighbors=2, p=2, weights='uniform')
It provides the same interface as all other estimators in sklearn: we train it using
fit(), after which we can predict the class of new data instances using predict():
>>> knn.fit([[1],[2],[3],[4],[5],[6]], [0,0,0,1,1,1])
>>> knn.predict(1.5)
array([0])
>>> knn.predict(37)
array([1])
>>> knn.predict(3)
array([0])
To get the class probabilities, we can use predict_proba(). In this case of having
two classes, 0 and 1, it will return an array of two elements:
>>> knn.predict_proba(1.5)
array([[ 1.,
0.]])
>>> knn.predict_proba(37)
array([[ 0.,
1.]])
>>> knn.predict_proba(3.5)
array([[ 0.5,
0.5]])
What we could do is check the number of HTML links in the answer as a proxy for
quality. Our hypothesis would be that more hyperlinks in an answer indicate better
answers and thus a higher likelihood of being up-voted. Of course, we want to only
count links in normal text and not code examples:
import re
code_match = re.compile('<pre>(.*?)</pre>',
re.MULTILINE | re.DOTALL)
link_match = re.compile('<a href="http://.*?".*?>(.*?)</a>',
[ 101 ]
def extract_features_from_body(s):
link_count_in_code = 0
# count links in code to later subtract them
for match_str in code_match.findall(s):
link_count_in_code += len(link_match.findall(match_str))
With this in place, we can generate one feature per answer. But before we train
the classifier, let's first have a look at what we will train it with. We can get a first
impression with the frequency distribution of our new feature. This can be done by
plotting the percentage of how often each value occurs in the data. Have a look at the
following plot:
[ 102 ]
Chapter 5
With the majority of posts having no link at all, we know now that this feature will
not make a good classifier alone. Let's nevertheless try it out to get a first estimation
of where we are.
Using the standard parameters, we just fitted a 5NN (meaning NN with k=5) to
our data. Why 5NN? Well, at the current state of our knowledge about the data, we
really have no clue what the right k should be. Once we have more insight, we will
have a better idea of how to set k.
[ 103 ]
print("Mean(scores)=%.5f\tStddev(scores)=%.5f"\
%(np.mean(scores), np.std(scores)))
Stddev(scores)=0.055591
Now that is far from being usable. With only 55 percent accuracy, it is not much
better than tossing a coin. Apparently, the number of links in a post is not a very
good indicator for the quality of a post. So, we can say that this feature does not
have much discriminative powerat least not for kNN with k=5.
links = link_match.findall(s)
link_count = len(links)
link_count -= link_count_in_code
html_free_s = re.sub(" +", " ",
[ 104 ]
Chapter 5
tag_match.sub('',
code_free_s)).replace("\n", "")
link_free_s = html_free_s
Looking at them, we notice that at least the number of words in a post shows
higher variability:
Stddev(scores)=0.02600
[ 105 ]
But still, this would mean that we would classify roughly 4 out of 10 wrong. At least
we are going in the right direction. More features lead to higher accuracy, which
leads us to adding more features. Therefore, let's extend the feature space by even
more features:
The following charts show the value distributions for average sentence and word
lengths and number of uppercase words and exclamation marks:
[ 106 ]
Chapter 5
With these four additional features, we now have seven features representing the
individual posts. Let's see how we progress:
Mean(scores)=0.61400
Stddev(scores)= 0.02154
Now, that's interesting. We added four more features and don't get anything in
return. How can that be?
To understand this, we have to remind ourselves how kNN works. Our 5NN
classifier determines the class of a new post by calculating the seven aforementioned
features, LinkCount, NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen,
NumAllCaps, and NumExclams, and then finds the five nearest other posts. The new
post's class is then the majority of the classes of those nearest posts. The nearest
posts are determined by calculating the Euclidean distance (as we did not specify
it, the classifier was initialized with the default p=2, which is the parameter in the
Minkowski distance). That means that all seven features are treated similarly. kNN
does not really learn that, for instance, NumTextTokens is good to have but much less
important than NumLinks. Let's consider the following two posts A and B that only
differ in the following features and how they compare to a new post:
Post
NumLinks
NumTextTokens
20
25
new
23
Although we would think that links provide more value than mere text, post B
would be considered more similar to the new post than post A.
Clearly, kNN has a hard time in correctly using the available data.
Add more data: Maybe it is just not enough data for the learning algorithm
and we should simply add more training data?
Play with the model complexity: Maybe the model is not complex enough?
Or maybe it is already too complex? In this case, we could decrease k so
that it would take less nearest neighbors into account and thus be better in
predicting non-smooth data. Or we could increase it to achieve the opposite.
[ 107 ]
Modify the feature space: Maybe we do not have the right set of features?
We could, for example, change the scale of our current features or design
even more new features. Or should we rather remove some of our current
features in case some features are aliasing others?
Change the model: Maybe kNN is in general not a good fit for our use case
such that it will never be capable of achieving good prediction performance,
no matter how complex we allow it to be and how sophisticated the feature
space will become?
In real life, at this point, people often try to improve the current performance by
randomly picking one of the these options and trying them out in no particular
order, hoping to find the golden configuration by chance. We could do the same
here, but it will surely take longer than making informed decisions. Let's take the
informed route, for which we need to introduce the bias-variance tradeoff.
Chapter 5
The only possibilities we have in this case are to get more features, make the model
more complex, or change the model.
[ 109 ]
Looking at the graph, we immediately see that adding more training data will
not help, as the dashed line corresponding to the test error seems to stay above 0.4.
The only option we have is to decrease the complexity, either by increasing k or by
reducing the feature space.
Reducing the feature space does not help here. We can easily confirm this by plotting
the graph for a simplified feature space of only LinkCount and NumTextTokens:
We get similar graphs for other smaller feature sets. No matter what subset of
features we take, the graph would look similar.
At least reducing the model complexity by increasing k shows some positive impact:
k
mean(scores)
stddev(scores)
40
0.62800
0.03750
10
0.62000
0.04111
0.61400
0.02154
[ 110 ]
Chapter 5
But it is not enough, and also comes at a price of lower classification runtime
performance. Take, for instance, k=40, where we have a very low test error. To
classify a new post, we would need to find the 40 nearest other posts to decide
whether the new post is a good one or not:
Clearly, it seems to be an issue with using nearest neighbor for our scenario. And it
has another real disadvantage. Over time, we will get more and more posts into our
system. As the nearest neighbor method is an instance-based approach, we will have
to store all posts in our system. The more we get, the slower the prediction will be.
This is different with model-based approaches, where one tries to derive a model
from the data.
There we are, with enough reasons now to abandon the nearest neighbor approach
to look for better places in the classification world. Of course, we will never know
whether there is the one golden feature we just did not happen to think of. But for
now, let's move on to another classification method that is known to work great in
text-based classification scenarios.
[ 111 ]
[ 112 ]
Chapter 5
Let's say a feature has the probability of 0.9 that it belongs to class 1, P(y=1) = 0.9. The
odds ratio is then P(y=1)/P(y=0) = 0.9/0.1 = 9. We could say that the chance is 9:1 that
this feature maps to class 1. If P(y=0.5), we would consequently have a 1:1 chance
that the instance is of class 1. The odds ratio is bounded by 0, but goes to infinity
(the left graph in the following set of graphs). If we now take the logarithm of it, we
can map all probabilities between 0 and 1 to the full range from negative to positive
infinity (the right graph in the following set of graphs). The nice thing is that we still
maintain the relationship that higher probability leads to a higher log of odds, just
not limited to 0 and 1 anymore.
This means that we can now fit linear combinations of our features (OK, we only
have one and a constant, but that will change soon) to the
values. In a
sense, we replace the linear from Chapter 1, Getting Started with Python Machine
pi
1 pi
1
We can solve this for pi, so that we have pi =
( c0 + c1 xi ) .
1+ e
Learning, yi = c0 + c1 xi with log
We simply have to find the right coefficients, such that the formula gives the lowest
errors for all our (xi, pi) pairs in our data set, but that will be done by scikit-learn.
After fitting, the formula will give the probability for every new data point x that
belongs to class 1:
>>> from sklearn.linear_model import LogisticRegression
>>> clf = LogisticRegression()
>>> print(clf)
[ 113 ]
P(x=7)=0.85
You might have noticed that scikit-learn exposes the first coefficient through the
special field intercept_.
If we plot the fitted model, we see that it makes perfect sense given the data:
[ 114 ]
Chapter 5
Comparing it to the best nearest neighbor classifier (k=40) as a baseline, we see that it
performs a bit better, but also won't change the situation a whole lot.
Method
mean(scores)
stddev(scores)
LogReg C=0.1
0.64650
0.03139
LogReg C=1.00
0.64650
0.03155
LogReg C=10.00
0.64550
0.03102
LogReg C=0.01
0.63850
0.01950
40NN
0.62800
0.03750
We have shown the accuracy for different values of the regularization parameter
C. With it, we can control the model complexity, similar to the parameter k for the
nearest neighbor method. Smaller values for C result in more penalization of the
model complexity.
A quick look at the bias-variance chart for one of our best candidates, C=0.1, shows
that our model has high biastest and train error curves approach closely but stay
at unacceptable high values. This indicates that logistic regression with the current
feature space is under-fitting and cannot learn a model that captures the data correctly:
[ 115 ]
So what now? We switched the model and tuned it as much as we could with our
current state of knowledge, but we still have no acceptable classifier.
More and more it seems that either the data is too noisy for this task or that our set of
features is still not appropriate to discriminate the classes well enough.
Positive
Negative
Positive
Negative
For instance, if the classifier predicts an instance to be positive and the instance
indeed is positive in reality, this is a true positive instance. If on the other hand the
classifier misclassified that instance, saying that it is negative while in reality it was
positive, that instance is said to be a false negative.
What we want is to have a high success rate when we are predicting a post as either
good or bad, but not necessarily both. That is, we want as much true positives as
possible. This is what precision captures:
Precision =
TP
TP + FP
[ 116 ]
Chapter 5
If instead our goal would have been to detect as much good or bad answers as
possible, we would be more interested in recall:
Recall =
TP
TP + FN
In terms of the following graphic, precision is the fraction of the intersection of the
right circle while recall is the fraction of the intersection of the left circle:
So, how can we now optimize for precision? Up to now, we have always used 0.5
as the threshold to decide whether an answer is good or not. What we can do now
is count the number of TP, FP, and FN while varying that threshold between 0 and 1.
With those counts, we can then plot precision over recall.
The handy function precision_recall_curve() from the metrics module does all
the calculations for us:
>>> from sklearn.metrics import precision_recall_curve
>>> precision, recall, thresholds = precision_recall_curve(y_test,
clf.predict(X_test))
[ 117 ]
Predicting one class with acceptable performance does not always mean that
the classifier is also acceptable predicting the other class. This can be seen in the
following two plots, where we plot the precision/recall curves for classifying bad
(the left graph) and good (the right graph) answers:
We see that we can basically forget predicting bad answers (the left plot). Precision
drops to a very low recall and stays at an unacceptably low 60 percent.
Predicting good answers, however, shows that we can get above 80 percent precision
at a recall of almost 40 percent. Let's find out what threshold we need for that. As we
trained many classifiers on different folds (remember, we iterated over KFold() a
couple of pages back), we need to retrieve the classifier that was neither too bad nor
too good in order to get a realistic view. Let's call it the medium clone:
>>> medium = np.argsort(scores)[int(len(scores) / 2)]
>>> thresholds = np.hstack(([0],thresholds[medium]))
>>> idx80 = precisions>=0.8
>>> print("P=%.2f R=%.2f thresh=%.2f" % (precision[idx80][0],
recall[idx80][0], threshold[idx80]
[0]))
P=0.80 R=0.37 thresh=0.59
[ 118 ]
Chapter 5
Setting the threshold at 0.59, we see that we can still achieve a precision of 80
percent detecting good answers when we accept a low recall of 37 percent. That
means that we would detect only one in three good answers as such. But from that
third of good answers that we manage to detect, we would be reasonably sure that
they are indeed good. For the rest, we could then politely display additional hints on
how to improve answers in general.
To apply this threshold in the prediction process, we have to use predict_proba(),
which returns per class probabilities, instead of predict(), which returns the
class itself:
>>> thresh80 = threshold[idx80][0]
>>> probs_for_good = clf.predict_proba(answer_features)[:,1]
>>> answer_class = probs_for_good>thresh80
precision
recall
f1-score
support
not accepted
0.59
0.85
0.70
101
accepted
0.73
0.40
0.52
99
avg / total
0.66
0.63
0.61
200
Note that using the threshold will not guarantee that we are
always above the precision and recall values that we determined
above together with its threshold.
[ 119 ]
We see that LinkCount, AvgWordLen, NumAllCaps, and NumExclams have the biggest
impact on the overall classification decision, while NumImages (a feature that we
sneaked in just for demonstration purposes a second ago) and AvgSentLen play a
rather minor role. While the feature importance overall makes sense intuitively, it is
surprising that NumImages is basically ignored. Normally, answers containing images
are always rated high. In reality, however, answers very rarely have images. So,
although in principal it is a very powerful feature, it is too sparse to be of any value.
We could easily drop that feature and retain the same classification performance.
[ 120 ]
Chapter 5
Ship it!
Let's assume we want to integrate this classifier into our site. What we definitely do
not want is training the classifier each time we start the classification service. Instead,
we can simply serialize the classifier after training and then deserialize on that site:
>>> import pickle
>>> pickle.dump(clf, open("logreg.dat", "w"))
>>> clf = pickle.load(open("logreg.dat", "r"))
Congratulations, the classifier is now ready to be used as if it had just been trained.
Summary
We made it! For a very noisy dataset, we built a classifier that suits a part of our goal.
Of course, we had to be pragmatic and adapt our initial goal to what was achievable.
But on the way we learned about strengths and weaknesses of nearest neighbor
and logistic regression. We learned how to extract features such as LinkCount,
NumTextTokens, NumCodeLines, AvgSentLen, AvgWordLen, NumAllCaps, NumExclams,
and NumImages, and how to analyze their impact on the classifier's performance.
But what is even more valuable is that we learned an informed way of how to debug
bad performing classifiers. That will help us in the future to come up with usable
systems much faster.
After having looked into nearest neighbor and logistic regression, in the next
chapter, we will get familiar with yet another simple yet powerful classification
algorithm: Nave Bayes. Along the way, we will also learn some more convenient
tools from scikit-learn.
[ 121 ]
www.PacktPub.com
Stay Connected: