Module 1
Module 1
Eg: 4U/For U
Some algorithms can deal with partially labeled training data, usually a lot of unlabeled data and a little bit
of labeled data. This is called semisupervised learning
Some photo-hosting services, such as Google Photos, are good examples of this. Once you upload all
your family photos to the service, it automatically recognizes that the same person A shows up in
photos 1, 5, and 11, while another person B shows up in photos 2, 5, and 7. This is the unsupervised
part of the algorithm (clustering). Now all the system needs is for you to tell it who these people are.
Just one label per person,4 and it is able to name everyone in every photo, which is useful for searching
photos.
If you want a batch learning system to know about new data (such as a new type of spam), you
need to train a new version of the system from scratch on the full dataset (not just the new data,
but also the old data), then stop the old system and replace it with the new one.
• training using the full set of data can take many hours
• training on the full set of data requires a lot of computing resources (CPU, memory
space, disk space, disk I/O, network I/O, etc.). If you have a lot of data and you
automate your system to train from scratch every day, it will end up costing you a lot
of money. If the amount of data is huge, it may even be impossible to use a batch
learning algorithm
• Finally, if your system needs to be able to learn autonomously and it has limited
resources (e.g., a smartphone application or a rover on Mars), then carrying around
large amounts of training data and taking up a lot of resources to train for hours
every day is a showstopper
Fortunately, a better option in all these cases is to use algorithms that are capable of
learning incrementally.
A big challenge with online learning is that if bad data is fed to the system, the system’s performance will
gradually decline. If we are talking about a live system, your clients will notice. For example, bad data could
come from a malfunctioning sensor on a robot, or from someone spamming a search engine to try to rank high in
search results. To reduce this risk, you need to monitor your system closely and promptly switch learning off (and
possibly revert to a previously working state) if you detect a drop in performance. You may also want to monitor
the input data and react to abnormal data
Instead of just flagging emails that are identical to known spam emails, your
spam filter could be programmed to also flag emails that are very similar to
known spam emails.
This requires a measure of similarity between two emails.
A (very basic) similarity measure between two emails could be to count the
number of words they have in common.
The system would flag an email as spam if it has many words in common with
a known spam email.
This is called instance-based learning: the system learns the examples by
heart, then generalizes to new cases by comparing them to the learned
examples (or a subset of them), using a similarity measure.