Machine Learning - Data
Machine Learning - Data
Machine learning is a sub-domain of computer science which evolved from the study of pattern
recognition in data, and also from the computational learning theory in artificial intelligence. It is the
first-class ticket to most interesting careers in data analytics today.
As data sources proliferate along with the computing power to process them, going straight to the data
is one of the most straightforward ways to quickly gain insights and make predictions. Machine Learning
can be thought of as the study of a list of sub-problems, viz: decision making, clustering, classification,
forecasting, deep-learning, inductive logic programming, support vector machines, reinforcement
learning, similarity and metric learning, genetic algorithms, sparse dictionary learning, etc. Supervised
learning or classification is the machine learning task of inferring a function from a labeled data.
In Supervised learning, we have a training set, and a test set. The training and test set consists of a set of
examples consisting of input and output vectors, and the goal of the supervised learning algorithm is to
infer a function that maps the input vector to the output vector with minimal error. In an optimal
scenario, a model trained on a set of examples will classify an unseen example in a correct fashion,
which requires the model to generalize from the training set in a reasonable way. In layman’s terms,
supervised learning can be termed as the process of concept learning, where a brain is exposed to a set
of inputs and result vectors and the brain learns the concept that relates said inputs to outputs.
A wide array of supervised machine learning algorithms are available to the machine learning
enthusiast, for example Neural Networks, Decision Trees, Support Vector Machines, Random Forest,
Naïve Bayes Classifier, Bayes Net, Majority Classifier[4,7,8,9] etc., and they each have their own merits
and demerits. There is no single algorithm that works for all cases, as merited by the No free lunch
theorem [3]. In this project, we try and find patterns in a dataset [2], which is a sample of males in a
heart-disease high risk region of South Africa, and attempt to throw various intelligently-picked
algorithms at the data, and see what sticks.
At a high-level, machine learning is simply the study of teaching a computer program or algorithm how
to progressively improve upon a set task that it is given. On the research-side of things, machine learning
can be viewed through the lens of theoretical and mathematical modeling of how this process works.
However, more practically it is the study of how to build applications that exhibit this iterative
improvement. There are many ways to frame this idea, but largely there are three major recognized
categories: supervised learning, unsupervised learning, and reinforcement learning.
1) Supervised Learning
Supervised learning is the most popular paradigm for machine learning. It is the easiest to understand
and the simplest to implement. It is very similar to teaching a child with the use of flash cards.
Given data in the form of examples with labels, we can feed a learning algorithm these example-label
pairs one by one, allowing the algorithm to predict the label for each example, and giving it feedback as
to whether it predicted the right answer or not. Over time, the algorithm will learn to approximate the
exact nature of the relationship between examples and their labels. When fully-trained, the supervised
learning algorithm will be able to observe a new, never-before-seen example and predict a good label
for it.
Supervised learning is often described as task-oriented because of this. It is highly focused on a singular
task, feeding more and more examples to the algorithm until it can accurately perform on that task. This
is the learning type that you will most likely encounter, as it is exhibited in many of the following
common applications:
Advertisement Popularity: Selecting advertisements that will perform well is often a supervised learning
task. Many of the ads you see as you browse the internet are placed there because a learning algorithm
said that they were of reasonable popularity (and clickability). Furthermore, its placement associated on
a certain site or with a certain query (if you find yourself using a search engine) is largely due to a
learned algorithm saying that the matching between ad and placement will be effective.
Spam Classification: If you use a modern email system, chances are you’ve encountered a spam filter.
That spam filter is a supervised learning system. Fed email examples and labels (spam/not spam), these
systems learn how to preemptively filter out malicious emails so that their user is not harassed by them.
Many of these also behave in such a way that a user can provide new labels to the system and it can
learn user preference.
Face Recognition: Do you use Facebook? Most likely your face has been used in a supervised learning
algorithm that is trained to recognize your face. Having a system that takes a photo, finds faces, and
guesses who that is in the photo (suggesting a tag) is a supervised process. It has multiple layers to it,
finding faces and then identifying them, but is still supervised nonetheless.
Decision Trees
K Nearest Neighbours
Linear SVC (Support vector Classifier)
Logistic Regression
Naive Bayes
Neural Networks
Linear Regression
Support Vector Regression (SVR)
Regression Trees
2) Unsupervised Learning
Unsupervised learning is very much the opposite of supervised learning. It features no labels. Instead,
our algorithm would be fed a lot of data and given the tools to understand the properties of the data.
From there, it can learn to group, cluster, and/or organize the data in a way such that a human (or other
intelligent algorithm) can come in and make sense of the newly organized data.
What makes unsupervised learning such an interesting area is that an overwhelming majority of data in
this world is unlabeled. Having intelligent algorithms that can take our terabytes and terabytes of
unlabeled data and make sense of it is a huge source of potential profit for many industries. That alone
could help boost productivity in a number of fields.
For example, what if we had a large database of every research paper ever published and we had an
unsupervised learning algorithms that knew how to group these in such a way so that you were always
aware of the current progression within a particular domain of research. Now, you begin to start a
research project yourself, hooking your work into this network that the algorithm can see. As you write
your work up and take notes, the algorithm makes suggestions to you about related works, works you
may wish to cite, and works that may even help you push that domain of research forward. With such a
tool, your productivity can be extremely boosted.
Because unsupervised learning is based upon the data and its properties, we can say that unsupervised
learning is data-driven. The outcomes from an unsupervised learning task are controlled by the data and
the way its formatted. Some areas you might see unsupervised learning crop up are:
Recommender Systems: If you’ve ever used YouTube or Netflix, you’ve most likely encountered a video
recommendation system. These systems are often times placed in the unsupervised domain. We know
things about videos, maybe their length, their genre, etc. We also know the watch history of many users.
Taking into account users that have watched similar videos as you and then enjoyed other videos that
you have yet to see, a recommender system can see this relationship in the data and prompt you with
such a suggestion.
Buying Habits: It is likely that your buying habits are contained in a database somewhere and that data
is being bought and sold actively at this time. These buying habits can be used in unsupervised learning
algorithms to group customers into similar purchasing segments. This helps companies market to these
grouped segments and can even resemble recommender systems.
Unsupervised machine learning algorithms
K-means clustering
Dimensionality Reduction
Neural networks / Deep Learning
Principal Component Analysis
Singular Value Decomposition
Independent Component Analysis
Distribution models
Hierarchical clustering
3) Reinforcement Learning
Reinforcement learning is fairly different when compared to supervised and unsupervised learning.
Where we can easily see the relationship between supervised and unsupervised (the presence or
absence of labels), the relationship to reinforcement learning is a bit murkier. Some people try to tie
reinforcement learning closer to the two by describing it as a type of learning that relies on a time-
dependent sequence of labels; however, my opinion is that that simply makes things more
confusing.
So long as we provide some sort of signal to the algorithm that associates good behaviors with a
positive signal and bad behaviors with a negative one, we can reinforce our algorithm to prefer good
behaviors over bad ones. Over time, our learning algorithm learns to make less mistakes than it used
to.
For any reinforcement learning problem, we need an agent and an environment as well as a way to
connect the two through a feedback loop. To connect the agent to the environment, we give it a set
of actions that it can take that affect the environment. To connect the environment to the agent, we
have it continually issue two signals to the agent: an updated state and a reward (our reinforcement
signal for behavior).
Applications of reinforcement learning in the real world.
Video Games: One of the most common places to look at reinforcement learning is in learning to
play games. Look at Google’s reinforcement learning application, AlphaZero and AlphaGo which
learned to play the game Go. Our Mario example is also a common example. Currently, I don’t know
any production-grade game that has a reinforcement learning agent deployed as its game AI, but I
can imagine that this will soon be an interesting option for game devs to employ.
Industrial Simulation: For many robotic applications (think assembly lines), it is useful to have our
machines learn to complete their tasks without having to hardcode their processes. This can be a
cheaper and safer option; it can even be less prone to failure. We can also incentivize our machines
to use less electricity, so as to save us money. More than that, we can start this all within a
simulation so as to not waste money if we potentially break our machine.
Resource Management: Reinforcement learning is good for navigating complex environments. It can
handle the need to balance certain requirements. Take, for example, Google’s data centers. They
used reinforcement learning to balance the need to satisfy our power requirements, but do it as
efficiently as possible, cutting major costs. How does this affect us and the average person? Cheaper
data storage costs for us as well and less of an impact on the environment we all share.
The Root Mean Square Error or RMSE is a frequently applied measure of the differences between
numbers (population values and samples) which is predicted by an estimator or a mode. The RMSE
describes the sample standard deviation of the differences between the predicted and observed
values. Each of these differences is known as residuals when the calculations are done over the data
sample that was used to estimate, and known as prediction errors when calculated out of sample.
The RMSE aggregates the magnitudes of the errors in predicting different times into a single
measure of predictive power.
• Statistically, the root mean square (RMS) is the square root of the mean square, which is the
arithmetic mean of the squares of a group of values. RMS is also called as quadratic mean and is
a special case of the generalized mean whose exponent is 2. Root mean square is also defined as
a varying function based on an integral of the squares of the values which are instantaneous in a
cycle.
• In other words, RMS of a group of numbers is the square of the arithmetic mean or the
function’s square which defines the continuous waveform.
The RMSE of a predicted model with respect to the estimated variable xmodel is defined as the
square root of the mean squared error.
Cascade Classifier
Object Detection using Haar feature-based cascade classifiers is an effective object detection
method proposed by Paul Viola and Michael Jones in their paper, "Rapid Object Detection using a
Boosted Cascade of Simple Features" in 2001. It is a machine learning based approach where a
cascade function is trained from a lot of positive and negative images. It is then used to detect
objects in other images.
Here we will work with face detection. Initially, the algorithm needs a lot of positive images (images
of faces) and negative images (images without faces) to train the classifier. Then we need to extract
features from it. For this, Haar features shown in the below image are used. They are just like our
convolutional kernel. Each feature is a single value obtained by subtracting sum of pixels under the
white rectangle from sum of pixels under the black rectangle.
Now, all possible sizes and locations of each kernel are used to calculate lots of features. (Just
imagine how much computation it needs? Even a 24x24 window results over 160000 features). For
each feature calculation, we need to find the sum of the pixels under white and black rectangles. To
solve this, they introduced the integral image. However large your image, it reduces the calculations
for a given pixel to an operation involving just four pixels. Nice, isn't it? It makes things super-fast.
But among all these features we calculated, most of them are irrelevant. For example, consider the
image below. The top row shows two good features. The first feature selected seems to focus on the
property that the region of the eyes is often darker than the region of the nose and cheeks. The
second feature selected relies on the property that the eyes are darker than the bridge of the nose.
But the same windows applied to cheeks or any other place is irrelevant. So how do we select the
best features out of 160000+ features?
For this, we apply each and every feature on all the training images. For each feature, it finds the
best threshold which will classify the faces to positive and negative. Obviously, there will be errors
or misclassifications. We select the features with minimum error rate, which means they are the
features that most accurately classify the face and non-face images. (The process is not as simple as
this. Each image is given an equal weight in the beginning. After each classification, weights of
misclassified images are increased. Then the same process is done. New error rates are calculated.
Also new weights. The process is continued until the required accuracy or error rate is achieved or
the required number of features are found).
The final classifier is a weighted sum of these weak classifiers. It is called weak because it alone can't
classify the image, but together with others forms a strong classifier. The paper says even 200
features provide detection with 95% accuracy. Their final setup had around 6000 features. (Imagine
a reduction from 160000+ features to 6000 features. That is a big gain).
So now you take an image. Take each 24x24 window. Apply 6000 features to it. Check if it is face or
not. Wow.. Isn't it a little inefficient and time consuming? Yes, it is. The authors have a good solution
for that.
In an image, most of the image is non-face region. So it is a better idea to have a simple method to
check if a window is not a face region. If it is not, discard it in a single shot, and don't process it
again. Instead, focus on regions where there can be a face. This way, we spend more time checking
possible face regions.
For this they introduced the concept of Cascade of Classifiers. Instead of applying all 6000 features
on a window, the features are grouped into different stages of classifiers and applied one-by-one.
(Normally the first few stages will contain very many fewer features). If a window fails the first
stage, discard it. We don't consider the remaining features on it. If it passes, apply the second stage
of features and continue the process. The window which passes all stages is a face region. How is
that plan!
The authors' detector had 6000+ features with 38 stages with 1, 10, 25, 25 and 50 features in the
first five stages. (The two features in the above image are actually obtained as the best two features
from Adaboost). According to the authors, on average 10 features out of 6000+ are evaluated per
sub-window.
About Python
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a scripting
or glue language to connect existing components together. Python's simple, easy to learn syntax
emphasizes readability and therefore reduces the cost of program maintenance. Python supports
modules and packages, which encourages program modularity and code reuse. The Python
interpreter and the extensive standard library are available in source or binary form without charge
for all major platforms, and can be freely distributed.
Often, programmers fall in love with Python because of the increased productivity it provides. Since
there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs
is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter
discovers an error, it raises an exception. When the program doesn't catch the exception, the
interpreter prints a stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line
at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective
power. On the other hand, often the quickest way to debug a program is to add a few print
statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.
What Are IDEs and Code Editors?
An editor designed to handle code (with, for example, syntax highlighting and auto-completion)
Most IDEs support many different programming languages and contain many more features. They
can, therefore, be large and take time to download and install. You may also need advanced
knowledge to use them properly.
In contrast, a dedicated code editor can be as simple as a text editor with syntax highlighting and
code formatting capabilities. Most good code editors can execute code and control a debugger. The
very best ones interact with source control systems as well. Compared to an IDE, a good dedicated
code editor is usually smaller and quicker, but often less feature rich.
Python code editors are designed for the developers to code and debug program easily. Using these
Python IDEs(Integrated Development Environment), you can manage a large codebase and achieve
quick deployment.
Developers can use these editors to create desktop or web application. The Python IDEs can also be
used by DevOps engineers for continuous Integration.
Following is a handpicked list of Top Python Code Editors, with popular features and latest
download links. The list contains both open-source(free) and premium tools.
1) IDLE
IDLE (Integrated Development and Learning Environment) is a default editor that comes with
Python. It is one of the best Python IDE software which helps a beginner to learn Python easily. IDLE
software package is optional for many Linux distributions. The tool can be used on Windows,
macOS, and Unix.
Price: free
Features:
Price: Free
3) PyCharm
PayCharm is a cross-platform IDE used for Python programming. It is one of the best Python IDE
editor that can be used on Windows, macOS, and Linux. This software contains API that can be used
by the developers to write their own Python plugins so that they can extend the basic
functionalities.
Price: Free
Features:
It is an intelligent Python code editor supports for CoffeeScript, JavaScript, CSS, and
TypeScript.
Provides smart search to jump to any file, symbol, or class.
Smart Code Navigation
This Python editor offers quick and safe refactoring of code.
It allows you to access PostgreSQL, Oracle, MySQL, SQL Server, and many other databases
from the IDE.
4) Spyder
Spyder is a scientific integrated development environment written in Python. This software is
designed for and by scientists who can integrate with Matplotlib, SciPy, NumPy, Pandas, Cython,
IPython, SymPy, and other open-source software. Spyder is available through Anaconda (open-
source distribution system) distribution on Windows, macOS, and Linux.
Price: Free
Features:
It is one of the best Python IDE for Windows which allows you to run Python code by cell,
line, or file.
Plot a histogram or time-series, make changes in dateframe or numpy array.
It offers automatic code completion and horizontal/vertical splitting.
Find and eliminate bottlenecks
An interactive way to trace each step of Python code execution.