Tech
Tech
Tech
The objective of this briefing is to present an overview of the machine learning techniques
currently in use or in consideration at statistical agencies worldwide. Section I, outlines the main
reason why statistical agencies should start exploring the use of machine learning techniques.
Section II outlines what machine learning is, by comparing a well-known statistical technique
(logistic regression) with a (non-statistical) machine learning counterpart (support vector
machines). Sections III, IV, and V discuss current research or applications of machine learning
techniques within the field of official statistics in the areas of automatic coding, editing and
imputation, and record linkage, respectively. The material presented in this paper is the result of
a literature review, of direct contacts with authors during conferences, and more importantly of
an international call for input that was distributed on July 18, 2014 to participants from the 2014
MSIS Meeting, participants from the 2014 Work Session on Statistical Data Editing, and
members of the Modernization Committee on Production and Methods. Section VI contains a list
of machine learning applications in official statistics outside of the three areas mentioned above
Machine Learning means In the statistical context, Machine Learning is defined as an application
of artificial intelligence where available information is used through algorithms to process or
assist the processing of statistical data. While Machine Learning involves concepts of
automation, it requires human guidance. Machine Learning involves a high level of
generalization in order to get a system that performs well on yet unseen data instances.
Machine learning is a relatively new discipline within Computer Science that provides a
collection of data analysis techniques. Some of these techniques are based on well established
statistical methods (e.g. logistic regression and principal component analysis) while many others
are not. Most statistical techniques follow the paradigm of determining a particular probabilistic
model that best describes observed data among a class of related models. Similarly, most
machine learning techniques are designed to find models that best fit data (i.e. they solve certain
optimization problems), except that these machine learning models are no longer restricted to
probabilistic ones. Therefore, an advantage of machine learning techniques over statistical ones
is that the latter require underlying probabilistic models while the former do not. Even though
some machine learning techniques use probabilistic models, the classical statistical techniques
are most often too stringent for the oncoming Big Data era, because data sources are increasingly
complex and multi-faceted. Prescribing probabilistic models relating variables from disparate
data sources that are plausible and amenable to statistical analysis might be extremely difficult if
not impossible. Machine learning might be able to provide a broader class of more flexible
alternative analysis methods better suited to modern sources of data. It is imperative for
statistical agencies to explore the possible use of machine learning techniques to determine
whether their future needs might be better met with such techniques than with traditional ones.
There are two main classes of machine learning techniques: supervised machine learning and
unsupervised machine learning.
Logistic regression, when used for prediction purposes, is an example of supervised machine
learning. In logistic regression, the values of a binary response variable (with values 0 or 1, say)
as well as a number of predictor variables (covariates) are observed for a number of observation
units. These are called training data in machine learning terminology. The main hypotheses are
that the response variable follows a Bernoulli distribution (a class of probabilistic models), and
the link between the response and predictor variables is the relation that the logarithm of the
posterior odds of the response is a linear function of the predictors. The response variables of the
units are assumed to be independent of each other, and the method of maximum likelihood is
applied to their joint probability distribution to find the optimal values for the coefficients (these
parameterise the aforementioned joint distribution) in this linear function. The particular model
with these optimal coefficient values is called the “fitted model,” and can be used to “predict”
the value of the response variable for a new unit (or, “classify” the new unit as 0 or 1) for which
only the predictor values are known. Support Vector Machines (SVM) are an example of a
non-statistical supervised machine learning technique; it has the same goal as the logistic
regression classifier just described: Given training data, find the best-fitting SVM model, and
then use the fitted SVM model to classify new units. The difference is that the underlying models
for SVM are the collection of hyper planes in the space of the predictor variables. The
optimization problem that needs to be solved is finding the hyper plane that best separates, in the
predictor space, the units with response value 0 from those with response value 1. The logistic
regression optimization problem comes from probability theory whereas that of SVM comes
from geometry.
Other supervised machine learning techniques mentioned later in this briefing include decision
trees, neural networks, and Bayesian networks.
The main example of an unsupervised machine learning technique that comes from classical
statistics is principal component analysis, which seeks to “summarize” a set of data points in
high-dimensional space by finding orthogonal one-dimensional subspaces along which most of
the variation in the data points is captured. The term “unsupervised” simply refers to the fact that
there is no longer a response variable in the current setting.
Cluster analysis and association analysis are examples of non-statistical unsupervised machine
learning techniques. The former seeks to determine inherent grouping structure in given data,
whereas the latter seeks to determine co-occurrence patterns of items.
Automatic Coding:
The Central Statistics Office of Ireland has reported they are developing an automatic coding
system for Classification of Individual Consumption by Purpose (COICOP) assignment for their
Household Budget Survey, using previously coded records as training data. Their method is
based on the open-source indexing and searching tool Apache Lucene (http://lucene.apache.org).
4. Automatic coding of census variables via Support Vector Machines (New Zealand)
Statistics New Zealand investigated the potential of using Support Vector Machines (SVM) to
improve coding of item responses in their Census. They applied SVM to code the variables
Occupation and Post-school Qualification, using two disjoint sets of observations, each of size
10,000, from Census 2013 data for training and testing. They reported 50% correctness rate on
testing data for both variables, and concluded that further investigations would be necessary to
further evaluate SVM as an automatic coding methodology.
About python:
The Python language has a substantial body of documentation, much of it contributed by various
authors. The markup used for the Python documentation is restructured text, developed by
the docutils project, amended by custom directives and using a toolset named sphinx to
post-process the HTML output.
This document describes the style guide for our documentation as well as the custom restructured
text markup introduced by Sphinx to support Python documentation and how it should be used.
The documentation in HTML, PDF or EPUB format is generated from text files written using
the restructured text format and contained in the CPython Git repository.
Introduction
Python’s documentation has long been considered to be good for a free programming language.
There are a number of reasons for this, the most important being the early commitment of
Python’s creator, Guido van Rossum, to providing documentation on the language and its
libraries, and the continuing involvement of the user community in providing assistance for
creating and maintaining documentation.
The involvement of the community takes many forms, from authoring to bug reports to just plain
complaining when the documentation could be more complete or easier to use.
This document is aimed at authors and potential authors of documentation for Python. More
specifically, it is for people contributing to the standard documentation and developing
additional documents using the same tools as the standard documents. This guide will be less
useful for authors using the Python documentation tools for topics other than Python, and less
useful still for authors not using the tools at all.
If your interest is in contributing to the Python documentation, but you don’t have the time or
inclination to learn restructured Text and the markup structures documented here, there’s a
welcoming place for you among the Python contributors as well. Any time you feel that you can
clarify existing documentation or provide documentation that’s missing, the existing
documentation team will gladly work with you to integrate your text, dealing with the markup
for you. Please don’t let the material in this document stand between the documentation and your
desire to help out!
Style guide:
Use of whitespace
All reST files use an indentation of 3 spaces; no tabs are allowed. The maximum line length is 80
characters for normal text, but tables, deeply indented code samples and long links may extend
beyond that. Code example bodies should use normal Python 4-space indentation.
Make generous use of blank lines where applicable; they help group things together.
A sentence-ending period may be followed by one or two spaces; while reST ignores the second
space, it is customarily put in by some users, for example to aid Emacs’ auto-fill mode.
Footnotes
Footnotes are generally discouraged, though they may be used when they are the best way to
present specific information. When a footnote reference is added at the end of the sentence, it
should follow the sentence-ending punctuation. The reST markup should appear something like
this:
This sentence has a footnote reference. [#]_ This is the next sentence.
Footnotes should be gathered at the end of a file, or if the file is very long, at the end of a section.
The docutils will automatically create backlinks to the footnote reference.
Capitalization
Sentence case
Sentence case is a set of capitalization rules used in English sentences: the first word is always
capitalized and other words are only capitalized if there is a specific rule requiring it.
In the Python documentation, the use of sentence case in section titles is preferable, but
consistency within a unit is more important than following this rule. If you add a section to a
chapter where most sections are in title case, you can either convert all titles to sentence case or
use the dominant style in the new section title.
Sentences that start with a word for which specific rules require starting it with a lower case
letter should be avoided.
Many special names are used in the Python documentation, including the names of operating
systems, programming languages, standards bodies, and the like. Most of these entities are not
assigned any special markup, but the preferred spellings are given here to aid authors in
maintaining the consistency of presentation in the Python documentation.
Other terms and words deserve special mention as well; these conventions should be used to
ensure consistency throughout the documentation:
CPU
For “central processing unit.” Many style guides say this should be spelled out on the first use
(and if you must use it, do so!). For the Python documentation, this abbreviation should be
avoided since there’s no reasonable way to predict which occurrence will be the first seen by the
reader. It is better to use the word “processor” instead.
POSIX
The name assigned to a particular group of standards. This is always uppercase.
Python
reST
For “restructured Text,” an easy to read, plaintext markup syntax used to produce Python
documentation. When spelled out, it is always one word and both forms start with a lower case
‘r’.
Unicode
The name of a character coding system. This is always written capitalized.
Unix
The name of the operating system developed at AT&T Bell Labs in the early 1970s.
Affirmative Tone
The documentation focuses on affirmatively stating what the language does and how to use it
effectively.
Except for certain security or segfault risks, the docs should avoid wording along the lines of
“feature x is dangerous” or “experts only”. These kinds of value judgments belong in external
blogs and wikis, not in the core documentation.
Warning: failing to explicitly close a file could result in lost data or excessive resource
consumption. Never rely on reference counting to automatically close a file.
Good example (establishing confident knowledge in the effective use of the language):
A best practice for using files is use a try/finally pair to explicitly close a file after it is used.
Alternatively, using a with-statement can achieve the same effect. This assures that files are
flushed and file descriptor resources are released in a timely manner.
Economy of Expression
More documentation is not necessarily better documentation. Err on the side of being succinct.It
is an unfortunate fact that making documentation longer can be an impediment to understanding
and can result in even more ways to misread or misinterpret the text. Long descriptions full of
corner cases and caveats can create the impression that a function is more complex or harder to
use than it actually is.
Some modules provided with Python are inherently exposed to security issues (e.g. shell
injection vulnerabilities) due to the purpose of the module (e.g. ssl). Littering the documentation
of these modules with red warning boxes for problems that are due to the task at hand, rather
than specifically to Python’s support for that task, doesn’t make for a good reading experience.
Instead, these security concerns should be gathered into a dedicated “Security Considerations”
section within the module’s documentation, and cross-referenced from the documentation of
affected interfaces with a note similar
to "Please refer to the :ref:`security-considerations` section for important information on ho
wto avoid common mistakes.".
Similarly, if there is a common error that affects many interfaces in a module (e.g. OS level pipe
buffers filling up and stalling child processes), these can be documented in a “Common Errors”
section and cross-referenced rather than repeated for every affected interface.
Code Examples
Short code examples can be a useful adjunct to understanding. Readers can often grasp a simple
example more quickly than they can digest a formal description in prose.
People learn faster with concrete, motivating examples that match the context of a typical use
case. For instance, the str.rpartition() method is better demonstrated with an example splitting the
domain from a URL than it would be with an example of removing the last word from a line of
Monty Python dialog.
The ellipsis for the sys.ps2 secondary interpreter prompt should only be used sparingly, where it
is necessary to clearly differentiate between input lines and output lines. Besides contributing
visual clutter, it makes it difficult for readers to cut-and-paste examples so they can experiment
with variations.
Code Equivalents
Giving pure Python code equivalents (or approximate equivalents) can be a useful adjunct to a
prose description. A documenter should carefully weigh whether the code equivalent adds value.
A good example is the code equivalent for all(). The short 4-line code equivalent is easily
digested; it re-emphasizes the early-out behavior; and it clarifies the handling of the corner-case
where the iterable is empty. In addition, it serves as a model for people wanting to implement a
commonly requested alternative where all() would return the specific object evaluating to False
whenever the function terminates early.
A more questionable example is the code for itertools.groupby(). Its code equivalent borders on
being too complex to be a quick aid to understanding. Despite its complexity, the code
equivalent was kept because it serves as a model to alternative implementations and because the
operation of the “grouper” is more easily shown in code than in English prose.
An example of when not to use a code equivalent is for the oct() function. The exact steps in
converting a number to octal doesn’t add value for a user trying to learn what the function does.
Audience
The tone of the tutorial (and all the docs) needs to be respectful of the reader’s intelligence.
Don’t presume that the readers are stupid. Lay out the relevant information, show motivating use
cases, provide glossary links, and do your best to connect-the-dots, but don’t talk down to them
or waste their time.
The tutorial is meant for newcomers, many of whom will be using the tutorial to evaluate the
language as a whole. The experience needs to be positive and not leave the reader with worries
that something bad will happen if they make a misstep. The tutorial serves as guide for
intelligent and curious readers, saving details for the how-to guides and other sources.
Be careful accepting requests for documentation changes from the rare but vocal category of
reader who is looking for vindication for one of their programming errors (“I made a mistake,
therefore the docs must be wrong …”). Typically, the documentation wasn’t consulted until after
the error was made. It is unfortunate, but typically no documentation edit would have saved the
user from making false assumptions about the language (“I was surprised by …”).
Open cv introduction:
OpenCV was started at Intel in 1999 by Gary Bradsky and the first release came out in 2000.
Vadim Pisarevsky joined Gary Bradsky to manage Intel’s Russian software OpenCV team. In
2005, OpenCV was used on Stanley, the vehicle who won 2005 DARPA Grand Challenge. Later
its active development continued under the support of Willow Garage, with Gary Bradsky and
Vadim Pisarevsky leading the project. Right now, OpenCV supports a lot of algorithms related to
Computer Vision and Machine Learning and it is expanding day-by-day. Currently OpenCV
supports a wide variety of programming languages like C++, Python, Java etc and is available on
different platforms including Windows, Linux, OS X, Android, iOS etc. Also, interfaces based
on CUDA and OpenCL are also under active development for high-speed GPU operations.
OpenCV-Python is the Python API of OpenCV. It combines the best qualities of OpenCV C++
API and Python language. OpenCV-Python Python is a general purpose programming language
started by Guido van Rossum, which became very popular in short time mainly because of its
simplicity and code readability. It enables the programmer to express his ideas in fewer lines of
code without reducing any readability. Compared to other languages like C/C++, Python is
slower. But another important feature of Python is that it can be easily extended with C/C++.
This feature helps us to write computationally intensive codes in C/C++ and create a Python
wrapper for it so that we can use these wrappers as Python modules. This gives us two
advantages: first, our code is as fast as original C/C++ code (since it is the actual C++ code
working in background) and second, it is very easy to code in Python. This is how
OpenCV-Python works, it is a Python wrapper around original C++ implementation. And the
support of Numpy makes the task more easier. Numpy is a highly optimized library for
numerical operations. It gives a MATLAB-style syntax. All the OpenCV array structures are
converted to-and-from Numpy arrays. So whatever operations you can do in Numpy, you can
combine it with OpenCV, which increases number of weapons in your arsenal. Besides that,
several other libraries like SciPy, Matplotlib which supports Numpy can be used with this. So
OpenCV-Python is an appropriate tool for fast prototyping of computer vision problems.
Since OpenCV is an open source initiative, all are welcome to make contributions to this library.
And it is same for this tutorial also. So, if you find any mistake in this tutorial (whether it be a
small spelling mistake or a big error in code or concepts, whatever), feel free to correct it. 1.1.
Introduction to OpenCV 7 OpenCV-Python Tutorials Documentation, Release 1 And that will be
a good task for freshers who begin to contribute to open source projects. Just fork the OpenCV in
github, make necessary corrections and send a pull request to OpenCV. OpenCV developers will
check your pull request, give you important feedback and once it passes the approval of the
reviewer, it will be merged to OpenCV. Then you become a open source contributor. Similar is
the case with other tutorials, documentation etc. As new modules are added to OpenCV-Python,
this tutorial will have to be expanded. So those who knows about particular algorithm can write
up a tutorial which includes a basic theory of the algorithm and a code showing basic usage of
the algorithm and submit it to OpenCV. Remember, we together can make this project a great
success !!! Contributors Below is the list of contributors who submitted tutorials to
OpenCV-Python.
Additional Resources
1. A Quick guide to Python - A Byte of Python
4. OpenCV Documentation
5. OpenCV Forum
We will learn to setup OpenCV-Python in your Windows system. Below steps are tested in a
Windows 7-64 bit machine with Visual Studio 2010 and Visual Studio 2012. The screenshots
shows VS2012.
1. Below Python packages are to be downloaded and installed to their default locations.
1.1. Python-2.7.x.
1.2. Numpy.
1.3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our tutorials).
2. Install all packages into their default locations. Python will be installed to C:/Python27/.
3. After installation, open Python IDLE. Enter import numpy and make sure Numpy is working
fine.
4. Download latest OpenCV release from source forge site and double-click to extract it.
5. Goto opencv/build/python/2.7 folder.
If the results are printed out without any errors, congratulations !!! You have installed
OpenCV-Python successful
1. Python 3.6.8.x
2. Numpy
3. Matplotlib (Matplotlib is optional, but recommended since we use it a lot in our tutorials.)
4. Download OpenCV source. It can be from Sourceforge (for official release version) or from
Github (for latest source).
7.2. Click on Browse Build... and locate the build folder we created.
7.3. Click on Configure.
7.4. It will open a new window to select the compiler. Choose appropriate compiler
(here, Visual Studio 11) and click Finish.
8. You will see all the fields are marked in red. Click on the WITH field to expand it. It decides
what extra features you need. So mark appropriate fields. See the below image:
9. Now click on BUILD field to expand it. First few fields configure the build method. See the
below image:
10. Remaining fields specify what modules are to be built. Since GPU modules are not yet
supported by Open CV Python, you can completely avoid it to save time (But if you work with
them, keep it there). See the image below:
11. Now click on ENABLE field to expand it. Make sure ENABLE_SOLUTION_FOLDERS is
unchecked (Solution folders are not supported by Visual Studio Express edition). See the image
below:
12. Also make sure that in the PYTHON field, everything is filled. (Ignore
PYTHON_DEBUG_LIBRARY). See image below:
14. Now go to our opencv/build folder. There you will find OpenCV.sln file. Open it with Visual
Studio.
16. In the solution explorer, right-click on the Solution (or ALL_BUILD) and build it. It will
take some time to finish.
17. Again, right-click on INSTALL and build it. Now OpenCV-Python will be installed.
18. Open Python IDLE and enter import cv2. If no error, it is installed correctly
Use the function cv2.imread() to read an image. The image should be in the working directory or
a full path of image should be given. Second argument is a flag which specifies the way image
should be read.
import numpy as np
import cv2
# Load an color image in grayscale
img = cv2.imread('messi5.jpg',0)
Warning: Even if the image path is wrong, it won’t throw any error, but print img will give you
None
Display an image Use the function cv2.imshow() to display an image in a window. The window
automatically fits to the image size. First argument is a window name which is a string. second
argument is our image. You can create as many windows as you wish, but with different window
names.
cv2.imshow('image',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
Write an image
Use the function cv2.imwrite() to save an image. First argument is the file name, second
argument is the image you want to save.
cv2.imwrite('messigray.png',img)
This will save the image in PNG format in the working directory
Sum it up
Below program loads an image in grayscale, displays it, save the image if you press ‘s’ and exit,
or simply exit without saving if you press ESC key.
import numpy as np
import cv2
img = cv2.imread('messi5.jpg',0)
cv2.imshow('image',img)
k = cv2.waitKey(0)
if k == 27: # wait for ESC key to exit
cv2.destroyAllWindows()
elif k == ord('s'): # wait for 's' key to save and exit
cv2.imwrite('messigray.png',img)
cv2.destroyAllWindows()
Using Matplotlib
Matplotlib is a plotting library for Python which gives you wide variety of plotting methods. You
will see them in coming articles. Here, you will learn how to display image with Matplotlib. You
can zoom images, save it etc using Matplotlib.
import numpy as np
import cv2 from matplotlib
import pyplot as plt
img = cv2.imread('messi5.jpg',0)
plt.imshow(img, cmap = 'gray', interpolation = 'bicubic')
plt.xticks([]), plt.yticks([]) # to hide tick values on X and Y axis
plt.show()
Drawing Rectangle
To draw a rectangle, you need top-left corner and bottom-right corner of rectangle. This time we
will draw a green rectangle at the top-right corner of image.
img = cv2.rectangle(img,(384,0),(510,128),(0,255,0),3)
Position coordinates of where you want put it (i.e. bottom-left corner where data starts).
Font type (Check cv2.putText() docs for supported fonts)
regular things like color, thickness, lineType etc. For better look, lineType = cv2.LINE_AA is
recommended.
font = cv2.FONT_HERSHEY_SIMPLEX
cv2.putText(img,'OpenCV',(10,500), font, 4,(255,255,255),2,cv2.LINE_AA)
Result So it is time to see the final result of our drawing. As you studied in previous articles,
display the image to see it.
Object Tracking
Now we know how to convert BGR image to HSV, we can use this to extract a colored object. In HSV, it
is more easier to represent a color than RGB color-space. In our application, we will try to extract a blue
colored object. So here is the method:
import cv2
import numpy as np
cap = cv2.VideoCapture(0)
while(1): # Take each frame
_, frame = cap.read() # Convert BGR to HSV
hsv = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) # define range of blue color in HSV
lower_blue = np.array([110,50,50])
upper_blue = np.array([130,255,255]) # Threshold the HSV image to get only blue colors
mask = cv2.inRange(hsv, lower_blue, upper_blue) # Bitwise-AND mask and original
image
res = cv2.bitwise_and(frame,frame, mask= mask)
cv2.imshow('frame',frame)
cv2.imshow('mask',mask)
cv2.imshow('res',res)
k = cv2.waitKey(5) & 0xFF
if k == 27: break
cv2.destroyAllWindows()
Numpy:
NumPy, which stands for Numerical Python, is a library consisting of multidimensional array
objects and a collection of routines for processing those arrays. Using NumPy, mathematical and
logical operations on arrays can be performed. This tutorial explains the basics of NumPy such
as its architecture and environment. It also discusses the various array functions, types of
indexing, etc. An introduction to Matplotlib is also provided. All this is explained with the help
of examples for better understanding.
The best way to enable NumPy is to use an installable binary package specific to your operating
system. These binaries contain full SciPy stack (inclusive of NumPy, SciPy, matplotlib, IPython,
SymPy and nose packages along with core Python).
Building from Source
Core Python (2.6.x, 2.7.x and 3.2.x onwards) must be installed with distutils and zlib module
should be enabled.
GNU gcc (4.2 and above) C compiler must be available.
To install NumPy, run the following command.
import numpy
If it is not installed, the following error message will be displayed.
Traceback (most recent call last):
import numpy as np
Pandas:
Pandas is an open-source, BSD-licensed Python library providing high-performance, easy-to-use
data structures and data analysis tools for the Python programming language. Python with
Pandas is used in a wide range of fields including academic and commercial domains including
finance, economics, Statistics, analytics, etc. In this tutorial, we will learn the various features of
Python Pandas and how to use them in practice.
Pandas is an open-source Python Library providing high-performance data manipulation and
analysis tool using its powerful data structures. The name Pandas is derived from the word Panel
Data – an Econometrics from Multidimensional data.
In 2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data munging and preparation. It had very little
contribution towards data analysis. Pandas solved this problem. Using Pandas, we can
accomplish five typical steps in the processing and analysis of data, regardless of the origin of
data — load, prepare, manipulate, model, and analyze.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
● Fast and efficient DataFrame object with default and customized indexing.
● Tools for loading data into in-memory data objects from different file formats.
● Data alignment and integrated handling of missing data.
● Reshaping and pivoting of date sets.
● Label-based slicing, indexing and subsetting of large data sets.
● Columns from a data structure can be deleted or inserted.
● Group by data for aggregation and transformations.
● High performance merging and joining of data.
● Time Series functionality.
Standard Python distribution doesn't come bundled with Pandas module. A
lightweight alternative is to install NumPy using popular Python package
installer, pip.
Let us now create a Series and see all the above tabulated attributes operation.
Example
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s
0 0.967853
1 -0.148368
2 -1.395906
3 -1.758394
dtype: float64
axes
Returns the list of the labels of the series.
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s.axes
import numpy as np
s = pd.Series(np.random.randn(4))
print s.empty
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s
print s.ndim
0 0.175898
1 0.166197
2 -0.609712
3 -1.377000
dtype: float64
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(2))
print s
print s.size
0 3.078058
1 -1.207803
dtype: float64
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s
print s.values
0 1.787373
1 -0.605159
2 0.180477
3 -0.140922
dtype: float64
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s
print s.head(2)
Its output is as follows −
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(4))
print s
print s.tail(2)
Deep learning is a subfield of machine learning that is a set of algorithms that is inspired by the
structure and function of the brain.
TensorFlow is the second machine learning framework that Google created and used to design,
build, and train deep learning models. You can use the TensorFlow library do to numerical
computations, which in itself doesn’t seem all too special, but these computations are done with
data flow graphs. In these graphs, nodes represent mathematical operations, while the edges
represent the data, which usually are multidimensional data arrays or tensors, that are
communicated between these edges.
You see? The name “TensorFlow” is derived from the operations which neural networks perform
on multidimensional data arrays or tensors! It’s literally a flow of tensors. For now, this is all you
need to know about tensors, but you’ll go deeper into this in the next sections!
Today’s TensorFlow tutorial for beginners will introduce you to performing deep learning in an
interactive way:
● Then, the tutorial you’ll briefly go over some of the ways that you can install
TensorFlow on your system so that you’re able to get started and load data in your
workspace;
● After this, you’ll go over some of the TensorFlow basics: you’ll see how you can easily
get started with simple computations.
● After this, you get started on the real work: you’ll load in data on Belgian traffic signs
and exploring it with simple statistics and plotting.
● In your exploration, you’ll see that there is a need to manipulate your data in such a way
that you can feed it to your model. That’s why you’ll take the time to rescale your images
and convert them to grayscale.
● Next, you can finally get started on your neural network model! You’ll build up your
model layer per layer;
● Once the architecture is set up, you can use it to train your model interactively and to
eventually also evaluate it by feeding some test data to it.
● Lastly, you’ll get some pointers for further improvements that you can do to the model
you just constructed and how you can continue your learning with TensorFlow.
Also, you could be interested in a course on Deep Learning in Python, DataCamp's Keras
tutorial or the keras with R tutorial.
Introducing Tensors
To understand tensors well, it’s good to have some working knowledge of linear algebra and
vector calculus. You already read in the introduction that tensors are implemented in TensorFlow
as multidimensional data arrays, but some more introduction is maybe needed in order to
completely grasp tensors and their use in machine learning.
Plane Vectors
Before you go into plane vectors, it’s a good idea to shortly revise the concept of “vectors”;
Vectors are special types of matrices, which are rectangular arrays of numbers. Because vectors
are ordered collections of numbers, they are often seen as column matrices: they have just one
column and a certain number of rows. In other terms, you could also consider vectors as scalar
magnitudes that have been given a direction.
Remember: an example of a scalar is “5 meters” or “60 m/sec”, while a vector is, for example,
“5 meters north” or “60 m/sec East”. The difference between these two is obviously that the
vector has a direction. Nevertheless, these examples that you have seen up until now might seem
far off from the vectors that you might encounter when you’re working with machine learning
problems. This is normal; The length of a mathematical vector is a pure number: it is absolute.
The direction, on the other hand, is relative: it is measured relative to some reference direction
and has units of radians or degrees. You usually assume that the direction is positive and in
counterclockwise rotation from the reference direction.
Visually, of course, you represent vectors as arrows, as you can see in the picture above. This
means that you can consider vectors also as arrows that have direction and length. The direction
is indicated by the arrow’s head, while the length is indicated by the length of the arrow.
Plane vectors are the most straightforward setup of tensors. They are much like regular vectors as
you have seen above, with the sole difference that they find themselves in a vector space. To
understand this better, let’s start with an example: you have a vector that is 2 X 1. This means
that the vector belongs to the set of real numbers that come paired two at a time. Or, stated
differently, they are part of two-space. In such cases, you can represent vectors on the
coordinate (x,y) plane with arrows or rays.
Working from this coordinate plane in a standard position where vectors have their endpoint at
the origin (0,0), you can derive the x coordinate by looking at the first row of the vector, while
you’ll find the y coordinate in the second row. Of course, this standard position doesn’t always
need to be maintained: vectors can move parallel to themselves in the plane without experiencing
changes.
Note that similarly, for vectors that are of size 3 X 1, you talk about the three-space. You can
represent the vector as a three-dimensional figure with arrows pointing to positions in the vectors
pace: they are drawn on the standard x, y and z axes.
It’s nice to have these vectors and to represent them on the coordinate plane, but in essence, you
have these vectors so that you can perform operations on them and one thing that can help you in
doing this is by expressing your vectors as bases or unit vectors.
Unit vectors are vectors with a magnitude of one. You’ll often recognize the unit vector by a
lowercase letter with a circumflex, or “hat”. Unit vectors will come in convenient if you want to
express a 2-D or 3-D vector as a sum of two or three orthogonal components, such as the x− and
y−axes, or the z−axis.
And when you are talking about expressing one vector, for example, as sums of components,
you’ll see that you’re talking about component vectors, which are two or more vectors whose
sum is that given vector.
Tip: watch this video, which explains what tensors are with the help of simple household
objects!
Tensors
Next to plane vectors, also covectors and linear operators are two other cases that all three
together have one thing in common: they are specific cases of tensors. You still remember how a
vector was characterized in the previous section as scalar magnitudes that have been given a
direction. A tensor, then, is the mathematical representation of a physical entity that may be
characterized by magnitude and multiple directions.
And, just like you represent a scalar with a single number and a vector with a sequence of three
numbers in a 3-dimensional space, for example, a tensor can be represented by an array of 3R
numbers in a 3-dimensional space.
The “R” in this notation represents the rank of the tensor: this means that in a 3-dimensional
space, a second-rank tensor can be represented by 3 to the power of 2 or 9 numbers. In an
N-dimensional space, scalars will still require only one number, while vectors will require N
numbers, and tensors will require N^R numbers. This explains why you often hear that scalars
are tensors of rank 0: since they have no direction, you can represent them with one number.
With this in mind, it’s relatively easy to recognize scalars, vectors, and tensors and to set them
apart: scalars can be represented by a single number, vectors by an ordered set of numbers, and
tensors by an array of numbers.
What makes tensors so unique is the combination of components and basis vectors: basis vectors
transform one way between reference frames and the components transform in just such a way as
to keep the combination between components and basis vectors the same.
Installing TensorFlow
Now that you know more about TensorFlow, it’s time to get started and install the library. Here,
it’s good to know that TensorFlow provides APIs for Python, C++, Haskell, Java, Go, Rust, and
there’s also a third-party package for R called tensorflow.
Tip: if you want to know more about deep learning packages in R, consider checking out
DataCamp’s keras: Deep Learning in R Tutorial.
In this tutorial, you will download a version of TensorFlow that will enable you to write the code
for your deep learning project in Python. On the TensorFlow installation webpage, you’ll see
some of the most common ways and latest instructions to install TensorFlow
using v irtualenv, p ip, Docker and lastly, there are also some of the other ways of installing
TensorFlow on your personal computer.
Note You can also install TensorFlow with Conda if you’re working on Windows. However,
since the installation of TensorFlow is community supported, it’s best to check the official
installation instructions.
Now that you have gone through the installation process, it’s time to double check that you have
installed TensorFlow correctly by importing it into your workspace under the alias tf:
import tensorflow as tf
Note that the alias that you used in the line of code above is sort of a convention - It’s used to
ensure that you remain consistent with other developers that are using TensorFlow in data
science projects on the one hand, and with open-source TensorFlow projects on the other hand.
Keras:
Two of the top numerical platforms in Python that provide the basis for Deep Learning research
and development are Theano and TensorFlow.
Both are very powerful libraries, but both can be difficult to use directly for creating deep
learning models.
In this post, you will discover the Keras Python library that provides a clean and convenient way
to create a range of deep learning models on top of Theano or TensorFlow.
Keras is a minimalist Python library for deep learning that can run on top of Theano or
TensorFlow.
It was developed to make implementing deep learning models as fast and easy as possible for
research and development.
It runs on Python 2.7 or 3.5 and can seamlessly execute on GPUs and CPUs given the underlying
frameworks. It is released under the permissive MIT license.
Keras was developed and maintained by François Chollet, a Google engineer using four guiding
principles:
Modularity: A model can be understood as a sequence or a graph alone. All the concerns of a
deep learning model are discrete components that can be combined in arbitrary ways.
Minimalism: The library provides just enough to achieve an outcome, no frills and maximizing
readability.
Extensibility: New components are intentionally easy to add and use within the framework,
intended for researchers to trial and explore new ideas.
Python: No separate model files with custom file formats. Everything is native Python
Keras is relatively straightforward to install if you already have a working Python and SciPy
environment.
You must also have an installation of Theano or TensorFlow on your system already.
You can check your version of Keras on the command line using the following snippet:
1 1.1.0
You can upgrade your installation of Keras using the same method:
Sklearn:
In general, a learning problem considers a set of n samples of data and then tries to predict
properties of unknown data. If each sample is more than a single number and, for instance, a
multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
supervised learning, in which the data comes with additional attributes that we want to predict
(Click here to go to the scikit-learn supervised learning page).This problem can be either:
classification: samples belong to two or more classes and we want to learn from already labeled
data how to predict the class of unlabeled data. An example of a classification problem would be
handwritten digit recognition, in which the aim is to assign each input vector to one of a finite
number of discrete categories. Another way to think of classification is as a discrete (as opposed
to continuous) form of supervised learning where one has a limited number of categories and for
each of the n samples provided, one is to try to label them with the correct category or class.
regression: if the desired output consists of one or more continuous variables, then the task is
called regression. An example of a regression problem would be the prediction of the length of a
salmon as a function of its age and weight.
unsupervised learning, in which the training data consists of a set of input vectors x without any
corresponding target values. The goal in such problems may be to discover groups of similar
examples within the data, where it is called clustering, or to determine the distribution of data
within the input space, known as density estimation, or to project the data from a
high-dimensional space down to two or three dimensions for the purpose of visualization (Click
here to go to the Scikit-Learn unsupervised learning page).
scikit-learn comes with a few standard datasets, for instance the iris and digits datasets for
classification and the boston house prices dataset for regression.
In the following, we start a Python interpreter from our shell and then load
the iris and digits datasets. Our notational convention is that $ denotes the shell prompt
while >>> denotes the Python interpreter prompt:
$ python
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
A dataset is a dictionary-like object that holds all the data and some metadata about the data.
This data is stored in the .data member, which is a n_samples, n_features array. In the case of
supervised problem, one or more response variables are stored in the .target member. More
details on the different datasets can be found in the dedicated section.
For instance, in the case of the digits dataset, digits.data gives access to the features that can be
used to classify the digits samples:
>>>
>>> print(digits.data)
[[ 0. 0. 5. ... 0. 0. 0.]
[ 0. 0. 0. ... 10. 0. 0.]
[ 0. 0. 0. ... 16. 9. 0.]
...
[ 0. 0. 1. ... 6. 0. 0.]
[ 0. 0. 2. ... 12. 0. 0.]
[ 0. 0. 10. ... 12. 1. 0.]]
and digits.target gives the ground truth for the digit dataset, that is the number corresponding to
each digit image that we are trying to learn:
>>>
>>> digits.target
array([0, 1, 2, ..., 8, 9, 8])