Big Data Analytics Tutorial
Big Data Analytics Tutorial
Big Data Analytics largely involves collecting data from different sources, munge it in a
way that it becomes available to be consumed by analysts and finally deliver data products
useful to the organization business.
The process of converting large amounts of unstructured raw data, retrieved from different
sources to a data product useful for organizations forms the core of Big Data Analytics.
In this tutorial, we will discuss the most fundamental concepts and methods of Big Data
Analytics.
Audience
This tutorial has been prepared for software professionals aspiring to learn the basics of
Big Data Analytics. Professionals who are into analytics in general may as well use this
tutorial to good effect.
Prerequisites
Before you start proceeding with this tutorial, we assume that you have prior exposure to
handling huge volumes of unprocessed data at an organizational level.
Through this tutorial, we will develop a mini project to provide exposure to a real-world
problem and how to solve it using Big Data Analytics. You can download the necessary
files of this project from this link: http://www.tools.tutorialspoint.com/bda/
All the content and graphics published in this e-book are the property of Tutorials Point (I)
Pvt. Ltd. The user of this e-book is prohibited to reuse, retain, copy, distribute or republish
any contents or a part of contents of this e-book in any manner without written consent
of the publisher.
We strive to update the contents of our website and tutorials as timely and as precisely as
possible, however, the contents may contain inaccuracies or errors. Tutorials Point (I) Pvt.
Ltd. provides no guarantee regarding the accuracy, timeliness or completeness of our
website or its contents including this tutorial. If you discover any errors on our website or
in this tutorial, please notify us at [email protected]
1
Big Data Analytics
Table of Contents
About the Tutorial ........................................................................................................................................... 1
Audience .......................................................................................................................................................... 1
Prerequisites .................................................................................................................................................... 1
Copyright & Disclaimer .................................................................................................................................... 1
Table of Contents ............................................................................................................................................ 2
2
Big Data Analytics
19. Big Data Analytics Machine Learning for Data Analysis ........................................................................ 77
Supervised Learning ...................................................................................................................................... 77
Unsupervised Learning .................................................................................................................................. 77
3
Big Data Analytics
4
Big Data Analytics Overview
Big Data Analytics
The volume of data that one has to deal has exploded to unimaginable levels in the past
decade, and at the same time, the price of data storage has systematically reduced.
Private companies and research institutions capture terabytes of data about their users
interactions, business, social media, and also sensors from devices such as mobile phones
and automobiles. The challenge of this era is to make sense of this sea of data. This is
where big data analytics comes into picture.
Big Data Analytics largely involves collecting data from different sources, munge it in a
way that it becomes available to be consumed by analysts and finally deliver data products
useful to the organization business.
The process of converting large amounts of unstructured raw data, retrieved from different
sources to a data product useful for organizations forms the core of Big Data Analytics.
5
Big Data Analytics Data Life Cycle
Big Data Analytics
CRISP-DM Methodology
The CRISP-DM methodology that stands for Cross Industry Standard Process for Data
Mining, is a cycle that describes commonly used approaches that data mining experts use
to tackle problems in traditional BI data mining. It is still being used in traditional BI data
mining teams.
Take a look at the following illustration. It shows the major stages of the cycle as described
by the CRISP-DM methodology and how they are interrelated.
6
Big Data Analytics
CRISP-DM was conceived in 1996 and the next year, it got underway as a European Union
project under the ESPRIT funding initiative. The project was led by five companies: SPSS,
Teradata, Daimler AG, NCR Corporation, and OHRA (an insurance company). The project
was finally incorporated into SPSS. The methodology is extremely detailed oriented in how
a data mining project should be specified.
Let us now learn a little more on each of the stages involved in the CRISP-DM life cycle:
Data Understanding The data understanding phase starts with an initial data
collection and proceeds with activities in order to get familiar with the data, to
identify data quality problems, to discover first insights into the data, or to detect
interesting subsets to form hypotheses for hidden information.
Data Preparation The data preparation phase covers all activities to construct
the final dataset (data that will be fed into the modeling tool(s)) from the initial
raw data. Data preparation tasks are likely to be performed multiple times, and not
in any prescribed order. Tasks include table, record, and attribute selection as well
as transformation and cleaning of data for modeling tools.
Modeling In this phase, various modeling techniques are selected and applied
and their parameters are calibrated to optimal values. Typically, there are several
techniques for the same data mining problem type. Some techniques have specific
requirements on the form of data. Therefore, it is often required to step back to
the data preparation phase.
Evaluation At this stage in the project, you have built a model (or models) that
appears to have high quality, from a data analysis perspective. Before proceeding
to final deployment of the model, it is important to evaluate the model thoroughly
and review the steps executed to construct the model, to be certain it properly
achieves the business objectives.
A key objective is to determine if there is some important business issue that has
not been sufficiently considered. At the end of this phase, a decision on the use of
the data mining results should be reached.
Deployment Creation of the model is generally not the end of the project. Even
if the purpose of the model is to increase knowledge of the data, the knowledge
gained will need to be organized and presented in a way that is useful to the
customer.
In many cases, it will be the customer, not the data analyst, who will carry out the
deployment steps. Even if the analyst deploys the model, it is important for the customer
to understand upfront the actions which will need to be carried out in order to actually
make use of the created models.
7
Big Data Analytics
SEMMA Methodology
SEMMA is another methodology developed by SAS for data mining modeling. It stands for
Sample, Explore, Modify, Model, and Asses. Here is a brief description of its stages:
Sample: The process starts with data sampling, e.g., selecting the dataset for
modeling. The dataset should be large enough to contain sufficient information to
retrieve, yet small enough to be used efficiently. This phase also deals with data
partitioning.
Modify: The Modify phase contains methods to select, create and transform
variables in preparation for data modeling.
Model: In the Model phase, the focus is on applying various modeling (data mining)
techniques on the prepared variables in order to create models that possibly
provide the desired outcome.
Assess: The evaluation of the modeling results shows the reliability and usefulness
of the created models.
The main difference between CRISMDM and SEMMA is that SEMMA focuses on the
modeling aspect, whereas CRISP-DM gives more importance to stages of the cycle prior
to modeling such as understanding the business problem to be solved, understanding and
preprocessing the data to be used as input, for example, machine learning algorithms.
Research
Data Acquisition
Data Munging
Data Storage
Modeling
Implementation
In this section, we will throw some light on each of these stages of big data life cycle.
8
Big Data Analytics
Research
Analyze what other companies have done in the same situation. This involves looking for
solutions that are reasonable for your company, even though it involves adapting other
solutions to the resources and requirements that your company has. In this stage, a
methodology for the future stages should be defined.
Data Acquisition
This section is key in a big data life cycle; it defines which type of profiles would be needed
to deliver the resultant data product. Data gathering is a non-trivial step of the process;
it normally involves gathering unstructured data from different sources. To give an
example, it could involve writing a crawler to retrieve reviews from a website. This involves
dealing with text, perhaps in different languages normally requiring a significant amount
of time to be completed.
Data Munging
Once the data is retrieved, for example, from the web, it needs to be stored in an easy-
to-use format. To continue with the reviews examples, lets assume the data is retrieved
from different sites where each has a different display of the data.
Suppose one data source gives reviews in terms of rating in stars, therefore it is possible
to read this as a mapping for the response variable y {1, 2, 3, 4, 5}. Another data source
gives reviews using two arrows system, one for up voting and the other for down voting.
This would imply a response variable of the form y {positive, negative}.
In order to combine both the data sources, a decision has to be made in order to make
these two response representations equivalent. This can involve converting the first data
source response representation to the second form, considering one star as negative and
five stars as positive. This process often requires a large time allocation to be delivered
with good quality.
Data Storage
Once the data is processed, it sometimes needs to be stored in a database. Big data
technologies offer plenty of alternatives regarding this point. The most common alternative
is using the Hadoop File System for storage that provides users a limited version of SQL,
known as HIVE Query Language. This allows most analytics task to be done in similar ways
9
Big Data Analytics
as would be done in traditional BI data warehouses, from the user perspective. Other
storage options to be considered are MongoDB, Redis, and SPARK.
This stage of the cycle is related to the human resources knowledge in terms of their
abilities to implement different architectures. Modified versions of traditional data
warehouses are still being used in large scale applications. For example, teradata and IBM
offer SQL databases that can handle terabytes of data; open source solutions such as
postgreSQL and MySQL are still being used for large scale applications.
Even though there are differences in how the different storages work in the background,
from the client side, most solutions provide a SQL API. Hence having a good understanding
of SQL is still a key skill to have for big data analytics.
This stage a priori seems to be the most important topic, in practice, this is not true. It is
not even an essential stage. It is possible to implement a big data solution that would be
working with real-time data, so in this case, we only need to gather data to develop the
model and then implement it in real time. So there would not be a need to formally store
the data at all.
Modeling
The prior stage should have produced several datasets for training and testing, for
example, a predictive model. This stage involves trying different models and looking
forward to solving the business problem at hand. In practice, it is normally desired that
the model would give some insight into the business. Finally, the best model or
combination of models is selected evaluating its performance on a left-out dataset.
Implementation
In this stage, the data product developed is implemented in the data pipeline of the
company. This involves setting up a validation scheme while the data product is working,
in order to track its performance. For example, in the case of implementing a predictive
model, this stage would involve applying the model to new data and once the response is
available, evaluate the model.
10
Big Data Analytics Methodology
Big Data Analytics
In terms of methodology, big data analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally we model
the data in a way to explain a response. The objectives of this approach is to predict the
response behavior or understand how the input variables relate to a response. Normally
in statistical experimental designs, an experiment is developed and data is retrieved as a
result. This allows to generate data in a way that can be used by a statistical model, where
certain assumptions hold such as independence, normality, and randomization.
In big data analytics, we are presented with the data. We cannot design an experiment
that fulfills our favorite statistical model. In large-scale applications of analytics, a large
amount of work (normally 80% of the effort) is needed just for cleaning the data, so it can
be used by a machine learning model.
One of the most important tasks in big data analytics is statistical modeling, meaning
supervised and unsupervised classification or regression problems. Once the data is
cleaned and preprocessed, available for modeling, care should be taken in evaluating
different models with reasonable loss metrics and then once the model is implemented,
further evaluation and results should be reported. A common pitfall in predictive modeling
is to just implement the model and never measure its performance.
11
Big Data Analytics Core Deliverables
Big Data Analytics
As mentioned in the big data life cycle, the data products that result from developing a big
data product are in most of the cases some of the following:
12
Big Data Analytics Key Stakeholders
Big Data Analytics
Check who and where are the sponsors of other projects similar to the one that
interests you.
Having personal contacts in key management positions helps, so any contact can
be triggered if the project is promising.
Who would benefit from your project? Who would be your client once the project is
on track?
Develop a simple, clear, and exiting proposal and share it with the key players in
your organization.
The best way to find sponsors for a project is to understand the problem and what would
be the resulting data product once it has been implemented. This understanding will give
an edge in convincing the management of the importance of the big data project.
13
Big Data Analytics Data Analyst
Big Data Analytics
Many organizations struggle hard to find competent data scientists in the market. It is
however a good idea to select prospective data analysts and teach them the relevant skills
to become a data scientist. This is by no means a trivial task and would normally involve
the person doing a master degree in a quantitative field, but it is definitely a viable option.
The basic skills a competent data analyst must have are listed below:
Business understanding
SQL programming
Dashboard development
14
Big Data Analytics Data Scientist
Big Data Analytics
The role of a data scientist is normally associated with tasks such as predictive modeling,
developing segmentation algorithms, recommender systems, A/B testing frameworks and
often working with raw unstructured data.
The nature of their work demands a deep understanding of mathematics, applied statistics
and programming. There are a few skills common between a data analyst and a data
scientist, for example, the ability to query databases. Both analyze data, but the decision
of a data scientist can have a greater impact in an organization.
In big data analytics, people normally confuse the role of a data scientist with that of a
data architect. In reality, the difference is quite simple. A data architect defines the tools
and the architecture the data would be stored at, whereas a data scientist uses this
architecture. Of course, a data scientist should be able to set up new tools if needed for
ad-hoc projects, but the infrastructure definition and design should not be a part of his
task.
15
Big Data Analytics
16
Big Data Analytics Problem Definition
Big Data Analytics
Through this tutorial, we will develop a project. Each subsequent chapter in this tutorial
deals with a part of the larger project in the mini-project section. This is thought to be an
applied tutorial section that will provide exposure to a real-world problem. In this case,
we would start with the problem definition of the project.
Project Description
The objective of this project would be to develop a machine learning model to predict the
hourly salary of people using their curriculum vitae (CV) text as input.
Using the framework defined above, it is simple to define the problem. We can define X =
{x 1, x 2, , x n } as the CVs of users, where each feature can be, in the simplest way
possible, the amount of times this word appears. Then the response is real valued, we are
trying to predict the hourly salary of individuals in dollars.
These two considerations are enough to conclude that the problem presented can be
solved with a supervised regression algorithm.
Problem Definition
Problem Definition is probably one of the most complex and heavily neglected stages in
the big data analytics pipeline. In order to define the problem a data product would solve,
experience is mandatory. Most data scientist aspirants have little or no experience in this
stage.
Supervised classification
Supervised regression
Unsupervised learning
Learning to rank
Supervised Classification
Given a matrix of features X = {x 1, x 2, ..., x n } we develop a model M to predict different
classes defined as y = {c 1, c 2, ..., c n }. For example: Given transactional data of customers
in an insurance company, it is possible to develop a model that will predict if a client would
churn or not. The latter is a binary classification problem, where there are two classes or
target variables: churn and not churn.
Other problems involve predicting more than one class, we could be interested in doing
digit recognition, therefore the response vector would be defined as: y = {0, 1, 2, 3, 4, 5,
6, 7, 8, 9}, a-state-of-the-art model would be convolutional neural network and the matrix
of features would be defined as the pixels of the image.
17
Big Data Analytics
Supervised Regression
In this case, the problem definition is rather similar to the previous example; the difference
relies on the response. In a regression problem, the response y , this means the
response is real valued. For example, we can develop a model to predict the hourly salary
of individuals given the corpus of their CV.
Unsupervised Learning
Management is often thirsty for new insights. Segmentation models can provide this
insight in order for the marketing department to develop products for different segments.
A good approach for developing a segmentation model, rather than thinking of algorithms,
is to select features that are relevant to the segmentation that is desired.
Learning to Rank
This problem can be considered as a regression problem, but it has particular
characteristics and deserves a separate treatment. The problem involves given a collection
of documents we seek to find the most relevant ordering given a query. In order to develop
a supervised learning algorithm, it is needed to label how relevant an ordering is, given a
query.
18
Big Data Analytics Data Collection
Big Data Analytics
Data collection plays the most important role in the Big Data cycle. The Internet provides
almost unlimited sources of data for a variety of topics. The importance of this area
depends on the type of business, but traditional industries can acquire a diverse source of
external data and combine those with their transactional data.
For example, lets assume we would like to build a system that recommends restaurants.
The first step would be to gather data, in this case, reviews of restaurants from different
websites and store them in a database. As we are interested in raw text, and would use
that for analytics, it is not that relevant where the data for developing the model would be
stored. This may sound contradictory with the big data main technologies, but in order to
implement a big data application, we simply need to make it work in real time.
First of all create a twitter account, and then follow the instructions in the twitteR
package vignette to create a twitter developer account. This is a summary of those
instructions:
After filling in the basic info, go to the "Settings" tab and select "Read, Write and
Access direct messages"
In the "Details" tab, take note of your consumer key and consumer secret
In your R session, youll be using the API key and API secret values
Finally run the following script. This will install the twitteR package from its
repository on github
library(devtools)
install_github("geoffjentry/twitteR")
19
Big Data Analytics
We are interested in getting data where the string "big mac" is included and finding out
which topics stand out about this. In order to do this, the first step is collecting the data
from twitter. Below is our R script to collect required data from twitter. This code is also
available in bda/part1/collect_data/collect_data_twitter.R file.
library(twitteR)
### Replace the xxxs with the values you got from the previous instructions
# consumer_key = "xxxxxxxxxxxxxxxxxxxx"
# consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token = "xxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
# access_token_secret= "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
df <- twListToDF(tweets)
head(df)
20
Big Data Analytics
source_table = table(sources)
as.data.frame(freq)
# Frequency
# recognia 20
21
Big Data Analytics Cleansing Big
DataData Analytics
Once the data is collected, we normally have diverse data sources with different
characteristics. The most immediate step would be to make these data sources
homogeneous and continue to develop our data product. However, it depends on the type
of data. We should ask ourselves if it is practical to homogenize the data.
Maybe the data sources are completely different, and the information loss will be large if
the sources would be homogenized. In this case, we can think of alternatives. Can one
data source help me build a regression model and the other one a classification model? Is
it possible to work with the heterogeneity on our advantage rather than just lose
information? Taking these decisions are what make analytics interesting and challenging.
In the case of reviews, it is possible to have a language for each data source. Again, we
have two choices:
For example, after getting the tweets we get these strange characters:
"<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>". These are probably
emoticons, so in order to clean the data, we will just remove them using the following
script. This code is also available in bda/part1/collect_data/cleaning_data.R file.
source('collect_data_twitter.R')
# Some tweets
head(df$text)
[1] "Im not a big fan of turkey but baked Mac & cheese
<ed><U+00A0><U+00BD><ed><U+00B8><U+008B>"
22
Big Data Analytics
return(tx)
# Cleaned tweets
head(clean_tweets)
[1] " WeNeedFeminlsm MAC s new make up line features men woc and big girls "
[1] " TravelsPhoto What Happens To Your Body One Hour After A Big Mac "
The final step of the data cleansing mini project is to have cleaned text we can convert to
a matrix and apply an algorithm to. From the text stored in the clean_tweets vector we
can easily convert it to a bag of words matrix and apply an unsupervised learning
algorithm.
23
Big Data Analytics Summarizing Data
Big Data Analytics
Reporting is very important in big data analytics. Every organization must have a regular
provision of information to support its decision making process. This task is normally
handled by data analysts with SQL and ETL (extract, transfer, and load) experience.
The team in charge of this task has the responsibility of spreading the information
produced in the big data analytics department to different areas of the organization.
The following example demonstrates what summarization of data means. Navigate to the
folder bda/part1/summarize_data and inside the folder, open the
summarize_data.Rproj file by double clicking it. Then, open the summarize_data.R
script and take a look at the code, and follow the explanations presented.
install.packages(pkgs)
The ggplot2 package is great for data visualization. The data.table package is a great
option to do fast and memory efficient summarization in R. A recent benchmark shows it
is even faster than pandas, the python library used for similar tasks.
24
Big Data Analytics
25
Big Data Analytics
Take a look at the data using the following code. This code is also available in
bda/part1/summarize_data/summarize_data.Rproj file.
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
DT <- as.data.table(flights)
dim(DT)
head(DT)
26
Big Data Analytics
# mean_arrival_delay
# 1: 6.895377
by = carrier]
print(mean1)
# carrier mean_arrival_delay
# 1: UA 3.5580111
# 2: AA 0.3642909
# 3: B6 9.4579733
# 4: DL 1.6443409
# 5: EV 15.7964311
# 6: MQ 10.7747334
# 7: US 2.1295951
# 8: WN 9.6491199
# 9: VX 1.7644644
# 10: FL 20.1159055
# 11: AS -9.9308886
# 12: 9E 7.3796692
27
Big Data Analytics
# 13: F9 21.9207048
# 14: HA -6.9152047
# 15: YV 15.5569853
# 16: OO 11.9310345
mean_arrival_delay=mean(arr_delay, na.rm=TRUE)),
by = carrier]
print(mean2)
# 1: UA 12.106073 3.5580111
# 2: AA 8.586016 0.3642909
# 3: B6 13.022522 9.4579733
# 4: DL 9.264505 1.6443409
# 5: EV 19.955390 15.7964311
# 6: MQ 10.552041 10.7747334
# 7: US 3.782418 2.1295951
# 8: WN 17.711744 9.6491199
# 9: VX 12.869421 1.7644644
28
Big Data Analytics
print(median_gain)
29
Big Data Analytics Data Exploration
Big Data Analytics
Exploratory data analysis is a concept developed by John Tuckey (1977) that consists
on a new perspective of statistics. Tuckeys idea was that in traditional statistics, the data
was not being explored graphically, is was just being used to test hypotheses. The first
attempt to develop a tool was done in Stanford, the project was called prim9. The tool was
able to visualize data in nine dimensions, therefore it was able to provide a multivariate
perspective of the data.
In recent days, exploratory data analysis is a must and has been included in the big data
analytics life cycle. The ability to find insight and be able to communicate it effectively in
an organization is fueled with strong EDA capabilities.
Based on Tuckeys ideas, Bell Labs developed the S programming language in order to
provide an interactive interface for doing statistics. The idea of S was to provide extensive
graphical capabilities with an easy-to-use language. In todays world, in the context of Big
Data, R that is based on the S programming language is the most popular software for
analytics.
30
Big Data Analytics
The following is an example of exploratory data analysis. This code is also available in
part1/eda/exploratory_data_analysis.R file.
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
DT <- as.data.table(flights)
mean_arrival_delay=mean(arr_delay, na.rm=TRUE)),
by = carrier]
# In order to plot data in R usign ggplot, it is normally needed to reshape the data
# We want to have the data in long format for plotting with ggplot
dt = melt(mean2, id.vars=carrier)
print(head(dt))
# Take a look at the help for ?geom_point and geom_line to find similar
examples
31
Big Data Analytics
print(p)
ggsave('mean_delay_by_carrier.png', p,
width=10.4, height=5.07)
32
Big Data Analytics Data Visualization
Big Data Analytics
In order to understand data, it is often useful to visualize it. Normally in Big Data
applications, the interest relies in finding insight rather than just making beautiful plots.
The following are examples of different approaches to understanding data using plots.
To start analyzing the flights data, we can start by checking if there are correlations
between numeric variables. This code is also available in
bda/part1/data_visualization/data_visualization.R file.
install.packages('corrplot')
library(corrplot)
library(nycflights13)
library(ggplot2)
library(data.table)
library(reshape2)
DT <- as.data.table(flights)
'arr_time', 'arr_delay',
'air_time', 'distance')
33
Big Data Analytics
print(cor_mat)
# save it to disk
png('corrplot.png')
dev.off()
34
Big Data Analytics
We can see in the plot that there is a strong correlation between some of the variables in
the dataset. For example, arrival delay and departure delay seem to be highly correlated.
We can see this because the ellipse shows an almost lineal relationship between both
variables, however, it is not simple to find causation from this result.
We cant say that as two variables are correlated, that one has an effect on the other. Also
we find in the plot a strong correlation between air time and distance, which is fairly
reasonable to expect as with more distance, the flight time should grow.
We can also do univariate analysis of the data. A simple and effective way to visualize
distributions are box-plots. The following code demonstrates how to produce box-plots and
trellis charts using the ggplot2 library. This code is also available in
bda/part1/data_visualization/boxplots.R file.
35
Big Data Analytics
source('data_visualization.R')
x='Carrier', y='Distance'))
# Save to disk
png(boxplot_carrier.png)
print(p)
dev.off()
geom_box-plot() +
theme_bw() +
guides(fill=FALSE) +
facet_wrap(~month) + # This creates the trellis plot with the by month variable
36
Big Data Analytics
x='Carrier', y='Distance'))
# The plot shows there aren't clear differences between distance in different
months
# Save to disk
png('boxplot_carrier_by_month.png')
print(p)
dev.off()
37
Big Data Analytics
38
Big Data Analytics Introduction to R
Big Data Analytics
This section is devoted to introduce the users to the R programming language. R can be
downloaded from the cran website. For Windows users, it is useful to install rtools and the
rstudio IDE.
Navigate to the folder of the book zip file bda/part2/R_introduction and open the
R_introduction.Rproj file. This will open an RStudio session. Then open the 01_vectors.R
file. Run the script line by line and follow the comments in the code. Another useful option
in order to learn is to just type the code, this will help you get used to R syntax. In R
comments are written with the # symbol.
In order to display the results of running R code in the book, after code is evaluated, the
results R returns are commented. This way, you can copy paste the code in the book and
try directly sections of it in R.
numbers = c(1, 2, 3, 4, 5)
print(numbers)
# [1] 1 2 3 4 5
# Concatenate both
print(mixed_vec)
# [1] "1" "2" "3" "4" "5" "a" "b" "c" "d" "e"
Lets analyze what happened in the previous code. We can see it is possible to create
vectors with numbers and with letters. We did not need to tell R what type of data type
we wanted beforehand. Finally, we were able to create a vector with both numbers and
letters. The vector mixed_vec has coerced the numbers to character, we can see this by
visualizing how the values are printed inside quotes.
39
Big Data Analytics
The following code shows the data type of different vectors as returned by the function
class. It is common to use the class function to "interrogate" an object, asking him what
his class is.
# Integer vector
num = 1:10
class(num)
# [1] "integer"
class(num)
# [1] "numeric"
# Character vector
ltrs = letters[1:10]
class(ltrs)
# [1] "character"
# Factor vector
fac = as.factor(ltrs)
class(fac)
# [1] "factor"
40
Big Data Analytics
R supports two-dimensional objects also. In the following code, there are examples of the
two most popular data structures used in R: the matrix and data.frame.
# Matrix
M = matrix(1:12, ncol=4)
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
lM = matrix(letters[1:12], ncol=4)
cbind(M, lM)
class(M)
# [1] "matrix"
class(lM)
# [1] "matrix"
# data.frame
41
Big Data Analytics
# One of the main objects of R, handles different data types in the same
object.
df = data.frame(n=1:5, l=letters[1:5])
df
# n l
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
As demonstrated in the previous example, it is possible to use different data types in the
same object. In general, this is how data is presented in databases, APIs part of the data
is text or character vectors and other numeric. In is the analyst job to determine which
statistical data type to assign and then use the correct R data type for it. In statistics we
normally consider variables are of the following types:
Numeric
Nominal or categorical
Ordinal
Numeric - Integer
Factor
Ordered Factor
R provides a data type for each statistical type of variable. The ordered factor is however
rarely used, but can be created by the function factor, or ordered.
42
Big Data Analytics
The following section treats the concept of indexing. This is a quite common operation,
and deals with the problem of selecting sections of an object and making transformations
to them.
df = data.frame(numbers=1:26, letters)
head(df)
# numbers letters
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 6 f
# str gives the structure of a data.frame, its a good summary to inspect an object
str(df)
# The latter shows the letters character vector was coerced as a factor.
class(df)
# [1] "data.frame"
### Indexing
43
Big Data Analytics
df[1, ]
# numbers letters
# 1 1 a
# $numbers
# [1] 1
# $letters
# [1] a
# Levels: a b c d e f g h i j k l m n o p q r s t u v w x y z
df[5:7, ]
# numbers letters
# 5 5 e
# 6 6 f
# 7 7 g
### Add one column that mixes the numeric column with the factor column
str(df)
44
Big Data Analytics
df[, 1]
head(df2)
# numbers letters
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
# 6 6 f
df3[1:3, ]
# numbers mixed
# 1 1 1a
45
Big Data Analytics
# 2 2 2b
# 3 3 3c
names(df)
# This is the best practice in programming, as many times indeces change, but
variable names dont
head(df4)
# numbers mixed
# 1 1 1a
# 2 2 2b
# 3 3 3c
# 4 4 4d
# 5 5 5e
# 6 6 6f
df5
# numbers mixed
# 1 1 1a
# 2 2 2b
46
Big Data Analytics
# 3 3 3c
# 4 4 4d
# 5 5 5e
df6
# numbers mixed
# 1 1 1a
# 2 2 2b
# 3 3 3c
# 4 4 4d
# 5 5 5e
# 6 6 6f
# 7 7 7g
# 8 8 8h
# 9 9 9i
47
Big Data Analytics IntroductionBigtoDataSQL
Analytics
SQL stands for structured query language. It is one of the most widely used languages for
extracting data from databases in traditional data warehouses and big data technologies.
In order to demonstrate the basics of SQL we will be working with examples. In order to
focus on the language itself, we will be using SQL inside R. In terms of writing SQL code
this is exactly as would be done in a database.
The core of SQL are three statements: SELECT, FROM and WHERE. The following examples
make use of the most common use cases of SQL. Navigate to the folder
bda/part2/SQL_introduction and open the SQL_introduction.Rproj file. Then open
the 01_select.R script. In order to write SQL code in R we need to install the sqldf
package as demonstrated in the following code.
install.packages('sqldf')
library('sqldf')
library(nycflights13)
str(flights)
# $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
# $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
# $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
48
Big Data Analytics
# $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
# $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
The select statement is used to retrieve columns from tables and do calculations on them.
The simplest SELECT statement is demonstrated in ej1. We can also create new variables
as shown in ej2.
ej1 = sqldf("
SELECT
dep_time
,dep_delay
,arr_time
,carrier
,tailnum
FROM
flights")
head(ej1)
49
Big Data Analytics
# In R we can use SQL with the sqldf function. It works exactly the same as in
a database
# The data.frame (in this case flights) represents the table we are querying
and goes in the FROM statement
# We can also compute new variables in the select statement using the syntax:
# old_variables as new_variable
ej2 = sqldf("
SELECT
carrier
FROM
flights")
ej2[1:5, ]
# gain carrier
# 1 9 UA
# 2 16 UA
# 3 31 AA
# 4 -17 B6
# 5 -19 DL
50
Big Data Analytics
One of the most common used features of SQL is the group by statement. This allows to
compute a numeric value for different groups of another variable. Open the script
02_group_by.R.
### GROUP BY
ej3 = sqldf("
SELECT
avg(arr_delay) as mean_arr_delay,
avg(dep_delay) as mean_dep_delay,
carrier
FROM
flights
GROUP BY
carrier
")
# 1 7.3796692 16.725769 9E
# 2 0.3642909 8.586016 AA
# 3 -9.9308886 5.804775 AS
# 4 9.4579733 13.022522 B6
# 5 1.6443409 9.264505 DL
# 6 15.7964311 19.955390 EV
# 7 21.9207048 20.215543 F9
# 8 20.1159055 18.726075 FL
# 9 -6.9152047 4.900585 HA
# 10 10.7747334 10.552041 MQ
51
Big Data Analytics
# 11 11.9310345 12.586207 OO
# 12 3.5580111 12.106073 UA
# 13 2.1295951 3.782418 US
# 14 1.7644644 12.869421 VX
# 15 9.6491199 17.711744 WN
# 16 15.5569853 18.996330 YV
# Other aggregations
ej4 = sqldf("
SELECT
avg(arr_delay) as mean_arr_delay,
min(dep_delay) as min_dep_delay,
max(dep_delay) as max_dep_delay,
carrier
FROM
flights
GROUP BY
carrier
")
# We can compute the minimun, mean, and maximum values of a numeric value
ej4
52
Big Data Analytics
### We could be also interested in knowing how many observations each carrier
has
ej5 = sqldf("
SELECT
FROM
flights
GROUP BY
carrier
")
ej5
# carrier count
# 1 9E 18460
53
Big Data Analytics
# 2 AA 32729
# 3 AS 714
# 4 B6 54635
# 5 DL 48110
# 6 EV 54173
# 7 F9 685
# 8 FL 3260
# 9 HA 342
# 10 MQ 26397
# 11 OO 32
# 12 UA 58665
# 13 US 20536
# 14 VX 5162
# 15 WN 12275
# 16 YV 601
The most useful feature of SQL are joins. A join means that we want to combine table A
and table B in one table using one column to match the values of both tables. There are
different types of joins, in practical terms, to get started these will be the most useful
ones: inner join and left outer join.
A = data.frame(c1=1:4, c2=letters[1:4])
B = data.frame(c1=c(2,4,5,6), c2=letters[c(2:5)])
# c1 c2
# 1 a
# 2 b
# 3 c
54
Big Data Analytics
# 4 d
# c1 c2
# 2 b
# 4 c
# 5 d
# 6 e
# This means to match the observations of the column we would join the tables by.
inner = sqldf("
SELECT
A.c1, B.c2
FROM
A INNER JOIN B
ON A.c1 = B.c1
")
inner
# c1 c2
# 2 b
# 4 c
55
Big Data Analytics
# the left outer join, sometimes just called left join will return the
# first all the values of the column used from the A table
left = sqldf("
SELECT
A.c1, B.c2
FROM
ON A.c1 = B.c1
")
left
# c1 c2
# 1 <NA>
# 2 b
# 3 <NA>
# 4 c
56
Big Data Analytics Charts & Graphs
Big Data Analytics
The first approach to analyzing data is to visually analyze it. The objectives at doing this
are normally finding relations between variables and univariate descriptions of the
variables. We can divide these strategies as:
Univariate analysis
Multivariate analysis
Box-Plots
Box-Plots are normally used to compare distributions. It is a great way to visually inspect
if there are differences between distributions. We can see if there are differences between
the price of diamonds for different cut.
library(ggplot2)
data("diamonds")
head(diamonds)
### Box-Plots
57
Big Data Analytics
geom_box-plot() +
theme_bw()
print(p)
We can see in the plot there are differences in the distribution of diamonds price in different
types of cut.
58
Big Data Analytics
Histograms
source('01_box_plots.R')
# We can plot histograms for each level of the cut factor variable using
facet_grid
geom_histogram() +
facet_grid(cut ~ .) +
theme_bw()
# the previous plot doesnt allow to visuallize correctly the data because of
the differences in scale
geom_histogram() +
facet_grid(cut ~ ., scales='free') +
theme_bw()
png('02_histogram_diamonds_cut.png')
print(p)
dev.off()
59
Big Data Analytics
In order to demonstrate this, we will use the diamonds dataset. To follow the code, open
the script bda/part2/charts/03_multivariate_analysis.R.
60
Big Data Analytics
library(ggplot2)
data(diamonds)
df = diamonds[, keep_vars]
M_cor = cor(df)
# plots
heat-map(M_cor)
61
Big Data Analytics
This is a summary, it tells us that there is a strong correlation between price and caret,
and not much among the other variables.
A correlation matrix can be useful when we have a large number of variables in which case
plotting the raw data would not be practical. As mentioned, it is possible to show the raw
data also:
library(GGally)
ggpairs(df)
62
Big Data Analytics
We can see in the plot that the results displayed in the heat-map are confirmed, there is
a 0.922 correlation between the price and carat variables.
It is possible to visualize this relationship in the price-carat scatterplot located in the (3, 1)
index of the scatterplot matrix.
63
Big Data Analysis Data AnalysisBigTools
Data Analytics
There are a variety of tools that allow a data scientist to analyze data effectively. Normally
the engineering aspect of data analysis focuses on databases, data scientist focus in tools
that can implement data products. The following section discusses the advantages of
different tools with a focus on statistical packages data scientist use in practice most often.
R Programming Language
R is an open source programming language with a focus on statistical analysis. It is
competitive with commercial tools such as SAS, SPSS in terms of statistical capabilities. It
is thought to be an interface to other programming languages such as C, C++ or Fortran.
Another advantage of R is the large number of open source libraries that are available. In
CRAN there are more than 6000 packages that can be downloaded for free and in Github
there is a wide a variety of R packages available.
In terms of performance, R is slow for intensive operations, given the large amount of
libraries available the slow sections of the code are written in compiled languages. But if
you are intending to do operations that require writing deep for loops, then R wouldnt be
your best alternative. For data analysis purpose, there are nice libraries such as
data.table, glmnet, ranger, xgboost, ggplot2, caret that allow to use R as an interface
to faster programming languages.
Most of whats available in R can also be done in Python but we have found that R is
simpler to use. In case you are working with large datasets, normally Python is a better
choice than R. Python can be used quite effectively to clean and process data line by line.
This is possible from R but its not as efficient as Python for scripting tasks.
For machine learning, scikit-learn is a nice environment that has available a large amount
of algorithms that can handle medium sized datasets without a problem. Compared to Rs
equivalent library (caret), scikit-learn has a cleaner and more consistent API.
Julia
Julia is a high-level, high-performance dynamic programming language for technical
computing. Its syntax is quite similar to R or Python, so if you are already working with R
or Python it should be quite simple to write the same code in Julia. The language is quite
new and has grown significantly in the last years, so it is definitely an option at the
moment.
We would recommend Julia for prototyping algorithms that are computationally intensive
such as neural networks. It is a great tool for research. In terms of implementing a model
in production probably Python has better alternatives. However, this is becoming less of a
64
Big Data Analytics
problem as there are web services that do the engineering of implementing models in R,
Python and Julia.
SAS
SAS is a commercial language that is still being used for business intelligence. It has a
base language that allows the user to program a wide variety of applications. It contains
quite a few commercial products that give non-experts users the ability to use complex
tools such as a neural network library without the need of programming.
Beyond the obvious disadvantage of commercial tools, SAS doesnt scale well to large
datasets. Even medium sized dataset will have problems with SAS and make the server
crash. Only if you are working with small datasets and the users arent expert data
scientist, SAS is to be recommended. For advanced users, R and Python provide a more
productive environment.
SPSS
SPSS, is currently a product of IBM for statistical analysis. It is mostly used to analyze
survey data and for users that are not able to program, it is a decent alternative. It is
probably as simple to use as SAS, but in terms of implementing a model, it is simpler as
it provides a SQL code to score a model. This code is normally not efficient, but its a start
whereas SAS sells the product that scores models for each database separately. For small
data and an unexperienced team, SPSS is an option as good as SAS is.
The software is however rather limited, and experienced users will be orders of magnitude
more productive using R or Python.
Matlab, Octave
There are other tools available such as Matlab or its open source version (Octave). These
tools are mostly used for research. In terms of capabilities R or Python can do all thats
available in Matlab or Octave. It only makes sense to buy a license of the product if you
are interested in the support they provide.
65
Big Data Analytics Statistical Methods
Big Data Analytics
When analyzing data, it is possible to have a statistical approach. The basic tools that are
needed to perform basic analysis are:
Correlation analysis
Analysis of Variance
Hypothesis Testing
When working with large datasets, it doesnt involve a problem as these methods arent
computationally intensive with the exception of Correlation Analysis. In this case, it is
always possible to take a sample and the results should be robust.
Correlation Analysis
Correlation Analysis seeks to find linear relationships between numeric variables. This can
be of use in different circumstances. One common use is exploratory data analysis, in
section 16.0.2 of the book there is a basic example of this approach. First of all, the
correlation metric used in the mentioned example is based on the Pearson coefficient.
There is however, another interesting metric of correlation that is not affected by outliers.
This metric is called the spearman correlation.
The spearman correlation metric is more robust to the presence of outliers than the
Pearson method and gives better estimates of linear relations between numeric variable
when the data is not normally distributed.
library(ggplot2)
# In this case as the variables are clearly not normally distributed, the
spearman correlation
par(mfrow=c(2,2))
colnm = names(x)
for(i in 1:4){
66
Big Data Analytics
par(mfrow=c(1,1))
From the histograms in the following figure, we can expect differences in the correlations
of both metrics. In this case, as the variables are clearly not normally distributed, the
spearman correlation is a better estimate of the linear relation among numeric variables.
67
Big Data Analytics
print(cor_pearson)
# x y z price
print(cor_spearman)
# x y z price
Chi-squared Test
The chi-squared test allows us to test if two random variables are independent. This means
that the probability distribution of each variable doesnt influence the other. In order to
evaluate the test in R we need first to create a contingency table, and then pass the table
to the chisq.test R function.
68
Big Data Analytics
For example, lets check if there is an association between the variables: cut and color
from the diamonds dataset. The test is formally defined as:
We would assume there is a relationship between these two variables by their name, but
the test can give an objective "rule" saying how significant this result is or not.
In the following code snippet, we found that the p-value of the test is 2.2e-16, this is
almost zero in practical terms. Then after running the test doing a Monte Carlo
simulation, we found that the p-value is 0.0004998 which is still quite lower than the
threshold 0.05. This result means that we reject the null hypothesis (H0), so we believe
the variables cut and color are not independent.
library(ggplot2)
tbl
# D E F G H I J
chisq.test(tbl)
# data: tbl
69
Big Data Analytics
# data: tbl
T-test
The idea of t-test is to evaluate if there are differences in a numeric variable # distribution
between different groups of a nominal variable. In order to demonstrate this, I will select
the levels of the Fair and Ideal levels of the factor variable cut, then we will compare the
values a numeric variable among those two groups.
# We can see the price means are different for each group
# Fair Ideal
# 4358.758 3457.542
The t-tests are implemented in R with the t.test function. The formula interface to t.test
is the simplest way to use it, the idea is that a numeric variable is explained by a group
variable.
70
Big Data Analytics
From a statistical perspective, we are testing if there are differences in the distributions of
the numeric variable among two groups. Formally the hypothesis test is described with a
null (H0) hypothesis and an alternative hypothesis (H1).
H0: There are no differences in the distributions of the price variable among the
Fair and Ideal groups
H1 There are differences in the distributions of the price variable among the Fair
and Ideal groups
# 719.9065 1082.5251
# sample estimates:
# 4358.758 3457.542
col = 'deepskyblue3')
We can analyze the test result by checking if the p-value is lower than 0.05. If this is the
case, we keep the alternative hypothesis. This means we have found differences of price
among the two levels of the cut factor. By the names of the levels we would have expected
this result, but we wouldnt have expected that the mean price in the Fail group would be
higher than in the Ideal group. We can see this by comparing the means of each factor.
71
Big Data Analytics
The plot command produces a graph that shows the relationship between the price and
cut variable. It is a box-plot; we have covered this plot in section 16.0.1 but it basically
shows the distribution of the price variable for the two levels of cut we are analyzing.
Analysis of Variance
Analysis of Variance (ANOVA) is a statistical model used to analyze the differences among
group distribution by comparing the mean and variance of each group, the model was
developed by Ronald Fisher. ANOVA provides a statistical test of whether or not the means
of several groups are equal, and therefore generalizes the t-test to more than two groups.
ANOVAs are useful for comparing three or more groups for statistical significance because
doing multiple two-sample t-tests would result in an increased chance of committing a
statistical type I error.
72
Big Data Analytics
x i j = x + (x i x) + (x i j x)
x i j = + i + ij
where is the grand mean and i is the ith group mean. The error term i j is assumed to
be iid from a normal distribution. The null hypothesis of the test is that:
1 = 2 = = k
where SSD B has a degree of freedom of k1 and SSD W has a degree of freedom of Nk.
Then we can define the mean squared differences for each metric.
MS B = SSD B (k 1)
MS W = SSD W (N k)
Finally, the test statistic in ANOVA is defined as the ratio of the above two quantities
F = MS B MS W
Basically, ANOVA examines the two sources of the total variance and sees which part
contributes more. This is why it is called analysis of variance although the intention is to
compare group means.
73
Big Data Analytics
library(ggplot2)
head(mtcars)
# Let's see if there are differences between the groups of cyl in the mpg
variable.
anova(fit)
# Response: mpg
74
Big Data Analytics
The p-value we get in the example is significantly smaller than 0.05, so R returns the
symbol '***' to denote this. It means we reject the null hypothesis and that we find
differences between the mpg means among the different groups of the cyl variable.
75
Big Data Analytics
76
Big Data Analytics Machine Learning for Data
Big Data Analytics
Analysis
Machine learning is a subfield of computer science that deals with tasks such as pattern
recognition, computer vision, speech recognition, text analytics and has a strong link with
statistics and mathematical optimization. Applications include the development of search
engines, spam filtering, Optical Character Recognition (OCR) among others. The
boundaries between data mining, pattern recognition and the field of statistical learning
are not clear and basically all refer to similar problems.
Supervised Learning
Unsupervised Learning
Supervised Learning
Supervised learning refers to a type of problem where there is an input data defined as a
matrix X and we are interested in predicting a response y. Where X = {x 1, x 2, x n } has n
predictors and has two values y = {c 1, c 2}.
An example application would be to predict the probability of a web user to click on ads
using demographic features as predictors. This is often called to predict the click through
rate (CTR). Then y = {click, doesnt click} and the predictors could be the used IP
address, the day he entered the site, the users city, country among other features that
could be available.
Unsupervised Learning
Unsupervised learning deals with the problem of finding groups that are similar within each
other without having a class to learn from. There are several approaches to the task of
learning a mapping from predictors to finding groups that share similar instances in each
group and are different with each other.
77
Big Data Analytics Naive Bayes Classifier
Big Data Analytics
Despite the oversimplified assumptions mentioned previously, naive Bayes classifiers have
good results in complex real-world situations. An advantage of naive Bayes is that it only
requires a small amount of training data to estimate the parameters necessary for
classification and that the classifier can be trained incrementally.
The problem with the above formulation is that if the number of features n is large or if a
feature can take on a large number of values, then basing such a model on probability
tables is infeasible. We therefore reformulate the model to make it simpler. Using Bayes
theorem, the conditional probability can be decomposed as:
This means that under the above independence assumptions, the conditional distribution
over the class variable C is:
78
Big Data Analytics
install.packages(pkgs)
library('ElemStatLearn')
library("klaR")
library("caret")
train = spam[inx,]
test = spam[-inx,]
X_train = train[,-58]
y_train = train$spam
X_test = test[,-58]
y_test = test$spam
trControl=trainControl(method='cv', number=3))
79
Big Data Analytics
# Compute
sum(diag(tbl)) / sum(tbl)
# 0.7217391
As we can see from the result, the accuracy of the Naive Bayes model is 72%. This means
the model correctly classifies 72% of the instances.
80
Big Data Analytics K-Means Clustering
Big Data Analytics
The later formula shows the objective function that is minimized in order to find the optimal
prototypes in k-means clustering. The intuition of the formula is that we would like to find
groups that are different with each other and each member of each group should be similar
with the other members of each cluster.
The following example demonstrates how to run the k-means clustering algorithm in R.
library(ggplot2)
# Prepare Data
data = mtcars
# We need to scale the data to have zero mean and unit variance
for (i in 2:dim(data)[2]) {
81
Big Data Analytics
In order to find a good value for K, we can plot the within groups sum of squares for
different values of K. This metric normally decreases as more groups are added, we would
like to find a point where the decrease in the within groups sum of squares starts
decreasing slowly. In the plot, this value is best represented by K = 6.
82
Big Data Analytics
Now that the value of K has been defined, it is needed to run the algorithm with that value.
aggregate(data,by=list(fit$cluster),FUN=mean)
83
Big Data Analytics AssociationBigRules
Data Analytics
Let I = i 1, i 2, ..., i n be a set of n binary attributes called items. Let D = t 1, t 2, ..., t m be a set
of transactions called the database. Each transaction in D has a unique transaction ID and
contains a subset of the items in I. A rule is defined as an implication of the form X Y
where X, Y I and X Y = .
The sets of items (for short item-sets) X and Y are called antecedent (left-hand-side or
LHS) and consequent (right-hand-side or RHS) of the rule.
To illustrate the concepts, we use a small example from the supermarket domain. The set
of items is I = {milk, bread, butter, beer} and a small database containing the items is
shown in the following table.
Transaction ID Items
1 milk, bread
2 bread, butter
3 beer
5 bread, butter
An example rule for the supermarket could be {milk, bread} {butter} meaning that if
milk and bread is bought, customers also buy butter. To select interesting rules from the
set of all possible rules, constraints on various measures of significance and interest can
be used. The best-known constraints are minimum thresholds on support and confidence.
The confidence of a rule is defined conf(X Y ) = supp(X Y )/supp(X). For example, the
rule {milk, bread} {butter} has a confidence of 0.2/0.4 = 0.5 in the database in Table
1, which means that for 50% of the transactions containing milk and bread the rule is
correct. Confidence can be interpreted as an estimate of the probability P(YX), the
probability of finding the RHS of the rule in transactions under the condition that these
transactions also contain the LHS.
84
Big Data Analytics
# install.packages(arules)
library(arules)
# Data preprocessing
data("AdultUCI")
AdultUCI[1:2,]
85
Big Data Analytics
In order to generate rules using the apriori algorithm, we need to create a transaction
matrix. The following code shows how to do this in R.
Adult
summary(Adult)
# generate rules
min_support = 0.01
confidence = 0.6
rules
inspect(rules[100:110, ])
86
Big Data Analytics Decision Trees
Big Data Analytics
A Decision Tree is an algorithm used for supervised learning problems such as classification
or regression. A decision tree or a classification tree is a tree in which each internal (non-
leaf) node is labeled with an input feature. The arcs coming from a node labeled with a
feature are labeled with each of the possible values of the feature. Each leaf of the tree is
labeled with a class or a probability distribution over the classes.
A tree can be "learned" by splitting the source set into subsets based on an attribute value
test. This process is repeated on each derived subset in a recursive manner called
recursive partitioning. The recursion is completed when the subset at a node has all the
same value of the target variable, or when splitting no longer adds value to the predictions.
This process of top-down induction of decision trees is an example of a greedy algorithm,
and it is the most common strategy for learning decision trees.
Regression tree: when the predicted outcome can be considered a real number
(e.g. the salary of a worker).
Decision trees are a simple method, and as such has some problems. One of this issues is
the high variance in the resulting models that decision trees produce. In order to alleviate
this problem, ensemble methods of decision trees were developed. There are two groups
of ensemble methods currently used extensively:
Bagging decision trees: These trees are used to build multiple decision trees by
repeatedly resampling training data with replacement, and voting the trees for a
consensus prediction. This algorithm has been called random forest.
Boosting decision trees: Gradient boosting combines weak learners; in this case,
decision trees into a single strong learner, in an iterative fashion. It fits a weak tree
to the data and iteratively keeps fitting weak learners in order to correct the error
of the previous model.
# install.packages('party')
library(party)
library(ggplot2)
head(diamonds)
# We will predict the cut of diamonds using the features available in the
diamonds dataset.
87
Big Data Analytics
# Example output
# Response: cut
# 7)* weights = 82
# 6) color > E
table(predict(ct), diamonds$cut)
head(probs)
88
Big Data Analytics Logistic Regression
Big Data Analytics
The following code demonstrates how to fit a logistic regression model in R. We will use
here the spam dataset to demonstrate logistic regression, the same that was used for
Naive Bayes.
From the predictions results in terms of accuracy, we find that the regression model
achieves a 92.5% accuracy in the test set, compared to the 72% achieved by the Naive
Bayes classifier.
library(ElemStatLearn)
head(spam)
train = spam[inx,]
test = spam[-inx,]
summary(fit)
# Call:
# Deviance Residuals:
89
Big Data Analytics
# Coefficients:
tbl
# preds
# target 0 1
# email 535 23
# spam 46 316
sum(diag(tbl)) / sum(tbl)
# 0.925
90
Big Data Analytics Time Series Analysis
Big Data Analytics
2015-10-11 12:00:00 90
Normally, the first step in time series analysis is to plot the series, this is normally done
with a line chart.
The most common application of time series analysis is forecasting future values of a
numeric value using the temporal structure of the data. This means, the available
observations are used to predict values from the future.
The temporal ordering of the data, implies that traditional regression methods are not
useful. In order to build robust forecast, we need models that take into account the
temporal ordering of the data.
The most widely used model for Time Series Analysis is called Autoregressive Moving
Average (ARMA). The model consists of two parts, an autoregressive (AR) part and a
moving average (MA) part. The model is usually then referred to as the ARMA(p, q)
model where p is the order of the autoregressive part and q is the order of the moving
average part.
91
Big Data Analytics
Autoregressive Model
The AR(p) is read as an autoregressive model of order p. Mathematically it is written as:
Moving Average
The notation MA(q) refers to the moving average model of order q:
where the 1, ..., q are the parameters of the model, is the expectation of X t , and the
t , t 1, ... are, white noise error terms.
We can see that the ARMA(p, q) model is a combination of AR(p) and MA(q) models.
To give some intuition of the model consider that the AR part of the equation seeks to
estimate parameters for X t i observations of in order to predict the value of the variable
in X t . It is in the end a weighted average of the past values. The MA section uses the same
approach but with the error of previous observations, t i . So in the end, the result of the
model is a weighted average.
# install.packages("forecast")
library("forecast")
92
Big Data Analytics
data = scan('fancy.dat')
ts_data
plot.ts(ts_data)
Plotting the data is normally the first step to find out if there is a temporal structure in the
data. We can see from the plot that there are strong spikes at the end of each year.
93
Big Data Analytics
The following code fits an ARMA model to the data. It runs several combinations of models
and selects the one that has less error.
fit = auto.arima(ts_data)
summary(fit)
# Series: ts_data
# ARIMA(1,1,1)(0,1,1)[12]
# Coefficients:
94
Big Data Analytics Text Analytics
Big Data Analytics
In this chapter, we will be using the data scraped in the part 1 of the book. The data has
text that describes profiles of freelancers, and the hourly rate they are charging in USD.
The idea of the following section is to fit a model that given the skills of a freelancer, we
are able to predict its hourly salary.
The following code shows how to convert the raw text that in this case has skills of a user
in a bag of words matrix. For this we use an R library called tm. This means that for each
word in the corpus we create variable with the amount of occurrences of each variable.
library(tm)
library(data.table)
source('text_analytics/text_analytics_functions.R')
data = fread('text_analytics/data/profiles.txt')
rate = as.numeric(data$rate)
keep = !is.na(rate)
rate = rate[keep]
X_all = bag_words(data$user_skills[keep])
X_all
# Sparsity : 99%
95
Big Data Analytics
Now that we have the text represented as a sparse matrix we can fit a model that will give
a sparse solution. A good alternative for this case is using the LASSO (least absolute
shrinkage and selection operator). This is a regression model that is able to select the
most relevant features to predict the target.
train_inx = 1:200
X_train = X_all[train_inx, ]
y_train = rate[train_inx]
X_test = X_all[-train_inx, ]
y_test = rate[-train_inx]
library(glmnet)
family='gaussian', alpha=1,
plot(fit)
# Make predictions
predictions = as.vector(predictions[,1])
head(predictions)
# We can compute the mean absolute error for the test data
mean(abs(y_test - predictions))
# 15.02175
Now we have a model that given a set of skills is able to predict the hourly salary of a
freelancer. If more data is collected, the performance of the model will improve, but the
code to implement this pipeline would be the same.
96
Big Data Analytics Online Learning
Big Data Analytics
Online learning is a subfield of machine learning that allows to scale supervised learning
models to massive datasets. The basic idea is that we dont need to read all the data in
memory to fit a model, we only need to read each instance at a time.
In this case, we will show how to implement an online learning algorithm using logistic
regression. As in most of supervised learning algorithms, there is a cost function that is
minimized. In logistic regression, the cost function is defined as:
where J() represents the cost function and h (x) represents the hypothesis. In the case
of logistic regression it is defined with the following formula:
Now that we have defined the cost function we need to find an algorithm to minimize it.
The simplest algorithm for achieving this is called stochastic gradient descent. The update
rule of the algorithm for the weights of the logistic regression model is defined as:
There are several implementations of the following algorithm, but the one implemented in
the vowpal wabbit library is by far the most developed one. The library allows training of
large scale regression models and uses small amounts of RAM. In the creators own words
it is described as: "The Vowpal Wabbit (VW) project is a fast out-of-core learning system
sponsored by Microsoft Research and (previously) Yahoo! Research".
We will be working with the titanic dataset from a kaggle competition. The original data
can be found in the bda/part3/vw folder. Here, we have two files:
In order to convert the csv format to the vowpal wabbit input format use the
csv_to_vowpal_wabbit.py python script. You will obviously need to have python
installed for this. Navigate to the bda/part3/vw folder, open the terminal and execute
the following command:
python csv_to_vowpal_wabbit.py
97
Big Data Analytics
Note that for this section, if you are using windows you will need to install a Unix command
line, enter the cygwin website for that.
Open the terminal and also in the folder bda/part3/vw and execute the following
command:
-f model.vw means that we are saving the model in the model.vw file for making
predictions later
--learning_rate 0.5: The learning rate as defined in the update rule formula
The following code shows the results of running the regression model in the command line.
In the results, we get the average log-loss and a small report of the algorithm
performance.
-loss_function logistic
final_regressor = model.vw
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
98
Big Data Analytics
num sources = 1
finished run
passes used = 11
99
Big Data Analytics
Now we can use the model.vw we trained to generate predictions with new data.
The predictions generated in the previous command are not normalized to fit between the
[0, 1] range. In order to do this, we use a sigmoid transformation.
preds = fread('vw/predictions.txt')
sigmoid = function(x){
1 / (1 + exp(-x))
probs = sigmoid(preds[[1]])
head(preds)
# [1] 0 1 0 0 1 0
100