Big Data Analytics - Complete Notes
Big Data Analytics - Complete Notes
1. Unstructured data: This is the data which does not conform to a data
model or is not in a form which can be used easily by a computer program.
About 80% data of an organization is in this format; for example, memos,
chat rooms, PowerPoint presentations, images, videos, letters. researches,
white papers, body of an email, etc.
Unstructured data
semi-structured
data
Structured data
The "Internet of Things" and its widely ultra-connected nature are leading
to a burgeoning rise in big data. There is no dearth of data for today's
enterprise. On the contrary, they are mired in data and quite deep at that.
That brings us to the following questions:
2
1. Why is it that we cannot forego big data?
2. How has it come to assume such magnanimous importance in running
business?
3. How does it compare with the traditional Business Intelligence (BI)
environment?
4. Is it here to replace the traditional, relational database management
system and data warehouse environment or is it likely to complement
their existence?"
As of 2014, Linkedln has more than 250 million user accounts and
has added many additionalfeatures and data-related products, such as
recruiting, job seeker tools, advertising, and lnMaps, which show a social
graph of a user's professional network.
CHARACTERISTICS OF DATA
Composition
Data Condition
Context
4
Table 1.1 The evolution of big data
DEFINITION OF BIG DATA
• Big data is high-velocity and high-variety information assets that
demand cost effective, innovative forms of information processing for
enhanced insight and decision making.
• Big data refers to datasets whose size is typically beyond the storage
capacity of and alsocomplex for traditional database software tools
• Big data is anything beyond the human & technical infrastructure
needed to supportstorage, processing and analysis.
• It is data that is big in volume, velocity and variety. Refer to figure 1.3
Part III of the definition: "enhanced insight and decision making" talks
about deriving deeper, richer and meaningful insights and then using these
insights to make faster and better decisions to gain business value and thus
a competitive edge.
Data —> Information —> Actionable intelligence —> Better decisions —>Enhanced businessvalue
Data retention: How long should one retain this data? Some data may
require for log-term decision, but some data may quickly become irrelevant
and obsolete.
Other challenges: Other challenges of big data are with respect to capture,
storage, search,analysis, transfer and security of big data.
Visualization: Big data refers to datasets whose size is typically beyond the
storage capacity of traditional database software tools. There is no explicit
definition of how bigthe data set should be for it to be considered bigdata.
Data visualization(computer graphics) is becoming popular as a separate
discipline. There are very few data visualization experts.
8
TRADITIONAL BUSINESS INTELLIGENCE (BI)
VERSUS BIG DATA
9
Many compliance and regulatory laws have been in existence for decades,
but additional requirements are added every year, which represent
additional complexity and data requirements for organizations.
Tables – 1.3 and 1.4 explain the comparison between BI and Data Science.
10
Business Intelligence
Typical Techniques and • Standard and ad hoc reporting, dashboards,
Data Types alerts, queries,details on demand
• Structured data. traditional sources.
manageable datasets
Common Questions • What happened last quarter?
• How many units sold?
• Where is the problem? Hey in which
situation?
Table 1.4: BI
1.9.1 Current Analytical Architecture: Figure 1.9 explains a typical
analytical architecture.
1. For data sources to be loaded into the data warehouse, data needs to
be well understood, structured and normalized with the appropriate data
type definitions.
2. As a result of this level of control on the EDW(enterprise data
warehouse-on server or on cloud), additional local systems may emerge
in the form of departmental warehouses and local data marts that
business users create to accommodate their need for flexible analysis.
However, these local systems reside in isolation, often are not
synchronized or integrated with other data stores and may not be backed
up.
3. In the data warehouse, data is read by additional applications across the
enterprisefor Bl and reporting purposes.
4. At the end of this workflow, analysts get data from server. Because users
generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to
analyze data offline in R or other local analytical tools to store and
process critical data, supporting enterprise applications and enabling
corporate reporting activities.
11
Although reports and dashboards are still important for
organizations, most traditional data architectures prevent data exploration
and more sophisticated analysis.
Figure 1.10: Data Evolution and the Rise of Big Data Sources
The following decade (2000) saw different kinds of data sources-
mainly productivity and publishing tools such as content management
repositories and networked attached storage systems-to manage this kind of
information, and the data began to increase in size and started to be
measured at petabyte scales.
1. Data devices and the "Sensornet” gather data from multiple locations
and continuously generate new data about this data. For each gigabyte
of new datacreated, an additional petabyte of data is created about that
data.
2. Data collectors include sample entities that collect data from the device
and users.
Retail stores tracking the path a customer takes through their store
while pushing a shopping cart with an RFID chip so they can gauge which
products get the most foot traffic using geospatial data collected from the
RFID chips
3. Data aggregators make sense of the data collected from the various
13
entities from the "SensorNet" or the "Internet of Things." These
organizations compile data from the devices and usage patterns
collected by government agencies, retail stores and websites. ln turn,
they can choose to transform and package the data as products to sell
to list brokers, who may want to generate marketing lists of people who
may be good targets for specific ad campaigns.
4. Data users / buyers: These groups directly benefit from the data
collected and aggregated by others within the data value chain. Retail
banks, acting as a data buyer, may want to know which customers have
the highest likelihood to apply for a second mortgage or a home equity
line of credit.
To provide input for this analysis, retail banks may purchase data
from a data aggregator. This kind of data may include demographic
information about people living in specific locations; people who appear to
have a specific level of debt, yet still have solid credit scores (or other
characteristics such as paying bills on time and having savings accounts)
that can be used to infer credit worthiness; and those who are searching
the web for information about paying off debts or doing homeremodeling
projects. Obtaining data from these various sources and aggregators will
enable a more targeted marketing campaign, which would have been more
challenging before Big Data due to the lack of information or high-
performing technologies.
14
1. Deep Analytical Talent is technically savvy, with strong analytical
skills. Members possess a combination of skills to handle raw,
unstructured data and to apply complex analytical techniques at massive
scales.
These three groups must work together closely to solve complex Big
Data challenges.
15
Most organizations are familiar with people in the latter two groups
mentioned, but thefirst group, Deep Analytical Talent, tends to be the
newest role for most and the least understood.
16
Figure 1.13 - Data scientist
Data scientists are generally comfortable using this blend of skills toacquire,
manage, analyze, and visualize data and tell compelling stories about it.
*****
17
BIG DATA ANALYTICS
Big Data is creating significant new opportunities for organizations
to derive new value and create competitive advantage from their most
valuable asset: information. For businesses, Big Data helps drive
efficiency, quality, and personalized products and services, producing
improved levels of customer satisfaction and profit. For scientific efforts,
Big Data analytics enable new avenues of investigation with potentially
richer results and deeper insights than previously available. In many cases,
Big Data analytics integrate structured and unstructured data with Realtime
feeds and queries, opening new paths to innovation and insight.
CLASSIFICATION OF ANALYTICS
Basic analytics: This primarily is slicing and dicing of data to help with
basic business insights. This is about reporting on historical data, basic
visualization, etc.
How can we make ithappen?
18
Figure 2.1 Analytics 1.0, 2.0 and 3.0
Let us take a closer look at analytics 1.0, analytics 2.0, and analytics
3.0. Refer Table 2.1. Figure 2.1 shows the subtle growth of analytics
from Descriptive 🡪 Diagnostic 🡪 Predictive 🡪 Perspective analytics.
19
and 3rd party systems, ERP, CRM,
applications. and 3rd party
applications.
Small and structured Big data is being taken up A blend of big data and
seriously.
data sources. Data stored Data is mainly traditional analytics to
in enterprise data unstructured, arriving at a yield insights and
warehouses or data marts. much higher pace. This fast offerings with speed and
flow of data entailed that the impact.
influx of big volume data
had to be stored
and processed rapidly,
often on
massive parallel servers
running
Hadoop.
Security: Most of the NoSQL big data platforms have poor security
mechanisms (lack of proper authentication and authorization mechanisms)
when it comes to safeguarding big data. A spot that cannotbe ignored
given that big data carries credit card information, personal information and
other sensitive data.
20
Continuous availability: The big question here is how to provide 24/7
support because almostall RDBMS and NoSQL big data platforms have a
certain amount of downtime built in.
Reactive - Big Data Analytics: Here the analysis is done on huge datasets
but the approach isstill reactive as it is still based on static data.
21
(to distribute processing to a number of machines), high connectivity,
and high throughputs(rate at whichsomething is processed).
• Cloud computing and other flexible resource allocation arrangements.
DATA SCIENCE
It employs techniques and theories drawn from many fields from the
broad areas of mathematics, statistics, information technology including
machine learning, data engineering, probability models, statistical learning,
pattern recognition and learning, etc.
A data scientist should have following ability to play the role of data
scientist.
• Understanding of domain
• Business strategy
• Problem solving
• Communication
• Presentation
• Keenness
Mathematics Expertise:
The following are the key skills that a data scientist will have to have to
comprehend data,interpret it and analyze.
• Mathematics.
• Statistics.
• Artificial Intelligence (AI).
• Algorithms.
• Machine learning.
• Pattern recognition.
• Natural Language Processing.
• To sum it up, the data science process is
• Collecting raw data from multiple different data sources.
• Processing the data.
• Integrating the data and preparing clean datasets.
• Engaging in explorative data analysis using model and algorithms.
• Preparing presentations using data visualizations.
• Communicating the findings to all stakeholders.
• Making faster and better decisions.
RESPONSIBILITIES
23
Analytical Techniques: Depending on the business questions which we are
trying to find answers to and the type of data available at hand, the data
scientist employs a blend of analytical techniques to develop models and
algorithms to understand the data, interpret relationships, spot trends, and
reveal patterns.
Basically Available: This constraint states that the system does guarantee
the availability of the data as regards CAP Theorem; there will be a response
to any request. But, that response could still be ‘failure’ to obtain the requested
data or the data may be in an inconsistent or changing state, much like
waiting for a check to clear in your bank account.
Soft state: The state of the system could change over time, so evenduring
times without input there may be changes going on due to ‘eventual
consistency,’ thus the state of the system is always ‘soft.’
25
Figure 2.4 - Overview of Data Analytical Lifecycle
this phase, the team also needs to familiarize itself with the data thoroughly
and take steps tocondition the data.
Phase 4-Model building: In Phase 4, the team develops data sets for
testing, training, and production purposes. In addition, in this phase the team
builds and executes models based on the work done in the model planning
phase. The team also considers whether its existing tools will suffice for
running the models, or if it will need a more robust environment for
executing models and workflows (for example, fast hardware and parallel
processing, if applicable).
26
Phase 6-0perationalize: In Phase 6, the team delivers final reports,
briefings, code and technical documents. In addition, the team may run a
pilot project to implement the models ina production environment.
*****
27
ANALYTICAL THEORY AND METHODS
DECISION TREES
Internal nodes are the decision or test points. Each internal node refers
to an input variable or an attribute. The top internal node is called the root.
The decision tree in Figure 7-1 is a binary tree in that each internal node has
28
no more than two branches. The branching of a node is referred to as a split.
The decision tree in Figure 7-1 shows that females with income
less than or equal to $45,000 and males 40 years old or younger are
classified as people who would purchase the product. In traversing this tree,
age does not matter for females, and income does not matter for males.
Where decision tree is used?
• Decision trees are widely used in practice.
• To classify animals, questions (like cold-blooded or warm-blooded,
mammal or not mammal) are answered to arrive at a certain
classification.
• A checklist of symptoms during a doctor's evaluation of a patient.
• The artificial intelligence engine of a video game commonly uses
decision trees to control the autonomous actions of a character in
response to various scenarios.
• Retailers can use decision trees to segment customers or predict
response rates to marketing and promotions.
• Financial institutions can use decision trees to help decide if a loan
application should be approved or denied. In the case of loan approval,
computers can use the logical if - then statements to predict whether the
customer will default on the loan.
0.5 x log2 0.5) = 1. On the other hand, if the coin is not fair, the probabilities
30
of heads and tails would not be equal and there would be less uncertainty.
As an extreme case, when the probability of tossing a head is equal to 0 or
1, the entropy is minimized to 0. Therefore, the entropy for a completely
pure variable is 0 and is 1 for a set with equal occurrences for both the
classes (head and tail, or yes and no)
ID3 Algorithm :
31
32
Evaluating a Decision Tree:
Decision trees use greedy algorithms, in that they always choose the
option that seems the best available at that moment. At each step, the
algorithm selects which attribute to use for splitting the remaining records.
This selection may not be the best overall, but it is guaranteed to be the best
at that step. This characteristic reinforces the efficiency of decision trees.
However, once a bad split is taken, it is propagated through the rest of the
tree. To address this problem, an ensemble technique (such as random
forest) may randomize the splitting or even randomize data and come up
with a multiple tree structure, these trees then vote for each class,and the
class with the most votes is chosen as the predicted class.
Having too many layers and obtaining nodes with few members
might be signs of overfitting. In overfitting, the model fits the training set
well, but it performs poorly on the new samples in the testing set. For
decision tree learning, overfitting can be caused by either the lack of training
data or the biased data in the training set. Two approaches canhelp avoid
overfitting in decision tree learning.
• Stop growing the tree early before it reaches the point where all the
training data is perfectly classified.
• Grow the full tree, and then post-prune the tree with methods such as
reduced-error pruning and rule- based post pruning.
Decision trees are not a good choice if the dataset contains many
irrelevant variables. This is different from the notion that they are robust
with redundant variables and correlated variables. If the dataset contains
33
redundant variables, the resulting decision tree ignores all but one of these
variables because the algorithm cannot detect information gain by including
more redundant variables. On the other hand, if the dataset contains
irrelevant variables and if these variables are accidentally chosen as splits
in the tree, the tree may grow too large and may end up with less data at
every split, where overfitting is likely to occur. To address this problem,
feature selection can be introduced in the data preprocessing phase to
eliminate the irrelevant variables.
NAIVE BAYES
34
• Middle Class: $50,000 < income < $1,000,000
• Upper Class: income >$1,000,000
Naive Bayes classifiers can also be used for fraud detection. In the
domain of auto insurance, for example, based on a training set with
attributes such as driver's rating, vehicle age, vehicle price, historicalclaims
by the policy holder, police report status, and claim genuineness, naive
Bayes can provide probability- based classification of whether a newclaim
is genuine.
BAYES' THEOREM
DIAGNOSTICS
35
The model is simple to implement even without using libraries. The
prediction is based on counting the occurrences of events, making the
classifier efficient to run. Naive Bayes is computationally efficient and is
able to handle high-dimensional data efficiently. .In some cases naive Bayes
even outperforms other methods. Unlike logistic regression, the naive
Bayes classifier can handle categorical variables with many levels. Recall
that decision trees can handle categorical variables as well, but too many
levels may result in a deep tree. The naive Bayes classifier overall performs
better than decision trees on categorical values with many levels.Compared
to decision trees, naive Bayes is more resistant to overfitting, especially with
the presence of a smoothing technique.
DIAGNOSTICS OF CLASSIFIERS
37
of a classifier based on the TP and FP, regardless of other factors such as
class distribution and error costs.
Related to the ROC curve is the area under the curve (AUC). The
AUC is calculated by measuring the area under the ROC curve. Higher
AUC scores mean the classifier performs better. The score can range from
0.5 (for the diagonal line TPR=FPR) to 1.0 (with ROC passing through the
top-left corner).
• Internal nodes are the decision or test points. Each internal node refers
to an input variable or an attribute
• Decision trees use greedy algorithms, in that they always choose the
option that seems the best available at that moment. At each step, the
algorithm selects which attribute to use for splitting the remaining
records.
39
QUESTIONS
*****
40
TIME SERIES AND TEXT ANALYSIS
(a) The expected value (mean) of yt, is a constant for all values of t.
(b) The variance of yt, is finite.
(c) The covariance of yt and yt+h depends only on the value of h= 0,1,2,
...for all t.
So the constant variance coupled with part (a), E[yt ]=μ, for all t and
some constant μ, suggests that a stationary time series can look like Figure
8-2. In this plot, the points appear to be centered about a fixed constant,
zero, and the variance appears to be somewhat constant over time.
44
In an MA(q) model, the value of a time series is a linear combination
of the current white noise term and the prior q white noise terms. So earlier
random shocks directly affect the current value of the time series. For
MA(q) models, the behavior of the ACF and PACF plots are somewhat
swapped from the behavior of these plots for AR(p) models.Fora simulated
MA(3) time series of the form yt = □t + 0.4□t -1+1.1□t-2 - 2.5□t-3
Figure 8-6 provide; the ACF plot for the simulated data. Again, the
ACF(O) equals 1, because any variable is perfectly correlated with itself. At
lags 1, 2, and 3, the value of the ACF is relatively large in absolute value
compared to the subsequent terms. In an autoregressive model, the ACF
slowly decays, but for an MA(3) model, the ACF somewhat abruptlycuts
off after lag 3. in general, this pattern can be extended to any MA(q) model.
45
combination of these two models for a stationary time series results in an
Autoregressive Moving Average model, ARMA(p,q), which is expressed
as shown in Equation 8-15.
ADDITIONAL METHODS
TEXT ANALYSIS
47
Text analysis suffers from the curse of high dimensionality. Text
analysis often deals with textual data that is far more complex. A corpus
(plural: corpora) is a large collection of texts used for various purposes in
Natural Language Processing (N LP). Another major challenge with text
analysis is that most of the time the text is not structured.
Text mining uses the terms and indexes produced by the prior two
steps to discover meaningful insights pertaining to domains or problems of
interest. With the proper representation of the text, many of the techniques
such as clustering and classification, can be adapted to text mining. For
example, the k-means can be modified to cluster text documents into
groups, where each group represents a collection of documents with a
similar topic. The distance of a document to a centroid represents how
closely the document talks about that topic. Classification tasks such as
sentiment analysis and spam filtering are prominent use cases for the naive
Bayes. Text mining may utilize methods and techniques from various fields
of study, such as statistical analysis, information retrieval, data mining, and
natural language processing.
Note that, in reality, all three steps do not have to be present in a text
analysis project. If the goal is to construct a corpus or provide a catalog
service, for example, the focus would be the parsing task using oneor more
text preprocessing techniques, such as part-of-speech (POS) tagging, named
entity recognition, lemmatization, or stemming. Furthermore, the three
tasks do not have to be sequential. Sometimes their orders might even look
like a tree.
48
A TEXT ANALYSIS EXAMPLE
1. Collect raw text: This corresponds to Phase 1 and Phase 2 of the Data
Analytic Lifecycle. In this step, the Data Science team at ACME
monitors websites for references to specific products. The websites may
include social media and review sites. The team could interact with
social network application programming interfaces (APIs) processdata
feeds, or scrape pages and use product names as keywords to get the raw
data. Regular expressions are commonly used in this case to identify
text that matches certain patterns. Additional filters can be applied to the
raw data for a more focused study. For example, only retrieving the
reviews originating in New York instead of the entire United States
would allow ACME to conduct regional studies on its products.
Generally, it is a good practice to apply filters during the data collection
phase. They can reduce I/O workloads and minimize the storage
requirements.
6. Review the results and gain greater insights- This step corresponds to
Phase 5 and 6 of the Data Analytic Lifecycle. Marketing gathers the
results from the previous steps. Find out what exactly makes people love
or hate a product. Use one or more visualization techniques to report the
findings. Test the soundness of the conclusions and operationalize the
findings if applicable.
In Data Analytic Lifecycle discovery is the first phase. In it, the Data
Science team investigates the problem, understands the necessary data
sources, and formulates initial hypotheses. Correspondingly, for text
analysis, data must be collected before anything can happen, The Data
Science team starts by actively monitoring various websites for user-
generated contents. The user-generated contents being collected could be
related articles from news portals and blogs, comments on ACME'S
products from online shops or reviews sites, or social media posts that
contain keywords ibPhone or bEbook. Regardless of where the data comes
from, it's likely that the team would deal with semi-structured data such as
HTML web pages, Really Simple Syndication (RSS) feeds, XML, or
JavaScript Object Notation (JSON) files. Enough structure needs to be
imposed to find the part of the raw text that the team really cares about. In
the brand management example, ACME is interested in what the reviews
say about bPhone or bEbook and when the reviews are posted. Therefore,
the team will actively collect such information.
The team can then construct the web scraper based on the identified
patterns. The scraper can use the curl tool to fetch HTML sourcecode given
specific URLs, use XPath and regular expressions to select and extract the
data that match the patterns, and write them into a data store.
Regular expressions can find words and strings that match particular
patterns in the text effectively and efficiently. The general idea is that once
text from the fields of interest is obtained, regular expressions can help
identify if the text is really interesting for the project. In this case, do those
fields mention bPhone, bEbook, or ACME? When matching the text,regular
expressions can also take into account capitalizations, common
misspellings, common abbreviations, and special formats for e-mail
addresses, dates, and telephone numbers.
REPRESENTING TEXT
In this data representation step, raw text is first transformed with text
normalization techniques such as tokenization and case folding. Then it is
represented in a more structured way for analysis.
Tokenization is the task of separating words from the body of text. Raw
text is converted into collections of tokens after the tokenization, where
each token is generally a word.
I once had a gf back in the day. Then the bPhone came out lol
tokenization based on spaces would output a list of tokens.
(I, once, had, a, gf, back, in, the, day., Then, the, bPhone, came, out,
lol) Another way is to tokenize the text based on punctuation marks and
spaces. In this case, the previous tweet would become:
{I, once, had, a, gf, back, in, the, day, ., Then, the, bPhone, came,
out, lol} However, tokenizing based on punctuation marks might not be well
suited to certain scenarios. For example, if the text contains
51
contractions such as we 'll, tokenizing based on punctuation will split them
into separated words we and ll.
Tokenization is a much more difficult task than one may expect. For
example, should words like state-of - the - art, Wi -Fi,and San Francisco
be considered one token or more?
i once had a gf back in the day. then the bphone came out lol
One needs to be cautious applying case folding to tasks such as
information extraction, sentiment analysis, and machine translation. For
example, when General Motors becomes general and motors, the
downstream analysis may very likely consider them as separated words
rather than the name of a company.
52
TERM FREQUENCY-INVERSE DOCUMENT
FREQUENCY (TFIDF)
53
The TFIDF (or TF-IDF) is a measure that considers both the
prevalence of a term within a document (TF) and the scarcity of the term
over the entire corpus (IDF). The TFIDF of a term t in a document dis
defined as the term frequency of t in d multiplying the document frequency
of t in the corpus as shown in Equation 9-7:
Figure 9-4 illustrates the intuitions behind LDA. The left side of the
figure shows four topics built from a corpus, where each topic contains a
list of the most important words from the vocabulary. The four example
54
topics are related to problem, policy, neural, and report. For each document,
a distribution over the topics is chosen, as shown in the histogram on the
right. Next, a topic assignment is picked for each word inthe document, and
the word from the corresponding topic (colored discs) is chosen. In reality,
only the documents (as shown in the middle of the figure) are available. The
goal of LDA is to infer the underlying topics, topic proportions, and topic
assignments for every document.
DETERMINING SENTIMENTS
55
Depending on the classifier, the data may need to be split into
training and testing sets. One way for splitting data is to produce a training
set much bigger than the testing set. For example, an 80/20 split would
produce 80% of the data as the training set and 20% as the testing set.
Next, one or more classifiers are trained over the training set to learn
the characteristics or patterns residing in the data. The sentiment tags in the
testing data are hidden away from the classifiers. After the training,
classifiers are tested over the testing set to infer the sentiment tags. Finally,
the result is compared against the original sentiment tags to evaluate the
overall performance of the classifier.
SUMMARY
56
covariance of the variables in the time series and its underlying
structure.
• Text analysis is text mining, the process of discovering relationships
and interesting patterns in large text collections.
• A corpus (plural: corpora) is a large collection of texts used for various
purposes in Natural Language Processing (NLP).
• Parsing is the process that takes unstructured text and imposes a
structure for further analysis.
• Search and retrieval is the identification of the documents in a corpus
that contain search items such as specific words, phrases, topics, or
entities like people or organizations.
• Text mining uses the terms and indexes produced by the prior two steps
to discover meaningful insights pertaining to domains or problems of
interest.
QUESTIONS
*****
58
DATA PRODUCT & BIG DATA
OPERATING SYSTEM
INTRODUCTION TO DATA PRODUCT
59
• Data products are economic engines that self-adapt and uses the data to
acquire its value and in the process creates additional data while it
makes inferences or predictions upon new data by influencing human
behavior with this very data.
• Data products are no longer programs that run on the web interface,
they are becoming an important part of every domain of activity in
the current modern world.
• In this era of data product the job of the data scientist is to build it.
The experimental methodology is this typical analytical workflow as
pointed by data scientists in creating a data product is:
Ingestion → Wrangling → Modeling → Reporting & Visualization.
• The data science pipeline and is human designed and augmented by
the use of languages like R and Python
• When we create a data product it allows data to become big in size,
fast in execution, and enables larger variety of data for computation
which in turn help to derive insights and does not involve human
interaction.
• Using Large Datasets as an Advantage
• Humans have extraordinary vision for large-scale patterns, such as
woods and clearings visible through the foliage.
• Statistical methodologies allow us to deal with both noisy and
meaningful data by defining them with aggregations and indices
or inferentially by performing the analysis directly.
• As our ability to gather data has increased, so has the requirement
for more generalization.
• Smart grids, quantified selves, mobile technology, sensors, and
wired homes all require personalised statistical inference.
• Scale is measured by the number of facets that must be explored
in addition to the amount of data—a forest view for individual
trees.
• Hadoop is distinct due to the economics of data processing as well
as the fact that it is a platform.
• Hadoop's release was interesting in that it came at a time when the
world needed a solution for large-scale data analytics.
60
• Data issues are no longer limited to tech behemoths; they also impact
commercial and public organisations of all sizes, from large companies
to startups, federal agencies to cities, and perhaps even individuals.
• Computing services are also becoming more available and affordable.
• Data scientists can get on-demand, instant access to clusters of large
sizes by using different cloud computing platforms like Google
Compute Engine or Amazon EC2 at a fraction of the cost of traditional
data centres and with no requirement of doing data center management..
• Big data computing is being made democratic and more open to
everyone by Hadoop
• Data analytics at large scale have historically been available only to
social networks such as Facebook and Twitter, but now they are also
available to individual brands or artists.
• Connected homes and mobile devices, as well as other personal sensors,
are producing vast quantities of personal data, raising questions about
privacy, among other items.
• In 2015, British researchers founded the Hub of All Things (HAT). It
is a customised data collection that tackles the problem of data
ownership and offers a solution for personal data aggregation.
• New data problems are emerging, and a data product is needed to
address these questions.
• Applications like ShotSpotter & Location and HAT offer an application
interface and decision-making tools to help people derive value from
data and create new data.
• Conventional software development workflows are insufficient for
working with large datasets, but Big Data workflows and Hadoop have
allowed and personalised these applications.
61
• An analyst takes in large volume of data performs some operations
on it to convert it into a normal form so that can we can perform
different calculations to finally present the results in a visual
manner.
• With the overwhelming growth rate in the volume and velocity at
which many businesses are now generating data, this human-
powered model is not scalable.
62
• A feedback system is often needed during the workflow management
stage, that gives the output of one job can be automatically fed in as
the data input for the next, allowing for self-adaptation.
• The ingestion phase involves both the model's initialization and the
model's device interaction with users.
• Users may define data source locations or annotate data during the
initialization process
• While interacting, users will receive the predictions given by the model
and in turn give important feedback to strengthen the model
• The staging step requires executing transformations on data to make it
usable and storeable, allowing it to be processed.
• The tasks of staging include data normalization, standardization & data
management.
• The computation phase takes maximum time while executing the key
responsibilities of extracting insights from data, conducting
aggregations or reports, and developing machine learning models for
recommendations, regressions, clustering, or classification.
• The workflow management phase involves tasks such as abstraction,
orchestration, and automation, which enables to operationalize the
performance of workflow steps. The final output is supposed to be an
program that is automated that can be run as desired.
• Hadoop systems ensure that the criteria for a distributed Big Data
Operating System are met, as well as that Hadoop is a data management
system that works as expected while processing analytical data.
63
• Hadoop has mainly been used to store and compute massive,
heterogeneous datasets stored in data lakes rather than warehouses, as
well as for rapid data processing and prototyping.
• Basic knowledge of distributed computing and storage is needed to fully
understand the working of Hadoop and how to build data processing
algorithms and workflows.
• Hadoop distributes the computational processing of a large dataset to
several machines that each run on their own chunk of data in parallel
to perform computation at scale.
• The following conditions must be fulfilled by a distributed system:
• Fault tolerance - A system part failure does not result in the whole
system failing. The system should be able to degrade intoa less
productive state in a graceful manner. The failed system part
should be able to rejoin the system if it recovers.
• Recoverability - No data should be lost when a malfunction
occurs no matter how big or small.
• Scalability - As the load increases (data & computation), the
output decreases, not fails; increasing resources should result in a
proportional increase in power.
• Continuity - The failure of one job or task should not affect the
final result.
• Hadoop tackles the above specifications using a variety of abstract
principles such as:
• Clusters - working out how to manage data storage and distributed
computing in a cluster.
• Data distribution - As data is applied to the cluster and stored on
several nodes, it is distributed instantly. To reduce network traffic,
each node processes locally stored data
• Data Storage - Data is held in typically 128 MB fixed-size blocks,
and copies of each block are made several times for achieving
redundancy and data protection.
• Jobs - In Hadoop, a job is any computation performed; jobs may
be divided into several tasks, with each node performing the work
on a single block of data.
• Programming Language - Jobs written in high level allow us to
ignore low level details, allowing developers to concentrate their
attention only on data and computation.
• Fault tolerance - When task replication is used, jobs are fault
tolerant, ensuring that the final computation is not incorrect or
incomplete if a single node or task fails.
• Communication - The amount of communication occurring
between nodes should be kept at minimum and should be done in a
64
transparent manner by the system. To avoid inter-process
dependencies leading to deadlock situation every task should be
executed independently and nodes should not communicate during
processing to ensure it.
• Work Allocation - Master programmes divide work among worker
nodes so that they can all run in parallel on their own slice of the
larger dataset.
HADOOP ARCHITECTURE
• Together, HDFS and YARN together form a platform. This can be used
for creating big data applications as it provides an operating system for
big data. The two collaborate to reduce network traffic in
65
the cluster, mainly by guaranteeing that data is kept local to the
necessary computation. Both data and tasks are duplicated to ensure
error tolerance, recoverability, and accuracy. To provide scalability and
low-level clustering programming information, the cluster is managed
centrally.
Hadoop Cluster:
66
Fig : A cluster in Hadoop containing two master & four workers
nodes together implementing the six primary Hadoop services
(Ref - Chapter 2, Fig 2.2 - Data Analytics with Hadoop - An Introduction
for Data Scientists)
HDFS has the following master, worker services:
• NameNode (Master service)
• Keeps the file system's directory tree, file metadata, and the
locations of all files in the cluster.
• Clients who want to use HDFS must first request information from
the NameNode in order to find the required storage nodes.
• Secondary NameNode (Master service)
• On behalf of the NameNode, conducts housekeeping and
checkpointing.
• It is not a backup NameNode, despite its name.
• DataNode (Worker service)
• Stores and manages HDFS blocks on the local disk.
• Reports health and status of individual data stores back to the
NameNode.
• When a client application requests data from HDFS, it must first make
a request to the NameNode for the data to be located on disc.
• Instead of storing data or transferring data from DataNode to client, the
NameNode simply functions as a traffic cop, guiding clients to the
necessary DataNodes.
67
• Following are the master and worker services provided by YARN:
• ResourceManager (Master service)
• Controls job scheduling on the cluster by allocating and monitoring
available cluster resources, such as physical assets like memory and
processor cores, to applications.
• ApplicationMaster (Master service)
• The ResourceManager schedules the execution of a particular
program on the cluster, and this portion coordinates its execution.
• NodeManager (Worker service)
• On a each individual node, it runs and manages processing
activities, as well as reporting on their health and status..
• Similar to how HDFS works, clients that wish to execute a job must first
request resources from the ResourceManager, which assigns an
application-specific ApplicationMaster for the duration of the job. The
ApplicationMaster is responsible for tracking the execution of the job,
while the ResourceManager is responsible for tracking the status of the
nodes, and each individual NodeManager creates containers and
executes tasks within them.
• Pseudo-distributed mode is a single node cluster. All Hadoop daemons
are run on a single machine as if it were a cluster, but network traffic is
routed via the local loopback network interface. The advantages of a
distributed architecture aren't realised in this mode, but it's a great way
to build without having to worry about handling multiple machines.
68
rather than billions of smaller files that would otherwise occupy
the same amount of space.
• HDFS follows the WORM (write once, read many) pattern and
does not permit random file appends or writes.
• HDFS is designed for large-scale, continuous file reading rather
than random reading or collection.
• HDFS Blocks
• HDFS files are divided into blocks, which are usually 64 MB or 128
MB in size, but this is configurable at runtime, and high-
performance systems typically use 256 MB block sizes.
• Equivalent to the block size on a single disc file system, the block
size in HDFS is the smallest amount of data that can be read or
written to. Files that are smaller than the block size, unlike blocks
on a single disc, do not fill the entire block.
• Blocks allow very large files to be split across multiple machines
and distributed at runtime. To allow for more efficient distributed
processing, separate blocks from the same file will be stored on
different machines.
• The DataNodes will duplicate the blocks. The replication is three-
fold by design, but this can be modified at runtime. As a result, each
block of data resides on three different computers and three different
discs, and the data will not be lost even though two nodes fail.
• The cluster's potential data storage capacity is just a third of the
available disc space due to replication.
• HDFS Data Management
• The master NameNode keeps track of the file's blocks and their
locations.
• The NameNode communicates with the DataNodes, which are
processes that house the blocks in the cluster.
• Each file's metadata is stored in the NameNode master's memory for
fast lookups, and if the NameNode stops or fails, the entire cluster
becomes unavailable.
• The Secondary NameNode is not a substitute for the NameNode;
rather, it handles the NameNode's housekeeping such as periodically
combining a snapshot of the current data space with the edit log to
prevent the edit log from becoming too large.
• The function of the edit log is used to maintain data integrity and
avoid data loss; in case the NameNode fails, this combined record
can be used to restore the state of the DataNodes.
69
Workload & Resource Manager (YARN):
• The original version of Hadoop offered MapReduce on HDFS where the
MapReduce job/workload management functions were highly coupled
to the cluster/resource management functions. As a result, other
computing models or applications were unable to use the cluster
infrastructure for execution of distributed workloads.
• YARN separates workload and resource management so that many
applications can share a single, unified resource management service.
Hadoop is no longer a uniquely oriented MapReduce platform, but a
full-fledged multi-application, big data operating system, thanks to
YARN's generalised job and resource management capabilities.
• The basic concept behind YARN is to separate the resourcemanagement
and workload management roles into separate daemons.
70
Fig - A map function
(Ref - Chapter 2, Fig 2.3 - Data Analytics with Hadoop - An Introduction
for Data Scientists)
• Reduce Function
• Any emitted key/value pairs will be grouped by key after the map
phase, and those key/value groups will be used as input for per-key
minimization functions.
• When a reduce function is applied to an input set, the output is a
single, aggregated value.
• MapReduce Framework
• Hadoop MapReduce is a software framework for composing jobs
that run in parallel on a cluster and process large quantities of data,
and is the native distributed processing framework that ships with
Hadoop.
• As a job configuration, the system exposes a Java API that allows
developers to define HDFS input and output positions, map and
reduce functions, and other job parameters.
71
• Jobs are compiled and packaged into a JAR, which is submitted to
the Resource Manager by the job client—usually via the command
line. The Resource Manager will then schedule the tasks, monitor
them, and provide the status back to the client.
• Typically, a Map Reduce application is composed of three Java
classes: a Job, a Mapper, and a Reducer.
• Mappers and reducers handle the details of computation on
key/value pairs and are connected through a shuffle and sort phase.
The Job is responsible of configuring the input and output data
format by specifying the InputFormat and OutputFormat classes of
data being serialized to and from HDFS.
QUESTION
1. Explain the concept of Data Product.
72
7. Explain the different master and worker services in Hadoop
73
HADOOP STREAMING & IN-MEMORY
COMPUTATION WITH SPARK
INTRODUCTION
HADOOP STREAMING
74
• Following figure demonstrates the streaming process in a MapReduce
context.
• When Streaming executes a job, each mapper task will launch the
supplied executable inside of its own process.
• The mapper then converts the input data into lines of text and pipes it
to the stdin of the external process while simultaneously collecting
output from stdout.
• The input conversion is usually a simple and straight forward
serialization of the value as data is being read from HDFS having each
line as a new value.
• The mapper expects output to be in a string key/value format, where the
key is separated from the value by some separator character, tab (\t) by
default. If there is no separator, then the mapper considers the output to
only be a key with a null value.
• The reducer is launched as its separate executable once the output from
the mappers is shuffled and sorted ensuring that each key is sent to the
same reducer.
• The output of key/value strings from the mapper are then streamed to
the reducer as input through stdin, matching the output from the mapper,
and ensures to be grouped by key.
• The output given by the reducer to stdout is supposed to have the same
key, separator, and value format as that of the mapper.
• To write Hadoop jobs using Python, we are required to create two
separate Python files, mapper.py and a reducer.py. Inside each of these
files we have to include the statement import sys to enable access to
stdin and stdout.
• The code will accept input as a string, then parse it and after converting
for each number or complex data type, it needs to serialize the output as
a string.
75
ADVANCED MAP REDUCE CONCEPTS
Combiners:
• Mappers produce a lot of intermediate data that must be sent over the
network to be shuffled, sorted, and reduced. Since networking is a
physical resource, transmission of large amounts of data can lead to
job delays resulting in memory bottlenecks
• Combiners are the primary mechanism to solve this problem, and are
essentially intermediate reducers that are associated with the mapper
output. Combiners reduce network traffic by performing a mapper- local
reduction of the data before forwarding it on to the appropriate reducer.
Partitioners:
• Partitioners control how keys and their values get sent to individual
reducers by dividing up the keyspace.
• The default behaviour is the HashPartitioner, which is often all that is
needed. By computing the hash of the key the partitioner allocates keys
evenly to each reducer and then assigns the key to a keyspace that is
determined by the number of reducers.
• Given a uniformly distributed keyspace, each reducer will get a
relatively equal workload. The problem occurs when there is a key
imbalance caused when a large number of values are associated with
one key. In such a situation, a major portion of the reducers are
unutilized, and the benefit of reduction using parallelism is lost.
• A custom partitioner can ease this problem by dividing the keyspace
according to some other semantic structure besides hashing.
Job Chaining:
• Most complex algorithms cannot be described as a simple map and
reduce, so in order to implement more complex analytics, a technique
called job chaining is required.
• If a complex algorithm can be decomposed into several smaller
MapReduce tasks, then these tasks can be chained together to produce
a complete output.
• Job chaining is therefore the combination of many smaller jobs into a
76
complete computation by sending the output of one or more previous
jobs into the input of another.
• Linear job chaining produces complete computations by sending the
output of one or more MapReduce jobs as the input to another
77
SPARK BASICS
78
Fig. Spark framework
(Ref - Chapter 4, Fig 4.1 - Data Analytics with Hadoop - An Introduction
for Data Scientists)
• Spark Streaming
• Enables real time processing and manipulation of unbounded
streams of data
• There are many streaming data libraries available for handling real-
time data.
• Spark Streaming makes sure that programmers get advantage by
allowing data interaction in a manner similar to interacting with a
normal RDD as data starts to come in.
• MLlib
• A library containing machine learning algorithms implemented as
Spark operations on RDD.
• MLib provides developers with scalable algorithms like the ones in
machine learning like neural networks, decision tress, classification,
regression, etc.
79
• GraphX
• It provides set of algorithms and tools that allow manipulation of
graphs and perform parallel operations on graphs
• It provides extension to the RDD API to include functions for
manipulation of graphs, creation subgraphs, or accessing all vertices
in a path.
• Spark does not deal with distributed data storage but it depends upon
Hadoop to provide storage functionality and itself uses Resilient
Distributed Datasets to make distributed computation more reliable .
• RDD is a concept suggesting a collection of objects that is read-only &
partitioned across a set of machines.
• RDD can be recreated using knowledge of the sequence of applications
of transformations to earlier RDD and are hence fault tolerant and can
be accessed using parallel operations. RDD can be read and written to
distributed storages and also provide ability to be cached in the memory
of worker nodes for future iterations.
• The feature of in-memory caching helps achieve massive speedups in
execution and facilitates iterative computing required in the case of
machine learning analyses.
• RDD are executed using functional programming concepts which use
the concepts of map and reduce. New RDDs can be created by simply
of loading data or by making any transformation to an existing
collection of data resulting in generation of a new RDD.
• The knowledge of the sequence of applications of transformations to
RDD defines its lineage, and the transformations can be reapplied to the
complete collection in to recover from failure as they are immutable.
• Spark API is a collection of operations that is used for creation,
transformation and export of RDD
• RDD can be operated upon using transformations and actions.
• Transformations – These consist of operations that are applied to an
existing RDD for the creation of a new one—for example, application
of a filter operation to an RDD for generation of a smaller RDD.
• Actions are operations that will return the computed result to theSpark
driver program —This results in a coordinating or aggregatingof all
partitions of an RDD.
• In context of the MapReduce model, map is a transformation while
reduce is an action. In case of map transformation after passing a
function to each object present in the RDD it gives an output which is
80
a mapping from old RDD to a new one. In case of reduce operation the
RDD has to be partitioned again and an aggregate value like sum or
average has to be computed and returned.
81
• The Spark Execution Model
• The Spark Execution model brings about its execution through the
interaction of the following components: driver, YARN, and
workers.
• The SparkContext in a driver program coordinates the independent
execution of processes in Spark applications
• The context in SparkContext will connect to a cluster manager for
allocation of system resources.
• Management of every worker in the cluster is done by an executor,
and management of the executor is done by the SparkContext.
• The executor coordinates computation, storage and caching onevery
machine.
QUESTION
82
5. Explain the concept of Resilient Distributed Datasets
6. Write short note on : A typical Spark Application
7. Write short note on : The Spark Execution model
8. What is data science pipeline? Explain in detail with a neat diagram.
9. How to refactor the data science pipeline into an iterative model?
Explain all its phases with a neat diagram.
10. List the requirements of distributed system in order to perform
computation at scale.
11. How Hadoop addresses these requirements?
12. Write a short note on Hadoop architecture.
13. Explain with a neat diagram a small Hadoop cluster with two master
nodes and four workers nodes that implements all six primary
Hadoop services.
83
24. Write a short note on Partitioners in advanced MapReduce context.
25. Write a short note on Job Chaining in advanced MapReduce context.
26. Write in brief about Spark. Also write and explain its primary
components.
84
DISTRIBUTED ANALYSIS AND
PATTERNS
A DISTRIBUTED ANALYSIS AND PATTERNS
Compound Keys:
Keys need not be simple primitives such as integers or strings;
instead, they can be compound or complex types so long as they are both
hashable and comparable. Comparable types must at the very least expose
some mechanism to determine equality and some method of ordering.
86
Comparison is usually accomplished by mapping some type to a numeric
value (e.g., months of the year to the integers 1-12) or through a lexical
ordering. Hashable types in Python are any immutable type, the most
notable of which is the tuple. Tuples can contain mutable types (e.g., a tuple
of lists), however, so a hashable tuple is one that is composed of immutable
types. Mutable types such as lists and dictionaries can betransformed into
immutable tuples:
Compound keys are used in two primary ways: to facet the keyspace
across multiple dimensions and to carry key-specific informationforward
through computational stages that involve the values alone.
Consider web log records of the following form:
Web log records are a typical data source of big data computations
on Hadoop, as they represent per-user clickstream data that can be easily
mined for insight in a variety of domains; they also tend to be very large,
dynamic semistructured datasets, well suited to operations in Spark and
MapReduce. Initial computation on this dataset requires a frequency
analysis; for example, we can decompose the text into two daily time series,
one for local traffic and the other for remote traffic using a compound key:
Mapping yields the following data from the preceding dataset:
87
Compound data serialization:
The final consideration when using compound keys (and complex
values) is to understand serialization and deserialization of the compound
data. Serialization is the process of turning an object in memory into a
stream of bytes such that it can be written to disk or transmitted across the
network (deserialization is the reverse process). This process is essential,
particularly in MapReduce, as keys and values are written (usually as
strings) to disk between map and reduce phases
By default in Spark, the Python API uses the pickle module for
serialization, which means that any data structures you use must be pickle-
able. With MapReduce Streaming, you must serialize both the key and the
value as a string, separated by a specified character, by default a tab (\t).
One common first attempt is to simply serialize an immutable type
(e.g., a tuple) using the built-in str function, converting the tuple into a string
that can be easily pickled or streamed. The problem then shifts to
deserialization; using the ast (abstract syntax tree) in the Python standard
library, we can use the literal_eval function to evaluate stringified tuples
back into Python tuple types as follows:
As both keys and values get more complex. Other data structures for
serialization is Base64-encoded JSON because it is compact, uses only
ASCII characters, and is easily serialized and deserialized with the standard
library as follows:
88
Keyspace Patterns:
The notion of computing with keys allows you to manage sets of
data and their relations. However, keys are also a primary piece of the
computation, and as such, they must be managed in addition to the data.
There several patterns that impact the keyspace, specifically the explode,
filter, transform, and identity patterns.
89
keyspace transformations identified earlier
This example is perhaps a bit verbose for the required task, but it does
demonstrate each type of transformation as follows:
1. First, the dataset is loaded from a CSV using the split method.
2. At this point, orders is only a collection of lists, so we assign keys by
breaking the value into the IDs and date as the key, and associate it with
the list of products as the value.
3. The next step is to get the length of the products list (number of products
ordered) and to parse the date, using a closure that wraps a date format
for date time.strptime; note that this method splits the compound key
and eliminates the customer ID, which is unnecessary.
4. In order to sort by order size, we need to invert the size value with the
key, also splitting the date from the key so we can also sort by date.
5. After performing the sort, this function reinverts so that each order can
be identified by size and date.
90
for a single input key. Generally, this is done by a combination of a key shift
and splitting of the value into multiple parts. An explode mapper can also
generate many intermediate pairs by dividing a value into its constituent
parts and reassigning them with the key. We can explode the list of products
per order value to order/product pairs, as in the following code:
Note the use of the flatMap operation on the RDD, which is
specifically designed for explode mapping. It operates similarly to the
regular map; however, the function can yield a sequence instead of a single
item, which is then chained into a single collection (rather than an RDD of
lists).
The filter mapper:
92
stopword filtering because most words only co-occur with veryfew other
words on a regular basis.
Figure 9.2 A word co-occurrence matrix demonstrates the frequency of
terms apperaring together in the same block of text such as a sentence
The pairs approach maps every cell in the matrix to a particular
value, where the pair is the compound key i, j. Reducers therefore work on
per-cell values to produce a final, cell-by-cell matrix. This is a reasonable
approach, which yields output where each Wij is computed upon and stored
separately. Using a sum reducer, the mapper is as follows:
93
The stripes approach is not only more compact in its representation,
but also generates fewer and simpler intermediary keys, thus optimizing
sorting and shuffling of data or other optimizations. However, the stripes
object is heavier, both in terms of processing time as well as the serialization
requirements, particularly if the stripes get very large. There is a limit to the
size of a stripe, particularly in very dense matrices, which may take a lot of
memory to track single occurrences.
DESIGN PATTERNS
94
Donald Miner and Adam Shook explore 23 design patterns for
common MapReduce jobs. They loosely categorize them as follows:
Summarization:
Provide a summary view of a large dataset in terms of aggregations,
grouping, statistical measures, indexing, or other high-level views of the
data.
Filtering:
Create subsets or samples of the data based on a fixed set of criteria,
without modifying the original data in any way.
Data Organization:
Reorganize records into a meaningful pattern in a way that doesn’t
necessarily imply grouping. This task is useful as a first step to further
computations.
Joins:
Collect related data from disparate sources into a unified whole.
Metapatterns:
Implement job chaining and job merging for complex or optimized
computations. These are patterns associated with other patterns.
Summarization:
Summarization attempts to describe the largest amount of
information about a dataset as simply as possible. We are accustomed to
executive summaries that highlight the primary take-aways of a longer
document without getting into the details. Similarly, descriptive statistics
attempt to summarize the relationships between observations by measuring
their central tendency (mean, median), their dispersion(standard deviation),
the shape of their distribution (skewness), or the dependence of variables on
each other (correlation).
95
Aggregation:
Statistical summarization:
96
In this case, the three operations that will be directly reduced are
count, sum, and sum of squares. Therefore, this mapper emits on a per-key
basis, a 1 for count, the value for summation, and the square of the value for
the sum of the squares. The reducer uses the count and sum to compute the
mean, the value to compute the range, and the count, sum, and sum of
squares to compute the standard deviation as follows:
97
We can’t simply perform our final computation during the
aggregation, and another map is needed to finalize the summarization across
the (much smaller) aggregated RDD.
98
The describe example provides a useful pattern for computing
multiple features simultaneously and returning them as a vector. This
pattern is reused often, particularly in the machine learning context, where
multiple procedures might be required in order to produce an instance to
train on (e.g., quadratic computations, normalization, imputation, joins, or
more specific machine learning tasks). Understanding the difference
between aggregation implementations in MapReduce versus Spark can
make a lot of difference in tracking down bugs and porting code from
MapReduce to Spark and vice versa.
Indexing:
Inverted index:
The search example shows the most common use case for an
inverted index: it quickly allows the search algorithm to retrieve the subset
of documents that it must rank and return without scanning every single
document. For example, for the query “running bear”, the index can be used
to look up the intersection of documents that contain the term “running” and
the term “bear”. A simple ranking system might then be employed to return
documents where the search terms are close together
99
rather than far apart in the document (though obviously modern search
ranking systems are far more complex than this).
we would use an identity reducer and the following mapper (note that the
same algorithm is easily implemented with Spark):
TF-IDF:
100
a word appears in a document, divided by the total number of words in
that document; the second term is the Inverse Document Frequency (IDF),
computed as the logarithm of the number of the documents in the corpus
divided by the number of documents where the specific term appears.
Filtering:
Top n records:
102
The primary benefit of this methodology is that a complete sort does
not have to occur over the entire dataset. Instead, the mappers each sort their
own subset of the data, and the reducer sees only n times the number of
mappers worth of data.
Bloom filtering:
In order to construct a bloom filter, you will first have to build it.
Bloom filters work by applying several hashes to input data, then by setting
bits in a bit array according to the hash. Once the bit array is constructed, it
can be used to test membership by applying hashes to the test data and
seeing if the relevant bits are 1 or not. The bit array construction can either
be parallelized by using rules to map distinct values to a reducer that
constructs the bloom filter, or it can be a living, versioned data structure that
is maintained by other processes.
After reading our hashtags and Twitter handles from files on, our
bloom filter will be written to disk in a file called twitter.bloom.
To employ this in a Spark context:
Bloom filters are potentially the most complex data structure that
you will use on a regular basis performing analytics in Hadoop.
104
TOWARD LAST-MILE ANALYTICS
Fitting a Model:
105
2. Create an index of comments/commenters to blog post associated with
a timestamp.
3. Use the index to create instances for our model, where an instance is a
blog post and the comments in a 24-hour sliding window.
4. Join the instances with the primary text data (for both comments and
blog test).
5. Extract the features of each instance (e.g., number of comments in the
first 24 hours, the length of the blog post, bag of words features, day of
week, etc.).
6. Sample the instance features.
7. Build a linear model in memory using Scikit-Learn or Statsmodels.
8. Compute the mean squared error or coefficient of determination across
the entire dataset of instance features.
This snippet of code uses the np.loadtxt function to load our sample
data from disk, which in this case must be a tab-delimited file of instances
where the first column is the target value and the remaining columns are the
features. This type of output matches what might happen when key/value
pairs are written to disk from Spark or MapReduce, although you will have
to collect the data from the cluster into a single file, and ensure it is correctly
formatted.
Validating Models:
106
linear model properties, clf.coef_ (coefficients) and clf.intercept_ (error
term) to disk and then load those parameters into our MapReduce or Spark
job and compute the error ourselves. However, this requires us to implement
a prediction function for every single model we may want to use. Instead,
we will use the pickle module to dump the model to disk,then load it to
every node in the cluster to make our prediction.
107
*****
108
DATA MINING AND WAREHOUSING
Hive provides its own dialect of SQL called the Hive Query
Language, or HQL. HQL supports many commonly used SQL statements,
including data definition statements (DDLs), data manipulation statements
(DMSs), and data retrieval queries. Hive also supports integration of custom
user-defined functions, which can be written in Java or any language
supported by Hadoop Streaming, that extend the built-in func- tionality of
HQL.
This will initiate the CLI and bootstrap the logger and Hive
history file, and finally display a Hive CLI prompt:
hive>
At any time, you can exit the Hive CLI using the following command:
hive> exit;
Hive can also run in non-interactive mode directly from the command line
by passing the filename option, -f, followed by the path to the script to
execute:
~$ hive -f ~/hadoop-fundamentats/hive/init.hqt
~$ hive -f ~/hadoop-fundamentats/hive/top_50_ptayers_by_homeruns.hqt
>> ~/homeruns.tsv
You can view the full list of Hive options for the CLI by using the -H flag:
…..
Hive Query Language (HQL):
110
However, because Hive data is stored in the file system, usually in
HDFS or the local file system, the CREATE TABLE command also takes
optional clauses to specify the row format with the ROW FORMAT clause
that tells Hive how to read each row in the file and map to our columns. For
example, we could indicate that the data is in a delimited file with fields
delimited by the tab character:
111
the contributed RegexSerDe library to specify a regex with which to
deserialize and map the fields into columns for our table. We’ll need to
manually add the hive-serde JAR from the lib folder to the current hive
session in order to use the RegexSerDe package:
And now let’s drop the apache_tog table that we created previously, and re-
create it to use our custom serializer:
112
Table 10.1 Hive primitive data type
Loading data:
Hive does not perform any verification of the data for compliance
with the table schema, nor does it perform any transformations whenloading
the data into a table. Data loading in Hive is done in batch- oriented fashion
using a bulk LOAD DATA command or by inserting results from another
query with the INSERT command. To start,
let’s copy our Apache log data file to HDFS and then load it into the table
we created earlier:
You can verify that the apache.log file was successfully uploaded
to HDFS with the tail command:
Once the file has been uploaded to HDFS, return to the Hive CLI and use
the log_data database:
113
INPATH takes an argument to a path on the default file system (in
this case, HDFS). We can also specify a path on the local file system by
using LOCAL INPATH instead.
we can easily perform other ad hoc queries on any of the other fields for
example:
hive> SELECT host, count(1) AS count FROM apache_log GROUP BY
host ORDER BY count;
115
HBase
HBase stores its data as key/value pairs, where all table lookups
are performed via the table’s row key, or unique identifier to the stored
record data. Data within a row is grouped into column families, which
consist of related columns. Visually, you can picture an HBase table that
holds census data for a given population where each row represents a person
and is accessed via a unique ID rowkey, with column families for personal
data which contains columns for name and address, and demographic info
which contains columns for birthdate and gender. This example is shown in
Figure 10.3.
116
Figure 10.3 Census data as an HBase schema
Storing data in columns rather than rows has particular benefits for
data warehouses and analytical databases where aggregates are computed
over large sets of data with potentially sparse values, where not all columns
values are present. However, the actual columns that make up a row can be
determined and created on an as-needed basis. In fact, eachrow can have
a different set of columns. Figure 6-2 shows an example HBase table with
two rows where first row key utilizes three column families and the second
row key utilizes just one column.
117
10.5 HBase timestamp versioning
Generating a schema:
When designing schemas in HBase, it’s important to think in terms
of the column- family structure of the data model and how it affects data
access patterns. Furthermore, because HBase doesn’t support joins and
provides only a single indexed rowkey, we must be careful to ensure that
the schema can fully support all use cases. Often this involves de-
normalization and data duplication with nested entities.
118
Row keys:
119
Inserting data with put:
120
Scan rows:
121
Filters:
HBase provides a number of filter classes that can be applied to
further filter the row data returned from a get or scan operation. These filters
can provide a much more efficient means of limiting the row data returned
by HBase and offloading the row-filtering operations from the client to the
server. Some of HBase’s available filters include:
• RowFilter: Used for data filtering based on row key values
• ColumnRangeFilter: Allows efficient intra-row scanning, can be used
to get a slice of the columns of a very wide row.
• SingleColumnValueFilter: Used to filter cells based on column value
• RegexStringComparator: Used to test if a given regular expression
matches a cell value in the column.
122
DATA INGESTION
While Sqoop works very well for bulk-loading data that already
resides in a relational database into Hadoop, many new applications and
systems involve fast-moving data streams like application logs, GPS
tracking, social media updates, and sensor-data that we’d like to load
directly into HDFS to process in Hadoop. In order to handle and process the
high-throughput of event-based data produced by these systems, we need
the ability to support continuous ingestion of data from multiple sources
into Hadoop.
123
Apache Flume was designed to efficiently collect, aggregate, and
move large amounts of log data from many different sources into a
centralized data store. While Flume is most often used to direct streaming
log data into Hadoop, usually HDFS or HBase, Flume data sources are
actually quite flexible and can be customized to transport many types of
event data, including network traffic data, social media-generated data,
and sensor data into any Flume-compatible consumer.
Before we proceed to run the sqoop import command, verify that HDFS
and YARN are started with the jps command:
124
Importing from MySQL to Hive:
Sqoop provides a couple ways to do this, either exporting to HDFS
first and then loading the data into Hive using the LOAD DATA HQL
command in the Hive shell, or by using Sqoop to directly create the tables
and load the relational database data into the corresponding tables in Hive.
Sqoop can generate a Hive table and load data based on the defined
schema and table contents from a source database, using the import
command. However, because Sqoop still actually utilizes MapReduce to
implement the data load operation, we must first delete any preexisting data
directory with the same output name before running the import tool:
125
Importing from MySQL to HBase:
HBase is designed to handle large volumes of data for a large
number of concurrent clients that need real-time access to row-level data.
Sqoop’s import tool allows us to import data from a relational database to
HBase. As with Hive, there are two approaches to importing this data. We
can import to HDFS first and then use the HBase CLI or API to load the
data into an HBase table, or we can use the --hbase-table option toinstruct
Sqoop to directly import to a table in HBase.
126
INGESTING STREAMING DATA WITH FLUME
Flume sinks eventually read and remove events from the channel
and forward them to their next hop or final destination. Sinks can thus be
configured to write its output as a streaming source for another Flume agent,
or to a data store like HDFS or HBase.
128
Figure 10.8. Multi-agent Flume data flow
10.2.1 Pig:
Pig, like Hive, is an abstraction of MapReduce, allowing users to
express their data processing and analysis operations in a higher-level
language that then compiles into a MapReduce job. Pig is now a top-level
Apache Project that includes two main platform components:
• Pig Latin, a procedural scripting language used to express data flows.
• The Pig execution environment to run Pig Latin programs, which can
be run in local or MapReduce mode and includes the Grunt command-
line interface.
Pig Latin scripts start with data, apply transformations to the data
until the script describes the desired results, and execute the entire data
129
processing flow as an optimized MapReduce job. Additionally, Pig supports
the ability to integrate custom code with user-defined functions (UDFs) that
can be written in Java, Python, or JavaScript, among other supported
languages. Pig thus enables us to perform near arbitrary transformations and
ad hoc analysis on our big data using comparatively simple constructs.
Pig Latin:
The following script loads Twitter tweets with the hashtag
#unitedairlines over the course of a single week. The data file,
united_airlines_tweets.tsv, provides the tweet ID, permalink, date posted,
tweet text, and Twitter username. The script loads a dictionary,
dictionary.tsv, of known “positive” and “negative” words along with
sentiment scores (1 and -1, respectively) associated to each word. The script
then performs a series of Pig transformations to generate a sentiment score
and classification, either POSITIVE or NEGATIVE, for each computed
tweet:
130
Data Types in pig:
Table 10.1. Pig scalar types
User-Defined Functions:
131
Wrapping Up:
Pig can be a powerful tool for users who prefer a procedural
programming model. It provides the ability to control data checkpoints in
the pipeline, as well as fine-grained controls over how the data is processed
at each step. This makes Pig a great choice when you require more
flexibility in controlling the sequence of operations in a data
flow (e.g., an extract, form, and load, or ETL, process), or when you are
working with semi-structured data that may not lend itself well to Hive’s
SQL syntax.
132
• A unified programming interface that includes several built-in higher-
level libraries to support a broad range of data processing tasks,
including complex interactive analysis, structured querying, stream
processing, and machine learning.
Spark SQL:
Spark SQL is a module in Apache Spark that provides a relational
interface to work with structured data using familiar SQL-based operations
in Spark. It can be accessed through JDBC/ODBC connectors,a built-in
interactive Hive console, or via its built-in APIs. The last methodof access
is the most interesting and powerful aspect of Spark SQL;because Spark
SQL actually runs as a library on top of Spark’s Core engine and APIs, we
can access the Spark SQL API using the sameprogramming interface that
we use for Spark’s RDD APIs, as shown in Figure 10-9.
Figure 10.9. Spark SQL interface
Let’s write a simple program that uses the Spark SQL API to load
JSON data and query it. You can enter these commands directly in a running
pyspark shell or in a Jupyter notebook that is using a pyspark kernel; in
either case, ensure that you have a running SparkContext, which we’ll
assume is referenced by the variable sc.
parking = sqlContext.read.json('../data/sf_parking/sf_parking_clean.json')
133
parking.registerTempTable("parking")
DataFrames:
DataFrames are the underlying data abstraction in Spark SQL. The
data frame concept should be very familiar to users of Python’s Pandas or
R, and in fact, Spark’s DataFrames are interoperable with native Pandas
(using pyspark) and R data frames (using SparkR). In Spark, a DataFrame
also represents a tabular collection of data with a defined schema. The key
difference between a Spark DataFrame and a dataframe in Pandas or R is
that a Spark DataFrame is a distributed collection that actually wraps an
RDD; you can think of it as an RDD of row objects.
134
Example of chaining several simple DataFrame operations:
The advantage of this approach over raw SQL is that we can easily
iterate on a complex query by successively chaining and testing operations.
Additionally, we have access to a rich collection of built-in functions from
the DataFrames API, including the count, round, and avg aggregation
functions that we used previously. The pyspark.sql.functions module also
contains several mathematical and statistical utilities thatinclude functions
for:
• Random data generation
• Summary and descriptive statistics
• Sample covariance and correlation
• Cross tabulation (a.k.a. contingency table)
• Frequency computation
• Mathematical functions
QUESTIONS
*****
136