Notes - KCS 061 Big Data Unit 1
Notes - KCS 061 Big Data Unit 1
Analytics types:-
Descriptive Analytics:- aims to answer - What has happened?
o For example, computing the total number of likes for a particular post, computing the average
monthly rainfall or finding the average number of visitors per month on a website. What was
the sales volume over the past 12 months? What is the number of support calls received as
categorized by severity and geographic location? What is the monthly commission earned by
each sales agent?
o Analysing past data to describing patterns in the data and present the data in a summarized
form. Use of statistics functions such as counts, maximum, minimum, mean, top-N,
percentage.
Diagnostic analytics: - Aims to answer - Why did it happen?
o Analysis of past data to diagnose the reasons as to why certain events happened.
o Example: - a system that collects and analyses sensor data from machines for monitoring their
health and predicting failures. Why were Q2 sales less than Q1 sales? Why have there been
more support calls originating from the Eastern region than from the Western region? Why
was there an increase in patient re-admission rates over the past three months?
o Descriptive analytics can be useful for summarizing the data by computing various statistics.
Diagnostic analytics can provide more insights into why certain a fault has occurred based on
the patterns in the sensor data for previous faults.
Predictive analytics: - Aims to answer - What is likely to happen?
o Predicting the occurrence of an event or the likely outcome of an event or forecasting the
future values using prediction models.
o Predictive Analytics is done using predictive models which are trained by existing data. These
models learn patterns and trends from the existing data and predict the occurrence of an
event or the likely outcome of an event (classification models) or forecast numbers (regression
models).
Prescriptive Analytics: - Aims to answer - What can we do to make it happen?
o Prescriptive analytics uses multiple prediction models to predict various outcomes and the
best course of action for each outcome. Prescribes actions or the best option to follow from
the available options.
o Prescribes actions or the best option to follow from the available options. Suggest the best
mobile data plan for a customer based on the customer’s browsing patterns.
Characteristics of Big Data:-
Volume: - Big data is a form of data whose volume is so large that it would not fit on a single machine
therefore specialized tools and frameworks are required to store process and analyse such data. There
is no fixed threshold for the volume of data to be considered as big data, however, typically, the term
big data is used for massive scale data that is difficult to store, manage and process using traditional
databases and data processing architectures.
Velocity of data refers to how fast the data is generated. Data generated by certain sources can arrive
at very high velocities, for example, social media data or sensor data. Specialized tools are required to
ingest such high velocity data into the big data infrastructure and analyse the data in real-time.
Variety refers to the forms of the data. Big data comes in different forms such as structured,
unstructured or semi-structured, including text data, image, audio, and video and sensor data.
Veracity refers to how accurate is the data. To extract value from the data, the data needs to be
cleaned to remove noise. Data-driven applications can reap the benefits of big data only when the
data is meaningful and accurate.
Value of data refers to the usefulness of data for the intended purpose. The value of the data is also
related to the veracity or accuracy of the data. For some applications value also depends on how fast
we are able to process the data.
Figure 1-1 IBM characterizes Big Data by its volume, velocity, and variety—or simply, V3. (Understanding Big
Data, Analytics for Enterprise Class Hadoop and Streaming Data, Paul C. Zikopoulos, )
Figure: Big Data Analytics Flow (Big Data Science & Analytics: A Hands-On Approach, (1st. ed.))
Datasets
Collections or groups of related data are generally referred to as datasets. Each group or dataset
member (datum) shares the same set of attributes or properties as others in the same dataset.
Examples:- tweets stored in a flat file, a collection of image files in a directory, an extract of rows from
a database table stored in a CSV formatted file, historical weather observations that are stored as XML
files.
Data Analysis
Data analysis is the process of examining data to find facts, relationships, patterns, insights and/or
trends.
The overall goal of data analysis is to support better decision making.
Example - analysis of ice cream sales data in order to determine how the number of ice cream cones
sold is related to the daily temperature. The results of such an analysis would support decisions related
to how much ice cream a store should order in relation to weather forecast information.
Data Analytics
Data analytics is a discipline that includes the management of the complete data lifecycle, which
encompasses collecting, cleansing, organizing, storing, analyzing and governing data.
Different kinds of organizations use data analytics tools and techniques in different ways.
o In business-oriented environments, data analytics results can lower operational costs and
facilitate strategic decision-making.
o In the scientific domain, data analytics can help identify the cause of a phenomenon to
improve the accuracy of predictions.
o In service-based environments like public sector organizations, data analytics can help
strengthen the focus on delivering high-quality services by driving down costs.
Big data architecture refers to the logical and physical structure that dictates how high volumes of
data are ingested, processed, stored, managed, and accessed.
Big data architecture is the foundation for big data analytics.
The big data architecture framework serves as a reference blueprint for big data infrastructures and
solutions, logically defining how big data solutions will work, the components that will be used, how
information will flow, and security details.
References:-
https://builtin.com/big-data/big-data-examples-applications
Arshdeep Bahga and Vijay Madisetti. 2016. Big Data Science; Analytics: A Hands-On Approach, (1st.
ed.). VPT.
Big Data Fundamentals: Concepts, Drivers & Techniques (The Pearson Service Technology Series from
Thomas Erl) by Thomas Erl (Author), Wajid Khattak (Author), Paul Buhler (Author)
https://www.omnisci.com/technical-glossary/big-data-architecture
Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, Paul C. Zikopoulos
, Chris Eaton, Dirk deRoos. Thomas Deutsch, George Lapis
HADOOP
Hadoop (http://hadoop.apache.org/) is a top-level Apache project in the Apache Software Foundation
that’s written in Java.
Hadoop is a computing environment built on top of a distributed clustered file system that was
designed specifically for very large-scale data operations.
Hadoop was inspired by Google’s work on its Google (distributed) File System (GFS).
MapReduce programming paradigm, in which work is broken down into mapper and reducer tasks to
manipulate data that is stored across a cluster of servers for massive parallelism.
Hadoop is designed to scan through large data sets to produce its results through a highly scalable,
distributed batch processing system.
Hadoop is actually the name that creator Doug Cutting’s son gave to his stuffed toy elephant. In
thinking up a name for his project, Cutting was apparently looking for something that was easy to say
and stands for nothing in particular, so the name of his son’s toy seemed to make perfect sense.
Hadoop is generally seen as having two parts:
o a file system (the Hadoop Distributed File System)
o and a programming paradigm (MapReduce)
One of the key components of Hadoop is the redundancy built into the environment.
o data redundantly stored in multiple places across the cluster
o programming model - failures are expected and resolved automatically by running portions of
the program on various servers in the cluster.
o It is well known that commodity hardware components will fail (especially when you have
very large numbers of them), but this redundancy provides fault tolerance and a capability for
the Hadoop cluster to heal itself.
o scale out workloads across large clusters of inexpensive machines to work on Big Data
problems.
Hadoop-related projects
o Apache Avro (for data serialization),
o Cassandra and HBase (databases),
o Chukwa (a monitoring system specifically designed with large distributed systems in mind),
o Hive (provides ad hoc SQL-like queries for data aggregation and summarization),
o Mahout (a machine learning library),
o Pig (a high-level Hadoop programming language that provides a data-flow language and
execution framework for parallel computation),
o ZooKeeper (provides coordination services for distributed applications),
o and more.
Components of Hadoop
Hadoop project is comprised of three pieces:
o Hadoop Distributed File System (HDFS),
o Hadoop MapReduce model,
o and Hadoop Common.
Hadoop Distributed File System
o Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and distributed
throughout the cluster. Copies of these blocks are stored on other servers in the Hadoop
cluster.
o an individual file is actually stored as smaller blocks that are replicated across multiple servers
in the entire cluster.
o the map and reduce functions can be executed on smaller subsets of your larger data sets,
and this provides the scalability that is needed for Big Data processing.
o use commonly available servers in a very large cluster, where each server has a set of
inexpensive internal disk drives.
o MapReduce tries to assign workloads to these servers where the data to be processed is
stored. (Data Locality)
Figure:- example of how data blocks are written to HDFS. Notice how (by default) each block is written three times and at least one block
is written to a different server rack for redundancy. (Understanding Big Data Analytics, Paul C. Zikopoulos)
MapReduce
MapReduce is programming paradigm that allows for massive scalability across hundreds or
thousands of servers in a Hadoop cluster.
The term MapReduce actually refers to two separate and distinct tasks that Hadoop programs
perform.
o map job
o reduce job
The first is the map job, which takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples into a smaller
set of tuples.
As the sequence of the name MapReduce implies, the reduce job is always performed after the map
job.
Example
o Five files having each file contains two columns (a key and a value in Hadoop terms).
o Here key represents a city and the value represents corresponding temperature recorded in that city for
the various measurement days.
o Find the maximum temperature for each city across all of the data files?
o Following snippet shows a sample of the data from one of test files
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
o MapReduce framework can break this down into five map tasks, where each mapper works
on one of the five files and the mapper task goes through the data and returns the maximum
temperature for each city.
o four mapper tasks (working on the other four files not shown here) produced the following
intermediate results:
o All five of these output streams would be fed into the reduce tasks, which combine the input
results and output a single value for each city, producing a final result set as follows:
Mapper Code
o import java.io.IOException;
o import java.util.StringTokenizer;
o
o import org.apache.hadoop.io.IntWritable;
o import org.apache.hadoop.io.LongWritable;
o import org.apache.hadoop.io.Text;
o import org.apache.hadoop.mapreduce.Mapper;
o
o public class MapClass extends Mapper<LongWritable, Text, Text,
IntWritable>{
o
o private final static IntWritable one = new IntWritable(1);
o private Text word = new Text();
o
o @Override
o protected void map(LongWritable key, Text value,
o Context context)
o throws IOException, InterruptedException {
o
o String line = value.toString();
o StringTokenizer st = new StringTokenizer(line," ");
o
o while(st.hasMoreTokens()){
o word.set(st.nextToken());
o context.write(word,one);
o }
o
o }
o }
Reducer Code
o import java.io.IOException;
o import java.util.Iterator;
o
o import org.apache.hadoop.io.IntWritable;
o import org.apache.hadoop.io.Text;
o import org.apache.hadoop.mapreduce.Reducer;
o
o public class ReduceClass extends Reducer{
o
o @Override
o protected void reduce(Text key, Iterable values,
o Context context)
o throws IOException, InterruptedException {
o
o int sum = 0;
o Iterator valuesIt = values.iterator();
o
o while(valuesIt.hasNext()){
o sum = sum + valuesIt.next().get();
o }
o
o context.write(key, new IntWritable(sum));
o }
o }
Hive
Facebook developed a runtime Hadoop support structure that allows anyone who is already fluent
with SQL to leverage the Hadoop platform.
Hive allows SQL developers to write Hive Query Language (HQL) statements that are similar to
standard SQL statements.
HQL statements are broken down by the Hive service into MapReduce jobs and executed across a
Hadoop cluster.
Jaql
Jaql is primarily a query language for JavaScript Object Notation (JSON) and allows to process both
structured and nontraditional data and was developed by IBM.
Jaql allows to select, join, group, and filter data that is stored in HDFS.
Jaql’s query language includes Lisp, SQL, XQuery, and Pig.
ZooKeeper
Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud
applications.
ZooKeeper is essentially a service for distributed systems offering a hierarchical key-value store, which
is used to provide a distributed configuration service, synchronization service, and naming registry for
large distributed systems.
ZooKeeper is an open source Apache project that provides a centralized infrastructure and services
that enable synchronization across a cluster.
ZooKeeper maintains common objects needed in large cluster environments.
Examples of these objects include configuration information, hierarchical namingspace, and so on.
Applications can leverage these services to coordinate distributed processing across large clusters.
HBase
HBase is a column-oriented database management system that runs on top of HDFS.
Unlike relational database systems, HBase does not support a structured query language like SQL.
An HBase system comprises a set of tables. Each table contains rows and columns, much like a
traditional database.
Each table must have an element defined as a Primary Key, and all access attempts to HBase tables
must use this Primary Key.
An HBase column represents an attribute of an object; for example, if the table is storing diagnostic
logs from servers in your environment, where each row might be a log record, a typical column in such
a table would be the timestamp of when the log record was written, or the servername where the
record originated.
Text/CSV -
o A text file is the most basic and a human-readable file. It can be read or written in any
programming language and is mostly delimited by comma or tab.
o The text file format consumes more space when a numeric value needs to be stored as a
string. It is also difficult to represent binary data such as an image.
o A plain text file or CSV is the most common format both outside and within the Hadoop
ecosystem.
o The disadvantage in the use of this format is that it does not support block compression,
so the compression of a CSV file in Hadoop can have a high cost in reading.
o The plain text format or CSV would only be recommended in case of extractions of data
from Hadoop or a massive data load from a file.
SequenceFile –
o The SequenceFile format stores the data in binary format.
o The sequencefile format can be used to store an image in the binary format.
o They store key-value pairs in a binary container format and are more efficient than a text
file. However, sequence files are not human- readable.
o This format accepts compression; however, it does not store metadata and the only
option in the evolution of its scheme is to add new fields at the end.
o This is usually used to store intermediate data in the input and output of MapReduce
processes.
o The SequenceFile format is recommended in case of storing intermediate data in
MapReduce jobs.
Parquet -
o Parquet is a column-based (column-based) binary storage format that can store nested
data structures.
o This format is very efficient in terms of disk input / output operations when the necessary
columns to be used are specified.
o This format is much optimized for use with Cloudera Impala.
o Parquet is a columnar format developed by Cloudera and Twitter.
o It is supported in Spark, MapReduce, Hive, Pig, Impala, Crunch, and so on.
o Parquet file format uses advanced optimizations described in Google’s Dremel paper.
These optimizations reduce the storage space and increase performance.
o This Parquet file format is considered the most efficient for adding multiple records at a
time. Some optimizations rely on identifying repeated patterns.
Hadoop – Streaming
Hadoop streaming is a utility that comes with the Hadoop distribution.
It enables to create or run MapReduce scripts in any language either, java or non-java, as
mapper/reducer.
By default, the Hadoop MapReduce framework is written in Java and provides support for writing
map/reduce programs in Java only.
But Hadoop provides API for writing MapReduce programs in languages other than Java.
Hadoop Streaming is the utility that allows us to create and run MapReduce jobs with any script or
executable as the mapper or the reducer.
It uses Unix streams as the interface between the Hadoop and our MapReduce program so that we
can use any language which can read standard input and write to standard output to write for writing
our MapReduce program.
Hadoop Streaming supports the execution of Java, as well as non-Java, programmed MapReduce jobs
execution over the Hadoop cluster.
It supports the Python, Perl, R, PHP, and C++ programming languages.
HADOOP PIPES –
Hadoop Pipes is the name of the C++ interface to Hadoop MapReduce.
Unlike Streaming, which uses standard input and output to communicate with the map and reduce
code, Pipes uses sockets as the channel over which the tasktracker communicates with the process
running the C++ map or reduce function.
The application links against the Hadoop C++ library, which is a thin wrapper for communicating with
the tasktracker child process.
The map and reduce functions are defined by extending the Mapper and Reducer classes defined in
the HadoopPipes namespace and providing implementations of the map() and reduce() methods in
each case.
These methods take a context object (of type MapContext or ReduceContext), which provides the
means for reading input and writing output, as well as accessing job configuration information via the
JobConf class.
Unlike the Java interface, keys and values in the C++ interface are byte buffers, represented as
Standard Template Library (STL) strings. This makes the interface simpler, although it does put a
slightly greater burden on the application developer, who has to convert to and from richer domain-
level types. This is evident in MapTempera tureReducer where we have to convert the input value into
an integer (using a convenience method in HadoopUtils) and then the maximum value back into a
string before it’s written out. In some cases, we can save on doing the conversion, such as in MaxTem
peratureMapper where the airTemperature value is never converted to an integer since it is never
processed as a number in the map() method.
The main() method is the application entry point. It calls HadoopPipes::runTask, which connects to the
Java parent process and marshals data to and from the Mapper or Reducer.
The runTask() method is passed a Factory so that it can create instances of the Mapper or Reducer.
Which one it creates is controlled by the Java parent over the socket connection.
There are overloaded template factory methods for setting a combiner, partitioner, record reader, or
record writer.
HADOOP ECOSYSTEM
Hadoop Ecosystem is neither a programming language nor a service, it is a platform or framework
which solves big data problems.
It includes a number of services (ingesting, storing, analyzing and maintaining).
Hadoop components, that together form a Hadoop ecosystem
References:-
Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, Paul C.
Zikopoulos\, Chris Eaton, Dirk deRoos. Thomas Deutsch, George Lapis.
https://en.wikipedia.org/wiki/Apache_ZooKeeper
https://blog.bi-geek.com/en/formatos-de-ficheros-en-hadoop/
http://hadoop.apache.org/docs/r1.2.1/streaming.html
https://data-flair.training/blogs/hadoop-streaming/
https://www.wisdomjobs.com/e-university/hadoop-tutorial-484/hadoop-pipes-14765.html
https://www.edureka.co/blog/hadoop-ecosystem