Bigdata Notes
Bigdata Notes
DIGITAL DATA
Digital data is information stored on a computer system as a series of 0’s and 1’s
in a
binary language. Digital data jumps from one value to the next in a step by step
sequence.
Example: Whenever we send an email, read a social media post, or take pictures
with our digital camera, we are working with digital data.
a. Unstructured Data: The data which does not conform to a data model or is not
in a form that can be used easily by a computer program is categorized as
unstructured data. About 80—90% data of an organization is in this format.
b. Semi-Structured Data: The data which does not conform to a data model but
has some structure is categorized as semi-structured data. However, it is not in a
form that can be used easily by a computer program.
Example : Emails, XML, markup languages like HTML, etc. Metadata for this
data is available but is not sufficient.
c. Structured Data: The data which is in an organized form (ie. in rows and
columns) and can be easily used by a computer program is categorized as
semi-structured data. Relationships exist between entities of data, such as
classes and their objects.
IT has become an integral part of daily life as well as various other industries like:
health, education, entertainment, science and technology, genetics, or business
operations and these industries generate a lot of data, this can be called Big
Data.
Big Data consists of large datasets that cannot be managed efficiently by the
common database management systems.
Mobile phones, credit cards, Radio Frequency Identification (RFID) devices, and
social networking platforms create huge amounts of data that may reside
unutilized at unknown servers for many years.
And with the evolution of Big Data, this data can be accessed and analyzed on a
regular basis to generate useful information.
“Big Data” is a relative term depending on who is discussing it. For Example, Big
Data to Amazon or Google is very different from Big Data to a medium-sized
insurance organization.
A big data platform is a type of IT solution that combines the features and
capabilities of several big data applications and utilities within a single solution,
this is then used further for managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive
datasets.
The users of such platforms can custom build applications according to their use
case like to calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.
Example: Some of the most commonly used Big Data Platforms are :
Big Data has quickly risen to become one of the most desired topics in the
industry.
The main business drivers for such rising demand for Big Data Analytics are :
Example: A number of companies that have Big Data at the core of their strategy
like :
Apple, Amazon, Facebook and Netflix have become very successful at the
beginning of the 21st century.
Data sources: All big data solutions start with one or more data sources.
Example,
Data storage: Data for batch processing operations is stored in a distributed file
store that can hold high volumes of large files in various formats (also called data
lake).
Example,
Batch processing: Since the data sets are so large, therefore a big data solution
must process data files using long-running batch jobs to filter, aggregate, and
prepare the data for analysis.
Real-time message ingestion: If a solution includes real-time sources, the
architecture must include a way to capture and store real-time messages for
stream processing.
Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using
analytical tools. Example: Azure Synapse Analytics provides a managed service
for large-scale, cloud-based data warehousing.
Analysis and reporting: The goal of most big data solutions is to provide insights
into the data through analysis and reporting. To empower users to analyze the
data, the architecture may include a data modelling layer. Analysis and reporting
can also take the form of interactive data exploration by data scientists or data
analysts.
● Volume
● Variety
● Velocity
5 Vs of Big Data, Big Data technology components
5 Vs of Big Data :
1. Volume :
Big Data is a vast “volumes” of data generated from many sources daily, such as
business processes, machines, social media platforms, networks, human
interactions, and so on.
2. Variety :
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.
Data were only collected from databases and sheets in the past, But these days
the data will come in an array of forms ie.- PDFs, Emails, audios, Social Media
posts, photos, videos, etc.
3. Velocity :
It contains the linking of incoming data sets speeds, rate of change, and activity
bursts.
4. Veracity :
5. Value :
It is not the data that we process or store, it is valuable and reliable data that we
The ingestion layer is the very first step of pulling in raw data.
Batch, in which large groups of data are gathered and delivered together.
2. Storage :
Storage is where the converted data is stored in a data lake or warehouse and
eventually processed.
3. Analysis :
In the analysis layer, data gets passed through several tools, shaping it into
actionable insights.
4. Consumption :
The final big data component is presenting the information in a format digestible
to the end-user.
This can be in the forms of tables, advanced visualizations and even single
numbers if requested.
The most important thing in this layer is making sure the intent and meaning of
the output is understandable.
Big Data importance doesn’t revolve around the amount of data a company has
but lies in the fact that how the company utilizes the gathered data.
Every company uses its collected data in its own way. More effectively the
company uses its data, more rapidly it grows.
By analysing the big data pools effectively the companies can get answers to :
Cost Savings :
o Some tools of Big Data like Hadoop can bring cost advantages to business
when large amounts of data are to be stored.
Time Reductions :
o The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately.
o Therefore, you can get feedback about who is saying what about your
company.
o If you want to monitor and improve the online presence of your business, then
big data tools can help in all this.
o No single business can claim success without first having to establish a solid
customer base.
o If a business is slow to learn what customers are looking for, then it is very
likely to deliver poor quality products.
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing
Insights :
In today’s world big data have several applications, some of them are listed
below :
In big retails stores, the management team has to keep data of customer’s
spending habits, shopping behaviour, most liked product, which product is being
searched/sold most, based on that data, the production/collection rate of that
product gets fixed.
Recommendation :
All such data are analyzed and jam-free or less jam way, less time taking ways
are recommended.
These sensors capture data like the speed of flight, moisture, temperature, and
other environmental conditions.
In the various spots of the car camera, a sensor is placed that gathers data like
the size of the surrounding car, obstacle, distance from those, etc.
These data are being analyzed, then various calculations are carried out.
Big data analysis helps virtual personal assistant tools like Siri, Cortana and
Google Assistant to provide the answer to the various questions asked by users.
This tool tracks the location of the user, their local time, season, other data
related to questions asked, etc.
IoT :
Analyzing such data, it can be predicted how long a machine will work without
any problem when it requires repair.
Data like what type of video, music users are watching, listening to most,
how long users are spending on site, etc are collected and analyzed to set
Auditors can use big data to expand the scope of their projects and draw
comparisons over larger populations of data.
Big data also helps financial auditors to streamline the reporting process and
detect fraud.
These professionals can identify business risks in time and conduct more
relevant and accurate audits.
That’s why data privacy is there to protect those customers but also
companies and their employees from security breaches.
Big Data Analytics enables enterprises to analyze their data in full context quickly
and some also offer real-time analysis.
Organizations use big data analytics systems and software to make data-driven
decisions that can improve business-related outcomes.
rivals.
Big Data Analytics tools also help businesses save time and money and aid in
gaining insights to inform data-driven decisions.
Big Data Analytics enables enterprises to narrow their Big Data to the most
relevant information and analyze it to inform critical business decisions.
Previous
Intelligent Data Analysis (IDA) is one of the most important approaches in the
field of data mining.
Based on the basic principles of IDA and the features of datasets that IDA
handles, the development of IDA is briefly summarized from three aspects :
● Algorithm principle
● The scale
● Type of the dataset
Intelligent Data Analysis (IDA) is one of the major issues in artificial intelligence
and information.
Intelligent data analysis discloses hidden facts that are not known previously and
provide potentially important information or facts from large quantities of data.
These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to bring
more efficiency in their work environment.
Many big data tools and processes are being utilised by companies these days in
the processes of discovering insights and supporting decision making.
Below is the list of some of the data analytics tools used most in the industry :
Analysis vs reporting
Reporting :
● Once data is collected, it will be organized using tools such as graphs and
tables.
● The process of organizing this data is called reporting.
● Reporting translates raw data into information.
● Reporting helps companies to monitor their online business and be alerted
when data falls outside of expected ranges.
● Good reporting should raise questions about the business from its end
users.
Analysis :
● Analytics is the process of taking the organized data and analyzing it.
● This helps users to gain valuable insights on how businesses can improve
their performance.
● Analysis transforms data and information into insights.
● The goal of the analysis is to answer questions by interpreting the data at a
deeper level and providing actionable recommendations.
Conclusion :
● These days, organizations are realising the value they get out of big data
analytics and hence they are deploying big data tools and processes to
bring more efficiency to their work environment.
● Many big data tools and processes are being utilised by companies these
days in the processes of discovering insights and supporting decision
making.
● Data Analytics tools are types of application software that retrieve data
from one or more systems and combine it in a repository, such as a data
warehouse, to be reviewed and analysed.
● Most organizations use more than one analytics tool including
spreadsheets with statistical functions, statistical software packages, data
mining tools, and predictive modelling tools.
● Together, these Data Analytics Tools give the organization a complete
overview of the company to provide key insights and understanding of the
market/business so smarter decisions may be made.
● Data analytics tools not only report the results of the data but also explain
why the results occurred to help identify weaknesses, fix potential problem
areas, alert decision-makers to unforeseen events and even forecast future
results based on decisions the company might make.
● Below is the list some of data analytics tools :
● R Programming (Leading Analytics Tool in the industry)
● Python
● Excel
● SAS
● Apache Spark
● Splunk
● RapidMiner
● Tableau Public
● KNime
UNIT 2
HADOOP
History of Hadoop :
It is written in Java.
The traditional approach like RDBMS is not sufficient due to the heterogeneity of
the data.
So Hadoop comes as the solution to the problem of big data i.e. storing and
processing the big data with some extra capabilities.
Its co-founder Doug Cutting named it on his son’s toy elephant.
Apache Hadoop :
It is written in Java.
Applications built using HADOOP are run on large data sets distributed across
clusters of commodity computers.
Commodity computers are cheap and widely available, these are useful for
achieving greater computational power at a low cost.
HDFS splits files into blocks and sends them across various nodes in form of
large clusters.
The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on
commodity hardware.
Commodity hardware is cheap and widely available, these are useful for
achieving greater computational power at a low cost.
Hadoop Common: These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN: This is a framework for job scheduling and cluster resource
management.
HDFS splits files into blocks and sends them across various nodes in form of
large clusters.
Hadoop MapReduce -
Hadoop YARN –
YARN helps to open up Hadoop by allowing to process and run data for batch
processing, stream processing, interactive processing and graph processing
which are stored in HDFS.
Previous
DATA FORMAT :
The big problem in the performance of applications that use HDFS is the
information search time and the writing time.
The choice of an appropriate file format can produce the following benefits:
Some of the most commonly used formats of the Hadoop ecosystem are :
● Text/CSV: A plain text file or CSV is the most common format both outside and
within the Hadoop ecosystem.
● SequenceFile: The SequenceFile format stores the data in binary format, this
format accepts compression but does not store metadata.
● Avro: Avro is a row-based storage format. This format includes the definition of
the scheme of your data in JSON format. Avro allows block compression along
with its divisibility, making it a good choice for most cases when using Hadoop.
● RCFile (Record Columnar File): RCFile is a columnar format that divides data
into groups of rows, and inside it, data is stored in columns.
● ORC (Optimized Row Columnar): ORC is considered an evolution of the
RCFile format and has all its benefits alongside some improvements such as
better compression, allowing faster queries.
There are several choices available for writing data analysis jobs.
The Hive and Pig projects are popular choices that provide SQL-like and
procedural data flow-like languages, respectively.
MapReduce jobs can read and write data in HBase’s table format, but data
processing is often done via HBase’s own client API.
Once a decision has been made for data scaling, the specific scaling approach
must be chosen.
1. Up
2. Out
Scaling up, or vertical scaling :
It involves obtaining a faster server with more powerful processors and more
memory.
This solution uses less network hardware, and consumes less power; but
ultimately.
For many platforms, it may only provide a short-term fix, especially if continued
growth is expected.
The scale-out technique is a long-term solution, as more and more servers may
be added when needed.
But going from one monolithic system to this type of cluster may be difficult,
although extremely effective solution.
Hadoop Streaming :
We can use any language that can read from the standard input(STDIN) like
keyboard input and all and write using standard output(STDOUT).
In the diagram,
We have an Input Reader which is responsible for reading the input data and
produces the list of key-value pairs. We can read data in .csv format, in delimiter
format, from a database table, image data(.jpg, .png), audio data etc.
This list of key-value pairs is fed to the Map phase and Mapper will work on each
of these key-value pair of each pixel and generate some intermediate key-value
pairs.
After shuffling and sorting, the intermediate key-value pairs are fed to the
Reducer: then the final output produced by the reducer will be written to the
HDFS. These are how a simple Map-Reduce job works.
Hadoop Pipes :
Unlike Streaming, this uses standard input and output to communicate with the
map and reduce code.
Pipes uses sockets as the channel over which the task tracker communicates
with the process running the C++ map or reduce function.
. Map Reduce Framework and Basics :
Prior to Hadoop 2.0, MapReduce was the only way to process data in Hadoop.
A MapReduce job usually splits the input data set into independent chunks,
which are processed by the map tasks in a completely parallel manner.
The core idea behind MapReduce is mapping your data set into a collection of <
key, value> pairs, and then reducing overall pairs with the same key.
The framework sorts the outputs of the maps, which are then inputted to the
reduced tasks.
Both the input and the output of the job are stored in a file system.
The framework takes care of scheduling tasks, monitors them, and re-executes
the failed tasks.
2.Your keys and values may be of any type: strings, integers, dummy types and,
of course, pairs themselves.
1. Mapper component
2. Reducer component
MapReduce integrates with HDFS to provide the exact same benefits for parallel
data processing.
You start by writing your map and reduce functions, ideally with unit tests to make
sure they do what you expect.
Then you write a driver program to run a job, which can run from your IDE using
a small subset of the data to check that it is working.
If it fails, you can use your IDE’s debugger to find the source of the problem.
When the program runs as expected against the small dataset, you are ready to
unleash it on a cluster.
Running against the full dataset is likely to expose some more issues, which you
can fix by expanding your tests and altering your mapper or reducer to handle
the new cases.
Profiling distributed programs are not easy, but Hadoop has hooks to aid in the
process.
Hadoop MapReduce jobs have a unique code architecture that follows a specific
template with specific constructs.
With MRUnit, you can craft test input, push it through your mapper and/or
reducer, and verify its output all in a JUnit test.
As do other JUnit tests, this allows you to debug your code using the JUnit test
as a driver.
Example: We’re processing road surface data used to create maps. The input
contains both linear surfaces and intersections. The mapper takes a collection of
these mixed surfaces as input, discards anything that isn’t a linear road surface,
i.e., intersections, and then processes each road surface and writes it out to
HDFS. We can keep count and eventually print out how many non-road
surfaces are input. For debugging purposes, we can additionally print out how
many road surfaces were processed.
Anatomy of a Map-Reduce Job Run :
A typical Hadoop MapReduce job is divided into a set of Map and Reduce tasks
that execute on a Hadoop cluster.
Job Scheduling :
Early versions of Hadoop had a very simple approach to scheduling users’ jobs:
they ran in order of submission, using a FIFO scheduler.
Typically, each job would use the whole cluster, so jobs had to wait their turn.
Although a shared cluster offers great potential for offering large resources to
many users, the problem of sharing resources fairly between users requires a
better scheduler.
Production jobs need to complete in a timely manner while allowing users who
are making smaller ad hoc queries to get results back in a reasonable time.
The ability to set a job’s priority was added, via the mapred. job.priority property
or the setJobPriority() method on JobClient.
When the job scheduler is choosing the next job to run, it selects the one with the
highest priority.
The default is the original FIFO queue-based scheduler, and there are also
multiuser schedulers called :
The Fair Scheduler aims to give every user a fair share of the cluster capacity
over time.
As more jobs are submitted, free task slots are given to the jobs in such a way as
to give each user a fair share of the cluster.
A short job belonging to one user will complete in a reasonable time even while
another user’s long job is running, and the long job will still make progress.
Jobs are placed in pools, and by default, each user gets their own pool.
The Fair Scheduler supports preemption, so if a pool has not received its fair
share for a certain period of time, then the scheduler will kill tasks in pools
running over capacity in order to give the slots to the pool running under capacity.
The Capacity Scheduler :
This is like the Fair Scheduler, except that within each queue, jobs are scheduled
using FIFO scheduling (with priorities).
The Fair Scheduler, by contrast, enforces fair sharing within each pool, so
running jobs share the pool’s resources.
Task Execution :
After the task tracker assigns a task, the next step is for it to run the task.
First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’s filesystem.
It also copies any files needed from the distributed cache by the application to
the local disk.
Second, it creates a local working directory for the task and un-jars the contents
of the JAR into this directory.
TaskRunner launches a new Java Virtual Machine to run each task so that any
bugs in the user-defined map and reduce functions don’t affect the task tracker
(by causing it to crash or hang, for example).
This way it informs the parent of the task’s progress every few seconds until the
task is complete.
Hadoop uses the MapReduce programming model for the data processing of
input and output for the map and to reduce functions represented as key-value
pairs.
They are subject to the parallel execution of datasets situated in a wide array of
machines in a distributed architecture.
Mapping is the core technique of processing a list of data elements that come in
pairs of keys and values.
The general idea of the map and reduce the function of Hadoop can be illustrated
as follows:
The input parameters of the key and value pair, represented by K1 and V1
respectively, are different from the output pair type: K2 and V2.
The reduce function accepts the same format output by the map, but the type of
output again of the reduce operation is different: K3 and V3.
These outputs are nothing but the intermediate output of the job.
If the combine function is used, it has the same form as the reduce function and
the output is fed to the reduce function.
Note that they combine and reduce functions use the same type, except in the
variable names where K3 is K2 and V3 is V2.
The total number of partitions is the same as the number of reduced tasks for the
job.
The partition is determined only by the key ignoring the value.
Input Format :
Hadoop has to accept and process a variety of formats, from text files to
databases.
Each split is further divided into logical records given to the map to process in
key-value pair.
In the context of a database, the split means reading a range of tuples from an
SQL table, as done by the DBInputFormat and producing LongWritables
containing record numbers as keys and DBWritables as values.
It returns the length in bytes and has a reference to the input data.
It is the responsibility of the InputFormat to create the input splits and divide them
into records.
The jobtracker schedules map tasks for the tasktracker using storage location.
The task tracker then passes the split by invoking the getRecordReader() method
on the InputFormat to get RecordReader for the split.
The FileInputFormat is the base class for the file data source.
It has the responsibility to identify the files that are to be included as the job input
and the definition for generating the split.
Hadoop also includes the processing of unstructured data that often comes in
textual format, the TextInputFormat is the default InputFormat for such data.
Output Format :
The output format classes are similar to their corresponding input format classes
and work in the reverse direction.
For example :
Fast: Even if we are dealing with large volumes of unstructured data, Hadoop
MapReduce just takes minutes to process terabytes of data. It can process
petabytes of data in just an hour.
Availability: If any particular node suffers from a failure, then there are always
other copies present on other nodes that can still be accessed whenever needed.
Resilient nature: One of the major features offered by Apache Hadoop is its fault
tolerance. The Hadoop MapReduce framework has the ability to quickly
recognizing faults that occur.
UNIT 3
Design of HDFS :
● HDFS is a filesystem designed for storing very large files with streaming
data access patterns, running on clusters of commodity hardware.
● There are Hadoop clusters running today that store petabytes of data.
● HDFS is built around the idea that the most efficient data processing
pattern is a write-once, read-many-times pattern.
● A dataset is typically generated or copied from a source, then various
analyses are
HDFS Concepts :
● Blocks :
● HDFS has the concept of a block, but it is a much larger unit—64 MB by
default.
● Files in HDFS are broken into block-sized chunks, which are stored as
independent units.
● Having a block abstraction for a distributed filesystem brings several
benefits. :
1. A file can be larger than any single disk in the network. Nothing requires
the blocks from a file to be stored on the same disk, so they can take
advantage of any of the disks in the cluster.
2. Making the unit of abstraction a block rather than a file simplifies the
storage subsystem. It simplifies the storage management (since blocks are
a fixed size, it is easy to calculate how many can be stored on a given disk)
and eliminating metadata concerns.
3. Blocks fit well with replication for providing fault tolerance and availability.
To insure against corrupted blocks and disk and machine failure, each
block is replicated to a small number of physically separate machines.
● HDFS blocks are large compared to disk blocks, and the reason is to
minimize the cost of seeks.
● Namenodes and Datanodes :
● An HDFS cluster has two types of nodes operating in a master-worker
pattern:
1. A Namenode (the master) and
2. A number of datanodes (workers).
● The namenode manages the filesystem namespace.
● It maintains the filesystem tree and the metadata for all the files and
directories in the tree.
● This information is stored persistently on the local disk in the form of two
files:
● The namespace image
● The edit log.
● The namenode also knows the datanodes on which all the blocks for a
given file are located.
Benefits of HDFS:
● HDFS does not give any reliability if that machine goes down.
● An enormous number of clients must be handled if all the clients need the
data stored on a single machine.
● Clients need to copy the data to their local machines before they can
operate it.
● Applications that require low-latency access to data, in the tens of
milliseconds range, will not work well with HDFS.
● Since the namenode holds filesystem metadata in memory, the limit to the
number of files in a filesystem is governed by the amount of memory on
the namenode.
● Files in HDFS may be written by a single writer. Writers are always made
at the end of the file.
● There is no support for multiple writers, or for modifications at arbitrary
offsets in the file.
File Size :
1.
● You can use the Hadoop fs -ls command to list files in the current directory
as well as their details.
● The 5th column in the command output contains file size in bytes.
● For example, command Hadoop fs -ls input gives the following output :
Found 1 item
2.
● You can also find file size using hadoop fs -dus <path>.
● For example, if a directory on HDFS named "/user/frylock/input" contains
100 files and you need the total size for all of those files you could run:
● And you would get back the total size (in bytes) of all of the files in the
"/user/frylock/input" directory.
3.
● You can also use the following function to find the file size :
public class GetflStatus
{
public long getflSize(String args) throws IOException, FileNotFoundException
{
Configuration config = new Configuration();
Path path = new Path(args);
FileSystem hdfs = path.getFileSystem(config);
ContentSummary cSummary = hdfs.getContentSummary(path);
long length = cSummary.getLength();
return length;
}
number of times you make a copy of that particular thing can be expressed
as its Replication Factor.
● As HDFS stores the data in the form of various blocks at the same time
Hadoop is also configured to make a copy of those file blocks.
● By default, the Replication Factor for Hadoop is set to 3 which can be
configured.
● We need this replication for our file blocks because for running Hadoop we
are using commodity hardware (inexpensive system hardware) which can
be crashed at any time.
● We are not using a supercomputer for our Hadoop setup.
● That is why we need such a feature in HDFS that can make copies of that
file blocks for backup purposes, this is known as fault tolerance.
● For the big brand organization, the data is very much important than the
storage, so nobody cares about this extra storage.
● You can configure the Replication factor in your hdfs-site.xml file.
How Does Hadoop Store , read , write files :
1. Read Files :
Step 1: The client opens the file he/she wishes to read by calling open() on
the File System Object.
Step 4: Data is streamed from the data node back to the client, which calls
read() repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close
the connection to the data node, then finds the best data node for the next
block.
Step 6: When the client has finished reading the file, a function is called,
close() on the FSDataInputStream.
1. Write Files :
Step 1: The client creates the file by calling create() on
DistributedFileSystem(DFS).
Step 2: DFS makes an RPC call to the name node to create a new file in
the file system’s namespace, with no blocks associated with it. The name
node performs various checks to make sure the file doesn’t already exist
and that the client has the right permissions to create the file. If these
checks pass, the name node prepares a record of the new file; otherwise,
the file can’t be created. The DFS returns an FSDataOutputStream for the
client to start out writing data to the file.
Step 3: Because the client writes data, the DFSOutputStream splits it into
packets, which it writes to an indoor queue called the info queue. The data
queue is consumed by the DataStreamer, which is liable for asking the
name node to allocate new blocks by picking an inventory of suitable data
nodes to store the replicas. The list of data nodes forms a pipeline. The
DataStreamer streams the packets to the primary data node within the
pipeline, which stores each packet and forwards it to the second data node
within the pipeline.
Step 4: Similarly, the second data node stores the packet and forwards it to
the third (and last) data node in the pipeline.
Step 6: This action sends up all the remaining packets to the data node
pipeline and waits for acknowledgements before connecting to the name
node to signal whether the file is complete or not.
1. Store Files :
● HDFS divides files into blocks and stores each block on a DataNode.
● Multiple DataNodes are linked to the master node in the cluster, the
NameNode.
● The master node distributes replicas of these data blocks across the
cluster.
● It also instructs the user where to locate wanted information.
● Before the NameNode can help you store and manage the data, it first
needs to partition the file into smaller, manageable data blocks.
● This process is called data block splitting.
Java Interfaces to HDFS :
HDFS Interfaces :
Data Flow :
● Input Reader :
● The input reader reads the upcoming data and splits it into the data blocks
of the appropriate size (64 MB to 128 MB).
● Once input reads the data, it generates the corresponding key-value pairs.
● The input files reside in HDFS.
● Map Function :
● The map function process the upcoming key-value pairs and generated the
corresponding output key-value pairs.
● The mapped input and output types may be different from each other.
● Partition Function :
● The partition function assigns the output of each Map function to the
appropriate reducer.
● The available key and value provide this function.
● It returns the index of reducers.
● Shuffling and Sorting :
● The data are shuffled between nodes so that it moves out from the map
and get ready to process for reduce function.
● The sorting operation is performed on input data for Reduce function.
● Reduce Function :
● The Reduce function is assigned to each unique key.
● These keys are already arranged in sorted order.
● The values associated with the keys can iterate the Reduce and generates
the corresponding output.
● Output Writer :
● Once the data flow from all the above phases, the Output writer executes.
● The role of the Output writer is to write the Reduce output to the stable
storage.
Data Ingestion :
Hadoop Archives :
● Hadoop Archive is a facility that packs up small files into one compact
HDFS block to avoid memory wastage of name nodes.
● Name node stores the metadata information of the HDFS data.
● If 1GB file is broken into 1000 pieces then namenode will have to store
metadata about all those 1000 small files.
● In that manner,namenode memory will be wasted in storing and managing
a lot of data.
● HAR is created from a collection of files and the archiving tool will run a
MapReduce job.
● These Maps reduces jobs to process the input files in parallel to create an
archive file.
● Hadoop is created to deal with large files data, so small files are
problematic and to be handled efficiently.
● As a large input file is split into a number of small input files and stored
across all the data nodes, all these huge numbers of records are to be
stored in the name node which makes the name node inefficient.
● To handle this problem, Hadoop Archive has been created which packs the
HDFS files into archives and we can directly use these files as input to the
MR jobs.
● It always comes with *.har extension.
● HAR Syntax :
hadoop archive -archiveName NAME -p <parent path> <src>* <dest>
Example :
I/O Compression :
● In the Hadoop framework, where large data sets are stored and processed,
you will need storage for large files.
● These files are divided into blocks and those blocks are stored in different
nodes across the cluster so lots of I/O and network data transfer is also
involved.
● In order to reduce the storage requirements and to reduce the time spent
in-network transfer, you can have a look at data compression in the
Hadoop framework.
● Using data compression in Hadoop you can compress files at various
steps, at all of these steps it will help to reduce storage and quantity of
data transferred.
● You can compress the input file itself.
● That will help you reduce storage space in HDFS.
● You can also configure that the output of a MapReduce job is compressed
in Hadoop.
● That helps is reducing storage space if you are archiving output or sending
it to some other application for further processing.
I/O Serialization :
Avro :
Security in Hadoop :
Administering Hadoop :
UNIT 4
Hadoop-Schedulers
1. Fifo scheduler
As the name suggests FIFO i.e. First In First Out, therefore the tasks or
application that comes first are going to be served first. This is the default
Scheduler we use in Hadoop. The tasks are placed during a queue and therefore
the tasks are performed in their submission order. In this method, once the work
is scheduled, no intervention is allowed. So sometimes the high priority process
has got to wait an extended time since the priority of the task doesn't matter
during this method.
2.Capacity schedulers
In Capacity Scheduler we've multiple job queues for scheduling our tasks. The
Capacity Scheduler allows multiple occupants to share an outsized size Hadoop
cluster. In the Capacity Scheduler corresponding for every job queue, we offer
some slots or cluster resources for performing job operations. Each job queue
has its own slots to perform its task. just in case we've tasks to perform in just
one queue then the tasks of that queue can access the slots of other queues also
as they're liberal to use, and when the new task enters to another queue then
jobs in running in its own slots of the cluster are replaced with its own job.
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the capacity scheduler. The
priority of the job is kept in consideration. With the help of Fair Scheduler, the
YARN applications can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need for prior capacity. The
resources are distributed in such a manner that all applications within a cluster
get an equal amount of time. Fair Scheduler takes Scheduling decisions based
on memory, we can configure it to work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is
that in Fair Scheduler whenever any high priority job arises in the same queue,
the task is processed in parallel by replacing some portion from the already
dedicated slots.
High Availability was a new feature added to Hadoop 2.x to solve the Single point
of failure problem in the older versions of Hadoop.
Before Hadoop 2.0.0, the NameNode was a single point of failure (SPOF) in an
HDFS cluster. Each cluster had a single NameNode, and if NameNode fails, the
cluster as a whole would be out of services. The cluster will be unavailable until
the NameNode restarts or brought on a separate machine.
Hadoop 2.0 overcomes this SPOF by providing support for many NameNode.
HDFS NameNode High Availability architecture provides the option of running
two redundant NameNodes in the same cluster in an active/passive configuration
with a hot standby.
If Active NameNode fails, then passive NameNode takes all the responsibility of
active node and the cluster continues to work.
● Active and Standby NameNode should always be in sync with each other,
i.e. they should have the same metadata. This permit reinstating the
Hadoop cluster to the same namespace state where it got crashed. And
this will provide us to have fast failover.
● There should be only one NameNode active at a time. Otherwise, two
NameNode will lead to corruption of the data. We call this scenario a
“Split-Brain Scenario”, where a cluster gets divided into the smaller cluster.
Each one believes that it is the only active cluster. “Fencing” avoids such
scenarios. Fencing is a process of ensuring that only one NameNode
remains active at a particular time.
HDFS Federation
To scale the name service horizontally, the federation uses multiple independent
namenodes/namespaces. The namenodes are federated, that is, the namenodes
are independent and don’t require coordination with each other. The datanodes
are used as common storage for blocks by all the namenodes. Each datanode
registers with all the namenodes in the cluster. Datanodes send periodic
heartbeats and block reports and handle commands from the namenodes.
A Namespace and its block pool together are called Namespace Volume. It is a
self-contained unit of management. When a namenode/namespace is deleted,
the corresponding block pool at the datanodes is deleted. Each namespace
volume is upgraded as a unit, during cluster upgrade.
MRv2
MRv2 (aka YARN, "Yet Another Resource Negotiator") has a Resource Manager
for each cluster, and each data node runs a Node Manager. For each job, one
slave node will act as the Application Master, monitoring resources/tasks, etc.
The MapReduce framework in the Hadoop 1.x version is also known as MRv1.
The MRv1 framework includes client communication, job execution and
management, resource scheduling and resource management. The Hadoop
daemons associated with MRv1 are JobTracker and TaskTracker as shown in the
following figure:
YARN
YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop
2.0 to remove the bottleneck on Job Tracker which was present in Hadoop 1.0.
YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as a large-scale distributed
operating system used for Big Data processing.YARN also allows different data
processing engines like graph processing, interactive processing, stream
processing as well as batch processing to run and process data stored in HDFS
(Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources
and schedule the application processing. For large volume data processing, it is
quite necessary to manage the available resources properly so that every
application can leverage them.
The Resource Manager is the core component of YARN – Yet Another Resource
Negotiator. In analogy, it occupies the place of JobTracker of MRV1. Hadoop
YARN is designed to provide a generic and flexible framework to administer the
computing resources in the Hadoop cluster.
In this direction, the YARN Resource Manager Service (RM) is the central
controlling authority for resource management and makes allocation decisions
ResourceManager has two main components: Scheduler and
ApplicationsManager.
NoSQL Databases:
Introduction to NoSQL
A term for any type of database that does not use SQL for the primary retrieval of
data from the database. NoSQL databases have limited traditional functionality
and are designed for scalability and high performance retrieve and append.
Typically, NoSQL databases store data as key-value pairs, which works well for
data that is unrelated.
NoSQL databases can store relationship data—they just store it differently than
relational databases do. When compared with SQL databases, many find
modeling relationship data in NoSQL databases to be easier than in SQL
databases, because related data doesn’t have to be split between tables.
NoSQL data models allow related data to be nested within a single data
structure.
● Document databases
A document database stores data in JSON, BSON, or XML documents (not Word
documents or Google docs, of course). In a document database, documents can
be nested. Particular elements can be indexed for faster querying.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications, which means less translation is required to use the
data in an application. SQL data must often be assembled and disassembled
when moving back and forth between applications and storage.
● Key-value stores
Key-value databases are a simpler type of database where each item contains
keys and values. A value can typically only be retrieved by referencing its key, so
learning how to query for a specific key-value pair is typically simple. Key-value
databases are great for use cases where you need to store large amounts of
data but you don’t need to perform complex queries to retrieve it. Common use
cases include storing user preferences or caching. Redis and DynamoDB are
popular key-value databases.
● Column-oriented databases
● Graph databases
Graph databases store data in nodes and edges. Nodes typically store
information about people, places, and things while edges store information about
the relationships between the nodes. Graph databases excel in use cases where
you need to traverse relationships to look for patterns such as social networks,
fraud detection, and recommendation engines. Neo4j and JanusGraph are
examples of graph databases.
MongoDB:
Introduction:
MongoDB is an open-source document database and leading NoSQL database.
MongoDB is written in C++. This tutorial will give you a great understanding of
MongoDB concepts needed to create and deploy a highly scalable and
performance-oriented database.
Data Types:
MongoDB supports many data types. Some of them are −
● String − This is the most commonly used datatype to store the data. String
in MongoDB must be UTF-8 valid.
● Integer − This type is used to store a numerical value. Integer can be 32 bit
or 64 bit depending upon your server.
● Boolean − This type is used to store a boolean (true/ false) value.
● Double − This type is used to store floating-point values.
● Min/ Max keys − This type is used to compare a value against the lowest
and highest BSON elements.
● Arrays − This type is used to store arrays or lists or multiple values into
one key.
● Timestamp − timestamp. This can be handy for recording when a
document has been modified or added.
● Object − This data type is used for embedded documents.
● Null − This type is used to store a Null value.
● Symbol − This datatype is used identically to a string; however, it's
generally reserved for languages that use a specific symbol type.
● Date − This data type is used to store the current date or time in UNIX time
format. You can specify your own date time by creating an object of Date
and passing day, month, a year into it.
● Object ID − This data type is used to store the document’s ID.
● Binary data − This data type is used to store binary data.
● Code − This data type is used to store JavaScript code into the document.
● Regular expression − This data type is used to store regular expression.
Creating Document:
Insert a Single Document
Updates at most a single document that matches a specified filter even though
multiple documents may match the specified filter.
Replaces at most a single document that matches a specified filter even though
multiple documents may match the specified filter.
db.collection.update()
Deleting Documents
db.collection.deleteMany()
db.collection.deleteOne()
Delete at most a single document that matches a specified filter even though
multiple documents may match the specified filter.
db.collection.remove()
db.collection.findOneAndDelete()
findOneAndDelete() provides a sort option. The option allows for the deletion of
the first document sorted by the specified order.
db.collection.findAndModify()
db.collection.bulkWrite()
Previous
Querying:
find() Method
To query data from MongoDB collection, you need to use MongoDB's find()
method.
Syntax
>db.COLLECTION_NAME.find()
pretty() Method
To display the results in a formatted way, you can use pretty() method.
Syntax
>db.COLLECTION_NAME.find().pretty()
findOne() method
Apart from the find() method, there is findOne() method, that returns only one
document.
Syntax
>db.COLLECTIONNAME.findOne()
AND in MongoDB
Syntax
To query documents based on the AND condition, you need to use $and
keyword. Following is the basic syntax of AND −
OR in MongoDB
Syntax
To query documents based on the OR condition, you need to use $or keyword.
Following is the basic syntax of OR −
>db.mycol.find(
{
$or: [
{key1: value1}, {key2:value2}
]
}
).pretty()
NOR in MongoDB
Syntax
To query documents based on the NOT condition, you need to use the $not
keyword. Following is the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$not: [
{key1:
value1}, {key2:value2}
]
}
)
NOT in MongoDB
Syntax
To query documents based on the NOT condition, you need to use the $not
keyword following is the basic syntax of NOT −
>db.COLLECTION_NAME.find(
{
$NOT: [
{key1:
value1}, {key2:value2}
]
}
).pretty()
Introduction to indexing
Indexes are special data structures that store a small portion of the collection's
data set in an easy to traverse form. The index stores the value of a specific field
or set of fields, ordered by the value of the field. The ordering of the index entries
supports efficient equality matches and range-based query operations. In
addition, MongoDB can return sorted results by using the ordering in the index.
Types of Indexes
MongoDB creates a unique index on the _id field during the creation of a
collection.
The _id index prevents clients from inserting two documents with the same value
for the
_id field.
→ Create an Index
db.collection.createIndex()
MongoDB provides several different index types to support specific types of data
and
queries.
→ Single Field
db.collection.createIndex( { orderDate: 1 } )
→ Compound Index
The order of fields listed in a compound index has significance. For instance, if a
compound index consists of { userid: 1, score: -1 }, the index sorts first by userid
and then, within each userid value, sorts by score.
The following example creates a compound index on the orderDate field (in
ascending order) and the zip code field (in descending order.)
→ Multikey Index
If you index a field that holds an array value, MongoDB creates separate index
entries for every element of the array.
These multikey indexes allow queries to select documents that contain arrays by
matching on elements or elements of the arrays.
→ Index Use
Covered Queries
When the query criteria and the projection of a query includes only the indexed
fields,
MongoDB will return results directly from the index without scanning any
documents or
bringing documents into memory. These covered queries can be very efficient.
For queries that specify the compound query conditions, if one index can fulfil a
part of a query condition, and another index can fulfil another part of the query
condition, then MongoDB can
To illustrate index intersection, consider collection orders that have the following
indexes:
{ qty: 1 }
{ item: 1 }
→ Remove Indexes
db.collection.dropIndex() method
db.accounts.dropIndex( { "tax-id": 1 } )
The above operation removes an ascending index on the item field in the items
collection.
Db.collection.drop indexes()
To remove all indexes barring the _id index from a collection, use the operation
above.
→ Modify Indexes
To modify an index, first, drop the index and then recreate it.
Drop Index: Execute the query given below to return a document showing the
operation status.
Recreate the Index: Execute the query given below to return a document
showing the status of the results.
→ Rebuild Indexes
This will drop all indexes including _id and rebuild all indexes in a single
operation.
Capped Collections:
Procedures
→ Create a Capped Collection
To retrieve documents in reverse insertion order, issue find() along with the sort()
method with the $natural parameter set to -1, as shown in the following example:
db.cappedCollection.find().sort( { $natural: -1 } )
db.collection.isCapped()
The size parameter specifies the size of the capped collection in bytes.
Spark:
Installing spark
1. Choose a Spark release: 3.1.2 (Jun 01 2021)3.0.3 (Jun 23 2021)
2. Choose a package type: Pre-built for Apache Hadoop 3.2 and later
Pre-built for Apache Hadoop 2.7 Pre-built with user-provided Apache
Hadoop Source Code
3. Download Spark: spark-3.1.2-bin-hadoop3.2.tgz
4. Verify this release using the 3.1.2 signatures, checksums and project
release KEYS.
Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is
pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12.
Spark Applications:
1. Processing Streaming Data
The most wonderful aspect of Apache Spark is its ability to process streaming
data. Every second, an unprecedented amount of data is generated globally. This
pushes companies and businesses to process data in large bulks and analyze it
in real-time. The Spark Streaming feature can efficiently handle this function. By
unifying disparate data processing capabilities, Spark Streaming allows
developers to use a single framework to accommodate all their processing
requirements. Some of the best features of Spark Streaming are:
Streaming ETL – Spark’s Streaming ETL continually cleans and aggregates the
data before pushing it into data repositories, unlike the complicated process of
conventional ETL (extract, transform, load) tools used for batch processing in
data warehouse environments – they first read the data, then convert it to a
database compatible format, and finally, write it to the target database.
Data enrichment – This feature helps to enrich the quality of data by combining it
with static data, thus, promoting real-time data analysis. Online marketers use
data enrichment capabilities to combine historical customer data with live
customer behaviour data for delivering personalized and targeted ads to
customers in real-time.
Trigger event detection – The trigger event detection feature allows you to
promptly detect and respond to unusual behaviours or “trigger events” that could
compromise the system or create a serious problem within it.
Complex session analysis – Spark Streaming allows you to group live sessions
and events ( for example, user activity after logging into a website/application)
together and also analyze them. Moreover, this information can be used to
update ML models continually. Netflix uses this feature to obtain real-time
customer behavior insights on the platform and to create more targeted show
recommendations for the users.
2. Machine Learning
When the packets arrive in the repository, they are further analyzed by other
Spark components (for instance, MLlib). In this way, Spark helps security
providers to identify and detect threats as they emerge, thereby enabling them to
solidify client security.
3. Fog Computing
Stage - Each job gets divided into smaller sets of tasks called stages that depend
on each other. As part of the DAG nodes, stages are created based on what
operations can be performed serially or in parallel. Not all Spark operations can
happen in a single stage, so they may be divided into multiple stages. Often
stages are delineated on the operator’s computation boundaries, where they
dictate data transfer among Spark executors.
Task - A single unit of work or execution that will be sent to a Spark executor.
Each stage is comprised of Spark tasks (a unit of execution), which are then
federated across each Spark executor; each task maps to a single core and
works on a single partition of data. As such, an executor with 16 cores can have
16 or more tasks working on 16 or more partitions in parallel, making the
execution of Spark’s tasks exceedingly parallel!
Previous
11. Spark
Next
There are two ways to create RDDs − parallelizing an existing collection in your
driver program or referencing a dataset in an external storage system, such as a
shared file system, HDFS, HBase, or any data source offering a Hadoop Input
Format.
Spark makes use of the concept of RDD to achieve faster and efficient
MapReduce operations. Let us first discuss how MapReduce operations take
place and why they are not so efficient.
Spark application contains several components, all of which exists whether you
are running Spark on a single machine or across a cluster of hundreds or
thousands of nodes.
The components of the spark application are Driver, the Master, the Cluster
Manager and the Executors.
All of the spark components including the driver, master, executor processes run
in java virtual machines(JVMs). A JVM is a cross-platform runtime engine that
executes the instructions compiled into java bytecode. Scala, which spark is
written in, compiles into bytecode and runs on JVMs.
Spark on YARN:
When running Spark on YARN, each Spark executor runs as a YARN container.
Where MapReduce schedules a container and fires up a JVM for each task,
Spark hosts multiple tasks within the same container. This approach enables
several orders of magnitude faster task startup time.
Spark supports two modes for running on YARN, “yarn-cluster” mode and
“yarn-client” mode. Broadly, yarn-cluster mode makes sense for production jobs,
while yarn-client mode makes sense for interactive and debugging uses where
you want to see your application’s output immediately.
In yarn-cluster mode, the driver runs in the Application Master. This means that
the same process is responsible for both driving the application and requesting
resources from YARN, and this process runs inside a YARN container. The client
that starts the app doesn’t need to stick around for its entire lifetime.
The yarn-cluster mode, however, is not well suited to using Spark interactively.
Spark applications that require user input, like spark-shell and PySpark, need the
Spark driver to run inside the client process that initiates the Spark application. In
yarn-client mode, the Application Master is merely present to request executor
containers from YARN. The client communicates with those containers to
schedule work after they start:
In Yarn Cluster-Mode, Spark client will submit spark application to yarn, both
Spark Driver and Spark Executor are under the supervision of yarn. In yarn client
mode, only the Spark Executor are under the supervision of yarn. The Yarn
ApplicationMaster will request resources for just spark executor. The driver
program is running in the client process which has nothing to do with yarn.
SCALA:
Introduction
Scala is a modern multi-paradigm programming language designed to express
common programming patterns in a concise, elegant, and type-safe way. It
seamlessly integrates features of object-oriented and functional languages.
Class
Following is a simple syntax to define a basic class in Scala. This class defines
two variables x and y and a method: move, which does not return a value. Class
variables are called, fields of the class and methods are called class methods.
The class name works as a class constructor which can take several parameters.
The above code defines two constructor arguments, xc and yc; they are both
visible in the whole body of the class.
Syntax
class Point(xc: Int, yc: Int) {
var x: Int = xc
var y: Int = yc
you can create objects using a keyword new and then you can access class
fields and methods
Extending a Class
You can extend a base Scala class and you can design an inherited class in the
same way you do it in Java (use extends keyword), but there are two restrictions:
method overriding requires the override keyword, and only the primary
constructor can pass parameters to the base constructor.
Implicit Classes
Implicit classes allow implicit conversations with the class’s primary constructor
when the class is in scope. An implicit class is a class marked with an ‘implicit’
keyword. This feature is introduced in Scala 2.10.
Syntax − The following is the syntax for implicit classes. Here implicit class is
always in the object scope where all method definitions are allowed because the
implicit class cannot be a top-level class.
Syntax
object <object name> {
implicit class <class name>(<Variable>: Data type) {
def <method>(): Unit =
}
}
Singleton Objects
Scala is more object-oriented than Java because, in Scala, we cannot have static
members. Instead, Scala has singleton objects. A singleton is a class that can
have only one instance, i.e., Object. You create a singleton using the keyword
object instead of the class keyword. Since you can't instantiate a singleton object,
you can't pass parameters to the primary constructor.
1
Byte 8 bit signed value. Range from -128 to 127
Operators:
● Arithmetic Operators
● Relational Operators
● Logical Operators
● Bitwise Operators
● Assignment Operators
→ Arithmetic Operators
The following arithmetic operators are supported by the Scala language.
Operator Description
Operator Description
== Checks if the values of two operands are equal or not, if yes then the
condition becomes true.
!= Checks if the values of two operands are equal or not, if values are not
equal then the condition becomes true.
> Checks if the value of the left operand is greater than the value of the
right operand, if yes then the condition becomes true.
< Checks if the value of the left operand is less than the value of the right
operand, if yes then the condition becomes true.
>= Checks if the value of the left operand is greater than or equal to the
value of the right operand, if yes then the condition becomes true.
<= Checks if the value of the left operand is less than or equal to the value
of the right operand, if yes then the condition becomes true.
→ Logical Operators
The following logical operators are supported by the Scala language.
Operator Description
&& It is called Logical AND operator. If both the operands are non zero then
the condition becomes true.
! It is called Logical NOT Operator. Use to reverses the logical state of its
operand. If a condition is true then the Logical NOT operator will make it
false.
→ Bitwise Operators
Bitwise operator works on bits and performs bit by bit operation. The truth tables
for &, |, and ^ are as follows −
0 0 0 0 0
0 1 0 1 1
1 1 1 1 0
1 0 0 1 1
Operator Description
& Binary AND Operator copies a bit to the result if it exists in both
operands.
^ Binary XOR Operator copies the bit if it is set in one operand but not
both.
<< Binary Left Shift Operator. The bit positions of the value of the left
operand are moved left by the number of bits specified by the right
operand.
>> Binary Right Shift Operator. The bit positions of the left operand value
are moved right by the number of bits specified by the right operand.
>>> Shift right zero-fill operator. The left operands value is moved right by
the number of bits specified by the right operand and shifted values are
filled up with zeros.
Assignment Operators
There are the following assignment operators supported by Scala language −
Operator Description
+= Add AND assignment operator, It adds the right operand to the left
operand and assigns the result to the left operand
/= Divide AND assignment operator, It divides left operand with the right
operand and assigns the result to left operand
Scala has only a handful of built-in control structures. The only control structures
are if, while, for, try, match, and function calls. The reason Scala has so few is
that it has included function literals since its inception. Instead of accumulating
one higher-level control structure after another in the base syntax, Scala
accumulates them in libraries.
1. If expressions
Scala's if works just like in many other languages. It tests a condition and then
executes one of two code branches depending on whether the condition holds
true. Here is a common example, written in an imperative style:
This code declares a variable, filename, and initializes it to a default value. It then
uses and if expression to check whether any arguments were supplied to the
program. If so, it changes the variable to hold the value specified in the argument
list. If no arguments were supplied, it leaves the variable set to the default value.
2. While loops
Scala's while loop behaves as in other languages. It has a condition and a body,
and the body is executed over and over as long as the condition holds true.
example:
Scala also has a do-while loop. This works like the while loop except that it tests
the condition after the loop body instead of before.
Below shows a Scala script that uses a do-while to echo lines read from the
standard input until an empty line is entered:
var line = ""
do {
line = readLine()
println("Read: "+ line)
} while (line != "")
3. For expressions
Scala's for expression is a Swiss army knife of iteration. It lets you combine a few
simple ingredients in different ways to express a wide variety of iterations. Simple
uses enable common tasks such as iterating through a sequence of integers.
More advanced expressions can iterate over multiple collections of different
kinds, can filter out elements based on arbitrary conditions, and can produce new
collections.
Iteration through collections
The simplest thing you can do is to iterate through all the elements of a
collection.
For example, below shows some code that prints out all files in the current
directory. The I/O is performed using the Java API. First, we create a java.io.File
on the current directory, ".", and call its listFiles method. This method returns an
array of File objects, one per directory and file contained in the current directory.
We store the resulting array in the filesHere variable.
Filtering
Sometimes you do not want to iterate through a collection in its entirety. You want
to filter it down to some subset. You can do this with a for expression by adding a
filter: an if clause inside the for's parentheses.
For example, the code shown below lists only those files in the current directory
whose names end with ".scala":
Nested iteration
If you add multiple <- clauses, you will get nested "loops." For example, the for
expression shown below has two nested loops. The outer loop iterates through
filesHere, and the inner loop iterates through fileLines(file) for any file that ends
with .scala.
grep(".*gcd.*")
Note that the previous code repeats the expression line.trim. This is a non-trivial
computation, so you might want to only compute it once. You can do this by
binding the result to a new variable using an equals sign (=). The bound variable
is introduced and used just like a val, only with the val keyword left out.
Below shows an example.
grep(".*gcd.*")
While all of the examples so far have operated on the iterated values and then
forgotten them, you can also generate a value to remember for each iteration. To
do so, you prefix the body of the for expression by the keyword yield. For
example, here is a function that identifies the .scala files and stores them in an
array:
def scalaFiles =
for {
file <- filesHere
if file.getName.ends with(".scala")
} yield file
Each time the body of the for expression executes it produces one value, in this
case simply file. When the for expression completes, the result will include all of
the yielded values contained in a single collection. The type of the resulting
collection is based on the kind of collections processed in the iteration clauses. In
this case, the result is an Array[File], because filesHere is an array and the type
of the yielded expression is File.
Catching exceptions
You catch exceptions using the syntax shown below. The syntax for catch
clauses was chosen for its consistency with an important part of Scala: pattern
matching. Pattern matching, a powerful feature.
import java.io.FileReader
import java.io.FileNotFoundException
import java.io.IOException
try {
val f = new FileReader("input.txt") // Use and close file
} catch {
case ex: FileNotFoundException => // Handle missing file
case ex: IOException => // Handle other I/O error
}
You can wrap an expression with a finally clause if you want to cause some code
to execute no matter how the expression terminates. For example, you might
want to be sure an open file gets closed even if a method exits by throwing an
exception. Below shown an example.
import java.io.FileReader
Yielding a value
import java.net.URL
import java.net.MalformedURLException
5. Match expressions
Scala's match expression lets you select from several alternatives, just like
switch statements in other languages. In general, a match expression lets you
select using arbitrary patterns. For now, just consider using match to select
among several alternatives.
As an example, the script below reads a food name from the argument list and
prints a companion to that food. This match expression examines firstArg, which
has been set to the first argument out of the argument list. If it is the string "salt",
it prints "pepper", while if it is the string "chips", it prints "salsa", and so on. The
default case is specified with an underscore (_), a wildcard symbol frequently
used in Scala as a placeholder for a completely unknown value.
Functions
Scala has both functions and methods and we use the terms method and
function interchangeably with a minor difference. A Scala method is a part of a
class that has a name, a signature, optionally some annotations, and some
bytecode whereas a function in Scala is a complete object which can be
assigned to a variable. In other words, a function, which is defined as a member
of some object, is called a method.
Function Declarations
A Scala function declaration has the following form −
Methods are implicitly declared abstract if you don’t use the equals sign and the
method body.
Function Definitions
A Scala function definition has the following form −
Syntax
Here, the return type could be any valid Scala data type and the list of
parameters will be a list of variables separated by a comma and the list of
parameters and return type are optional.
Calling Functions
Scala provides several syntactic variations for invoking methods. Following is the
standard way to call a method −
If a function is being called using an instance of the object, then we would use
dot notation similar to Java as follows −
Closures
Scala Closures are functions which uses one or more free variables and the
return value of this function is dependent of these variable. The free variables are
defined outside of the Closure Function and are not included as a parameter of
this function. So the difference between a closure function and a normal function
is the free variable. A free variable is any kind of variable which is not defined
within the function and not passed as the parameter of the function. A free
variable is not bound to a function with a valid value. The function does not
contain any values for the free variable.
Inheritance.
Important terminology:
Syntax:
class parent_class_name extends child_class_name{
// Methods and fields
}
Type of inheritance
Below are the different types of inheritance which are supported by Scala.
Single Inheritance: In single inheritance, derived class inherits the features of one
base class. In the image below, class A serves as a base class for the derived
class B.
1. Pig :
Applications :
2. Hive :
It resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.
Benefits :
1. Ease of use
2. Accelerated initial insertion of data
3. Superior scalability, flexibility, and cost-efficiency
4. Streamlined security
5. Low overhead
6. Exceptional working capacity
3. HBase :
HBase provides a fault-tolerant way of storing sparse data sets, which are
common in many big data use cases
HBase does support writing applications in Apache Avro, REST and Thrift.
Application :
1. Medical
2. Sports
3. Web
4. Oil and petroleum
5. e-commerce
PIG
Introduction to PIG :
Pig is a high-level platform or tool which is used to process large datasets.
Pig Latin and Pig Engine are the two main components of the Apache Pig tool.
One limitation of MapReduce is that the development cycle is very long. Writing
the reducer and mapper, compiling packaging the code, submitting the job and
retrieving the output is a time-consuming task.
Apache Pig reduces the time of development using the multi-query approach.
Pig is beneficial for programmers who are not from Java backgrounds.
200 lines of Java code can be written in only 10 lines using the Pig Latin
language.
Programmers who have SQL knowledge needed less effort to learn Pig Latin.
In this shell, you can enter the Pig Latin statements and get the output (using the
Dump operator).
You can run Apache Pig in Batch mode by writing the Pig Latin script in a single
file with the .pig extension.
Apache Pig provides the provision of defining our own functions (User Defined
Functions) in programming languages such as Java and using them in our script.
PIG SQL
The data model in Apache Pig is nested relational. The data model used in SQL
is flat relational.
Apache Pig provides limited opportunity for Query There is more opportunity for
optimization. query optimization in SQL.
Grunt :
The Grunt shell of the Apace pig is mainly used to write pig Latin scripts.
Pig script can be executed with grunt shell which is a native shell provided by
Apache pig to execute pig queries.
Syntax of sh command :
grunt> sh ls
Syntax of fs command :
grunt>fs -ls
Pig Latin :
The Pig Latin is a data flow language used by Apache Pig to analyze the data in
Hadoop.
User-Defined Functions :
Using these UDF’s, we can define our own functions and use them.
The UDF support is provided in six programming languages:
· Java
· Jython
· Python
· JavaScript
· Ruby
· Groovy
For writing UDF’s, complete support is provided in Java and limited support is
provided in all the remaining languages.
Using Java, you can write UDF’s involving all parts of the processing like data
load/store, column transformation, and aggregation.
Since Apache Pig has been written in Java, the UDF’s written using Java
language work efficiently compared to other languages.
Filter Functions :
Eval Functions :
Algebraic Functions :
The Apache Pig Operators is a high-level procedural language for querying large
data sets using Hadoop and the Map-Reduce Platform.
A Pig Latin statement is an operator that takes a relation as input and produces
another relation as output.
These operators are the main tools for Pig Latin provides to operate on the data.
Relational Operators :
Relational operators are the main tools Pig Latin provides to operate on the data.
LOAD: The LOAD operator is used to load data from the file system or HDFS
storage into a Pig relation.
JOIN: JOIN operator is used to performing an inner, equijoin join of two or more
relations based on common field values
ORDER BY: Order By is used to sort a relation based on one or more fields in
either ascending or descending order using ASC and DESC keywords.
GROUP: The GROUP operator groups together the tuples with the same group
key (key field).
COGROUP: COGROUP is the same as the GROUP operator. For readability,
programmers usually use GROUP when only one relation is involved and
COGROUP when multiple relations are reinvolved.
Diagnostic Operator :
The load statement will simply load the data into the specified relation in Apache
Pig.
To verify the execution of the Load statement, you have to use the Diagnostic
Operators.
DUMP: The DUMP operator is used to run Pig Latin statements and display the
results on the screen.
EXPLAIN: The EXPLAIN operator is used to display the logical, physical, and
MapReduce execution plans of a relation.
Hive
1. Hive Client
2. Hive Services
4. Distributed Storage
HIVE CLIENT :
Hive supports applications written in any language like Python, Java, C++, Ruby,
etc using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive.
Hence, one can easily write a hive client application in any language of its own
choice.
1. Thrift Clients : The Hive server is based on Apache Thrift so that it can
serve the request from a thrift client.
2. JDBC client : Hive allows for the Java applications to connect to it using the
JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server.
3. ODBC client : Hive ODBC driver allows applications based on the ODBC
protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses
Thrift to communicate with the Hive Server.
HIVE SERVICE :
To perform all queries, Hive provides various services like the Hive server2,
Beeline, etc.
1. Beeline
2. Hive Server 2
3. Hive Driver
4. Hive Compiler
5. Optimizer
6. Execution Engine
7. Metastore
8. HCatalog
9. WebHCat
MapReduce job works by splitting data into chunks, which are processed by
map-reduce tasks.
DISTRIBUTED STORAGE :
Hive is built on top of Hadoop, so it uses the underlying Hadoop Distributed File
System for the distributed storage.
Hive Shell :
In hive shell up and down arrow keys are used to scroll previous commands.
The tab key will autocomplete (provides suggestions while you type into the field)
Hive keywords and functions.
Non-Interactive mode :
Hive Shell can run in the non-interactive mode, with the -f option.
Example:
Interactive mode :
The hive can work in interactive mode by directly typing the command “hive” in
the terminal.
Example:
$hive
5. Hive
Next
Hive Services :
· Hive CLI: The Hive CLI (Command Line Interface) is a shell where we can
execute Hive queries and commands.
· Hive Web User Interface: The Hive Web UI is just an alternative of Hive CLI.
It provides a web-based GUI for executing Hive queries and commands.
· Hive Compiler: The purpose of the compiler is to parse the query and
perform semantic analysis on the different query blocks and expressions. It
converts HiveQL statements into MapReduce jobs.
· Hive Execution Engine: Optimizer generates the logical plan in the form of
DAG of map-reduce tasks and HDFS tasks. In the end, the execution engine
executes the incoming tasks in the order of their dependencies.
MetaStore :
Hive metastore (HMS) is a service that stores Apache Hive and other metadata
in a backend RDBMS, such as MySQL or PostgreSQL.
The connections to and from HMS include HiveServer, Ranger, and the
NameNode, which represents HDFS.
Beeline, Hue, JDBC, and Impala shell clients make requests through thrift or
JDBC to HiveServer.
All connections are routed to a single RDBMS service at any given time.
HMS talks to the NameNode over thrift and functions as a client to HDFS.
HMS connects directly to Ranger and the NameNode (HDFS), and so does
HiveServer.
One or more HMS instances on the backend can talk to other services, such as
Ranger.
RDBMS HIVE
HiveQL :
Even though based on SQL, HiveQL does not strictly follow the full SQL-92
standard.
HiveQL offers extensions not in SQL, including multitable inserts and create table
as select.
HiveQL lacked support for transactions and materialized views and only limited
subquery support.
Support for insert, update, and delete with full ACID functionality was made
available with release 0.14.
Internally, a compiler translates HiveQL statements into a directed acyclic graph
of MapReduce Tez, or Spark jobs, which are submitted to Hadoop for execution.
Example :
Checks if table docs exist and drop it if it does. Creates a new table called docs
with a single column of type STRING called line.
Loads the specified file or directory (In this case “input_file”) into the table.
OVERWRITE specifies that the target table to which the data is being loaded is
to be re-written; Otherwise, the data would be appended.
This query draws its input from the inner query (SELECT explode(split(line, '\s'))
AS word FROM docs) temp".
This query serves to split the input words into different rows of a temporary table
aliased as temp.
This results in the count column holding the number of occurrences for each
word of the word column.
In a managed table, both the table data and the table schema are managed by
Hive.
The data will be located in a folder named after the table within the Hive data
warehouse, which is essentially just a file location in HDFS.
External Tables :
An external table is one where only the table schema is controlled by Hive.
In most cases, the user will set up the folder location within HDFS and copy the
data file(s) there.
When an external table is deleted, Hive will only delete the schema associated
with the table.
The data files are not affected.
Querying Data :
SQL, the most well-known and widely-used query language, is familiar to most
database administrators (DBAs)
User-Defined Functions :
In Hive, the users can define their own functions to meet certain client
requirements.
The developer will develop these functions in Java and integrate those UDFs
with the Hive.
During the Query execution, the developer can directly use the code, and UDFs
will return outputs according to the user-defined tasks.
The general type of UDF will accept a single input value and produce a single
output value.
We can use two different interfaces for writing Apache Hive User-Defined
Functions :
1. Simple API
2. Complex API
When a globally sorted result is not required and in many cases it isn’t, then you
can use Hive’s nonstandard extension, SORT BY instead.
If you want to control which reducer a particular row goes to, typically so you can
perform some subsequent aggregation.
· Output :
1949 111
1949 78
1950 22
1950 0
1950 -11
Similar to any other scripting language, Hive scripts are used to execute a set of
Hive commands collectively.
Hive scripting helps us to reduce the time and effort invested in writing and
executing the individual commands manually.
JOINS :
· Left outer Join: Returns all the rows from the left table even though there
are no matches in the right table.
· Right Outer Join: Returns all the rows from the Right table even though
there are no matches in the left table.
· Full Outer Join: It combines records of both the tables based on the JOIN
Condition given in the query. It returns all the records from both tables and fills in
NULL Values for the columns missing values matched on either side.
SUBQUERIES :
The main query will depend on the values returned by the subqueries.
When to use :
· To get a particular value combined from two column values from different
tables.
Syntax :
Subquery in FROM clause
SELECT <column names 1, 2...n>From (SubQuery) <TableName_Main >
Subquery in WHERE clause
SELECT <column names 1, 2...n> From<TableName_Main>WHERE col1 IN (SubQuery);
HBASE
HBase Concepts :
HBase is a data model that is similar to Google’s big table designed to provide
quick random access to huge amounts of structured data.
It leverages the fault tolerance provided by the Hadoop File System (HDFS).
One can store the data in HDFS either directly or through HBase.
HBase sits on top of the Hadoop File System and provides read and write
access.
HBase Vs RDBMS :
RDBMS HBase
It is row-oriented It is column-oriented
It is not scalable It is scalable
Schema Design :
HBase table can scale to billions of rows and any number of columns based on
your requirements.
The HBase table supports the high read and writes throughput at low latency.
A single value in each row is indexed; this value is known as the row key.
The HBase schema design is very different compared to the relational database
schema design.
Some of the general concepts that should be followed while designing schema in
Hbase:
· Row key: Each table in the HBase table is indexed on the row key. There
are no secondary indices available on the HBase table.
· Automaticity: Avoid designing a table that requires atomicity across all rows.
All operations on HBase rows are atomic at row level.
· Even distribution: Read and write should be uniformly distributed across all
nodes available in the cluster. Design row key in such a way that, related entities
should be stored in adjacent rows to increase read efficacy.
Zookeeper :
For instance, to track the status of distributed data, Apache HBase uses
ZooKeeper.
· Reliability: The system keeps performing, even if more than one node fails.
· Speed: In the cases where ‘Reads’ are more common, it runs with the ratio
of 10:1.
Where the company offered solutions to store, manage, and analyze the huge
amounts of data generated daily and equipped large and small companies to
make informed business decisions.
The company believed that its Big Data and analytics products and services
would help its clients become more competitive and drive growth.
Issues :
· Understand the concept of Big Data and its importance to large, medium,
and small companies in the current industry scenario.
· Understand the need for implementing a Big Data strategy and the various
issues and challenges associated with this.
· Explore ways in which IBM’s Big Data strategy could be improved further.
Introduction to InfoSphere :
InfoSphere Information Server provides a single platform for data integration and
governance.
The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume
requirements.
You can use the suite to deliver business results faster while maintaining data
quality and integrity throughout your information landscape.
InfoSphere Information Server helps your business and IT personnel collaborate
to understand the meaning, structure, and content of information across a wide
variety of sources.
By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
BigInsights :
Big Sheets :
These deep insights help you to filter and manipulate data from sheets even
further.
A Big SQL query can quickly access a variety of data sources including HDFS,
RDBMS, NoSQL databases, object stores, and WebHDFS by using a single
database connection or single query for best-in-class analytic capabilities.
Big SQL provides tools to help you manage your system and your databases,
and you can use popular analytic tools to visualize your data.
Big SQL's robust engine executes complex queries for relational data and
Hadoop data.
Big SQL provides an advanced SQL compiler and a cost-based optimizer for
efficient query execution.
Combining these with massive parallel processing (MPP) engine helps distribute
query execution across nodes in a cluster.