0% found this document useful (0 votes)
29 views

Module 1

Uploaded by

beelogger4321
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Module 1

Uploaded by

beelogger4321
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Digital Data

• Today, data undoubtedly is an invaluable asset of any enterprise (big or


small). Even though professionals work with data all the time, the understand-
ing, management and analysis of data from heterogeneous sources remains a
serious challenge.
• In this lecture, the various formats of digital data (structured, semi-structured
and unstructured data), data storage mechanism, data access methods, manage-
ment of data, the process of extracting desired information from data, challenges
posed by various formats of data, etc. will be explained.
• Data growth has seen exponential acceleration since the advent of the com-
puter and Internet.
In fact, the computer and Internet duo has imparted the digital form to data.
Digital data can be classified into three forms: – Unstructured – Semi-structured
– Structured
• Usually, data is in the unstructured format which makes extracting informa-
tion from it difficult.
• According to Merrill Lynch, 80–90% of business data is either unstructured
or semi-structured.
• Gartner also estimates that unstructured data constitutes 80% of the whole
enterprise data.
Data Forms Defined –
Unstructured data: This is the data which does not conform to a data model
or is not in a form which can be used easily by a computer program. About
80—90% data of an organization is in this format; for example, memos, chat
rooms, PowerPoint presentations, images, videos, letters, researches, white pa-
pers, body of an email, etc.
Semi-structured data: This is the data which does not conform to a data
model but has some structure structure. However, However, it is not in a form
which can be used easily by a computer program; for example, emails, XML,
markup languages like HTML, etc. Metadata for this data is available but is
not sufficient.
Structured data: This is the data which is in an organized form (e.g., in rows
and columns) and can be easily used by a computer program. Relationships
exist between entities of data, such as classes and their objects. Data stored in
databases is an example of structured data.
Unstructured Data
Getting to Know
• Dr. Ben, Dr. Stanley, and Dr. Mark work at the medical facility of “GoodLife”.
Over the past few days, Dr. Ben and Dr. Stanley had been exchanging long

1
emails about a particular case of testinal problem. Dr. Stanley has chanced
upon a particular combination of drugs that has cured gastro-intestinal disorders
in his patients. He has written an email about this combination of drugs to Dr.
Ben.
• Dr. Mark has a patient in the “GoodLife” emergency unit with quite a similar
case of gastro-intestinal disorder whose cure Dr. Stanley has chanced upon. Dr.
Mark has already tried regular drugs but with no positive results so far. He
quickly searches the organization’s database for answers, but with no luck. The
information he wants is tucked away in the email conversation between two other
“GoodLife” doctors, Dr. Ben and Dr. Stanley Stanley. Dr. Mark would have
accessed accessed the solution solution with few mouse clicks had the storage
and analysis of unstructured data been undertaken by “GoodLife”.
• As is the case at “GoodLife”, 80-85% of data in any organization is unstruc-
tured and is an alarming rate. An enormous amount of knowledge is buried in
this data. In the above Stanley’s email to Dr. Ben had not been successfully
updated into the medical system in the unstructured format.
• Unstructured data, thus, is the one which cannot be stored in the form of
rows and as in a database and does not conform to any data model, i.e. it is
difficult to determine the meaning of the data. It does not follow any rules or
semantics. It can be of any type and is hence unpredictable.
Semi-structured Data
Semi-structured data does not conform to any data model i.e. it is difficult to
determine the meaning of data neither can data be stored in rows and columns
as in a database but semi-structured data has tags and markers which help to
group data and describe how data is stored, giving some metadata but it is not
sufficient for management and automation of data.
• Similar entities in the data are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the same. For
example two addresses may or may not contain the same number of properties
as in Address 1 Semi-structured Data Address 2
• For example an e-mail follows a standard format To: From: Subject: CC:
Body:
• The tags give us some metadata but the body of the e-mail contains no format
neither is such which conveys meaning of the data it contains.
• There is very fine line between unstructured and semi-structured data.
Structured Data
• Structured data is organized in semantic chunks (entities)
• Similar entities are grouped together (relations or classes)
• Entities in the same group have the same descriptions (attributes)

2
• Descriptions for all entities in a group (schema)
have the same defined format
have a predefined length
are all present
and follow the same order .
What is Big Data? Introduction, Types, Characteristics, Examples
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals
and recorded on magnetic, optical, or mechanical recording media.
Now, let’s learn Big Data definition
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional
data management tools can store it or process it efficiently. Big data is also a
data but with huge size.
What is an Example of Big Data?
Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight
time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
Types Of Big Data
Following are the types of Big Data:
1.Structured
2.Unstructured
3.Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format
is termed as a ‘structured’ data. Over the period of time, talent in computer
science has achieved greater success in developing techniques for working with
such kind of data (where the format is well known in advance) and also deriving

3
value out of it. However, nowadays, we are foreseeing issues when a size of
such data grows to a huge extent, typical sizes are being in the rage of multiple
zettabytes.
Do you know? 1021 bytes equal to 1 zettabyte or one billion terabytes forms a
zettabyte.
Looking at these figures one can easily understand why the name Big Data is
given and imagine the challenges involved in its storage and processing.
Do you know? Data stored in a relational database management system is one
example of a ‘structured’ data.
Examples Of Structured Data
An ‘Employee’ table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Unstructured
Any data with unknown form or the structure is classified as unstructured data.
In addition to the size being huge, un-structured data poses multiple challenges
in terms of its processing for deriving value out of it. A typical example of
unstructured data is a heterogeneous data source containing a combination of
simple text files, images, videos etc. Now day organizations have wealth of data
available with them but unfortunately, they don’t know how to derive value out
of it since this data is in its raw form or unstructured format.
Examples Of Un-structured Data
The output returned by ‘Google Search’
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a
table definition in relational DBMS. Example of semi-structured data is a data
represented in an XML file.
Examples Of Semi-structured Data
Personal data stored in an XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

4
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
Characteristics Of Big Data
Big data can be described by the following characteristics:
1.Volume
2.Variety
3.Velocity
4.Variability
(i) Volume – The name Big Data itself is related to a size which is enormous.
Size of data plays a very crucial role in determining value out of data. Also,
whether a particular data can actually be considered as a Big Data or not, is
dependent upon the volume of data. Hence, ‘Volume’ is one characteristic which
needs to be considered while dealing with Big Data solutions.
(ii) Variety – The next aspect of Big Data is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured
and unstructured. During earlier days, spreadsheets and databases were the
only sources of data considered by most of the applications. Nowadays, data in
the form of emails, photos, videos, monitoring devices, PDFs, audio, etc. are
also being considered in the analysis applications. This variety of unstructured
data poses certain issues for storage, mining and analyzing data.
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors,
Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the
data at times, thus hampering the process of being able to handle and manage
the data effectively.
Advantages Of Big Data Processing
Ability to process Big Data in DBMS brings in multiple benefits, such as-
1.Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like facebook, twitter are
enabling organizations to fine tune their business strategies.

5
2.Improved customer service
Traditional customer feedback systems are getting replaced by new systems
designed with Big Data technologies. In these new systems, Big Data and
natural language processing technologies are being used to read and evaluate
consumer responses.
3.Early identification of risk to the product/services, if any
4.Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for
new data before identifying what data should be moved to the data warehouse.
In addition, such integration of Big Data technologies and data warehouse helps
an organization to offload infrequently accessed data.
What is Big Data Analytics?
Big Data analytics is a process used to extract meaningful insights, such as hid-
den patterns, unknown correlations, market trends, and customer preferences.
Big Data analytics provides various advantages—it can be used for better deci-
sion making, preventing fraudulent activities, among other things.

Why is big data analytics important?


In today’s world, Big Data analytics is fueling everything we do online—in every
industry.
Take the music streaming platform Spotify for example. The company has nearly
96 million users that generate a tremendous amount of data every day. Through
this information, the cloud-based platform automatically generates suggested
songs—through a smart recommendation engine—based on likes, shares, search
history, and more. What enables this is the techniques, tools, and frameworks
that are a result of Big Data analytics.
If you are a Spotify user, then you must have come across the top recommenda-
tion section, which is based on your likes, past history, and other things. Uti-
lizing a recommendation engine that leverages data filtering tools that collect
data and then filter it using algorithms works. This is what Spotify does.
Boost your career prospects with Simplilearn’s data management courses. Stay
ahead of the curve and become a sought-after data professional.
But, let’s get back to the basics first.

What is Big Data?


Big Data is a massive amount of data sets that cannot be stored, processed, or
analyzed using traditional tools.

6
Today, there are millions of data sources that generate data at a very rapid rate.
These data sources are present across the world. Some of the largest sources
of data are social media platforms and networks. Let’s use Facebook as an
example—it generates more than 500 terabytes of data every day. This data
includes pictures, videos, messages, and more.
Data also exists in different formats, like structured data, semi-structured data,
and unstructured data. For example, in a regular Excel sheet, data is classified
as structured data—with a definite format. In contrast, emails fall under semi-
structured, and your pictures and videos fall under unstructured data. All this
data combined makes up Big Data.

Uses and Examples of Big Data Analytics


There are many different ways that Big Data analytics can
be used in order to improve businesses and organizations.
Here are some examples:
• Using analytics to understand customer behavior in order to
optimize the customer experience
• Predicting future trends in order to make better business deci-
sions
• Improving marketing campaigns by understanding what works
and what doesn’t
• Increasing operational efficiency by understanding where bottle-
necks are and how to fix them
• Detecting fraud and other forms of misuse sooner
These are just a few examples — the possibilities are really endless when it
comes to Big Data analytics. It all depends on how you want to use it in order
to improve your business.

History of Big Data Analytics


The history of Big Data analytics can be traced back to the early days of com-
puting, when organizations first began using computers to store and analyze
large amounts of data. However, it was not until the late 1990s and early 2000s
that Big Data analytics really began to take off, as organizations increasingly
turned to computers to help them make sense of the rapidly growing volumes
of data being generated by their businesses.
Today, Big Data analytics has become an essential tool for organizations of all
sizes across a wide range of industries. By harnessing the power of Big Data,
organizations are able to gain insights into their customers, their businesses, and
the world around them that were simply not possible before.

7
As the field of Big Data analytics continues to evolve, we can expect to see even
more amazing and transformative applications of this technology in the years
to come.
Read More: Fascinated by Data Science, software alum Aditya Shivam wanted
to look for new possibilities of learning and then gradually transitioning in to the
data field. Read about Shivam’s journey with our Big Data Engineer Master’s
Program, in his Simplilearn Big Data Engineer Review.

Benefits and Advantages of Big Data Analytics


1. Risk Management
Use Case: Banco de Oro, a Phillippine banking company, uses Big Data an-
alytics to identify fraudulent activities and discrepancies. The organization
leverages it to narrow down a list of suspects or root causes of problems.

2. Product Development and Innovations


Use Case: Rolls-Royce, one of the largest manufacturers of jet engines for airlines
and armed forces across the globe, uses Big Data analytics to analyze how
efficient the engine designs are and if there is any need for improvements.

3. Quicker and Better Decision Making Within Organizations


Use Case: Starbucks uses Big Data analytics to make strategic decisions. For
example, the company leverages it to decide if a particular location would be
suitable for a new outlet or not. They will analyze several different factors, such
as population, demographics, accessibility of the location, and more.

4. Improve Customer Experience


Use Case: Delta Air Lines uses Big Data analysis to improve customer experi-
ences. They monitor tweets to find out their customers’ experience regarding
their journeys, delays, and so on. The airline identifies negative tweets and does
what’s necessary to remedy the situation. By publicly addressing these issues
and offering solutions, it helps the airline build good customer relations.
The Lifecycle Phases of Big Data Analytics
Now, let’s review how Big Data analytics works:
• Stage 1 - Business case evaluation - The Big Data analytics
lifecycle begins with a business case, which defines the reason
and goal behind the analysis.
• Stage 2 - Identification of data - Here, a broad variety of data
sources are identified.

8
• Stage 3 - Data filtering - All of the identified data from the
previous stage is filtered here to remove corrupt data.
• Stage 4 - Data extraction - Data that is not compatible with the
tool is extracted and then transformed into a compatible form.
• Stage 5 - Data aggregation - In this stage, data with the same
fields across different datasets are integrated.
• Stage 6 - Data analysis - Data is evaluated using analytical and
statistical tools to discover useful information.
• Stage 7 - Visualization of data - With tools like Tableau, Power
BI, and QlikView, Big Data analysts can produce graphic visu-
alizations of the analysis.
• Stage 8 - Final analysis result - This is the last step of the Big
Data analytics lifecycle, where the final results of the analysis
are made available to business stakeholders who will take action.

Different Types of Big Data Analytics


Here are the four types of Big Data analytics:

1. Descriptive Analytics
This summarizes past data into a form that people can easily read. This helps
in creating reports, like a company’s revenue, profit, sales, and so on. Also, it
helps in the tabulation of social media metrics.

Use Case: The Dow Chemical Company analyzed its past data to increase facil-
ity utilization across its office and lab space. Using descriptive analytics, Dow
was able to identify underutilized space. This space consolidation helped the
company save nearly US $4 million annually.

2. Diagnostic Analytics
This is done to understand what caused a problem in the first place. Techniques
like drill-down, data mining, and data recovery are all examples. Organizations
use diagnostic analytics because they provide an in-depth insight into a
particular problem.

Use Case: An e-commerce company’s report shows that their sales have
gone down, although customers are adding products to their carts. This can
be due to various reasons like the form didn’t load correctly, the shipping fee
is too high, or there are not enough payment options available. This is where
you can use diagnostic analytics to find the reason.

9
3. Predictive Analytics
This type of analytics looks into the historical and present data to make pre-
dictions of the future. Predictive analytics uses data mining, AI, and machine
learning to analyze current data and make predictions about the future. It
works on predicting customer trends, market trends, and so on.

Use Case: PayPal determines what kind of precautions they have to


take to protect their clients against fraudulent transactions. Using predictive
analytics, the company uses all the historical payment data and user behavior
data and builds an algorithm that predicts fraudulent activities.

4. Prescriptive Analytics
This type of analytics prescribes the solution to a particular problem. Perspec-
tive analytics works with both descriptive and predictive analytics. Most of
the time, it relies on AI and machine learning.

Use Case: Prescriptive analytics can be used to maximize an airline’s


profit. This type of analytics is used to build an algorithm that will automat-
ically adjust the flight fares based on numerous factors, including customer
demand, weather, destination, holiday seasons, and oil prices.
Big Data Analytics Tools
Here are some of the key big data analytics tools :
• Hadoop - helps in storing and analyzing data
• MongoDB - used on datasets that change frequently
• Talend - used for data integration and management
• Cassandra - a distributed database used to handle chunks of
data
• Spark - used for real-time processing and analyzing large
amounts of data
• STORM - an open-source real-time computational system
• Kafka - a distributed streaming platform that is used for fault-
tolerant storage
Big Data Industry Applications
Here are some of the sectors where Big Data is actively used:
• Ecommerce - Predicting customer trends and optimizing prices
are a few of the ways e-commerce uses Big Data analytics
• Marketing - Big Data analytics helps to drive high ROI market-
ing campaigns, which result in improved sales

10
• Education - Used to develop new and improve existing courses
based on market requirements
• Healthcare - With the help of a patient’s medical history, Big
Data analytics is used to predict how likely they are to have
health issues
• Media and entertainment - Used to understand the demand of
shows, movies, songs, and more to deliver a personalized recom-
mendation list to its users
• Banking - Customer income and spending patterns help to pre-
dict the likelihood of choosing various banking offers, like loans
and credit cards
• Telecommunications - Used to forecast network capacity and
improve customer experience
• Government - Big Data analytics helps governments in law en-
forcement, among other things

Join the Big Data Analytics Revolution


Data touches every part of our lives today, meaning there is a high demand for
professionals with the skill to make sense of it. If you want to learn more about
Big Data analytics or want to jumpstart your career in Big Data, check out
Simplilearn’s Big Data Engineer and Data Analytics Bootcamp in collaboration
with IBM today!
Also, check out Simplilearn’s video on ”What is Big Data Analytics,” curated
by our industry experts, to help you understand the concepts.
History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin
was the Google File System paper, published by Google.

Let’s focus on the history of Hadoop in the following steps: -

11
1.In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache
Nutch. It is an open source web crawler software project.
2.While working on Apache Nutch, they were dealing with big data. To store
that data they have to spend a lot of costs which becomes the consequence
of that project. This problem becomes one of the important reason for the
emergence of Hadoop.
3.In 2003, Google introduced a file system known as GFS (Google file system).
It is a proprietary distributed file system developed to provide efficient access
to data.
4.In 2004, Google released a white paper on Map Reduce. This technique sim-
plifies the data processing on large clusters.
5.In 2005, Doug Cutting and Mike Cafarella introduced a new file system known
as NDFS (Nutch Distributed File System). This file system also includes Map
reduce.
6.In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the
Nutch project, Dough Cutting introduces a new project Hadoop with a file
system known as HDFS (Hadoop Distributed File System). Hadoop first version
0.1.0 released in this year.
7.Doug Cutting gave named his project Hadoop after his son’s toy elephant.
8.In 2007, Yahoo runs two clusters of 1000 machines.
9.In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a
900 node cluster within 209 seconds.
10.In 2013, Hadoop 2.2 was released.
11.In 2017, Hadoop 3.0 was released.

year Event
2003 Google released the paper, Google File System (GFS).
2004 Google released a white paper on Map Reduce.
2006 Hadoop introduced.
Hadoop 0.1.0 released.
Yahoo deploys 300 machines and
within this year reaches 600
machines.
2007 Yahoo runs 2 clusters of 1000
machines.
Hadoop includes HBase.

12
year Event
2008 YARN JIRA opened
Hadoop becomes the fastest system
to sort 1 terabyte of data on a 900
node cluster within 209 seconds.
Yahoo clusters loaded with 10
terabytes per day.
Cloudera was founded as a Hadoop
distributor.
2009 Yahoo runs 17 clusters of 24,000
machines.
Hadoop becomes capable enough to
sort a petabyte.
MapReduce and HDFS become
separate subproject.
2010 Hadoop added the support for
Kerberos.
Hadoop operates 4,000 nodes with
40 petabytes.
Apache Hive and Pig released.
2011 Apache Zookeeper released.
Yahoo has 42,000 Hadoop nodes and
hundreds of petabytes of storage.
2012 Apache Hadoop 1.0 version released.
2013 Apache Hadoop 2.2 version released.
2014 Apache Hadoop 2.6 version released.
2015 Apache Hadoop 2.7 version released.
2017 Apache Hadoop 3.0 version released.

What is Apache Hadoop?


Apache Hadoop software is an open source framework that allows for the dis-
tributed storage and processing of large datasets across clusters of computers
using simple programming models. Hadoop is designed to scale up from a single
computer to thousands of clustered computers, with each machine offering local
computation and storage. In this way, Hadoop can efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data.
Hadoop defined
Hadoop is an open source framework based on Java that manages the storage
and processing of large amounts of data for applications. Hadoop uses dis-
tributed storage and parallel processing to handle big data and analytics jobs,
breaking workloads down into smaller workloads that can be run at the same
time.

13
Four modules comprise the primary Hadoop framework and work collectively
to form the Hadoop ecosystem:
Hadoop Distributed File System (HDFS): As the primary component of the
Hadoop ecosystem, HDFS is a distributed file system in which individual
Hadoop nodes operate on data that resides in their local storage. This removes
network latency, providing high-throughput access to application data. In
addition, administrators don’t need to define schemas up front.
Yet Another Resource Negotiator (YARN): YARN is a resource-management
platform responsible for managing compute resources in clusters and using them
to schedule users’ applications. It performs scheduling and resource allocation
across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data process-
ing. In the MapReduce model, subsets of larger datasets and instructions for
processing the subsets are dispatched to multiple different nodes, where each
subset is processed by a node in parallel with other processing jobs. After
processing the results, individual subsets are combined into a smaller, more
manageable dataset.
Hadoop Common: Hadoop Common includes the libraries and utilities used and
shared by other Hadoop modules.
Beyond HDFS, YARN, and MapReduce, the entire Hadoop open source ecosys-
tem continues to grow and includes many tools and applications to help collect,
store, process, analyze, and manage big data. These include Apache Pig, Apache
Hive, Apache HBase, Apache Spark, Presto, and Apache Zeppelin.
How does Hadoop work?
Hadoop allows for the distribution of datasets across a cluster of commodity
hardware. Processing is performed in parallel on multiple servers simultane-
ously.
Software clients input data into Hadoop. HDFS handles metadata and the dis-
tributed file system. MapReduce then processes and converts the data. Finally,
YARN divides the jobs across the computing cluster.
All Hadoop modules are designed with a fundamental assumption that hardware
failures of individual machines or racks of machines are common and should be
automatically handled in software by the framework.
What are the benefits of Hadoop?
Scalability
Hadoop is important as one of the primary tools to store and process huge
amounts of data quickly. It does this by using a distributed computing model
which enables the fast processing of data that can be rapidly scaled by adding
computing nodes.

14
Low cost
As an open source framework that can run on commodity hardware and has
a large ecosystem of tools, Hadoop is a low-cost option for the storage and
management of big data.
Flexibility
Hadoop allows for flexibility in data storage as data does not require preprocess-
ing before storing it which means that an organization can store as much data
as they like and then utilize it later.
Resilience
As a distributed computing model, Hadoop allows for fault tolerance and system
resilience, meaning if one of the hardware nodes fail, jobs are redirected to other
nodes. Data stored on one Hadoop cluster is replicated across other nodes within
the system to fortify against the possibility of hardware or software failure.
What are the challenges of Hadoop?
MapReduce complexity and limitations
As a file-intensive system, MapReduce can be a difficult tool to utilize for com-
plex jobs, such as interactive analytical tasks. MapReduce functions also need
to be written in Java and can require a steep learning curve. The MapReduce
ecosystem is quite large, with many components for different functions that can
make it difficult to determine what tools to use.
Security
Data sensitivity and protection can be issues as Hadoop handles such large
datasets. An ecosystem of tools for authentication, encryption, auditing, and
provisioning has emerged to help developers secure data in Hadoop.
Governance and management
Hadoop does not have many robust tools for data management and governance,
nor for data quality and standardization.
Talent gap
Like many areas of programming, Hadoop has an acknowledged talent gap.
Finding developers with the combined requisite skills in Java to program MapRe-
duce, operating systems, and hardware can be difficult. In addition, MapReduce
has a steep learning curve, making it hard to get new programmers up to speed
on its best practices and ecosystem.
Why is Hadoop important?
Research firm IDC estimated that 62.4 zettabytes of data were created or repli-
cated in 2020, driven by the Internet of Things, social media, edge computing,
and data created in the cloud. The firm forecasted that data growth from 2020
to 2025 was expected at 23% per year. While not all that data is saved (it is

15
either deleted after consumption or overwritten), the data needs of the world
continue to grow.
Hadoop tools
Hadoop has a large ecosystem of open source tools that can augment and extend
the capabilities of the core module. Some of the main software tools used with
Hadoop include:
Apache Hive: A data warehouse that allows programmers to work with data in
HDFS using a query language called HiveQL, which is similar to SQL
Apache HBase: An open source non-relational distributed database often paired
with Hadoop
Apache Pig: A tool used as an abstraction layer over MapReduce to analyze
large sets of data and enables functions like filter, sort, load, and join
Apache Impala: Open source, massively parallel processing SQL query engine
often used with Hadoop
Apache Sqoop: A command-line interface application for efficiently transferring
bulk data between relational databases and Hadoop
Apache ZooKeeper: An open source server that enables reliable distributed co-
ordination in Hadoop; a service for, ”maintaining configuration information,
naming, providing distributed synchronization, and providing group services”
Apache Oozie: A workflow scheduler for Hadoop jobs
What is Apache Hadoop used for?
Here are some common uses cases for Apache Hadoop:
Analytics and big data
A wide variety of companies and organizations use Hadoop for research, pro-
duction data processing, and analytics that require processing terabytes or
petabytes of big data, storing diverse datasets, and data parallel processing.
Data storage and archiving
As Hadoop enables mass storage on commodity hardware, it is useful as a low-
cost storage option for all kinds of data, such as transactions, click streams, or
sensor and machine data.
Data lakes
Since Hadoop can help store data without preprocessing, it can be used to
complement to data lakes, where large amounts of unrefined data are stored.
Marketing analytics
Marketing departments often use Hadoop to store and analyze customer rela-
tionship management (CRM) data.

16
Risk management
Banks, insurance companies, and other financial services companies use Hadoop
to build risk analysis and management models.
AI and machine learning
Hadoop ecosystems help with the processing of data and model training opera-
tions for machine learning applications.
Data Analysis with Unix tools
To understand how to work with Unix, data – Weather Dataset is used.
Weather sensors gather information consistently at numerous areas over the
globe and assemble an enormous volume of log information, which is a decent
possibility for investigation with MapReduce in light of the fact that is required
to process every one of the information, and the information is record-oriented
and semi-organized.
The information utilized is from the National Climatic Data Center, or NCDC.
The information is put away utilizing a line-arranged ASCII group, in which
each line is a record. The organization underpins a rich arrangement of meteo-
rological components, huge numbers of which are discretionary or with variable
information lengths. For straightforwardness, centre around the fundamental
components, for example, temperature, which is constantly present and are of
fixed width.
Use of UNIX
So now we’ll find out the highest recorded global temperature in the dataset
(for each year) using Unix?
The classic tool for processing line-oriented data is awk.
Analyzing the Data with Unix Tools
To take advantage of the parallel processing that Hadoop provides, we need to
express our query as a MapReduce job. After some local, small-scale testing,
we will be able to run it on a cluster of machines.
Unix tools defined the modern computing landscape. Originally created in the
1960’s, in today’s fast moving market place, Unix tools provide the user the
avenue to solve many engineering and business analytics problems professionals
face today.
Although by the standards of shiny IDEs some may find the interface of these
tools arcane, their power for exploring and prototyping big data processing
workflows remains unmatched. Their versatility makes them the first choice
for obtaining a quick answer and the last resort for tackling difficult problems.
Compared to scripting languages, another great productivity booster, Unix tools
uniquely allow an interactive, explorative programming style, which is ideal for

17
solving efficiently many engineering and business analytics problems that we
face every day.
Natively available on all flavors of Unix-like operating systems, including
GNU/Linux and Mac OS X, the tools are nowadays also easy to install under
Windows.
While many Unix-like systems have come and gone over the years, there’s still
plenty of reasons why the original operating system has outlasted the competi-
tion.

Analyzing data with Hadoop


Hadoop is an open-source framework that provides distributed storage and pro-
cessing of large data sets. It consists of two main components: Hadoop Dis-
tributed File System (HDFS) and MapReduce. HDFS is a distributed file sys-
tem that allows data to be stored across multiple machines, while MapReduce
is a programming model that enables large-scale distributed data processing.
To analyze data with Hadoop, you first need to store your data in HDFS. This
can be done by using the Hadoop command line interface or through a web-
based graphical interface like Apache Ambari or Cloudera Manager.
Once your data is stored in HDFS, you can use MapReduce to perform dis-
tributed data processing. MapReduce breaks down the data processing into two
phases: the map phase and the reduce phase.
In the map phase, the input data is divided into smaller chunks and processed
independently by multiple mapper nodes in parallel. The output of the map
phase is a set of key-value pairs.
In the reduce phase, the key-value pairs produced by the map phase are aggre-
gated and processed by multiple reducer nodes in parallel. The output of the
reduce phase is typically a summary of the input data, such as a count or an
average.

18
Hadoop also provides a number of other tools for analyzing data, including
Apache Hive, Apache Pig, and Apache Spark. These tools provide higher-level
abstractions that simplify the process of data analysis.
Apache Hive provides a SQL-like interface for querying data stored in HDFS. It
translates SQL queries into MapReduce jobs, making it easier for analysts who
are familiar with SQL to work with Hadoop.

19
Apache Pig is a high-level scripting language that enables users to write data
processing pipelines that are translated into MapReduce jobs. Pig provides a
simpler syntax than MapReduce, making it easier to write and maintain data
processing code.
Apache Spark is a distributed computing framework that provides a fast and
flexible way to process large amounts of data. It provides an API for work-
ing with data in various formats, including SQL, machine learning, and graph
processing.
In summary, Hadoop provides a powerful framework for analyzing large amounts
of data. By storing data in HDFS and using MapReduce or other tools like
Apache Hive, Apache Pig, or Apache Spark, you can perform distributed data
processing and gain insights from your data that would be difficult or impossible
to obtain using traditional data analysis tools.
Hadoop Streaming
Hadoop Streaming uses UNIX standard streams as the interface between
Hadoop and your program so you can write MapReduce program in any
language which can write to standard output and read standard input. Hadoop
offers a lot of methods to help non-Java development.
The primary mechanisms are Hadoop Pipes which gives a native C++ inter-
face to Hadoop and Hadoop Streaming which permits any program that uses
standard input and output to be used for map tasks and reduce tasks.
Features of Hadoop Streaming
Some of the key features associated with Hadoop Streaming are as follows :

20
1.Hadoop Streaming is a part of the Hadoop Distribution System.
2.It facilitates ease of writing Map Reduce programs and codes.
3.Hadoop Streaming supports almost all types of programming languages such
as Python, C++, Ruby, Perl etc.
4.The entire Hadoop Streaming framework runs on Java. However, the codes
might be written in different languages as mentioned in the above point.
5.The Hadoop Streaming process uses Unix Streams that act as an interface
between Hadoop and Map Reduce programs.
6.Hadoop Streaming uses various Streaming Command Options and the two
mandatory ones are – -input directoryname or filename and -output directory-
name

As it can be clearly seen in the diagram above that there are almost 8 key parts
in a Hadoop Streaming Architecture. They are :
A.Input Reader/Format
B.Key Value
C.Mapper Stream
D.Key-Value Pairs
E.Reduce Stream
F.Output Format
G.Map External
H.Reduce External
The involvement of these components will be discussed in detail when we explain
the working of the Hadoop streaming. However, to precisely summarize the
Hadoop Streaming Architecture, the starting point of the entire process is when
the Mapper reads the input value from the Input Reader Format. Once the

21
input data is read, it is mapped by the Mapper as per the logic given in the
code. It then passes through the Reducer stream and the data is transferred to
the output after data aggregation is done. A more detailed description is given
in the below section on the working of the Hadoop Streaming.
How does Hadoop Streaming Work?
Input is read from standard input and the output is emitted to standard output
by Mapper and the Reducer. The utility creates a Map/Reduce job, submits
the job to an appropriate cluster, and monitors the progress of the job until
completion.
Every mapper task will launch the script as a separate process when the mapper
is initialized after a script is specified for mappers. Mapper task inputs are
converted into lines and fed to the standard input and Line oriented outputs
are collected from the standard output of the procedure Mapper and every line
is changed into a key, value pair which is collected as the outcome of the mapper.
Each reducer task will launch the script as a separate process and then the
reducer is initialized after a script is specified for reducers. As the reducer task
runs, reducer task input key/value pairs are converted into lines and fed to the
standard input (STDIN) of the process.
Each line of the line-oriented outputs is converted into a key/value pair after it
is collected from the standard output (STDOUT) of the process, which is then
collected as the output of the reducer.
Hadoop Ecosystem
Overview: Apache Hadoop is an open source framework intended to make
interaction with big data easier, However, for those who are not acquainted
with this technology, one question arises that what is big data ? Big data is
a term given to the data sets which can’t be processed in an efficient manner
with the help of traditional methodology such as RDBMS. Hadoop has made
its place in the industries and companies that need to work on large data sets
which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being
a framework, Hadoop is made up of several modules that are supported by a
large ecosystem of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides vari-
ous services to solve the big data problems. It includes Apache projects and
various commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common. Most of
the tools or solutions are used to supplement or support these major elements.
All these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

22
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm
libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsi-
ble for storing large data sets of structured or unstructured data across various
nodes and thereby maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1.Name node

23
2.Data Node
Name Node is the prime node which contains metadata (data about data) re-
quiring comparatively fewer resources than the data nodes that stores the actual
data. These data nodes are commodity hardware in the distributed environment.
Undoubtedly, making Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs schedul-
ing and resource allocation for the Hadoop System.
Consists of three major components i.e.
1.Resource Manager
2.Nodes Manager
3.Application Manager
Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such as
CPU, memory, bandwidth per machine and later on acknowledges the resource
manager. Application manager works as an interface between the resource man-
ager and node manager and performs negotiations as per the requirement of the
two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes
it possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language,
which is Query based language similar to SQL.

24
It is a platform for structuring the data flow, processing and analyzing huge
data sets.
Pig does the work of executing commands and in the background, all the activ-
ities of MapReduce are taken care of. After the processing, pig stores the result
in HDFS.
Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two compo-
nents: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage permis-
sions and connection whereas HIVE Command line helps in the processing of
queries.
Mahout:
• Mahout, allows Machine Learnability to a system or applica-
tion. Machine Learning, as the name suggests helps the system
to develop itself based on some patterns, user/environmental
interaction or on the basis of algorithms.
• It provides various libraries or functionalities such as collabo-
rative filtering, clustering, and classification which are nothing
but concepts of Machine learning. It allows invoking algorithms
as per our need with the help of its own libraries.
Apache Spark:

• It’s a platform that handles all the process consumptive tasks


like batch processing, interactive or iterative real-time process-
ing, graph conversions, and visualization, etc.
• It consumes in memory resources hence, thus being faster than
the prior in terms of optimization.

25
• Spark is best suited for real-time data whereas Hadoop is best
suited for structured data or batch processing, hence both are
used in most of the companies interchangeably.
Apache HBase:

• It’s a NoSQL database which supports all kinds of data and thus
capable of handling anything of Hadoop Database. It provides
capabilities of Google’s BigTable, thus able to work on Big Data
sets effectively.
• At times where we need to search or retrieve the occurrences
of something small in a huge database, the request must be
processed within a short quick span of time. At such times,
HBase comes handy as it gives us a tolerant way of storing
limited data
Other Components: Apart from all of these, there are some other components
too that carry out a huge task in order to make Hadoop capable of processing
large datasets. They are as follows:

• Solr, Lucene: These are the two services that perform the
task of searching and indexing with the help of some java li-
braries, especially Lucene is based on Java which allows spell
check mechanism, as well. However, Lucene is driven by Solr.
• Zookeeper: There was a huge issue of management of co-
ordination and synchronization among the resources or the
components of Hadoop which resulted in inconsistency, often.
Zookeeper overcame all the problems by performing synchro-
nization, inter-component based communication, grouping, and
maintenance.
• Oozie: Oozie simply performs the task of a scheduler, thus
scheduling jobs and binding them together as a single unit.
There is two kinds of jobs .i.e Oozie workflow and Oozie coordi-
nator jobs. Oozie workflow is the jobs that need to be executed
in a sequentially ordered manner whereas Oozie Coordinator
jobs are those that are triggered when some data or external
stimulus is given to it.
IBM Big Data Strategy :
• IBM, a US-based computer hardware and software manufacturer, had im-
plemented a Big Data strategy.
• Where the company offered solutions to store, manage, and analyze the

26
huge amounts of data generated daily and equipped large and small com-
panies to make informed business decisions.
• The company believed that its Big Data and analytics products and ser-
vices would help its clients become more competitive and drive growth.
Issues :
• · Understand the concept of Big Data and its importance to large, medium,
and small companies in the current industry scenario.
• · Understand the need for implementing a Big Data strategy and the
various issues and challenges associated with this.
• · Analyze the Big Data strategy of IBM.
• · Explore ways in which IBM’s Big Data strategy could be improved
further.
Introduction to InfoSphere :
• InfoSphere Information Server provides a single platform for data integra-
tion and governance.
• The components in the suite combine to create a unified foundation for
enterprise information architectures, capable of scaling to meet any infor-
mation volume requirements.
• You can use the suite to deliver business results faster while maintaining
data quality and integrity throughout your information landscape.
• InfoSphere Information Server helps your business and IT personnel col-
laborate to understand the meaning, structure, and content of information
across a wide variety of sources.
• By using InfoSphere Information Server, your business can access and
use information in new ways to drive innovation, increase operational effi-
ciency, and lower risk.
BigInsights :
• BigInsights is a software platform for discovering, analyzing, and visualiz-
ing data from disparate sources.
• The flexible platform is built on an Apache Hadoop open-source framework
that runs in parallel on commonly available, low-cost hardware.
Big Sheets :
• BigSheets is a browser-based analytic tool included in the InfoSphere Bi-
gInsights Console that you use to break large amounts of unstructured
data into consumable, situation-specific business contexts.
• These deep insights help you to filter and manipulate data from sheets
even further.

27

You might also like