0% found this document useful (0 votes)
2 views

Big Data Introduction

The document provides an overview of Big Data, defining it as large and complex data sets that traditional tools cannot efficiently manage. It discusses the characteristics of Big Data, including volume, velocity, variety, and veracity, as well as the challenges faced in handling such data, such as privacy, security, and analytical difficulties. Additionally, it outlines various Big Data technologies, including Hadoop and its architecture, and highlights both the advantages and disadvantages of using these technologies.

Uploaded by

himanshugmarekar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Big Data Introduction

The document provides an overview of Big Data, defining it as large and complex data sets that traditional tools cannot efficiently manage. It discusses the characteristics of Big Data, including volume, velocity, variety, and veracity, as well as the challenges faced in handling such data, such as privacy, security, and analytical difficulties. Additionally, it outlines various Big Data technologies, including Hadoop and its architecture, and highlights both the advantages and disadvantages of using these technologies.

Uploaded by

himanshugmarekar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

 Introduction to Big Data

 What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
 What is Big Data?
Big Data is also data but with a huge size. Big Data is a term used to
describe a collection of data that is huge in volume and yet growing
exponentially with time. In short such data is so large and complex that none
of the traditional data management tools are able to store it or process it
efficiently.

 “Extremely large data sets that may be analyzed computationally to reveal


patterns , trends and association, especially relating to human
behavior and interaction are known as Big Data.”
 Social Media
The statistic shows that 500+terabytes of new data get ingested into the
databases of social media site Facebook, every day. This data is mainly
generated in terms of photo and video uploads, message exchanges, putting
comments etc.

TWITTER

 A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With
many thousand flights per day, generation of data reaches up to many Petabytes.
 Characteristics Of Big Data

• The following are known as “Big Data Characteristics”.


1.Volume
2.Velocity
3.Variety
4.Veracity
1. Volume:
Volume means “How much Data is generated”. Now-a-days, Organizations or Human
Beings or Systems are generating or getting very vast amount of Data say TB(Tera
Bytes) to PB(Peta Bytes) to Exa Byte(EB) and more.
4 Veracity
.

Veracity means “The Quality or Correctness or Accuracy of Captured Data”.


Out of 4Vs, it is most important V for any Big Data Solutions. Because without
Correct Information or Data, there is no use of storing large amount of data at fast
rate and different formats. That data should give correct business value.
The challenges in Big Data
 The challenges in Big Data are the real
implementation hurdles. These require
immediate attention and need to be
handled because if not handled then the
failure of the technology may take place
which can also lead to some unpleasant
result. Big data challenges include the
storing, analyzing the extremely large and
fast-growing data.
 Some of the Big Data challenges are:
 Sharing and Accessing Data:

◦ Perhaps the most frequent challenge in big data


efforts is the inaccessibility of data sets from
external sources.
◦ Sharing data can cause substantial challenges.
◦ It include the need for inter and intra- institutional
legal documents.
◦ Accessing data from public repositories leads to
multiple difficulties.
◦ It is necessary for the data to be available in an
accurate, complete and timely manner because if
data in the companies information system is to be
used to make accurate decisions in time then it
becomes necessary for data to be available in this
manner.
 Privacy and Security:
◦ It is another most important challenge with Big Data. This
challenge includes sensitive, conceptual, technical as well
as legal significance.
◦ Most of the organizations are unable to maintain regular
checks due to large amounts of data generation.
However, it should be necessary to perform security
checks and observation in real time because it is most
beneficial.
◦ There is some information of a person which when
combined with external large data may lead to some facts
of a person which may be secretive and he might not
want the owner to know this information about that
person.
◦ Some of the organization collects information of the
people in order to add value to their business. This is
done by making insights into their lives that they’re
unaware of.
 Analytical Challenges:
◦ There are some huge analytical challenges in big
data which arise some main challenges questions
like how to deal with a problem if data volume gets
too large?
◦ Or how to find out the important data points?
◦ Or how to use data to the best advantage?
◦ These large amount of data on which these type of
analysis is to be done can be structured (organized
data), semi-structured (Semi-organized data) or
unstructured (unorganized data). There are two
techniques through which decision making can be
done:
 Either incorporate massive data volumes in the
analysis.
 Or determine upfront which Big data is relevant.
 Technical challenges: Quality of data:
◦ When there is a collection of a large amount of
data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders
always want large data storage.
◦ For better results and conclusions, Big data rather
than having irrelevant data, focuses on quality
data storage.
◦ This further arise a question that how it can be
ensured that data is relevant, how much data
would be enough for decision making and
whether the stored data is accurate or not.
 Fault tolerance:
◦ Fault tolerance is another technical challenge and fault
tolerance computing is extremely hard, involving intricate
algorithms.
◦ Nowadays some of the new technologies like cloud
computing and big data always intended that whenever
the failure occurs the damage done should be within the
acceptable threshold that is the whole task should not
begin from the scratch.
 Scalability:
◦ Big data projects can grow and evolve rapidly. The
scalability issue of Big Data has lead towards cloud
computing.
◦ It leads to various challenges like how to run and execute
various jobs so that goal of each workload can be achieved
cost-effectively.
◦ It also requires dealing with the system failures in an
efficient manner. This leads to a big question again that
what kinds of storage devices are to be used.
What are the big data technologies?

Actually, Big Data Technologies is the utilized software that


incorporates data mining, data storage, data sharing, and data
visualization, the comprehensive term embraces data, data
framework including tools and techniques used to investigate and

transform data .
Big Data Technologies can be split into two categories

 1. Operational Big Data Technologies:



 It indicates the generated amount of data on a daily basis such as online
transactions, social media, or any sort of data from a specific firm used for
the analysis through big data technologies based software. It acts as raw
data to feed the Analytical Big Data Technologies.

 Few cases that outline the Operational Big Data Technologies include
executives’ particulars in an MNC, online trading and purchasing from
Amazon, Flipkart, Walmart, etc, online ticket booking for movies, flight,
railways and many more.

 2. Analytical Big Data Technologies:

 It refers to advance adaptation of Big Data Technologies, a bit complicated
in comparison to Operational Big Data. The real investigation of massive
data that is crucial for business decisions comes under this part. Some
examples covered in this domain are stock marketing,
weather forecasting, time series analysis, and medical-health records.

Top Big Data Technologies

 Top big data technologies are divided


into 4 fields which are classified as follows:
 Data Storage
 Data Mining
 Data Analytics
 Data Visualization
Big Data Technologies in Data Storage
Big Data Technologies used in
Data Mining.
Big Data Technologies used in
Data Analytics
BlockChain
 BlockChain is used in essential functions such as payment,
escrow, and title can also reduce fraud, increase financial
privacy, speed up transactions, and internationalize markets.
 BlockChain can be used for achieving the following in a Business
Network Environment:
 Shared Ledger: Here we can append the Distributed
System of records across a Business network.

 Smart Contract: Business terms are embedded in the


transaction Database and Executed with transactions.
 Privacy: Ensuring appropriate Visibility, Transactions are
Secure, Authenticated and Verifiable
 Consensus: All parties in a Business network agree to
network verified transactions.

 Developed by: Bitcoin


 Written in: JavaScript, C++, Python
 Current stable version: Blockchain 4.0
Data Visualization Big Data technologies
The History of Big Data
 The 1960s and '70s when the world of data was just getting started
with the first data centers and the development of the relational
database.
 Around 2005, people began to realize just how much data users
generated through Facebook, YouTube, and other online services.
Hadoop (an open-source framework created specifically to store and
analyze big data sets) was developed that same year. NoSQL also
began to gain popularity during this time
 The development of open-source frameworks, such as Hadoop (and
more recently, Spark) was essential for the growth of big data because
they make big data easier to work with and cheaper to store. In the
years since then, the volume of big data has skyrocketed.
 Users are still generating huge amounts of data—but it’s not just
humans who are doing it. With the advent of the Internet of Things
(IoT), more objects and devices are connected to the internet,
gathering data on customer usage patterns and product performance.
 The emergence of machine learning has produced still more data.
While big data has come far, its usefulness is only just beginning.
Cloud computing has expanded big data possibilities even further. The
cloud offers truly elastic scalability, where developers can simply spin
up ad hoc clusters to test a subset of data.
 Apache Hadoop consists of two sub-projects –

 Hadoop MapReduce: MapReduce is a computational model


and software framework for writing applications which are
run on Hadoop. These MapReduce programs are capable of
processing enormous data in parallel on large clusters of
computation nodes.
 HDFS (Hadoop Distributed File System): HDFS takes care
of the storage part of Hadoop applications. MapReduce
applications consume data from HDFS. HDFS creates multiple
replicas of data blocks and distributes them on compute
nodes in a cluster. This distribution enables reliable and
extremely rapid computations.
 Although Hadoop is best known for MapReduce and its
distributed file system- HDFS, the term is also used for a
family of related projects that fall under the umbrella of
distributed computing and large-scale data processing. Other
Hadoop-related projects at Apache include are Hive, HBase,
Mahout, Sqoop, Flume, and ZooKeeper.
Hadoop Architecture
 Hadoop has a Master-Slave Architecture for data storage and
distributed data processing using MapReduce and HDFS methods.
 NameNode:
 NameNode represented every files and directory which is used in
the namespace
 DataNode:
 DataNode helps you to manage the state of an HDFS node and
allows you to interacts with the blocks
 MasterNode:
 The master node allows you to conduct parallel processing of
data using Hadoop MapReduce.
 Slave node:
 The slave nodes are the additional machines in the Hadoop
cluster which allows you to store data to conduct complex
calculations. Moreover, all the slave node comes with Task Tracker
and a DataNode. This allows you to synchronize the processes
with the NameNode and Job Tracker respectively.
 In Hadoop, master or slave system can be set up in the cloud or
on-premise
Features Of ‘Hadoop’
• Suitable for Big Data Analysis

As Big Data tends to be distributed and unstructured in nature,


HADOOP clusters are best suited for analysis of Big Data. Since it is
processing logic (not the actual data) that flows to the computing
nodes, less network bandwidth is consumed. This concept is called
as data locality concept which helps increase the efficiency of
Hadoop based applications.

• Scalability

HADOOP clusters can easily be scaled to any extent by adding


additional cluster nodes and thus allows for the growth of Big Data.
Also, scaling does not require modifications to application logic.

• Fault Tolerance

◦ HADOOP ecosystem has a provision to replicate the input data


on to other cluster nodes. That way, in the event of a cluster
node failure, data processing can still proceed by using data
stored on another cluster node.
Hadoop Architecture
At its core, Hadoop has two major layers
namely −
1)Processing/Computation layer (MapReduce),
2) Storage layer (Hadoop Distributed File
system).
MapReduce

• MapReduce is a parallel programming


model for writing distributed applications
devised at Google for efficient processing of
large amounts of data (multi-terabyte data-
sets), on large clusters (thousands of
nodes) of commodity hardware in a
reliable, fault-tolerant manner.
• The MapReduce program runs on Hadoop
which is an Apache open-source
framework.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is based on the Google File System
(GFS) and provides a distributed file system that is designed to run on commodity
hardware. It has many similarities with existing distributed file systems. However,
the differences from other distributed file systems are significant. It is highly fault-
tolerant and is designed to be deployed on low-cost hardware. It provides high
throughput access to application data and is suitable for applications having large
datasets.
It has distributed file system known as HDFS and this HDFS splits files into blocks
and sends them across various nodes in form of large clusters. Also in case of a
node failure, the system operates and data transfer takes place between the
nodes which are facilitated by HDFS.
Apart from the above-mentioned two core components, Hadoop framework also
includes the following two modules −
Hadoop Common − These are Java libraries and utilities required by other
Hadoop modules.
Hadoop YARN − This is a framework for job scheduling and cluster resource
management.
How Does Hadoop Work?

• It is quite expensive to build bigger servers with heavy


configurations that handle large scale processing, but as an
alternative, you can tie together many commodity computers
with single-CPU, as a single functional than one high-end
server. So this is the first motivational factor behind using
Hadoop that it runs across clustered and low-cost machines.
• Hadoop runs code across a cluster of computers. This process
includes the following core tasks that Hadoop performs −
1) Data is initially divided into directories distributed system
and practically, the clustered machines can read the dataset
in parallel and provide a much higher throughput
2) Moreover, it is cheaper and files. Files are divided into
uniform sized blocks of 128M and 64M (preferably 128M).
3) These files are then distributed across various cluster nodes
for further processing.
4) HDFS, being on top of the local file system, supervises the
processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce
stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Advantages of Hadoop
• Hadoop framework allows the user to quickly write and test
distributed systems. It is efficient, and it automatic
distributes the data and work across the machines and in
turn, utilizes the underlying parallelism of the CPU cores.
• Hadoop does not rely on hardware to provide fault-
tolerance and high availability (FTHA), rather Hadoop
library itself has been designed to detect and handle
failures at the application layer.
• Servers can be added or removed from the cluster
dynamically and Hadoop continues to operate without
interruption.
• Another big advantage of Hadoop is that apart from being
open source, it is compatible on all the platforms since it is
Java based.
• It is inexpensive, immutable in nature, stores data reliably,
ability to tolerate faults, scalable, block structured, can
process a large amount of data simultaneously and many
more.
Disadvantages of HDFS:

It’s the biggest disadvantage is that it is not


fit for small quantities of data. Also, it has
issues related to potential stability,
restrictive and rough in nature.
 Hadoop also supports a wide range of

software packages such as Apache Flumes,


Apache Oozie, Apache HBase, Apache
Sqoop, Apache Spark, Apache Storm,
Apache Pig, Apache Hive, Apache Phoenix,
Cloudera Impala.
Advantages and Disadvantages of Hadoop

 Advantages:
 Ability to store a large amount of data.
 High flexibility.
 Cost effective.
 High computational power.
 Tasks are independent.
 Linear scaling.
 Disadvantages:
 Not very effective for small data.
 Hard cluster management.
 Has stability issues.
 Security concerns.
Comparison with Other Systems(RDBMS)
 Seeking is the process of moving the disk’s head to a
particular place on the disk to read or write data. It
characterizes the latency of a disk operation, whereas the
transfer rate corresponds to a disk’s bandwidth.

 If the data access pattern is dominated by seeks, it will take


longer to read or write large portions of the dataset than
streaming through it, which operates at the transfer rate.
On the other hand, for updating a small proportion of
records in a database, a traditional B-Tree (the data
structure used in relational databases, which is limited by
the rate it can perform seeks) works well. For updating the
majority of a database, a B-Tree is less efficient than Map
Reduce, which uses Sort/Merge to rebuild the database.
 MapReduce is a good fit for problems that
need to analyze the whole dataset, in a
batch fashion, particularly for ad hoc
analysis. An RDBMS is good for point
queries or updates, where the dataset has
been indexed to deliver low-latency
retrieval and update times of a relatively
small amount of data. MapReduce suits
applications where the data is written once,
and read many times, whereas a relational
database is good for datasets that are
continually updated.
What is Hadoop?
 Apache Hadoop is an open source software framework used
to develop data processing applications which are executed
in a distributed computing environment.
 Applications built using HADOOP are run on large data sets
distributed across clusters of commodity computers.
Commodity computers are cheap and widely available. These
are mainly useful for achieving greater computational power
at low cost.
 Similar to data residing in a local file system of a personal
computer system, in Hadoop, data resides in a distributed
file system which is called as a Hadoop Distributed File
system. The processing model is based on ‘Data
Locality’ concept wherein computational logic is sent to
cluster nodes(server) containing data. This computational
logic is nothing, but a compiled version of a program written
in a high-level language such as Java. Such a program,
processes data stored in Hadoop HDFS.
 Hadoop is an Apache open source framework
written in java that allows distributed
processing of large datasets across clusters of
computers using simple programming models.
 The Hadoop framework application works in
an environment that provides
distributed storage and computation across
clusters of computers.
 Hadoop is designed to scale up from single
server to thousands of machines, each offering
local computation and storage.

You might also like