0% found this document useful (0 votes)
14 views60 pages

BDA Unit 1

The document provides an extensive overview of Big Data, covering its definition, types (structured, unstructured, and semi-structured), characteristics (including the 3Vs and 6Vs), challenges, applications across various industries, and enabling technologies such as Hadoop and NoSQL databases. It emphasizes the importance of data management and analytics in handling large volumes of data generated from diverse sources. The document also highlights the need for advanced tools and techniques to effectively process and analyze Big Data.

Uploaded by

dnyangitte01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views60 pages

BDA Unit 1

The document provides an extensive overview of Big Data, covering its definition, types (structured, unstructured, and semi-structured), characteristics (including the 3Vs and 6Vs), challenges, applications across various industries, and enabling technologies such as Hadoop and NoSQL databases. It emphasizes the importance of data management and analytics in handling large volumes of data generated from diverse sources. The document also highlights the need for advanced tools and techniques to effectively process and analyze Big Data.

Uploaded by

dnyangitte01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Unit 1

INTRODUCTION
Introduction to Big Data
• Big data is a collection of massive and complex data sets
and data volume that include the huge quantities of
data and data management capabilities.
• There exists large amount of heterogeneous digital
data.
• With growth of technology, large volume of data is
produced.
• This data can be structured, semi- structured or
unstructured from different sources.
Introduction to Big Data
• Big data is about data volumes and large data sets
measured in terabytes or petabytes.
• Data analytics technologies and techniques analyzes
data sets and gather new information about the data.
• Big data analytics is a form of advanced analytics, which
involve complex applications with elements such as
predictive models, statistical algorithms
• Traditional SQL queries and RDBMS systems can not
work with big data, thus wide variety of scalable tools
and techniques have evolved to work with big data.
Introduction to Big Data
• Big Data is a technique to store, process, manage,
analysis and report, a huge amount of variety data, at
the required speed, and within the required time to
allow real-time analysis.
• The need of big data comes from big companies like
Google and Face book.
Evolution of Big Data
Evolution of Big Data
Types of Big Data
• Big data is classified in three ways:
1. Structured Data
2. Unstructured Data
3. Semi-Structured Data
Types of Big Data
• The structure of the data is important to know how to
work with the data and also indicates what insights it
can produce.
• All data goes through a process called extract,
transform, load (ETL) before it can be analyzed.
• The data is collected, formatted, converted to be
readable by an application, and then stored for use.
• The ETL process for each structure of data varies.
Types of Big Data
1. Structured Data
• Any data that can be processed, accessed and stored as a
fixed-format is named structured data.
• With the increased ability in software engineering, many
advanced techniques for working with structured data.
• Structured data is the easiest to work with.
• It is highly organized with dimensions defined by set of
parameters.
• It is a quantitative data such as, Age, billing, contact,
address, expenses, debit/credit card numbers etc.
Types of Big Data
• Structured data is the easiest type of data to analyze
because it requires no preparation before processing.
• A user need to cleanse data and fit it to relevant points.
• One of the major benefit of using structured data is the
streamlined process of merging enterprise data with
relational database.
• The ETL process for structured data stores the finished
product in a data warehouse.
• These databases are highly structured and filtered for the
specific analytics purpose.
Types of Big Data
2. Unstructured Data
• This type of big data, the data format of the relative
multitude of unstructured files are incorporated.
 for example, image files, audio files, log files, and video files.
• Any data which has an unfamiliar structure or model is
arranged as unstructured data.
• As size of this data is huge, this type of data has different
difficulties for preparing and determining its value.
• Analytical process needed for unstructured data are takes
time and efforts to covert it into some level of readability.
• For unstructured data, the second phase of ETL process is
complicated.
Types of Big Data
• It is difficult to analyze unstructured data and use the
information extracted from it into an application.
• This means translating it into some form of structured data.
• Methods like text parsing, natural language processing and
developing content hierarchies are needed to do so.
• Unstructured is placed in data lakes, which preserve the raw
format of the data and all of the information it holds.
• In warehouses, the data is limited to its defined schema, but
in data lakes the data more flexible and in variety of formats.
Types of Big Data
3. Semi-Structured Data
• Semi-structured data toes the line between structured and
unstructured.
• Most of the time, this translates to unstructured data with
metadata attached to it.
• This can be inherent data collected, such as time, location,
device ID stamp or email address, or it can be a semantic
tag attached to the data later.
• Semi-structured splits the gap between structured and
unstructured data.
• It can inform AI training and machine learning by
associating patterns with metadata.
• Semi-structured data has no set schema.
Characteristics of Big Data
• Since 1997, many attributes have been added to Big
Data.
• Among these attributes, three of them are the most
popular and have been widely cited and adopted.
• In 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of
Big Data
• Variety, Velocity, and Volume
Characteristics of Big Data
1) Variety
• Variety of Big Data refers to structured, unstructured, and
semistructured data that is gathered from multiple sources.
• Today data comes in an array of forms such as emails, PDFs,
photos, videos, audios, SM posts, and so much more.
• It signifies the variety of incompatible and inconsistent data
formats and data structures
2) Velocity
• Velocity essentially refers to the speed at which data is being
created in real-time.
• It represents the pace of data used to support interaction
and generated by interactions.
Characteristics of Big Data
3) Volume
• Volume is one of the characteristics of big data.
• It means the incoming data stream and cumulative
volume of data.
• Big Data indicates huge ‘volumes’ of data that is being
generated on a daily basis from various sources like
social media platforms, business processes, machines,
networks, human interactions, etc.
• Such a large amount of data are stored in data
warehouses.
Characteristics of Big Data
• IBM — 4Vs definition
• IBM added another attribute or “V” for “Veracity” on the
top of Douglas Laney’s 3Vs notation.
• This is known as the 4Vs of Big Data.
• It defines each “V” as following:
• 1. Volume stands for the scale of data
• 2. Velocity denotes the analysis of streaming data
• 3. Variety indicates different forms of data
• 4. Veracity implies the uncertainty of data
Characteristics of Big Data
• Microsoft — 6Vs definition
• For the sake of maximizing the business value, Microsoft
extended Douglas Laney’s 3Vs attributes to 6 Vs which it
added variability, veracity, and visibility:
• 1. Volume stands for scale of data
• 2. Velocity denotes the analysis of streaming data
• 3. Variety indicates different forms of data
• 4. Veracity focuses on trustworthiness of data sources
Characteristics of Big Data
• Microsoft — 6Vs definition
• 5. Variability refers to the complexity of data set. In
comparison with “Variety” (or different data format), it
means the number of variables in data sets
• 6. Visibility emphasizes clear and full picture of data in order
to make informative decision
Characteristics of Big Data
Challenges of Big Data
1. Lack of proper understanding of Big Data
2. Storing huge sets of data properly is difficult
3. Insufficient understanding of big data
4. Need for synchronization across data sources
5. Constantly changing and updating data
6. Lack of data professionals
7. Data searching, sharing and transferring is complicated
8. Securing huge sets of data is one of the
overwhelming challenge
9. Finding and fixing data quality issues
10. Scaling big data systems efficiently and cost effectively
Applications of Big Data
• Education Industry:- helps to analyze and study huge
amount of data, can be used to improve the operational
effectiveness and working of educational institutes.
• Healthcare Industry :-reduces the costs of a treatment,
helps avoid preventable diseases by detecting them in the
early stages and helps in deciding preventive measures
• Government Sector:-In making faster and informed
decisions, identify areas that needs attention and helps to
overcome national challenges such as unemployment,
terrorism, energy resources etc
• Media and Entertainment:- helps in predicting the interests
of audiences and optimized or on-demand scheduling of
media streams.
Applications of Big Data
• Weather Patterns:-In weather forecasting, to study global
warming, in understanding the patterns of natural disasters,
to make necessary preparations in the case of crises
• Transportation Industries:- helps in route planning, helps in
congestion management and traffic control, increase the
safety level of traffic
• Banking Sector:- helps in misuse of credit/debit cards,
venture credit hazard treatment, business clarity, customer
statistics alteration, money laundering
Applications of Big Data
• Marketing:- helps to collect huge amounts of data and get to
know the choices of millions of customers in a few seconds.
Analyzing the data help marketers run campaigns, increase
click-through rates, put relevant advertisements, improve
the product
• Business Insights :- helps to solve a lot of problems related
to profits, customer satisfaction, and product development.
• Space Sector:- helps to manage the data received from
satellites orbiting the earth, probes studying outer space,
and rovers on other planets and analyzes that to run
simulations.
Enabling Technologies for Big Data
• Big data is used, for a collection of data sets.
• These data sets are large and complex, that it is difficult to
process using traditional tools.
• A recent survey, says that 80% of the data created, in the
world are, unstructured.
• Thus traditional tools, will not be able to handle, such a big
data motion.
• One challenge is, to store and process this big data.
• To do so, the technologies and the enabling framework to
process the big data are needed.
Enabling Technologies for Big Data
• 1. Operational Big Data Technologies:
• It indicates the amount of data generated on a daily basis
such as online transactions, social media, or any sort of data
from a specific firm used for the analysis through big data
technologies based software.
• It acts as raw data to feed the Analytical Big Data
Technologies.
• For example: the Operational Big Data Technologies include
executives’ particulars in an MNC, online trading and
purchasing from Amazon, Flipkart, Walmart, etc, online ticket
booking for movies, flight, railways and many more.
Enabling Technologies for Big Data
• 2. Analytical Big Data Technologies:
• It refers to advance adaptation of Big Data Technologies.
• The real investigation of massive data that is crucial for business
decisions comes under this part.
• For example: stock marketing, weather forecasting, time series
analysis, and medical-health records.
Enabling Technologies for Big Data
• In both the types of big data technologies, the most commonly used
techniques that easily facilitate the practical use of big data were as
follows:
• 1) Predictive Analytics
• One of the prime tools for businesses to avoid risks in decision
making, and help businesses.
• Predictive analytics hardware and software solutions can be utilised for
discovery, evaluation and deployment of predictive scenarios by
processing big data.
• This data can help companies solve problems by analyzing and
understanding them.
• 2) NoSQL Databases
• These databases are utilised for reliable and efficient data management
across a scalable number of storage nodes.
• NoSQL databases store data as relational database tables, JSON docs or
key-value pairings.
Enabling Technologies for Big Data
• 3) Knowledge Discovery Tools
• These are tools that allow businesses to mine big data (structured
and unstructured) which is stored on multiple sources.
• These sources can be different file systems, APIs, DBMS or similar
platforms.
• With search and knowledge discovery tools, businesses can isolate
and utilise the information to their benefit.
• 4) Stream Analytics
• Organization needs to process the data that can be stored on
multiple platforms and in multiple formats.
• Stream analytics software is useful for filtering, aggregation, and
analysis of such big data.
• These software also allows connection to external data sources and
their integration into the application flow.
Enabling Technologies for Big Data
• 5) In-memory Data Fabric
• This technology helps in distribution of large quantities of data
across system resources such as Dynamic RAM, Flash Storage or
Solid State Storage Drives.
• This enables low latency access and processing of big data on the
connected nodes.
• 6) Distributed Storage
• This technology provides a way to overcome independent node
failures and loss or corruption of big data sources.
• Distributed file stores contain replicated data, sometimes the data
is also replicated for low latency quick access on large computer
networks.
• These are generally non-relational databases.
Enabling Technologies for Big Data
• 7) Data Virtualization
• It enables applications to retrieve data without implementing
technical restrictions such as data formats, the physical location of
data, etc.
• This technology is used by Apache Hadoop and other distributed
data stores for real-time or near real-time access to data stored on
various platforms.
• 8) Data Integration
• A key operational challenge for most organizations handling big
data is to process terabytes (or petabytes) of data in a way that can
be useful for customer deliverables.
• Data integration tools allow businesses to streamline data across a
number of big data solutions such as Amazon EMR, Apache Hive,
Apache Pig, Apache Spark, Hadoop, MapReduce, MongoDB and
Couchbase.
Enabling Technologies for Big Data
• 9) Data Preprocessing
• These software solutions are used for manipulation of data into a
format that is consistent and can be used for further analysis.
• The data preparation tools accelerate the data sharing process by
formatting and cleansing unstructured data sets.
• A limitation of data preprocessing is that all its tasks cannot be
automated and require human oversight, which can be tedious and
time-consuming.
• 10) Data Quality
• An important parameter for big data processing is the data quality.
The data quality software can conduct cleansing and enrichment of
large data sets by utilising parallel processing.
• These softwares are widely used for getting consistent and reliable
outputs from big data processing.
Enabling Technologies for Big Data
Apart from above mentioned techniques, some other big data
enabling technologies are as follows:

• Apache Hadoop • Apache Spark


• Hadoop Ecosystem • Zookeepar
• HDFS Architecture • Cassandra
• YARN • Hbase
• NoSQL • Spark Streaming
• Hive • Kafka
• MapReduce • Spark Mlib
• GraphX
Enabling Technologies for Big Data
• Apache Hadoop is the tool which is going to be used for, the
big data computation.
• It is an open source, software framework, for a big data.
• Hadoop Framework was designed to store and process data
in a Distributed Data Processing Environment with
commodity hardware with a simple programming model.
• It can store and analyse the data present in different
machines with high speeds and low costs.
• In particular, Hadoop can process extremely large volumes of
data with varying structures (or no structure at all).
• Developed by: Apache Software Foundation in the year 2011
10th of Dec.
• Written in: JAVA
Enabling Technologies for Big Data
• Hadoop Ecosystem is used for big data computation.
• The Hadoop architecture comprises three layers.
1. Storage layer (HDFS)
2. Resource Management layer (YARN)
3. Processing layer (MapReduce)
Enabling Technologies for Big Data

Hadoop Ecosystem
Enabling Technologies for Big Data
• HDFS:-
• HDFS is one of the major components of Apache Hadoop.
• It is a distributed file system that handles large data sets
running on commodity hardware.
• It is used to scale a single Apache Hadoop cluster to
hundreds (and even thousands) of nodes.
• HDFS has been built to detect faults and automatically
recover quickly.
• It is designed for high data throughput rates.
• HDFS manages, all the nodes, and their corresponding
memories.
Enabling Technologies for Big Data
• HDFS:-
• HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying
operating systems.
• HDFS can process applications that have data sets typically
gigabytes to terabytes in size.
• It provides high aggregate data bandwidth and can scale to
hundreds of nodes in a single cluster.
Enabling Technologies for Big Data
• YARN:-(Yet Another Resource Negotiator)
• It is one of Apache Hadoop's core components.
• Apache Hadoop YARN is the resource management and job
scheduling technology in the open
source Hadoop distributed processing framework.
• YARN is responsible for allocating system resources to the
various applications running in a Hadoop cluster and
scheduling tasks to be executed on different cluster nodes.
• YARN can dynamically allocate resources to applications as
needed.
Enabling Technologies for Big Data
• MapReduce:-
• It is a program model for distributed computing based on
java.
• The MapReduce algorithm contains two important tasks,
Map and Reduce.
• Map takes a set of data and converts it into another set of
data, and breaks the individual elements into tuples
(key/value pairs).
• Reduce takes the output from a map as an input and
combines those data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the
reduce task is always performed after the map task
Enabling Technologies for Big Data
• Apache Hive is an open-source data warehousing tool for
performing distributed processing and data analysis.
• It was developed by Facebook to reduce the work of writing
the Java MapReduce program.
• Apache Hive uses a Hive Query language, which is a
declarative language similar to SQL.
• Hive translates the hive queries into MapReduce programs.
• Hive supports applications written in any language like
Python, Java, C++, Ruby, etc. using JDBC, ODBC, and Thrift
drivers, for performing queries on the Hive.
Enabling Technologies for Big Data
• Apache Pig is a data flow language.
• It is an abstraction over MapReduce.
• It is a tool which is used to analyze larger sets of data
representing them as data flows.
• All data manipulation operations in Hadoop can be
performed using Apache Pig.
• To write data analysis programs, Pig provides a high-level
language known as Pig Latin.
• To analyze data using Apache Pig, programmers need to
write scripts using Pig Latin language.
Enabling Technologies for Big Data
• HBase stands for Hadoop Database.
• HBase is Java-based Not Only SQL (NoSQL) database which
runs on top of Hadoop.
• Data stored in the table format in HDFS.
• HBase stores data as key/value pair.
• Hbase is flexible and is convenient for multiple read-write of
data stored in HDFS.
• HBase is a data model that is similar to Google’s big table
designed to provide quick random access to huge amounts
of structured data.
• It provides random real-time read/write access to data in
the Hadoop File System.
Enabling Technologies for Big Data
• A NoSQL database (sometimes called as Not Only SQL) is a
database, provides a mechanism to store and retrieve data
other than the tabular relations used in relational databases.
• These databases are schema-free, support easy replication,
have simple API, and very consistent.
• They can handle huge amounts of data.
• They are mainly used for:-
• simplicity of design
• horizontal scaling
• easy availability.
Enabling Technologies for Big Data
• Cassandra is a distributed database from Apache that is
highly scalable and designed to manage very large amounts
of structured data.
• It is a type of NoSQL, column oriented database
• It is scalable and consistent.
• It is used to provide high availability along with no single
failure point.
• Cassandra accommodates all possible data formats
including: structured, semi-structured, and unstructured.
• It can dynamically accommodate changes to the data
structures according to need of application.
• It supports properties like Atomicity, Consistency, Isolation,
and Durability (ACID).
Enabling Technologies for Big Data
• MongoDB is an open-source document database and leading
NoSQL database and is written in C++.
• MongoDB is a cross-platform, document oriented database.
• It provides, high performance, high availability, and easy
scalability.
• MongoDB works on concept of collection and document.
• A single MongoDB server typically has multiple databases.
• Collection is a group of MongoDB documents. It is the equivalent
of an RDBMS table.
• A document is a set of key-value pairs. Documents have dynamic
schema.
• MongoDB handles flexibility and also a wide variety of data types
at high volumes and among distributed architectures.
Enabling Technologies for Big Data
• Apache ZooKeeper is a service used by a cluster (group of nodes)
to coordinate between themselves and maintain shared data with
robust synchronization techniques.
• ZooKeeper is a distributed application and also provide services
for writing a distributed application.
• It has simple architecture and API.
• ZooKeeper allows developers to focus on core application logic
without thinking about the distributed nature of the application.
• The ZooKeeper framework was originally built at Yahoo for
accessing their applications in an easy and robust way.
• Zookeeper resolves race condition and deadlock using fail-safe
synchronization approach and inconsistency of data
with atomicity.
• It provides mutual exclusion and co-operation between server
processes, which helps in HBase for configuration management.
Big Data Stack
Big Data Stack
• The Data Layer:-
• At the bottom of the stack are technologies that store
masses of raw data.
• The data comes from traditional sources like OLTP
databases, and less structured sources like log files,
sensors, web analytics, document and media archives.
Big Data Stack
• Data Storage Systems:-
• Following are some of the examples of data storage systems
• Hadoop HDFS—the classic big data file system. It became
popular due to its robustness and limitless scale on
commodity hardware.
• Amazon S3—create buckets and load data using a variety of
integrations.
• Mongo DB—a mature open source document-based
database, built to handle data at scale with proven
performance.
• Cassandra- is a distributed database from Apache that is
highly scalable and designed to manage very large amounts
of structured data
Big Data Stack
• The Data Integration Layer:-
• To create a big data store, the data must be imported from
its original sources into the data layer.
• In many applications, data need to be ingested into
specialized tools, such as data warehouses.
• This needs a data pipeline.
• To do this, a rich ecosystem of big data integration tools,
including powerful open source integration tools, is needed.
• These tools need to pull the data from sources, transform it,
and load it to a target system.
Big Data Stack
• Big Data Ingestion Tools
• Stitch—a lightweight ETL (Extract, Transform, Load) tool
which pulls data from multiple pre-integrated data sources,
transforms and cleans it as necessary.
• Stitch is easy to setup, seamless and integrates multiple
sources of data .
• Blendo—a cloud data integration tool that allows to connect
data sources with a few clicks, and pipe them to a Server.
• Blendo provides schemas and optimization for email
marketing, eCommerce and other big data use cases.
Big Data Stack
• Big Data Ingestion Tools
• Apache Kafka—It is an open source streaming
messaging bus that can creates a feed from your data
sources, partitions the data, and streams it to a passive
listener.
• Apache Kafka is one of the powerful solution used in
production at huge scale data.
Big Data Stack
• The Data Processing Layer:-
• At this layer, data arrives at its destination.
• Now a technology is needed that can crunch the data to help
data analysis.
• The data processing layer should optimize the data to
facilitate more efficient analysis, and provide a compute
engine to run the queries.
• Data warehouse tools are optimal for processing data at
large scale, and a data lake is more appropriate for storage,
it also assist other technologies when data needs to be
processed and analyzed.
Big Data Stack
• Data Processing Tools
• Following are the examples of data processing tools.
• Apache Spark— It is similar to Map/Reduce but more faster.
• Runs parallelized queries on unstructured, distributed data
in Hadoop. Spark also provides a SQL interface, but is not a
SQL engine.
• PostgreSQL— It is used to pipeline the data to facilitate
queries. PostgreSQL can be scaled by partitioning the data
and it is very reliable.
• Amazon Redshift—A cloud-based data warehouse and
offers huge query processing speeds and can also be used as
a relational database.
Big Data Stack
• Data Analytics & BI Layer :-
• The data layer collected the raw materials for the analysis,
the integration layer mixed them all together, the data
processing layer optimized, organized the data and executed
the queries.
• The analytics & BI is the application layer which, with the
help of data enables data-driven decisions.
• The technology in this layer is helpful to run the queries to
answer questions, slice and dice the data, build dashboards
and create visualizations etc.
Big Data Stack
• Data Analytics Tools
• Tableau— This is one of the powerful BI and data
visualization tool, which connects the data and allows to
perform complex analysis, and build charts and dashboards.
• Chartio—cloud BI service allows to connect data sources,
explore data, build SQL queries and transform the data as
needed, and create live auto-refreshing dashboards.
• Looker—cloud-based BI platform that allows query and
analyze large data sets via SQL, set up visualizations and
define metrics that elaborates the data.
Hadoop Distributions

Distro Remarks Free / Premium


Apache •The Hadoop Source
Completely free and open
hadoop.apache.or •No packaging except TAR balls
source
g •No extra tools
•Oldest distro
Cloudera
•Very polished Free / Premium model
www.cloudera.co
•Comes with good tools to install (depending on cluster size)
m
and manage a Hadoop cluster
•Newer distro
HortonWorks
•Tracks Apache Hadoop closely
www.hortonworks. Completely open source
•Comes with tools to manage and
com
administer a cluster
Hadoop Distributions
•MapR has their own file system (alternative to
HDFS)
•Boasts higher performance
MapR
•Nice set of tools to manage and administer a Free / Premium
www.mapr.co
cluster model
m
•Does not suffer from Single Point of Failure
•Offer some cool features like mirroring,
snapshots, etc.

•Encryption support
Intel
•Hardware acceleration added to some layers of
hadoop.intel.c Premium
stack to boost performance
om
•Admin tools to deploy and manage Hadoop

Pivotal HD •fast SQL on Hadoop


Premium
gopivotal.com •software only or appliance

You might also like