0% found this document useful (0 votes)
3 views74 pages

BigData Unit1

The document provides an overview of Big Data, its characteristics defined by the 3Vs (Volume, Velocity, Variety), and the challenges and opportunities it presents. It introduces Hadoop as a key framework for managing and processing large datasets, detailing its components like HDFS, MapReduce, and YARN. Additionally, it covers various types of digital data, relationships in big data, and the history of Apache Hadoop's development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views74 pages

BigData Unit1

The document provides an overview of Big Data, its characteristics defined by the 3Vs (Volume, Velocity, Variety), and the challenges and opportunities it presents. It introduces Hadoop as a key framework for managing and processing large datasets, detailing its components like HDFS, MapReduce, and YARN. Additionally, it covers various types of digital data, relationships in big data, and the history of Apache Hadoop's development.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Big Data Analytics

Dr. U. Vinay Kumar


Associate Professor, FCE
Poornima University
Unit 1:
Introduction to Big Data and
Hadoop
Introduction to Big Data
Introduction to Big Data
• Big Data refers to the immense
volume of structured and
unstructured data generated by
various sources at an
unprecedented speed.
• This data comes from a wide range
of channels, including social media,
sensors, mobile devices, business
transactions, and more.
Introduction to Big Data
• The concept of Big Data is characterized by three primary
dimensions often referred to as the "3Vs":
1. Volume
2. Velocity
3. Variety
Big Data Characteristics: Volume
• Big Data involves large
amounts of data.
• Traditional data
management tools may
struggle to process and store
such massive volumes.
• The sheer size of the data
sets is a key aspect of what
makes it "big."
Big Data Characteristics: Volume
➢Walmart handles 1 Million customer transactions per hour.
➢Instagram, Facebook inserts 10 PB of new Data every day.
➢A flight generates 1 PB data in a 2-4 hour flight.
➢More than 5 billion are calling, texting, tweeting and
browsing on mobile phones worldwide.
Big Data Characteristics: Volume
Big Data Characteristics: Volume
Big Data Characteristics: Velocity
• Data is generated at an
incredibly high speed in real-
time or near-real-time.
• Examples include social
media posts, online
transactions, and sensor data.
• The ability to handle and
process data at this pace is
crucial for extracting
meaningful insights.
Big Data Characteristics: Velocity
Big Data Characteristics: Variety
• Big Data comes in various
formats and types, including
structured data (such as
databases and tables),
unstructured data (like text and
images), and semi-structured
data (such as XML or JSON
files).
• Managing and analyzing
diverse data types is a
significant challenge in the
realm of Big Data.
Big Data Characteristics
• Additionally, two more Vs are sometimes added to the
definition:
➢Variability:
This refers to the inconsistency in the data flow. Data can be
unpredictable and can vary at times.
➢Veracity:
Veracity deals with the quality of the data. With the vast amount of data
being generated, there is often uncertainty about its accuracy and
reliability.
Sources of Big Data
An example of Big Data

Real time Traffic info


Challenges and Opportunities of Big Data
• Challenges: • Opportunities:
• Storage: Managing and storing • Innovation: Big Data analytics
large volumes of data efficiently. can lead to innovative solutions,
• Processing: Analyzing and products, and services.
processing data quickly to derive • Efficiency: Improved decision-
meaningful insights. making and operational
• Analysis: Extracting relevant efficiency through data-driven
information from diverse and insights.
complex data sets. • Competitive Advantage:
• Privacy and Security: Ensuring Organizations can gain a
the confidentiality and protection competitive edge by harnessing
of sensitive data. Big Data effectively.
What's driving the Big Data
What's driving the Big Data
Tools and Technologies
• Numerous tools and
technologies have emerged to
handle the challenges posed by
Big Data.
• These include distributed
storage systems like Hadoop,
in-memory processing
frameworks like Apache Spark,
and various data analytics and
machine learning tools.
Introduction to Hadoop
• Hadoop is an open-source
framework for distributed storage
and processing of large sets of data
using a cluster of commodity
hardware.
• It is designed to scale from single
servers to thousands of machines,
offering a cost-effective solution for
managing and analyzing massive
amounts of data.
• The project is inspired by Google's
MapReduce and Google File
System (GFS) papers.
Introduction to Hadoop
• Key components of Hadoop include:
1. Hadoop Distributed File System (HDFS)
2. MapReduce
3. YARN (Yet Another Resource Negotiator)
4. Hadoop Ecosystem
Introduction to Hadoop: HDFS
• HDFS is a distributed file system
that provides high-throughput
access to data.
• It breaks large files into smaller
blocks (typically 128 MB or
256 MB) and distributes them
across the nodes in a Hadoop
cluster.
• HDFS is fault-tolerant, with data
replication across multiple
nodes to ensure data durability.
Introduction to Hadoop: MapReduce
• MapReduce is a programming
model and processing engine for
parallel and distributed data
processing.
• It consists of two main steps: the
Map phase, where data is divided
into key-value pairs, and the
Reduce phase, where the results
from the Map phase are
aggregated.
• MapReduce allows for scalable and
efficient processing of large
datasets across a Hadoop cluster.
Introduction to Hadoop: YARN
• YARN is a resource
management layer in Hadoop
that manages and schedules
resources across the cluster.
• It allows multiple applications
to share resources effectively,
enabling more flexible and
dynamic allocation of
resources.
Introduction to Hadoop: Ecosystem
• Hadoop has a rich ecosystem
of related projects and tools
that extend its capabilities.
• Some notable components
include Apache Hive (data
warehouse infrastructure),
Apache Pig, Apache HBase
(distributed, scalable, and big
data store), Apache Spark and
more.
Introduction to Hadoop
• Hadoop is widely used in various industries for processing
and analyzing large datasets, including log files, social media
data, sensor data, and more.
• Its ability to handle massive amounts of data across a
distributed cluster makes it a crucial tool for organizations
seeking to gain insights and make informed decisions based
on their data.
Types of Digital Data
Types of Digital Data
• Digital data refers to information
that is stored and transmitted in a
form that is composed of discrete
elements.
• There are various types of digital
data, and they can be broadly
categorized based on their
formats, structures, and
characteristics.
Types of Digital Data: Text Data
• Plain Text: Unformatted text
without any styling or
formatting.
• Rich Text Format (RTF): Text
with formatting options such as
bold, italic, and font changes.
• HTML (Hypertext Markup
Language): Used for creating
and structuring web content.
Types of Digital Data: Numeric Data
• Integers: Whole numbers
without decimal points.
• Floating-Point Numbers:
Numbers with decimal
points or in scientific
notation.
• Complex Numbers:
Numbers with both real and
imaginary parts.
Types of Digital Data: Audio Data
• Digital Audio: Recorded or
synthesized sound stored in
digital format (e.g., MP3,
WAV).
• Speech Data: Transcribed or
recorded human speech.
Types of Digital Data: Image Data
• Bitmap Images: Pixel-
based images (e.g., JPEG,
PNG, BMP).
• Vector Images: Graphics
represented by
mathematical equations
(e.g., SVG).
Types of Digital Data: Image Data
Types of Digital Data: Video Data
• Digital Video: Sequences of
images presented in a rapid
succession (e.g., MP4, AVI).
• Streaming Video: Video
content transmitted in real-
time over the internet.
Types of Digital Data: Binary Data
• Executable Files: Programs
and applications in binary
format (e.g., EXE, ELF).
• Binary Code: Machine code
instructions for computer
processors.
Types of Digital Data: Geospatial Data
• Geographic Information
System (GIS) Data:
Information related to
geographic locations.
• Global Positioning System
(GPS) Data: Location data
obtained from GPS devices.
Types of Digital Data: Meta Data
• Descriptive Metadata:
Information describing other
data (e.g., file size, creation
date).
• Structural Metadata:
Information about the
structure and relationships
within data.
Types of Digital Data: Sensor Data
• Environmental Sensor
Data: Information collected
from sensors measuring
environmental parameters.
• Biometric Data: Data
related to physiological or
behavioral characteristics
(e.g., fingerprints, heart
rate).
Types of Digital Data: Social Media Data
• Text Posts: Messages,
tweets, or status updates.
• Multimedia Posts: Images,
videos, and audio shared on
social media platforms.
Types of Digital Data: Machine Learning Data
• Training Data: Examples
used to train machine
learning models.
• Testing Data: Examples
used to evaluate the
performance of machine
learning models.
Relationships and
Representations
Relationships in Big Data
1. Inter-Data Relationships
2. Temporal Relationships
3. Graph Relationships
Inter-Data Relationships
• Big data often involves diverse
datasets that may have complex
relationships.
• Understanding how different
datasets relate to each other is
crucial for deriving meaningful
insights.
• For example, in a retail setting,
you might explore the relationship
between customer demographics
and purchasing behavior to target
specific market segments.
Temporal Relationships
• Many big data applications
involve time-series data.
Analyzing temporal
relationships helps uncover
patterns and trends over time.
• This is valuable in various
domains, such as finance (stock
market trends), healthcare
(patient monitoring), and
manufacturing (predictive
maintenance).
Graph Relationships
• Some big data scenarios involve
data with intricate network or
graph structures.
• Social networks, for instance, can
be represented as graphs where
individuals are nodes and
relationships between them are
edges.
• Understanding these relationships
is vital for social network analysis,
recommendation systems, and
fraud detection.
Representations in Big Data: Data Models
• Choosing the right data model
is crucial in big data systems.
• Whether it's a relational
database model, NoSQL
models like document or
graph databases, or
specialized models for specific
data types, the representation
of data affects how it can be
stored, queried, and
processed.
Data Formats
• Big data is often stored in
various formats, such as
JSON, XML, Parquet, or Avro.
• The choice of data format
impacts data storage
efficiency, query
performance, and ease of
integration with different
tools and systems.
Visualization Representations
• Converting big data into
visual representations, such
as charts, graphs, and
dashboards, is essential for
making the data accessible
and understandable.
• Visualization aids in
identifying patterns, trends,
and outliers, facilitating
better decision-making.
Feature Representations
• In machine learning and
data analytics, representing
data features effectively is
crucial.
• Feature engineering involves
transforming raw data into a
format that machine learning
algorithms can understand.
• This process influences the
model's performance and
accuracy.
Graph Databases
Graph Databases
• Graph databases are a type of
NoSQL database that is designed
to store and manage data using
graph structures.
• In a graph database, data is
represented as nodes, edges, and
properties.
• Nodes represent entities, edges
represent relationships between
entities, and properties provide
additional information about
nodes and edges.
Graph Databases
1. Nodes: Nodes are entities in the graph,
and each node can have properties that
describe its attributes.
• For example, in a social network graph, a
node could represent a person, and
properties could include the person's
name, age, and location.
2. Edges: Edges are the relationships
between nodes. They connect nodes and
can also have properties to describe the
nature of the relationship.
• In a social network graph, an edge could
represent a friendship between two
people.
Graph Databases
3. Properties: Nodes and edges can have
associated properties, which are key-value
pairs providing additional information about
the entity or relationship.
• For instance, a property on a person node
might be "gender" with values like "male"
or "female."
4. Graph Query Language: Graph
databases often use a specialized query
language to navigate and retrieve data from
the graph.
• Common graph query languages include
Cypher (used in Neo4j) and Gremlin (used
in Apache TinkerPop).
5. Schema-less: Unlike traditional relational
databases, graph databases are typically
schema-less, allowing for flexibility in
adding new types of nodes and relationships
without modifying a predefined schema.
Graph Databases
Use Cases: Examples of Graph Databases:
• Social Networks: Modeling • Neo4j: A popular open-source
relationships between users in a graph database.
social network. • Amazon Neptune: A fully
• Recommendation Engines: managed graph database service
Analyzing user preferences and by Amazon Web Services.
recommending items based on • ArangoDB: A multi-model
connections. database that supports graph,
• Fraud Detection: Detecting document, and key-value data
patterns and connections in models.
financial transactions.
• Network Analysis: Analyzing and
visualizing complex relationships in
various domains.
History of Apache Hadoop
History of Apache Hadoop
• Hadoop is an open-source framework designed for distributed storage and
processing of large data sets using a cluster of commodity hardware.
• The history of Hadoop dates back to the early 2000s, and it has since become a
fundamental tool in the field of big data.
Google's MapReduce Paper (2004):
• The roots of Hadoop can be traced back to a paper titled "MapReduce:
Simplified Data Processing on Large Clusters," published by Google
researchers Jeffrey Dean and Sanjay Ghemawat in 2004.
• The paper described a programming model for processing and generating
large datasets that could be distributed across a cluster of computers.
History of Apache Hadoop
• Creation of Hadoop (2005): Doug Cutting, along with Mike Cafarella,
created an open-source implementation of MapReduce in the
programming language Java.
• The project was named Hadoop after Doug's son's toy elephant. Hadoop
aimed to provide an open-source, distributed computing framework that
could process large datasets.
• Nutch and Yahoo! (2006): Hadoop became an integral part of the Apache
Nutch project, an open-source web search engine.
• Yahoo! showed early interest in Hadoop and became a major contributor
to its development.
• Formation of the Apache Hadoop Project (2008): In January 2008, the
Apache Software Foundation (ASF) established the Apache Hadoop
project, and Hadoop became a top-level Apache project. This move
facilitated collaboration and contributions from a broader community.
History of Apache Hadoop
• Hadoop Distributed File System (HDFS): HDFS, a distributed file system designed to
store vast amounts of data across multiple machines, was developed as part of the
Hadoop project.
• HDFS follows the principles outlined in the Google File System (GFS) paper.
• Expansion of Hadoop Ecosystem: Over time, the Hadoop ecosystem expanded with
the introduction of various projects that complemented the core Hadoop framework.
• Apache Hive (data warehousing), Apache HBase (NoSQL database), Apache Pig (data
flow language), Apache Spark (cluster computing), and many others became integral
components of the Hadoop ecosystem.
• Hadoop 2.0 and YARN (2013): Hadoop 2.0, released in 2013, introduced the YARN (Yet
Another Resource Negotiator) framework.
• YARN separated resource management and job scheduling/monitoring functions,
making Hadoop more versatile and capable of running a broader range of applications.
Analysing Data with Hadoop
• Analyzing data with Hadoop involves processing and deriving insights
from large datasets using the Hadoop ecosystem, which is a set of open-
source tools designed for distributed storage and processing of big
data.
1.Setup Hadoop Cluster:
Install and configure Hadoop on a cluster of machines. Hadoop follows
a distributed computing model, and a cluster typically consists of
multiple nodes.
2.Store Data in Hadoop Distributed File System (HDFS):
HDFS is the primary storage system in Hadoop. Upload your datasets
into HDFS to distribute and replicate the data across the cluster.
Analysing Data with Hadoop
• MapReduce Programming Model:
• Write MapReduce programs to process and analyze data.
MapReduce is a programming model that allows you to
process large datasets in parallel across a distributed cluster.
• Mapper: Processes input data and produces intermediate
key-value pairs.
• Reducer: Aggregates and processes the intermediate key-
value pairs to produce the final result.
Analysing Data with Hadoop
• Hive: Hive provides a high-level SQL-like language called HiveQL,
allowing you to query data stored in Hadoop. It translates queries
into MapReduce jobs.
• Pig: Pig is a scripting language designed for processing and
analyzing large datasets. Pig scripts are translated into a series of
MapReduce jobs.
• Apache Spark: Apache Spark is a fast and general-purpose cluster
computing system that can be integrated with Hadoop. It provides
higher-level APIs in Java, Scala, Python, and R.
• Spark enables in-memory data processing and supports interactive
queries, iterative algorithms, and stream processing.
Analysing Data with Hadoop
Data Visualization:
Use tools like Apache Zeppelin or Jupyter notebooks to create visualizations
and reports based on the analyzed data.
Hadoop Ecosystem Tools:
Leverage other tools in the Hadoop ecosystem for specific tasks:
1.HBase: A NoSQL database for real-time read/write access to Hadoop
data.
2.Sqoop: Transfers data between Hadoop and relational databases.
3.Flume: Collects, aggregates, and moves large amounts of log data to
Hadoop.
Analysing Data with Hadoop
Optimization and Performance Tuning:
Tune Hadoop configurations, optimize MapReduce jobs, and adjust
cluster settings for better performance.
Scaling:
Hadoop is designed to scale horizontally. As data volumes grow, add
more nodes to the cluster to handle the increased processing
demands.
IBM Big Data Strategy
IBM Big Data Strategy
• IBM has been a significant
player in the big data and
analytics space, offering a
range of solutions and services
to help organizations manage,
analyze, and derive insights
from large volumes of data.
Infosphere BigInsights
• IBM InfoSphere BigInsights is
an analytics platform designed
for processing and analyzing
large volumes of structured
and unstructured data.
• It is built on open-source
Apache Hadoop and includes
additional tools and
capabilities to simplify big
data analytics.
Infosphere BigInsights
• Hadoop Ecosystem Integration: BigInsights leverages the Apache
Hadoop ecosystem, providing distributed storage and processing
capabilities for big data.
• Analytical Tools: The platform includes various tools for data
exploration, analysis, and visualization, allowing users to derive insights
from diverse datasets.
• Advanced Analytics: BigInsights supports advanced analytics,
including machine learning and predictive analytics, to uncover
patterns and trends in data.
• Security and Governance: It includes features for securing data and
ensuring compliance with regulatory requirements. This includes access
controls, encryption, and auditing capabilities.
• Integration with IBM and Open Source Tools: BigInsights integrates
with other IBM products and open-source tools, providing flexibility in
the choice of programming languages and frameworks.
BigSheets
• IBM BigSheets is a component of InfoSphere BigInsights
designed to simplify the exploration and analysis of large
datasets without requiring extensive programming skills. It
provides a spreadsheet-like interface for users to interact with
and analyze data.
• Spreadsheet Interface: BigSheets offers a familiar
spreadsheet-like interface that enables users to perform data
exploration and analysis using point-and-click interactions.
• Data Exploration: Users can explore and analyze large
datasets by applying filters, aggregations, and
transformations through a visual interface.
BigSheets
• Integration with BigInsights: BigSheets is tightly integrated with
InfoSphere BigInsights, allowing users to leverage the underlying
Hadoop-based analytics platform for processing and querying
large volumes of data.
• Visualization: Users can create visualizations of data directly
within BigSheets to better understand patterns and trends.
• Data Enrichment: BigSheets supports the enrichment of data
through external sources, enhancing the analysis capabilities.
• Collaboration: Users can share and collaborate on BigSheets
workbooks, facilitating collaborative data analysis within a team.
Hadoop Streaming
Hadoop Streaming
• Hadoop Streaming is a utility
that comes with Apache
Hadoop, a distributed
storage and processing
framework.
• It allows users to create and
run MapReduce jobs with
any executable or script as
the mapper and/or reducer.
Hadoop Streaming
• Mapper and Reducer Execution: In Hadoop Streaming, mappers and
reducers can be implemented using any executable or script (e.g.,
Python, Perl, Ruby, etc.). This flexibility allows users to leverage their
preferred programming languages for processing data.
• Input and Output Formats: Hadoop Streaming uses standard input
and output streams for communication between the Hadoop
framework and the user's mapper and reducer scripts. Each line of
input to the mapper is treated as a separate record, and the output of
the mapper is likewise treated as input for the reducer.
• Command-Line Interface: Users specify the mapper and reducer
scripts along with input and output paths using the Hadoop Streaming
command-line interface.
Hadoop Streaming
• Data Flow:
• Input Data: Input data is typically stored in the Hadoop
Distributed File System (HDFS) and is divided into fixed-size
blocks. Each block is processed by a separate mapper.
• Intermediate Data: The output of each mapper is partitioned
and sorted based on keys. This sorted intermediate data is then
passed to the reducers for further processing.
• Output Data: The final output is stored in HDFS or another
specified location. Each reducer produces a part of the final
output, and these parts are combined to form the complete
result.
Hadoop Streaming
• Use Cases:
• Hadoop Streaming is particularly useful when existing code
or scripts can be easily adapted to the MapReduce paradigm
without the need for a full Java implementation.
• It provides a bridge between traditional, non-Java
applications and the Hadoop ecosystem, making it accessible
to a wider range of users.

You might also like