Biggdata
Biggdata
Big Data
Big Data refers to extremely large and complex datasets that traditional data processing
tools cannot handle efficiently. These datasets often come from various sources such as
social media, sensors, IoT devices, e-commerce platforms, and more.Big Data refers to
large amounts of massive data yet increases exponentially in size over time.Big Data
enables organizations to extract valuable insights for decision-making, improving customer
experiences, and optimizing operations.
Examples
Discovering consumer shopping habits.
Finding new customer leads.
Monitoring health conditions through data from wearables.
Predictive inventory ordering.
Semi-Structured Data
• Definition: Data that does not have a strict structure but uses tags or markers to
separate elements and enforce hierarchies.
• Characteristics:
• Does not fit neatly into relational databases but has some organizational
properties.
• Often stored in formats that allow flexibility and scalability.
• Examples:
• JSON and XML files.
• NoSQL database records.
• Email metadata (e.g., sender, recipient, timestamp).
Quasi-Structured Data
Quasi-structured data refers to data that does not have a rigid structure like structured
data but still contains some organizational elements or markers that make it easier to parse
and analyze. It is often considered a subset of semi-structured data.
Characteristics:
1. Lacks a Fixed Schema:
• No strict organization into tables or rows but has some identifiable patterns
or tags.
2. Requires Parsing:
• Tools or algorithms are often needed to extract meaningful information.
3. Flexible Formats:
• The structure can vary within the same dataset, making it dynamic.
4. Intermediate Complexity:
• More structured than unstructured data but less structured than fully
structured data.
Examples:
• Sensor Logs
• Social Media Comments
• Email Logs
Unstructured Data
• Definition: Data that lacks a predefined format or organization, making it difficult to
store and process using traditional tools.
• Characteristics:
• Cannot be stored in rows and columns.
• Requires advanced tools (e.g., Hadoop, Spark) for processing and analysis.
• Examples:
• Text files, social media posts, and customer reviews.
• Multimedia content like images, videos, and audio files.
Hadoop
Hadoop is an open-source framework developed by the Apache Software Foundation. It is
designed to store, process, and analyze large-scale datasets efficiently. Hadoop uses a
distributed computing model, enabling it to handle Big Data across clusters of commodity
hardware.
Benefits of Hadoop
• Scalability: Easily scales from a single node to thousands of nodes.
• Fault Tolerance: Automatically replicates data to ensure availability even in case of
node failures.
• Cost-Effectiveness: Runs on inexpensive commodity hardware.
• Flexibility: Handles structured, semi-structured, and unstructured data.
Applications of Hadoop
• Fraud detection in financial services
• Customer sentiment analysis in marketing
• Recommendation systems in e-commerce
• Log analysis in IT and cybersecurity
• Genomic data analysis in healthcare
HDFS is not a good fit in following scenarios:
• Low-latency data access
• Lots of small files
• Multiple writers, arbitrary file modifications
Apache Hadoop
Apache Hadoop is an open-source framework designed for distributed storage and
processing of large datasets across clusters of computers. It is particularly suited for
handling Big Data and is developed and maintained by the Apache Software Foundation.
Hadoop Streaming
Hadoop Streaming is a utility that allows developers to write MapReduce programs in any
programming language that can read from standard input (stdin) and write to standard
output (stdout). This means you can use languages like Python, Ruby or any other language
instead of Java to implement MapReduce logic in a Hadoop environment.
Basic Workflow
1. Input: Data is passed to the Mapper via stdin (standard input).
2. Mapper: The script processes each line and outputs key-value pairs.
3. Shuffle and Sort: Hadoop sorts and groups the intermediate key-value pairs from
the mappers.
4. Reducer: The script processes grouped key-value pairs and outputs the final results
to stdout (standard output).
5. Output: The final output is written to HDFS or any other output location.
2. MapReduce
• MapReduce is the programming model used for processing large datasets in parallel.
• Map Phase: Input data is processed in parallel by mappers, outputting key-
value pairs.
• Reduce Phase: The key-value pairs are aggregated by reducers to produce the
final output.
• This framework is used for batch processing and handles tasks like sorting, counting,
and filtering data.
4. Hadoop Common
• Hadoop Common refers to the shared utilities, libraries, and APIs needed by other
Hadoop modules. It includes components that are essential for HDFS and
MapReduce to function.
5. Apache Hive
• Hive is a data warehouse system that provides a SQL-like interface for querying and
managing data stored in HDFS. It abstracts the complexities of MapReduce
programming by allowing users to write queries in HiveQL, a SQL-like language.
• Use Case: Useful for business intelligence (BI) and data analytics.
6. Apache HBase
• HBase is a NoSQL database that provides real-time, random access to large datasets.
• Built on top of HDFS, HBase is designed to handle very large datasets that
require quick read and write operations.
• Use Case: Real-time analytics and operational workloads requiring low-
latency data access.
7. Apache Pig
• Pig is a high-level platform for processing data using a language called Pig Latin. It
abstracts the complexities of writing MapReduce jobs, providing a simpler way to
express data transformations.
• Use Case: ETL (Extract, Transform, Load) processes and data transformation tasks.
8. Apache Spark
• Spark is an in-memory processing engine that provides faster data processing than
MapReduce. It can be used for both batch and stream processing, and it supports
complex analytics, machine learning, and graph processing.
• RDDs (Resilient Distributed Datasets): Core abstraction that enables
distributed data processing.
• Use Case: Faster processing and real-time analytics.
9. Apache Flume
• Flume is a distributed tool for collecting, aggregating, and moving large amounts of
log data from various sources to Hadoop (HDFS).
• Use Case: Streaming data ingestion, particularly for log data.
IBM has developed a comprehensive strategy for Big Data, focusing on enabling businesses
to harness the value of their data through advanced analytics, artificial intelligence (AI),
and cloud computing. IBM’s Big Data strategy emphasizes the integration of various
technologies to help businesses manage, analyze, and gain insights from massive volumes
of data.
Core Components of IBM’s Big Data Strategy
1. IBM Cloud
• IBM Cloud provides businesses with the tools and infrastructure to manage, store,
and process Big Data in a highly scalable and flexible cloud environment. It includes
both public and private cloud services, as well as hybrid cloud solutions.
• IBM Cloud Pak for Data is an integrated data and AI platform that allows businesses
to collect, organize, and analyze data from various sources, while providing tools for
governance and security.
3. Analytics and AI
• IBM Watson: A suite of AI-powered analytics tools that help businesses gain insights
from data. Watson includes solutions like:
• Watson Studio: A platform for data scientists and developers to build and
train machine learning models.
• Watson Machine Learning: Provides tools for automating model deployment
and management.
• Watson Knowledge Catalog: An AI-driven data governance and cataloging
solution.
• IBM SPSS: A powerful tool for advanced statistical analysis, used for predictive
analytics and decision-making.
• IBM Cognos Analytics: A suite for data exploration, reporting, and visualization,
empowering businesses to make data-driven decisions.
5. Real-Time Analytics
• IBM Streams: A real-time data analytics platform for analyzing high-velocity data
streams, enabling businesses to process data as it arrives and take immediate
actions.
6. Data Governance and Security
• IBM Watson Knowledge Catalog: Helps businesses organize, govern, and manage
their data assets, ensuring compliance with industry regulations.
• IBM Guardium: A data security and protection tool that monitors and protects
sensitive data.
• IBM Cloud Pak for Security: Offers solutions for securing data and workloads across
hybrid and multicloud environments.
IBM BigSheets
IBM BigSheets is a tool designed to help businesses analyze large datasets directly within
IBM Cloud using a familiar, spreadsheet-like interface. It is part of the IBM InfoSphere
BigInsights platform and leverages Hadoop's power to process and analyze Big Data
without the need for specialized programming skills.
HDFS Federation
HDFS Federation is a feature introduced in Hadoop 2.x to enhance the scalability and
performance of the Hadoop Distributed File System (HDFS). It addresses the limitations of
the single NameNode architecture in earlier Hadoop versions, where a single NameNode
could become a bottleneck as the number of files and the size of the cluster grew.
Data Replication
• HDFS is designed to reliably store very large files across machines in a large cluster. It
stores each file as a sequence of blocks; all blocks in a file except the last block are
the same size. The blocks of a file are replicated for fault tolerance. The block size
and replication factor are configurable per file. An application can specify the
number of replicas of a file. The replication factor can be specified at file creation
time and can be changed later. Files in HDFS are write-once and have strictly one
writer at any time.
• Large HDFS instances run on a cluster of computers that commonly spread across
many racks. For the common case, when the replication factor is three, HDFS’s
placement policy is to put one replica on one node in the local rack, another on a
node in a different (remote) rack, and the last on a different node in the same
remote rack.
Flume
Apache Flume is a robust, distributed system designed for ingesting large volumes of
streaming data into systems like HDFS, Hive, or HBase. It is widely used for collecting,
aggregating, and transferring log data or other event-based data from multiple sources to
Big Data storage solutions.
Hadoop Archives
Hadoop Archives (HAR) is a feature of Hadoop that aggregates multiple small files into a
single archive to optimize storage and metadata management in HDFS. While not a direct
data ingestion tool like Flume or Sqoop, HAR can be used as part of the ingestion workflow
when managing large datasets with numerous small files.
AVRO
Apache Avro is a popular data serialization framework in the Hadoop ecosystem. It
provides a compact, fast, and efficient way to serialize structured data, with built-in
support for schema evolution and interoperability across programming languages.
2. Node Failures
A node failure occurs when a node (or a machine) in the Hadoop cluster becomes unavailable due
to hardware failure, network issues, or node shutdown.
3. JobTracker/ResourceManager Failures
The JobTracker (MapReduce v1) or ResourceManager (in YARN) is responsible for coordinating
job execution and scheduling tasks. If these components fail, the entire job might fail unless
handled properly.
1. Supervised Learning:
• Data: Labeled data (input-output pairs).
• Goal: Predict outcomes for new data based on past examples.
• Examples: Spam detection, house price prediction.
2. Unsupervised Learning:
• Data: Unlabeled data.
• Goal: Find patterns or groupings in the data.
• Examples: Customer segmentation, anomaly detection.
3. Semi-Supervised Learning:
• Data: A mix of labeled and unlabeled data.
• Goal: Improve learning when labeled data is scarce.
• Examples: Image classification with few labeled images.
4. Reinforcement Learning:
• Data: Agent interacts with an environment and receives feedback
(rewards/penalties).
• Goal: Learn actions that maximize cumulative reward over time.
• Examples: Self-driving cars, game playing AI.