Big Data Analysis PDF 2
Big Data Analysis PDF 2
Overview
1. Data Storage:
• Each data node in HDFS is responsible for storing actual blocks of data.
• Files in HDFS are broken into blocks (default: 128 MB or 256 MB) and distributed
across the data nodes.
2. Directory Structure:
• A data node has a structured directory where data blocks are stored. The structure
varies slightly depending on the configuration and file system being used.
Use Cases
1. Fault Tolerance:
• If a disk fails, other storage directories (from dfs.datanode.data.dir) can still hold
the data.
2. Efficient Access:
• Organizing blocks into subdirectories prevents the overhead of managing thousands
of files in a single directory.
3. Data Integrity:
• The VERSION file and logs ensure the data node can verify block integrity and
maintain consistency.
2. Processing Power
• Definition: The computational capacity required to process and analyze large
datasets efficiently.
• Examples:
• Multi-core CPUs and GPUs for parallel processing.
• Distributed computing frameworks like Apache Spark and MapReduce.
• Features:
• Supports real-time and batch processing.
• Enables machine learning and advanced analytics.
• Handles complex transformations and computations on Big Data.
3. Memory (RAM)
• Definition: High-speed, temporary storage used for processing data during
computations.
• Examples:
• In-memory frameworks like Apache Spark or Apache Flink utilize RAM for
faster data processing.
• Memory-intensive tasks like caching and real-time analytics depend on large
RAM capacities.
• Features:
• Enhances processing speed by reducing reliance on disk I/O.
• Crucial for applications requiring low latency.
4. Network Bandwidth
• Definition: The capacity of the network to transfer data between nodes in a
distributed system or between systems and storage.
• Examples:
• High-speed Ethernet or fiber optic connections in data centers.
• Cloud services with optimized networking infrastructure (AWS Direct Connect,
Google Cloud Interconnect).
• Features:
• Essential for data replication, distribution, and retrieval.
• Impacts the performance of real-time processing and distributed computing
tasks.
• Supports integration of diverse data sources.
5. Explain YARN.
YARN (Yet Another Resource Negotiator) is a key component of the Apache Hadoop
ecosystem, introduced in Hadoop 2.0. It serves as a resource management and job scheduling
framework for distributed computing systems. YARN decouples resource management and job
scheduling/monitoring functions, enhancing scalability and flexibility.
Components of YARN
1. ResourceManager (RM)
• A central authority that manages resources in the cluster.
• Responsibilities:
• Allocating resources to various applications.
• Ensuring fair resource distribution among competing applications.
• Monitoring and handling failures.
2. NodeManager (NM)
• A per-node service that manages resources and execution on individual nodes.
• Responsibilities:
• Monitoring resource usage (CPU, memory, etc.) on the node.
• Reporting resource availability to the ResourceManager.
3. ApplicationMaster (AM)
• A per-application entity responsible for managing the execution of tasks within an
application.
• Responsibilities:
• Monitoring task progress and handling failures.
4. Container
• A logical unit of resources (CPU, memory, etc.) allocated to a specific task.
• Containers are created an managed by the NodeManager.
6. What is map reduce programming model ?
MapReduce Programming Model
MapReduce is a programming model and a processing technique for handling large-scale data sets
in distributed systems. Developed by Google, it became widely known as part of the Apache Hadoop
ecosystem. MapReduce simplifies parallel data processing by breaking tasks into smaller, manageable
sub-tasks that can be processed independently across a distributed cluster.
Steps in MapReduce
1. Input Splitting:
• The input data is divided into fixed-sized chunks (default in Hadoop: 128 MB or 256
MB).
2. Mapping:
• Each split is processed by a mapper that generates intermediate key-value pairs.
• Example: In a word count program, the input "cat cat dog" would produce
intermediate pairs like (cat, 1), (cat, 1), (dog, 1).
3. Shuffling and Sorting:
• Intermediate data is shuffled to group all values with the same key and sorted for
efficient processing.
• Example: After shuffling, (cat, 1), (cat, 1) becomes (cat, [1, 1]).
4. Reducing:
• Reducers process the grouped data, applying an aggregation function to produce the
final output.
• Example: For (cat, [1, 1]), the reducer outputs (cat, 2).
5. Output:
• The final results are written to the output location, typically in a distributed file system
like HDFS.
Components of Hive
1. MetaStore:
• Central repository storing metadata about the data, including table
schemas, partitions, and data locations.
2. Driver:
• Manages query lifecycle, including compilation, optimization, and
execution.
3. Query Compiler:
• Translates HiveQL into execution plans for underlying processing
engines like MapReduce or Spark.
4. Execution Engine:
• Executes the query on Hadoop's processing framework (e.g.,
MapReduce, Tez, Spark).
5. Hive CLI/Beeline:
• Interfaces for running Hive queries interactively or in batch mode.
6. Storage:
• Data is stored in HDFS or other compatible storage systems in various
formats (e.g., ORC, Parquet, Text, Avro).
Types of Sharding
1. Range-Based Sharding:
• Data is divided into shards based on a range of values.
• Example: User IDs 1–1000 on Shard 1, 1001–2000 on Shard 2.
• Advantage: Easy to implement and understand.
• Disadvantage: Can lead to uneven data distribution (hot spots).
2. Hash-Based Sharding:
• A hash function is applied to the sharding key to determine which shard the data
belongs to.
• Example: Hash(user_id) % number_of_shards = shard_number.
• Advantage: Ensures even distribution of data across shards.
• Disadvantage: Difficult to re-shard when adding servers.
3. Geographical Sharding:
• Data is partitioned based on geographical locations.
• Example: Data for users in Asia on Shard 1, Europe on Shard 2.
• Advantage: Useful for applications with region-specific data.
• Disadvantage: Can result in uneven shard sizes.
4. Directory-Based Sharding:
• A lookup table maps data to specific shards.
• Example: A table indicating that User IDs 1–100 are on Shard 1.
• Advantage: Flexible and allows for custom distribution.
• Disadvantage: Adds overhead due to maintaining the lookup table.
1. Data Ingestion: Data is ingested into Hadoop using tools like HDFS, Apache
Flume, Kafka, Sqoop, or Apache NiFi for batch and real-time data processing.
2. Data Storage: Data is stored in HDFS, which distributes large datasets across
multiple nodes for scalability and fault tolerance. Data is typically stored in
formats like Avro, Parquet, ORC, and Sequence files.
3. Data Processing: Hadoop provides processing frameworks like MapReduce,
Apache Spark, Apache Hive, and Apache Pig to perform data transformations
and analytics, supporting both batch and real-time processing.
4. Data Querying: Tools like Apache Hive, Impala, and Apache Drill allow for
SQL-like querying and real-time analysis of large datasets in HDFS or NoSQL
databases like HBase.
5. Visualization and Advanced Analytics: After analysis, tools like Tableau,
Power BI, and QlikView are used to visualize data, while machine learning
libraries in Spark and Mahout allow for advanced analytics on the processed
data.
11. Explain in detail about HDFS.
What is HDFS (Hadoop Distributed File System)?
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop to store
large volumes of data in a distributed manner across multiple machines. It is designed to be scalable,
fault-tolerant, and optimized for high-throughput access to data, making it suitable for handling the
large datasets common in big data applications.
Components of HDFS
1. NameNode: Manages metadata (file names, block locations) but does not
store actual data.
2. DataNode: Stores actual data blocks and handles read/write requests.
3. Secondary NameNode: Backs up the NameNode's metadata.
4. Client: Interacts with HDFS for data access, relying on NameNode and
DataNodes.
HDFS Architecture
• Single NameNode:
• There is only one NameNode in an HDFS cluster, making it a single point of failure. To
address this, HDFS supports Secondary NameNode (or standby NameNode) to
periodically back up the metadata from the primary NameNode, but it does not act as
a failover.
• For high availability, a NameNode HA configuration with two NameNodes can be
implemented.
• Block Size and Replication Factor:
• The default block size in HDFS is 128MB, which is much larger than the typical block
size in traditional file systems (e.g., 4KB).
• The replication factor of 3 (by default) ensures redundancy, so data is replicated
across multiple DataNodes.
• Client Interaction:
• Clients interact directly with DataNodes to read and write data but rely on the
NameNode for metadata information.
Advantages of HDFS
1. Fault Tolerance:
• Data is replicated across multiple DataNodes, ensuring that data remains available
even in case of hardware failure.
2. High Throughput:
• Optimized for large-scale data processing and throughput, HDFS is designed for
applications like data analysis and batch processing.
3. Scalability:
• HDFS is designed to scale horizontally by adding more machines to the cluster,
supporting the growth of data over time.
4. Cost Efficiency:
• It uses commodity hardware, making it cost-effective compared to traditional storage
systems.
Challenges of HDFS
1. Single Point of Failure:
• The NameNode is a single point of failure, though high availability configurations can
mitigate this.
2. Not Suitable for Small Files:
• HDFS is optimized for large files, and managing many small files is inefficient due to
overhead and metadata storage.
3. Latency:
• HDFS is optimized for throughput, not low-latency access. It is not ideal for
applications requiring real-time data access.
Cluster analysis, a type of unsupervised machine learning, requires several key components and
conditions to perform effectively. Here are the main requirements for cluster analysis:
1. Data Representation:
• Data should be represented in a form that can be processed, typically as numerical
vectors or matrices, where each data point or object is represented by a set of
features.
2. Distance/Similarity Measure:
• A method for measuring the similarity or distance between data points is crucial.
Common measures include Euclidean distance, Manhattan distance, or cosine
similarity, depending on the type of data.
3. Appropriate Clustering Algorithm:
• The choice of clustering algorithm (e.g., K-means, Hierarchical Clustering, DBSCAN)
should match the characteristics of the data, such as the number of clusters, the
shape of the clusters, and whether the data is noisy or has outliers.
4. Data Preprocessing:
• Data should be cleaned, normalized, and transformed (if needed) to ensure that
features are comparable, and irrelevant or redundant information is removed. This
may include handling missing values, scaling features, and reducing dimensionality.
5. Scalability:
• The algorithm should be scalable to handle large datasets, especially in big data
contexts. Some clustering algorithms may struggle with high-dimensional or very
large datasets, so computational efficiency and scalability are important.
6. Evaluation Criteria:
• A method to assess the quality of the clusters is necessary, such as silhouette
score, within-cluster sum of squares, or Davies-Bouldin index. Evaluation ensures
the clustering is meaningful and provides useful insights.
7. Domain Knowledge:
• Understanding the data and the problem domain helps in selecting the right features,
interpreting the results, and making sense of the clusters formed.
2. Windowing
• Windowing is the process of dividing a continuous stream of data into
manageable chunks or windows, enabling the system to apply operations like
aggregation or analysis on subsets of data at a time.
Importance:
• Windowing is essential for dealing with continuous data streams, allowing
meaningful analyses over a finite subset of data, and helps in aggregating
data for further processing.
Challenges:
• Handling late data and window updates: Late events can affect windowing
calculations, and processing the data correctly is key for accurate results,
requiring techniques like watermarks or lateness policies to adjust windows
dynamically.
1. Personal Networks:
2. Professional Networks:
4. Media-Sharing Networks:
5. Discussion Forums:
6. Dating Networks:
• Designed to connect students, educators, and institutions for learning and academic
purposes.
• Examples: Coursera, Edmodo, Khan Academy.
• Focus: Sharing educational resources, taking online courses, and collaborating in
academic settings.
Web Analytics involves the collection, measurement, and analysis of data related to websites, apps,
and online platforms to understand user behavior, enhance website performance, and achieve
business goals. It provides insights into how users interact with websites, enabling businesses to
optimize user experience and improve marketing strategies.
The structure of web analytics can be broken down into the following key components:
1. Data Collection:
• Web Tracking: Captures user activities on a website (e.g., page views, clicks).
• Event Tracking: Monitors specific user actions like button clicks.
• E-Commerce Tracking: Tracks transactions and customer actions.
• Session Tracking: Records user interactions during a single visit.
2. Data Processing:
• Data Cleaning: Removes irrelevant data (e.g., bots, internal traffic).
• Data Aggregation: Combines raw data into useful statistics.
• Segmentation: Divides users into groups based on characteristics.
• Data Enrichment: Adds extra information, like location or demographics.
3. Data Analysis:
• Traffic Analysis: Identifies sources of website traffic.
• Behavior Analysis: Understands user actions (e.g., pages viewed, time spent).
• Conversion Analysis: Measures goal achievement like purchases or sign-ups.
• Cohort Analysis: Groups users by behaviors or acquisition date.
• A/B Testing: Tests different webpage versions for better performance.
4. Reporting and Visualization:
• Dashboards: Real-time displays of key metrics (e.g., traffic, conversions).
• Custom Reports: Tailored reports for specific business needs.
• Alerts and Notifications: Automated alerts for significant changes in metrics.
5. Optimization and Action:
• CRO: Improves conversion rates by optimizing website elements.
• Personalization: Customizes user experience based on behavior.
• SEO Optimization: Improves website ranking using organic search data.
• Marketing Strategy Refinement: Adjusts strategies based on data insights.
6. Tools Used:
• Google Analytics, Adobe Analytics, Mixpanel, Hotjar, Crazy Egg are common
tools used for tracking, analyzing, and optimizing website performance.