0% found this document useful (0 votes)
14 views6 pages

Big Data Analytics Unit Wise Short Note

The document provides an overview of Big Data Analytics, covering its fundamentals, characteristics, and challenges, as well as the Hadoop framework and Spark for data processing. It details the architecture, components, and programming models used in Big Data, emphasizing the need for scalable and efficient systems. Additionally, it discusses GPU computing and the advantages of Spark over traditional MapReduce methods.

Uploaded by

ramp49498
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Big Data Analytics Unit Wise Short Note

The document provides an overview of Big Data Analytics, covering its fundamentals, characteristics, and challenges, as well as the Hadoop framework and Spark for data processing. It details the architecture, components, and programming models used in Big Data, emphasizing the need for scalable and efficient systems. Additionally, it discusses GPU computing and the advantages of Spark over traditional MapReduce methods.

Uploaded by

ramp49498
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Big Data Analytics

Unit 1: Fundamentals of Big Data Analysis

1. Data Storage and Analysis


• Big Data refers to large, complex datasets that cannot be processed using
traditional data management tools. They require scalable storage solutions and
processing power.
• Data storage is the first step, involving distributed systems like Hadoop
Distributed File System (HDFS) to handle vast amounts of data.
• Data analysis involves cleaning, processing, and interpreting data to extract
valuable insights.

2. Characteristics of Big Data

Big Data is characterized by the following:


• Volume: The sheer size of the data, ranging from terabytes to petabytes.
• Variety: The different types of data (structured, semi-structured, and
unstructured) from various sources.
• Velocity: The speed at which data is generated and needs to be processed.
• Veracity: The quality and accuracy of the data.
• Value: The meaningful insights that can be derived from the data.

3. Big Data Analytics

Big Data analytics involves the use of advanced analytic techniques to process and analyze large
datasets. This can include statistical analysis, machine learning, and predictive modeling.

4. Typical Analytical Architecture


• The architecture for Big Data analytics typically consists of:
• Data Ingestion: Collecting data from various sources.
• Data Storage: Storing data in distributed file systems.
• Data Processing: Using frameworks like Hadoop or Spark.
• Data Analysis and Visualization: Extracting insights and
presenting them for decision-making.

5. Requirements for New Analytical Architecture


• Traditional architectures are not equipped to handle Big Data due to:
• Scalability: Need to scale systems horizontally across
multiple nodes.
• Flexibility: Support for different types of data sources and
formats.
• Real-time Processing: Ability to process data as it is
generated.

6. Challenges in Big Data Analytics


• Data Privacy: Ensuring privacy and security of sensitive data.
• Data Integration: Merging data from different sources.
• Data Quality: Ensuring accuracy, completeness, and consistency.
• Processing Power: Handling large volumes of data in real-time.

Unit 2: Hadoop Framework

1. Hadoop

Hadoop is an open-source framework that allows for distributed storage and processing of large
datasets across clusters of computers. It is based on the following components:
• Hadoop Distributed File System (HDFS): A distributed file system designed
to store large files across multiple machines.
• MapReduce: A programming model that allows for the processing of large
datasets in parallel.

2. Requirements of the Hadoop Framework


• Scalability: The ability to scale across multiple nodes.
• Fault Tolerance: Data replication and task re-execution in case of node
failure.
• Cost Efficiency: It runs on commodity hardware, making it cost-effective.
3. Design Principles of Hadoop
• Distributed Computing: Split large datasets into smaller chunks that are
processed in parallel.
• Data Locality: Processing data where it is stored to minimize data transfer.
• Fault Tolerance: Ensure reliability by replicating data and tasks.

4. Hadoop Components
• HDFS: Stores data across multiple nodes.
• YARN: Resource manager that allocates resources to applications.
• MapReduce: Executes the data processing tasks.

5. Hadoop 1 vs. Hadoop 2


• Hadoop 1 uses a single JobTracker for resource management, which can be
a bottleneck.
• Hadoop 2 introduces YARN (Yet Another Resource Negotiator), allowing for
better resource management and improved scalability.

6. Hadoop Daemons
• NameNode: Manages the HDFS metadata.
• DataNode: Stores the actual data.
• JobTracker: Manages MapReduce jobs (Hadoop 1).
• TaskTracker: Executes MapReduce tasks (Hadoop 1).

7. MapReduce Programming

MapReduce is a programming model for processing large datasets. It has two main stages:
• Map: Transforms input data into key-value pairs.
• Reduce: Aggregates the key-value pairs to produce final results.

8. MapReduce Job Variants


• Map-side Join: Join performed during the map phase.
• Reduce-side Join: Join performed during the reduce phase.
• Secondary Sorting: Sorting keys and values during the reduce phase.
• Pipelining MapReduce Jobs: Chaining multiple MapReduce jobs together for
complex operations.

Unit 3: HDFS (Hadoop Distributed File System)

1. The Design of HDFS

HDFS is designed to store large datasets across multiple nodes in a cluster. It uses block-level
replication to ensure fault tolerance.

2. HDFS Concepts
• Blocks: Data is split into blocks, typically 128 MB or 256 MB in size.
• Replication: Data is replicated across multiple nodes (default is 3 copies).
• NameNode: Keeps track of file metadata and block locations.
• DataNode: Stores the actual data blocks.

3. Command Line Interface (CLI)


• hdfs dfs -ls: List files in HDFS.
• hdfs dfs -put: Upload data to HDFS.
• hdfs dfs -get: Download data from HDFS.

4. Hadoop File System Interfaces

HDFS can be accessed using command-line tools, APIs, or frameworks such as Hive and Pig.

5. Data Flow

Data is first ingested into HDFS, processed using MapReduce, and then outputted to a
distributed storage system.

6. Data Ingestion with Flume and Scoop


• Flume: Used to collect log data from various sources and store it in HDFS.
• Scoop: Transfers bulk data between relational databases and HDFS.

7. Hadoop I/O
• Compression: Reduces storage space.
• Serialization: Formats data for efficient transfer.
• Avro: A serialization framework used in Hadoop.
• File-Based Data Structures: Organizes data for efficient storage and
retrieval.

Unit 4: Spark Framework and Data Analysis with Spark Shell

1. Introduction to GPU Computing

GPU computing leverages the parallel processing capabilities of GPUs to accelerate


computations. CUDA (Compute Unified Device Architecture) is a programming model for GPU
programming.

2. CUDA Programming Model

CUDA allows developers to write code that runs on NVIDIA GPUs. It uses parallel threads to
divide the workload and maximize performance.

3. CUDA API

The CUDA API provides functions to manage memory, launch kernels, and synchronize
operations. Key functions include:
• cudaMalloc(): Allocates memory on the GPU.
• cudaMemcpy(): Transfers data between CPU and GPU.
• cudaFree(): Frees GPU memory.

4. Simple Matrix Multiplication in CUDA

Matrix multiplication is a common application for CUDA, where each thread performs a small part
of the computation. Using shared memory optimizes performance.

5. CUDA Memory Model

CUDA memory includes several types:


• Global Memory: Slow but accessible by all threads.
• Shared Memory: Fast memory shared between threads in a block.
• Local Memory: Fast but specific to each thread.

6. Spark Framework Overview

Spark is an open-source distributed computing system for processing Big Data. It supports batch
processing and real-time analytics.

7. Components of Spark
• Spark Core: Handles scheduling and memory management.
• Spark SQL: Supports querying structured data.
• Spark Streaming: Processes real-time data.
• MLlib: Provides machine learning algorithms.
• GraphX: Offers graph processing.

8. Writing Spark Applications

Spark applications can be written in Scala, Python (PySpark), Java, or R. The process typically
involves:
1. Create SparkSession to interact with Spark.
2. Load Data from sources like HDFS or local files.
3. Transformations like map() and filter().
4. Actions like collect() and save().
5. Close the SparkSession.

9. Spark Execution

Spark applications can run in various cluster environments like YARN, Mesos, or Kubernetes. The
application is executed using the spark-submit command.

10. Advantages of Spark Over MapReduce


• Spark performs in-memory computations, making it much faster than
Hadoop MapReduce.
• Spark supports both batch and real-time processing.
• It provides high-level APIs for easier development.

You might also like