0% found this document useful (0 votes)

14 views6 pages

Big Data Analytics Unit Wise Short Note

The document provides an overview of Big Data Analytics, covering its fundamentals, characteristics, and challenges, as well as the Hadoop framework and Spark for data processing. It details the architecture, components, and programming models used in Big Data, emphasizing the need for scalable and efficient systems. Additionally, it discusses GPU computing and the advantages of Spark over traditional MapReduce methods.

Uploaded by

ramp49498

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views6 pages

Big Data Analytics Unit Wise Short Note

Uploaded by

ramp49498

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Big Data Analytics

Unit 1: Fundamentals of Big Data Analysis

1. Data Storage and Analysis

• Big Data refers to large, complex datasets that cannot be processed using
traditional data management tools. They require scalable storage solutions and
processing power.
• Data storage is the first step, involving distributed systems like Hadoop
Distributed File System (HDFS) to handle vast amounts of data.
• Data analysis involves cleaning, processing, and interpreting data to extract
valuable insights.

2. Characteristics of Big Data

Big Data is characterized by the following:

• Volume: The sheer size of the data, ranging from terabytes to petabytes.
• Variety: The different types of data (structured, semi-structured, and
unstructured) from various sources.
• Velocity: The speed at which data is generated and needs to be processed.
• Veracity: The quality and accuracy of the data.
• Value: The meaningful insights that can be derived from the data.

3. Big Data Analytics

Big Data analytics involves the use of advanced analytic techniques to process and analyze large
datasets. This can include statistical analysis, machine learning, and predictive modeling.

4. Typical Analytical Architecture

• The architecture for Big Data analytics typically consists of:
• Data Ingestion: Collecting data from various sources.
• Data Storage: Storing data in distributed file systems.
• Data Processing: Using frameworks like Hadoop or Spark.
• Data Analysis and Visualization: Extracting insights and
presenting them for decision-making.

5. Requirements for New Analytical Architecture

• Traditional architectures are not equipped to handle Big Data due to:
• Scalability: Need to scale systems horizontally across
multiple nodes.
• Flexibility: Support for different types of data sources and
formats.
• Real-time Processing: Ability to process data as it is
generated.

6. Challenges in Big Data Analytics

• Data Privacy: Ensuring privacy and security of sensitive data.
• Data Integration: Merging data from different sources.
• Data Quality: Ensuring accuracy, completeness, and consistency.
• Processing Power: Handling large volumes of data in real-time.

Unit 2: Hadoop Framework

1. Hadoop

Hadoop is an open-source framework that allows for distributed storage and processing of large
datasets across clusters of computers. It is based on the following components:
• Hadoop Distributed File System (HDFS): A distributed file system designed
to store large files across multiple machines.
• MapReduce: A programming model that allows for the processing of large
datasets in parallel.

2. Requirements of the Hadoop Framework

• Scalability: The ability to scale across multiple nodes.
• Fault Tolerance: Data replication and task re-execution in case of node
failure.
• Cost Efficiency: It runs on commodity hardware, making it cost-effective.
3. Design Principles of Hadoop
• Distributed Computing: Split large datasets into smaller chunks that are
processed in parallel.
• Data Locality: Processing data where it is stored to minimize data transfer.
• Fault Tolerance: Ensure reliability by replicating data and tasks.

4. Hadoop Components
• HDFS: Stores data across multiple nodes.
• YARN: Resource manager that allocates resources to applications.
• MapReduce: Executes the data processing tasks.

5. Hadoop 1 vs. Hadoop 2

• Hadoop 1 uses a single JobTracker for resource management, which can be
a bottleneck.
• Hadoop 2 introduces YARN (Yet Another Resource Negotiator), allowing for
better resource management and improved scalability.

6. Hadoop Daemons
• NameNode: Manages the HDFS metadata.
• DataNode: Stores the actual data.
• JobTracker: Manages MapReduce jobs (Hadoop 1).
• TaskTracker: Executes MapReduce tasks (Hadoop 1).

7. MapReduce Programming

MapReduce is a programming model for processing large datasets. It has two main stages:
• Map: Transforms input data into key-value pairs.
• Reduce: Aggregates the key-value pairs to produce final results.

8. MapReduce Job Variants

• Map-side Join: Join performed during the map phase.
• Reduce-side Join: Join performed during the reduce phase.
• Secondary Sorting: Sorting keys and values during the reduce phase.
• Pipelining MapReduce Jobs: Chaining multiple MapReduce jobs together for
complex operations.
⸻

Unit 3: HDFS (Hadoop Distributed File System)

1. The Design of HDFS

HDFS is designed to store large datasets across multiple nodes in a cluster. It uses block-level
replication to ensure fault tolerance.

2. HDFS Concepts
• Blocks: Data is split into blocks, typically 128 MB or 256 MB in size.
• Replication: Data is replicated across multiple nodes (default is 3 copies).
• NameNode: Keeps track of file metadata and block locations.
• DataNode: Stores the actual data blocks.

3. Command Line Interface (CLI)

• hdfs dfs -ls: List files in HDFS.
• hdfs dfs -put: Upload data to HDFS.
• hdfs dfs -get: Download data from HDFS.

4. Hadoop File System Interfaces

HDFS can be accessed using command-line tools, APIs, or frameworks such as Hive and Pig.

5. Data Flow

Data is first ingested into HDFS, processed using MapReduce, and then outputted to a
distributed storage system.

6. Data Ingestion with Flume and Scoop

• Flume: Used to collect log data from various sources and store it in HDFS.
• Scoop: Transfers bulk data between relational databases and HDFS.

7. Hadoop I/O
• Compression: Reduces storage space.
• Serialization: Formats data for efficient transfer.
• Avro: A serialization framework used in Hadoop.
• File-Based Data Structures: Organizes data for efficient storage and
retrieval.

Unit 4: Spark Framework and Data Analysis with Spark Shell

1. Introduction to GPU Computing

GPU computing leverages the parallel processing capabilities of GPUs to accelerate

computations. CUDA (Compute Unified Device Architecture) is a programming model for GPU
programming.

2. CUDA Programming Model

CUDA allows developers to write code that runs on NVIDIA GPUs. It uses parallel threads to
divide the workload and maximize performance.

3. CUDA API

The CUDA API provides functions to manage memory, launch kernels, and synchronize
operations. Key functions include:
• cudaMalloc(): Allocates memory on the GPU.
• cudaMemcpy(): Transfers data between CPU and GPU.
• cudaFree(): Frees GPU memory.

4. Simple Matrix Multiplication in CUDA

Matrix multiplication is a common application for CUDA, where each thread performs a small part
of the computation. Using shared memory optimizes performance.

5. CUDA Memory Model

CUDA memory includes several types:

• Global Memory: Slow but accessible by all threads.
• Shared Memory: Fast memory shared between threads in a block.
• Local Memory: Fast but specific to each thread.

6. Spark Framework Overview

Spark is an open-source distributed computing system for processing Big Data. It supports batch
processing and real-time analytics.

7. Components of Spark
• Spark Core: Handles scheduling and memory management.
• Spark SQL: Supports querying structured data.
• Spark Streaming: Processes real-time data.
• MLlib: Provides machine learning algorithms.
• GraphX: Offers graph processing.

8. Writing Spark Applications

Spark applications can be written in Scala, Python (PySpark), Java, or R. The process typically
involves:
1. Create SparkSession to interact with Spark.
2. Load Data from sources like HDFS or local files.
3. Transformations like map() and filter().
4. Actions like collect() and save().
5. Close the SparkSession.

9. Spark Execution

Spark applications can run in various cluster environments like YARN, Mesos, or Kubernetes. The
application is executed using the spark-submit command.

10. Advantages of Spark Over MapReduce

• Spark performs in-memory computations, making it much faster than
Hadoop MapReduce.
• Spark supports both batch and real-time processing.
• It provides high-level APIs for easier development.

Detailed Big Data and Hadoop Notes
No ratings yet
Detailed Big Data and Hadoop Notes
3 pages
SPARK
No ratings yet
SPARK
66 pages
CouchDB Presentation1
No ratings yet
CouchDB Presentation1
48 pages
Bda Unit 2
No ratings yet
Bda Unit 2
16 pages
Big Data
No ratings yet
Big Data
45 pages
Bigdata
No ratings yet
Bigdata
18 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Apache
No ratings yet
Apache
9 pages
BDA Unit 2
No ratings yet
BDA Unit 2
39 pages
EECS6893 BigDataAnalytics Lecture2
No ratings yet
EECS6893 BigDataAnalytics Lecture2
79 pages
Big Data Notes With Diagrams
No ratings yet
Big Data Notes With Diagrams
3 pages
BDA Simple 1 To 4
No ratings yet
BDA Simple 1 To 4
11 pages
Big-Data Unit-4
No ratings yet
Big-Data Unit-4
10 pages
Big Data Hadoop Complete Final Spaced
No ratings yet
Big Data Hadoop Complete Final Spaced
15 pages
Big Data
No ratings yet
Big Data
8 pages
BDA Unit 2
No ratings yet
BDA Unit 2
8 pages
Bigdata
No ratings yet
Bigdata
23 pages
Big Data Assignment Notes
No ratings yet
Big Data Assignment Notes
13 pages
SPARK
No ratings yet
SPARK
125 pages
Bda Ese
No ratings yet
Bda Ese
21 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Big Data Spark Lab Manual 2025-2026
No ratings yet
Big Data Spark Lab Manual 2025-2026
62 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Big Data and Hadoop Notes
No ratings yet
Big Data and Hadoop Notes
3 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
BD U-2 (Anupam Sir)
No ratings yet
BD U-2 (Anupam Sir)
30 pages
Unit 2
No ratings yet
Unit 2
7 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Bda U2
No ratings yet
Bda U2
68 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Bca Bigdata Fifth - Sem Approved Syllabus
No ratings yet
Bca Bigdata Fifth - Sem Approved Syllabus
23 pages
Lecture 3 MR Model and Systems
No ratings yet
Lecture 3 MR Model and Systems
67 pages
UNIT-I Introduction To Hadoop - A20
No ratings yet
UNIT-I Introduction To Hadoop - A20
24 pages
Apache Spark
No ratings yet
Apache Spark
3 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
24 pages
BD by Maaz
No ratings yet
BD by Maaz
19 pages
Python List, Tuples and Dictionaries
No ratings yet
Python List, Tuples and Dictionaries
19 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
SPARK
No ratings yet
SPARK
47 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Sdcbdasparkweek1 1
No ratings yet
Sdcbdasparkweek1 1
9 pages
Bda Summer 2022 Solution
No ratings yet
Bda Summer 2022 Solution
30 pages
Activador WINDOWS 8
No ratings yet
Activador WINDOWS 8
3 pages
Unit-3: Describe Mapreduce With Application?
No ratings yet
Unit-3: Describe Mapreduce With Application?
6 pages
On Process Management-LINUX
No ratings yet
On Process Management-LINUX
18 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
Write Down The Syntax Rules For XML Declaration
No ratings yet
Write Down The Syntax Rules For XML Declaration
35 pages
Big Data Analytics
No ratings yet
Big Data Analytics
61 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Python Chatbot Project: January 2022
No ratings yet
Python Chatbot Project: January 2022
6 pages
J277-Computer Science-Paper2-MS - 241103 - 230709-1
100% (1)
J277-Computer Science-Paper2-MS - 241103 - 230709-1
27 pages
Attachment
No ratings yet
Attachment
11 pages
Big Data Analytics
No ratings yet
Big Data Analytics
2 pages
Unit 5
No ratings yet
Unit 5
32 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Python - 4th Sem
No ratings yet
Python - 4th Sem
2 pages
Bda Module 2
No ratings yet
Bda Module 2
12 pages
Topic 1 Big Data Technologies
No ratings yet
Topic 1 Big Data Technologies
5 pages
I Am Preparing For A Big Data Analytics University...
No ratings yet
I Am Preparing For A Big Data Analytics University...
15 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
IOT and Comp - Architecture
No ratings yet
IOT and Comp - Architecture
17 pages
Chapter 1 - Lecture 1
No ratings yet
Chapter 1 - Lecture 1
54 pages
SAP S/4HANA 1909 FPS00 Fully-Activated Appliance: Print Form Customization
No ratings yet
SAP S/4HANA 1909 FPS00 Fully-Activated Appliance: Print Form Customization
15 pages
Tybca Blackbook Guidelines
No ratings yet
Tybca Blackbook Guidelines
15 pages
Kasper Sky
No ratings yet
Kasper Sky
41 pages
Harmonic Smi
No ratings yet
Harmonic Smi
3 pages
Types of Relationships (DBMS)
No ratings yet
Types of Relationships (DBMS)
12 pages
My Python Lab Sheet3
No ratings yet
My Python Lab Sheet3
11 pages
Stqa Assignment-2: Explain QA Activities in Waterfall Process With The Help of Neat Diagram
No ratings yet
Stqa Assignment-2: Explain QA Activities in Waterfall Process With The Help of Neat Diagram
7 pages
Sap La TS410 en 17 SG
No ratings yet
Sap La TS410 en 17 SG
5 pages
VATSIM India vACC - vATIS Tutorial Guide
No ratings yet
VATSIM India vACC - vATIS Tutorial Guide
11 pages
Y2K38 - The Bug
No ratings yet
Y2K38 - The Bug
4 pages
Ankit Kumar
No ratings yet
Ankit Kumar
3 pages
Error 1603 When Installing GFI LanGuard - GFI LanGuard Support
No ratings yet
Error 1603 When Installing GFI LanGuard - GFI LanGuard Support
4 pages
Software Engineering Presentation
No ratings yet
Software Engineering Presentation
3 pages
SE - 2024 - Assignment 4
No ratings yet
SE - 2024 - Assignment 4
5 pages
LLMAll - en US - FINAL 16 19
No ratings yet
LLMAll - en US - FINAL 16 19
4 pages
Employee Management System
No ratings yet
Employee Management System
12 pages
Hadoop Course Content
No ratings yet
Hadoop Course Content
2 pages
FRD FSD Template
No ratings yet
FRD FSD Template
7 pages
Tear-Offs in Dart
No ratings yet
Tear-Offs in Dart
3 pages
Dbms 01 Project
No ratings yet
Dbms 01 Project
25 pages
Itw Midsem Notes
No ratings yet
Itw Midsem Notes
21 pages
Shopping List App
No ratings yet
Shopping List App
3 pages
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
From Everand
Mastering Data Engineering: Advanced Techniques with Apache Hadoop and Hive
Peter Jones
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
From Everand
Advanced Hadoop Techniques: A Comprehensive Guide to Mastery
Adam Jones
No ratings yet
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
From Everand
Apache Hive Handbook: Query, Analyze, and Optimize Big Data
Robert Johnson
No ratings yet

Big Data Analytics Unit Wise Short Note

Uploaded by

Big Data Analytics Unit Wise Short Note

Uploaded by

Big Data Analytics

Unit 1: Fundamentals of Big Data Analysis

1. Data Storage and Analysis

2. Characteristics of Big Data

Big Data is characterized by the following:

3. Big Data Analytics

4. Typical Analytical Architecture

5. Requirements for New Analytical Architecture

6. Challenges in Big Data Analytics

Unit 2: Hadoop Framework

2. Requirements of the Hadoop Framework

5. Hadoop 1 vs. Hadoop 2

8. MapReduce Job Variants

Unit 3: HDFS (Hadoop Distributed File System)

1. The Design of HDFS

3. Command Line Interface (CLI)

4. Hadoop File System Interfaces

6. Data Ingestion with Flume and Scoop

Unit 4: Spark Framework and Data Analysis with Spark Shell

1. Introduction to GPU Computing

GPU computing leverages the parallel processing capabilities of GPUs to accelerate

2. CUDA Programming Model

4. Simple Matrix Multiplication in CUDA

5. CUDA Memory Model

CUDA memory includes several types:

6. Spark Framework Overview

8. Writing Spark Applications

10. Advantages of Spark Over MapReduce

You might also like