Big Data Analytics Unit Wise Short Note
Big Data Analytics Unit Wise Short Note
Big Data analytics involves the use of advanced analytic techniques to process and analyze large
datasets. This can include statistical analysis, machine learning, and predictive modeling.
1. Hadoop
Hadoop is an open-source framework that allows for distributed storage and processing of large
datasets across clusters of computers. It is based on the following components:
• Hadoop Distributed File System (HDFS): A distributed file system designed
to store large files across multiple machines.
• MapReduce: A programming model that allows for the processing of large
datasets in parallel.
4. Hadoop Components
• HDFS: Stores data across multiple nodes.
• YARN: Resource manager that allocates resources to applications.
• MapReduce: Executes the data processing tasks.
6. Hadoop Daemons
• NameNode: Manages the HDFS metadata.
• DataNode: Stores the actual data.
• JobTracker: Manages MapReduce jobs (Hadoop 1).
• TaskTracker: Executes MapReduce tasks (Hadoop 1).
7. MapReduce Programming
MapReduce is a programming model for processing large datasets. It has two main stages:
• Map: Transforms input data into key-value pairs.
• Reduce: Aggregates the key-value pairs to produce final results.
HDFS is designed to store large datasets across multiple nodes in a cluster. It uses block-level
replication to ensure fault tolerance.
2. HDFS Concepts
• Blocks: Data is split into blocks, typically 128 MB or 256 MB in size.
• Replication: Data is replicated across multiple nodes (default is 3 copies).
• NameNode: Keeps track of file metadata and block locations.
• DataNode: Stores the actual data blocks.
HDFS can be accessed using command-line tools, APIs, or frameworks such as Hive and Pig.
5. Data Flow
Data is first ingested into HDFS, processed using MapReduce, and then outputted to a
distributed storage system.
7. Hadoop I/O
• Compression: Reduces storage space.
• Serialization: Formats data for efficient transfer.
• Avro: A serialization framework used in Hadoop.
• File-Based Data Structures: Organizes data for efficient storage and
retrieval.
CUDA allows developers to write code that runs on NVIDIA GPUs. It uses parallel threads to
divide the workload and maximize performance.
3. CUDA API
The CUDA API provides functions to manage memory, launch kernels, and synchronize
operations. Key functions include:
• cudaMalloc(): Allocates memory on the GPU.
• cudaMemcpy(): Transfers data between CPU and GPU.
• cudaFree(): Frees GPU memory.
Matrix multiplication is a common application for CUDA, where each thread performs a small part
of the computation. Using shared memory optimizes performance.
Spark is an open-source distributed computing system for processing Big Data. It supports batch
processing and real-time analytics.
7. Components of Spark
• Spark Core: Handles scheduling and memory management.
• Spark SQL: Supports querying structured data.
• Spark Streaming: Processes real-time data.
• MLlib: Provides machine learning algorithms.
• GraphX: Offers graph processing.
Spark applications can be written in Scala, Python (PySpark), Java, or R. The process typically
involves:
1. Create SparkSession to interact with Spark.
2. Load Data from sources like HDFS or local files.
3. Transformations like map() and filter().
4. Actions like collect() and save().
5. Close the SparkSession.
9. Spark Execution
Spark applications can run in various cluster environments like YARN, Mesos, or Kubernetes. The
application is executed using the spark-submit command.