Big Data Analytics
Big Data Analytics
Course Objectives
Course Outcome
Big data analytics refers to the process of examining large and varied datasets, often referred to as
"big data," to uncover hidden patterns, correlations, trends, and other useful insights that can help
organizations make more informed decisions. Big data analytics involves the use of advanced
analytical techniques, technologies, and tools to process, analyze, and interpret vast amounts of
data from diverse sources.
Motivation for Hadoop, Big Data Characteristics, Challenges with traditional system, Hadoop’s
History, Core Hadoop Concepts, Hadoop Clusters, What Hadoop is?, What features the Hadoop
Distributed File System (HDFS) provides, Architecture, Features, Goals and Advantages of HDFS,
7 Hours
Unit 2: Map Reduce
Why Map Reduce is essential in Hadoop?, Processing Daemons of Hadoop, Input Split, Map
Reduce Life Cycle, MapReduce Programming Model, Job Tracker, Task Tracker, InputSplit,
Communication Mechanism of Job Tracker and Task Tracker, Input Format Class, Record Reader
Class, Different phases of Map Reduce Algorithm.
7 Hours
Introduction to Apache Spark, Features of Spark, Spark built on Hadoop, Components of Spark,
Resilient Distributed Datasets, Data Sharing using Spark RDD, Iterative Operations on Spark RDD,
Interactive Operations on Spark RDD, Spark shell
7 Hours
Unit 5: PySpark
Introduction to SparkContext, Spark RDD, spark Caching, Common Transformations and Actions,
Spark Functions, Key-Value Pairs, Aggregate Functions, Joins in Spark, Spark DataFrame, Getting
Started with Spark SQL, Basic Transformations using Spark SQL
7 Hours
Reference Books
1. “Hadoop – The Definitive Guide; Storage and Analysis at Internet scale”, Tom White, 4th
Edition, O’Reilly- Shroff Publishers, 2015
2. “Spark: The Definitive Guide: Big Data Processing Made Simple“, Bill Chambers, Matei
Zaharia, O'Reilly Media, 2nd Edition, 2020
3. “Interactive Spark using PySpark”, Benjamin Bengfort, Jenny Kim, O'Reilly Media, 2nd
Edition, 2016