0% found this document useful (0 votes)
76 views2 pages

Big Data Analytics

The document outlines a course on big data analytics using Hadoop and Spark. It covers setting up Hadoop clusters, using MapReduce for data intensive problems, Spark for iterative processing integrated with Hadoop, and connecting Hadoop with Python using PySpark.

Uploaded by

Tamal Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views2 pages

Big Data Analytics

The document outlines a course on big data analytics using Hadoop and Spark. It covers setting up Hadoop clusters, using MapReduce for data intensive problems, Spark for iterative processing integrated with Hadoop, and connecting Hadoop with Python using PySpark.

Uploaded by

Tamal Dey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Big Data Analytics

Course Objectives

The objective of the course is to


• Provide a comprehensive understanding of Hadoop and its ecosystem.
• Provide an in-depth understanding of Map Reduce and its essential role in Hadoop.
• Understand a fast and flexible big data processing framework that can run on top of Hadoop
• Understanding PySpark and the Spark ecosystem

Course Outcome

At the end of the course, the student should be able to


• Set up Hadoop clusters, manage and manipulate data using HDFS commands
• Design of Algorithms to solve Data Intensive Problems using Map Reduce Paradigm
• Use Spark in iterative processing and integrate with Hadoop
• Connect Hadoop with Python using PySpark
Course Overview

Big data analytics refers to the process of examining large and varied datasets, often referred to as
"big data," to uncover hidden patterns, correlations, trends, and other useful insights that can help
organizations make more informed decisions. Big data analytics involves the use of advanced
analytical techniques, technologies, and tools to process, analyze, and interpret vast amounts of
data from diverse sources.

Unit 1: Introduction to Hadoop Eco System

Motivation for Hadoop, Big Data Characteristics, Challenges with traditional system, Hadoop’s
History, Core Hadoop Concepts, Hadoop Clusters, What Hadoop is?, What features the Hadoop
Distributed File System (HDFS) provides, Architecture, Features, Goals and Advantages of HDFS,

7 Hours
Unit 2: Map Reduce

Why Map Reduce is essential in Hadoop?, Processing Daemons of Hadoop, Input Split, Map
Reduce Life Cycle, MapReduce Programming Model, Job Tracker, Task Tracker, InputSplit,
Communication Mechanism of Job Tracker and Task Tracker, Input Format Class, Record Reader
Class, Different phases of Map Reduce Algorithm.

7 Hours

Unit 3: Spark for Hadoop

Introduction to Apache Spark, Features of Spark, Spark built on Hadoop, Components of Spark,
Resilient Distributed Datasets, Data Sharing using Spark RDD, Iterative Operations on Spark RDD,
Interactive Operations on Spark RDD, Spark shell

7 Hours

Unit 5: PySpark

Introduction to SparkContext, Spark RDD, spark Caching, Common Transformations and Actions,
Spark Functions, Key-Value Pairs, Aggregate Functions, Joins in Spark, Spark DataFrame, Getting
Started with Spark SQL, Basic Transformations using Spark SQL

7 Hours

Reference Books

1. “Hadoop – The Definitive Guide; Storage and Analysis at Internet scale”, Tom White, 4th
Edition, O’Reilly- Shroff Publishers, 2015

2. “Spark: The Definitive Guide: Big Data Processing Made Simple“, Bill Chambers, Matei
Zaharia, O'Reilly Media, 2nd Edition, 2020
3. “Interactive Spark using PySpark”, Benjamin Bengfort, Jenny Kim, O'Reilly Media, 2nd
Edition, 2016

You might also like