BDDA - Course Outline
BDDA - Course Outline
Course Name: Big Data and Data Analytics for Managers (Using Python) Credit: 3.0
Term: 4 Academic Year: 2022-2023
Faculty: Prof. Ashok Kumar Harnal & (Mr. Anuj Saini (20hrs) for BDA-02)
Office Contact No.: 8750893093
Email: [email protected]
Introduction
This course has two objectives: One, build up project-profile of students on Kaggle/github using important
techniques and second, analyzing big data on Spark—a unified platform for data analytics. We begin with covering
two very important machine learning techniques that are often used in the data analytics community. Learning to
optimize hyper parameters, especially when there are many of them, is very important in any model building
exercise. We briefly cover Hadoop—a big-data storage platform and then move on to analyzing data on this
platform using Spark. We cover streaming analytics—that is analyzing data in motion. Streaming analytics has
numerous applications (for example in ‘social-media-analytics’) and a number of business models (for example
that of Uber or of smart-cities) are built only on streaming technologies. This course assumes some prior basic
working knowledge of two python libraries—pandas and numpy. This is a project oriented Big Data course with
python as the primary language.
Students are expected to have laptops with minimum 8GB of RAM. They are strongly advised to upgrade to 16GB.
OS of Windows 10, Mac or Ubuntu will do.
Text Book:
1. Hands on Machine Learning with Scikit Learn Keras and TensorFlow 2nd Edition-2019--Aurélien
Geron
Reference Book:
1. Feature Engineering for Machine Learning--Principles and Techniques for Data Scientists by Alice
Zheng & Amanda Casari
2. Spark The Definitive Guide--by Bill Chambers and Matei Zaharia
3. HadoopThe definitive guide by Tom White
Course Pedagogy: This is a project based and lab-oriented course. For every topic there is a project. Students are
first exposed to a problem, then understand data and learn techniques and tools to solve the problem and finally a
model is built and solution presented. For working on Big Data and streaming analytics related projects we will
use virtual machines.
Evaluation Components:
Page 1 of 3
Session Plan: (Each session is of 90 minutes unless specified)
2-3 Using pipes in modeling, Otto project from Kaggle Learn to use pipelines in any
stacking classifiers predictive modeling project
3-4 Structure in data—tsne and Otto project from Kaggle To learn how to discover
umap whether data has some
structure or is mostly random
15-minutes online-quiz
18-19 Spark-Kafka data pipeline Analyzing streaming data Develop a simple pipeline for
over a pipeline streaming data
Page 2 of 3
For official use: -
As Benchmarked with course content in previous year, the contents of this course: (Please mark the
right option below)
(a) Is totally new
(b) Has not changed at all
(c) Has undergone less than/equal to 20% change
_
(d) Has undergone more than 20% change _
_
/
Faculty – Prof. Ashok Harnal Area Chair – Prof. Shilpi Jain
Manager (Academics-1)
Dean (Academics)
Page 3 of 3