0% found this document useful (0 votes)

23 views7 pages

BDA2023 Outline

The Big Data Analytics course at IIM Bangalore, taught by Associate Professor Shankar Venkatagiri, aims to explore the dimensions of big data, including volume, velocity, variety, and veracity. It covers computational techniques and applications in various sectors such as retail, finance, and healthcare, utilizing hands-on exploration and tools like Python, Hadoop, and Spark. The course includes a term project, midterm and final exams, and emphasizes the importance of big data in contemporary business environments.

Uploaded by

Ankit Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views7 pages

BDA2023 Outline

Uploaded by

Ankit Pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Big Data Analytics (BDA)

Name of the Faculty: Shankar Venkatagiri

Affiliation:
(Institution / Organisation & Designation)
Associate Professor, IIM Bangalore
Teaching Area:
(such as Finance & Accounting; Marketing; Production
& Operations Management; Strategy) Information Systems
This course may be offered to:
(PGP, FPM, PGPEM, PGPPM, EPGP)
http://www.iimb.ernet.in/programmes PGP BA open to all programs
Credits (No. of hours):
(3 credits=30 classroom hours; 1.5 credits-15
classroom hours; session=90 minutes) 3 credits
Term / Quarter:
(Starting April /June /September/December) Term 4 of PGP-BA
Course Type:
(Regular: staggered across the term;
Workshop1: 3-5 continuous days) Regular

Additional information required

Are there any financial implications
to this course?

1
Workshop course: Please provide reasons as to why the course is being offered in workshop mode and why it cannot be offered
as a regular course (that is spread over 10 weeks). As an institution, IIMB prefers courses offered in the regular mode, since it
results in better learning experience for the students and avoids overlapping of courses.

Big Data Analytics

Course Summary

The above image from Google Trends shows how “big data” is a phenomenon that is just a
decade old as of this writing, and whose usage has already peaked. This course looks beyond
the hype of big data2, into the opportunities and challenges for businesses in a world driven by
information that must be extracted from large, heterogeneous data sources.

There are dimensions to big data, the most obvious among them being volume (as in big!).
Starting with the 1970s, data stored on computers was organised into tabular “entities”, with
linking relationships. Commercial database systems such as Oracle’s RDBMS and open-source
MySQL implemented this E-R model of “structured” data. During the 1980s, data warehouses
helped store and analyse large amounts of data, so that managers could make informed
business decisions.

The advent of the Internet during the 1990s was a big bang event in the world of data. Notably,
web pages did not follow a set format. The first decade of the 2000s witnessed the rise of social
networks such as Facebook and Twitter, whose messages did not adhere to a structure.
Consequently, special analytical techniques had to be devised accommodate this variety of
“unstructured” data, consisting of text, audio, imagery, video and so on.

In the present decade, watches, phones, and a host of other sensors have come to dominate
the realm of data generation. With storage on the cloud becoming cheap and communication
scaling up to 5G speeds, the age of IoT has finally dawned upon us. Computational platforms
will have to deal with real-time data that arrive at tremendous velocity. Handling this aspect
of big data is key to the success of companies such as Uber and Tesla.

The ubiquity of mobile devices like phones has its disadvantages. On the one hand, information
about anything is literally at one’s fingertips. On the other hand, recent advances in AI have
made it effortless to create content that appears credible and can be passed off as real news.
The 2016 US election as well as the Brexit vote were influenced by targeted campaigns on
Facebook, whose veracity was in question.

This course, titled Big Data Analytics, has been designed to cover these dimensions.

Pre-requisites, inclusion/exclusion criteria (if any): None

2
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International
Journal of Information Management, 35(2), 137-144.

Big Data Analytics

Learning Objectives
The course is designed with the following specific objectives and learning outcomes:

a. Explore the 4 traditional Vs of big data – volume, velocity, variety and veracity
b. Execute computations on distributed platforms that facilitate big data analytics
c. Examine applications of big data to retail, finance, and healthcare

Pedagogy
The course employs a mix of lecture and hands-on exploration in class. The book by Davy
Cielen et al, titled Introducing Data Science Manning (2016), will serve as a companion.

We shall use a session to cover the basics of Python; students may strengthen their proficiency
in the language via a suitable resource like a MOOC.

We now have a fully functioning Big Data Cluster on the IIMB Campus, on which you shall be
provisioned an account. Any code that cannot be run on a laptop (due to the size of the data or
other considerations) can be run on the cluster.

To facilitate even larger implementations, you could open a guest account on Microsoft Azure/
Google Cloud/Amazon AWS.

A term project (max. group size of 4) will help bring all the concepts together.

Course Evaluation & Grading Pattern

Midterm 35%
Final 35%
Project 20%
Class Participation 10%

Constructing exams is an arduous task. There are no make-ups for any component.

Bookshelf

1. Easley, David., & Kleinberg, Jon. Networks, Crowds and Markets. Cambridge (2010)
2. Mayer-Schönberger, V., & Cukier, K. Big data: A revolution that will transform how we
live, work, and think. Houghton Mifflin Harcourt (2013)
3. Ryza, Sandy et al. Advanced Analytics with Spark 2nd Edition. O’Reilly (2017)

Big Data Analytics

Session-wise plan
Session Topic
1&2 An Overview of Big Data

The book Big Data by Viktor Mayer-Schonberger & Kenneth Cukier provides
a good backdrop for our course discussions. Major shifts in our thinking around
data have been facilitated by cheap access to abundant computational power
and storage of volumes of data, greater variety in types of data analysed –
structured as well as unstructured, the velocity with which the data arrives,
thanks to IoT devices, and finally, the question of veracity when targeted news
and advertising campaigns can twist the truth at will.

These four Vs shall be discussed in detail during the rest of the course.

Video: Big Data: A revolution that will transform how we live

http://www.youtube.com/watch?v=bYS_4CWu3y8

3 Introduction to Python

All four Vs of big data can be demonstrated and better understood on a laptop.
The lingua franca for our demos will be Python because there are commercial
grade open-source libraries to support all aspects. We will use Jupyter to run
Python code on our personal laptops. We also show how to run the same code
inside a browser using Google Colab, which makes no installation demands.

We will cover coding basics such as statements, functions, and branching logic.
We detail the provisions of the NumPy library. We use the Pandas library to
work with structured, tabular data. To visualise the results, we will begin with
the basic Matplotlib library, and proceed to examine the superior features of
Plotly and Dash libraries.

4&5 VOLUME: Google Cloud and BigQuery

Large datasets and lengthy computations pose hurdles when we are constrained
to work with a single machine or a stack of servers in a datacentre. Businesses
such as UPS and Vodafone, which must support users at scale are moving their
applications to the “elastic” cloud.

In this session, we go over structured data representation and learn how to query
tabular data using SQL. Next, we discuss Google’s cloud offerings, and execute
rich SQL-style queries on large, structured datasets with BigQuery.

Video: Google BigQuery introduction by Jordan Tigani

https://www.youtube.com/watch?v=kKBnFsNWwYM

Big Data Analytics

6&7 MapReduce and Hadoop

The task of handling large volumes of data makes the case for parallelisation.
Through the 1980s, companies like Sequent supported this requirement by
placing multiple processors on a single motherboard. The 1990s and 2000s saw
the rise of companies like VMWare, whose virtualisation platforms enabled
resource sharing across servers in a datacentre.

Over the last decade, businesses large and small (e.g., Vodafone) have begun
migrating their operations into public clouds such as GCP, AWS and Azure.
These platforms “elastically” spread storage and compute across thousands of
machines. In the first session, we describe the MapReduce architecture, which
is foundational to these business models.

In the second session, we outline the Hadoop ecosystem, which is a realisation

of this architecture. Hadoop enables data tasks to be executed in parallel on a
cluster of machines.

Video Series: Introduction to Hadoop and MapReduce

https://www.youtube.com/watch?v=44K_bzTL_SM

Reading: Dean, J. and Ghemawat, S. MapReduce: simplified data processing

on large clusters. Communications of the ACM 51, no. 1 (2008): 107-113.

8&9 Hive Overview

We dedicate these sessions to Hive, a database developed at Facebook to

facilitate SQL-like calls on large datasets stored in Hadoop.

Across two sessions, we take up the exploration of the popular Movielens

datasets and query them for various ends. By building these ideas into a
MovieTuner, we show how to develop ideas around big data into a business
proposition.

An in-class quiz shall be held to assess the understanding of what is discussed

in the video series.

Reading: Chapter 5 of Cielen et al.

10 Spark Overview

While Hadoop is excellent at parallelising data storage, the tasks are executed
in batch mode, i.e., at one go. A versatile approach would be to execute tasks
sequentially because the data may have to processed further to derive insights.
Spark is a powerful platform that helps us work with data loaded into memory.

Big Data Analytics

Businesses dealing with customer data often grapple with data duplication
before they even proceed to analyse the data. Entire businesses have been built
around software that tackles this issue. In this session, we use Spark to interpret
the results of de-duplication of patient data for a German hospital.

11 & 12 Machine Learning with Big Data

We begin by illustrating basic ideas of machine learning and discuss the

challenges that arise from scale. Spark’s ML and MLLib libraries support a
vast collection of supervised and unsupervised algorithms at scale.

Recommendations are the most popular use of ML in the commercial world.

The Alternating Least Squares (ALS) method works on a latent factor model,
which rely on explicit or implicit ratings. We apply this model to the Movielens
and Autoscrobbler datasets to make movie and music recommendations. These
approaches drive the business models of Netflix and Spotify.

Readings: Chapters 3 & 8 of Cielen et al.

13 Applications to Finance

In this session, we learn how to estimate financial risk for a mutual fund, using
the Value at Risk metric, calculated empirically via a Monte Carlo Simulation
of stock returns for thousands of instruments, with a large number of trials.

Reading: Chapter 3 of Cielen et al.

14 & 15 VARIETY: Unstructured Data

A large proportion of real-world big data is unstructured. One classic example

is a network dataset, which arises from diverse domains such as social media,
epidemiology, and so on.

Across two sessions, we use the Gephi package to load and explore networks
from various business spheres. It also provides a mechanism to construct
random networks.

16 Social Networks

We have witnessed the emergence of social network platforms since the turn
of the century. In contrast with random networks, real world networks tend to
contain hubs. We examine the role of selection and socialisation in link and
community formation.

Reading: SP Borgatti, A Mehra, DJ Brass, G Labianca. Network analysis in

the social sciences. Science 323 (5916), 892-895

Big Data Analytics

17 & 18 Networks at Scale

We begin with an overview of the PageRank algorithm, which allows us to

systematically identify the “important” nodes in a large network. We apply it
to a small network, and make meaning of the results.

The GraphX library allows us to handle networks small and large. We can
obtain quantities such as triangle count, shortest paths, connected components,
and apply it to identify hubs in a large bikeshare dataset.

Reading: Chapter 7 of Cielen et al

19 VELOCITY: Streaming Applications

Dynamic business environments require stream processing, which is the act of

continually ingesting real-time data to compute or update a result. This is a key
requirement for tasks such as generating notifications, processing credit card
activity, and more recently, IoT sensor data.

We then demonstrate the features of Kafka, a distributed streaming platform.

Reading: IoT: Real-time Data Processing and Analytics using Spark / Kafka

20 VERACITY: Targeted Marketing

Around the turn of this century, the field of mass communications took a sharp
turn: user generated content was at the heart of this revolution. The mechanism
of targeted advertising by the likes of Google was improved upon by Facebook,
with its incisive insights into the intimate details of an individual.

However, the user’s privacy was sacrificed along the way, and one’s personal
data became a marketable commodity. An outfit by the name of Cambridge
Analytica capitalised on Facebook’s knowledge of its users, and played a
divisive role in the 2016 US elections as well as the Brexit campaign.

Reading: Grasseger, H. & Krogerus, M. The Data that turned the World
Upside- down. Motherboard (Jan 2017)

Big Data Analytics

Big Data Analytics - Lecture Slides
No ratings yet
Big Data Analytics - Lecture Slides
72 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
33 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Unit 4 - Cloud Programming Models
100% (2)
Unit 4 - Cloud Programming Models
21 pages
Module 1 Introduction To Big Data Analytics
No ratings yet
Module 1 Introduction To Big Data Analytics
121 pages
Mca Big Data PDF Sem 3
No ratings yet
Mca Big Data PDF Sem 3
193 pages
Digital Notes IDBA Final Original
No ratings yet
Digital Notes IDBA Final Original
156 pages
Unit 1
No ratings yet
Unit 1
76 pages
cst499 Final Capstone Proposal
No ratings yet
cst499 Final Capstone Proposal
25 pages
Bda U1
No ratings yet
Bda U1
80 pages
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
No ratings yet
It - (R20) - 4-1 - Big Data Analytics - Digital Notes
117 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
42 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Big Data Seminar
100% (2)
Big Data Seminar
27 pages
Unit 1
No ratings yet
Unit 1
18 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Analytics (VN) 1
No ratings yet
Big Data Analytics (VN) 1
98 pages
Big Data 2022 Notes
No ratings yet
Big Data 2022 Notes
118 pages
Big Data Engineering and Analytics Developer
No ratings yet
Big Data Engineering and Analytics Developer
5 pages
BDA Notes
No ratings yet
BDA Notes
68 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
Course Material
100% (1)
Course Material
57 pages
Basics of Big Data
No ratings yet
Basics of Big Data
14 pages
It (r20) 4-1 Big Data Analytics Digital Notes
No ratings yet
It (r20) 4-1 Big Data Analytics Digital Notes
84 pages
Data Science and Big Data Analytics - Unit - 1
No ratings yet
Data Science and Big Data Analytics - Unit - 1
47 pages
Big Data
No ratings yet
Big Data
41 pages
Unit 1 - BDS - DS307
No ratings yet
Unit 1 - BDS - DS307
47 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
20 pages
BDA Unit 1
No ratings yet
BDA Unit 1
36 pages
Big Data Analytics
No ratings yet
Big Data Analytics
45 pages
Bda CHP1
No ratings yet
Bda CHP1
83 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Developing Big Data Solutions On Microsoft Azure HDInsight
No ratings yet
Developing Big Data Solutions On Microsoft Azure HDInsight
346 pages
Lec 1 - Introduction To Big Data
No ratings yet
Lec 1 - Introduction To Big Data
37 pages
Big Data (KCS-061)
No ratings yet
Big Data (KCS-061)
46 pages
Big Data Analytics
No ratings yet
Big Data Analytics
10 pages
Unit 1
No ratings yet
Unit 1
20 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big Data Analytics (R20a0520)
No ratings yet
Big Data Analytics (R20a0520)
84 pages
Fundamentals of Big Data Analytics
No ratings yet
Fundamentals of Big Data Analytics
151 pages
BIG Data Analytics 21CSH-471: Computer Science & Engineering
No ratings yet
BIG Data Analytics 21CSH-471: Computer Science & Engineering
17 pages
CS8091 LN
No ratings yet
CS8091 LN
68 pages
Unit 1 BIGDATA - 702 (D) CSE
No ratings yet
Unit 1 BIGDATA - 702 (D) CSE
20 pages
UNIT I BIG DATA Extra Content
No ratings yet
UNIT I BIG DATA Extra Content
15 pages
BIG Data - Unit - 1
No ratings yet
BIG Data - Unit - 1
24 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
117769
No ratings yet
117769
20 pages
Unit 1
No ratings yet
Unit 1
19 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Report On Big Data
No ratings yet
Report On Big Data
23 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Cp5293 Big Data Analytics Question Bank
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
BAD601 Module 2 PDF
No ratings yet
BAD601 Module 2 PDF
58 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
CS8091 BDA Unit1
No ratings yet
CS8091 BDA Unit1
63 pages
Block-2-Unit 5
No ratings yet
Block-2-Unit 5
101 pages
1664473609-Unit 5 - Database Management - MongoDB
No ratings yet
1664473609-Unit 5 - Database Management - MongoDB
23 pages
Architecting For Fast Data Applications Mesosphere
No ratings yet
Architecting For Fast Data Applications Mesosphere
45 pages
BUDT 758B Big Data - Syllabus-2016 - Gao & Gopal - 0
No ratings yet
BUDT 758B Big Data - Syllabus-2016 - Gao & Gopal - 0
4 pages
cp5293 Big Data Analytics Question Bank
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
03 Unit Bda Hadoop, Map Reduce
No ratings yet
03 Unit Bda Hadoop, Map Reduce
80 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
CC ZG522 Course Handout
No ratings yet
CC ZG522 Course Handout
6 pages
Ibm'S Biginsights: Smart Analytics For Big Data: C. M. Saracco, Ibm Silicon Valley Lab
No ratings yet
Ibm'S Biginsights: Smart Analytics For Big Data: C. M. Saracco, Ibm Silicon Valley Lab
49 pages
MapReduce: Simplified Data Processing On Large Clusters
100% (1)
MapReduce: Simplified Data Processing On Large Clusters
13 pages
Interview - Questions
No ratings yet
Interview - Questions
8 pages
EPGP in Data Science (Curriculum)
No ratings yet
EPGP in Data Science (Curriculum)
30 pages
Kinjal Mistry Resume (CS) PDF
No ratings yet
Kinjal Mistry Resume (CS) PDF
2 pages
Training For Bigdata and Hadoop: #I Background and Introduction
No ratings yet
Training For Bigdata and Hadoop: #I Background and Introduction
9 pages
NOSql
No ratings yet
NOSql
46 pages
Over View of Security Issues in Big Data Analytics
No ratings yet
Over View of Security Issues in Big Data Analytics
70 pages
Divyam Singal CV PDF
No ratings yet
Divyam Singal CV PDF
2 pages
Project Certificate
No ratings yet
Project Certificate
68 pages
BDA All 37 Answers Complete
No ratings yet
BDA All 37 Answers Complete
5 pages
Seminar Topic
No ratings yet
Seminar Topic
13 pages
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Map-Reduce and The New Software Stack: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
48 pages
Hadoop
No ratings yet
Hadoop
5 pages
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
No ratings yet
Divide and Recombine Approach For Detailed Analysis and Visualization of Large Complex Data
13 pages
19Nh14-102190051-Lab10 - Chương Trình MapReduce Process Units
No ratings yet
19Nh14-102190051-Lab10 - Chương Trình MapReduce Process Units
6 pages
Hadoop Course Outline UPDATED SURESH
No ratings yet
Hadoop Course Outline UPDATED SURESH
5 pages
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
From Everand
Introduction to Data Platforms: How to leverage data fabric concepts to engineer your organization's data for today's cloud-based digital world
Anthony David Giordano
No ratings yet
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet

BDA2023 Outline

Uploaded by

BDA2023 Outline

Uploaded by

Big Data Analytics (BDA)

Name of the Faculty: Shankar Venkatagiri

Additional information required

Big Data Analytics

Pre-requisites, inclusion/exclusion criteria (if any): None

Big Data Analytics

Course Evaluation & Grading Pattern

Big Data Analytics

Video: Big Data: A revolution that will transform how we live

4&5 VOLUME: Google Cloud and BigQuery

Video: Google BigQuery introduction by Jordan Tigani

Big Data Analytics

In the second session, we outline the Hadoop ecosystem, which is a realisation

Video Series: Introduction to Hadoop and MapReduce

Reading: Dean, J. and Ghemawat, S. MapReduce: simplified data processing

8&9 Hive Overview

We dedicate these sessions to Hive, a database developed at Facebook to

Across two sessions, we take up the exploration of the popular Movielens

An in-class quiz shall be held to assess the understanding of what is discussed

Reading: Chapter 5 of Cielen et al.

Big Data Analytics

11 & 12 Machine Learning with Big Data

We begin by illustrating basic ideas of machine learning and discuss the

Recommendations are the most popular use of ML in the commercial world.

Readings: Chapters 3 & 8 of Cielen et al.

Reading: Chapter 3 of Cielen et al.

14 & 15 VARIETY: Unstructured Data

A large proportion of real-world big data is unstructured. One classic example

Reading: SP Borgatti, A Mehra, DJ Brass, G Labianca. Network analysis in

Big Data Analytics

We begin with an overview of the PageRank algorithm, which allows us to

Reading: Chapter 7 of Cielen et al

19 VELOCITY: Streaming Applications

Dynamic business environments require stream processing, which is the act of

We then demonstrate the features of Kafka, a distributed streaming platform.

20 VERACITY: Targeted Marketing

Big Data Analytics

You might also like