0% found this document useful (0 votes)
23 views7 pages

BDA2023 Outline

The Big Data Analytics course at IIM Bangalore, taught by Associate Professor Shankar Venkatagiri, aims to explore the dimensions of big data, including volume, velocity, variety, and veracity. It covers computational techniques and applications in various sectors such as retail, finance, and healthcare, utilizing hands-on exploration and tools like Python, Hadoop, and Spark. The course includes a term project, midterm and final exams, and emphasizes the importance of big data in contemporary business environments.

Uploaded by

Ankit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views7 pages

BDA2023 Outline

The Big Data Analytics course at IIM Bangalore, taught by Associate Professor Shankar Venkatagiri, aims to explore the dimensions of big data, including volume, velocity, variety, and veracity. It covers computational techniques and applications in various sectors such as retail, finance, and healthcare, utilizing hands-on exploration and tools like Python, Hadoop, and Spark. The course includes a term project, midterm and final exams, and emphasizes the importance of big data in contemporary business environments.

Uploaded by

Ankit Pandey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Big Data Analytics (BDA)

Name of the Faculty: Shankar Venkatagiri


Affiliation:
(Institution / Organisation & Designation)
Associate Professor, IIM Bangalore
Teaching Area:
(such as Finance & Accounting; Marketing; Production
& Operations Management; Strategy) Information Systems
This course may be offered to:
(PGP, FPM, PGPEM, PGPPM, EPGP)
http://www.iimb.ernet.in/programmes PGP BA open to all programs
Credits (No. of hours):
(3 credits=30 classroom hours; 1.5 credits-15
classroom hours; session=90 minutes) 3 credits
Term / Quarter:
(Starting April /June /September/December) Term 4 of PGP-BA
Course Type:
(Regular: staggered across the term;
Workshop1: 3-5 continuous days) Regular

Additional information required


Are there any financial implications
to this course?

1
Workshop course: Please provide reasons as to why the course is being offered in workshop mode and why it cannot be offered
as a regular course (that is spread over 10 weeks). As an institution, IIMB prefers courses offered in the regular mode, since it
results in better learning experience for the students and avoids overlapping of courses.

Big Data Analytics


Course Summary

The above image from Google Trends shows how “big data” is a phenomenon that is just a
decade old as of this writing, and whose usage has already peaked. This course looks beyond
the hype of big data2, into the opportunities and challenges for businesses in a world driven by
information that must be extracted from large, heterogeneous data sources.

There are dimensions to big data, the most obvious among them being volume (as in big!).
Starting with the 1970s, data stored on computers was organised into tabular “entities”, with
linking relationships. Commercial database systems such as Oracle’s RDBMS and open-source
MySQL implemented this E-R model of “structured” data. During the 1980s, data warehouses
helped store and analyse large amounts of data, so that managers could make informed
business decisions.

The advent of the Internet during the 1990s was a big bang event in the world of data. Notably,
web pages did not follow a set format. The first decade of the 2000s witnessed the rise of social
networks such as Facebook and Twitter, whose messages did not adhere to a structure.
Consequently, special analytical techniques had to be devised accommodate this variety of
“unstructured” data, consisting of text, audio, imagery, video and so on.

In the present decade, watches, phones, and a host of other sensors have come to dominate
the realm of data generation. With storage on the cloud becoming cheap and communication
scaling up to 5G speeds, the age of IoT has finally dawned upon us. Computational platforms
will have to deal with real-time data that arrive at tremendous velocity. Handling this aspect
of big data is key to the success of companies such as Uber and Tesla.

The ubiquity of mobile devices like phones has its disadvantages. On the one hand, information
about anything is literally at one’s fingertips. On the other hand, recent advances in AI have
made it effortless to create content that appears credible and can be passed off as real news.
The 2016 US election as well as the Brexit vote were influenced by targeted campaigns on
Facebook, whose veracity was in question.

This course, titled Big Data Analytics, has been designed to cover these dimensions.

Pre-requisites, inclusion/exclusion criteria (if any): None

2
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International
Journal of Information Management, 35(2), 137-144.

Big Data Analytics


Learning Objectives
The course is designed with the following specific objectives and learning outcomes:

a. Explore the 4 traditional Vs of big data – volume, velocity, variety and veracity
b. Execute computations on distributed platforms that facilitate big data analytics
c. Examine applications of big data to retail, finance, and healthcare

Pedagogy
The course employs a mix of lecture and hands-on exploration in class. The book by Davy
Cielen et al, titled Introducing Data Science Manning (2016), will serve as a companion.

We shall use a session to cover the basics of Python; students may strengthen their proficiency
in the language via a suitable resource like a MOOC.

We now have a fully functioning Big Data Cluster on the IIMB Campus, on which you shall be
provisioned an account. Any code that cannot be run on a laptop (due to the size of the data or
other considerations) can be run on the cluster.

To facilitate even larger implementations, you could open a guest account on Microsoft Azure/
Google Cloud/Amazon AWS.

A term project (max. group size of 4) will help bring all the concepts together.

Course Evaluation & Grading Pattern

Midterm 35%
Final 35%
Project 20%
Class Participation 10%

Constructing exams is an arduous task. There are no make-ups for any component.

Bookshelf

1. Easley, David., & Kleinberg, Jon. Networks, Crowds and Markets. Cambridge (2010)
2. Mayer-Schönberger, V., & Cukier, K. Big data: A revolution that will transform how we
live, work, and think. Houghton Mifflin Harcourt (2013)
3. Ryza, Sandy et al. Advanced Analytics with Spark 2nd Edition. O’Reilly (2017)

Big Data Analytics


Session-wise plan
Session Topic
1&2 An Overview of Big Data

The book Big Data by Viktor Mayer-Schonberger & Kenneth Cukier provides
a good backdrop for our course discussions. Major shifts in our thinking around
data have been facilitated by cheap access to abundant computational power
and storage of volumes of data, greater variety in types of data analysed –
structured as well as unstructured, the velocity with which the data arrives,
thanks to IoT devices, and finally, the question of veracity when targeted news
and advertising campaigns can twist the truth at will.

These four Vs shall be discussed in detail during the rest of the course.

Video: Big Data: A revolution that will transform how we live


http://www.youtube.com/watch?v=bYS_4CWu3y8

3 Introduction to Python

All four Vs of big data can be demonstrated and better understood on a laptop.
The lingua franca for our demos will be Python because there are commercial
grade open-source libraries to support all aspects. We will use Jupyter to run
Python code on our personal laptops. We also show how to run the same code
inside a browser using Google Colab, which makes no installation demands.

We will cover coding basics such as statements, functions, and branching logic.
We detail the provisions of the NumPy library. We use the Pandas library to
work with structured, tabular data. To visualise the results, we will begin with
the basic Matplotlib library, and proceed to examine the superior features of
Plotly and Dash libraries.

4&5 VOLUME: Google Cloud and BigQuery

Large datasets and lengthy computations pose hurdles when we are constrained
to work with a single machine or a stack of servers in a datacentre. Businesses
such as UPS and Vodafone, which must support users at scale are moving their
applications to the “elastic” cloud.

In this session, we go over structured data representation and learn how to query
tabular data using SQL. Next, we discuss Google’s cloud offerings, and execute
rich SQL-style queries on large, structured datasets with BigQuery.

Video: Google BigQuery introduction by Jordan Tigani


https://www.youtube.com/watch?v=kKBnFsNWwYM

Big Data Analytics


6&7 MapReduce and Hadoop

The task of handling large volumes of data makes the case for parallelisation.
Through the 1980s, companies like Sequent supported this requirement by
placing multiple processors on a single motherboard. The 1990s and 2000s saw
the rise of companies like VMWare, whose virtualisation platforms enabled
resource sharing across servers in a datacentre.

Over the last decade, businesses large and small (e.g., Vodafone) have begun
migrating their operations into public clouds such as GCP, AWS and Azure.
These platforms “elastically” spread storage and compute across thousands of
machines. In the first session, we describe the MapReduce architecture, which
is foundational to these business models.

In the second session, we outline the Hadoop ecosystem, which is a realisation


of this architecture. Hadoop enables data tasks to be executed in parallel on a
cluster of machines.

Video Series: Introduction to Hadoop and MapReduce


https://www.youtube.com/watch?v=44K_bzTL_SM

Reading: Dean, J. and Ghemawat, S. MapReduce: simplified data processing


on large clusters. Communications of the ACM 51, no. 1 (2008): 107-113.

8&9 Hive Overview

We dedicate these sessions to Hive, a database developed at Facebook to


facilitate SQL-like calls on large datasets stored in Hadoop.

Across two sessions, we take up the exploration of the popular Movielens


datasets and query them for various ends. By building these ideas into a
MovieTuner, we show how to develop ideas around big data into a business
proposition.

An in-class quiz shall be held to assess the understanding of what is discussed


in the video series.

Reading: Chapter 5 of Cielen et al.

10 Spark Overview

While Hadoop is excellent at parallelising data storage, the tasks are executed
in batch mode, i.e., at one go. A versatile approach would be to execute tasks
sequentially because the data may have to processed further to derive insights.
Spark is a powerful platform that helps us work with data loaded into memory.

Big Data Analytics


Businesses dealing with customer data often grapple with data duplication
before they even proceed to analyse the data. Entire businesses have been built
around software that tackles this issue. In this session, we use Spark to interpret
the results of de-duplication of patient data for a German hospital.

11 & 12 Machine Learning with Big Data

We begin by illustrating basic ideas of machine learning and discuss the


challenges that arise from scale. Spark’s ML and MLLib libraries support a
vast collection of supervised and unsupervised algorithms at scale.

Recommendations are the most popular use of ML in the commercial world.


The Alternating Least Squares (ALS) method works on a latent factor model,
which rely on explicit or implicit ratings. We apply this model to the Movielens
and Autoscrobbler datasets to make movie and music recommendations. These
approaches drive the business models of Netflix and Spotify.

Readings: Chapters 3 & 8 of Cielen et al.

13 Applications to Finance

In this session, we learn how to estimate financial risk for a mutual fund, using
the Value at Risk metric, calculated empirically via a Monte Carlo Simulation
of stock returns for thousands of instruments, with a large number of trials.

Reading: Chapter 3 of Cielen et al.

14 & 15 VARIETY: Unstructured Data

A large proportion of real-world big data is unstructured. One classic example


is a network dataset, which arises from diverse domains such as social media,
epidemiology, and so on.

Across two sessions, we use the Gephi package to load and explore networks
from various business spheres. It also provides a mechanism to construct
random networks.

16 Social Networks

We have witnessed the emergence of social network platforms since the turn
of the century. In contrast with random networks, real world networks tend to
contain hubs. We examine the role of selection and socialisation in link and
community formation.

Reading: SP Borgatti, A Mehra, DJ Brass, G Labianca. Network analysis in


the social sciences. Science 323 (5916), 892-895

Big Data Analytics


17 & 18 Networks at Scale

We begin with an overview of the PageRank algorithm, which allows us to


systematically identify the “important” nodes in a large network. We apply it
to a small network, and make meaning of the results.

The GraphX library allows us to handle networks small and large. We can
obtain quantities such as triangle count, shortest paths, connected components,
and apply it to identify hubs in a large bikeshare dataset.

Reading: Chapter 7 of Cielen et al

19 VELOCITY: Streaming Applications

Dynamic business environments require stream processing, which is the act of


continually ingesting real-time data to compute or update a result. This is a key
requirement for tasks such as generating notifications, processing credit card
activity, and more recently, IoT sensor data.

We then demonstrate the features of Kafka, a distributed streaming platform.

Reading: IoT: Real-time Data Processing and Analytics using Spark / Kafka

20 VERACITY: Targeted Marketing

Around the turn of this century, the field of mass communications took a sharp
turn: user generated content was at the heart of this revolution. The mechanism
of targeted advertising by the likes of Google was improved upon by Facebook,
with its incisive insights into the intimate details of an individual.

However, the user’s privacy was sacrificed along the way, and one’s personal
data became a marketable commodity. An outfit by the name of Cambridge
Analytica capitalised on Facebook’s knowledge of its users, and played a
divisive role in the 2016 US elections as well as the Brexit campaign.

Reading: Grasseger, H. & Krogerus, M. The Data that turned the World
Upside- down. Motherboard (Jan 2017)

Big Data Analytics

You might also like