BDA2023 Outline
BDA2023 Outline
1
Workshop course: Please provide reasons as to why the course is being offered in workshop mode and why it cannot be offered
as a regular course (that is spread over 10 weeks). As an institution, IIMB prefers courses offered in the regular mode, since it
results in better learning experience for the students and avoids overlapping of courses.
The above image from Google Trends shows how “big data” is a phenomenon that is just a
decade old as of this writing, and whose usage has already peaked. This course looks beyond
the hype of big data2, into the opportunities and challenges for businesses in a world driven by
information that must be extracted from large, heterogeneous data sources.
There are dimensions to big data, the most obvious among them being volume (as in big!).
Starting with the 1970s, data stored on computers was organised into tabular “entities”, with
linking relationships. Commercial database systems such as Oracle’s RDBMS and open-source
MySQL implemented this E-R model of “structured” data. During the 1980s, data warehouses
helped store and analyse large amounts of data, so that managers could make informed
business decisions.
The advent of the Internet during the 1990s was a big bang event in the world of data. Notably,
web pages did not follow a set format. The first decade of the 2000s witnessed the rise of social
networks such as Facebook and Twitter, whose messages did not adhere to a structure.
Consequently, special analytical techniques had to be devised accommodate this variety of
“unstructured” data, consisting of text, audio, imagery, video and so on.
In the present decade, watches, phones, and a host of other sensors have come to dominate
the realm of data generation. With storage on the cloud becoming cheap and communication
scaling up to 5G speeds, the age of IoT has finally dawned upon us. Computational platforms
will have to deal with real-time data that arrive at tremendous velocity. Handling this aspect
of big data is key to the success of companies such as Uber and Tesla.
The ubiquity of mobile devices like phones has its disadvantages. On the one hand, information
about anything is literally at one’s fingertips. On the other hand, recent advances in AI have
made it effortless to create content that appears credible and can be passed off as real news.
The 2016 US election as well as the Brexit vote were influenced by targeted campaigns on
Facebook, whose veracity was in question.
This course, titled Big Data Analytics, has been designed to cover these dimensions.
2
Gandomi, A., & Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International
Journal of Information Management, 35(2), 137-144.
a. Explore the 4 traditional Vs of big data – volume, velocity, variety and veracity
b. Execute computations on distributed platforms that facilitate big data analytics
c. Examine applications of big data to retail, finance, and healthcare
Pedagogy
The course employs a mix of lecture and hands-on exploration in class. The book by Davy
Cielen et al, titled Introducing Data Science Manning (2016), will serve as a companion.
We shall use a session to cover the basics of Python; students may strengthen their proficiency
in the language via a suitable resource like a MOOC.
We now have a fully functioning Big Data Cluster on the IIMB Campus, on which you shall be
provisioned an account. Any code that cannot be run on a laptop (due to the size of the data or
other considerations) can be run on the cluster.
To facilitate even larger implementations, you could open a guest account on Microsoft Azure/
Google Cloud/Amazon AWS.
A term project (max. group size of 4) will help bring all the concepts together.
Midterm 35%
Final 35%
Project 20%
Class Participation 10%
Constructing exams is an arduous task. There are no make-ups for any component.
Bookshelf
1. Easley, David., & Kleinberg, Jon. Networks, Crowds and Markets. Cambridge (2010)
2. Mayer-Schönberger, V., & Cukier, K. Big data: A revolution that will transform how we
live, work, and think. Houghton Mifflin Harcourt (2013)
3. Ryza, Sandy et al. Advanced Analytics with Spark 2nd Edition. O’Reilly (2017)
The book Big Data by Viktor Mayer-Schonberger & Kenneth Cukier provides
a good backdrop for our course discussions. Major shifts in our thinking around
data have been facilitated by cheap access to abundant computational power
and storage of volumes of data, greater variety in types of data analysed –
structured as well as unstructured, the velocity with which the data arrives,
thanks to IoT devices, and finally, the question of veracity when targeted news
and advertising campaigns can twist the truth at will.
These four Vs shall be discussed in detail during the rest of the course.
3 Introduction to Python
All four Vs of big data can be demonstrated and better understood on a laptop.
The lingua franca for our demos will be Python because there are commercial
grade open-source libraries to support all aspects. We will use Jupyter to run
Python code on our personal laptops. We also show how to run the same code
inside a browser using Google Colab, which makes no installation demands.
We will cover coding basics such as statements, functions, and branching logic.
We detail the provisions of the NumPy library. We use the Pandas library to
work with structured, tabular data. To visualise the results, we will begin with
the basic Matplotlib library, and proceed to examine the superior features of
Plotly and Dash libraries.
Large datasets and lengthy computations pose hurdles when we are constrained
to work with a single machine or a stack of servers in a datacentre. Businesses
such as UPS and Vodafone, which must support users at scale are moving their
applications to the “elastic” cloud.
In this session, we go over structured data representation and learn how to query
tabular data using SQL. Next, we discuss Google’s cloud offerings, and execute
rich SQL-style queries on large, structured datasets with BigQuery.
The task of handling large volumes of data makes the case for parallelisation.
Through the 1980s, companies like Sequent supported this requirement by
placing multiple processors on a single motherboard. The 1990s and 2000s saw
the rise of companies like VMWare, whose virtualisation platforms enabled
resource sharing across servers in a datacentre.
Over the last decade, businesses large and small (e.g., Vodafone) have begun
migrating their operations into public clouds such as GCP, AWS and Azure.
These platforms “elastically” spread storage and compute across thousands of
machines. In the first session, we describe the MapReduce architecture, which
is foundational to these business models.
10 Spark Overview
While Hadoop is excellent at parallelising data storage, the tasks are executed
in batch mode, i.e., at one go. A versatile approach would be to execute tasks
sequentially because the data may have to processed further to derive insights.
Spark is a powerful platform that helps us work with data loaded into memory.
13 Applications to Finance
In this session, we learn how to estimate financial risk for a mutual fund, using
the Value at Risk metric, calculated empirically via a Monte Carlo Simulation
of stock returns for thousands of instruments, with a large number of trials.
Across two sessions, we use the Gephi package to load and explore networks
from various business spheres. It also provides a mechanism to construct
random networks.
16 Social Networks
We have witnessed the emergence of social network platforms since the turn
of the century. In contrast with random networks, real world networks tend to
contain hubs. We examine the role of selection and socialisation in link and
community formation.
The GraphX library allows us to handle networks small and large. We can
obtain quantities such as triangle count, shortest paths, connected components,
and apply it to identify hubs in a large bikeshare dataset.
Reading: IoT: Real-time Data Processing and Analytics using Spark / Kafka
Around the turn of this century, the field of mass communications took a sharp
turn: user generated content was at the heart of this revolution. The mechanism
of targeted advertising by the likes of Google was improved upon by Facebook,
with its incisive insights into the intimate details of an individual.
However, the user’s privacy was sacrificed along the way, and one’s personal
data became a marketable commodity. An outfit by the name of Cambridge
Analytica capitalised on Facebook’s knowledge of its users, and played a
divisive role in the 2016 US elections as well as the Brexit campaign.
Reading: Grasseger, H. & Krogerus, M. The Data that turned the World
Upside- down. Motherboard (Jan 2017)