BDS Course Handout - Intuit PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

COURSE HANDOUT

DSEID ZG522 (5 units)


Instructor-in-Charge: Shan Balasubramaniam
Instructor: Pravin Y. Pawar

Course Description
The course introduces the students to the concepts of Systems for Analytics with particular emphasis on
processing Big Data. It introduces distributed computing models for storage and processing of Big Data
with specific coverage of block storage, file systems, and databases on the one hand and batch processing,
in-memory distributed processing, and stream processing on the other. Hadoop (along with associated
technologies such as Hive and Pig), Spark, and Amazon’s storage and database services are used as
exemplar platforms.

Course Objectives
CO Enable students to understand requirements for and constraints in storing and processing Big Data
1

CO Enable students to leverage commodity infrastructure (such as scale-out clusters, distributed


2 datastores, and the cloud) and the appropriate platforms and services for storing and processing Big
Data.

CO Enable students to implement solutions for big data processing


3

CO Enable students to develop a working knowledge of stream processing


4

Text Book(s)
T1 Seema Acharya and Subhashini Chellappan. ​Big Data Analytics.​ Wiley India Pvt. Ltd. 2015

Reference Book(s) & other resources


R1 DT Editorial Services. ​Big Data - Black Book.​ DreamTech. Press. 2016
R2 Kai Hwang, Jack Dongarra, and Geoffrey C. Fox. ​Distributed and Cloud Computing: From
Parallel Processing to the Internet of Things.​ Morgan Kauffman 2011.

R3 Additional Reading ​(as assigned for specific topics)


Learning Outcomes:
No Learning Outcomes

LO1 A comprehensive understanding of the Big Data ecosystem and along with the typical
technologies involved.

LO2 Apply concepts from distributed computing and use the Hadoop/Map-reduce framework
and for solving typical big data problems.

LO3 Identify and use appropriate storage / database platforms for Big data storage along with
appropriate querying mechanisms / interfaces for retrieval.

LO4 Use in-memory processing and stream processing techniques for building Big Data
systems.
 

Session Plan

Session # Contact List of Topic / Title Text/Ref


Hours(#) Book/externa
l resource
1 1 Different Types of Data and Storage for Data​:
Structured Data (Relational Databases) , Semi-structured T1 Ch. 1 and
data (Object Stores), and Unstructured Data (File systems) Ch.2

What is Big Data?


Characteristics of Big Data.
Systems perspective - Processing: In-memory vs. (from)
secondary storage vs. (over the) network

2 Storage Models and Cost​: Memory Hierarchy, Access Any text book
costs, I/O Costs (i.e. number of disk blocks accessed); for Computer
Architecture /
Locality of Reference​: Principle, examples Operating
Systems
2 3 Impact of Latency​: Algorithms and data structures that N.A.
leverage locality, data organization on disk for better
locality

4 Parallel and Distributed Processing​: Motivation (Size R2 Sec. 1.2,


of data and complexity of processing); Storing data in 1.3.4, and
parallel and distributed systems: Shared Memory vs. 1.4.1
Message Passing; Strategies for data access: Partition,
Replication, and Messaging.
3 5 Memory Hierarchy in Distributed Systems: ​In-node N.A.
vs. over the network latencies, Locality, Communication
Cost.
6 Distributed Systems: ​Motivation (​size, scalability, R2 Sec. 2.1 to
cost-benefit), Client-Server vs. Peer-to-Peer models, 2.3
Cluster Computing: Components and Architecture
4 7 Big Data Analytics​: Requirements, constraints, R2 Sec. 3.1 to
approaches, and technologies. 3.11; R1 Ch. 3
and Ch. 6

8 Big Data Systems – Characteristics: ​Failures; T1 Ch. 4; AR


Reliability and Availability; Consistency – Notions of
Consistency.

5 9 CAP Theorem and implications for Big data T1 Sec. 3.12


Analytics and 3.13; AR

10 Big Data Lifecycle: ​Data Acquisition, Data Extraction T1 Sec. 2.9 to


–Validation and Cleaning, Data Loading, Data 2.12;
Transformation, Data Analysis and Visualization. Case R1 Ch. 6 and
study – Big data application Ch. 7

6 11-12 Distributed Computing - Design Strategy: AR


Divide-and-conquer for Parallel / Distributed Systems -
Basic scenarios and Implications.
Programming Patterns: ​Data-parallel programs and
map a​ s a construct; Tree-parallelism, and ​reduce ​as a
construct; Map-reduce model: Examples (of map, reduce,
map-reduce combinations, and Iterative map-reduce)
7 13-14 Hadoop: ​Introduction, Architecture, and Map-reduce T1 Sec. 5.1
Programming on Hadoop and 5.2, Sec.
5.7, Sec.
5.11, and Ch.
8; R1 Ch. 5
and Ch. 9; R2
Sec. 1.4.3 and
6.2.2; AR
8 15-16 ​Hadoop​: Hadoop Distributed File System (HDFS), T1 5.10 and
Scheduling in Hadoop (using YARN). Example – 5.12;
Hadoop application. R1 Ch. 4
(sections on
HDFS and
Yarn) and Ch.
12;
AR
9 17-18 Hadoop Ecosystem: ​Databases and Querying (HBASE, T1 Sec. 5.13;
Pig, and Hive) R1 Ch. 4
(sections on
HBase, Hive,
and Pig) and
Ch. 5 (section
on HBase)
10 19-20 NoSQL databases: ​Introduction, Architecture, T1 Sec. 4.2,
Querying, Variants, Case Study. Ch. 6, and Ch.
7

11 21 Cloud Computing: ​A brief overview: Motivation, AR


Structure and Components; Characteristics and
advantages – Elasticity. Services on the cloud.

22 Storage as a Service: ​Forms of storage on the cloud, AR


databases on the cloud.
12 23 Amazon’s storage services​: block storage, file system, AR (​sourced
and database; EBS, SimpleDB, S3 from Amazon​)

24 Case study – Amazon DynamoDB (Access/Querying -


model, Database architecture and applications on the
cloud).
13 25 Spark: ​Introduction, Architecture and Features AR

26 Programming on Spark: ​Resilient Distributed Datasets, AR (Apache


Transformation, Examples & Spark
docs.)
14 27-28 Machine Learning (on Spark): ​Regression, AR (Apache
Classification, Collaborative Filtering, and Clustering. & Spark
docs.)
15 29-30 Streaming: ​Stream Processing – Motivation, Examples, AR
Constraints, and Approaches.
16 31-32 Streaming on Spark: ​Architecture of Spark Streaming, AR (Apache
Stream Processing Model, Example. & Spark
docs.)

Select Topics for experiential learning


Topic Select Topics in Syllabus for experiential learning
No.

1 ● Exercises on Distributed Systems – Hadoop;


● Exercises using Map-reduce model: Map only and reduce only jobs, Standard patterns
in map reduce models.

2 ● Exercises on NoSQL;
● Exercises on NoSQL database – Simple CRUD operations and Failure / Consistency
tests;
● Exercises to implement a Web based application that uses NoSQL databases
3 ● Exercises with Pig queries to perform Map-reduce job and understand how to build
queries and underlying principles;
● Exercises on creating Hive databases and operations on Hive, exploring built in
functions, partitioning, data analysis

4 ● Exercises on Spark to demonstrate RDD, and operations such as Map, FlatMap, Filter,
PairRDD;
● Typical Spark Programming idioms such as : Selecting Top N, Sorting, and Joins;
● Exercises on Spark SQL and DataFrames

5 ● Exercises using Spark MLLib: Regression, Classification, Collaborative Filtering,


Clustering

6 Exercises on Analytics on the Cloud – using AWS, AWS Map-Reduce, AWS data stores /
databases.

[​Note​: A few of these topics for experiential learning will be covered by video demonstrations and/or
participatory lab sessions operated remotely. Rest of them will be assigned as homework and may be
included for evaluation – ​see below.​ ​End of Note.​]

Evaluation Scheme
Legend: EC = Evaluation Component
No Name Type Duration Weight Day, Date, Session, Time
Assignment I
Take-home, Programming (10+10
EC-1 Assignment II and use of platforms +20 =) To be announced
40%
Assignment III
EC-2 Mid-Semester Test Closed Book 2 hours 24% To be announced
EC-3 Comprehensive Exam Open Book 3 hours 36% To be announced

Important Information
Syllabus for Mid-Semester Test (Closed Book): Topics in Weeks 1-7
Syllabus for Comprehensive Exam (Open Book): All topics given in plan of study

Evaluation Guidelines:
1. EC-1 consists of three Assignments. Announcements regarding the same will be made in a timely
manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
3. For Open Book exams: Use of prescribed and reference text books, in original (not photocopies) is
permitted. Class notes/slides as reference material in filed or bound form is permitted. All other
additional reading materials in filed / bound form are also permitted. However, loose sheets of paper
will not be allowed. Use of calculators is permitted in all exams. Laptops/Mobiles of any kind are not
allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam. The genuineness of the reason for
absence in the Regular Exam shall be assessed prior to giving permission to appear for the Make-up
Exam. Make-Up Test/Exam will be conducted only at selected exam centres on the dates to be
announced later.
It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the lectures, and take all the prescribed evaluation components such as
Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.

You might also like