Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
Birla Institute of Technology & Science, Pilani: Work Integrated Learning Programmes
COURSE HANDOUT
Course Description
Learn about fundamentals of data engineering; Basics of systems and techniques for data processing -
comprising of relevant database, cloud computing and distributed computing concepts.
Course Objectives
CO1 Introduce students to a systems perspective of data analytics: to leverage systems effectively,
understand, measure, and improve performance while performing data analytics tasks
CO2 Enable students to develop a working knowledge of how to use parallel and distributed systems for
data analytics
CO3 Enable students to apply best practices in storing and retrieving data for analytics
CO4 Enable students to leverage commodity infrastructure (such as scale-out clusters, distributed data-
stores, and the cloud) for data analytics.
Text Book(s)
T1 Kai Hwang, Geoffrey Fox, and Dongarra. - Distributed Computing and Cloud
Computing. Morgan Kauffman
# Topics
Storage for Data: Structured Data (Relational Databases) , Semi-structured data (Object
Stores), Unstructured Data (file systems)
Processing: In-memory vs. (from) secondary storage vs. (over the) network
Storage Models and Cost: Memory Hierarchy, Access costs, I/O Costs (i.e. number of disk
blocks accessed);
Impact of Latency: Algorithms and data structures that leverage locality, data organization
on disk for better locality
1.2 Systems Attributes for Data Analytics - Parallel and Distributed Systems
Storing data in parallel and distributed systems: Shared Memory vs. Message Passing
Strategies for data access: Partition, Replication, and Messaging
Memory Hierarchy in Parallel Systems: Shared memory access and memory contention;
shared data access and mutual exclusion
Memory Hierarchy in Distributed Systems: In-node vs. over the network latencies, Locality,
Communication Cost
Parallel Architectures and Programming Models: Flynn’s Taxonomy (SIMD, MISD, MIMD)
and Parallel Programming (SPMD, MPSD, MPMD)
Parallel vs. Distributed Systems: Shared Memory vs. Distributed Memory (i.e. message
passing)
Motivation for distributed systems (large size, easy scalability, cost-benefit)
Reliability (for distributed systems): MTTF and MTTR, Serial vs. Parallel Connections,
Single Point-of-Failure
Map-reduce model: Examples (of map, reduce, map-reduce combinations, Iterative map-
reduce)
● Partitioning vs. Replication and Communication vs. Locality for Data Mining
algorithms like k-means, DBSCAN, Nearest Neighbor
● Using data structures (such as kd-trees) for partitioning)
● Matrices and Locality - Row-major vs. Column major vs. Blocking in distributed
context
Learning Outcomes:
No Learning Outcomes
L01 Ability to identify the right storage model to use given a dataset
L02 Ability to apply the appropriate parallel programming model to a given dataset
L03 Ability to identify and tune some common quality attributes of a distributed system
L04 Ability to choose the relevant consistency model for data stores based on application
L05 Ability to apply data mining algorithms like k-means clustering on appropriate dataset
L06 Ability to design and develop a n-tier data mining system in a cloud environment
Course Contents
2 1.1
Impact of Latency: Algorithms and data structures that Class Slides
leverage locality, data organization on disk for better
locality
Storing data in parallel and distributed systems: Shared T1. Sec. 1.4.3
1.2 Memory vs. Message Passing R12
3-4 Class Slides
9 2.2
# The above contact hours and topics can be adapted for non-specific and specific WILP programs
depending on the requirements and class interests.
Select Topics for experiential learning [Tutorials]
Topic Select Topics in Syllabus for experiential Resources (Need Weka or equivalent
No. learning software)
Evaluation Scheme
Legend: EC = Evaluation Component
No Name Type Duration Weight Day, Date, Session, Time
Assignment-1 Take Home 12 To be announced
EC-1 Quiz-II Take Home 5 To be announced
Assignment-II Take Home 13 To be announced
EC-2 Mid-Semester Test Closed Book 90 Min 30 To be announced
EC-3 Comprehensive Exam Open Book 120 Min 40 To be announced
Note - Evaluation components can be tailored depending on the proposed model.
Important Information
Syllabus for Mid-Semester Test (Closed Book): Topics in Weeks 1-7
Syllabus for Comprehensive Exam (Open Book): All topics given in plan of study
Evaluation Guidelines:
1. EC-1 consists of two Assignments and a Quiz. Announcements regarding the same will be made in a
timely manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
3. For Open Book exams: Use of prescribed and reference text books, in original (not photocopies) is
permitted. Class notes/slides as reference material in filed or bound form is permitted. However,
loose sheets of paper will not be allowed. Use of calculators is permitted in all exams.
Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam. The genuineness of the reason for
absence in the Regular Exam shall be assessed prior to giving permission to appear for the Make-up
Exam. Make-Up Test/Exam will be conducted only at selected exam centres on the dates to be
announced later.
It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the lectures, and take all the prescribed evaluation components such as
Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.