0% found this document useful (0 votes)
335 views7 pages

Cse 511

This document outlines a course on scalable data processing. The course covers topics such as efficient query processing, indexing structures, distributed database design, parallel query execution, concurrency control, NoSQL database systems, data management in cloud computing and MapReduce environments. Students will learn to perform queries and analytics tasks in database systems, design distributed and parallel databases, and perform scalable data processing in cloud computing environments. The course consists of lectures, assignments, projects and a final exam. Required skills include programming knowledge and a basic understanding of computer science topics. The course aims to equip students to differentiate data models, apply techniques for distributed databases, and utilize cloud-based systems for specified cases.

Uploaded by

Ioana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
335 views7 pages

Cse 511

This document outlines a course on scalable data processing. The course covers topics such as efficient query processing, indexing structures, distributed database design, parallel query execution, concurrency control, NoSQL database systems, data management in cloud computing and MapReduce environments. Students will learn to perform queries and analytics tasks in database systems, design distributed and parallel databases, and perform scalable data processing in cloud computing environments. The course consists of lectures, assignments, projects and a final exam. Required skills include programming knowledge and a basic understanding of computer science topics. The course aims to equip students to differentiate data models, apply techniques for distributed databases, and utilize cloud-based systems for specified cases.

Uploaded by

Ioana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Scalable Data Processing

(CSE 511)
Note: Below outline is subject to modifications and updates.

About this Course


Database systems are used to provide convenient access to disk-resident data through efficient
query processing, indexing structures, concurrency control, and recovery. T his course delves
into new frameworks for processing and generating large-scale datasets with parallel and
distributed algorithms, covering the design, deployment and use of state-of-the-art data
processing systems, which provide scalable access to data.

Specific topics covered include:

yy Efficient query processing yy Data management in cloud


yy Indexing structures computing environments
yy Distributed database design yy Data management in Map/Reduce-based
yy Parallel query execution yy NoSQL database systems
yy Concurrency control in distributed parallel
database systems

Learning Outcomes

Learners completing this course will be able to:

yyDifferentiate among major data models such as relational, spatial, and NoSQL
yyPerform queries (e.g., SQL) and analytics tasks in state-of-the-art database systems
yyApply leading-edge techniques to design/tune distributed and parallel database systems
yyUtilize existing NoSQL database systems as appropriate for specified cases
yyPerform database operations (e.g., selection, projection, join, and groupby) in state-of-the-art
cluster computing systems such as Hadoop/Spark
yyPerform scalable data processing operations (e.g., selection, projection, join, and groupby) in
cloud computing environments, including Amazon AWS

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 1
Projects
yyProject 1: Movie Recommendation Database
yyProject 2: Distributed Movie Recommendation Database
yyProject 3: Location-Aware Twitter Analytics
yyProject 4: Spatial Data Processing using Apache Spark
yyProject 5: SQL queries on Amazon EC2

Course Content
Instruction Assessments
yy Video Lectures yy Practice activities and quizzes (auto-graded)
yy Other Videos yy Practice assignments (instructor-
yy Readings or peer-reviewed)
yy Interactive Learning Objects yy Team and/or individual project(s)
(instructor-graded)
yy Live office hours
yy Final exam (graded)
yy Webinars

Estimated Workload/Time Commitment Per Week


Approximately 9 hours per week

Required Prior Knowledge and Skills


yy Basic statistics and computer science knowledge including computer organization and
architecture, discrete mathematics, data structures, and algorithms
yy Knowledge of high-level programming languages (e.g., C++, Java) and scripting
language (e.g., Python)

Technology Requirements

Hardware
yy Standard with major OS

Software and Other


yy To complete course projects, some of the following software may be required: Amazon AWS
yy Cloud, Hadoop/Spark, GitHub, PostgreSQL, MongoDB, Neo4j.

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 2
Course Outline
Unit 1: Basic Data Processing Concepts

Learning Objectives
1.1: Explain Data Models and Data processing concepts
1.2: Utilize Relational Model and Relational Algebra
1.3: Utilize SQL query language
• Unit Introduction
• Module 1: Big Data and Data Processing
• Introduction to Data and Data Processing
• Database Management Systems
• Data Models
• Module 2: Basic Data Concepts
• Database Systems - What and Why?
• Database Management Systems
• Data Model
• Database Design: Entity Relationship Model to Relational Model
• Entity Relational Model
• ER to Relational Model
• Assignment: Create a Movie Database
• Relational Model and Relational Algebra
• Relational Data Model
• Relational Algebra: Query Language
• Query Language: Union
• Query Language: Difference
• Query Language: Cartesian Product
• Query Language: Selection
• Query Language: Projection
• Query Language: Intersection
• Query Language: 0-Join
• SQL Query Language:
• Part 1: SQL Query Language
• Part 2: SQL Query Language
• Assignment: SQL Query for Movie Recommendation

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 3
Unit 2: Data Storage and Indexing

Learning Objectives
2.1 Recognize major data storage layouts
2.2 Identify major indexing schemes in Database Systems
• Unit Introduction
• Module 1: Major Storage Layouts
• Introduction to Data Storage
• Alternative File Organizations
• Module 2: Major Indexing Schemes in Database Systems
• Hash-based Indexes
• Index Classification

Unit 3: Transactions and Recovery

Learning Objectives
3.1 Examine the ACID properties
3.2 Explain Transactions and Concurrency Control concepts
3.3 Describe how recovery from failures happens in database systems
• Unit Introduction
• Module 1: ACID Properties
• Principles of Transactions: ACID Properties
• Module 2: Concurrency Control Concepts
• Concurrency Control
• Module 3: Lock-based Concurrency Control and Recovery from Failures
• Lock-Based Concurrency Control
• Database Recovery

Unit 4: Principles of Distributed and Parallel Database Systems

Learning Objectives
4.1 Describe data fragmentation and replication models
4.2 Describe the components of a distributed database
4.3. Apply skills learned to complete an assignment using data partitioning
• Unit Introduction
• Module 1: Distributed Databases: Why, What?
• Why Distribution?
• Module 2: Data Fragmentation and Replication Model
• Introduction to Fragmentation
• Introduction to Replication
• Assignment: Data Fragmentation

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 4
• Module 3: Advanced Distributed Database Systems
• Query Processing and Optimization in Distributed Databases
• Distributed Query Processing
• Total Cost of Query Execution Plan
• Assignment: Query Processing
• Module 4: Parallel Database Systems
• Parallel Data Architecture
• Introduction to Parallel DBMS
• The Different Types of DBMS Parallelism
• Parallel Sorting and Joins
• Assignment: Parallel Sort and Joins

Unit 5: NoSQL Database Systems

Learning Objectives
• Unit Introduction
• Module 1: NoSQL Database Systems
• Key-Value Stores
• Graph Databases
• Document Databasesy
• Module 2: Big Data Analytics Systems
• Intro Map-Reduce / Spark
• Data Analytics in Map-Reduce / Spark
• Graph Processing Engines
• Module 3: Data Processing on Modern HW

PROJECT: Distributed Movie Recommendation Database

Unit 6: Big Data Tools

PROJECT: Location-Aware Twitter Analytics


PROJECT: Spatial Data Processing using Apache Spark

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 5
Unit 7: Additional Tools Used for Data Visualization

Learning Objectives
7.1 Explain data processing in the cloud
7.2 Evaluate service models
7.3 Evaluate deployment models
• Unit Introduction
• Module 1: Introduction to Cloud Computing
• Introduction to Cloud Computing
• Module 2: Service Models
• Service Models
• Module 3: Deployment Models
• Deployment Models

Unit 8: Cloud-based Data Management

Learning Objectives
8.1 Explain AWS
• Unit Introduction
• Module 1: Amazon Web Services
• Introduction to Amazon Web Services
• AWS Computing
• AWS Storage
• AWS Queueing Services
• Module 2: Build an Elastic Cloud Application
• AWS Interfaces
• Auto-Scaling
• Module 3: Build a MapReduce Cloud Application
• Scalable Data Processing
• AWS Security

PROJECT: SQL queries on Amazon EC2

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 6
Creators
Established in Tempe in 1885, Arizona State University (ASU) has developed a new model
for the American Research University, creating an institution that is committed to access,
excellence and impact.

As the prototype for a New American University, ASU pursues research that contributes to the
public good, and ASU assumes major responsibility for the economic, social and cultural vitality
of the communities that surround it. Recognizing the university’s groundbreaking initiatives,
partnerships, programs and research, U.S. News and World Report has named ASU as the
most innovative university all three years it has had the category.

The innovation ranking is due at least in part to a more than 80 percent improvement in ASU’s
graduation rate in the past 15 years, the fact that ASU is the fastest-growing research university
in the country and the emphasis on inclusion and student success that has led to more than 50
percent of the school’s in-state freshman coming from minority backgrounds.

Mohamed Sarwat is an Assistant Professor of Computer Science and the director of the
Data Systems (DataSys) lab at Arizona State University (ASU). He is also an affiliate member
of the Center for Assured and Scalable Data Engineering (CASCADE). Before joining ASU,
Mohamed obtained his MSc and PhD degrees in computer science from the University of
Minnesota. His research interest lies in the broad area of data management systems.

Ming Zhao is an associate professor of the ASU School of Computing, Informatics, and
Decision Systems Engineering. Before joining ASU, he was an associate professor of the
School of Computing and Information Sciences (SCIS) at Florida International University.
He directs the Research Laboratory for Virtualized Infrastructure, Systems, and Applications
(VISA). His research interests are in distributed/cloud computing, big data, high-performance
computing, autonomic computing, virtualization, storage systems and operating systems.

Scalable Data Processing


Lead: Mohamed Sarwat, Ph.D. | Updated 12/28/2017 7

You might also like