Syllabus E63 2018 Fall PDF
Syllabus E63 2018 Fall PDF
Optional Online Sections: Saturdays, starting September 8nd, 2018 at 12:00 noon (EST) with an
introduction to AWS Cloud, Linux OS and Python Pandas.
The emphasis of this course is on mastering the most important big data technology: Spark 2.
Spark is an evolution of Hadoop and Map/Reduce with massive speedup and scalability
improvements. The explosion of social media and the computerization of every aspect of social
and economic activity results in the creation of large volumes of semi-structured data: web
logs, videos, speech recordings, photographs, e-mails, Tweets, and similar. In a parallel
development, computers keep getting ever more powerful and storage ever cheaper. Today,
with Spark 2, we can reliably and cheaply store huge volumes of data, efficiently analyze them,
and extract business and socially relevant information. We will examine most important Spark
APIs: Spark Core, Spark MLlIB (machine learning) API, and Spark Streaming which allows
analysis of data in flight, that is, in near real time. We will learn how to use Spark GraphX and
GraphFrames, in-memory graph databases, to analyze highly connected data. We will acquire
practical skills in scalable messaging systems like Kafka and/or Akka and learn to integrate Spark
with NoSQL systems. We will conduct some of our exercises in Amazon Cloud, so the students
will master the most important AWS services, EC2 and S3 among others. At the end of the
course, students will be able to initiate and design highly scalable systems that can accept,
store, and analyze large volumes of unstructured data in batch mode and/or real time.
Prerequisites: Proficiency in Python or Java or Scala or R. Some familiarity with Linux or Mac OS
is helpful. No familiarity with AWS is assumed. Students need access to a computer with a 64-
bit operating system and at least 8 GB of RAM (32 GB is highly recommended).
Lectures: Lectures will be delivered live and made available after lectures for
online viewing through Zoom Web Conferencing tool. Recordings of video streams will also be
available. Zoom Links to recorded lectures and lab sessions will be accessible on the course
Web site one or two hours after the end of the lecture or lab session. Streaming videos of
recorded lectures will become available with a delay, hopefully by Saturday morning.
References: Detailed handouts with references to material on the Web will be handed out
every week. There is no required text book. “Spark, the Definitive Guide” by Bill Chambers and
Matei Zaharia O’Reilly 2018 is a recommended text.
Assessment: Solutions to homework assignments in any of four languages will be accepted, though
we encourage students to use Python. At the end of the class students will implement an
individual final project using one of Big Data technologies of use cases not covered in class.
1
Grading: Practically every class will be followed by a homework assignment. Grades on the
solutions for class assignments constitute approximately 85% of the final grade. 15% of the
grade will be earned through the final project. Final projects will be assigned a few weeks
before the end of the class. For the final project, you will produce a paper (10+ pages of MS
Word text, 10+ PowerPoint Slides, a working demo, 15 minute YouTube video of your
presentation and a brief 2 minute YouTube video that might be presented to the whole class on
the day of the final presentations. Some 30+ students will be invited to present their final
projects live to the entire class. All final project materials will be made available to the entire
class. Each solution of every homework assignment and each final project is an individual effort.
Grades: 95% or higher cumulative grade on all assignments and the final project gives you an A
as the final grade in the course, 90-94.9% gives you an A-, 85-89.9% a B+, 80-84.9% a B, etc.
Communications: [email protected], Canvas class site and Piazza, once class starts.
Academic Integrity: Students are responsible for understanding Harvard Extension School policies on
academic integrity (www.extension.harvard.edu/resources-policies/student- conduct/academic-
integrity) and how to use sources responsibly. Not knowing the rules, misunderstanding the rules,
running out of time, submitting the wrong draft, or being overwhelmed with multiple demands are not
acceptable excuses. There are no excuses for failure to uphold academic integrity. To support the
learning about academic citation rules, students should visit the Harvard Extension School Tips to Avoid
Plagiarism (www.extension.harvard.edu/resourcespolicies/resources/tips-avoid-plagiarism), where
you'll find links to the Harvard Guide to Using Sources and two free online 15-minute tutorials to test
your knowledge of academic citation policy. The tutorials are anonymous open learning tools.
Accessibility: The Extension School is committed to providing an accessible academic community. The
Accessibility Office offers a variety of accommodations and services to students with documented
disabilities. Please visit www.extension.harvard.edu/resources-policies/resources/disability-
servicesaccessibility for more information.
2
Relationships and Representations, Graph Databases. We will use Neo4J
graph database to represent relationships among objects in IT space, as
well as concepts and words in spoken languages. Various types of
relationship discovery and representations are essential in solving many of
our data analysis problems. Neo4J and similar technologies make our
understanding of complex problems much easier.
4 09/28/2018 Spark SQL and Datasets. Interaction of Spark with RDBMS and NoSQL:
Cassandra, Hive, Parquet files. Spark SQL queries, tables, views, databases.
Dataset actions and transformations, joins, grouping and aggregations.
5 10/05/2018 Resilient Distributed Datasets (RDDs). Low level APIs, creating and
manipulating RDDs, actions, saving files, caching, piping, key-value RDDs,
partitions, controlling partitions.
6 10/12/2018 Lifecycle of Spark Application. Architecture of Spark application,
developing Spark applications, testing, configuring, launching, job
scheduling. Deploying Spark in the Cloud. Cluster management
7 10/19/2018 Analysis of Streaming Data with Spark 2. Not all applications could rely
only on batch processing of large volumes of data. Some application must
process large data volumes in real time. Spark provides a Streaming API for
such scenarios. We will introduce special messaging systems (Kafka, Akka
and/or AWS Kinesis) which could act as buffers between actual data
sources and Spark.
8 10/26/2018 Event-Time and Stateful Processing. Events, stateful processing, windows
on event time, removing duplicates in streams, structured streaming in
production, fault tolerance and checkpointing, metrics and monitoring,
advanced monitoring and streaming listener.
9 11/02/2018 Advanced Analytics and Machine Learning and Spark MLLib API. Use cases
for machine learning with Spark. Applications of Spark MLLib to performing
many ML task at Spark speed.
10 11/09/2018 Feature Engineering and ETL Tasks. Estimators for preprocessing, high-level
transformations, categorical and continuous feature processing. Text
preprocessing and analysis. Natural Language Processing (NLP) common
tasks, Word2Vec, Principal Component Analysis (PCA)
11 11/16/2018 Classification tasks with Spark. Use cases, types of classifications,
classifications in MLLIb, logistic regression, decision trees, random forests,
naïve Bayes, evaluators for classifications and automated model testing.
11/23/2018 Thanksgiving Holiday
12 11/30/2018 Graph Analysis. Various types of relationship discovery and
representations are essential in solving many of the data analysis problems.
We will use GraphFrames to represent relationships among objects,
including concepts and words in a spoken languages Comparison to Neo4J
and similar technologies.
13 12/07/2018 Deep Learning with Spark. Deep Learning APIs on Spark, TensorFlow,
DeepLearnign4J, TensorFrames. MLLib Neural network support, Deep
Learning Pipelines. Examples
14 12/14/2018 Final Project Presentations
01/08/2019 Grades available online