Introduction To Data Science 439
Introduction To Data Science 439
org
Source: Scaling Language Model Training to a Trillion Parameters Using Megatron (NVIDIA)
4
Data contains value and knowledge
5
But to extract the knowledge
data needs to be
▪ Stored (Database Systems)
▪ Managed (Data Management for Data Science)
▪ And ANALYZED this class (Data Science and
Data Mining)
6
Given lots of data
Discover patterns and models that are:
▪ Valid: hold on new data with some certainty
▪ Useful: should help us to learn something from data
▪ Unexpected: non-obvious to us
▪ Understandable: humans should be able to
interpret the pattern
Time Position
𝜏2 𝜏2
1 (a,b) =𝐾 =𝐾 is because
2 (c,d) 𝑟3 𝑟3
3 (e,f)
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630) Isaac Newton (1643-1727)
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Massive data Use massive computing facilities
- search engine logs and develop advanced algorithms
- e-commerce transactions to analyze the data and make predictions.
- Social network data
Advanced algorithms:
- Impossible to manually analyze Machine learning, data mining, etc.
the data and derive conclusions
from data
11
Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering
Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems, Computational
Advertising, etc.
12
This class overlaps with machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data
Statistics Machine
and big models)
Learning
▪ Algorithms
▪ Computing architectures Data Mining
▪ Automation for handling
large data Database
systems
13
Usage
Quality
Context
Streaming
Scalability
14
We will learn to analyze different types of data:
▪ Data is high dimensional (dim reduction)
▪ Data is a graph (social and web link graph analytics)
▪ Data is infinite/never-ending (streaming data process)
▪ Data is labeled (supervised) or unlabeled (unsupervised)
We will learn to use different models of
computation:
▪ Single machine in-memory
▪ Streams and online algorithms
▪ Map-Reduce, Hadoop, Spark, PyTorch, Huggingface
15
We will learn to solve real-world problems:
▪ Recommender systems (dimensionality reduction, latent
factor models, Collaborative Filtering, Collaborative
Reasoning, Large Language Models, Amazon)
▪ Market Basket Analysis (frequent item set mining, the beer
and diaper example, Walmart)
▪ Spam detection (web graph analysis, page rank, Google)
▪ Social network analysis (Facebook)
We will learn various “tools”:
▪ Linear algebra (SVD, LFM, Community analysis)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)
16
High dim. Graph Infinite Machine
Apps
data data data learning
Locality
PageRank, Filtering SVM, SVD,
sensitive Recommen
TrustRank data SGD
hashing der systems
streams
Spam Decision
Clustering
Detection Trees, kNN
Web Search
advertising Perceptron, Engine
Dimension Community
Neural
reduction Detection
network
Graph Queries on Neural Social
Representat streams Reasoning,
Networks
Neural
ion Learning Large Language
Network Model
17
Instructor:
Yongfeng Zhang, [email protected]
TA:
Xi Zhu (Section 1), Email: [email protected]
Wujiang Xu (Section 2), Email: [email protected]
Devanshi Patel (Section 3), Email: [email protected]
Sai Samhith Thatikonda (Section 4), Email: [email protected]
Graders:
Satyam Saini (Section 1&2), Email: [email protected]
Sasank Chindirala (Section 3&4), Email: [email protected]
19
Data Structures (CS 112)
▪ Arrays, lists, sets, maps, queues, linked lists, etc.
Basic probability
▪ Moments, typical distributions, MLE, …
Programming
▪ Your choice, but C++/Java/Python will be very useful
Infrastructure (optional)
▪ Linux / Hadoop / Spark / PyTorch / Jupyter / Hugging Face
Fundamental Algorithms (CS 344, optional but preferred)
▪ Linear Algebra, basic data structures
22
Slides, lecture notes, homework answers, etc. -> Files
23
Course Website
▪ Homework Assignments will also be released and submitted on
Canvas.
Be sure to enable your course notifications!
24
Canvas Chat:
▪ You may post questions and participate in discussions
on Canvas Chat room.
▪ Use Canvas for technical questions and public communication
with the course staff
25
4 homework assignments: 40 points
▪ Theoretical and programming questions (10 points
each)
▪ Assignments take lots of time. Start early!
26
Midterm: 20 points
▪ Wednesday, Oct 30, 12-hour take-home exam (no
class on that day)
▪ Detailed instructions will be provided on Canvas.
27
Final Project: 40 points
▪ Complete as a team of at most 3 students, the team
can either choose from a set of provided project topics
(will provide on Canvas later this semester) or propose
their own project topic (based on approval from the
instructor), write the project report as a “mini-paper,”
make a slides for the project, and submit the paper,
slides, code and data. (40%)
Bonus: 10 points
▪ We will select 20 teams, each can do a 15-min
presentation to the class, each member of the
selected teams will get 10 bonus credits.
▪ Final two weeks is presentation day.
30