0% found this document useful (0 votes)
7 views

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

http://www.mmds.

org

Introduction to Data Science


Yongfeng Zhang, Rutgers University
3
Not only massive data, but also massive models and massive computation.
e.g. Large Language Models.

Source: Scaling Language Model Training to a Trillion Parameters Using Megatron (NVIDIA)

4
Data contains value and knowledge
5
 But to extract the knowledge
data needs to be
▪ Stored (Database Systems)
▪ Managed (Data Management for Data Science)
▪ And ANALYZED  this class (Data Science and
Data Mining)

6
 Given lots of data
 Discover patterns and models that are:
▪ Valid: hold on new data with some certainty
▪ Useful: should help us to learn something from data
▪ Unexpected: non-obvious to us
▪ Understandable: humans should be able to
interpret the pattern

Data Mining ≈ Big Data (from 2012)


Big Data + Predictive Analytics ≈ Data Science
7
We can We can
Obverse it! Predict it!

Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)


Demark astronomer German astronomer, student of Tycho Brahe.

Good at astro-observation Analyzed Tycho’s data, and discovered a rule


Observed and recorded a lot of hidden in the data.
data about how planets circle The “Kepler’s laws of planetary motion”:
around the sun. 𝜏2
=𝐾
𝑟3
𝜏: period of circling around the sun, r: radius
Time, Position
1, (a, b)
2, (c, d) Time, Position
3, (e, f) 1, (a, b) 𝜏2
=𝐾
… 2, (c, d) 𝑟3
3, (e, f)

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
We can We Understand it!
Predict it! We know Why!

Johannes Kepler (1571-1630) Isaac Newton (1643-1727)


German astronomer, student of Tycho Brahe. English mathematician, physicist, astronomer,
theologian, and author.
Analyzed Tycho’s data, and discovered the rules
hidden in the data. Proposed the Newton's law of universal gravitation
The “Kepler’s laws of planetary motion”: + differential calculus:
𝜏2
=𝐾 Naturally derives out the Kepler’s laws of
𝑟3
𝜏: period of circling around the sun, r: radius planetary motion!

Time Position
𝜏2 𝜏2
1 (a,b) =𝐾 =𝐾 is because
2 (c,d) 𝑟3 𝑟3
3 (e,f)

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

Data Collection Data Analytics (what) Data Interpretation (why)


Time Position
1 (a,b) 𝜏2
2 (c,d) =𝐾
𝑟3
3 (e,f)

Almost automatic Many available methods, Still needs much exploration,


Main part of this course We will touch some topics

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Massive data Use massive computing facilities
- search engine logs and develop advanced algorithms
- e-commerce transactions to analyze the data and make predictions.
- Social network data
Advanced algorithms:
- Impossible to manually analyze Machine learning, data mining, etc.
the data and derive conclusions
from data

11
 Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering

 Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems, Computational
Advertising, etc.

12
 This class overlaps with machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data
Statistics Machine
and big models)
Learning
▪ Algorithms
▪ Computing architectures Data Mining
▪ Automation for handling
large data Database
systems

13
Usage

Quality

Context

Streaming
Scalability

14
 We will learn to analyze different types of data:
▪ Data is high dimensional (dim reduction)
▪ Data is a graph (social and web link graph analytics)
▪ Data is infinite/never-ending (streaming data process)
▪ Data is labeled (supervised) or unlabeled (unsupervised)
 We will learn to use different models of
computation:
▪ Single machine in-memory
▪ Streams and online algorithms
▪ Map-Reduce, Hadoop, Spark, PyTorch, Huggingface
15
 We will learn to solve real-world problems:
▪ Recommender systems (dimensionality reduction, latent
factor models, Collaborative Filtering, Collaborative
Reasoning, Large Language Models, Amazon)
▪ Market Basket Analysis (frequent item set mining, the beer
and diaper example, Walmart)
▪ Spam detection (web graph analysis, page rank, Google)
▪ Social network analysis (Facebook)
 We will learn various “tools”:
▪ Linear algebra (SVD, LFM, Community analysis)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)

16
High dim. Graph Infinite Machine
Apps
data data data learning

Locality
PageRank, Filtering SVM, SVD,
sensitive Recommen
TrustRank data SGD
hashing der systems
streams
Spam Decision
Clustering
Detection Trees, kNN
Web Search
advertising Perceptron, Engine
Dimension Community
Neural
reduction Detection
network
Graph Queries on Neural Social
Representat streams Reasoning,
Networks
Neural
ion Learning Large Language
Network Model

17
 Instructor:
 Yongfeng Zhang, [email protected]

 TA:
 Xi Zhu (Section 1), Email: [email protected]
 Wujiang Xu (Section 2), Email: [email protected]
 Devanshi Patel (Section 3), Email: [email protected]
 Sai Samhith Thatikonda (Section 4), Email: [email protected]

 Graders:
 Satyam Saini (Section 1&2), Email: [email protected]
 Sasank Chindirala (Section 3&4), Email: [email protected]

19
 Data Structures (CS 112)
▪ Arrays, lists, sets, maps, queues, linked lists, etc.
 Basic probability
▪ Moments, typical distributions, MLE, …
 Programming
▪ Your choice, but C++/Java/Python will be very useful
 Infrastructure (optional)
▪ Linux / Hadoop / Spark / PyTorch / Jupyter / Hugging Face
 Fundamental Algorithms (CS 344, optional but preferred)
▪ Linear Algebra, basic data structures

 We provide some background, but


the class will be fast paced
21
 Course website:
▪ Canvas: Please join our Canvas homepage
▪ Lecture slides, homework, solutions, readings

 Textbook: (LRU) Mining of Massive Datasets


by J. Leskovec, A. Rajaraman, J. D. Ullman
Free online: http://www.mmds.org

22
 Slides, lecture notes, homework answers, etc. -> Files

23
 Course Website
▪ Homework Assignments will also be released and submitted on
Canvas.
 Be sure to enable your course notifications!

24
 Canvas Chat:
▪ You may post questions and participate in discussions
on Canvas Chat room.
▪ Use Canvas for technical questions and public communication
with the course staff

 For private questions please email to me or TA


 We will post course announcements on Canvas
(make sure you check it regularly)

25
 4 homework assignments: 40 points
▪ Theoretical and programming questions (10 points
each)
▪ Assignments take lots of time. Start early!

 Time and Submission


▪ Homework assignments are posted on Fridays, and
you have 2 weeks to finish (due on next next Friday on
11:59pm).
▪ Homework can be submitted in Canvas
▪ Late policy will be shown on the homework
assignment: 90% discount factor for each 1 day late.

26
 Midterm: 20 points
▪ Wednesday, Oct 30, 12-hour take-home exam (no
class on that day)
▪ Detailed instructions will be provided on Canvas.

27
 Final Project: 40 points
▪ Complete as a team of at most 3 students, the team
can either choose from a set of provided project topics
(will provide on Canvas later this semester) or propose
their own project topic (based on approval from the
instructor), write the project report as a “mini-paper,”
make a slides for the project, and submit the paper,
slides, code and data. (40%)
 Bonus: 10 points
▪ We will select 20 teams, each can do a 15-min
presentation to the class, each member of the
selected teams will get 10 bonus credits.
▪ Final two weeks is presentation day.

 It’s going to be fun and hard work. ☺


28
 3 To-do items for you:
▪ Register to Canvas
▪ Begin to think about your team up for class project

 Additional details/instructions can be seen in


the course syllabus (posted on Canvas)

30

You might also like