0% found this document useful (0 votes)

8 views

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Introduction To Data Science 439

Introduction presentation to data science and understanding of concepts

Uploaded by

Surya Bhardwaj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

http://www.mmds.

org

Introduction to Data Science

Yongfeng Zhang, Rutgers University
3
Not only massive data, but also massive models and massive computation.
e.g. Large Language Models.

Source: Scaling Language Model Training to a Trillion Parameters Using Megatron (NVIDIA)

4
Data contains value and knowledge
5
 But to extract the knowledge
data needs to be
▪ Stored (Database Systems)
▪ Managed (Data Management for Data Science)
▪ And ANALYZED  this class (Data Science and
Data Mining)

6
 Given lots of data
 Discover patterns and models that are:
▪ Valid: hold on new data with some certainty
▪ Useful: should help us to learn something from data
▪ Unexpected: non-obvious to us
▪ Understandable: humans should be able to
interpret the pattern

Data Mining ≈ Big Data (from 2012)

Big Data + Predictive Analytics ≈ Data Science
7
We can We can
Obverse it! Predict it!

Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)

Demark astronomer German astronomer, student of Tycho Brahe.

Good at astro-observation Analyzed Tycho’s data, and discovered a rule

Observed and recorded a lot of hidden in the data.
data about how planets circle The “Kepler’s laws of planetary motion”:
around the sun. 𝜏2
=𝐾
𝑟3
𝜏: period of circling around the sun, r: radius
Time, Position
1, (a, b)
2, (c, d) Time, Position
3, (e, f) 1, (a, b) 𝜏2
=𝐾
… 2, (c, d) 𝑟3
3, (e, f)
…
[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
We can We Understand it!
Predict it! We know Why!

Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

German astronomer, student of Tycho Brahe. English mathematician, physicist, astronomer,
theologian, and author.
Analyzed Tycho’s data, and discovered the rules
hidden in the data. Proposed the Newton's law of universal gravitation
The “Kepler’s laws of planetary motion”: + differential calculus:
𝜏2
=𝐾 Naturally derives out the Kepler’s laws of
𝑟3
𝜏: period of circling around the sun, r: radius planetary motion!

Time Position
𝜏2 𝜏2
1 (a,b) =𝐾 =𝐾 is because
2 (c,d) 𝑟3 𝑟3
3 (e,f)

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Tycho Brahe (1546-1610) Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

Data Collection Data Analytics (what) Data Interpretation (why)

Time Position
1 (a,b) 𝜏2
2 (c,d) =𝐾
𝑟3
3 (e,f)

Almost automatic Many available methods, Still needs much exploration,

Main part of this course We will touch some topics

[1] Z. Li, J. Ji, Y. Zhang. ”From Kepler to Newton: Explainable AI for Science Discovery.” arXiv:2111.12210.
Massive data Use massive computing facilities
- search engine logs and develop advanced algorithms
- e-commerce transactions to analyze the data and make predictions.
- Social network data
Advanced algorithms:
- Impossible to manually analyze Machine learning, data mining, etc.
the data and derive conclusions
from data

11
 Descriptive methods
▪ Find human-interpretable patterns that
describe the data
▪ Example: Clustering

 Predictive methods
▪ Use some variables to predict unknown
or future values of other variables
▪ Example: Recommender systems, Computational
Advertising, etc.

12
 This class overlaps with machine learning,
statistics, artificial intelligence, databases but
more stress on
▪ Scalability (big data
Statistics Machine
and big models)
Learning
▪ Algorithms
▪ Computing architectures Data Mining
▪ Automation for handling
large data Database
systems

13
Usage

Quality

Context

Streaming
Scalability

14
 We will learn to analyze different types of data:
▪ Data is high dimensional (dim reduction)
▪ Data is a graph (social and web link graph analytics)
▪ Data is infinite/never-ending (streaming data process)
▪ Data is labeled (supervised) or unlabeled (unsupervised)
 We will learn to use different models of
computation:
▪ Single machine in-memory
▪ Streams and online algorithms
▪ Map-Reduce, Hadoop, Spark, PyTorch, Huggingface
15
 We will learn to solve real-world problems:
▪ Recommender systems (dimensionality reduction, latent
factor models, Collaborative Filtering, Collaborative
Reasoning, Large Language Models, Amazon)
▪ Market Basket Analysis (frequent item set mining, the beer
and diaper example, Walmart)
▪ Spam detection (web graph analysis, page rank, Google)
▪ Social network analysis (Facebook)
 We will learn various “tools”:
▪ Linear algebra (SVD, LFM, Community analysis)
▪ Optimization (stochastic gradient descent)
▪ Dynamic programming (frequent itemsets)
▪ Hashing (LSH, Bloom filters)

16
High dim. Graph Infinite Machine
Apps
data data data learning

Locality
PageRank, Filtering SVM, SVD,
sensitive Recommen
TrustRank data SGD
hashing der systems
streams
Spam Decision
Clustering
Detection Trees, kNN
Web Search
advertising Perceptron, Engine
Dimension Community
Neural
reduction Detection
network
Graph Queries on Neural Social
Representat streams Reasoning,
Networks
Neural
ion Learning Large Language
Network Model

17
 Instructor:
 Yongfeng Zhang, [email protected]

 TA:
 Xi Zhu (Section 1), Email: [email protected]
 Wujiang Xu (Section 2), Email: [email protected]
 Devanshi Patel (Section 3), Email: [email protected]
 Sai Samhith Thatikonda (Section 4), Email: [email protected]

 Graders:
 Satyam Saini (Section 1&2), Email: [email protected]
 Sasank Chindirala (Section 3&4), Email: [email protected]

19
 Data Structures (CS 112)
▪ Arrays, lists, sets, maps, queues, linked lists, etc.
 Basic probability
▪ Moments, typical distributions, MLE, …
 Programming
▪ Your choice, but C++/Java/Python will be very useful
 Infrastructure (optional)
▪ Linux / Hadoop / Spark / PyTorch / Jupyter / Hugging Face
 Fundamental Algorithms (CS 344, optional but preferred)
▪ Linear Algebra, basic data structures

 We provide some background, but

the class will be fast paced
21
 Course website:
▪ Canvas: Please join our Canvas homepage
▪ Lecture slides, homework, solutions, readings

 Textbook: (LRU) Mining of Massive Datasets

by J. Leskovec, A. Rajaraman, J. D. Ullman
Free online: http://www.mmds.org

22
 Slides, lecture notes, homework answers, etc. -> Files

23
 Course Website
▪ Homework Assignments will also be released and submitted on
Canvas.
 Be sure to enable your course notifications!

24
 Canvas Chat:
▪ You may post questions and participate in discussions
on Canvas Chat room.
▪ Use Canvas for technical questions and public communication
with the course staff

 For private questions please email to me or TA

 We will post course announcements on Canvas
(make sure you check it regularly)

25
 4 homework assignments: 40 points
▪ Theoretical and programming questions (10 points
each)
▪ Assignments take lots of time. Start early!

 Time and Submission

▪ Homework assignments are posted on Fridays, and
you have 2 weeks to finish (due on next next Friday on
11:59pm).
▪ Homework can be submitted in Canvas
▪ Late policy will be shown on the homework
assignment: 90% discount factor for each 1 day late.

26
 Midterm: 20 points
▪ Wednesday, Oct 30, 12-hour take-home exam (no
class on that day)
▪ Detailed instructions will be provided on Canvas.

27
 Final Project: 40 points
▪ Complete as a team of at most 3 students, the team
can either choose from a set of provided project topics
(will provide on Canvas later this semester) or propose
their own project topic (based on approval from the
instructor), write the project report as a “mini-paper,”
make a slides for the project, and submit the paper,
slides, code and data. (40%)
 Bonus: 10 points
▪ We will select 20 teams, each can do a 15-min
presentation to the class, each member of the
selected teams will get 10 bonus credits.
▪ Final two weeks is presentation day.

 It’s going to be fun and hard work. ☺

28
 3 To-do items for you:
▪ Register to Canvas
▪ Begin to think about your team up for class project

 Additional details/instructions can be seen in

the course syllabus (posted on Canvas)

My Pals Are Here. Maths 1B
100% (3)
My Pals Are Here. Maths 1B
142 pages
Padel Court Introduction
No ratings yet
Padel Court Introduction
3 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
API - 650 Check List Tank Inspection
100% (4)
API - 650 Check List Tank Inspection
5 pages
Syllabus - CIS 509 Data Mining II (Fall 2019)
No ratings yet
Syllabus - CIS 509 Data Mining II (Fall 2019)
7 pages
0 WD951 EM610 00079 Rev 0 Operation and Maintenance Manuals For Cranes
100% (1)
0 WD951 EM610 00079 Rev 0 Operation and Maintenance Manuals For Cranes
421 pages
Course Title Course Number
No ratings yet
Course Title Course Number
15 pages
MR22 3-1 and 3-2
No ratings yet
MR22 3-1 and 3-2
68 pages
Introduction To Data Science: Cpts 483-06 - Syllabus
No ratings yet
Introduction To Data Science: Cpts 483-06 - Syllabus
5 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
4 pages
COURSE PLAN - FDS THEORY
No ratings yet
COURSE PLAN - FDS THEORY
8 pages
MCA-SEM-III-Syllabus Mobile Computing
No ratings yet
MCA-SEM-III-Syllabus Mobile Computing
12 pages
Mtechds 2021
No ratings yet
Mtechds 2021
17 pages
Unit 1 Fod
No ratings yet
Unit 1 Fod
43 pages
DataScience Minordegree 2023 Syllabus
No ratings yet
DataScience Minordegree 2023 Syllabus
12 pages
DC DSA DSM DSV POC Merged
No ratings yet
DC DSA DSM DSV POC Merged
5 pages
Data Science CS481 - Course Outline Spring 2020
No ratings yet
Data Science CS481 - Course Outline Spring 2020
3 pages
Data Science
No ratings yet
Data Science
244 pages
CS3352 FDS
No ratings yet
CS3352 FDS
23 pages
FoDS MIDSEM Syllabus
No ratings yet
FoDS MIDSEM Syllabus
3 pages
Lecture 1 Annotated
No ratings yet
Lecture 1 Annotated
76 pages
CS3352 - Foundations of Data Science
No ratings yet
CS3352 - Foundations of Data Science
142 pages
Types of Digital Data
No ratings yet
Types of Digital Data
22 pages
Edit Ds
No ratings yet
Edit Ds
37 pages
B.Tech.AIDS R 2021
No ratings yet
B.Tech.AIDS R 2021
31 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
No ratings yet
COMP4332/RMBI4310: Big Data Mining and Management Advanced Data Mining For Risk Management and Business Intelligence
45 pages
Plan
No ratings yet
Plan
14 pages
Data Science
No ratings yet
Data Science
37 pages
Summer Term 2024 Course Handout: Date: 28.05.2024
No ratings yet
Summer Term 2024 Course Handout: Date: 28.05.2024
3 pages
Day01 - Welcome To Data Science Fundamental
No ratings yet
Day01 - Welcome To Data Science Fundamental
30 pages
FDS Course Plan - Update
No ratings yet
FDS Course Plan - Update
7 pages
ho
No ratings yet
ho
9 pages
SYLLABUS
No ratings yet
SYLLABUS
13 pages
BTCS9202 Data Sciences Lab Manual
No ratings yet
BTCS9202 Data Sciences Lab Manual
39 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
7 pages
BDA - CSE Syllabus
No ratings yet
BDA - CSE Syllabus
2 pages
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
No ratings yet
PRINCIPLES OF DATA SCIENCE by - JOHN P DICKERSON
91 pages
DS_3000_Syllabus_Spring_2025
No ratings yet
DS_3000_Syllabus_Spring_2025
10 pages
Certified Professional Diploma in Data Science-1
No ratings yet
Certified Professional Diploma in Data Science-1
43 pages
Data Science
No ratings yet
Data Science
9 pages
[#0] Introduction
No ratings yet
[#0] Introduction
50 pages
PDS MERGED NEW
No ratings yet
PDS MERGED NEW
19 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
Course Outline PDF
No ratings yet
Course Outline PDF
2 pages
Gujarat Technological University: Overview of Python and Data Structures
No ratings yet
Gujarat Technological University: Overview of Python and Data Structures
4 pages
ch01 Intro
No ratings yet
ch01 Intro
45 pages
Mba ZG536 Course Handout
No ratings yet
Mba ZG536 Course Handout
7 pages
Cpget2023 M.SC Datascience Eligibility Criteria
No ratings yet
Cpget2023 M.SC Datascience Eligibility Criteria
1 page
Data Science Student Schedule
No ratings yet
Data Science Student Schedule
7 pages
Lecture 1 - Introduction to Data Science
No ratings yet
Lecture 1 - Introduction to Data Science
12 pages
Pe Verticals
No ratings yet
Pe Verticals
95 pages
CS 3352 Foundations of Data Science Syllabus
No ratings yet
CS 3352 Foundations of Data Science Syllabus
2 pages
Course Outline (Ds & Ai) 2024
No ratings yet
Course Outline (Ds & Ai) 2024
13 pages
SEM 4 stuff
No ratings yet
SEM 4 stuff
27 pages
Introduction To Data Science: D ATA 1 1 0 0 1
No ratings yet
Introduction To Data Science: D ATA 1 1 0 0 1
27 pages
Data science book1
No ratings yet
Data science book1
9 pages
22am901 Data Science Using Python Unit 2
No ratings yet
22am901 Data Science Using Python Unit 2
116 pages
Data Science 1
100% (3)
Data Science 1
133 pages
III_I
No ratings yet
III_I
20 pages
Neural - Data - Science - 0 Introduction
No ratings yet
Neural - Data - Science - 0 Introduction
19 pages
21CSS303T DATA SCIENCE SYLLABUS
No ratings yet
21CSS303T DATA SCIENCE SYLLABUS
2 pages
From The Beginning
From Everand
From The Beginning
James A. Madison
No ratings yet
Epochal Discoveries
From Everand
Epochal Discoveries
Manvel Zakharian
No ratings yet
Essential Safe 4.0: A Scaled Agile, Inc. White Paper March 2017
No ratings yet
Essential Safe 4.0: A Scaled Agile, Inc. White Paper March 2017
27 pages
Senographe Essential: Technical Publications
No ratings yet
Senographe Essential: Technical Publications
2,044 pages
All Movitel Sim Uses Details
No ratings yet
All Movitel Sim Uses Details
11 pages
Xi-Davao Oriental-Kamoning Elementary School-Rf05
No ratings yet
Xi-Davao Oriental-Kamoning Elementary School-Rf05
1 page
Design and Fabrication of Remote Controlled Scissor Jack
33% (3)
Design and Fabrication of Remote Controlled Scissor Jack
19 pages
Quiz 1 Bac1
No ratings yet
Quiz 1 Bac1
1 page
Boiler Start Up
100% (2)
Boiler Start Up
6 pages
Final Manuscript
No ratings yet
Final Manuscript
52 pages
Project On Indian Fedralism
100% (1)
Project On Indian Fedralism
14 pages
Đề Số 01-HSG Anh 9
No ratings yet
Đề Số 01-HSG Anh 9
17 pages
World Commercial Refrigeration Equipment
No ratings yet
World Commercial Refrigeration Equipment
3 pages
SOLO AssessmentforLearning 1
No ratings yet
SOLO AssessmentforLearning 1
3 pages
Cesar Franck
No ratings yet
Cesar Franck
3 pages
Squat Standards For Men and Women (KG) - Strength Level
No ratings yet
Squat Standards For Men and Women (KG) - Strength Level
2 pages
CV JC
No ratings yet
CV JC
2 pages
Calque - Hajar Akalai
No ratings yet
Calque - Hajar Akalai
10 pages
Jotun - Penguard FC (Second Coat)
No ratings yet
Jotun - Penguard FC (Second Coat)
5 pages
Strategic Reward MGT
No ratings yet
Strategic Reward MGT
73 pages
Congenital Anomalies 2
No ratings yet
Congenital Anomalies 2
59 pages
Part-3 Lalit Narayan Mithila University, Darbhanga
100% (1)
Part-3 Lalit Narayan Mithila University, Darbhanga
1 page
Managing_Project_Uncertainty_----_(3_Problem-Solving_Strategies_For_Managing_Uncertainty)
No ratings yet
Managing_Project_Uncertainty_----_(3_Problem-Solving_Strategies_For_Managing_Uncertainty)
20 pages
Oral Communication in Context Performance Task No. 1
No ratings yet
Oral Communication in Context Performance Task No. 1
5 pages
Lettering and Caligraphy
No ratings yet
Lettering and Caligraphy
107 pages
Insignia Range Highlights
No ratings yet
Insignia Range Highlights
72 pages
MinutesNotes of Coaching and Mentoring Sessions Meetings FGDs
No ratings yet
MinutesNotes of Coaching and Mentoring Sessions Meetings FGDs
3 pages
SOCRATES - D2.1 Use Cases For Self-Organising Networks
No ratings yet
SOCRATES - D2.1 Use Cases For Self-Organising Networks
71 pages

Introduction To Data Science 439

Uploaded by

Introduction To Data Science 439

Uploaded by

http://www.mmds.

Introduction to Data Science

Data Mining ≈ Big Data (from 2012)

Tycho Brahe (1546-1610) Johannes Kepler (1571-1630)

Good at astro-observation Analyzed Tycho’s data, and discovered a rule

Johannes Kepler (1571-1630) Isaac Newton (1643-1727)

Data Collection Data Analytics (what) Data Interpretation (why)

Almost automatic Many available methods, Still needs much exploration,

 We provide some background, but

 Textbook: (LRU) Mining of Massive Datasets

 For private questions please email to me or TA

 Time and Submission

 It’s going to be fun and hard work. ☺

 Additional details/instructions can be seen in

You might also like