0% found this document useful (0 votes)
121 views26 pages

CS429: Data Mining: About Instructor

The document provides an overview of the CS429: Data Mining course taught by instructor Sibt ul Hussain. It discusses the instructor's background and research interests in deep learning and machine learning. It then provides a high-level overview of data science, including definitions, examples of data products from companies like Google, Netflix, and Facebook, and contrasts data science with related fields like databases and scientific computing.

Uploaded by

Abdullah Ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views26 pages

CS429: Data Mining: About Instructor

The document provides an overview of the CS429: Data Mining course taught by instructor Sibt ul Hussain. It discusses the instructor's background and research interests in deep learning and machine learning. It then provides a high-level overview of data science, including definitions, examples of data products from companies like Google, Netflix, and Facebook, and contrasts data science with related fields like databases and scientific computing.

Uploaded by

Abdullah Ammar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1/27/2017

CS429: Data Mining


(Data Science / Machine Learning/ Data Analysis)

Sibt ul Hussain
http://sites.google.com/SibtulHussain

About Instructor
• Instructor: Sibt ul Hussain (Leading Research Scientist of ReVeaL
Research Group)
Experience:
2006-2007 SUPELEC, Paris, France
2007 INRIA (LEAR), Grenoble, France
2007-2012 LJK, Grenoble,
2012-2013 GREYC, Caen, France.
2013-now Assistant Professor, FAST
Research Grants
2014 -- 2016 NVIDIA ~US$ 11000 for research
in deep learning.

• My major areas of research are deep learning (visual recognition) and


machine learning, data mining, and I have extensive experience (11+
years) of building these systems from scratch.

1
1/27/2017

Data Science Overview


• Why ? & Why all the excitement ?
• What ?
• How ?

Why Data Science

2
1/27/2017

Building Telescope(s) for the data


• “ Just as there are odors that dogs can smell
and we cannot, as well as sounds that dogs
can hear and we cannot, so too there are
wavelengths of light we cannot see and
flavors we cannot taste.”
• These sounds that we can’t hear, this light
that we can’t see, how do we even know
about these things in the first place?

•Richard Hamming, The Unreasonable


Effectiveness of Mathematics

Building Telescope(s) for the data


• Well, we built tools. We built tools that
adapt these things that are outside of our
senses, to our human bodies, our human
senses.
• We can’t hear ultrasonic sound, but you hook a
microphone up to an oscilloscope and there it is.
You’re seeing that sound with your plain old
monkey eyes. We can’t see cells and we can’t
see galaxies, but we build microscopes and
telescopes and these tools adapt the world to our
human bodies, to our human senses.

3
1/27/2017

Building Telescope(s) for the data


• As a data scientist (machine learning researcher or
data miner), our job is basically to struggle with data
that is incomprehensible – literally impossible for the
human mind to comprehend – and try to build tools to
think about it and work with it.

• These tools are a bit like looking through a


telescope. Just like a telescope transforms the sky
into something we can see, the tools we will learn
transform the data into a more accessible form.
Remember one learns about the telescope by
observing how it magnifies the night sky, but the
really remarkable thing is what one learns about
the stars.

Why Data Science ?


• Google, Yahoo today
▫ Web Search and Computational advertising
▫ Google: 35,000 searches/sec
▫ Yahoo! scale: 600 million users per month, 4 billion clicks
per day, 25 terabytes of data collected every day
• Netflix 2007
▫ Movie recommendations, Netflix prize
▫ 100 million ratings, 500,000 users, 18,000 movies
• Amazon 2003
▫ Product recommendations, reviews
▫ 29 million customers, millions of products
• Kaggle
▫ Hundreds of competitions from variety of different
companies.

4
1/27/2017

5
1/27/2017

6
1/27/2017

Some recent Data


Science Competitions

The Obama Administration’s Roadmap


for AI Policy (President Obama’s Executive Office) Report says
▫ AI policy should be an urgent concern.
▫ The U.S. government is not designing policy for
general intelligence or “strong AI.”
▫ AI isn’t a science project; it’s commercially
important.
▫ The U.S. government has no clear vision regarding
where to focus research funding.
▫ AI can help governments do their jobs better.
▫ China is a leader, not a follower or a copycat.

7
1/27/2017

The Obama Administration’s Roadmap


for AI Policy (President Obama’s Executive Office) Report says
▫ AI policy should be an urgent concern.
▫ The U.S. government is not designing policy for
general intelligence or “strong AI.”
▫ AI isn’t a science project; it’s commercially
important.
▫ The U.S. government has no clear vision regarding
where to focus research funding.
▫ AI can help governments do their jobs better.
▫ China is a leader, not a follower or a copycat.

What is data science – A Definition


• Data Science is the science which uses
computer science, statistics and machine
learning, visualization and human computer
interactions to collect, clean, integrate, analyze,
visualize, interact with data to create data
products.

8
1/27/2017

Data Science – One Definition

Hal Varian Explains


• The ability to take data— to be able to
understand it, to process it, to extract value
from it, to visualize it, to communicate it—that’s
going to be a hugely important skill in the next
decades, not only at the professional level but
even at the educational level for elementary
school kids, for high school kids, for college kids.
Because now we really do have essentially free
and ubiquitous data. So the complimentary
scarce factor is the ability to understand that
data and extract value from it.

9
1/27/2017

Goal of Data Science ?

Turn data into data products.

Data Products – Google


• Web Search
• Google Ads
• News Recommendation Engine
• Google Maps
• PlayStore
• In fact now almost all the products & services
Google provide are data science driven.

10
1/27/2017

Data Products – Netflix


• Personalized Movie Ratings
• Movie Recommendations
• Similar Movies
• Movie Categories (e.g., 80’s movie with a strong
female lead, Kung Fu movies).

• Have put other businesses out of buisness.

Data Products – Facebook/LinkedIn


• People you may know
• Applications you may like
• Jobs/Events you might be interested
• Classifier for bad users and bad content
• With high accuracy, Facebook can guess whether
you are single or married

11
1/27/2017

Data Products – Facebook/LinkedIn


• People you may know
• Applications you may like
• Jobs/Events you might be interested
• Classifier for bad users and bad content
• With high accuracy, Facebook can guess whether
you are single or married

Data Products – Twitter


• Text Analysis – Spam Filter/SimilaritySearch
• User Sentiment/Satisfaction/Feedback
• News Breakout
• Trend and Topics

• 200 million users as of 2011, generating over


200 million tweets and handling over 1.6 billion
search queries per day.

12
1/27/2017

Contrast: Databases
Databases Data Science
Querying the past Querying the future

• Querying the future:


▫ What happens if I show this ad?
▫ Or recommend this product?
▫ Or filter this email?
▫ Microsoft lost an estimated $1.7B on Surface
computers (past) but what do they expect to make
in future?

Contrast: Scientific Computing


Image General purpose classifier
Supernova

Not

Nugent group / C3 LBL

Scientific Data-Driven Approach


Modeling General inference engine replaces
Physics-based models model
Problem-Structured Structure not related to problem
Mostly deterministic, Statistical models handle true
precise randomness, and unmodeled
complexity.
Run on Supercomputer or
High-end Computing Run on cheaper computer Clusters
Cluster

13
1/27/2017

Contrast: Machine Learning


Machine Learning Data Science
(old) Explore many models, build and
Develop new (individual) tune hybrids
models Understand empirical properties
Prove mathematical properties of models
of models Develop/use tools that can handle
massive datasets
Improve/validate on a few,
relatively clean, small datasets Take action!

Publish a paper

The life of the data

Data Sources Collect Clean Integrate


Analysis Visualization Interface Users

14
1/27/2017

Data Scientist’s Practice


Clean,
prep

Hypothesize Large Scale


Digging Around Model Exploitation
in Data

Evaluate
Interpret

15
1/27/2017

What is Data Scientist ?


•A Data Scientist should be good at data analysis,
math, statistics, but also be able to code with
huge amounts of data and use the extracted
information to build data products.

How to become a Data Scientist?


• No Royal Way.

16
1/27/2017

Data Science Myths


• A Magic Trick:

Input Data Model(s) Product

• If thats the case there will be no need of you or


even me.
• Kaggle will stop working …

Hardest part of Data Science


• Problem Understanding ? A Problem should be well
defined
▫ What your buisness requirement is ?
• Getting (or generating good) Data:
▫ Remember Data beats algorithm
• Predictive Modeling:
▫ Finding right set of techniques and algorithms
▫ A through understanding can help
• Defining Evaluation Metrics
▫ One of the main reasons of failure
▫ Complete system depends on the right evaluation metric.
• Infrastructure (some what solved)

17
1/27/2017

What Will You Learn?


• Complete set of tools and algorithms from data
collection, cleaning, munging, analysis and
visualization.
• Specifically,
▫ Use Python and other tools to scrape, clean, and
process data
▫ Use statistical methods and visualization to quickly
explore data
▫ Apply statistics and computational analysis to make
predictions based on data
▫ Effectively communicate the outcome of data analysis
using descriptive statistics and visualizations

What problems you will work on


• Different based on these different types of data

▫ Medical Data
▫ Reviews Data
▫ Mobile Sensor Data
▫ Company Consumer Data
▫ …

18
1/27/2017

What we expect from you ?


• Remain proactive and curious ?
▫ By asking questions and giving insights.

• Interest & Hard work ☺

• Pre-requisite: Should have a basic programming


knowledge.

19
1/27/2017

In Labs

Course Contents
•Data
▫Geometric & Algebric View
Numeric Attributes
Categorical Attributes
▫Probablistic View
•Classification
▫Bayes Networks (Naive Bayes)
▫Decision Trees
▫Random Forests and Boosting
▫Linear Regression
▫Percepterons
▫SVM
▫Logistic Regression and Softmax
▫Neural Networks
•Clustering
▫Kmeans
▫Hierarchical
•Pattern Mining
▫A Priori

20
1/27/2017

Textbooks and learning resources


Recommended textbooks:
1. Data Science From Scratch.
2. Doing Data Science by Rachel Schutt and Cathy O’Neil
3. Data Mining and Analysis Fundamental Concepts and
Algorithms (Muhammd J Zaki)
4. Python for Data Analysis: Data Wrangling with Pandas,
NumPy, and Ipython by Wes McKinney
5. Introduction to Data Mining by Pang-Ning Tan , Michael
Steinbach , Vipin Kumar
• Suggested textbooks:
1. Probability for Computer Scientists
(http://luthuli.cs.uiuc.edu/~daf/courses/Probcourse/Pr
obcourse-2015/book-9-midJan15.pdf)

Resources
•Quora •Statistical Learning Theory and
•What is Data Science? Applications, MIT
•How do I become a Data •Data Literacy, MIT
Scientist? •Prediction, MIT
•How does Data Science differ •Introduction to Data Mining,
from traditional statistical UIUC
analysis? •Learning from Data, Caltech
•Related Courses •Introduction to Statistics,
•Concepts in Computing with Harvard
Data, Berkeley •Introduction to Computing,
•Practical Machine Learning, Modeling, and Visualization,
Berkeley Harvard
•Artificial Intelligence, Berkeley •Data-Intensive Information
•Visualization, Berkeley Processing Applications,
•Data Mining and Analytics in University of Maryland
Intelligent Business Services, •Statistical Inference, UPenn
Berkeley •Introduction to Data Science,

21
1/27/2017

Resources •Books
•Competing on Analytics
•Coursera •Analytics at Work
•Data Analysis, Johns Hopkins •Super Crunchers
•Computing for Data Analysis, •The Numerati
Johns Hopkins •Data Driven
•Machine Learning, Stanford •Data Source Handbook
•Introduction to Data Science, •Programming Collective
University of Washington Intelligence
•Computational Methods for Data •Mining the Social Web
Analysis, University of •Data Analysis with Open Source
Washington Tools
•Machine Learning, University of •Visualizing Data
Washington •The Visual Display of
•Related Workshops Quantitative Information
•Data Bootcamp, Strata 2011 •Envisioning Information
•Machine Learning Summer •Visual Explanations: Images and
School, Purdue 2011 Quantities, Evidence and
•Looking at Data Narrative
•Beautiful Evidence

Softwares and Tools


•Python
•IPython
•Pandas
•Numpy
•Matplotlib
•Scipy
•BeautifulSoup
•Scikit-Learn
•Linux Shell
•Cython

I recommend using Ubuntu 14.04 and Python Anaconda Distribution


(It works both on Ubuntu and Windows).

22
1/27/2017

Prerequisites
1. Linear Algebra
2.Calculus
3. Probability
4. Solid Knowledge of programming (if you are not
comfortable with programming do not take this
course)

23
1/27/2017

General Information
Grading
Programming Assignments --- 20% to 30%
Quizzes --- 10%
Mid-term exams --- 20% to 30%
Final projects (most probably on individual basis) --- 10%
Final --- 40%
Lab --- 15%
Piazza group (upto 2 Bonus Marks):
For open communication, questions, feedback, polls, suggestions, etc.
https://piazza.com/class/ (Slate Integrated)
Please make the best use of piazza for learning among yourselves

Work Submission
IPython Notebooks and Kaggle Competition Score.
Soft copy only (we will run your program).
Plagiarism: If our system found your code is plagiarized, then you
risk:
Zero in all assignments.
Referral to DC committee.
A straight F in course.
Warning: Majority of Failure cases will be due to
plagiarism cases.
Feel free to discuss assignments with each other (piazza), but coding
must be done individually (except for the final project)
Late Submissions of assignments are not allowed.

24
1/27/2017

Assignments (You might be


reporting results on Kaggle
Competition)
•Probablistic Models
•Data Scrapping
•Data Summarization and Visualization
•Bayes & Naive Bayes (Sentiment Analysis)
•Decision Trees & Random Forests
•Linear and Logisitic Regression
•SVM, Neural Networks, Softmax
•Clustering
•APriori

Who Should take this course


•Those students who are comfortable in
programming and who want:
1.To learn the complete chain of data analysis tools
& techniques.
2.To learn Machine Learning Algorithms.
3.To compete in Kaggle Competitions.
4.To solve real world data problems.
5.To build applications on the learned techniques.

25
1/27/2017

Who MUST NOT take this course


•Students who think that electives are there to pass
them. (Its a tough course, so be prepared)
•Students on Warning (Be careful, you risk to get
another warning, Simply don’t take the course)
•Students who cannot program or cannot think
logically.
•Students who struggle with basic mathematical
concepts like Bayes Rule, Derivatives, Norms,
etc.

26

You might also like