0% found this document useful (0 votes)
41 views

Introduction To Data Mining Unit 1

The document outlines an introduction to a data mining course, including the course objectives, topics, software, books, and schedule. It discusses applications of data mining and machine learning. It also provides a brief overview of data mining and related fields such as data analytics, machine learning, and big data. Key trends in big data from articles are summarized, including the shortage of data science talent and the volume, variety, and velocity of big data.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views

Introduction To Data Mining Unit 1

The document outlines an introduction to a data mining course, including the course objectives, topics, software, books, and schedule. It discusses applications of data mining and machine learning. It also provides a brief overview of data mining and related fields such as data analytics, machine learning, and big data. Key trends in big data from articles are summarized, including the shortage of data science talent and the volume, variety, and velocity of big data.

Uploaded by

vinay
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

9/22/2020

INTRODUCTION TO DATA MINING


UNIT # 1

SPRING 2020 Sajjad Haider 1

TODAY’S AGENDA

 Course management
 Brief overview of Data Mining and allied fields
 Summary of a few impactful articles and recent trends

SPRING 2020 Sajjad Haider 2

1
9/22/2020

COURSE MANAGEMENT

SPRING 2020 Sajjad Haider 3

LEARNING OBJECTIVES

 Learn the art of modeling and interpreting large complicated data sets
via predictive and descriptive data mining methods.
 Get to know several online data repositories and how to participate in
data analytics competitions held at Kaggle.com and other sites
 Have advanced level expertise in data analytics software and languages
such as KNIME and Python.

SPRING 2020 Sajjad Haider 4

2
9/22/2020

COURSE OVERVIEW

 Data Preparation
 Classification Techniques
 Clustering
 Text Analytics
 Regression Analysis
 Principal Component Analysis
 Association Rule Mining

SPRING 2020 Sajjad Haider 5

SOFTWARE AND DATA REPOSITORIES

 KNIME
 Python
 Data on Kaggle Website
 http://www.kaggle.com/

SPRING 2020 Sajjad Haider 6

3
9/22/2020

BOOKS

 Data Mining for Business Analytics: Concepts,Techniques and


Applications in R (2017)
 Data Mining and Data Warehousing: Principles and Practical Techniques
(2019)
 Learning Data Mining with Python (2017)
 Data Mining: Practical Machine Learning Tools and Techniques by Witten
and Frank (2016)

SPRING 2020 Sajjad Haider 7

ACKNOWLEDGEMENT

 Although I am not extensively following the two books below but their
slides are still very popular in the academia and would be using them
occasionally:
 Data Mining: Concepts and Techniques (2011)
 Introduction to Data Mining (2018)

SPRING 2020 Sajjad Haider 8

4
9/22/2020

(TENTATIVE) MARKS DISTRIBUTION

 Final 40
 Project 15
 Assignments + Quizzes 45

SPRING 2020 Sajjad Haider 9

MEETING HOURS

 Office Hours:
 Monday/Wednesday: noon – 1 PM and 4 – 5 PM
 or by appointment (by e-mailing me at [email protected]).

 Note: I DO NOT entertain SMS/WhatsApp messages. E-mail is the


official medium of correspondence.

SPRING 2020 Sajjad Haider 10

5
9/22/2020

OVERVIEW OF DATA MINING AND ALLIED FIELDS

SPRING 2020 Sajjad Haider 11

APPLICATIONS OF DATA MINING/MACHINE LEARNING

 Traffic Predictions
 Google Maps
 Online Transportation Networks
 Uber/Careem for price prediction
 Video Surveillence
 Crime detection
 Fraud Detection
 Financial institutions
SPRING 2020 Sajjad Haider 12

6
9/22/2020

APPLICATIONS OF DATA MINING/MACHINE LEARNING (CONT’D)

 Social Media Services


 Face recognition by Facebook
 Hate speech detection by Facebook/Twitter
 Inappropriate content by YouTube
 Emails
 Product Recommendation
 Amazon,YouTube, and others
 Machine Translation
 Autonomous Vehicles
SPRING 2020 Sajjad Haider 13

MACHINE LEARNING

A computer program is said to learn from experience E with respect to


some class of tasks T and performance measures P, if its performance
at tasks in T, as measured by P, improves with experience E.’
(Tom Mitchell, 1988)

SPRING 2020 Sajjad Haider 14

7
9/22/2020

A SIMPLIFIED TAXONOMY

 Data Science > Data Analytics > Data Mining > Machine Learning
 Data Analytics also deals with Visualization
 Data Science also deals with data acquisition and management of data
 Beside machine learning, data mining also makes use of statistical models

 Because of a significant overlap and due to the popularity of different terms in


different communities, the boundaries of these terms are not as crisp as
shown in this slide.

SPRING 2020 Sajjad Haider 15

DATA MINING

 Data mining is a process of automated discovery of previously unknown


patterns in large volumes of data.
 This large volume of data is usually the historical data of an organization
known as the data warehouse.
 Data mining deals with large volumes of data, in Gigabytes or Terabytes
of data and sometimes as much as Zetabytes of data (in case of big data).
 Patterns must be valid, novel, useful and understandable.

SPRING 2020 Sajjad Haider 16

8
9/22/2020

DATA MINING LIFE CYCLE (CRISP-DM)

1. Statistical Models
2. Machine learning

SPRING 2020 Sajjad Haider 17

SUMMARY OF A FEW ARTICLES

SPRING 2020 Sajjad Haider 18

9
9/22/2020

SPRING 2020 Sajjad Haider 19

HBR ARTICLE (CONT’D)

 Data scientists are the people who understand how to fish out answers
to important business questions from today’s tsunami of unstructured
information.
 As companies rush to capitalize on the potential of big data, the largest
constraint many face is the scarcity of this special talent.

SPRING 2020 Sajjad Haider 20

10
9/22/2020

SPRING 2020 Sajjad Haider 21

BIG DATA:THE NEXT FRONTIER FOR INNOVATION (MCKINSEY 2011)

 Big Data referes to datasets whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze.
 The demand for deep analytical positions in a big world could exceed the
supply being produced on current trends by 140K to 190K positions.
 A need for 1.5 million additional managers and analysts in the US
who can ask the right questions and consume the results of the analysis
of big data effectively.

SPRING 2020 Sajjad Haider 22

11
9/22/2020

WHAT IS BIG DATA?

 There is not a consensus as to how to define big data


“Big data exceeds the reach of commonly used hardware environments and
software tools to capture, manage, and process it with in a tolerable elapsed
time for its user population.” - Teradata Magazine article, 2011

“Big data refers to data sets whose size is beyond the ability of
typical database software tools to capture, store, manage and analyze.”
- The McKinsey Global Institute, 2011

 One reasonable definition is that it’s data which can’t comfortably be


processed on a single machine.
SPRING 2020 Sajjad Haider 23

3 V’S

 Doug Laney was the first one in talking


about 3 V's in Big Data management:
 Volume: there is more data than ever before,
its size continues increasing, but not the
percent of data that our tools can process
 Variety: there are many different types of data,
as text, sensor data, audio, video, graph, and
more
 Velocity: data is arriving continuously as
streams of data, and we are interested in
obtaining useful information from it in real
time
SPRING 2020 Sajjad Haider 24

12
9/22/2020

4 V’S (IBM 2014)

SPRING 2020 Sajjad Haider 25

13

You might also like