0% found this document useful (0 votes)

3K views3 pages

Course - Data Science Foundations - Data Mining

Data reduction techniques like principal component analysis (PCA) can be used to simplify large datasets by focusing on the most meaningful variables and reducing noise. This involves projecting the data from a high-dimensional space onto a lower-dimensional space to more easily identify patterns while still representing the overall data. Clustering algorithms group similar observations together based on distance or density between data points. Classification algorithms take clustered data and assign new observations to the appropriate clusters/buckets. Anomaly detection identifies outliers that may distort statistics and correlations in the data.

Uploaded by

Imtiaz N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3K views3 pages

Course - Data Science Foundations - Data Mining

Uploaded by

Imtiaz N

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Data Reduction:

Simplify the dataset to focus on variables or constructs that carry more meaning and separate it from
noise. Here we generally are talking about reduction of variables or fields (as opposed to observations).
Possible reasons:

- Storage (Hard drive)

- Memory (RAM)
- Time
- Reduce noise / distractions
- Focus on patterns
- Easier to interpret

Analogy is of projecting a shadow, taking data from a high dimensional space (each variable from the
dataset is a dimension) and projecting a shadow to a lower dimensional space. Think of taking a three
dimensional object and projecting a shadow on a two dimensional space and still be able to tell what it
is. One of the ways of doing it is PCA (Principal Component Analysis). Tools used may be:

- R
- Python
- Orange
- Rapid Miner

Clustering:

Idea is to put the entire set of observations or cases so that “like goes with like”. This is a grouping of
convenience rather than some sort of natural/universal grouping. We group the cases so that it
accomplishes a specific purpose. For example, in marketing similar customers are grouped together for
offers. Clusters are pragmatic groupings to serve a particular purpose.

Algorithms used for clustering can be:

- Distance between points:

o Measure distance from every point to every point
o Cons: Applicable only on convex clusters, very slow for big data
- Distance from a centroid
o K-means
- Density of data
- Distribution models

Classification:

Choosing the right bucket for data. Examples:

- Spam filters
- Fraud detection
- Genetic testing

Classification complements clustering. Clustering creates buckets and classification puts new cases into
them. Algorithms used for classification:

- K-nearest neighbors (k-NN)

- Naïve Bayes
- Decision trees
- Random forests
- Support vector machines (SVM)
- Artificial neural networks (ANN)
- K-means
- Logistic regression

Anomaly Detection:

Anomalies distort the statistics, correlations, etc. We have a few ways around it:

- Deleting them, but making sure this does not nullify analysis
- Transform (log, squares, etc., to make distribution symmetrical)
- Robust (use methods that are not strongly influenced by anomalies like median over mean, etc.)

Association Analysis:

- Powerful method of finding associations (items that go together)

- Able to get probability of an item (or set of items) based on the presence of another item (or set
of items)

This may be used on a purchasing website where associated items may be shown to customers.
Packages in R: arules, arulesViz.

Regression Analysis:

Use many variables to predict one. Example is of least squares regression (the assumption here is that
the data is following normal distribution).

Correlated predictors: Multicollinearity when the predicted variables are associated with each other:
Sequence Mining:

Sequence mining is like association analysis but the sequence/order of events matters here. Examples
are recommendation engines (if a person does a and b, then he is likely to do c…)

Text Mining:

Unlike other types this is unstructured data (instead of rows and columns of numeric data); here we
have a blob of text. E.g.,

- accessing authorship and voice

- Sentiment analysis for social media (figuring if people are saying good or bad about something
without actually reading)

Stets & Burke - Identity Theory and Social Identity Theory
No ratings yet
Stets & Burke - Identity Theory and Social Identity Theory
15 pages
Action Plan in ESP
100% (1)
Action Plan in ESP
11 pages
TROMPENAARS-A New Framework For Managing Change Across Cultures - JofCM
No ratings yet
TROMPENAARS-A New Framework For Managing Change Across Cultures - JofCM
15 pages
Unit3_Datamining
No ratings yet
Unit3_Datamining
5 pages
DMTN
No ratings yet
DMTN
17 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
24 pages
Down 2
No ratings yet
Down 2
61 pages
Unit no 3
No ratings yet
Unit no 3
10 pages
Data Mining and Business Intelligence
No ratings yet
Data Mining and Business Intelligence
52 pages
Data Mining 1 2 and 3
No ratings yet
Data Mining 1 2 and 3
20 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Big Data
No ratings yet
Big Data
5 pages
What Is Data Mining?
No ratings yet
What Is Data Mining?
17 pages
DM 5th unit ppt
No ratings yet
DM 5th unit ppt
54 pages
(Ebook PDF) Data Mining For Business Analytics: Concepts, Techniques, and Applications in R
No ratings yet
(Ebook PDF) Data Mining For Business Analytics: Concepts, Techniques, and Applications in R
41 pages
Unit 1 Data Mining task
No ratings yet
Unit 1 Data Mining task
7 pages
Data Mining Notes
No ratings yet
Data Mining Notes
297 pages
DDB - Presentation5data Mining Overview
No ratings yet
DDB - Presentation5data Mining Overview
19 pages
Unit 2
No ratings yet
Unit 2
37 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Data Mining
No ratings yet
Data Mining
24 pages
PDF (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download
100% (1)
PDF (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download
50 pages
UNIT 1 Introduction of Data Mining
No ratings yet
UNIT 1 Introduction of Data Mining
11 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
39 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R instant download
100% (1)
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R instant download
51 pages
2 Buss Intel Analytics
No ratings yet
2 Buss Intel Analytics
43 pages
module 1
No ratings yet
module 1
41 pages
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in Rpdf download
100% (4)
(eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in Rpdf download
44 pages
Instant ebooks textbook (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download all chapters
100% (4)
Instant ebooks textbook (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download all chapters
55 pages
Data Mining Technique Using Weka Tool
No ratings yet
Data Mining Technique Using Weka Tool
21 pages
Data Mining: © Pearson Education Limited 1995, 2005
No ratings yet
Data Mining: © Pearson Education Limited 1995, 2005
50 pages
R Lect1 Introduction
No ratings yet
R Lect1 Introduction
16 pages
Mca II Sem Data Ware Hoise and Mining
No ratings yet
Mca II Sem Data Ware Hoise and Mining
53 pages
Data Mining
No ratings yet
Data Mining
6 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
PTDLKT
No ratings yet
PTDLKT
11 pages
03 Data Preparation
No ratings yet
03 Data Preparation
28 pages
DataMining S
No ratings yet
DataMining S
103 pages
Data Mining-CH5
No ratings yet
Data Mining-CH5
49 pages
Data Mining Tasks
No ratings yet
Data Mining Tasks
3 pages
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
No ratings yet
margin_6794edf99eb1f_3c24107b2ce99dfbffd813406a34e332_6794ede66a47f
2 pages
Data Mining
No ratings yet
Data Mining
14 pages
Data_Preprocessing-2
No ratings yet
Data_Preprocessing-2
30 pages
Data Mining
No ratings yet
Data Mining
87 pages
Q.1. What Is Data Mining?
No ratings yet
Q.1. What Is Data Mining?
15 pages
BUSINESS ANALYTICS
No ratings yet
BUSINESS ANALYTICS
14 pages
CC Unit - 4 Imp Questions
No ratings yet
CC Unit - 4 Imp Questions
4 pages
cc15 2nd
No ratings yet
cc15 2nd
2 pages
What Is Data Warehouse: 1) Predictive Tasks
No ratings yet
What Is Data Warehouse: 1) Predictive Tasks
4 pages
Module_III_data_mining
No ratings yet
Module_III_data_mining
7 pages
DM Module1 notes
No ratings yet
DM Module1 notes
25 pages
Week 2
No ratings yet
Week 2
96 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
Unit 5
No ratings yet
Unit 5
9 pages
data-mining-notes (1)
No ratings yet
data-mining-notes (1)
3 pages
CS822-DataMining-Week1 (1)
No ratings yet
CS822-DataMining-Week1 (1)
97 pages
CH1-data Preprocessing
No ratings yet
CH1-data Preprocessing
49 pages
BCA Data Mining
No ratings yet
BCA Data Mining
116 pages
Data Mining-Applications, Issues
No ratings yet
Data Mining-Applications, Issues
9 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
The Difference Between Locations and Dimensions in Grid: Paul Murrell September 28, 2017
No ratings yet
The Difference Between Locations and Dimensions in Grid: Paul Murrell September 28, 2017
3 pages
Drawing Lines Between Viewports: Grid - Move.to and Grid - Line.to
No ratings yet
Drawing Lines Between Viewports: Grid - Move.to and Grid - Line.to
2 pages
Table2: Combocnt
No ratings yet
Table2: Combocnt
4 pages
Shareholder Value: (Formerly Karachi Stock Exchange Limited)
No ratings yet
Shareholder Value: (Formerly Karachi Stock Exchange Limited)
43 pages
GoodWE Product Page - DNS - All
No ratings yet
GoodWE Product Page - DNS - All
2 pages
GoodWE Product Page - DNS - All
No ratings yet
GoodWE Product Page - DNS - All
2 pages
GoodWe Product Catalogue
No ratings yet
GoodWe Product Catalogue
24 pages
Handover Control
No ratings yet
Handover Control
337 pages
MR Goal Setting Worksheet
No ratings yet
MR Goal Setting Worksheet
6 pages
Dr. B.C. Roy Engineering College CA1: Information Filtering
No ratings yet
Dr. B.C. Roy Engineering College CA1: Information Filtering
10 pages
Norwegian Grammar NTNU
100% (4)
Norwegian Grammar NTNU
39 pages
6 - Narrative 4-Point Writing Rubric
No ratings yet
6 - Narrative 4-Point Writing Rubric
1 page
The Language Instinct by Steven Pinker Summary
100% (1)
The Language Instinct by Steven Pinker Summary
2 pages
Department of Education Region III Division of Zambales Masinloc District Inhobol Elementary School SY 2016-2017
No ratings yet
Department of Education Region III Division of Zambales Masinloc District Inhobol Elementary School SY 2016-2017
16 pages
Learning and Development - Chapter 18: Human Resource Management Keti Khapava
No ratings yet
Learning and Development - Chapter 18: Human Resource Management Keti Khapava
6 pages
Bodily Kinesthetic Intelligence
100% (1)
Bodily Kinesthetic Intelligence
11 pages
Fat City
No ratings yet
Fat City
4 pages
Figure 1.1. The Goals of Classroom Management
No ratings yet
Figure 1.1. The Goals of Classroom Management
3 pages
Marzano Strategies Tel311 Weebly
No ratings yet
Marzano Strategies Tel311 Weebly
3 pages
Daily Lesson Plan Year 4: Hi! My Name Is . For Example, Pupils Could Sit in A Circle. They Throw A Soft Ball
No ratings yet
Daily Lesson Plan Year 4: Hi! My Name Is . For Example, Pupils Could Sit in A Circle. They Throw A Soft Ball
5 pages
Modules For 2nd Year
No ratings yet
Modules For 2nd Year
20 pages
KPP 6014 Perkembangan Sosial Emosi Dan Moral: Tajuk Ringkasan Jurnal
No ratings yet
KPP 6014 Perkembangan Sosial Emosi Dan Moral: Tajuk Ringkasan Jurnal
11 pages
Competing On The Edge
No ratings yet
Competing On The Edge
6 pages
Cambridge International A Level
No ratings yet
Cambridge International A Level
3 pages
Mental Health and Well Being in Middle and Late Adolescence
100% (1)
Mental Health and Well Being in Middle and Late Adolescence
12 pages
Unit 2 Soft
No ratings yet
Unit 2 Soft
14 pages
Week 11 Media and Information Literacy1
No ratings yet
Week 11 Media and Information Literacy1
5 pages
Simple Present Vs Present Continuous
No ratings yet
Simple Present Vs Present Continuous
2 pages
RepOrt On PerfOrmance Appraisal ABL
No ratings yet
RepOrt On PerfOrmance Appraisal ABL
13 pages
Why Do You Want To Be A Teach For Malaysia Fellow
No ratings yet
Why Do You Want To Be A Teach For Malaysia Fellow
5 pages
Is It Hard To Learn AI My Personal AI Learning Journey - MLTut
No ratings yet
Is It Hard To Learn AI My Personal AI Learning Journey - MLTut
13 pages
A Brief Guide For Teaching Proof
No ratings yet
A Brief Guide For Teaching Proof
65 pages
Love Definition & Meaning - Merriam-Webster
No ratings yet
Love Definition & Meaning - Merriam-Webster
10 pages
Speech Acts Analysis of President Joko Widodo's Speech About Covid-19
No ratings yet
Speech Acts Analysis of President Joko Widodo's Speech About Covid-19
83 pages
Vision Research: Pamela M. Pallett, Stephen Link, Kang Lee
No ratings yet
Vision Research: Pamela M. Pallett, Stephen Link, Kang Lee
6 pages

Course - Data Science Foundations - Data Mining

Uploaded by

Course - Data Science Foundations - Data Mining

Uploaded by

Data Reduction:

- Storage (Hard drive)

Algorithms used for clustering can be:

- Distance between points:

Choosing the right bucket for data. Examples:

- K-nearest neighbors (k-NN)

- Powerful method of finding associations (items that go together)

- accessing authorship and voice

You might also like