CSE6242 400 AnalyticsConcepts
CSE6242 400 AnalyticsConcepts
edu/cse6242
Partly based on materials by Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
8 concept non-mutually
exclusive classes
Free for GT students
1. Classi cation
(or Probability Estimation)
3
fi
1. Classi cation
(or Probability Estimation)
5
2. Regression (“value estimation”)
Predict the numerical value of some variable for an
entity.
•point value of wine (50-100)
•credit score
•stock prices
•relationship between price and sales
•weather
•sports and game scores
6
3. Similarity Matching
Find similar entities (from a large dataset)
based on what we know about them.
7
3. Similarity Matching
Find similar entities (from a large dataset) based on what we know
about them.
•online dating
•patent search
•carpool matching ( nd people to carpool)
8
fi
fi
4. Clustering (unsupervised learning)
Group entities together by their similarity.
(For most algorithms, user provides # of clusters)
9
4. Clustering (unsupervised learning)
Group entities together by their similarity.
•groupings of similar bugs in code
•topical analysis (tweets?)
•land cover: tree/road/…
•for advertising: grouping users for marketing
purposes
•cluster people by accents (y’all, you all)
10
5. Co-occurrence grouping
(Many names: frequent itemset mining, association rule
discovery, market-basket analysis)
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target- gured-out-a-teen-girl-
was-pregnant-before-her-father-did/ 11
fi
6. Pro ling / Pattern Mining /
Anomaly Detection (unsupervised)
Characterize typical behaviors of an entity (person,
computer router, etc.) so you can nd trends and outliers.
13
fl
8. Data reduction (“dimensionality reduction”)
Shrink a large dataset into smaller one, with as
little loss of information as possible
1. if you want to visualize the data (in 2D/3D)
Most popular: UMAP, T-SNE
2. faster computation/less storage
3. reduce noise
14
Start Thinking About Project!
15