0% found this document useful (0 votes)
9 views15 pages

CSE6242 400 AnalyticsConcepts

CSE6242/CX4242 at Georgia Tech focuses on Data and Visual Analytics, covering eight key data analytics concepts including classification, regression, similarity matching, clustering, co-occurrence grouping, profiling, link prediction, and data reduction. The course emphasizes practical applications and encourages students to think about real-world problems they want to solve using large datasets and appropriate techniques. It is free for Georgia Tech students and is partly based on materials from various experts in the field.

Uploaded by

runner4ever81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

CSE6242 400 AnalyticsConcepts

CSE6242/CX4242 at Georgia Tech focuses on Data and Visual Analytics, covering eight key data analytics concepts including classification, regression, similarity matching, clustering, co-occurrence grouping, profiling, link prediction, and data reduction. The course emphasizes practical applications and encourages students to think about real-world problems they want to solve using large datasets and appropriate techniques. It is free for Georgia Tech students and is partly based on materials from various experts in the field.

Uploaded by

runner4ever81
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

http://poloclub.gatech.

edu/cse6242

CSE6242/CX4242: Data & Visual Analytics

Data Analytics Concepts


Duen Horng (Polo) Chau
Professor, College of Computing
Associate Director, MS Analytics
Georgia Tech

Partly based on materials by Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
8 concept non-mutually
exclusive classes
Free for GT students
1. Classi cation
(or Probability Estimation)

Predict which of a (small) set of classes an


entity belong to.

3
fi
1. Classi cation
(or Probability Estimation)

Predict which of a (small) set of classes an entity belong to.


•email spam (y, n)
•sentiment analysis (+, -, neutral)
•news (politics, sports, …)
•medical diagnosis (cancer or not)
•shirt size (s, m, l)
•cat detection
•face detection (baby, middle-aged, etc.)
•buy /not buy - commerce
4
fi
2. Regression (“value estimation”)
Predict the numerical value of some variable for
an entity.

5
2. Regression (“value estimation”)
Predict the numerical value of some variable for an
entity.
•point value of wine (50-100)
•credit score
•stock prices
•relationship between price and sales
•weather
•sports and game scores
6
3. Similarity Matching
Find similar entities (from a large dataset)
based on what we know about them.

7
3. Similarity Matching
Find similar entities (from a large dataset) based on what we know
about them.

• nd similar gene sequences (that may be repeating, or does


similar things)

•online dating

•patent search
•carpool matching ( nd people to carpool)

8
fi
fi
4. Clustering (unsupervised learning)
Group entities together by their similarity.
(For most algorithms, user provides # of clusters)

9
4. Clustering (unsupervised learning)
Group entities together by their similarity.
•groupings of similar bugs in code
•topical analysis (tweets?)
•land cover: tree/road/…
•for advertising: grouping users for marketing
purposes
•cluster people by accents (y’all, you all)

10
5. Co-occurrence grouping
(Many names: frequent itemset mining, association rule
discovery, market-basket analysis)

Find associations between entities based on


transactions that involve them
(e.g., bread and milk often bought together)

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target- gured-out-a-teen-girl-
was-pregnant-before-her-father-did/ 11
fi
6. Pro ling / Pattern Mining /
Anomaly Detection (unsupervised)
Characterize typical behaviors of an entity (person,
computer router, etc.) so you can nd trends and outliers.

• Google sign-in alert


• Computer instruction prediction
• Removing noisy data (data cleaning)
• Detect anomalies in network tra c
• Moneyball
• Smart security camera
12
fi
ffi
fi
7. Link Prediction / Recommendation
Predict if two entities should be connected, and how
strongly that link should be.
Linkedin/Facebook: people you may know
Amazon/Net ix.Pandora: because you like
terminator…suggest other movies you may also like

13
fl
8. Data reduction (“dimensionality reduction”)
Shrink a large dataset into smaller one, with as
little loss of information as possible
1. if you want to visualize the data (in 2D/3D)
Most popular: UMAP, T-SNE
2. faster computation/less storage
3. reduce noise

14
Start Thinking About Project!

• What problems do you want to solve?


• Using what large, real datasets?
• What techniques do you need?

15

You might also like