Data Mining1

The document discusses data mining and how it can be used to find useful patterns in large data sets. It describes how data mining techniques like diffusion maps can be applied to reduce the dimension of high-dimensional data sets and reveal underlying low-dimensional structures. Examples are provided of diffusion maps being used to analyze the configuration of molecular structures and cluster text documents by topic.

Uploaded by

Nurul Nadirah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views3 pages

Data Mining1

Uploaded by

Nurul Nadirah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

?

W H A T

I S . . .

Data Mining
Mauro Maggioni

Data collected from a variety of sources has been accumulating rapidly. Many elds of science have gone from being data-starved to being data-rich and needing to learn how to cope with large data sets. The rising tide of data also directly aects our daily lives, in which computers surrounding us use data-crunching algorithms to help us in tasks ranging from nding the quickest route to our destination considering current trac conditions to automatically tagging our faces in pictures; from updating in near real time the prices of sale items to suggesting the next movie we might want to watch. The general aim of data mining is to nd useful and interpretable patterns in data. The term can encompass many diverse methods and therefore means dierent things to dierent people. Here we discuss some aspects of data mining potentially of interest to a broad audience of mathematicians. Assume a sample data point xi (e.g., a picture) may be cast in the form of a long vector of numbers (e.g., the pixel intensities in an image): we represent it as a point in RD . Two types of related goals exist. One is to detect patterns in this set of points, and the other is to predict a function on the data: given a training set (xi , f (xi ))i , we want to predict f at points outside the training set. In the case of text documents or webpages, we might want to automatically label each document as belonging to an area of research; in the case of pictures, we might want to recognize faces; when suggesting the next movie to watch given past ratings of movies by a viewer, f consists of ratings of unseen movies.
Mauro Maggioni is assistant professor of mathematics and computer science at Duke University. His email address is [email protected]. DOI: http://dx.doi.org/10.1090/noti831

Typically, xi is noisy (e.g., noisy pixel values), and so is f (xi ) (e.g., mislabeled samples in the training set). Of course mathematicians have long concerned themselves with high-dimensional problems. One example is studying solutions of PDEs as functions in innite-dimensional function spaces and performing ecient computations by projecting the problem onto low-dimensional subspaces (via discretizations, nite elements, or operator compression) so that the reduced problem may be numerically solved on a computer. In the case of solutions of a PDE, the model for the data is specied: a lot of information about the PDE is known, and that information is exploited to predict the properties of the data and to construct low-dimensional projections. For the digital data discussed above, however, typically we have little information and poor models. We may start with crude models, measure their tness to the data and predictive ability, and, those being not satisfactory, improve the models. This is one of the key processes in statistical modeling and data mining. It is not unlike what an applied mathematician does when modeling a complex physical system: he may start with simplifying assumptions to construct a tractable model, derive consequences of such a model (e.g., properties of the solutions) analytically and/or with simulations, and compare the results to the properties exhibited by the real-world physical system. New measurements and real-world simulations may be performed, and the tness of the model reassessed and improved as needed for the next round of validation. While physics drives the modeling in applied mathematics, a new type of intuition, built on experiences in the world of high-dimensional data sets rather than in the world of physics, drives the intuition of the

532

Notices of the AMS

Volume 59, Number 4

mathematician set to analyze high-dimensional data sets, where tractable models are geometric or statistical models with a small number of parameters. One of the reasons for focusing on reducing the dimension is to enable computations, but a fundamental motivation is the so-called curse of dimensionality. One of its manifestations arises in the approximation of a 1-Lipschitz function on the unit cube, f : [0, 1]D R satisfying |f (x) f (y)| ||x y|| for x, y [0, 1]D . To achieve uniform error , given samples (xi , f (xi )), in general one needs at least one sample in each cube of side , for a total of D samples, which is too large even for, say, = 101 and D = 100 (a rather small dimension in applications). A common assumption is that either the samples xi lie on a low-dimensional subset of [0, 1]D and/or f is not simply Lipschitz but has a smoothness that is suitably large, depending on D (see references in [3]). Taking the former route, one assumes that the data lies on a low-dimensional subset in the high-dimensional ambient space, such as a low-dimensional hyperplane or unions thereof, or low-dimensional manifolds or rougher sets. Research problems require ideas from dierent areas of mathematics, including geometry, geometric measure theory, topology, and graph theory, with their tools for studying manifolds or rougher sets; probability and geometric functional analysis for studying random samples and measures in high dimensions; harmonic analysis and approximation theory, with their ideas of multiscale analysis and function approximation; and numerical analysis, because we need ecient algorithms to analyze real-world data. As a concrete example, consider the following construction. Given n points {xi }n RD and > i=1 0, construct Wij = exp(
||xi xj ||2 ), 2

0.04 0.02 0

0.05 0.02 0.04 0.03

0 0.02 0.01 0

0.01

3 0.02 0.05

0.15
Anthropology Astronomy Social Sciences Earth Sciences Biology Mathematic s Medicine Physics

0.1

0.05 6 0 0.05 0.1 0.5 5

0.5

0.15

0.1 4

0.05

Dii =
2
1

and the Laplacian matrix L = I D W D on the weighted graph G with vertices {xi } and edges weighted by W . When xi is sampled from a manifold M and n tends to innity, L approximates (in a suitable sense) the Laplace-Beltrami operator on M [2], which is a completely intrinsic object. The random walk on G, with transition matrix P = D 1 W , approximates Brownian motion on M. Consider, for a time t > 0, the so-called diusion distance dt (x, y) := ||P t (x, ) P t (y, )||L2 (G) (see [2]). This distance is particularly useful for capturing clusters/groupings in the data, which are regions of fast diusion connected by bottlenecks that slow diusion. Let 1 = 0 1 be the eigenvalues of P and i be the corresponding eigenvectors (0 , when G is a web graph, is related to Googles pagerank). Consider a diut sion map d that embeds the graph in Euclidean April 2012

j Wij , 1 2

Figure 1. Top: Diusion map embedding of the set of congurations of a small biomolecule (alanine dipeptide) from its 36-dimensional state space. The color is one of the dihedral angles , of the molecule, known to be essential to the dynamics [4]. This is a physical system where (approximate) equations of motion are known, but their structure is too complicated and the state space too high-dimensional to be amenable to analysis. Bottom: Diusion map of a data set consisting of 1161 Science News articles, each modeled by a 1153-dimensional vector of word frequencies, embedded in a low-dimensional space with diusion maps, as described in the text and in [2].

t space, where d (x) := ( t 1 (x), . . . , t d (x)), 1 d for some t > 0 [2]. One can show that the Euclidean t t distance between d (x) and d (y) approximates dt (x, y), the diusion distance at time scale t between x and y on the graph G.

Notices of the AMS

533

David Blackwell
Memorial Conference
April 1920, 2012 Howard University Washington, DC
Join us on April 1920, 2012, at Howard University in Washington, DC for a special conference honoring David Blackwell (19192010), former President of the Institute of Mathematical Statistics and the first AfricanAmerican to be inducted into the National Academy of Sciences. This conference will bring together a diverse group of leading theoretical and applied statisticians and mathematicians to discuss advances in mathematics and statistics that are related to, and in many cases have grown out of, the work of David Blackwell. These include developments in dynamic programming, information theory, game theory, design of experiments, renewal theory, and other fields. Other speakers will discuss Blackwells legacy for the community of African-American researchers in the mathematical sciences. The conference is being organized by the Department of Mathematics at Howard University in collaboration with the University of California, Berkeley, Carnegie Mellon University, and the American Statistical Association. Funding is being provided by the National Science Foundation and the Army Research Office. To learn more about the program, invited speakers, and registration, visit: https://sites.google.com/ site/conferenceblackwell.

In Figure 1 we apply this technique to two completely dierent data sets. The rst one is a set of congurations of a small peptide, obtained by a molecular dynamics simulation: a point xi R123 contains the coordinates in R3 of the 12 atoms in the alanine dipeptide molecule (represented as an inset in Figure 1). The forces between the atoms in the molecule constrain the trajectories to lie close to low-dimensional sets in the 36dimensional state space. In Figure 1 we apply the construction above1 and represent the diusion map embedding of the congurations collected [4]. The second one is a set of text documents (articles from Science News), each represented as a R1153 vector whose kth coordinate is the frequency of the kth word in a 1153-word dictionary. The diusion embedding in low dimensions reveals even lowerdimensional geometric structures, which turn out to be useful for understanding the dynamics of the peptide considered in the rst data set and for automatically clustering documents by topic in the case of the second data set. Ideas from probability (random samples), harmonic analysis (Laplacian), and geometry (manifolds) come together in these types of constructions. This is only the beginning of one of many research avenues explored in the last few years. Many other exciting opportunities exist, for example the study of stochastic dynamic networks, where a sample is a network and multiple samples are collected in time: quantifying and modeling change requires introducing sensible and robust metrics between graphs. Further reading: [5, 3, 1] and the references therein.

Photograph by Lot A. Zadeh.

References
1. Science: Special issue: Dealing with data, February 2011, pp. 639806. 2. R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker, Geometric diusions as a tool for harmonic analysis and structure denition of data: Diusion maps, Proc. Natl. Acad. Sci. USA 102 (2005), no. 21, 74267431. 3. D. Donoho, High-dimensional data analysis: The curses and blessings of dimensionality, Math Challenges of the 21st Century, AMS, 2000. 4. M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi, Determination of reaction coordinates via locally scaled diusion map, J. Chem. Phys. 134 (2011), 124116. 5. J. W. Tukey, The Future of Data Analysis, Ann. Math. Statist. 33, Number 1 (1962), 167.

We use here a slightly dierent denition of the weight matrix W , which uses distances between molecular congurations up to rigid ane transformations, instead of Euclidean distances.

534

Notices of the AMS

Volume 59, Number 4

Math For Data Science
100% (1)
Math For Data Science
507 pages
Badshah Mobile Statement
No ratings yet
Badshah Mobile Statement
56 pages
Statistics
100% (2)
Statistics
515 pages
Geometric and Topological Data Reduction
No ratings yet
Geometric and Topological Data Reduction
275 pages
Math For Data Science
No ratings yet
Math For Data Science
538 pages
Manual JKSimMet V5.1
100% (5)
Manual JKSimMet V5.1
414 pages
Airbnb Business Model Canvas PDF
No ratings yet
Airbnb Business Model Canvas PDF
1 page
Computational Inverse Problems
100% (1)
Computational Inverse Problems
67 pages
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
No ratings yet
Roger D. Peng - Advanced Statistical Computing (2022 Update) (2023) - Libgen - Li
107 pages
1 L1 - Introduction To Data Structure and Algorithms
100% (1)
1 L1 - Introduction To Data Structure and Algorithms
96 pages
Law 2015
No ratings yet
Law 2015
256 pages
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
No ratings yet
MAT-52506 Inverse Problems: Samuli Siltanen February 20, 2009
58 pages
Updated Lab File BCS351
No ratings yet
Updated Lab File BCS351
7 pages
Laterally Loaded Pile Analysis Program For The Microcomputer
No ratings yet
Laterally Loaded Pile Analysis Program For The Microcomputer
510 pages
Machine Learning and Econometrics
No ratings yet
Machine Learning and Econometrics
50 pages
The Status of E-Government in South Africa
0% (1)
The Status of E-Government in South Africa
12 pages
User Manual Stream Selector SS 301
60% (5)
User Manual Stream Selector SS 301
24 pages
Gee7 2011
No ratings yet
Gee7 2011
318 pages
178 HW 9
No ratings yet
178 HW 9
153 pages
Ai - Foundations of Machine Learning III
No ratings yet
Ai - Foundations of Machine Learning III
98 pages
High-Dimensional Statistics: Lecture Notes
No ratings yet
High-Dimensional Statistics: Lecture Notes
168 pages
178 HW 6
No ratings yet
178 HW 6
125 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Assessment Matching and True or False Type Test
No ratings yet
Assessment Matching and True or False Type Test
18 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Solution
No ratings yet
Solution
148 pages
17D58101 Advanced Data Structures and Algorithms
No ratings yet
17D58101 Advanced Data Structures and Algorithms
2 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Rig Notes 17
No ratings yet
Rig Notes 17
168 pages
Total Station
No ratings yet
Total Station
2 pages
Mathophilia
No ratings yet
Mathophilia
18 pages
RigNotes15 PDF
No ratings yet
RigNotes15 PDF
130 pages
Math Revision For DS and ML
No ratings yet
Math Revision For DS and ML
74 pages
Module 1 - Model Question Paper
No ratings yet
Module 1 - Model Question Paper
78 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Gaji Pertamina
No ratings yet
Gaji Pertamina
6 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Statlearn PDF
No ratings yet
Statlearn PDF
123 pages
Journal Club: Diffusion Maps For Geometric Data Organization and Network Analysis
No ratings yet
Journal Club: Diffusion Maps For Geometric Data Organization and Network Analysis
40 pages
Machine Learning - Lec4 - 5
No ratings yet
Machine Learning - Lec4 - 5
41 pages
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
No ratings yet
Score-Based Diffusion Models Via Stochastic Differential Equations - A Technical Tutorial
29 pages
Mso203b PDF
No ratings yet
Mso203b PDF
127 pages
Scaling Techniques
No ratings yet
Scaling Techniques
30 pages
MIQ - Lecture 1, 2, 3 and 4 ?
No ratings yet
MIQ - Lecture 1, 2, 3 and 4 ?
15 pages
Understanding Deep Convolutional Networks
No ratings yet
Understanding Deep Convolutional Networks
17 pages
Module 2 ML Chapter2
No ratings yet
Module 2 ML Chapter2
64 pages
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
No ratings yet
ISYE 8803 - Kamran - M1 - Intro To HD and Functional Data - Updated
87 pages
Operations Guide For Microsoft Advanced Group Policy Management 2.5
No ratings yet
Operations Guide For Microsoft Advanced Group Policy Management 2.5
65 pages
Prob RV Opt Basics
No ratings yet
Prob RV Opt Basics
35 pages
Geometry of High-Dimensional Space
No ratings yet
Geometry of High-Dimensional Space
36 pages
CS-AI LECUTRE NOTES Unsupervised Learning-03
No ratings yet
CS-AI LECUTRE NOTES Unsupervised Learning-03
71 pages
Robocode Tutorial 1 (1) .4
No ratings yet
Robocode Tutorial 1 (1) .4
48 pages
Planning Kopia
No ratings yet
Planning Kopia
4 pages
Zotero User Guide PDF
No ratings yet
Zotero User Guide PDF
16 pages
Deep Learning For Functional Data Analysis With Adaptive Basis Layers
No ratings yet
Deep Learning For Functional Data Analysis With Adaptive Basis Layers
11 pages
(Berg, Jens, and Kaj Nystrom), Data-Driven Discovery of PDEs in Complex Datasets, Journal of Computational Physics 384 (2019)
No ratings yet
(Berg, Jens, and Kaj Nystrom), Data-Driven Discovery of PDEs in Complex Datasets, Journal of Computational Physics 384 (2019)
14 pages
DMbook TOC1
No ratings yet
DMbook TOC1
8 pages
Diffusion Wavelets On Graphs and Manifolds: R.R. Coifman, MM, J.C. Bremer JR., A.D. Szlam
No ratings yet
Diffusion Wavelets On Graphs and Manifolds: R.R. Coifman, MM, J.C. Bremer JR., A.D. Szlam
46 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
FreeNAS 0.7.2 On VirtualBox 3.2
100% (1)
FreeNAS 0.7.2 On VirtualBox 3.2
10 pages
Summary Chap 1 & 2
No ratings yet
Summary Chap 1 & 2
5 pages
Report Orange Quockhanh - Abc
No ratings yet
Report Orange Quockhanh - Abc
15 pages
SQL Statements and Functions in SAP HANA
No ratings yet
SQL Statements and Functions in SAP HANA
35 pages
Question On FEM
100% (1)
Question On FEM
4 pages
Analog Lab (EE2401) Experiment 1: Inverter Characteristics
No ratings yet
Analog Lab (EE2401) Experiment 1: Inverter Characteristics
13 pages
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
No ratings yet
A Tutorial on ν-Support Vector Machines: 1 An Introductory Example
29 pages
Pattern Recognition
No ratings yet
Pattern Recognition
11 pages
斯坦福大学机器学习数学基础 33-40
No ratings yet
斯坦福大学机器学习数学基础 33-40
8 pages
B+ Tree: by Li Wen CS157B Professor: Sin-Min Lee
No ratings yet
B+ Tree: by Li Wen CS157B Professor: Sin-Min Lee
26 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
Ipmv Mod 5&6 (Theory Questions)
No ratings yet
Ipmv Mod 5&6 (Theory Questions)
11 pages
Pattern Classification
No ratings yet
Pattern Classification
41 pages
File PDFTK
No ratings yet
File PDFTK
2 pages
Lecture Notes On Pattern Recognition and Image Processing
No ratings yet
Lecture Notes On Pattern Recognition and Image Processing
24 pages
Geometric Diffusions As A Tool For Harmonic Analysis and Structure Definition of Data: Diffusion Maps
No ratings yet
Geometric Diffusions As A Tool For Harmonic Analysis and Structure Definition of Data: Diffusion Maps
6 pages
B Bob Doc Functions Def
No ratings yet
B Bob Doc Functions Def
14 pages
COS3721 TL 202 2 2018 B PDF
No ratings yet
COS3721 TL 202 2 2018 B PDF
10 pages
Autocad 2016: Objective
No ratings yet
Autocad 2016: Objective
5 pages
Viking New Products 2013
No ratings yet
Viking New Products 2013
7 pages
Resume Testing6+ Saptagireswar
No ratings yet
Resume Testing6+ Saptagireswar
5 pages
Data From SAP BAPI Using PHP Script - SCN
No ratings yet
Data From SAP BAPI Using PHP Script - SCN
5 pages
5986 iPhone-7-7-Plus ApplicationForm en PDF
No ratings yet
5986 iPhone-7-7-Plus ApplicationForm en PDF
4 pages
App 212
No ratings yet
App 212
8 pages
Ee5551 Newproj Report
No ratings yet
Ee5551 Newproj Report
18 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Calculus Fundamentals Explained
From Everand
Calculus Fundamentals Explained
Samuel Horelick
3/5 (3)
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
From Everand
Two Dimensional Geometric Model: Understanding and Applications in Computer Vision
Fouad Sabry
No ratings yet