Data Mining1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

?

W H A T

I S . . .

Data Mining
Mauro Maggioni

Data collected from a variety of sources has been accumulating rapidly. Many elds of science have gone from being data-starved to being data-rich and needing to learn how to cope with large data sets. The rising tide of data also directly aects our daily lives, in which computers surrounding us use data-crunching algorithms to help us in tasks ranging from nding the quickest route to our destination considering current trac conditions to automatically tagging our faces in pictures; from updating in near real time the prices of sale items to suggesting the next movie we might want to watch. The general aim of data mining is to nd useful and interpretable patterns in data. The term can encompass many diverse methods and therefore means dierent things to dierent people. Here we discuss some aspects of data mining potentially of interest to a broad audience of mathematicians. Assume a sample data point xi (e.g., a picture) may be cast in the form of a long vector of numbers (e.g., the pixel intensities in an image): we represent it as a point in RD . Two types of related goals exist. One is to detect patterns in this set of points, and the other is to predict a function on the data: given a training set (xi , f (xi ))i , we want to predict f at points outside the training set. In the case of text documents or webpages, we might want to automatically label each document as belonging to an area of research; in the case of pictures, we might want to recognize faces; when suggesting the next movie to watch given past ratings of movies by a viewer, f consists of ratings of unseen movies.
Mauro Maggioni is assistant professor of mathematics and computer science at Duke University. His email address is [email protected]. DOI: http://dx.doi.org/10.1090/noti831

Typically, xi is noisy (e.g., noisy pixel values), and so is f (xi ) (e.g., mislabeled samples in the training set). Of course mathematicians have long concerned themselves with high-dimensional problems. One example is studying solutions of PDEs as functions in innite-dimensional function spaces and performing ecient computations by projecting the problem onto low-dimensional subspaces (via discretizations, nite elements, or operator compression) so that the reduced problem may be numerically solved on a computer. In the case of solutions of a PDE, the model for the data is specied: a lot of information about the PDE is known, and that information is exploited to predict the properties of the data and to construct low-dimensional projections. For the digital data discussed above, however, typically we have little information and poor models. We may start with crude models, measure their tness to the data and predictive ability, and, those being not satisfactory, improve the models. This is one of the key processes in statistical modeling and data mining. It is not unlike what an applied mathematician does when modeling a complex physical system: he may start with simplifying assumptions to construct a tractable model, derive consequences of such a model (e.g., properties of the solutions) analytically and/or with simulations, and compare the results to the properties exhibited by the real-world physical system. New measurements and real-world simulations may be performed, and the tness of the model reassessed and improved as needed for the next round of validation. While physics drives the modeling in applied mathematics, a new type of intuition, built on experiences in the world of high-dimensional data sets rather than in the world of physics, drives the intuition of the

532

Notices of the AMS

Volume 59, Number 4

mathematician set to analyze high-dimensional data sets, where tractable models are geometric or statistical models with a small number of parameters. One of the reasons for focusing on reducing the dimension is to enable computations, but a fundamental motivation is the so-called curse of dimensionality. One of its manifestations arises in the approximation of a 1-Lipschitz function on the unit cube, f : [0, 1]D R satisfying |f (x) f (y)| ||x y|| for x, y [0, 1]D . To achieve uniform error , given samples (xi , f (xi )), in general one needs at least one sample in each cube of side , for a total of D samples, which is too large even for, say, = 101 and D = 100 (a rather small dimension in applications). A common assumption is that either the samples xi lie on a low-dimensional subset of [0, 1]D and/or f is not simply Lipschitz but has a smoothness that is suitably large, depending on D (see references in [3]). Taking the former route, one assumes that the data lies on a low-dimensional subset in the high-dimensional ambient space, such as a low-dimensional hyperplane or unions thereof, or low-dimensional manifolds or rougher sets. Research problems require ideas from dierent areas of mathematics, including geometry, geometric measure theory, topology, and graph theory, with their tools for studying manifolds or rougher sets; probability and geometric functional analysis for studying random samples and measures in high dimensions; harmonic analysis and approximation theory, with their ideas of multiscale analysis and function approximation; and numerical analysis, because we need ecient algorithms to analyze real-world data. As a concrete example, consider the following construction. Given n points {xi }n RD and > i=1 0, construct Wij = exp(
||xi xj ||2 ), 2


0.04 0.02 0

0.05 0.02 0.04 0.03

0 0.02 0.01 0

0.01

3 0.02 0.05

0.15
Anthropology Astronomy Social Sciences Earth Sciences Biology Mathematic s Medicine Physics

0.1

0.05 6 0 0.05 0.1 0.5 5

0.5

0.15

0.1 4

0.05

0.05

Dii =
2
1

and the Laplacian matrix L = I D W D on the weighted graph G with vertices {xi } and edges weighted by W . When xi is sampled from a manifold M and n tends to innity, L approximates (in a suitable sense) the Laplace-Beltrami operator on M [2], which is a completely intrinsic object. The random walk on G, with transition matrix P = D 1 W , approximates Brownian motion on M. Consider, for a time t > 0, the so-called diusion distance dt (x, y) := ||P t (x, ) P t (y, )||L2 (G) (see [2]). This distance is particularly useful for capturing clusters/groupings in the data, which are regions of fast diusion connected by bottlenecks that slow diusion. Let 1 = 0 1 be the eigenvalues of P and i be the corresponding eigenvectors (0 , when G is a web graph, is related to Googles pagerank). Consider a diut sion map d that embeds the graph in Euclidean April 2012

j Wij , 1 2

Figure 1. Top: Diusion map embedding of the set of congurations of a small biomolecule (alanine dipeptide) from its 36-dimensional state space. The color is one of the dihedral angles , of the molecule, known to be essential to the dynamics [4]. This is a physical system where (approximate) equations of motion are known, but their structure is too complicated and the state space too high-dimensional to be amenable to analysis. Bottom: Diusion map of a data set consisting of 1161 Science News articles, each modeled by a 1153-dimensional vector of word frequencies, embedded in a low-dimensional space with diusion maps, as described in the text and in [2].

t space, where d (x) := ( t 1 (x), . . . , t d (x)), 1 d for some t > 0 [2]. One can show that the Euclidean t t distance between d (x) and d (y) approximates dt (x, y), the diusion distance at time scale t between x and y on the graph G.

Notices of the AMS

533

David Blackwell
Memorial Conference
April 1920, 2012 Howard University Washington, DC
Join us on April 1920, 2012, at Howard University in Washington, DC for a special conference honoring David Blackwell (19192010), former President of the Institute of Mathematical Statistics and the first AfricanAmerican to be inducted into the National Academy of Sciences. This conference will bring together a diverse group of leading theoretical and applied statisticians and mathematicians to discuss advances in mathematics and statistics that are related to, and in many cases have grown out of, the work of David Blackwell. These include developments in dynamic programming, information theory, game theory, design of experiments, renewal theory, and other fields. Other speakers will discuss Blackwells legacy for the community of African-American researchers in the mathematical sciences. The conference is being organized by the Department of Mathematics at Howard University in collaboration with the University of California, Berkeley, Carnegie Mellon University, and the American Statistical Association. Funding is being provided by the National Science Foundation and the Army Research Office. To learn more about the program, invited speakers, and registration, visit: https://sites.google.com/ site/conferenceblackwell.

In Figure 1 we apply this technique to two completely dierent data sets. The rst one is a set of congurations of a small peptide, obtained by a molecular dynamics simulation: a point xi R123 contains the coordinates in R3 of the 12 atoms in the alanine dipeptide molecule (represented as an inset in Figure 1). The forces between the atoms in the molecule constrain the trajectories to lie close to low-dimensional sets in the 36dimensional state space. In Figure 1 we apply the construction above1 and represent the diusion map embedding of the congurations collected [4]. The second one is a set of text documents (articles from Science News), each represented as a R1153 vector whose kth coordinate is the frequency of the kth word in a 1153-word dictionary. The diusion embedding in low dimensions reveals even lowerdimensional geometric structures, which turn out to be useful for understanding the dynamics of the peptide considered in the rst data set and for automatically clustering documents by topic in the case of the second data set. Ideas from probability (random samples), harmonic analysis (Laplacian), and geometry (manifolds) come together in these types of constructions. This is only the beginning of one of many research avenues explored in the last few years. Many other exciting opportunities exist, for example the study of stochastic dynamic networks, where a sample is a network and multiple samples are collected in time: quantifying and modeling change requires introducing sensible and robust metrics between graphs. Further reading: [5, 3, 1] and the references therein.

Photograph by Lot A. Zadeh.

References
1. Science: Special issue: Dealing with data, February 2011, pp. 639806. 2. R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker, Geometric diusions as a tool for harmonic analysis and structure denition of data: Diusion maps, Proc. Natl. Acad. Sci. USA 102 (2005), no. 21, 74267431. 3. D. Donoho, High-dimensional data analysis: The curses and blessings of dimensionality, Math Challenges of the 21st Century, AMS, 2000. 4. M. A. Rohrdanz, W. Zheng, M. Maggioni, and C. Clementi, Determination of reaction coordinates via locally scaled diusion map, J. Chem. Phys. 134 (2011), 124116. 5. J. W. Tukey, The Future of Data Analysis, Ann. Math. Statist. 33, Number 1 (1962), 167.

We use here a slightly dierent denition of the weight matrix W , which uses distances between molecular congurations up to rigid ane transformations, instead of Euclidean distances.

534

Notices of the AMS

Volume 59, Number 4

You might also like