Wikipedia K Nearest Neighbor Algorithm
Wikipedia K Nearest Neighbor Algorithm
Algorithm
The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the classification phase, k is a user-defined constant, and an unlabelled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point. Usually Euclidean distance is used as the distance metric; however this is only applicable to continuous variables. In cases such as text classification, another metric such as the overlap metric (or Hamming distance) can be used. Often, the classification accuracy of "k"-NN can be improved significantly if the distance metric is learned with specialized algorithms such as e.g. Large Margin Nearest Neighbor or Neighbourhood components analysis.
A drawback to the basic "majority voting" classification is that the classes with the more frequent examples tend to dominate the prediction of the new vector, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors. KNN is a special case of a variable-bandwidth, kernel density "balloon" estimator with a uniform kernel.[2] [3]
Example of k-NN classification. The test sample (green circle) should be classified either to the first class of blue squares or to the second class of red triangles. If k = 3 it is classified to the second class because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is classified to first class (3 squares vs. 2 triangles inside the outer circle).
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, for example, cross-validation. The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm. The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular approach is the use of evolutionary algorithms to optimize feature scaling.[4] Another popular approach is to scale features by the mutual information of the training data with the training classes. In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.[5]
Properties
The naive version of the algorithm is easy to implement by computing the distances from the test sample to all stored vectors, but it is computationally intensive, especially when the size of the training set grows. Many nearest neighbor search algorithms have been proposed over the years; these generally seek to reduce the number of distance evaluations actually performed. Using an appropriate nearest neighbor search algorithm makes k-NN computationally tractable even for large data sets. The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data).[6] k-nearest neighbor is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points). Various improvements to k-nearest neighbor methods are possible by using proximity graphs.[7]
The optimal k for most datasets is 10 or more [8] . That produces much better results than 1-NN. Using a weighted k-NN, where the weights by which each of the k nearest points' class (or value in regression problems) is multiplied are proportional to the inverse of the distance between that point and the point for which the class is to be predicted also significantly improves the results.
References
[1] Bremner D, Demaine E, Erickson J, Iacono J, Langerman S, Morin P, Toussaint G (2005). "Output-sensitive algorithms for computing nearest-neighbor decision boundaries". Discrete and Computational Geometry 33 (4): 593604. doi:10.1007/s00454-004-1152-0. [2] D. G. Terrell; D. W. Scott (1992). "Variable kernel density estimation". Annals of Statistics 20: 12361265. doi:10.1214/aos/1176348768. [3] Mills, Peter. "Efficient statistical classification of satellite measurements". International Journal of Remote Sensing. [4] Nigsch, F.; A. Bender, B. van Buuren, J. Tissen, E. Nigsch & J.B.O. Mitchell (2006). "Melting Point Prediction Employing k-nearest Neighbor Algorithms and Genetic Parameter Optimization". Journal of Chemical Information and Modeling 46 (6): 24122422. doi:10.1021/ci060149f. PMID17125183. [5] P. Hall; B. U. Park; R. J. Samworth (2008). "Choice of neighbor order in nearest-neighbor classification". Annals of Statistics 36: 21352152. doi:10.1214/07-AOS537. [6] Cover TM, Hart PE (1967). "Nearest neighbor pattern classification". IEEE Transactions on Information Theory 13 (1): 2127. doi:10.1109/TIT.1967.1053964. [7] Toussaint GT (April 2005). "Geometric proximity graphs for improving nearest neighbor methods in instance-based learning and data mining". International Journal of Computational Geometry and Applications 15 (2): 101150. doi:10.1142/S0218195905001622. [8] Franco-Lopez et al., 2001
Further reading
When Is "Nearest Neighbor" Meaningful? (http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31. 1422) Belur V. Dasarathy, ed (1991). Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. ISBN0-8186-8930-7. Shakhnarovish, Darrell, and Indyk, ed (2005). Nearest-Neighbor Methods in Learning and Vision. MIT Press. ISBN0-262-19547-X. Mkel H Pekkarinen A (2004-07-26). "Estimation of forest stand volumes by Landsat TM imagery and stand-level field-inventory data". Forest Ecology and Management 196 (2-3): 245255. doi:10.1016/j.foreco.2004.02.049. Franco-Lopez H, Ek AR, Bauer ME (September 2001). "Estimation and mapping of forest stand density, volume, and cover type using the k-nearest neighbors method". Remote Sensing of Environment 77 (3): 251274. doi:10.1016/S0034-4257(01)00209-7. Fast k nearest neighbor search using GPU. In Proceedings of the CVPR Workshop on Computer Vision on GPU, Anchorage, Alaska, USA, June 2008. V. Garcia and E. Debreuve and M. Barlaud.
External links
k-nearest neighbor algorithm in C++ and Boost (http://codingplayground.blogspot.com/2010/01/ nearest-neighbour-on-kd-tree-in-c-and.html) by Antonio Gulli k-nearest neighbor algorithm in Java (Applet) (http://www.leonardonascimento.com/en/knn.html) (includes source code) by Leonardo Nascimento Ferreira k-nearest neighbor algorithm in Visual Basic and Java (http://paul.luminos.nl/documents/show_document. php?d=197) (includes executable and source code) k-nearest neighbor tutorial using MS Excel (http://people.revoledu.com/kardi/tutorial/KNN/index.html) STANN: A simple threaded approximate nearest neighbor C++ library that can compute Euclidean k-nearest neighbor graphs in parallel (http://compgeom.com/~stann) TiMBL: a fast C++ implementation of k-NN with feature and distance weighting, specifically suited to symbolic feature spaces (http://ilk.uvt.nl/timbl/) libAGF: A library for multivariate, adaptive kernel estimation, including KNN and Gaussian kernels (http:// libagf.sourceforge.net) OBSearch: A library for similarity search in metric spaces created during Google Summer of Code 2007 (http:// obsearch.net) ANN: A Library for Approximate Nearest Neighbor Searching (http://www.cs.umd.edu/~mount/ANN/)
License
Creative Commons Attribution-Share Alike 3.0 Unported http:/ / creativecommons. org/ licenses/ by-sa/ 3. 0/