Feature Selection for Clustering
FSFC is a library with algorithms of feature selection for clustering.
It's based on the article "Feature Selection for Clustering: A Review." by S. Alelyani, J. Tang and H. Liu
Algorithms are covered with tests that check their correctness and compute some clustering metrics. For testing we use open datasets:
- Generic data - High-dimensional points datasets
- Text data - SMS Spam Collection
Project documentation is available on Read the Docs
Implemented algorithms:
- Generic Data:
- SPEC family - NormalizedCut, ArbitraryClustering, FixedClustering
- Sparse clustering - Lasso
- Localised feature selection - LFSBSS algorithm
- Multi-Cluster Feature Selection
- Weighted K-means
- Text Data:
- Text clustering - Chi-R algorithm, Feature Set-Based Clustering (FTC)
- Frequent itemset extraction - Apriori
Dependencies:
- numpy
- scikit-learn
- scipy
How to use:
Now the project is in the early alpha stage, so it isn't publish to pip.
Because of it, installation of the project is a bit complicated. To use FSFC you should:
- Clone repository to your computer.
- Run
make initto install dependencies. - Copy content of the folder fsfc to the source root of your project.
After it you can use feature selectors as follows:
import numpy as np
from fsfc.generic import NormalizedCut
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
data = np.array([...])
pipeline = Pipeline([
('select', NormalizedCut(3)),
('cluster', KMeans())
])
pipeline.fit_predict(data)How to support:
You can support development by testing and reporting of bugs or opening pull-requests.
Project has tests, they can be run with the command make test
Also code there is a Sphinx documentation for code, it can be built with the command make html.
Documentation uses numpydoc, so it should be installed on the system. To do it, run pip install numpydoc.
References:
- Alelyani, Salem, Jiliang Tang, and Huan Liu. "Feature Selection for Clustering: A Review."
- Data Clustering: Algorithms and Applications 29 (2013): 110-121.
- Zhao, Zheng, and Huan Liu. "Spectral feature selection for supervised and unsupervised learning."
- Proceedings of the 24th international conference on Machine learning. ACM, 2007.
- D.M. Witten and R. Tibshirani. "A framework for feature selection in clustering."
- Journal of the American Statistical Association, 105(490):713–726, 2010.
- Li, Yuanhong, Ming Dong, and Jing Hua. "Localized feature selection for clustering."
- Pattern Recognition Letters 29.1 (2008): 10-18.
- Cai, Deng, Chiyuan Zhang, and Xiaofei He. "Unsupervised feature selection for multi-cluster data."
- Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.
- Huang, Joshua Zhexue, et al. "Automated variable weighting in k-means type clustering."
- IEEE Transactions on Pattern Analysis and Machine Intelligence 27.5 (2005): 657-668.
- Li, Yanjun, Congnan Luo, and Soon M. Chung. "Text clustering with feature selection by using statistical data."
- IEEE Transactions on knowledge and Data Engineering 20.5 (2008): 641-652.
- Agrawal, Rakesh, and Ramakrishnan Srikant. "Fast algorithms for mining association rules."
- Proc. 20th int. conf. very large data bases, VLDB. Vol. 1215. 1994.
- Beil, Florian, Martin Ester, and Xiaowei Xu. "Frequent term-based text clustering"
- Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2002.
