Fast Efficient Machine Learning Methods

Machine learning and data mining algorithms, including Support Vector Machine (SVM), Spectral Clustering and Semi-supervised learning algorithms are useful for a wide-range of applications in areas such as information retrieval, compute vision, pattern recognition, social network analysis, etc. However, most existing learning algorithms have high computation complexity and require all data being in a central site for analysis. So the computation and/or communication resources required by the methods in processing large-scale (distributed) data are often prohibitively high, and practitioners are often required to approximate the original data in various ways (quantization, filtering down sampling, etc) before invoking the data mining algorithms.

In this project, we aim to develop a general framework for efficient learning and inference on large-scale data. The main idea is to perform a light-weight, distortion minimizing local transformation on the data to get a small number of representative points on which the cluster membership of all data points can be derived. There are three steps involved:

  • Distortion minimizing local transformation on the data to get a small set of representative points which preserve the "clusterness" of the original data.
  • Machine learning on the set of representative points.
  • Recovery of the learning results for all data points based on their correspondence to the representative points.
Since the distortion minimizing local transformation can be computed with fast procedures while the expensive machine learning is only performed on the representative points, large speedup can be achieved. Our theory predicts that if the distortion resulted from the data "compression" is small, then algorithms implemented within our framework will incur very little loss in clustering "accuracy".

Within this framework, we currently: 1) study the effects of data transformation on the performance of spectral clustering and semi-supervised learning on graphs. We show that the error under approximation of these algorithms is closely related to the perturbation of the eigenvectors or matrix norm of Laplacian matrix. 2) study the tradeoffs between data reduction and the loss in the SVM's classification performance. We derive approximate upper bounds on the perturbation on SVM classification. We show that the bound is empirically tight, making it practical for the practitioner to determine the amount of data reduction given a permissible loss in the classification performance.

From these results we derive approximate upper bounds on the learning accuracy. They can be used in practical settings to determine the amount of data transformation (reduction) allowed in order to meet a specification of permitted loss in clustering performance. They can also serve as a guideline for developing a class of fast approximate kernel-based learning algorithms -- those based on the idea of pre-grouping neighboring points and approximating a (large) data set by a reduced set of ``representative'' points.


  • Online Semi-Supervised Learning on Quantized Graphs

    Michal Valko, Branislav Kveton, Daniel Ting, Ling Huang. In Proceedings of the 26th Conference on Uncertainty in Artificial Intelligence (UAI) , July 2010. [pdf]
  • Fast Approximate Spectral Clustering

    Donghui Yan, Ling Huang and Michael I. Jordan. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD'09), Paris, France, June 2009. [pdf] [Long version] [Software]
  • Spectral Clustering with Perturbed Data

    Ling Huang, Donghui Yan, Michael I. Jordan and Nina Taft. In Advances in Neural Information Processing Systems (NIPS) 21, Vancouver, B.C, December 2008. [pdf]
  • Support vector machines, data reduction and approximate kernel matrice

    XuanLong Nguyen, Ling Huang, and Anthony D. Joseph. In Proceedings of European Conference on Machine Learning (ECML), Belgium, September, 2008. [pdf]