Decentralized Machine Learning and Efficient Anomaly Detection

The area of distributed computing systems provides a promising domain for applications of machine learning methods, for example, in information retrieval, anomaly detection, social network analysis, etc. However, most existing learning algorithms have high computation complexity and require all data being in a central site for analysis. So the computation and/or communication resources required by the methods in processing large-scale data in distributed systems are often prohibitively high, and practitioners are often required to approximate the original data in various ways (quantization, filtering down sampling, etc) before invoking the data mining algorithms.

In this project, we aim to develop a general framework for efficient learning and anomaly detection in distributed (mobile) systems. This framework involves in-network processing at distributed sites (devices), and approximate mining at the Reasoning Operation Center (ROC). The combination of distributed local processing strategies, sophisticated learning algorithms, and theoretical analysis tools enable our approach to perform in-network inference and mining which achieves high accuracy with low communication overhead.

Our system leverages intelligent data filtering at distributed monitors, and Support Vector Machine (SVM) and Principal Component Analysis (PCA) for detection at Network Operation Center (NOC). Our approximate scheme involves a set of local monitors that maintain parameterized sliding filters. These sliding filters yield quantized data streams that are sent to the NOC. The NOC makes global decisions based on these quantized data streams.

We derive analytical results based on stochastic matrix perturbation theory to effectively balance the tradeoff between detection accuracy and the amount of data communicated over the network. By avoiding the expensive step of centralizing all traffic data, our solution enables tracking SVM and PCA-based anomalies in real time with minimal data communications. This overcomes the key scalability limitations of the state-of-the-art network-wide anomaly detection solution. Experiments with traffic data from an ISP-backbone network demonstrate that our methods yield significant communication benefits while simultaneously achieving high detection accuracy.