Estimating the covariance matrix is crucial in understanding the complex relationships among variables in "big data" from various applications such as social networks, biomedical data and finance. Our goal is to implement CONCORD, a cutting-edge and state-of the-art algorithm for high-dimensional covariance estimation, in the ECL Machine Learning Library. This will enable us to leverage the computational power of HPCC and scale up the algorithm to datasets with up to millions of variables.
Additional info: http://onlinelibrary.wiley.com/doi/10.1111/rssb.12088/abstract
The availability of high dimensional data (or “big data”) has touched almost every field of science and industry. Such data, where the number of variables (features) is often much higher than the number of samples, is now more pervasive than it has ever been. Discovering meaningful relationships between the variables in such data is one of the major challenges that modern day data scientists have to contend with.
The covariance matrix of the variables is the most fundamental quantity that can help us understand the complex multivariate relationships in the data. The CONCORD algorithm (to appear in Journal of the Royal Statistical Society) is a state-of-the-art high-dimensional covariance estimation algorithm which substantially improves the previous methods both in terms of computational efficiency as well as theoretical properties.
The goal of this project is to implement the CONCORD algorithm in ECL, using the PB-BLAS infrastructure. Currently, CONCORD has been implemented in R, and works fine computationally when the number of variables (features) is in the thousands. This is good enough for many applications (such as finance or climate sciences). However, many applications (social networks, genetic data) have hundreds of thousands, sometimes millions of variables. Implementing the CONCORD code in ECL will allow us to do the following.
1. Leverage the computational power of HPCC, and target such datasets.
2. Use the PB-BLAS infrastructure in ECL to further speed up the matrix operations used in CONCORD.
3. CONCORD has some features which are easily parallelizable. An ECL implementation will help us capitalize on these features.
The CONCORD algorithm works by minimizing a convex objective function through a cyclic coordinate minimization approach. In addition, it is theoretically guaranteed to converge to a global minimum of the objective function. In our experience, it has always converged in 50 iterations or less. A more detailed description of the basic algorithm can be found here: https://www.dropbox.com/s/duzopk3ntndyqxp/concord.pdf?dl=0