Codes & Data
ML Group
ML Seminar


My principal research interests lie in the development of efficient algorithms and systems which can learn from a massive volume of complex (high dimensional, nonlinear, multi-modal, skewed, and structured) data arising from both artificial and natural systems, reveal trends and patterns too subtle for humans to detect, and automate decision making processes in uncertain and dynamic possible world.

I am also interested in developing machine learning algorithms to address interdisciplinary problems. Projects I've worked on range from the modeling of robotic systems based on sensor measurements, to the discovery of evolving gene regulatory networks based on microarray time-series, to the management of information diffusion based on temporal sequences of social events, to the understanding of disease progression based on longitudinal medical records, to the extraction of topics based on online document feeds, and to the prediction of materials properties based on computational and lab experiments.

Big Nonlinear Models: New Representations and Scalable Algorithms

Modern machine learning applications, such as image classification, speech recognition, and materials discovery, are facing datasets with increasing volume, velocity and variety. Nonlinear and nonparametric machine learning models, such as kernel methods, are needed to capture the increasing complexity of the data. For these nonparametric models to work, the sizes of the models typically grow with the volume of the data, making them run in time quadratic or even cubic in the number of data points. How to make the training of such large scale models possible and efficient? My recent research focuses on scaling up the learning of nonparametric kernel models using randomized and distributed algorithms. With these developments, kernel methods are able scale up to the regime which is only possible by neural nets before, and achieve the accuracy of neural networks in a range of applications such as classifying millions of images from IMAGENET dataset, understanding many hours of speech from TIMIT datasets, and exploring millions of solar panel materials from Clean Energy project. Furthermore, these new algorithms have provable guarantees, and can lead to models more compact than neural networks.

Probabilistic Graphical Models

Probabilistic graphical models are good tools for representing structured dependencies between random variables in challenging tasks in social networks, natural language processing, computer vision, and beyond. Most existing applications of graphical models are restricted to cases where each random variable can take on only a relatively small number of values, or, in continuous domains, where the joint distributions are Gaussians.

I developed a novel nonparametric representation for graphical models based on the concept of kernel embeddings of distributions. This new representation allows one to conduct learning and inference in graphical model with much more general distributions.

Nonparametric graphical models have been applied to various learning problems, such as cross-language document retrieval, estimating depth from a single image, classification and forecast for dynamical system models of video, speech and sensor time series. In these applications, this new method outperforms state-of-the-art techniques.

Modeling, Analyzing and Visualizing Networks and Spatial/Temporal Dynamics

Much of the world's information has a relational structure and can be modelled mathematically as networks and graphs. Examples include biological networks, webgraphs and social networks. Many of these large and complex networks exhibit rich spatial and temporal phenomena. Traditional graph modeling, analysis and visualization algorithms are not able to capture this complex spatial and temporal behavior. I designed new modeling, analyzing and visualizing tools to better understand complex networks.

Dynamic processes occur in social, biological, biomedical and sustainability contexts over networks, space and time. I am interested in modeling and analyzing problems and data arising from these contexts. For instance, I have been studying information diffusion in social networks using continuous-time diffusion models, and using nonparametric estimation algorithms to understand the modality of social interactions, and developing methods to control or steer dynamics of social events based on these models.

Learning via Kernel Embedding of Distributions

In this work, distributions are embedded into Hilbert spaces via expected feature map of a kernel, and then all subsequent operations on distributions are carried out in the Hilbert space. This allows one to compute distances between distributions in terms of distances between their embeddings. We have developed a framework of learning based on this which includes density estimation, clustering, feature selection, two sample tests, independence tests, nonparametric sorting, and dimensional reduction. A large number of existing methods appear in this framework as special cases. Furthermore, this often leads to algorithms which are simpler and more effective than information theoretic methods in a broad range of applications.

Healthcare Analytics, Computational Biology and Neuroscience

I bring the state-of-the-art statistical learning and modeling techniques to study complex data in real world applications and accelerate the understanding of increasingly challenging modern science problems. For instance, in healthcare analytics, I developed point processes models for understanding disease co-morbidity and progression, making prediction on the outcome of diseases, and recommending effective treatments. In life science, the deluge of inter-related genome-transcriptome-phenome data offers an unprecedented opportunity for statistical modelings to explore questions such as how higher organism functions respond to molecular-level alterations. Clues to these questions are essential to the understanding, diagnoses and treatments of complex disease such as asthma and cancer. I developed methods to address problems such as selecting informative genes, understaning time varying gene regulatory networks, analyzing dynamic mental processes.

Interdisplinary Problems

I am very open to new collaborations.