Codes & Data
ML Group
ML Seminar

Codes and Data


Scalable Influence Estiamtion and Maximization in Information Diffusion

We are surrounded by social and information sharing networks, over which diffusions of information, events, virus, takes place constantly. We often observe that after some influential users adopt certain new product or idea, they actively influence the behaviors of their friends, which in turn makes more friends of friends adopt the product through word-of-mouth. The specific questions we seek to address in this NIPS 2013 paper is to accurately estimate the number of follow-ups which can be triggered by a given set of earlier influential users, and then to identify a set of influential users, to whom we will give promotions, in order to trigger the largest expected number of follow-ups as soon as possible? These questions are interesting because, for instance, advertisers want to have an efficient and effective campaign for their new products.

Kernel Embedding of Hidden Markov Models

Hidden Markov Models (HMMs) are important tools for modeling sequence data. However, they are restricted to discrete latent states, and are largely restricted to Gaussian and discrete observations. And, learning algorithms for HMMs have predominantly relied on local search heuristics, with the exception of spectral methods such as those described below. We propose a nonparametric HMM that extends traditional HMMs to structured and non-Gaussian continuous distributions. Furthermore, we derive a localminimum- free kernel spectral algorithm for learning these HMMs. We apply our method to robot vision data, slot car inertial sensor data and audio event classification data, and show that in these applications, embedded HMMs exceed the previous state-of-the-art performance.


We introduce a kernel reweighted logistic regression (KELLER) for reverse engineering the dynamic interactions between genes based on their time series of expression values. We apply the proposed method to estimate the latent sequence of temporal rewiring networks of 588 genes involved in the developmental process during the life cycle of Drosophila melanogaster. Our results offer the first glimpse into the temporal evolution of gene networks in a living organism during its full developmental course. Our results also show that many genes exhibit distinctive functions at different stages along the developmental cycle.


Elefant (Efficient Learning, Large-scale Inference, and Optimization Toolkit) is a Python open source library for machine learning licensed under the Mozilla Public License. The aim is to develop an open source machine learning platform which will become the platform of choice for prototyping and deploying machine learning algorithms.
This toolkit is the common platform for software development in the machine learning team in NICTA. Not all the tools are currently released but many can be found in the developers version with SVN access.


Feature selectors for unconventional data (such as string and graph label). A versitle framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. Written in Python.


Clustering with a metric on labels. A family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which endows them with a particular structure. Written in c and examples in Matlab


Dimensionality reduction with side information. Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distance preserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. This method is also called maximum unfolding via Hilbert-Schmidt Independence Criterion (MUHSIC) or maximum covariance unfolding (MCU). Written in a mix of Matlab and C.


Some essential procedures for machine learning


Feature Selection

For the data file (data.txt), each row is a data point and each column is a feature. For the label file (y.txt), there are two cases. For binary data, there is one column in y, positive class 1 and negative class 0. For multiclass data, the number of columns in y is the same as the number of classes; if a data i is in class j then there is 1 in row i and column j, otherwise 0.


Dimensionality Reduction