# Codes and Data

## Codes

### Deep Learning over Graphs, Networks and Structures

An efficient C++ package for deep learning over graphs, networks and structures. Our approach, structure2vec, is an effective and scalable approach for structured data representation based on the idea of embedding latent variable models into feature spaces, and learning such feature spaces using discriminative information. Interestingly, structure2vec extracts features by performing a sequence of function mappings in a way similar to graphical model inference procedures, such as mean field and belief propagation. In applications involving millions of data points, we showed that structure2vec runs 2 times faster, produces models which are 10,000 times smaller, while at the same time achieving the state-of-the-art predictive performance.### High Dimensional Point Process Package

PtPack is a C++ software library of high-dimensional temporal point processes. It aims to provide flexible modeling, learning, and inference of general multivariate temporal point processes to capture the latent dynamics governing the sheer volume of various temporal events arising from social networks, online media, financial trading, modern health-care, recommender systems, etc.### Doubly Stochastic Gradient Descent for Large Scale Kernel Methods

The general perception is that kernel methods are not scalable, and neural nets are the methods of choice for nonlinear learning problems. Or have we simply not tried hard enough for kernel methods? Here we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Our approach relies on the fact that many kernel methods can be expressed as convex optimization problems, and we solve the problems by making two unbiased stochastic approximations to the functional gradient, one using random training points and another using random functions associated with the kernel, and then descending using this noisy functional gradient. We show that a function produced by this procedure after t iterations converges to the optimal function in the reproducing kernel Hilbert space in rate O(1/t), and achieves a generalization performance of O(1/sqrt(t)). This doubly stochasticity also allows us to avoid keeping the support vectors and to implement the algorithm in a small memory footprint, which is linear in number of iterations and independent of data dimension. Our approach can readily scale kernel methods up to the regimes which are dominated by neural nets. We show that our method can achieve competitive performance to neural nets in datasets such as 8 million handwritten digits from MNIST, 2.3 million energy materials from MolecularSpace, and 1 million photos from ImageNet.### Shaping Social Actitvities Efficiently using Convex Optmization

Events in an online social network can be categorized roughly into endogenous events, where users just respond to the actions of their neighbors within the network, or exogenous events, where users take actions due to drives external to the network. How much external drive should be provided to each user, such that the network activity can be steered towards a target state? In this paper, we model social events using multivariate Hawkes processes, which can capture both endogenous and exogenous event intensities, and derive a time dependent linear relation between the intensity of exogenous events and the overall network activity. Exploiting this connection, we develop a convex optimization framework for determining the required level of external drive in order for the network to reach a desired activity level. We experimented with event data gathered from Twitter, and show that our method can steer the activity of the network more accurately than alternatives.### Scalable Influence Estiamtion and Maximization in Information Diffusion

We are surrounded by social and information sharing networks, over which diffusions of information, events, virus, takes place constantly. We often observe that after some influential users adopt certain new product or idea, they actively influence the behaviors of their friends, which in turn makes more friends of friends adopt the product through word-of-mouth. The specific questions we seek to address in this NIPS 2013 paper is to accurately estimate the number of follow-ups which can be triggered by a given set of earlier influential users, and then to identify a set of influential users, to whom we will give promotions, in order to trigger the largest expected number of follow-ups as soon as possible? These questions are interesting because, for instance, advertisers want to have an efficient and effective campaign for their new products.### Kernel Embedding of Hidden Markov Models

Hidden Markov Models (HMMs) are important tools for modeling sequence data. However, they are restricted to discrete latent states, and are largely restricted to Gaussian and discrete observations. And, learning algorithms for HMMs have predominantly relied on local search heuristics, with the exception of spectral methods such as those described below. We propose a nonparametric HMM that extends traditional HMMs to structured and non-Gaussian continuous distributions. Furthermore, we derive a localminimum- free kernel spectral algorithm for learning these HMMs. We apply our method to robot vision data, slot car inertial sensor data and audio event classification data, and show that in these applications, embedded HMMs exceed the previous state-of-the-art performance.### KELLER

We introduce a kernel reweighted logistic regression (KELLER) for reverse engineering the dynamic interactions between genes based on their time series of expression values. We apply the proposed method to estimate the latent sequence of temporal rewiring networks of 588 genes involved in the developmental process during the life cycle of Drosophila melanogaster. Our results offer the first glimpse into the temporal evolution of gene networks in a living organism during its full developmental course. Our results also show that many genes exhibit distinctive functions at different stages along the developmental cycle. Data can also be found here download### Elefant

Elefant (Efficient Learning, Large-scale Inference, and Optimization Toolkit) is a Python open source library for machine learning licensed under the Mozilla Public License. The aim is to develop an open source machine learning platform which will become the platform of choice for prototyping and deploying machine learning algorithms.This toolkit is the common platform for software development in the machine learning team in NICTA. Not all the tools are currently released but many can be found in the developers version with SVN access.

### BAHSIC

Feature selectors for unconventional data (such as string and graph label). A versitle framework for filtering features that employs the Hilbert-Schmidt Independence Criterion (HSIC) as a measure of dependence between the features and the labels. The key idea is that good features should maximise such dependence. Feature selection for various supervised learning problems (including classification and regression) is unified under this framework, and the solutions can be approximated using a backward-elimination algorithm. Written in Python.### CLUHSIC

Clustering with a metric on labels. A family of clustering algorithms based on the maximization of dependence between the input variables and their cluster labels, as expressed by the Hilbert-Schmidt Independence Criterion (HSIC). Under this framework, we unify the geometric, spectral, and statistical dependence views of clustering, and subsume many existing algorithms as special cases (e.g. k-means and spectral clustering). Distinctive to our framework is that kernels can also be applied on the labels, which endows them with a particular structure. Written in c and examples in Matlab### MUHSIC

Dimensionality reduction with side information. Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximizing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distance preserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e.g. subject to class labels or other side information. This method is also called maximum unfolding via Hilbert-Schmidt Independence Criterion (MUHSIC) or maximum covariance unfolding (MCU). Written in a mix of Matlab and C.### Others

Some essential procedures for machine learning- Incomplete Cholesky Decomposition
linearize the kernel matrix for a nonlinear kernel.