Research 

 

      Introduction

         Representation

Similarity Metric

Clustering

Typical Class Member

Activity Classification

Anomaly Detection

Anomaly Explanation

Experiments

 

 

Bags of Event n-Grams - A Multi-Resolution Representation of Activities

To start with, let us consider an activity as a sequence of discrete events. Looking at an activity from this perspective, two important quantities emerge:

1- Content - which events span the activity.

2- Order - what is the order of the set of events constituting an activity.

This treatment of  an activity is very similar to the treatment of a document as a sequence of words - also known as the Vector Space Model (VSM) [6], in which a document is represented as a vector of word-counts occurring in that document, in a high dimensional space of all possible words. In order to use such a scheme, we must define a set of all possible events (event vocabulary) that could take place in the situation under consideration. To make the proposal more crisp, let us assume we are given a set of 10 events:

E = {1,2,.....10}

Consider a particular sequence of events given as:

S1 = {1,2,3,8,9,10}

The VSM representation of such an activity would be:

Figure 1

The resolution at which these events should be defined is an interesting problem since almost any event can be defined by a set of lower level events. This problem is essentially related to linguistics and how human beings abstract information in terms of nouns and verbs to communicate and understand each other in an efficient manner. Since the everyday activities that we are concerned with in this work have humans as agents, it is reasonable to use a vocabulary of events which is defined by a human being. We therefore use a human-defined vocabulary of events for processing information further. (The problem of discovering events at a granularity understandable to human beings is an open question and is not a contribution of this work).

While this approach captures the content in an efficient way, the order of the sequence is captured at a very global scale. To illustrate this point, let us consider a different sequence of events given as:

S2 = {8,9,10,1,2,3}

The VSM representation of such an activity would be:

Figure 2

As can be observed that while S1 and S2 are two completely different sequences, their VSM models are entirely the same. Therefore we see how VSM can only capture a global notion of event order. Since everyday activities do not generally follow a particular set of grammatical rules very strictly, therefore a better modeling of the order of events at a more local scale is needed.

One possible extension is to consider higher order n-grams histogram of events instead of simply considering event mono-grams (n = 1). Therefore, an activity could be represented by a vector of n-gram counts in a (very) high dimensional space of all possible n-gram counts for a given sequence of events (we will come back to the issues of dimensional complexity in a bit). To understand the notion of histograms of n-grams, let us consider S1 and S2 again:

S1 = {1,2,3,8,9,10}

S2 = {8,9,10,1,2,3}

The tir-gram representation of S1 and S2, (given as S1(3) and S2(3) respectively) can be given as:

S1(3) = {1 2 3, 2 3 8, 3 8 9, 8 9 10}

S2(3) = {8 9 10, 9 10 1, 10 1 2, 1 2 3}

The histograms of S1(3) can be shown as:

Figure 3

 

The histograms of S2(3) can be shown as:

Figure 4

 

As can be seen from figure 3 and 4, the orders of S1 and S2 are better represented by the histograms of S1(3) and S2(3) as they appear to be more different from each other than the histograms of mono-grams of S1 and S2. It should be evident that as we go higher in the value of n, the scale of the order of events become more and more local. We therefore have a multi-resolution representation of an activity, where every activity is represented as a set of histograms of n-grams of events. Such a multi-resolution representation could be used in a cascaded manner to filter out false negatives in a better manner (similar to [8]).

While our proposed representation of treating activities as bags of n-grams of events captures both content and order of events at a multi-resolution scale, it does pose a problem of dealing with sparse very high dimensional data. If we defined K events that could take place in a situation, and we considered n-grams with n = 1,3 and 5, our activity would be living in a space of order O(K^5). For even moderate values of K, learning and estimation in such a space can be infeasible.

We could partially solve the curse of dimensionality of our space using any of the plethora of dimensionality reduction techniques available. Such an approach is very similar to the Latent Semantic Analysis [7] of documents in the Information Retrieval community, where documents are mapped to a lower dimensional subspace (using Principal Component Analysis - PCA), and document similarity is computed based on the inner product between vectors in the reduced dimensional subspace. However, since PCA would lose some information (least significant in terms of L2 norm) in the process of reducing dimensionality, this lost information might be critical to detect something as an anomaly.

To understand this point in a better way, let use consider the situation where we have a total of X sequences, each of length L (we are considering equal length sequences for the sake of simplicity here). The event vocabulary consists of 10 events, E = {1,2,3,...10}. Computing histograms of trig-grams on this event space would take us to a space of 10^3 = 1,000 dimensional space - thus transforming each of X sequences from 10 dimensional space of events to 1,000 dimensional space of tri-gram counts. We can perform the principal component analysis of X vectors of histograms (of tri-grams of X sequences). The percentage of total variance of the data (sorted in an ascending order of eigen values) can be shown as:

Figure 5

 

While there would be 1,000 eigen vectors of our space, only the first 20 most significant vectors have been shown in figure 5, the rest are too weak to show. The first 20 eigen vectors capture 95.756% variance of the entire data, while the first 7 eigen vectors capture 90.765% variance. We therefore project the entire data from 1,000 dimensional space to 7 dimensional data, still preserving 90.756% variance of the entire data. Once this data is projected in a low dimensional space, we can re-project it back in our original 1,000 dimensional space to see how much data have we actually lost in the process of re-construction. The following figures show one of the sequences in the original 1,000 dimensions and its corresponding re-projection from 7 dimensional space:

Figure 6

Figure 7

 

As can be seen from figure 6 and 7, there is some aliasing in the re-projected vector, i.e. the vector components with smaller magnitudes have been removed (as expected). While most of the significant (more frequent) tri-grams are still preserved, even the tri-grams which occur less frequently could make an activity anomalous. Therefore, while this approach could be very useful for anomaly detection where anomalies are defined to be "very different" from regular, it can not be used for detecting relatively more subtle anomalies.

Getting back to the problem of dimensionality reduction, a potential solution to the problem is to consider only those dimensions which are of significant importance, where the dimension importance can be estimated by using some inductive learning techniques based on some training data as proposed in some of the work in Network Intrusion Detection community [9]. Unfortunately, unlike the problem of network intrusion detection, where large sets of network activity data are usually available, we do not have such large data sets, and therefore using inductive learning techniques for our problem would not provide a good estimation of dimension importance.

We propose a solution to this problem by constructing a similarity metric that encompasses all the non-zero components of the activity vector, hence giving equal importance to all the dimensions with values greater than zero, and no importance to dimensions with values equal to zero. Since the activity vector is highly sparse, this allows us to reduce the dimensionality dramatically without loosing any information, allowing us to be able to use higher values of n-grams for our analysis.

 

Recap: In this section we have described a novel approach for activity representation as bags of n-grams of events. The representation is influenced by the work done in Information Retrieval and Network Intrusion Detection community, and has been proved to be effective in those domains. To the best of our knowledge, we have not seen the use of such a representation of activities in Visual Activity Recognition community.

 

 

[6] G. Salton, "The SMART Retrieval System - Experiment in Automatic Documentation Processing, " 1971.

[7] Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to Latent Semantic Analysis. Discourse Processes, 25, 259-284.

[8] Paul A. Viola, Michael J. Jones: Fast and Robust Classification using Asymmetric AdaBoost and a Detector Cascade. NIPS 2001: 1311-1318

[9] A Data Mining Framework for Building Intrusion Detection Models, Wenke Lee, Sal Stolfo, and Kui Mok In Proceedings of the 1999 IEEE Symposium on Security and Privacy, Oakland, CA, May 1999.