Models for Natural Language and Vision

10-1-2001
Note taker: Brian Lee

Natural Language Models

Verbs can be grouped into categories of primitive actions to help with lexical analysis.

Example: Eve ate the apple.

Semantic Lexicon:
Eve noun human, proper name
ate verb INGEST
the determinant
apple noun fruit, juicy, food, edible

The parser first does a small amount of syntactic analysis from left to right to find the first verb.

A couple of data structures are needed.
C-list: concept list - kept in short-term memory.
R-list: request list - comes from long-term memory.
i.e. - INGEST is stored as a frame, a logical grouping of a large amount of attributes.
actorslot 0
objectslot 1
to body part (part of slot 0)
fromslot 2

Now that we have the verb, processing begins again at the front of the sentence. "Eve" is put into the concept list, and we try to satisfy requests of the verb as soon as possible. INGEST's requests are:

  1. req0: if there's a concept on the c-list before this one and it is animate, then slot 0 gets that concept.
  2. req1: if there's a concept on the c-list after this one, and it is edible, put it in slot 1.
  3. req2: if there's a concept on the c-list after this one, and it's a location outside slot 0, put it in slot 2.

Only req0 has been satisfied (by "Eve") so far, so we begin to process the rest of the sentence. "Apple" is found to be able to fill req1, but req2 remains unsatisfied since there's nothing left in the sentence.

Basically, verbs are used as keys to extract models from memory so top-down processing can occur.

There's a great amount of disagreement as to how many primitives there are, but one theory suggests the following:

The theory says that this is a set of language-independent primitives in which everyone thinks. To put it another way, this is the grammer of the language of thought.

Some words can fall into multiple categories. How do we resolve the amibiguity?

  1. pull both frames and see what fits
  2. don't pick the primitive immediately. Do a little bit more syntactic analysis.

The difference here is when more processing is done: before or after the primitive is picked.

What if there are two verbs? (i.e. - I imagined he ran.) Ans: Break up the sentence.

MBUILD
actor I
object (ptr to PTRANS)
PTRANS
actor he
instrument
from
to

In the context of a general theory of frames, the number of slots isn't fixed, but in the context of a theory of natural language processing, the number is fixed.

Types of slots:
actor from
object time
recipient/beneficiary location
to body part instrument

This theory appears attractive because it says a lot about top-down processing, but it's unattractive because the question of learning is left open.

Models for Vision

After labeling of an image occurs, the labels can be used to get models from long-term memory. The models must allow for two things:

  1. the capability to zoom in and out
  2. abstraction

One model used that meets these requirements is the model of generalized cylinders, where a human would be represented by something like the following:



(Lewd drawing by Asok omitted.)

So processing on an image progresses from the raw image to lines to labeled surfaces. From there, the labels are used to retrieve models from long-term memory to a short-term memory so a full 3D representation of the image can be constructed.

Retrieval of a model is based on partial pattern recognition. If the correct model is retrieved from memory, expectations can be drawn even if the entire object isn't visible. Expectations and zooming can be helpful for disambiguation.