NEURAL NETWORKS and CONNECTIONISM %---------------------------------------------------------------------------- CONNECTIONISM: Connectionism is a major school of thought in AI. Its basic tenets are o Intelligent behavior is an emergent phenomenon. It can be understood in terms simple associative interactions among a large number of small simple computing elements. Thus the problem of understanding intelligent behavior reduces down to understanding the behavior of the individual elements and the interactions between them. o Mental representations are distributed and continous (rather than discrete or digital). Connectionist representations are in the FORM of numerical functions (i) which can take on values over a continous space, and (ii) may be distributed over the computing elements and the CONNECTIONS between them. o Cognitive reasoning is massively parallel. Connectionist reasoning is highly parallel since a large number of computing elements can be active simultaneoulsy. o Memory and learning are key elements of intelligence. Connectionist networks are associative memories capable of simple learning. %---------------------------------------------------------------------------- NEURAL NETWORKS: Neural networks are connectionst networks inspired by the neurobiology of the brain. Why look at the neurobiology of the brain? The brain is the only example of intelligence we know, and, according to some neurobiologists, it can be modeled as a ``neural network''. %---------------------------------------------------------------------------- REAL NEURONS Simplified schematic of a real neuron. Cell body Dendrites (branched protusions) Axon (a single branch) Synapses Dendrites receive signals from other neurons. If the combined input impulses exceed a certain threshold, the neuron fires and a signal (impulse) passes down the axon. Branches at the end of a axon form synapses with axons and dendrites of other neurons. A synapse is the point of contact between neurons. The connections between neurons may be excitatory or inhibitory. %---------------------------------------------------------------------------- ARTIFICIAL NEURONS: Highly simplified models of real neurons. The connections between neurons have numerical weights. Neurons are simple computing devices whose output depends on the input and an ACTIVATION FUNCTION. In some models (e.g., the perceptron model), the input to a neuron may be the linear weighted sum of the outputs of neurons in the previous layer. Also, in some models (e.g., the perceptron model), the activation function may be just a threshold. If the output of a neuron is greater than the threshold, it fires, i.e., its actual output is 1; if the output is less than the threshold, the neuron does not fire, i.e., its actual output is 0. %---------------------------------------------------------------------------- PERCEPTRONS: Small networks of simple threshold logic units. ------- --------- -------- /|------| phi | ---| alpha | ---> | | / | -----1- -------1| | -- | --------- /#-|-----\------- --------- | \ | | | /#--|------| phi | ---| alpha | ---> | / | ---> | theta | ---> Output /. #-|-----/-----i- -------i- | -- | | | | .. | ------- --------- | | --------- | . ./-------| phi | ---| alpha | ---> | sigma| | . / -----n- -------n- -------- | / | / Retina Input Weights Adder Threshold |/ (R x R) Units |----- Output Unit ----| The units are organized in two layers: an input layer and an output layer. The output layer contains only one unit. Activation Function: | 1 iff sigma [ alpha x phi ] > theta | i=1,n i i Output = | | | 0 otherwise Diameter-limited perceptrons: for each phi_i, all inputs come from area with diameter d << R %---------------------------------------------------------------------------- PERCEPTRON LEARNING: Let us suppose that a teacher wants to train a perceptron to learn about classifying visual patterns such as the pattern of the letter A. Training Procedure: 1. Present the perceptron with both positive and negative examples, for instance, by flashing images of A (positive examples) and other letters (negavtive examples). 2. For each example, check the output of the perceptron which may be a 0 (no, it is not an A) or a 1 (yes, it is an A). 3. If the output is incorrect, supply the correct output at the output unit. 4. Let the perceptron adjust the weights of its connections, and then repeat the process until the perceptron learns to recognize the pattern of A and correctly classify the images into A and not A. Gradient Descent (Hebb's or Delta) Rule: Let i be a unit in the input layer and let j be the unit in the output layer. Let wij be the weight of connection between i and j. The gradient descent rule gives the adjustment to wij when the perceptron gives an incorrect output. Let oi be the output of unit i and let oj be the (incorrect) output of unit j. Let tj be the (correct) output supplied by the teacher. Then, Delta(wij) = Eta x wij x (tj-oj) where Delta(wij) is the adjustment in wij, and Eta, called the learning rate, is a constant of proportionality. The gradient descent method searches for the ``right'' wij that produce the smallest error term, (tj-oj). This method is equivalent to the weak method of hill climbing, and, thus, offers all its advantages and suffers from all its drawbacks. %---------------------------------------------------------------------------- Limitations of perceptrons (Minsky and Papert): Theorem: No diameter-limited perceptron can determine whether or not all the parts of a geometric figure are connected to one another. Theorem: Perceptrons cannot learn to correctly classify certain patterns such as XOR within polynomial time. %---------------------------------------------------------------------------- Nonlinear Multilayer networks (Rumelhart and McClelland): Networks of neurons with input units, output units, and one or more layers of intermediate units (called hidden units). The output layer may contain more than unit. Hidden layers/units allow the system to construct more complex patterns of internal weights and therefore can learn more complex features. Activation functions of units in the hidden layers are nonlinear. A commonly used activation function is the thermodynamic potential: oj = 1 / [1 + exp {1 - (Sigma i =1,n wij x oi - theta)/T }] where T = ``temperature'' (just a parameter of the network), n is the number of units in the preceding layer, and the rest of the terms have their usual meanings. Note that this function gives a value lower than 1 when the term (Sigma i =1,n wij x oi - theta) is positive, and a value larger than 0 when the term is negative. Thus, this nonlinear function does not give just 0s and 1s as output; instead, its output is some value between 0 and 1. The units in the output layer continue to have the linear thrshold activation function, and, thus, continue to give 0s and 1s as output. %---------------------------------------------------------------------------- BACKPROPAGATION: The training procedure in monlinear multilayer networks is the same as the procedure for perceptrons. The learning rule however is a generalization of the gradient descent rule given above. This is because in multilayer networks the corrective feedback is BACKPROPAGATED through the layers. Actually, the rule for the connection weights between the last hidden layer and the output layer is the same as for perceptron. Let oi be the output of unit i in the last hidden laeyr and let oj be the (incorrect) output of unit j in the output layer. Let tj be the (correct) output supplied by the teacher to the j unit. Then, Delta(wij) = Eta x wij x (tj-oj) However, the rule for connection weights between the last hidden layer and the preceding layer is different because there is no teacher-supplied output in this case. Let ok be the output of unit k in the preceding layer and let oi be the output of unit j in the hidden layer. (Sigma j = i,m Delta(wij)) is the sum of weight adjustments in connections coming out of the j unit. Then, Delta(wki) = Eta x wki x (Sigma j= 1,m Delta(wij)) %---------------------------------------------------------------------------- Advantages of neural networks: Computational models of brains Highly parallel processing Distributed representations Can learn some new ``concepts'' Robust in the sense of graceful degradation Softer, fuzzier inputs/outputs. Limited success in reactive control, speech recognition Drawbacks: Neural nets are not brains, though they look superficially similar. Neural nets cannot (at least not yet) model higher-level cognitive mechanisms (symbols, attention, reference, focus of attention). Some reasoning may not be parallel (e.g., problem solving). Computationally much too complex. How to implement, or even simulate, systems with trillions of elements? %----------------------------------------------------------------------------