VISION and PERCEPTION %---------------------------------------------------------------------------- The vision problem: Given a 2-dimensionsal image (typically, a bitmap), to infer the objects that produced it, including their shapes, positions, colors, sizes, ... Some aspects of vision: Shape (central aspect) Color Motion (Optical flow) Texture Shading and shadows Free space Stereo Reflectance Perspective Illusions History of AI research in Vision: Pattern recognition (eg, blood cell counts; assembly line inspection) Vision for limited domains (eg, polyhedra; uniform albedo (gray level)) Scene analysis (eg, object and shape recognition) Modeling perception (eg, optical illusions; clustering) Some current research topics in vision: edge detection shape from shading relaxation labelling generalized cylinders intrinsic images (2-1/2 D images) model-driven vision (top-down) optical flow %---------------------------------------------------------------------------- Why is vision hard? - computational complexity: example: 1024x1024 image x 8 bits x 3 planes for color = 24M edge detection: 10 operations per pixel therefore, 240 million operations per image - confounding factors ambient light shadows atmospheric conditions camera angle lens distortions - need for a priori world knowledge to provide context and expectations - poor understanding of how human vision works - underconstrained: insufficient information in image to provide scene interpretation (image ambiguity) missing edges (brightness; occlusion) extra edges (shadows) semantic confusion For example, the image: --------- | | --------- may be produced by any of the following objects: /--------/| --------- /-------/\ / / | | | / / \ /--------/ / | | /-------/----\ | | / | | ----------/ --------- %---------------------------------------------------------------------------- Subtasks of vision SIGNAL PROCESSING convert image into one with more desirable properties enhance signal-to-noise ratio enhance edges RECOGNITION/CLASSIFICATION map the image into space of possible categories group image components into predetermined categories INTERPRETATION/UNDERSTANDING build description of image and the scene it represents %---------------------------------------------------------------------------- Stages in vision processing scene local object processing interpret- processing ation ----> ----> ----> ----> image low intermediate high semantic sensor level level level description Levels of representations scene (usually semantic net of some kind) object region subregion patch pixel (gray image or intensity bitmap) %---------------------------------------------------------------------------- Representations for vision processing (Marr) Raw 2-D Primal 2-1/2 D 3-D Image ----------> Sketch ----------> Image ----------> Image (pixels) (edges) (surfaces) (volumes) low-level intermediate high-level processing processing processing 2-1/2 D Images are also called intrinsic images Motion Textures Stereo Primal ----> ----> Line ----> Intrinsic Sketch Photometry Interpretation Image %---------------------------------------------------------------------------- Control of Processing: mostly bottom-up in low-level vision both bottom-up and top-down in high-level vision Another surprise: low-level vision involves mostly local operations (e.g., discrete convolution, constraint propagation) but provides global information (e.g., blocks; edges). %---------------------------------------------------------------------------- LOW LEVEL VISION: Example: EDGE DETECTION Assumption: Gray image (no color) What is an "edge"? How can we detect an "edge" in a pixel array? By comparing the intensities of neighbouring pixels. Example: If the intesities were 9 8 3 2 2 3 9 8 3 2 3 4 9 9 4 3 3 2 9 7 3 4 2 2 9 8 3 3 2 3 then an edge between the second and third columns is possible (because of the ``large'' difference in the intensities). %---------------------------------------------------------------------------- Operators: Simple neighbor comparison doesn't always work (digitization error; noise) Need larger but still local "operators" Combinations of LOCAL OPERATIONS give rise to GLOBAL INFORMATION: a central theme in low-level vision. Example of an operator: --------------- |+1|+1|0|-1|-1| each square represents a pixel --------------- Example, if the intensities were: 22 21 19 18 16 the "value" of the operator would be +9 (which suggests an edge because there is a large "difference" in "average" intensity across the width of the operator). Problem: what if the intensities went up and down? What you really want is not a "value" of "average intensity", but rather some way of detecting where the intensities rise and fall. You want to find where the change in intensity changes direction, i.e., "zero crossings" of the derivatives of intensity values. Types of edges: ^ intensity | ----- --- /\ /\ I | | / / \ / \ -----> ---- --/ -- -- / \ linear step blurred peak roof difference step X Theory of edge detection: classic paper: Marr and Hildreth 1979 also other neurophysiology papers: Lettvin et al 1960: "bug detectors" in frog's eye Hubel and Wiesel 1962-68: mammalian vision (cats) Use Gaussian and Laplacian operators for edge detection Band pass filters for low level vision to filter spatial frequencies Finding zero crossings: Central aspect of Marr's theory ------------- Example: Use a simple edge mask |-1|-1|+1|+1| ------------- "Move" the mask across the image (discrete convolution) ^ I | ------ | | |------| | . -------------> Edge . X . ^ . dI/dX | . | = | =.= | = . = -====-----===> First spatial derivative . X . ^ . d2I/dX2 | . | | = . | = =. -===---=---=> Second spatial derivative .= = . = . zero crossing Even if edge is smeared and difficult to identify, the zero crossing is unique. Low frequency detector: High frequency detector: ------------------------- ------- |-1|-1|-1|-1|+1|+1|+1|+1| |-1|+1| ------------------------- ------- Averages over a larger range Looks at smaller range Less susceptible to local changes Picks up sharp local variations Exercise: Try convolving the following image using both detectors: ^ The high frequency detector | --| will pick up all the jagged I | | --| edges, whereas the low | | | frequency detector will | -----------| ---------- gradually rise and fall. | --| | --| |-----| -----------------------------------> X Convolution: It is easier (and equivalent) to convolve the derivative of the operator with the scene than to convolve the operator twice, one with the scene and then again with the derivative obtained. Derivative of operator: Convolve with itself ----------- ^ | -1 | +1 | | ----------- I | | ====== | = = -=======----=----=======----> X = = ====== ^ dI/dx | = "Mexican hat operator" | = = |-===------=-----=------===-> X | = = = = | = = | Marr-Hildreth operator: The three dimensional version of the above operator. Corresponds well with evidence from neurophysiology. Varying sizes of operators to get high-frequency and low-frequency responses, each double the size of the previous one. Each "channel" provides a band pass filter for a particular spatial frequency. Small (high frequency) filters: Less blurred; sharp, detailed images. But also pick up noise; extra "fake" edges. Wider (lower frequency) filters: Less noise, but image more blurred. Run all of these over the image to get the edges (zero crossings). Edge orientation: Orientation of the edge (in a 3-d scene) can be found by finding the maximum response of the 1-d operator. %---------------------------------------------------------------------------- So edge detection is done using discrete convolution of MH (Marr-Hildreth or "Mexican Hat") operators with the intensity patterns of the image. Different channel sizes are used, ranging from low frequency to high frequency. Primal sketch: Raw primal sketch: Build up edge segments by joining adjacent zero crossings for a given "channel", i.e., an operator with a certain size. This Full primal sketch Combine channels to get items (which cats seem to recognize directly too): - edges - bars (two close edges; disappear in large channels) - blobs (joinings of bars; relatively sharp curvature) - terminations (joinings of bars; relatively sharp curvature) Supported by evidence from: Optical illusions: - Cafe wall illusion: Zero crossings are trapezoidal - Mueller-Lyer line illusion: zero crossings are in fact unequal. (When the lengths look equal, the zero crossings are also equal). Neurophysiology Information theory Actual performance of programs Algorithm: gray level image | 1. Pass Marr-Hildreth operators of different sizes over the scene | zero crossings, several channels (3 or 5) | 2. Combine, via relaxation, information in each channel | several zero crossing plots, one per channel | 3. Join channels together, high frequency first | raw primal sketch | 4. Find bars, blobs, terminations from above | full primal sketch (2-d sketch) | 5. Compute surface directions using brightness, shading, smoothness, reflectance | 2-1/2-d sketch All this takes a lot of compute power, but it can be done. Now comes the hard step: | 6. Find objects | symbolic model of scene (3-d) %---------------------------------------------------------------------------- INTERMEDIATE LEVEL VISION Example: Line Labelling Assumptions: Only polyhedral objects (no curved surfaces); no more than three lines meeting at one point Fold: A line where two surfaces meet; both surfaces visible Blade: A line where two surfaces meet; only one surface visible Example: ---------- / /| / / | ---------- | | fold | | | | / | | / | |/ ---------- blade TASK: Given a set of edges, like in the image of a cube, determine which of them are folds and which are blades. Combinatorics of the task: Number of edges in a cube: 9 Number of ways to label each edge: 2 Total number of combinations: 2**9 %---------------------------------------------------------------------------- METHOD of CONSTRAINT PROPAGATION (Waltz) If only certain types of edge junctions are possible, then we can use knowledge of types of junctions to "constrain" interpretations of possible edge labels. Types of junctions between folds and blades: Two edges: L Three edges: Y, W, T Example: L ---------- W / /| / Y / | W ---------- | | | | L | | / | | / | |/ L ---------- W No T junctions are possible in the image of a (single) cube. Enumeration of all edge/junction combinations: f=fold b=blade L junctions: b \/ b; f \/ b; b \/ f Y junctions: f \/ f; b \/ b f| f| W junctions: f\f|f/ ; b\f|b/ \|/ \|/ T junctions: --|-- ; --|-- b b b b f b An additional rule: An edge next to background is a blade (backgrounds have no folds). Example: b ---------- b / /| / / | b ---------- | | | | b | | / | | / b | |/ ---------- b By propagating constraints from the W junction on the left b ---------- b / /| / f / | b ---------- | | | | b | | / | | / b | |/ ---------- b By propagating constraints from the W junction on the top-right b ---------- b / /| / f f / | b ---------- | | | | b | | / | | / b | |/ ---------- b By propagating constraints from the Y junction in the middle: b ---------- b / /| / f f / | b ---------- | | | | b | f| / | | / b | |/ ---------- b %---------------------------------------------------------------------------- In Waltz's original formulation: Types of junctions: Two edges: L Three edges: Arrow, Fork, T Four edges: X, Peak, K, Psi (we can ignore these in scenes consisting only of trihedral (3-faced) vertices) Example: > arrows (>, <, ^, v) mark boundary lines ------------| + marks convex interior edges / +/ |v - marks concave interior edges ^/ ------| | / +/| + | | ----/ |- +| / | + | | |/v ^| +| |-----/ | | / < ------/v < Combinatorics of the problem: Number of ways to label each line: 4 boundary (two directions), interior (convex, concave) Number of ways to label: L 4x4 = 16 Arrow, Fork, T 4x4x4 = 64 Total number of junction labels: 16+64+64+64 = 208 Actual number of types of junctions that can occur naturally: only 18 Example: Some junctions not found in scenes containing polyhedra with trihedral vertices Waltz's line labelling algorithm: Start with placing arrow labels clockwise along the border of the drawing. Next, label the shafts of the Arrow junctions whose barbs lie in the border. Then propagate constraints and label all the junctions (possible labels for each is constrained by labels assigned to neighboring junctions). There are so few natural labellings that this process is very efficient and requires almost no search. %---------------------------------------------------------------------------- Waltz showed that a combinatorially explosive problem (such as trying all possible assignments of labels to all edges in a large scene) could be done very easily and very efficiently by exploiting natural constraints. Why start from the boundaries? Because they add the most constraints. "Impossible" objects: Easy to detect through constraint propagation because no labelling is possible. Shadows and cracks: Need more types of vertices to deal with scenes with shadows and cracks, as well as scenes which allow different kinds of illumination. But these introduce more constraints that can be useful in disambiguating scenes. Waltz's set of junction labels: Vertex Number of Number of Ratio Ratio for type combinatorially physically (%) trihedral possible junctions possible junctions scenes L 2,500 80 3.2 37.5 Fork 120,000 500 0.4 7.8 T 120,000 500 0.4 6.2 Arrow 120,000 70 0.056 4.7 Psi 6,200,000 300 0.0048 K 6,200,000 100 0.0016 X 6,200,000 100 0.0016 Multi 6,200,000 100 0.0016 Peak 6,200,000 10 0.00016 Kk 310,000,000 30 0.0000096 Although the set is large, it is still manageable, and far more so than the unconstrained set. %---------------------------------------------------------------------------- HIGH LEVEL VISION Scene analysis (interpretation) Model-based vision uses stored geometric and topological models Why use a model? To generate expectations that can help interpret images Suppose that an image shows a ball with two holes in it, one clear, corresponding to the thumb-hole on a bowling ball, one blurred, corresponding to the middle-finger-hole on a bowling ball, (the ring-finger-hole may be invisible). This information may be enough to evoke the model of the bowling ball in memory. Once evoked, the model generates expectations and helps in the task of image interpretation. %---------------------------------------------------------------------------- Example: ACRONYM (Rod Brooks) Model Image ------------------------------ | Function <=> Function| | vv ^^ | | Object <=> Object | | vv ^^ | | Volume <=> Volume | Brooks (mostly top-down, some bottom-up) | vv --- ^^ ----- | vv | - ^^ ----- | Surface <=> || Surface | | vv || ^^ | | Shape <=> || Shape | | vv || ^^ | | Edge <=> || Edge | | vv || ^^ | | Pixel <=> || Pixel | Marr (bottom-up) ------------------ ---------- <=> denotes info-sharing, mostly left-to-right Uses a set of generalized cones or cylinders to represent primitive volume elements. Each is defined by: An axis A cross-section A sweeping rule These "volumes" are then combined to form representations of objects. Uses KRL (Knowledge Representation Language, a frame system language) to represent a hierarchy of objects whose shapes are defined using sets of generalized cones or cylinders. Algorithm: 1. Find lines 2. Link edges to find ribbons (parallel lines; bars) and ellipses (2-d). This step is model-directed. 3. Look for constraints to identify parts. Constraints are specified as part of the frame definitions of the objects, as above. 4. Look for constraints to identify objects (worked on electric motors; airplanes). Can hypothesize distance of camera from scene based on what it thinks it's looking at and how large the picture is. Efficient; can be used for assembly line robots. Speed comes from using higher level constraints to constrain considered features. Uses a 3-d model, which can be used in any orientation. %---------------------------------------------------------------------------- Copyright (c) Ashwin Ram, 1990-93 Assistant Professor, College of Computing Georgia Institute of Technology, Atlanta, Georgia 30332-0280 E-mail: