The question we're facing is "How to pipeline it?" If we just split all the links between stages and put registers in them we get gargabe!
So how to construct a valid pipeline? Here are a few rules:
Rule #0: You have to start somewhere, so start with the outputs (or you can start with the inputs)!
Rule #1: We can duplicate pipeline banks safely.
Rule #2: We can "slide" registers from all outputs to all inputs of a unit safely.
Finally:
Note that we have to have the same number of registers along every path from input to output!
Alternatively (to the three rules above), we can use the Cheaters' Rule: Redraw the circuit!
Then, it becomes obvious how to pipeline the circuit.
After we've piplelined a circuit we need to figure out how quickly we can run the thing.
Latency = time to compute one result
Throughput = number of results per unit time
Imagine the following circuit:
Its latency is 250 and its throughput is 1/250.
1-pipeline:
Latency is 260, throughput is 1/260. This already made latency worse and lowered the throughput at the same time!!!
2-pipeline:
Tclock = 160, Latency = 320, Throughput = 1/160. Latency got even worse, but throughput is increasing.
3-pipeline: (the best we can do here)
Tclock = 110, Latency = 330, Throughput = 1/110. Latency worse, throughput
increased.
A major conclusion from this excercise is: When pipelining latency only gets worse!
So is a pipelined circuit better that its vanilla version?
What do we do if we have pipelined the obvious stages, yet, we still
want to run the circuit at an even higher clock rate? We can:
If we duplicate the "100" stage from the example above, we get something such:
The FSM at the bottom controls which copy of the hardware works at which clock cycle. Both copies alternate under the FSM control. The FSM also controls the output of which copy should be put through the MUX as the output of the circuit. This in effect halves the rate at which this stage can operate. Note, however, that it adds some overhead associated with the MUX latency.
For the above duplicated piece of hardware we can make Tclock = 60.
Macro-pipelining
Wholesale pipelining -- or pipelining of stages of whole microprocessors. Beacause of the possible hazards associated with the timing of reads and writes to the scratch RAM, usually a technique is used called "double buffering". Two buffers are used, one being produced while the other is being consumed. After one is comsumed free and the other is produced full, the buffers switch roles, ad infinitum.
This technique is fundamental in macro-pipelining.
Goal:
Capture NTSC video from a camera, compress it, and send it over the net.
Software Architecture:
Three software stages, pipelined to work in parallel.
1st stage: V4L kernel video-capture driver -- uses DMA to capture frames from the video device and dump them into a buffer of memory.
2nd stage: Compression code -- read the captured frames, compresses them (cheesy, quick-and-dirty homebrew compression).
3rd stage: Just writes the data to a socket.
The V4L kernel driver uses DMA transfers from the capture device and double buffering to pass data to the user-space reader. This allows the two stages to really work in parallel. The reader (compression code) always has one buffer outstanding with a request for a frame to the driver, while the code is compressing the other buffer (double buffering). The use of DMA (by V4L) frees the CPU for the compression code.
The writing of the compressed data to a socket turns out to be touble because of the semantics of the send() call which are synchronous. To work around the problem, the network send (third stage) is separated in a thread of its own so that it can block on send() without locking up the whole application. The compressed data is passed from the compressor to the net stage with double buffering again.
If instead of using double buffering and pipelining we just put queues
in between stages then latency goes south and the code is unusable.