McDonalds Drive-through Architecture [For those who don't know, some McDonalds have a new setup where there are two ordering systems merging into the drive through so two people can order at a time.] It seems to me that McDonalds could use a serious lesson in pipeline architecture. They started out well. They recognized that their classic 2 stage drive through pipeline (order, execute) could be optimized by extending it to a 3 deep pipeline splitting the execute stage into pay and pickup stages since the execute stage typically had a higher latency than the order stage assuming well behaved instructions whose operands are known at order time thus avoiding an order stall condition. So they had a high performance 3 stage pipeline (order, pay, pickup) which worked rather well in my humble opinion. But they couldn't leave well enough alone. They decided to attempt to get really fancy with a full blown out of order machine (apart from the catastrophy avoidance out-of-order "penalty box" for inordinately high latency instructions which has existed for as long as I can remember). Now they have a 2 wide order unit feeding the rest of the pipeline allowing them to simultaneously process the order stage of 2 instructions (I guess they thought that the order stage had become the critical latency in the pipe by a large enough amount to warrant such an architectural change...). The problem is that the execution engine can not keep up with a 2-wide front end. Invariably the pipeline stalls in execution and leaves instructions which have already completed the order stage stuck in the pipe preventing subsequent instructions from entering. With the older system during peak processing times the execution engine would have trouble keeping matching the throughput of a single order unit, making any benefit from a higher throughput front end miniscule at best during high load times. In an almost empty system (perhaps that is the condition they are designing to) there may be a moderate improvement in throughput, but I can't imagine the overall benefit being worthwhile. Another problem is the lack of an arbitration mux to schedule instructions from the front end into the execution pipeline. The combination of a lack of collision avoidance logic with the fact that there are almost always 2 instructions stalled and waiting to enter the execution pipeline is a fault waiting to happen... Additionally, the absence of scheduled control of entry for instructions from the front end into the execution pipe prevents the instruction queue from being able to correlate operands with instructions in the (frequent) event of a collision. This results in a increased risk of instructions executing with other instructions' operands and an added "confirm" logic latency in the pay pipestage. All in all the new design strikes me as an overly ambitious attempt at optimization which has resulted in a drive-through model which offers no significant benefit over the simpler model but plenty of drawbacks. Of course, the analogy does have (at least) one gaping hole. In a CPU pipeline an instruction doesn't get irritated when 3 other instructions which were issued behind it end up executing before it, dramatically increasing its latency through the system. *grumble* I'd be really interested in seeing some system latency, throughput, and error rate data comparing the old system to the new. I can't imagine the new system shows any real benefit (although maybe I just don't have a wild enough imagination). I suspect it is simply an overarchitected mess. I hope they have someone doing performance validation on it. Jim Vaught (jvaught@ichips.intel.com)