arch-beer

Weekly Reading
 


Sam is presenting


Samantika Subramaniam Gabriel H. Loh
"Fire-and-Forget: Load/Store Scheduling with No Store Queue at All"
MICRO 2006
PDF copy (accessible within GT network only)


Modern processors use CAM-based load and store queues (LQ/SQ) to support out-of-order memory scheduling and store-to-load forwarding. However, the LQ and SQ scale poorly for the sizes required for large-window, high- ILP processors. Past research has proposed ways to make the SQ more scalable by reorganizing the CAMs or using non-associative structures. In particular, the Store Queue Index Prediction (SQIP) approach allows for load instructions to predict the exact SQ index of a sourcing store and access the SQ in a much simpler and more scalable RAMbased fashion. The reason why SQIP works is that loads that receive data directly from stores will usually receive the data from the same store each time. In our work, we take a slightly different view on the underlying observation used by SQIP: a store that forwards data to a load usually forwards to the same load each time. This subtle change in perspective leads to our “Fire-and- Forget” (FnF) scheme for load/store scheduling and forwarding that results in the complete elimination of the store queue. The idea is that stores issue out of the reservation stations like regular instructions, and any store that forwards data to a load will use a predicted LQ index to directly write the value to the LQ entry without any associative logic. Any mispredictions/misforwardings are detected by a low-overhead pre-commit re-execution mechanism. Our original goal for FnF was to design a more scalable memory scheduling microarchitecture than the previously proposed approaches without degrading performance. The relative infrequency of store-to-load forwarding, accurate LQ index prediction, and speculative cloaking actually combine to enable FnF to slightly out-perform the competition. Specifically, our simulation results show that our SQless Fire-and-Forget provides a 3.3% speedup over a processor using a conventional fully-associative SQ.