From kenmac Wed Mar 26 10:15:49 -0500 2003 From: "Kenneth M. Mackenzie" To: gtg824c@prism, gtg546c@prism, gtg445d@prism, gte667i@prism, gtg251f@prism, gte389w@prism, gte009y@prism, gte489q@prism, gte184x@prism, gtg119d@prism, gte455u@prism, gte959h@prism, gte768q@prism, gte855t@prism, gte537p@prism, gte342k@prism, gtg993b@prism, gte336w@prism, gtg784c@prism, gtg128j@prism, gte990w@prism, gtg455c@prism, gte334i@prism, gte309s@prism, gte103i@prism, gte602j@prism, gtg018k@prism, gtg669c@prism, gtg985b@prism, gtg734c@prism, gtg643d@prism, gte640j@prism, gtg761i@prism, gtg664d@prism, gte227u@prism, dzook, gtg141a@prism, gte996v@prism, gtg281a@prism, gte910w@prism, gte859v@prism, gte363v@prism, gte079q@prism, gte978m@prism, gte757u@prism, gte394n@prism, gte763n@prism, gte089p@prism Subject: prj2-note1 Reply-to: kenmac@cc.gatech.edu Hi all, I've had many questions about how to count the number of bits in the predictors in order to estimate the time. Here's a attempt to satisfy all those questions and we can pursue it further in class this afternoon if need be. I'll save this note on the project2 web page as well. Two points here: (1) descriptions of the predictor options and (2) how to find nbits. -------------------- 1. predictor options -------------------- o simplescalar's "bimodal" predictor uses two bits per entry (the usual saturating counter scheme) and you specify the number of entries. --- | | | | log2(nentries) bits from PC ---->| | | | | | --- 2 x nentries array o The "2lev" predictor uses a two tables (figure 6.2 in Burger97): a pattern history table and then a table of 2-bit predictors just like the bimodal above. The idea is to use something other than the plain PC to index into the table of 2-bit predictors. You can configure this structure to represent multiple forms of predictors: GAg: Called a "global" predictor in class: The pattern history has exactly one entry with hist_size bits in it. It remembers the result (taken/not taken) of the last hist_size branches anywhere in the program. PAg: Called a "local" predictor in class: The pattern history has l1_size entries, indexed by log2(l1_size) bits from the PC. The idea is to remember the result of of *particular* branches in the program. there are a couple of other possibilities with this two-level structure as well. We didn't talk about two of them! ghare: Mentioned in class. It's like the global predictor (keeps 1 entry of hist_size bits of history), but then it XORs those bits with hist_size bits taken from the PC before indexing into the second-level table. This one works quite well -- it was the best performer in Emer's paper. GAp: not mentioned in class This one uses hist_size bits of global history and then *appends* additional bits from the PC in order to construct the index into the l2 table (the table of 2-bit predictors). You specify this structure to simplescalar in an odd way: you make l2_size *bigger* than 2^hist_size. I.e. for GAg, l2_size = 2^hist_size because you have hist_size bits from the first level and you're using it to index into the second level. If you make l2_size bigger than that, it implies that extra index bits have to come from somewhere. They come from the low bits of the PC. PAp: also not mentioned in class PAp is the local-history equivalent of GAp: it has a multi-entry first-level table that provides hist_size bits. The l2_size is made bigger than 2^hist_size so it need additional index bits for the second level index and it picks those up from the PC. Table 2 indicates that for PAp, l2_size has to be exactly 2^(l1_size + hist_size). I see no reason for that to be true -- like GAp, it seems like l2_size could be any power of two bigger than 2^hist_size and it would make sense. This restriction may be another wierdo restriction in simplescalar or maybe just an assumption in Burger97 that's not true in the code. o Finally, the "comb" predictor is (more or less) the Alpha 21264 predictor described in class. It makes *two* predictions, one using the simple bimodal technique and the other using a two-level predictor. The output of those two predictors are then multiplexed to give the ultimate output. The mux requires a selector input. If you recall from the 21264 discussion (in slides or in Emer's paper), this selector input comes from a *third* predictor: a predictor to predict which of the other two predictors is going to be correct. The simplescalar folks call the storage for this third predictor the "meta-table". It's 2-bits wide and you set the number of entries with the argument to -bpred:comb o In summary, there's a lot of predictor options. The 21264 predictor and the gshare predictor are known to be good but even within those there are a number of options for tuning sizes. If I read Burger97 correctly, it seems that you can use gshare as one of the inputs to the combined predictor (where the 21264 used a local ("PAg") predictor). ------------------ 2. computing nbits ------------------ The timing model is crude ... much in the flavor of Project 1. It's based on the idea that storage is responsible for the bulk of the delay. Other features are added as a penalty atop the storage cost. o Basic idea: sum *all* the bits of storage used in the branch predictor, assume it's one big square array and compute its access time. Ignore the topology entirely at this point. - for the bimodal predictor, there's just one array - for the 2-level predictor, there are two (l1, l2) - for the combining predictor, there are four arrays: one for the bimodal part, two more for the 2-level part and a fourth for the meta table. add 'em all up and convert to pS using the same model as project 1: 2 * 2ps * sqrt(nbits). o Having done that, we deal with topology by adding 10%, cumulative penaties for each goodie you add. 1.1x for two-level 1.1x for using an XOR (in gshare) 1.1x for using a combining predictor. I.e, - a simple bimodel predictor would have no penalty. Just the time implied by the nbits calculation - a 2-level local (PAg) would have a 1.1x penalty - gshare would have a 1.1x1.1 (= 1.21x) penalty: it's two-level and uses an XOR - 21264-type combining would have a 1.1x1.1 penalty: it uses the two-level trick and a combining mux. - combining with embedded gshare would have 1.1x1.1x1.1 penalty! ---------------- Hopefully this info helps. I can draw pix on the board in class. I believe the numbers in the project are such that the branch predictors are *not* the bottleneck in setting Tclock. You still have to prove that, though. Also, the many variations make them interesting to tune. -- Ken