For this project, your group is to implement several sub-projects selected from the list included below. Each "option" has an attached number of points. You must implement a total of 6 points per person in your group. Thus, for instance, a 4 person group must implement 24 points worth of work from the list below. You will not be awarded more than the number of points possible for your group (except through the special extra-credit section); however, you may choose to implement additional sub-projects to increase your chances at extra-credit or guard against mistakes. All projects must maintain compatibility with the standard MIPS instruction set.
Near the end of classes, your group must do a final presentation. Your presentation should be 15 minutes long, 20-minutes including questions. Everybody in your group must present; your individual grade will include your presentation. Since you will present before your final project is due, you will not be expected to present a completed, fully functional, project. Good presentations (and write-ups, for that matter) will cover the specific sub-projects you chose to implement, and how they affected your processors performance. Detailed descriptions of your project datapath are not appropriate for a 15-minute presentation. However, high-level data paths might be appropriate.
Your final lab write-up is due Friday 5/16, after the contest. Please make sure that your write-up is clear, concise, and complete. To receive full credit for a sub-project, you must include a discussion on how it effects the performance of you project. You will not get a chance to discuss your final project with your TA, so make sure that you effectively communicate the key-points of your project. One good technique is to analyze the performance of select benchmarks with and without your new features.
This document contains only a brief summary of the projects. For more information about any of the specific sub-projects, please consult with your TA. All projects are weighted in accordance with their estimated difficulty. You should not implement a project unless it is relevant to your design (i.e., branch prediction is useless for the single-issue MIPS pipeline).
Important note: many of these options can get quite complicated. Thus, plan out a testing philosophy along with your architecture. One thing that you learned this term is how to make interesting tracer modules out of Verilog. Use this technique freely, since ASCII traces (possibly written to files) of pipeline state is likely to assist in debugging your design.
As you will see below, at minimum, we require you to maintain a disassembly-tracer as in previous labs. This required instruction tracer should only show committed instructions, i.e. show the instructions that actually finish execution. Note that in this lab, we now require that you include a cycle count value in the tracer output (at the beginning of each output line). This will show exactly which cycle instructions finish on. Since it is part of the debugging code, all of the logic for the cycle counter can be included in the Verilog monitor code; it should be reset to zero on reset, and increment with each clock cycle (since CPI is not necessarily 1, you will actually see the delay in your instruction traces).
Should you decide to use other tracer modules (say ones that watch individual reservation stations, etc), then you should also include cycle counts in those trace outputs as well. Each tracer module can write its output to an individual file, with the cycle counts helping you to figure out what order things happened in.
Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace.
For more information, see the companion document on the handouts page (under Lab 7)
Out-of-order Execution [22]:
Implement a processor which uses the Tomasulo algorithm to perform
out-of-order execution. Include at least 4 different units: instruction
fetch/branch unit, load unit, store unit, integer ALU. Note that
the instruction fetch unit is tricky, since you need to deal with branches
whose values may be coming from the broadcast results bus. Also,
see if you can be more clever than we were in class about the broadcast
results bus so that a chain of instructions that depend on each other can
have an execution throughput of one per cycle. For more information,
look at the end of chapter 6, which talks about the power-pc 604, and shows
a possible organization. Incidentally, if done properly, the store
unit could serve as a write buffer....
Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. Suggestion: you might consider outputing two separate traces (to files): (1) a trace of dispatched instructions and (2) a trace of committed instructions/possibly triggered from the broadcast results bus. Only (2) is required, however.
Multiprocessing [18]:
Hook two of processors together to implement a multi-processor. To
preserve correct behavior, the system must implement cache coherency. This
means that changes to the data cache on one processor must be visible to
the other processor. One way to do this is with write-through at
the first level and a second-level cache. To allow a thread to distinguish
which one of the two processors it is running on, a multi-processing system
should start one thread at address 0, and the other thread at the byte
address 0x08.
Synchronization mechanisms: In order to handle synchronization between multiple processors, you much implement the following synchronization mechanism: Treat addresses in the range 0xFFFFFFE0 to 0xFFFFFFEF specially. These 16 addresses form 16 , one-bit "synchronization variables". Loads and stores to these addresses should go to your synchronization module instead of to memory. So, memory operations to these addresses should have the following behavior:
Further the "fence" behavior means that the processor which is performing
the store to the synchronizing variable must stall until the hardware can
guarantee that all previous stores to memory are visible to the other processor.
Grablock:
lw $r1,-18($zero)
;Try to grab synchronization variable #14
bne $r1,$zero, Grablock
;Oops. Someone has it, wait..
; Do other stuff here....
Releaselock:
sw $zero,-18($zero)
; Release the lock!
For more information, see the companion document off the handouts page (under Lab 7)
Multithreading [20]:
Multithreading uses two hardware contexts to handle delays in the system
-- it will have 64 registers instead of the normal 32. Multi-threading
allows the system to immediately switch thread contexts on a cache
miss to avoid the associated delay. Cache coherency is
not an issue with multi-threading because both executing thread use the
same memory path; however the caches should be able to handle reads at
the same time as it is filling a cache line. To allow two thread
to execute independantly, a multi-threaded system should start one thread
at address 0, and the other thread at the byte address 0x08.
Be very careful to properly squash instructions from each thread when switching from one thread to another.
You must also implement 16 synchronization variables just as above for Multiprocessing. One additional thing: a load to a synchronizing variable that returns "1" should cause a switch to the other thread, just like with a cache miss (since this corresponds to trying to get a lock that the other processor currently holds).
Make sure to update your instruction tracer unit from previous labs to give useful information about the two instruction streams (independent threads). To avoid confusion, you might output each of the traces to separate files. Among other things, make sure to include the current cycle count in your traces.
For more information, see the companion document on the handouts page (under Lab 7)
Deep pipelining [18]:
Deep pipelining breaks the pipeline up into more than five stages,
decreasing the cycle time of your processor. Lengthening the pipeline,
however, will increase the difficulties that hazards cause in the system.
Some components, such as the ALU, will have to be broken up into two stages
(you will have to implement it using individual gates). To get full-credit
for this part, you need to implement at least 8 stages: 2 IF, 1
DEC, 2 EXE, 2 MEM, and 1WB (see also 'not-so-deep' pipelining below).
Further, the cycle time of your processor must be 20ns or lower (50Mhz or higher).
Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace.
Branch Prediction [8]:
Branch prediction attempts to reduce the branch penalty for super-scalar,
out-of-order, or deep-pipelining designs. When a branch is mis-predicted,
instructions in the pipeline will have to be squashed and correct execution
resumed. Typically, branches are predicted based on the address of
the branch instruction.
Note that you need to think carefully about what it means to have a predicted instruction stream and how you will recover from a mis-prediction.
Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. Your instruction tracer should only show committed instructions (i.e. not the squashed ones). A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.
Load-Value Predictor [8]:
Load-value predictors attempt to predict the result of a LW instruction
(based on the instruction address). This prediction allows execution to
continue for a short time using the predicted value. Meanwhile, the load
value is fetched from memory and compared to the predicted value. On a
mis-prediction, instructions in the pipeline must be squashed and execution
restarted with the correct value.
The simplest type of load-value predictor is a "last-value" predictor, which keeps a table of instruction addresses for loads as well as the previous value actually loaded. Further, a small "confidence" counter (2 bits) is kept for each entry. To predict, we use the instruction address of the load to index into the table and predict the previous value as long as the value of the confidence counter is high enough (say at least "2"). When the load is actually resolved, we check to see if (1) it is in the table and (2) whether we predicted the right thing or not. If it is in the table and we predicted the right thing, we increment the confidence (without overflowing). If in the table and we didn't predict the right thing, then we decrement the confidence (without going below zero).
Note that you need to think carefully about what it means to have a predicted value in your pipeline and how you will recover from problems. Also, some sort of tracer output which indicates which loads are predicted, etc,
Jump Target Predictor [8]:
Similar to a branch predictor and load-value predictor, a jump target
predictor predicts the destination of JR instructions based on the address
of the JR instruction itself. If the destination of the jump is mis-prediced,
then instructions in the pipeline must be squashed (similar to branch prediction).
Reorder buffer [12]:
Implement a reorder buffer of at least 8 instructions. This option
makes sense to be combined with a number of other options here. It
must be the case that you have precise exceptions as a result (i.e. that
you could stop the system and easily recover good state). It would
certainly help with the out-of-order execution model or the non-blocking-load
option. It might also help with branch prediction, load-value prediction,
etc. Even in the non-out-of-order pipeline, it could give you a longer
time to let your prediction take effect, i.e. you might predict a load
on a cache miss and let the execution continue. The number of entries
in your reorder buffer would directly affect the number of instructions
after the load that you could let execute before you absolutely had to
know whether the prediction was correct or not...
Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. The trace should show committed instructions. One useful debugging option would be to make another tracer which shows dispatched instructions (as they enter the reorder buffer).
Non-Blocking Loads [8]:
Non blocking loads allow the pipeline to continue execution during
a data-cache miss until the fetched data is actually needed. This
can be accomplished by adding a full/empty bit for every register in the
register file, and keeping a small table (2-4 entries) for outstanding
loads. The entries in this table are called Miss-Status Handling
Registers (MSHR) . Each MSHR is used to keep information about one
outstanding cache miss (for a load). A MSHR contains (1) A valid
bit indicating the MSHR is in use, (2) the address that has missed in the
cache and (3) the register that the result must go back to. Initially
the full/empty bits for all registers are set to "full" (this is a reset
condition) and all of the MSHRs are set to invalid.
When a load reaches the memory stage, you check to see if it misses in the cache. If so, you see if you have a free MSHR and stall if all of them are in use. If you have a free MSHR, you enter the info for the load into the MSHR, and set the F/E bit for the destination register to empty. Now, the simplest thing to do is flush all following instructions and restart fetching at the instruction after the load. The reason is as follows. In the decode stage, you need to check to see if you are decoding an instruction that has its F/E bit set to empty; if so, you stall until the F/E bit becomes full again. Make sure you understand WAR and WAW hazards for loads. Data returning from memory checks the MSHR and updates the appropriate F/E bit, register, and MSHR (making it invalid).
Note that there are more clever ways to implement blocking (without the F/E bits): hint - the only registers that could possibly be empty are in the MSHRs.
Victim Cache [4]:
Implement a Victim Cache which contains 4 cache-lines worth of information,
and which sits between the data cache and the DRAM.
Stream Buffer[4]:
Use a small FIFO of 2 cache lines which is used to prefetch instructions
from memory. On an instruction cache miss, you should go ahead and
fill the missing line in the cache, the proceed to start fetching additional
lines, which you place into the stream buffer. Then, on subsequent
misses, you check the stream buffer to potentially fetch data much more
quickly. Note that one possible implementation would take advantage
of burst-mode DRAM operation to fill the stream buffer more quickly...
Multiplier/Divider[4]:
Implement a combined multiplier/divider unit that handles both signed
and unsigned multiplies/divides. As in the extra-credit to lab 5,
this should be implemented as a coprocessor so that the actual instructions
don't block the pipeline. Only attempts to execute mflo/mfhi instructions
should block if the operation is not yet ready.
Second-level Cache[4]:
Implement a single second-level cache with 64K data with 64 byte cache
lines.
One test program will be released early on, to allow you to target your optimizations. The other program, a mystery program, will not be released until a few days before the project deadline. A good understanding of the different project options will help you decide in advance which solutions will have the greatest impact on performance. You may change the order of instructions in the programs, if you desire, to take advantage of your specific processor implementation; adding or removing instructions other than NOPs (i.e., unrolling loops), is not allowed. A separate, yet functionally similar, program will be provided for multi-threaded and multi-processing projects. Single-threaded and multi-threaded projects will be judged together, but the multi-threaded version of the programs will include synchronization and communication overhead.
Increasing the amount of first-level cache in your system is not allowed: the total amount of first-level cache in your system may not exceed 128 words per processor (a total of 256 words is allowed for a multi-processor design). 128 words is derived from Lab 6 which has 64 words each for I and D caches; although you may not exceed 128 words total, you may redistribute the memory if you wish (i.e., 64/196 words).
To aid in determining when the test programs are done executing we will place an undefined instruction at the end. Your VHDL controller should assert an error when it tries to decode an undefined instruction. The time to completion for the program, therefore, will be the time this undefined instruction is decoded.