Lab #7
Final Project

CS152 - Computer Architecture
Spring 2003, Prof John Kubiatowicz

There is no homework with this lab. Breath easily!
Problem 0: (Team Evaluation) due Thursday, 5/1 by midnight via EMail to your TA.
Lab organizational EMail due to TA Thursday 5/1 by midnight via EMail to your TA. ALSO, you should enter your project on the projects page by this time.
Lab status update report due to TA by Thursday 5/8 in Section to your TA.
Oral reports will be on Thursday 5/15
We will all gather at 5pm on Thursday 5/15 in the lab for the contest. Your reports will be due at that time.

Problem 0 (Team Evaluation) is due Thursday, 5/1 by midnight. Same description as previously.

Problem 1: The final project.

For this project, your group is to implement several sub-projects selected from the list included below. Each "option" has an attached number of points. You must implement a total of 6 points per person in your group. Thus, for instance, a 4 person group must implement 24 points worth of work from the list below. You will not be awarded more than the number of points possible for your group (except through the special extra-credit section); however, you may choose to implement additional sub-projects to increase your chances at extra-credit or guard against mistakes. All projects must maintain compatibility with the standard MIPS instruction set.

Near the end of classes, your group must do a final presentation. Your presentation should be 15 minutes long, 20-minutes including questions. Everybody in your group must present; your individual grade will include your presentation. Since you will present before your final project is due, you will not be expected to present a completed, fully functional, project. Good presentations (and write-ups, for that matter) will cover the specific sub-projects you chose to implement, and how they affected your processors performance. Detailed descriptions of your project datapath are not appropriate for a 15-minute presentation. However, high-level data paths might be appropriate.

Your final lab write-up is due Friday 5/16, after the contest. Please make sure that your write-up is clear, concise, and complete. To receive full credit for a sub-project, you must include a discussion on how it effects the performance of you project. You will not get a chance to discuss your final project with your TA, so make sure that you effectively communicate the key-points of your project. One good technique is to analyze the performance of select benchmarks with and without your new features.

This document contains only a brief summary of the projects. For more information about any of the specific sub-projects, please consult with your TA. All projects are weighted in accordance with their estimated difficulty. You should not implement a project unless it is relevant to your design (i.e., branch prediction is useless for the single-issue MIPS pipeline).

Important note: many of these options can get quite complicated. Thus, plan out a testing philosophy along with your architecture. One thing that you learned this term is how to make interesting tracer modules out of Verilog. Use this technique freely, since ASCII traces (possibly written to files) of pipeline state is likely to assist in debugging your design.

As you will see below, at minimum, we require you to maintain a disassembly-tracer as in previous labs. This required instruction tracer should only show committed instructions, i.e. show the instructions that actually finish execution. Note that in this lab, we now require that you include a cycle count value in the tracer output (at the beginning of each output line). This will show exactly which cycle instructions finish on. Since it is part of the debugging code, all of the logic for the cycle counter can be included in the Verilog monitor code; it should be reset to zero on reset, and increment with each clock cycle (since CPI is not necessarily 1, you will actually see the delay in your instruction traces).

Should you decide to use other tracer modules (say ones that watch individual reservation stations, etc), then you should also include cycle counts in those trace outputs as well. Each tracer module can write its output to an individual file, with the cycle counts helping you to figure out what order things happened in.

Subprojects [#points]

Super-Scalar [18]:
To increase performance, convert your pipeline into a 2-way super-scalar pipeline. Because super-scalar implementations increase the instruction bandwidth, this project will entail changes through the entire memory system.

For more information, see the companion document on the handouts page (under Lab 7)

Out-of-order Execution [22]:
Implement a processor which uses the Tomasulo algorithm to perform out-of-order execution. Include at least 4 different units: instruction fetch/branch unit, load unit, store unit, integer ALU. Note that the instruction fetch unit is tricky, since you need to deal with branches whose values may be coming from the broadcast results bus. Also, see if you can be more clever than we were in class about the broadcast results bus so that a chain of instructions that depend on each other can have an execution throughput of one per cycle. For more information, look at the end of chapter 6, which talks about the power-pc 604, and shows a possible organization. Incidentally, if done properly, the store unit could serve as a write buffer....

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. Suggestion: you might consider outputing two separate traces (to files): (1) a trace of dispatched instructions and (2) a trace of committed instructions/possibly triggered from the broadcast results bus. Only (2) is required, however.

Multiprocessing [18]:
Hook two of processors together to implement a multi-processor. To preserve correct behavior, the system must implement cache coherency. This means that changes to the data cache on one processor must be visible to the other processor. One way to do this is with write-through at the first level and a second-level cache. To allow a thread to distinguish which one of the two processors it is running on, a multi-processing system should start one thread at address 0, and the other thread at the byte address 0x08.

Synchronization mechanisms: In order to handle synchronization between multiple processors, you much implement the following synchronization mechanism: Treat addresses in the range 0xFFFFFFE0 to 0xFFFFFFEF specially. These 16 addresses form 16 , one-bit "synchronization variables". Loads and stores to these addresses should go to your synchronization module instead of to memory. So, memory operations to these addresses should have the following behavior:

Loads to synchronizing variables act as "test-and-set" operations. They should atomically grab the old value of the variable and then set the variable to 1. The old value will be returned as the result of the load (you should zero-extend the 1 bit to 32 bits). "Atomic" here means that only one processor at a time is allowed to do a test-and-set load. Hence, it should not be possible for the hardware to do the following:

Stores to synchronizing variables act as "fence" operations in addition to updating the single bit variable. Updates should take the lowest bit of the output register and write it to the synchronizing variable. This should not violate the atomicity property above (i.e, it should not be possible for the hardware to do):

Further the "fence" behavior means that the processor which is performing the store to the synchronizing variable must stall until the hardware can guarantee that all previous stores to memory are visible to the other processor.

Note that these above two properties are sufficient to construct a lock:

Grablock:
lw $r1,-18($zero) ;Try to grab synchronization variable #14
bne $r1,$zero, Grablock ;Oops. Someone has it, wait..

; Do other stuff here....

Releaselock:
sw $zero,-18($zero) ; Release the lock!

Make sure to update your instruction tracer unit from previous labs to give useful information about the two instruction streams. To avoid confusion, you might output each of the traces to separate files. You probably want special trace output for operations on synchronization variables. Among other things, make sure to include the current cycle count in your traces.

For more information, see the companion document off the handouts page (under Lab 7)

Multithreading [20]:
Multithreading uses two hardware contexts to handle delays in the system -- it will have 64 registers instead of the normal 32. Multi-threading allows the system to immediately switch thread contexts on a cache miss to avoid the associated delay. Cache coherency is not an issue with multi-threading because both executing thread use the same memory path; however the caches should be able to handle reads at the same time as it is filling a cache line. To allow two thread to execute independantly, a multi-threaded system should start one thread at address 0, and the other thread at the byte address 0x08.

Be very careful to properly squash instructions from each thread when switching from one thread to another.

You must also implement 16 synchronization variables just as above for Multiprocessing. One additional thing: a load to a synchronizing variable that returns "1" should cause a switch to the other thread, just like with a cache miss (since this corresponds to trying to get a lock that the other processor currently holds).

Make sure to update your instruction tracer unit from previous labs to give useful information about the two instruction streams (independent threads). To avoid confusion, you might output each of the traces to separate files. Among other things, make sure to include the current cycle count in your traces.

For more information, see the companion document on the handouts page (under Lab 7)

Deep pipelining [18]:
Deep pipelining breaks the pipeline up into more than five stages, decreasing the cycle time of your processor. Lengthening the pipeline, however, will increase the difficulties that hazards cause in the system. Some components, such as the ALU, will have to be broken up into two stages (you will have to implement it using individual gates). To get full-credit for this part, you need to implement at least 8 stages: 2 IF, 1 DEC, 2 EXE, 2 MEM, and 1WB (see also 'not-so-deep' pipelining below).

Further, the cycle time of your processor must be 20ns or lower (50Mhz or higher).

Branch Prediction [8]:
Branch prediction attempts to reduce the branch penalty for super-scalar, out-of-order, or deep-pipelining designs. When a branch is mis-predicted, instructions in the pipeline will have to be squashed and correct execution resumed. Typically, branches are predicted based on the address of the branch instruction.

Note that you need to think carefully about what it means to have a predicted instruction stream and how you will recover from a mis-prediction.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. Your instruction tracer should only show committed instructions (i.e. not the squashed ones). A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.

Load-Value Predictor [8]:
Load-value predictors attempt to predict the result of a LW instruction (based on the instruction address). This prediction allows execution to continue for a short time using the predicted value. Meanwhile, the load value is fetched from memory and compared to the predicted value. On a mis-prediction, instructions in the pipeline must be squashed and execution restarted with the correct value.

The simplest type of load-value predictor is a "last-value" predictor, which keeps a table of instruction addresses for loads as well as the previous value actually loaded. Further, a small "confidence" counter (2 bits) is kept for each entry. To predict, we use the instruction address of the load to index into the table and predict the previous value as long as the value of the confidence counter is high enough (say at least "2"). When the load is actually resolved, we check to see if (1) it is in the table and (2) whether we predicted the right thing or not. If it is in the table and we predicted the right thing, we increment the confidence (without overflowing). If in the table and we didn't predict the right thing, then we decrement the confidence (without going below zero).

Note that you need to think carefully about what it means to have a predicted value in your pipeline and how you will recover from problems. Also, some sort of tracer output which indicates which loads are predicted, etc,

Jump Target Predictor [8]:
Similar to a branch predictor and load-value predictor, a jump target predictor predicts the destination of JR instructions based on the address of the JR instruction itself. If the destination of the jump is mis-prediced, then instructions in the pipeline must be squashed (similar to branch prediction).

Reorder buffer [12]:
Implement a reorder buffer of at least 8 instructions. This option makes sense to be combined with a number of other options here. It must be the case that you have precise exceptions as a result (i.e. that you could stop the system and easily recover good state). It would certainly help with the out-of-order execution model or the non-blocking-load option. It might also help with branch prediction, load-value prediction, etc. Even in the non-out-of-order pipeline, it could give you a longer time to let your prediction take effect, i.e. you might predict a load on a cache miss and let the execution continue. The number of entries in your reorder buffer would directly affect the number of instructions after the load that you could let execute before you absolutely had to know whether the prediction was correct or not...

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream. Among other things, make sure to include the current cycle count in your trace. The trace should show committed instructions. One useful debugging option would be to make another tracer which shows dispatched instructions (as they enter the reorder buffer).

Non-Blocking Loads [8]:
Non blocking loads allow the pipeline to continue execution during a data-cache miss until the fetched data is actually needed. This can be accomplished by adding a full/empty bit for every register in the register file, and keeping a small table (2-4 entries) for outstanding loads. The entries in this table are called Miss-Status Handling Registers (MSHR) . Each MSHR is used to keep information about one outstanding cache miss (for a load). A MSHR contains (1) A valid bit indicating the MSHR is in use, (2) the address that has missed in the cache and (3) the register that the result must go back to. Initially the full/empty bits for all registers are set to "full" (this is a reset condition) and all of the MSHRs are set to invalid.

When a load reaches the memory stage, you check to see if it misses in the cache. If so, you see if you have a free MSHR and stall if all of them are in use. If you have a free MSHR, you enter the info for the load into the MSHR, and set the F/E bit for the destination register to empty. Now, the simplest thing to do is flush all following instructions and restart fetching at the instruction after the load. The reason is as follows. In the decode stage, you need to check to see if you are decoding an instruction that has its F/E bit set to empty; if so, you stall until the F/E bit becomes full again. Make sure you understand WAR and WAW hazards for loads. Data returning from memory checks the MSHR and updates the appropriate F/E bit, register, and MSHR (making it invalid).

Note that there are more clever ways to implement blocking (without the F/E bits): hint - the only registers that could possibly be empty are in the MSHRs.

Victim Cache [4]:
Implement a Victim Cache which contains 4 cache-lines worth of information, and which sits between the data cache and the DRAM.

Stream Buffer[4]:
Use a small FIFO of 2 cache lines which is used to prefetch instructions from memory. On an instruction cache miss, you should go ahead and fill the missing line in the cache, the proceed to start fetching additional lines, which you place into the stream buffer. Then, on subsequent misses, you check the stream buffer to potentially fetch data much more quickly. Note that one possible implementation would take advantage of burst-mode DRAM operation to fill the stream buffer more quickly...

Multiplier/Divider[4]:
Implement a combined multiplier/divider unit that handles both signed and unsigned multiplies/divides. As in the extra-credit to lab 5, this should be implemented as a coprocessor so that the actual instructions don't block the pipeline. Only attempts to execute mflo/mfhi instructions should block if the operation is not yet ready.

Second-level Cache[4]:
Implement a single second-level cache with 64K data with 64 byte cache lines.

Extra-Credit

Extra-credit will be assigned for this lab based on a friendly competition between groups for performance. The cycle time for all VHDL components will be fixed, and the total combined completion time for a small suite of benchmark programs will be measured. If you use a VHDL component that is not on the 'standard' list (from Lab 6), email your TA and describe the VHDL component and he or she will assign a delay value for that component. The VHDL delays will be somewhat pessimal; therefore, to reduce your cycle-time, you may choose to implement any VHDL components you have using discrete gates. Your program does not need to be `the best' to receive extra credit, but obviously the best design will receive the most extra points

One test program will be released early on, to allow you to target your optimizations. The other program, a mystery program, will not be released until a few days before the project deadline. A good understanding of the different project options will help you decide in advance which solutions will have the greatest impact on performance. You may change the order of instructions in the programs, if you desire, to take advantage of your specific processor implementation; adding or removing instructions other than NOPs (i.e., unrolling loops), is not allowed. A separate, yet functionally similar, program will be provided for multi-threaded and multi-processing projects. Single-threaded and multi-threaded projects will be judged together, but the multi-threaded version of the programs will include synchronization and communication overhead.

Increasing the amount of first-level cache in your system is not allowed: the total amount of first-level cache in your system may not exceed 128 words per processor (a total of 256 words is allowed for a multi-processor design). 128 words is derived from Lab 6 which has 64 words each for I and D caches; although you may not exceed 128 words total, you may redistribute the memory if you wish (i.e., 64/196 words).

To aid in determining when the test programs are done executing we will place an undefined instruction at the end. Your VHDL controller should assert an error when it tries to decode an undefined instruction. The time to completion for the program, therefore, will be the time this undefined instruction is decoded.

Lab #7 Final Project

CS152 - Computer Architecture Spring 2003, Prof John Kubiatowicz

Subprojects [#points]

Extra-Credit

Lab #7
Final Project

CS152 - Computer Architecture
Spring 2003, Prof John Kubiatowicz