CS152 Computer Architecture and Engineering

Lab #6: Final Project

Spring 2004, Prof. John Kubiatowicz


Problem 0 (Team Evaluation) for lab 5 is due Wednesday, 4/28 by 9 pm. Same description as previously.


Problem 1: Memory-mapped I/O

Here is the specification for the memory-mapped I/O space for your processor. Note that there will probably be some simple additions to this specification for specialized output for the processor race.
Address
Description
0x80000000-0x80001FFC TFTP Memory Source (See 1b)
0x80002000-0xFFFFFE6C Reserved for future use
0xFFFFFE70-0xFFFFFE90 TFTP Memory Source (See 1b)
0xFFFFFE94 Reserved For Future use
0xFFFFFE98-0xFFFFFEDC Synchronization (See 1c)
0xFFFFFEE0-0xFFFFFEE8 ASCII Display (Unchanged from Lab 5)
0xFFFFFEEC Reserved for future use
0xFFFFFEF0-0xFFFFFEFC Gianormous Cycle Counter (See 1a)
0xFFFFFF00-0xFFFFFFEC Boot ROM (unchanged from Lab 5)
0xFFFFFFF0-0xFFFFFFFC Basic I/O (Unchanged from Lab 5)

Problem 1a: 64-bit Cycle Counter

For the final project we will need a 64-bit cycle counter in order to test some of the longer-running programs. Therefore you must implement a new 64-bit cycle counter in order to achieve this task.

Address
Reads
Writes
0xFFFFFEF0 Lower 32 Bits of Big Cycle Counter Nothing
0xFFFFFEF4 Upper 32 Bits of Big Cycle Counter Nothing
0xFFFFFEF8 Timer State Timer Command
0xFFFFFEFC Reserved for Future Use Reserved for Future Use

Your 64-bit cycle counter will have two states, running and stopped. A load of the Timer State should indicate 1 if it is running and 0 if it is stopped. You may issue 3 different commands to your cycle counter via store words to the Timer Command location.

Value
Action
0x00000001 Reset the Counter
0x00000002 Start the Counter
0x00000004 Stop the Counter

Resetting the counter should NOT start or stop the counter.

Note that this counter should be independent of your 32-bit cycle counter and that both should be functional.

Problem 1b: TFTP Memory Source

We have provided you with a new version of the TFTP Memory source that allows you to store data into it. Now you will be able to fill the TFTP Black Box with data and then download it off the board to the computer. This means that stores in the range 0x80000000-0x80001FFC should now write back to the corresponding address of the TFTP black box. Bits [12:2] of the memory address should be used as the input to the ExternalAddress_ port of the black box. The new blackboxes can be found in m:\lab6. In addition there is a new FPGA_TOP2 with a test circuit that will test the writability of the black boxes. After file has been loaded on to the board, select a length using the dipswitches and push button 2, and the data in the black box will be replaced with words of the format DEADxxxx where xxxx is the memory location of the word. You can now download the file off of the board and see that it has changed, both in length and in content.

Address
Reads
Writes
0x80000000-0x80001FFC Read from corresponding address of black box Write to corresponding address of black box
0xFFFFFE70 The first four characters of the filename in the black box in ASCII. The high bits of the word correspond to the earlier characters. Nothing
0xFFFFFE74 The fifth and sixth characters of the filename in the black box in ASCII. The high bits of the word correspond to the earlier characters. The lower sixteen bits of the word should be 0. Nothing
0xFFFFFE78-0xFFFFFE8C Reserved For Future Use Reserved For Future Use
0xFFFFFE90 Nothing Writes the length of the file in bytes. The maximum legal value is 0x00002000. Values higher than that have undefined results.

Problem 1c: Synchronization

Your processor must provide synchronization support for multithreading. If you are not attempting the multithreaded or multiprocessor options from problem 2a then you only have to support reads from address 0xFFFFFE98. If you are attempting multithreading or multiprocessing then you should make one thread/processor be a master and the other a slave. The master must be the only one executing code following the level-0 boot, and then you will use I/O space to activate the second processor.

Address
Reads
Writes
0xFFFFFE98 State of Slave Processor/Thread Slave Processor/Thread Command
0xFFFFFE9C Starting Address of Slave Processor/Thread Starting Address of Slave Processor/Thread
0xFFFFFEA0-0xFFFFFEDC Test and Set Registers 0 - 15 Test and Set Registers 0 - 15

There are several states that a slave processor/thread can be in. The first is unsupported. If your processor can not handle multithreading or multiprocessing then your slave processor/thread state should always be unsupported. The second state is idle, this means that the slave processor/thread should be ready to begin executing. Upon exiting the level-0 boot the slave processor/thread should be left in this state. The third state is active. This means that the slave processor/thread is actively running (or in the case of multithreading, trying to run).

Value
State
0x00000000 Slave Processor/Thread Unsupported
0x00000001 Slave Processor/Thread Idle
0x00000002 Slave Processor/Thread Active

There are a couple of commands that may be issued to a slave processor/thread. The first is the start command, this tells the thread to start executing from the starting address of the thread stored at 0xFFFFFE9C. The second is the stop command. This forces the slave processor/thread to stop executing immediately. All instructions before the stop command should commit and all instructions after the stop command should be ignored.

Value
Action
0x00000001 Start Execution
0x00000002 Stop Execution

The use of these states and commands would allow you to do the following:
.count words sbegin send
.word 0x00000000
.address 0x00000000
sbegin:
addiu $t0, $t0, 0x0040 # Set starting address for second thread
sw $t0, 0xFE9C($0)
addiu $t0, $0, 1 # Start second thread
sw $t0, 0xFE98($0)
j t1start # Start Executing first thread
nop
send:
.count words t1start t1end
.word 0x00010000
.address 0x00010000
t1start:
Do thread 1 stuff
addiu $t1, $0, 1 # Value for an idle thread is 1
waitfort2: lw $t0, 0xFE9C($0) # Check to see if thread 2 is done
bne $t0, $1, waitfort2
nop
Do finish up code/start another thread, (etc.)
t1end:
.count words t2start t2end
.word 0x00000040
.address 0x00000040
t2start:
Do thread 2 stuff
addiu $t0, $0, 2 # 2 is the command to stop a thread
sw $t0, 0xFE98($0) # Stop Executing second thread
t2end:
.word 0x00000000 # A value of 0 terminates the blocks

In order to handle synchronization between multiple processors/threads, you must implement the following synchronization mechanism: Treat addresses in the range 0xFFFFFEA0-0xFFFFFEDC specially. These 16 words form 16 one-bit "synchronization variables". Loads and stores to these addresses should go to your synchronization module instead of to memory. Memory operations to these addresses should have the following behavior:


Problem 2: The Final Project

The final project consists of a number of subprojects. Each subproject has an attached number of points. You must implement a total of 7.5 points per person in your group. Thus, for instance, a 4 person group must implement 30 points worth of work from the list below. You will not be awarded more than the number of points possible for your group (except through the special extra-credit section); however, you may choose to implement additional sub-projects to increase your chances at extra-credit or guard against mistakes.

Aside from the point requirements you must complete problem 2a. You only have to complete enough of problems 2b - 2d to meet your minimum point total. All projects must maintain compatibility with the standard MIPS instruction set.

If you have ideas on other options, talk to one of the TAs or the professor for approval, as well an assignment of points for your option. For additional information on these options, please consult the graduate textbook, websites, or your TAs.

Near the end of classes, your group must do a final presentation. Your presentation should be 20 minutes long, with an additional 10 minutes for questions from the professor, the TAs, and your fellow classmates. Everybody in your group must present; your individual grade will include your presentation. Good presentations (and write-ups, for that matter) will cover the specific sub-projects you chose to implement, and how they affected your processors performance. What design decisions were made and why? How successful were you? How did you measure your performance? Detailed descriptions of your project datapath are not appropriate for a 20-minute presentation.  However, high-level data paths and explanations of specific implementations might be appropriate.

Your final lab write-up is due Tuesday 5/18, after the contest.   Please make sure that your write-up is clear, concise, and complete. To receive full credit for a sub-project, you must include a discussion on how it effects the performance of you project. You will not get a chance to discuss your final project with your TA, so make sure that you effectively communicate the key-points of your project. One good technique is to analyze the performance of select benchmarks with and without your new features.

This document contains only a brief summary of the projects. For more information about any of the specific sub-projects, please consult with your TA. All projects are weighted in accordance with their estimated difficulty. You should not implement a project unless it is relevant to your design (i.e., branch prediction is useless for the single-issue MIPS pipeline).

Important note: many of these options can get quite complicated.  Thus, plan out a testing philosophy along with your architecture. One thing that you learned this term is how to make interesting tracer modules out of Verilog.  Use this technique freely, since ASCII traces (possibly written to files) of pipeline state is likely to assist in debugging your design.

As you will see below, at minimum, we require you to maintain a disassembly-tracer as in previous labs.  This required  instruction tracer should only show committed instructions, i.e. show the instructions that actually finish execution.  Note that in this lab, we now require that you include a cycle count value in the tracer output (at the beginning of each output line).  This will show exactly which cycle instructions finish on.  Since it is part of the debugging code, all of the logic for the cycle counter can be included in the Verilog monitor code; it should be reset to zero on reset, and increment with each clock cycle (since CPI is not necessarily 1, you will actually see the delay in your instruction traces).

Should you decide to use other tracer modules (say ones that watch individual reservation stations, etc), then you should also include cycle counts in those trace outputs as well.  Each tracer module can write its output to an individual file, with the cycle counts helping you to figure out what order things happened in.


Problem 2a: Overall Architecture

You must complete at least one the following options. More can be completed for additional points, however it will be very difficult.

Super-Scalar [18]:

To increase performance, convert your pipeline into a 2-way super-scalar pipeline. Because super-scalar implementations increase the instruction bandwidth, this project may entail changes through the entire memory system.

The IF stage needs to fetch two instructions, and then determine if both instructions can be issued in parallel (hazards?). Additional complexity is added because data must be forwarded between the pipes. Note that unless you choose to implement an enhanced superscalar pipeline (see group D) only one of your pipes will have a real memory stage, so two lw/sw instructions cannot be issued in parallel.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.

Out-of-order Execution [22]:

Implement a processor which uses the Tomasulo algorithm to perform out-of-order execution.  Include at least 5 different units: instruction fetch/branch unit, load unit, store unit, integer ALU, multiply unit.  Note that the instruction fetch unit is tricky, since you need to deal with branches whose values may be coming from the broadcast results bus.  Also, see if you can be more clever than we were in class about the broadcast results bus so that a chain of instructions that depend on each other can have an execution throughput of one per cycle.  For more information, look at the end of chapter 6, which talks about the power-pc 604, and shows a possible organization.  Incidentally, if done properly, the store unit could serve as a write buffer....

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.  Suggestion: you might consider outputing two separate traces (to files): (1) a trace of dispatched instructions and (2) a trace of committed instructions/possibly triggered from the broadcast results bus.  Only (2) is required, however.

Deep pipelining [22]:

Deep pipelining breaks the pipeline up into more than five stages, decreasing the cycle time of your processor. Lengthening the pipeline, however, will increase the difficulties that hazards cause in the system. Some components, such as the ALU, may have to be broken up into two stages.

You may use as many or as few stages as you like, but the cycle time of your processor MUST be ~28.57ns or lower (35Mhz or higher).

Please note that this project will involve HEAVY USE of the cad tools in order to achieve the 35Mhz. If you feel that heavily exploring the Xilinx CAD tools should not be part of a Computer Architecture course, do not do this option.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace.

Multiprocessing [18]:

Hook two of processors together to implement a multi-processor. To preserve correct behavior, the system must implement cache coherency. This means that changes to the data cache on one processor must be visible to the other processor.  One way to do this is with write-through at the first level and a second-level cache. Make sure to update your instruction tracer unit from previous labs to give useful information about the two instruction streams.  To avoid confusion, you might output each of the traces to separate files. You probably want special trace output for operations on synchronization variables. Among other things, make sure to include the current cycle count in your traces.

Multithreading [20]:

Multithreading uses two hardware contexts to handle delays in the system -- it will have 64 registers instead of the normal 32.  Multi-threading allows the system to  immediately switch thread contexts on a cache miss  to avoid the associated delay.   Cache coherency is not an issue with multi-threading because both executing thread use the same memory path; however the caches should be able to handle reads at the same time as it is filling a cache line.

Be very careful to properly squash instructions from each thread when switching from one thread to another.

You must also implement 16 synchronization variables just as above for Multiprocessing.  One additional thing: a load to a synchronizing variable that returns "1" should cause a switch to the other thread, just like with a cache miss (since this corresponds to trying to get a lock that the other processor currently holds).

Make sure to update your instruction tracer unit from previous labs to give useful information about the two instruction streams (independent threads).  To avoid confusion, you might output each of the traces to separate files.  Among other things, make sure to include the current cycle count in your traces.


Problem 2b: Predictors

Branch Prediction [8]:

Branch prediction attempts to reduce the branch penalty for super-scalar, out-of-order, or deep-pipelining designs. When a branch is mis-predicted, instructions in the pipeline will have to be squashed and correct execution resumed.  Typically, branches are predicted based on the address of the branch instruction.

Note that you need to think carefully about what it means to have a predicted instruction stream and how you will recover from a mis-prediction.

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace. Your instruction tracer should only show committed instructions (i.e. not the squashed ones).  A useful debugging option would be to make another tracer which showed instructions as they were dispatched -- this would show more instructions than your primary instruction tracer.

Note: Implementing branch prediction is highly recommended if you are building a super-scalar, out-of-order, or deep-pipelined design.

For more info, see the branch prediction papers under "Final Project" on the handouts page. The Bi-Mode predictor is not necessary to implement, but is considered one of the state-of-the-art predictors. Note also that the GShare predictor (discussed in the Bi-Mode paper) is pretty good.

Load-Value Predictor [8]:

Load-value predictors attempt to predict the result of a LW instruction (based on the instruction address). This prediction allows execution to continue for a short time using the predicted value. Meanwhile, the load value is fetched from memory and compared to the predicted value. On a mis-prediction, instructions in the pipeline must be squashed and execution restarted with the correct value.

The simplest type of load-value predictor is a "last-value" predictor, which keeps a table of instruction addresses for loads as well as the previous value actually loaded.  Further, a small "confidence" counter (2 bits) is kept for each entry.  To predict, we use the instruction address of the load to index into the table and predict the previous value as long as the value of the confidence counter is high enough (say at least "2").  When the load is actually resolved, we check to see if (1) it is in the table and (2) whether we predicted the right thing or not.  If it is in the table and we predicted the right thing, we increment the confidence (without overflowing).  If in the table and we didn't predict the right thing, then we decrement the confidence (without going below zero).

Note that you need to think carefully about what it means to have a predicted value in your pipeline and how you will recover from problems.  Also, some sort of tracer output which indicates which loads are predicted, etc,

Jump Target Predictor [8]:

Similar to a branch predictor and load-value predictor, a jump target predictor predicts the destination of JR instructions based on the address of the JR instruction itself. If the destination of the jump is mis-prediced, then instructions in the pipeline must be squashed (similar to branch prediction).


Problem 2c: Caches

Non-Blocking Loads [7]:

Non blocking loads allow the pipeline to continue execution during a data-cache miss until the fetched data is actually needed.  This can be accomplished by adding a full/empty bit for every register in the register file, and keeping a small table (2-4 entries) for outstanding loads.   The entries in this table are called Miss-Status Handling Registers (MSHR) .  Each MSHR is used to keep information about one outstanding cache miss (for a load).  A MSHR contains (1) A valid bit indicating the MSHR is in use, (2) the address that has missed in the cache and (3) the register that the result must go back to.  Initially the full/empty bits for all registers are set to "full" (this is a reset condition) and all of the MSHRs are set to invalid.

When a load reaches the memory stage, you check to see if it misses in the cache.  If so, you see if you have a free MSHR and stall if all of them are in use.  If you have a free MSHR, you enter the info for the load into the MSHR, and set the F/E bit for the destination register to empty.  Now, the simplest thing to do is flush all following instructions and restart fetching at the instruction after the load.  The reason is as follows.  In the decode stage, you need to check to see if you are decoding an instruction that has its F/E bit set to empty; if so, you stall until the F/E bit becomes full again.  Make sure you understand WAR and WAW hazards for loads.  Data returning from memory checks the MSHR and updates the appropriate F/E bit, register, and MSHR (making it invalid).

For more info, see paper by Kroft under Final Project on the handouts page.

Note that there are more clever ways to implement blocking (without the F/E bits): hint - the only registers that could possibly be empty are in the MSHRs.

Write Back Cache [7]:

Implement a write-back scheme as discussed in class. Note that this may interact in a strange way with your write buffer.  In this new write policy, your write buffer will only be used to hold writes until they make it into the cache.  This means that a "write miss" will now load the complete cache line, then allow the write to go forward.

Further, both write and read cache misses need to read a complete 8-word cache line.  Where this gets complicated is that you may need to kick out a dirty cache line in the process.  Make sure that you get this correct!  Remember that writing to memory is a long, painful penalty, and a sloppy write policy can make things much worse.

Also note that you need to enable write-bursts as well (for write back of dirty lines).  To do that, you need to set bit M9 of the mode register to 0 to enable write bursts and read bursts to be the same length.

Victim Cache [4]:

Implement a Victim Cache which contains 4 cache-lines worth of information, and which sits between the data cache and the DRAM.

For more info, see paper by Jouppi under Final Project on the handouts page.

Second-level Cache[4]:

Implement a single second-level cache with 32K byte data with 64 byte cache lines. If you don't have enough blockRams, you may implement a 24K byte L2 cache.

Mutual Exclusion[2]:

As discussed in class, one idea that AMD did was mutual exclusion; cache data was either in L1 or L2, but not in both. As an extension to the Second-level cache, implement mutual exclusion.


Problem 2d: Miscellaneous

Reorder buffer [12]:

Implement a reorder buffer of at least 8 instructions.  This option makes sense to be combined with a number of other options here.  It must be the case that you have precise exceptions as a result (i.e. that you could stop the system and easily recover good state).  It would certainly help with the out-of-order execution model or the non-blocking-load option. It might also help with branch prediction, load-value prediction, etc.  Even in the non-out-of-order pipeline, it could give you a longer time to let your prediction take effect, i.e. you might predict a load on a cache miss and let the execution continue.  The number of entries in your reorder buffer would directly affect the number of instructions after the load that you could let execute before you absolutely had to know whether the prediction was correct or not...

Make sure to update your instruction tracer unit from previous labs to give useful information about the instruction stream.  Among other things, make sure to include the current cycle count in your trace. The trace should show committed instructions.  One useful debugging option would be to make another tracer which shows dispatched instructions (as they enter the reorder buffer).

Explicit Register Renaming [12]:

Expand your physical register file to 64 registers, and implement register renaming to eliminate WAR and WAW hazards. You will need to implement a translation table, which maps your physical registers to the ISA registers. Make sure you free a register when it's not longer being used, or you may risk deadlocking your processor because it can't issue. You must be able to fix speculation erros (misprediction) by restoring your translation table and free list.

Full Superscalar Pipeline [8]:

Modify your instruction and data caches to support simultaneous memory operations. Your caches should be able to perform two simultaneous accesses (to either the same cache line or different ones). Simultaneous hits must not stall the processor. You must be able to handle simultaneous misses.
The only exception to this rule are interdependent data access instructions such as:
lw $t0, 0($t9)
lw $t1, 0($t0)
or
lw $t1, 0($t9)
sw $t1, 0($t9)
Note: You may (and probably should) use dual-ported block rams for this purpose.

Stream Buffer (Prefetch) [4]:

Use a small FIFO of 2 cache lines which is used to prefetch instructions from memory.  On an instruction cache miss, you should go ahead and fill the missing line in the cache, the proceed to start fetching additional lines, which you place into the stream buffer.  Then, on subsequent misses, you check the stream buffer to potentially fetch data much more quickly.  Note that one possible implementation would take advantage of burst-mode DRAM operation to fill the stream buffer more quickly...

For more info, see paper by Jouppi under Final Project on the handouts page.

Multiplier/Divider[4]:

Implement a combined multiplier/divider unit that handles both signed and unsigned multiplies/divides.  Once again, this should be implemented as a coprocessor so that the actual instructions don't block the pipeline.  Only attempts to execute mflo/mfhi instructions should block if the operation is not yet ready. You must improve your multiplier to use a more advanced scheme than the one used for Lab 2.


Problem 3: Unfair Benchmarks

Hopefully, in this class, you have learned how benchmarks can be manipulated to misrepresent the performances of different processors. As part of your final project, each group needs to write a benchmark. This benchmark should be designed to make your processor perform as well as possible, while also making other processors perform poorly. You may want to look at the project description website to see what options other groups are implementing, and target those options. Of course, the best benchmark will be one that breaks other processors (just be careful that it doesn't break yours!)

Your benchmark must conform to the following guidelines. It must be at most 384 words of instructions and preallocated static storage all of which must be below address 0x00800000. It must not be self-modifying. The stack pointer should be initialized to 0x03FFFFFC and the heap should be initialized to 0x00800000. If you use any static storage data then it should be clearly labeled and we should be able to easily modify your code to remap the addresses for your static storage. You must store any $s registers before using them, and you must leave the high 32-bits of the cycle count in $v1 and the low 32-bits of the cycle count in $v0. Finally you must conform to the instruction set given in problem 5. Failure to follow these conventions will definitely mean that your program will not be used in the final race, and it will also mean a loss of credit for the problem.

We will incorporate some of the better benchmarks in a student test suite, and use them in our final testing / competition. Thus, the benchmarks will be due Thursday, 5/6.


Extra-Credit

Extra-credit will be assigned for this lab based on a friendly competition between groups for performance. Your program does not need to be `the best' to receive extra credit, but obviously the best design will receive the most extra points. We will have competitions for: best CPI, highest Clock Rate, best Performance / Hardware Resource usage (LUT count), and best overall performance, based on total execution time on the benchmarks..

One test program will be released early on, to allow you to target your optimizations. The other benchmark program, a mystery program, will not be released until a few days before the project deadline. Finally, one of the final benchmarks will include student submitted benchmarks from above. A good understanding of the different project options will help you decide in advance which solutions will have the greatest impact on performance.

Increasing the amount of first-level cache in your system is not allowed: the total amount of first-level cache in your system may not exceed 4096 words. 4096 words is derived from Lab 5 which has 2048 words each for I and D caches; although you may not exceed 4096 words total, you may redistribute the memory if you wish (i.e., 1024/3072 words).