Homework University of California, Berkeley

Homework #5 / Lab #5

CS152 - Computer Architecture
Spring 2003, Prof John Kubiatowicz

Homework 5 due Wednesday 4/9 in class. There will be a short quiz in class on that day.
Problem 0: (Team Evaluation) due by Monday 3/31 at midnight via EMail to your TA.
Lab organizational EMail due to TA by Tuesday 4/1 at midnight via EMail to your TA.
Lab 5 due Thursday 4/10 by midnight via the submit program. You will demonstrate your pipelined processor to your TA on Thursday 4/10 in Section.

Like Lab 4, this is a long lab! MAKE SURE TO START EARLY!

Please put the TIME or TA NAME of the DISCUSSION section that you attend as well as your NAME and STUDENT ID. Homeworks and labs will be handed back in discussion.

Homework Policy: Homework assignments are due in class. No late homeworks will be accepted. There will be a short quiz in lecture the day the assignment is due; the quiz will be based on the homework. Study groups are encouraged, but what you turn in must be your own work.

Lab Policy: Labs are due at Modnight on Thursdays via the submit program. You will demonstrate your lab to your TA in section on that same Thursday.

As decided in class, the penalty for cheating on homework or labs is no credit for the full assignment.

Homework 5

Please do the following problems from P&H: 6.2, 6.3, 6.4, 6.6, 6.9, 6.18, 6.19, 6.20, 6.23, 6.26, 6.27, 6.28, 6.29, 6.30
Homework assignments should continue to be done individually.

Lab 5: In this assignment, you will pipeline your processor. You will still use the same memory structure as in the last assignment and will get a chance to enhance that feature of your processor in the next lab. This assignment is non-trivial so get started early!

This lab assignment is to be completed with your project partners. To help you build a successful machine, we will give you intermediate milestones for this assignment. By Tuesday, April 1st, you must e-mail your TA the spokesperson for your group, the assignments for each team member, and copies of your on-line log thus far. The responsibility of the spokesperson is to communicate questions to the TA and reply to questions from the TA. Please choose a different spokesperson than the one you had for Lab 4.

Problem 0: Team Evaluation for Lab 4.

To help us understand how your team is functioning we require you to evaluate yourself and each of your team members individually.

To evaluate yourself, give us a list of the portions of Lab 4 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).

Next, based on your own observations, evaluate the performance of the other members of your team during the last lab assignment. You do not evaluate yourself. Assume an average team member would receive a score of 20 points. Top performers would receive more, sliders less. There are two constraints your point distribution must satisfy:

1. The total points distributed to your team members equals 20 times the team size minus one (i.e., yourself).

2. The maximum score you can assign to any one person is 30 points.

Note that each evaluation should have a one or (at most) two sentence justification for the evaluation.

For example, suppose you were on a 5 person team with the following other members: Sue Superstar, Ole Outstanding, Annie Average, and Ned Neverthere. You would have 80 points (4 times 20) to distribute. Your distribution might look like the following:

Name Score Reasoning

Sue Superstar 23 Both Sue and Ole really helped along the group. Sue in particular figured out how to handle interlocks in the pipeline

Ole Outstanding 23 Both Sue and Ole really helped along the group. Ole figured out how save 50% of the registers and how to frost donuts in the writeback stage.

Annie Average 20 Annie did a good job.

Ned Neverthere 14 Ned never showed up to group meetings. We ended up reimplementing the one piece that he did give us.

total 80

You should reevaluate your team members after every lab assignment and base your evaluation only on their performance during that lab assignment. These scores will be used for grading. Be honest and fair as you would hope others will be.

Please email this information to your TA by Monday, October 22 at 5pm. Each team member must turn a report in (separately).

Problem 1: Pipelining Your Design

Please read this section through completely before starting any of it!

Problem 1a: pipeline it.

For this design, you must use a new RAM file for your instruction and data memories. Note that verilog files that we refer to in this LAB are all in the m:\lib\high-level\ directory. Start by reading the Readme in m:\lib\high-level\Lab5Help.pdf. The new RAM (called sramblock2048.v) is fully synchronous. This means that you must set up the address and any data to be written before the edge of the clock. Both reads and writes are synchronous in this way. Keep this in mind when you work on your pipeline. One way to view the result, is that some of the registers that you see in the pipeline diagrams that we show you (for instance, the PC or the S and D registers) are partially duplicated in the RAM block. This means, for instance, that you still need to keep a separate PC register, but that you also need to pipe the value of the address before the PC register to the actual RAM block; on a clock edge, the new address will be clocked into both the PC register and the internal address registers of the RAM.

You have to implement the following five-stage pipeline:

IF Instruction Fetch: Access the instruction cache for the instruction to be executed.

ID Instruction Decode: Decode the instruction and read the operands from the register file. For branch instructions, calculate the branch target instruction address and compare the registers.

EX Execute: Bypass operands from other pipeline stages. One of the following activities can occur:

For computational instructions, the ALU or the shifter performs the arithmetic or logical instructions.
For load or store instructions, the data address is calculated.

MEM Data Memory Access: Access the data cache for load or store instructions.

WB Write Back: Write result to register file.

Add the appropriate pipeline registers to your single cycle design. Assume that memory accesses take only one cycle in your implementation.

Here are the instructions you must implement:

Type Instructions

arithmetic addu, subu, addiu

logical and, andi, or, ori, xor, xori, lui

shift sll, sra, srl

compare slt, slti, sltu, sltui

control beq, bne, bgez, bltz, j, jr, jal

data transfer lw, sw

Other: break

You should implement the components you need in verilog. All verilog models should correspond to realistic components (e.g. register, comparator, etc). No super-composite components, e.g. a branch unit that takes in the opcode, operands, and PC, and outputs a new PC, or something like that. You should use the delays from lab4 for the blocks that you make in lab5.

All modules should be rising-edge triggered.

All control instructions should have a single delay slot (i.e. the following instruction is always executed after the control instruction).

Make sure to verify the coding of instructions such as bgez in Appendix A of P&H. Note that the "rt" field is actually used to distinguish bgez and bltz.

These are the only operations that the ALU should support. As in Lab 4, there should an arithmetic/logical multibit (31) shifter external to the ALU. Furthermore, instructions such as SLT should be handled outside the ALU as well. SLT should subtract the two operands, and then use the ALU status flags (Zero, Neg, Ovf) to compute the output, and then put the correct value back in the destination register.

The break instruction is special. See A-75 of P&H for its coding. Although this is normally an exception-causing instruction, you should treat it more like a halt instruction. After being decoded, the break instruction should freeze the pipeline from advancing further. This means that the PC will not advance further, the break instruction will stay in the decode stage, and later instructions will drain from the pipeline. The proper terminology for this is that the break instruction will "stall" in the decode stage. Assume that there will be a single input signal called "release" that comes from outside. When it is high, you should release a blocked break instruction exactly once (you need to build a small circuit that generates a single, one-clock-cycle pulse when release is high, then ignores release until it goes low again). When we map our pipeline to the board (Problem 4), the break instruction will stop the pipeline and potentially display its code on the LEDs. Further, we will have the option to "unfreeze" the pipeline with a debounced switch.

Make sure to produce an 8-bit output signal (STAT): that is as follows: if the processor is not stopped, STAT=0. Otherwise, the high bit (bit 7) of STAT = 1 and the low 7-bits = low 7 bits of break code (which is in bits 6-25 of the break instruction). Whenever this signal changes, make sure that you have a monitor output that prints "STATUS Changed: 0xvalue" on the console.

Problem 1b: Memory-Mapped Input/Output

Since we are interested in displaying information to the HEX Leds, and getting information from switches, we need to have I/O for our basic processor. Build a memory-mapped I/O module. Whenever the processor writes to addresses 0xFFFFFFF0 - 0xFFFFFFFC (i.e. 4 words at top of address space), don't write to your data memory. Instead, write to the I/O module. Further, when reading from this address space, you will not actually read data from memory. Instead, you will read data from the I/O module.

Assume that the difference between regular memory and I/O memory is completely selected by the high bit of the address (sign bit). This means that writes to your data memory will be disabled when the high bit of the address is set. It also means that reads from the data memory will be ignored when the high bit is set and instead data will be gotten from the I/O module (hint: think a 32-bit mux at output of memory controlled by high bit).

Build the I/O module. It should have a 32-bit address and 32-bit data input and a 32-bit data output for the processor, just like memory. It should also have 2 I/O buses: a 32-bit input data bus and a 32-bit output data bus and a 1-bit output selector. Other control signals are probably necessary as well. Internally, this I/O module should have two 32-bit I/O registers.

Behavior is as follows: Writes to 0xFFFFFFF0 go to one 32-bit register (call it DP0). Writes to 0xFFFFFFF4 go to the other (call it DP1). Writes to 0xFFFFFFF8 or 0xFFFFFFFC are ignored. Reads from any of these addresses come directly from the input I/O bus. The output I/O bus will be either DP0 if the output selector is 0 and DP1 if the selector is 1.

Note that you can read/write I/O space with normal loads and stores with negative offsets:

        lw $r1, -16($r0)    ; Read input => $r1
      sw $r7, -16($r0)    ; Write $r7 to DP0
      sw $r8, -12($r0)    ; Write $r8 to DP1

This works because offsets are sign-extended. Thus, for instance, -12($r0) means address 0xFFFFFFF4.

Finally, within this module (marked to be non-synthesized -- see synplify manual), output a message to the console whenever a change is written (i.e. something like: "I/O Write to DP0: 0x44455523"). This message should also be written to a file called "iooutput.trace". Further, whenever the module inputs a value, arrange to have the value to come as the next value from an input file called ("ioinput.trace").

Problem 1c: update your monitor module

Make sure to update your disassembly monitoring module from Lab4:

Add the new instructions
Make sure to account for pipeline effects.

To do the latter, don't output instructions until they have reached the memory stage (since you won't be able to print out load instructions until the memory stage where you finally know the value of the destination register). In order to do this, introduce a number of signal arrays in your monitor which hold on to values until they are needed. For instance, to hold onto the instruction itself, you would have a series of statements like:

EXCinstruction = instruction;
MEMinstruction = instruction;

In this way, you have the value of the instruction word when it is at the end of the memory stage. This is like a mini pipeline. Values of input registers wouldn't have to be kept as long, etc. Think through this carefully...

Problem 1d: Top-level module integration: chip mapping

As with Lab4, you will have a top-level schematic module that ties everything together. Now, however, you will have several I/O pins left over: a clock net, 1 reset signal, 1 release signal (for break instructions), 1 output select signal (from the I/O module), 1 8-bit output (from the break logic), 1 32-bit output (I/O) and 32-bit input (I/O).

Use the TopLevel.v module in m:\lib\high-level\ as the top-level integration for your design. Briefly read the description of the Calinx boards (manual off the handouts page) to see what the various pins mean. You will be modifying the verilog for this top-level module to integrate all the pieces you need.

You should assume that the following is true:

There are 2 sets of 4 pushbuttons. We will only use group 1 (although you are free to use the others if you wish). Button 1 of the first set should be the RESET signal. Button 2 of the first set should be used to release the break instruction. Button 3 of the first set should be used to select the output of your I/O module. Button 4 will be a special "SINGLE_CLOCK" signal. We will mention this below. Since these are buttons, they will be naturally bouncy, so you should include the debouncer module from common_mods.v in m:\lib\high-level\. Assume that the clock for these debouncing modules comes from the LAB_CLK on the chip.

The break instruction outputs an 8-bit signal called STAT. This should be mapped to the 8 individual LEDs.

The output from your memory-mapped I/O module should go to the HEX leds. To drive these properly, you need to use the HEX decoder module available in common_mods.v.

The input of the I/O module should come from the first group of 8-bit dip switches. Assume that the value on these switches goes to the lowest 8-bits of the input bus and that the top 24-bits are set to zero. If you like (possibly a good idea), you can consider switching in the current PC to the HEX leds when a switch is set. Consider the second switch of the second set of 8 dipswitches as controlling this. You can use other switches to indicate what is going to be displayed other than the normal I/O.

Finally, the CLOCK net for your pipeline should be connected either to the LAB_CLK net you chose on the XILINX board or to your debounced clock. Let the first switch of the second set of 8 dipswitches be the choice (call this signal "CLK_SOURCE"):

processor_clock = CLK_SOURCE ? LAB_CLK: SINGLE_CLOCK;

Problem 1e: Initial testing

Build one or more test-bench modules around the FPGA top-level module above that provides clock (as in LAB 4), and prints output to the console when I/O changes, and perhaps "pushes" the buttons for testing. Note that debouncing of the switches is tricky when interacting with the clock. Think carefully before trying to test the single-stepping clock feature. You may also want to design test-benches to wrap around your processor module (inside the FPGA top-level module) just to make things easier to test.

To test your processor, you should write diagnostic programs, similar to those that you wrote in lab 2 (broken spim). Remember, the instructions that you are implementing here are still not the complete MIPS set, so you still cannot run your lab 2 programs yet. The general structure of the programs should be some calculations, data manipulations, etc, followed by SW's to the main memory. Then, at the end of the program, and at points during the execution, you can dump memory to show that the correct values were correctly written to memory.

Think carefully about the I/O features. How will you test these? For demonstrating your pipeline in simulation, create a test module that recognizes when break has been asserted, waits 10 cycles, then asserts the release line -- printing something to the console in the process. This will allow us to run programs that utilize the I/O features to output results.

Keep in mind that you haven't handled hazards yet (Problem 3), so you must be careful that your test programs don't try to use values too soon after they are generated.

Problem 2: Pipelining Gain

Calculate the cycle time for your pipelined processor. In your writeup, compare the cycle time from lab4 to that in lab5. Is this what we would have expected, from our knowledge of pipelining? If we just took our lab4 singlecycle processor, and added pipeline registers at key points, we would expect the cycle time to be the inverse of the delay through the longest block (ALU? Next PC? Memory?). Is this the perfromance that you were able to achieve? Why?

Problem 3: Handling Hazards

Problem 3a: Handle hazards

Here is a list of the hazards you must handle:

Data Hazards: Data hazards occur when the data produced by an instruction is used as an operand by subsequent instructions.

The ALU is one source of data hazards. Add the necessary forwarding logic and buses so that the result of an ALU operation can immediately be used by the following three instructions without waiting for the data to be written in the register file.
Load instructions also present a data hazard because the data is not available until the end of the MEM phase. The MIPS instruction set specifies that load instructions have a one cycle delay. That means that the compiler cannot generate code sequences where the data of a load will be used during the following cycle. Implement your datapath so that it has one load delay slot.

Control Hazards: Branch instructions present a common case of control hazards. Comply with the MIPS instruction set definition and implement your processor so that it has one branch delay slot that is always executed regardless of the result of the branch.
Structural hazards: are there any in this design? If so, explain what they are and how you are handling them. If not, why not?

Write or modify programs to test all the different hazard cases. Remember that hazards do not necessarily occur between two adjacent instructions. They can happen between two instructions that are separated by another instruction (or two?). Consider the following lines of code:

ADD    $1, $2, $3
SLL    $5, $6
SUB    $6, $1, $7

The ADD and SUB instructions have a data hazard, yet there is an SLL between them. Be sure to check these kinds of cases.

Problem 3b: Pipeline interlocks/Interlocking Loads

Now that you have basic hazards dealt with, you should figure out how to handle pipeline stalls. Your current processor deals with the load delay slot in the same way as the original version of MIPS: If the compiler generates a code sequence in which a value is loaded from memory and used by the next instruction, the following instruction gets the wrong value. Of course, the instruction specification explicitely disallowed such code sequences; if no other options were available, the compiler would have to introduce a noop in the load delay slot to avoid getting the wrong answer.

As your final exercise, introduce a pipeline stall so that a value can be used by the compiler in the very next cycle after it is loaded from memory. This feature was added to later versions of the MIPS instructions set. To be clear, we want the following code sequence to do the "obvious" thing, i.e. the add should use the value loaded from memory:

LW $1, 4($2)
ADD $2, $1, $3

Make sure to rerun all of your tests from part 3a to verify that you haven't broken anything. Write tests that try several different distances between loads and their following values. Hint: the mechanism for this single-cycle stall is very similar to what you need for the break instruction...

Problem 4: Map it to the board

As a last step, map your processor design directly down to the boards. Make sure to read the information in Lab5Help.pdf about changes to get versions of the RAMs that map to the on-chip XILINX boards. Also, until we fully integrate this information into our readme files, please check out the information about the tool flow from the CS150 class: http://inst.eecs.berkeley.edu/~cs150/handouts/3/Lab/lab2Writeup.pdfand http://inst.eecs.berkeley.edu/~cs150/handouts/3/Lab/lab2slides.pdf.You should be using the same design flow.

Note that you should be able to put the processor in single-step mode (first dipswitch of second set put to zero). Then you should be able to use push-button #4 as a single-stepping clock. You should also be able to spread break instructions in your code and debug code this way. Note also that you should be able to use a loop at the very end of your execution with a combination of break instructions and writes to the I/O (address to 0xFFFFFFF0, data to 0xFFFFFFF4) to dump the contents of your memory to the HEX display when you are done. Make sure that this works!

Make sure that the RESET line causes important processor state to be reset! Remember that "initial" blocks in Verilog will be ignored by the synthesizer. Many bugs can be introduced when registers contain random initial state! One obvious thing that must be reset is the PC. Are there other things?

Don't try to debug everything at once. Start with extremely simple examples. Possibly divert the HEX LEDs to display PC information during debugging (feel free to divert other things as well). For instance, what about simple program with break as the first instruction and a bunch of nops. Can you get that to work? What about simple I/O examples? If you make sure that your simplest debugging mechanisms work, then you can move on to more expensive examples.

Please include information in your writeup about the total number of FPGA slices used for your design and the fraction of the Xilinx part that has been used for your design. This information should be availble in the log files post place-and-route.

Extra Credit: Add a multiplier unit

For extra credit, add a multiplier unit to your design. This means adding three new instructions: multu, mflo, mfhi.
In keeping with the original MIPS multiplier, make a separate control machine that takes the multiplier and multiplicand during the execute stage and starts executing. Sometime later (independent of the rest of the pipeline, it will finish). This means that

Your HI and LO registers should be contained in your multiplier
In the decode stage, your should stall the pipeline on multu, mflo, or mfhiif the multiplier is still running. (Actually, if you are clever, you can stall if it will still be executing on the next cycle, since that is when we would try to start another multiply/need to bypass the ALU with the value of the HI or LO register).
Since you are doing an unsigned multiply, make sure to do the right thing with respect to carry out of your adder (you can use an ALU that is hard-wired to do add).

Can you make your multiplier take only 32 or 33 cycles? This means combining the conditional addition phase with a shift.

Make sure to test this with a number of different distances between the multu instruction and mflo or mfhi instructions.

PLEASE NOTE: Credit will only be given to WORKING extra credit! Therefore, DO NOT start on extra credit until you have a WORKING standard design!

Wrap it up: Turn in a copy of your verilog code, processor schematic, diagnostic program(s) and your on-line logs. Also turn in simulation logs that show correct operation of the processor. These logs should show the operations that were performed, and then the contents of memory with the correct values in it. Do not turn in waveforms.

As part of your writeup, explain to us your testing philosophy. Why do you think that you have a working processor?

How much time did your team spend on this assignment?

Name	Score	Reasoning
Sue Superstar	23	Both Sue and Ole really helped along the group. Sue in particular figured out how to handle interlocks in the pipeline
Ole Outstanding	23	Both Sue and Ole really helped along the group. Ole figured out how save 50% of the registers and how to frost donuts in the writeback stage.
Annie Average	20	Annie did a good job.
Ned Neverthere	14	Ned never showed up to group meetings. We ended up reimplementing the one piece that he did give us.
total	80

Type	Instructions
arithmetic	addu, subu, addiu
logical	and, andi, or, ori, xor, xori, lui
shift	sll, sra, srl
compare	slt, slti, sltu, sltui
control	beq, bne, bgez, bltz, j, jr, jal
data transfer	lw, sw
Other:	break