Homework #6 / Lab #6
Cache and Main Memory

CS152 - Computer Architecture
Spring 2003, Prof John Kubiatowicz

Homework 6 due Wednesday 4/23 in class. There will be a short quiz in class on that day.
Problem 0: (Team Evaluation) due by Wednesday 4/16 at Midnight via EMail to your TA.
Lab organizational EMail due to TA by Midnight Wednesday 4/16..
Lab 6 due Tuesday 4/29 by Midnight via the Submit program and demoed to TA by that day as well.

Like Lab 5, this is a long lab! MAKE SURE TO START NOW!

Please put the TIME or TA NAME of the DISCUSSION section that you attend as well as your NAME and STUDENT ID. Homeworks and labs will be handed back in discussion.

Homework Policy: Homework assignments are due in class. No late homeworks will be accepted. There will be a short quiz in lecture the day the assignment is due; the quiz will be based on the homework. Study groups are encouraged, but what you turn in must be your own work.

Lab Policy: Labs reports due by midnight via the submit program. No late labs will be accepted.

As decided in class, the penalty for cheating on homework or labs is no credit for the full assignment.

Homework 6:Please do the following problems from P&H: 7.7, 7.9, 7.11, 7.15, 7.16, 7.18, 7.23, 7.27, 7.32, 7.36, 8.8, 8.13, 8.19, 8.30
Homework assignments should continue to be done individually.

Lab 6:

In this lab, you will be designing a memory system for your pipelined processor. The previous memory module was far from practical, and you will never have separate, dedicated DRAM banks for instructions and data. Using a realistic main memory system will cause two problems in your pipelined processor: (1) your cycle time will dramatically increase as a result of the main memory write and read latency and (2) you must handle conflicts when both data and instruction accesses occur in the same clock cycle. As you most likely have learned in lecture and in the book, the solution to these problems is the addition of cache memory.

All code and helpfiles for this lab are (as usual) in the m:\lib\high-level including a new TopLevel.v file (with some sample things in it) and a DRAM simulation called mt48lc8m16a2.v. Further, there is a Lab6Help.pdf file. You probably want to use the debouner in common_mods.v.

There are some extra-credit options at the end of the lab. If you choose to try these extra-credit options, they may significantly affect your cache architecture. So, make sure to read through the whole lab first. If you choose to build the extra credit options, you are not required to demonstrate a working version with the base parameters.

Note that Lab6 builds on Lab5. Thus, unless otherwise noted, you should have the same features as the previous lab. We are adding to the functionality.

THIS LAB CAN BE EXTREMELY DIFFICULT, SO AN EARLY START WOULD BE A VERY GOOD IDEA.

Problem 0: Team Evaluation for Lab 5

As before, you will be evaluating you and your team members' performance on the last lab assignment (Lab #5). Remember that points are not earned only by doing work, but by how well you work in a team. So if one person does all the work, that certainly does not mean he/she should get all the points!

To evaluate yourself, give us a list of the portions of Lab 5 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).

You may give a total of 20 points per person in the group (besides yourself), but do not give any more than 30 points to one person. Submit your evaluations to your TA by email as soon as possible. Make sure to include a one or two line justification of your evaluation.

Email the result to your TA by Wednesday 4/16 at Midnight.

Problem 1: Producing a memory controller

NOTE: Read these instructions carefully. They contain all the information you will need to write your report for this lab, including some requirements not directly related to the cache and memory additions!

The first step in creating a new memory system is changing the model for our main memory bank (hereafter called the DRAM). We will be using a model of the 128 Mega-bit dram from Micron that is on the hardware boards. Please start by looking over the DRAM spec off the handouts page (http://inst.eecs.berkeley.edu/~cs152/handouts/128msdram_e.pdf). We will be using RAMS that are in the 2Megx16x4 bank configurations. Further, the board has 2 side-by-side for a total of 8 mega-bytes of ram (2 Megx32x4 banks).
All of the control signals for the two physical RAMS are ganged together as if you had a 256mega-bit DRAM.

There is a model for one of the physical DRAMs called mt48lc8m16a2.v in the m:\lib\high-level\ directory. You should use this for simulation. Note that you will be building a board-level.v module that includes toplevel.v and a couple of DRAMs for your design.

For this problem, design a DRAM controller which has the following entity specification:

module memory_control (PROCCLK,request,r_w,address,data_in,data_out,wait,
                       DRAMCLK,Data,Addr,Ba,CKE,CS,RAS,CAS,WE,DQM,CAS_TIME);
    input RESET;           // System reset
    input PROCCLK;         // Processor clock
    input request;         // Set to 1 on when processor making request
    input r_w;             // Set to 1 when processor writing
    input address[22:0];   // 23 bits = 8 megabytes of *words*
    input data_in[31:0];   // input data
    output data_out[31:0]; // output data
    output wait;           // Tells the processor to wait for next value
    input DRAMCLK;        // Clock used to control the DRAM
    inout Data[31:0];      // 32-bits of data bus (bi-directional)
    output Addr[11:0];     // 12-bits ROM/Column address
    output Ba[1:0];        // Bank address
    output CKE,CS,RAS,CAS,WE; // Clock enable and DRAM Command control
    output DQM[1:0];       // Data control (Shared between DRAMs, a bit strange!
    // Your code here
endmodule // memory_control

The processor part of this interface is very simple. When the "request" line is asserted on a rising edge, it is assumed that the "address" and "r_w" signals are stable. In addition, if the "r_w" signal is asserted (equal to "1"), it is then assumed that a write will happen, and the "data_in" signal is stable as well. Immediately afterwards, the "wait" output will be asserted for every cycle until the final cycle of the access (accesses will take more than one cycle). If a read is requested, then you should arrange so that the read data is ready (on the "data_out" bus) on the same clock edge that "wait" is deasserted.

After the final cycle (falling edge with "wait" deasserted), the processor must set "request" to low or risk starting another access (you have several timing options here -- choose them carefully). One possible timing for a processor read is shown here:

Figure 1: Processor Read Timing (sample)

Keep in mind that this memory module is word addressed (i.e. the DRAM data bus produces a complete 32-bit word). Don't forget to adjust your processor addresses accordingly (i.e. processor addresses are byte-oriented).

Problem 1a: READ and WRITE

Now, consider the interface to the DRAM. Make sure to read the DRAM specification. Although this DRAM has many options, we are going to start with a very simple interface that reads or writes one 32-bit word for each activation. Extra-credit options allow you to get more sophisticated. In this section, you should control the DRAM with the DRAMCLK signal. Assume that the processor interface may be running single stepped (the PROCCLK may take many DRAMCLKs). Keep this in mind throughout your design!

While the processor interface has a 23 bit address, the DRAM interface has only a 12 bit address + 2 bit bank select. This is because the DRAM takes its address in two pieces: the ROW/Bank address (which is the top bits 14 bits) and the COLUMN address (which is the bottom 9 bits). The four primary DRAM control signals, CS, RAS, CAS, and WE toogether combine to form a "DRAM Command" as follows (from page 12 of manual):

**Figure 2: Command Encodings (Page 12)**

For basic functionality, we will only use the ACTIVE, READ, WRITE, and REFRESH commands. You can use others for extra credit.

Design a statemachine that takes processor commands and sequences the DRAM to perform basic READ and WRITE operations. We will be talking more about DRAMs in class, but the key timing diagrams are shown in the following figures:

Figure 3: DRAM Read Timing (Page 45)

Figure 4: DRAM Write Timing (Page 52)

Notice, in particular, that we are using the "auto-precharge" feature of the DRAM. This means that we are not (at least for read or write) issuing PRECHARGE commands.

Note that we will be designing with a 2-cycle CAS latency. Also, assume that other timing parameters are adjusted appropriately. Note that these times are given in the spec at the bottom of the referenced pages. Consider, for instance, T_RCD. In the spec it is quoted as 20ns with the slower part; with a clock cycle of 25Mhz, this would easily be handled by 1 cycle. At 100Mhz, you would need 2 cycles.

Given the above timing, the minimum number of edges for a DRAM access becomes three at 25Mhz: One for the ACTIVE command, two for CAS latency, and one for the AutoPrecharge (which is overlapped with the last cycle of the CAS latency). If your clock cycle time is too short (as it would be if you optimized your processor well), then you need to use more cycles than the minimum. Further, since you probably won't know exactly how long your clock will be (or may be varying it during debugging), you may want to figure out how to lengthen the number of cycles for events automatically when the clock is too short. (Hint-- look at the time between sucessive clock edges and add additional cycles until you have met each spec). Note also that your controller must be able to handle both "FAST" and "SLOW" DRAMs.

It is important to note that the DRAM databus is bidirectional. Thus, if you are writing to the DRAM, data should be driven on the data bus. If you are reading, the data bus should be left in a high-impedance state (this can be accomplished by assigning the bus a "ZZZZZZZZ\h" value).

When writing up this part of the problem, make sure to include timing diagrams for your processor interface (read and write), as well as your DRAM interface (read and write). Describe the state-machine that you used to control your DRAM and include the code for your controller. Explain your methodology for testing of the DRAM.

Problem 1b: Initialization

You need to be able to perform initialization properly after reset. Among other things, you need to set the mode register to "0b00 1 00 010 0 000": This means Write Burst=only 1, CAS latency = 2, Burst type = Sequential and Burst length = 1.

Make sure to read the section on initialization. Following is the Initialization diagram:

**Figure 5: Initialization (Page 37)**

This initialization timing should be triggered by a system reset. During the reset (which will be many cycles), you should ignore all processor requests (i.e. the WAIT line will be asserted).

Problem 1c: REFRESH

Finally, you DRAM must periodically execute refresh cycles. Your controller must somehow execute 4096 AUTO Refresh commands every 64 ms. One way to do this is to have a counter that starts at a value of 64ms/DRAMCLKPERIOD/4096 and counts to zero over and over again. Every time that this hits zero, you increment a counter of "missed refreshes" (this count can be small, just a few bits worth).

Everytime your DRAM controller hits idle (all banks precharged), it checks the count of pending refreshes and performs auto-refresh cycles until this count is decremented to zero.

Here is the simple refresh diagram:

**Figure 6: Refresh (Page 40)**

Problem 1d: Testing

How will you test your controller? Come up with test benches to handle as many conditions as you can imagine. Be careful. Assume that the processor may request things at a much slower clock rate than the DRAM clock. However, you can assume that the processor clock is synchronized to the DRAM clock, i.e. that the processor clock gets an edge only right after a DRAM clock edge!

How can you make sure that refresh will happen even if the DRAM controller is always busy?

Problem 2: Designing your Cache

The next step is to design your cache. You should only design one cache module, which will be duplicated and used for both data and instruction accesses. This means you should not for any reason design your data cache differently than your instruction cache, regardless of any potential performance benefits you may find. The cache you design must have the following properties (unless you do extra credit, mentioned below):

Total cache size of 8K bytes (not including TAGS, etc)
Block size of 8 words (32 bytes)
Direct-mapped cache policy
Write-through write policy with a 4-entry write buffer
Build the SRAM portion of the cache from the same 2Kx32-bit blocks that we used for Lab 5 (with no initialization).
You need something else for tags, etc.

Any of the components listed above that you create must have realistic delay values. You must include a comprehensive description of your delay estimates for the project (meaning you should only need to add new components to your previous list). You do not need to go into great detail (we don't want to hear about transistors and circuit design style!), but a simple statement of the component's structure and the size and number of gate levels is enough. You will not receive more credit for pretending to design and use really fast components (that is better appreciated in EE141). Use reasonable simulation delays for any new basic components.

Problem 2a: TAGs file

Build the TAGs file for the cache. Make sure that you can reset all the valid bits to zero after reset. You are allowed to use smaller SRAM components for tags, etc, if you wish (make sure that they compile properly to SRAMs that work with the board).

Problem 2b: Write buffer for DATA CACHE

Processor writes to the data cache will go into the cache (if a particular cache line is cached) and will also go directly to DRAM. This can be a serious bottleneck, since it means that every processor write takes a complete DRAM write cycle. To ameliorate this problem, design a 4-entry write buffer for your system. Think of this buffer as sitting between the processor and the cache. This buffer should take writes from the processor and hold them until the DRAM is free. Whenever the buffer is full, you must stall the processor from writing.

Each entry will have a 32-bit address, a 32-bit data word, and a valid bit. Make sure to make this 4-entry write buffer fully-associative so that values sitting in the buffer will be returned properly from load instructions.

Processor load instructions must do the following: First, they check in the write buffer. If there is an entry there with the right address, the load should simply return the value directly from the write buffer. Otherwise, it will look in the cache. If there is a value available, then it will return the value immediately. Otherwise, it will stall the load and request a cache fill from the DRAM.

Processor store instructions should do the following: First, they check in the write buffer. If there is an entry there with the right address, then the store will simply overwrite the entry in the buffer. Otherwise, if there is a free entry, then use it for the store. Otherwise, stall the store untill an entry is free.

Now, the DRAM controller will have 2 different inputs from the data cache. Either (1) it will fullfill a complete cache-line read during cache miss or (2) it will handle a single-word write to DRAM. Note that, when we decide to empty a single-word write from the write buffer, we will write it both to the DRAM and write it to the cache if the word is properly cached.

The write buffer should be emptied in FIFO order (oldest write first).

Problem 2c: TESTING

Once your cache is designed, you may test it in one of two ways. The first way: leave your cache as a separate component and test it using vectors and manually assigned signals. If you use this method, you must be careful to keep the testing readable and concise for grading. Remember, we don't like looking at waveforms in the lab report.

The second way: you can hook up a new DRAM controller and DRAM TO EACH CACHE and begin testing your design. Keep in mind that the FETCH stage does not have a write buffer. You do not need to worry about simultaneous requests in this problem, which is why we allow a memory block for each cache. Further, force load instructions to stall until the write-buffer is empty. This way, the DRAM controller for the memory stage will always have only one request at a time (write from write buffer vs read from cache).

At this point (regardless of which testing method you use), you should begin to evaluate how the addition of cache affected your critical path (since it is required in the report!).

Problem 3: Adding a Single DRAM and Arbitration to Your Processor

After you have your cache ready to go, there is one more problem that needs to be fixed. Both the data and instruction caches could quite possibly need to access the DRAM at the same time. You will need to design an arbitration method for handling simultaneous DRAM requests. Depending on your cache timing and and how efficient you try to be, this can be the most difficult and trickiest part of this lab. There is no recommended way to accomplish this task, but you are certainly allowed to design a single state machine in VHDL and use it as a memory arbiter/controller; you are also allowed to modify your DRAM controller from Problem 1. Try to be as realistic as possible with the delay through this module, and keep in mind the speed it should have compared to the other components listed above.

To be more specific, your arbiter is an entity that takes requests from (1) the instruction cache (for cache misses), (2) the data cache (for cache misses) and (3) the write buffer. It should trade-off accesses to the DRAM in a fair way. A suggested priority:

1) If the write-buffer is full, it is highest-priority.
2) Next, if either the processor or DRAM is busy with a read-miss, let them go forward. Make sure that, if both are requesting, you alternate between them.
3) If nothing else is requesting, try to empty the write buffer.

TESTING: How will you test this? Devise an extensive test methodology and tell us about it. How can you make sure that the write buffer works and that values are written properly through to memory?

Problem 4: Enhancing the I/O module

Finally, we have several enhancements that you must make to the I/O module. We will be adding to the address space we used in Lab 5. There are 3 distinct address ranges to handle.

Problem 4a: Miscellaneous I/O

In the previous lab, we set I/O space in the top-4 words. Now, we would like the following:

`Address`	`Reads`	`Writes`
`0xFFFFFFF0`	`DP0`	`DPO`
`0xFFFFFFF4`	`DP1`	`DP1`
`0xFFFFFFF8`	`Input switches`	`Nothing`
`0xFFFFFFFC`	`Cycle Counter`	`Nothing`

As in Lab5, DP0 and DP1 are the registers whose outputs appear on the HEX LEDs. Notice now that your I/O should allow you to read values back that have been writen. Now, 0xFFFFFFF8 reads from the switches as in LAB5. The new entity is the cycle counter. This is a 32-bit counter that counts once per cycle. It will be used to measure statistics. Notice that it should be reset to zero on processor RESET and just count from that point on.

Describe tests that will demonstrate that these features work.

Problem 4b: Level 0 boot

We will have a special 19-word ROM that appears from Address 0xFFFFFF00 - 0xFFFFFF5C. You can build this ROM anyway you wish. Compiling it into hex and producing a logic block will work as well as anything.
Reads from this address range will return the corresponding instruction. Writes to this range will do nothing. Arrange your RESET sequence so that the first PC is always 0xFFFFFF00.

`Address`	`Instruction`
`0xFFFFFF00`	`lui $r8, 0xDEAD ;Initial Display`
`0xFFFFFF04`	`ori $r8, $r8, 0xBEEF`
`0xFFFFFF08`	`sw $r8, -16($r0) ;Put "DEADBEEF" on display`
`0xFFFFFF0C`	`lui $r1, 0x8000 ;Instruction I/O Space`
`0xFFFFFF10`	`ori $r7, $r0, 0x2000 ;Limit of 8K`
`0xFFFFFF14`	`j L3 ;Go copy first block`
`0xFFFFFF18`	`lw $r8, 4($r1) ;Save instruction address`
`0xFFFFFF1C`	`L1: addiu $r1, $r1, 8 ;Skip block header`
`0xFFFFFF20`	`L2: lw $r4, 0($r1) ;Next word`
`0xFFFFFF24`	`sw $r4, 0($r2) ;Copy to memory`
`0xFFFFFF28`	`addiu $r3, $r3, -1 ;Decrement count`
`0xFFFFFF2C`	`addiu $r1, $r1, 4 ;Increment source`
`0xFFFFFF30`	`bne $r3, $r0, L2 ;Not done`
`0xFFFFFF34`	`addiu $r2, $r2, 4 ;Increment destination`
`0xFFFFFF38`	`slt $r5, $r1, $r7 ;Run over limit?`
`0xFFFFFF3C`	`beq $r5, $r0, BADEND ;Yes. Format problem?`
`0xFFFFFF40`	`L3: lw $r3, 0($r1) ;Get next length`
`0xFFFFFF44`	`bne $r3, $r0, L1 ;Non-zero? Yes, copy`
`0xFFFFFF48`	`lw $r2, 4($r1) ;Get next address`
`0xFFFFFF4C`	`END:sw $r8, -16($r0) ;Put execution addr in DP0`
`0xFFFFFF50`	`jr $r8 ;Start Executing`
`0xFFFFFF54`	`break 0xAA ;Pause with 10101010`
`0xFFFFFF58`	`BADEND: j BADEND ;Loop forever`
`0xFFFFFF5C`	`break 0x7F ;Indicate problem!`

You can add more instructions if you wish, but you should do this basic functionality. Notice that what happens here is that the code looks for a compact description of instructions starting at address 0x80000000. The format of the block of memory at that address is:

Block 0: Length0
          Address0
          Block0[0]
          block0[1]
            ...
          Block0[Length0-1]
Block 1: Length1
          Address1
          Block1[0]
          block1[1]
            ...
          Block1[Length1-1]
Block 2: Length2

...

This sequence is terminated with a zero Length field. It is assumed that Block 0 is a block of instructions and that the system should start executing at Address0 after it is finished copying data. The idea here is that you can have a sequence of instructions that is copied one place in memory and a sequence of data that is copied elsewhere.

Problem 4c: Data Source

At a later time, we will produce the data source off the network. For now, you should use one of the 2Kx32 blocks from Lab 5 for this. Assume that you produce data in the above format. Compile it into a 2Kx32 block, then download it to your board. Note that reads from address 0x80000000 onward should read from this block. To reproduce what we had in Lab 5, you will simply add a length header and an address of 0 to the front of your instructions output from mipsasm. Then you will compile that into this module and you will be good to go. (Hint: there may be a chance that we will be able to get the contents of such a ram from TFTP packets, which would allow you to download data/instructions to a working board.). Writes to this address range should do nothing for now.

Problem 5: Tying it all together!

Finally, tie your Lab all together. You should have all of the I/O from your previous lab. Further, you should make a new file called "boardlevel.v" that includes (1) the new TopLevel.v from the high-level directory and (2) two copies of the DRAM simulation tied together as on the board. You should be able to demonstrate all of the new features in simulation, just as before.

Produce an extensive test suite to make sure that everything still works. You should use the same test programs from last Lab plus a bunch of new ones. Tell us about your test philosophy.

Then, you should be able to run things on the board. Can you produce a long-lived test that verfies that DRAM refresh still works?

Your lab report should contain a description of (a) how your DRAM controller operates, (b) how your cache operates, (c) how you handle DRAM request arbitration, (d) how the addition of your cache amd main memory affected your critical path, and (e) how you determined your component delay values. Make sure to describe your testing methodology: how did your verify that your components worked properly?? In addition, you should include legible schematics, all verilog code, all diagnostic programs (in assembly language), and simulation logs. If you do the extra credit below, also include a description of how the additional improvements affect your performance and critical path with respect to the minimum requirements.

Make sure to give us complete information about the physical design: (1) how many slices did you use/what fraction of the Xilinx chip did you use? (2) What is your critical path? What is the fastest clock that you think you can run with?

Extra Credit: Optimizing and Improving Your Cache

Instead of following the bare requirements listed above, you may complete any number of the following problems for extra credit. This means that you do not need to design the basic cache structure above, and then proceed to do the extra credit. You should now consider yourself warned that most of these cannot be easily added onto the requirements above; they are typically entirely different cache architectures. These improvements are not trivial, and may impact your ability to finish the lab on time. ONLY WORKING PROJECTS CAN RECEIVE FULL OR EXTRA CREDIT.

2-way set-associative cache -- Instead of the direct-mapped lookup policy suggested above, you can potentially increase the hit rate of your cache by using a 2-way set-associative cache. Keep the total size of your cache constant at 8K bytes with 8-word cache lines. Note that once you have more than a direct-mapped cache, you must deal with the issue of cache replacement. A simple flip-flop that flips its state every clock cycle (i.e. between one and zero and back again), can be used as a random-number generator to choose which entry is replaced on a replacement.

Burst Memory Requesting -- It turns out that you can make your DRAM access consecutive data words without going through a complete precharge cycle. This is the so-called "BURST MODE":

**Figure 7: Burst Read (Page 43)**

This diagram shows bursts of 4 words. You should arrange to allow bursts of 8 so that you can load a complete cache line in a single burst. To make this work, you need to set the burst-length in the mode register to 8. Describe the required modifications to the processor side of your DRAM interface and to the DRAM state machine.

Write a test program that shows that this works and improves your performance. Can you use the cycle-counter to show this?

Improved Write Policy -- A write-through policy can be a real performance killer, especially if you are modifying the same cluster of addresses over and over. Build a better write policy, namely write-back cache. What is required to make this work? How can you test it?

Further, both write and read cache misses need to read a complete 8-word cache line. Where this gets complicated is that you may need to kick out a dirty cache line in the process. Make sure that you get this correct! Remember that writing to memory is a long, painful penalty, and a slopy write policy can make things much worse.

If you combine this option with the previous one, you need to enable write-bursts as well (for write back of dirty lines). To do that, you need to set bit M9 of the mode register to 0 to enable write bursts and read bursts to be the same length.

If you implement this improved write policy, you will need to be able to flush instructions out of the data cache in the boot-level 0 code. Why? One possible way to do this is to implement a flush instruction that takes an address and causes a matching line at that cached address to be dumped to memory.

Write a test program that shows that this works and improves your performance. Can you use the cycle-counter to show this?

Please don't be mislead by the extra credit. We are not grading the projects based on their performance relative to other groups in the class. The extra credit just represents more advanced features that are more difficult to design and deserve more points for the additional effort. Not doing the extra credit has no effect beyond not getting bonus points. You can still get a perfect score without it.

Homework #6 / Lab #6 Cache and Main Memory

CS152 - Computer Architecture Spring 2003, Prof John Kubiatowicz

Problem 2: Designing your Cache

Homework #6 / Lab #6
Cache and Main Memory

CS152 - Computer Architecture
Spring 2003, Prof John Kubiatowicz