CS152 Computer Architecture and Engineering

Lab #5: The Memory Subsystem

Spring 2004, Prof. John Kubiatowicz

A formal design document and a division of labor is due via e-mail to your TA Wednesday 4/7 by 9pm. This will allow the TAs time to go over your design document with you during section on Thursday 4/8. You are encouraged to finish your design document and turn it in as early as possible, so that the TA can review it earlier, which will allow you to begin working on this lab earlier..

Lab reports for Lab 4 due Thursday 4/22 at 11:59pm via the submit program. You will demonstrate your lab to your TA on Thursday 4/15 and Thursday 4/22 in section. During this demos, the TA will provide you with secret test code. If you are able to pass these tests on your first try, you will receive bonus points. If you fail the first time, then your TA will give you the source code to the tests and you will have until the end of the day to complete the tests for full credit. If you are not done by the end of the day then you will lose points for each additional day that you are late.

We like to call this lab the "Widowmaker" (just ask Jack and John what happened last Spring), so you should get started now!!!!

Lab Policy: Labs (including final reports) must be submitted by 11:59pm on the day that your lab is checked off. To Submit your lab report, run m:\bin\submit-spring2004.exe or at command prompt, type "submit-spring2004.exe" then follow the instructions. Make sure you input the correct section number, and directory to submit. Remember that you can only submit once, so make sure to submit only when you're ready. Otherwise your lab/project grade will NOT be correctly recorded. The required format for lab reports is shown on the handouts page.

Lab 5:

In this lab, you will be designing a memory system for your pipelined processor. The previous memory module was far from practical, and you will never have separate, dedicated DRAM banks for instructions and data. Using a realistic main memory system will cause two problems in your pipelined processor: (1) your cycle time will dramatically increase as a result of the main memory write and read latency and (2) you must handle conflicts when both data and instruction accesses occur in the same clock cycle. As you most likely have learned in lecture and in the book, the solution to these problems is the addition of cache memory.

All code and helpfiles for this lab are (as usual) in the m:\lab5 including a new TopLevel.v file (with some sample things in it) and a DRAM simulation called mt48lc8m16a2.v.

There are some extra-credit options at the end of the lab. If you choose to try these extra-credit options, they may significantly affect your cache architecture. So, make sure to read through the whole lab first. If you choose to build the extra credit options, you are not required to demonstrate a working version with the base parameters.

Note that Lab 5 builds on Lab 4. Thus, unless otherwise noted, you should have the same features as the previous lab. We are adding to the functionality.

Because this lab is so monstrous we are scheduling TWO separate checkoffs to make sure that you stay on the right track. The schedule of due dates is as follows.

Wednesday 4/7, 9 pm: Design doc and Problem 0 (Team Evaluation) due
Thursday 4/8: Design doc review with your TA
Thursday 4/15: Demo a fully-functional instruction cache, IO space, level-0 boot, and write buffer. (This is the bare minimum to run code that does not contain any loads or stores to any other location besides IO space. This means you do not have to have stall arbitration or memory arbitration.)
Thursday 4/22: Demo fully-functional memory subsystem
Thursday 4/22, 11:59 pm: Project write-up due.

Important: Your processor does not have to handle self modifying code (except for the level-0 boot); i.e. if it receives self modifying code the behavior is undefined.

Problem 0: Team Evaluation for Lab 4

As before, you will be evaluating you and your team members' performance on the last lab assignment (Lab #4). Remember that points are not earned only by doing work, but by how well you work in a team. So if one person does all the work, that certainly does not mean he/she should get all the points!

To evaluate yourself, give us a list of the portions of Lab 4 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).

Use the same point system as last time, but please remember to include a one or two line justification of your evaluation.

Email the result to your TA by Wednesday 4/7, 9 pm.

Problem 1: Modifying a memory controller

NOTE: Read these instructions carefully. They contain all the information you will need to write your report for this lab, including some requirements not directly related to the cache and memory additions!

The first step in creating a new memory system is changing the model for our main memory bank (hereafter called the DRAM). We will be using a model of the 128 Mega-bit dram from Micron that is on the hardware boards. Please start by looking over the DRAM spec off the handouts page (http://inst.eecs.berkeley.edu/~cs152/handouts/128msdram_e.pdf). We will be using RAMS that are in the 2Megx16x4 bank configurations. Further, the board has 2 side-by-side for a total of 32 mega-bytes of ram (2 Megx32x4 banks).
All of the control signals for the two physical RAMS are ganged together as if you had a 256mega-bit DRAM.

There is a model for one of the physical DRAMs called mt48lc8m16a2.v in the m:\lab5_6 directory. You should use this for simulation. Note that you will be building a board-level.v module that includes toplevel.v and a couple of DRAMs for your design.

For this problem, design a DRAM controller which has the following entity specification:

module memory_control (PROCCLK,request,r_w,address,data_in,data_out,wait,
                       DRAMCLK,Data,Addr,Ba,CKE,CS,RAS,CAS,WE,DQM,CAS_TIME);
    input RESET;           // System reset
    input PROCCLK;         // Processor clock
    input request;         // Set to 1 on when processor making request
    input r_w;             // Set to 1 when processor writing
    input address[22:0];   // 23 bits = 8 megabytes of *words*
    input data_in[31:0];   // input data
    output data_out[31:0]; // output data
    output wait;           // Tells the processor to wait for next value
    input DRAMCLK;        // Clock used to control the DRAM
    inout Data[31:0];      // 32-bits of data bus (bi-directional)
    output Addr[11:0];     // 12-bits ROM/Column address
    output Ba[1:0];        // Bank address
    output CKE,CS,RAS,CAS,WE; // Clock enable and DRAM Command control
    output DQM[1:0];       // Data control (Shared between DRAMs, a bit strange!

    // Your code here

endmodule // memory_control

The processor part of this interface is very simple. When the "request" line is asserted on a rising edge, it is assumed that the "address" and "r_w" signals are stable. In addition, if the "r_w" signal is asserted (equal to "1"), it is then assumed that a write will happen, and the "data_in" signal is stable as well. Immediately afterwards, the "wait" output will be asserted for every cycle until the final cycle of the access (accesses will take more than one cycle). If a read is requested, then you should arrange so that the read data is ready (on the "data_out" bus) on the same clock edge that "wait" is deasserted.

After the final cycle (falling edge with "wait" deasserted), the processor must set "request" to low or risk starting another access (you have several timing options here -- choose them carefully). One possible timing for a processor read is shown here:

Figure 1: Processor Read Timing (sample)

Keep in mind that this memory module is word addressed (i.e. the DRAM data bus produces a complete 32-bit word). Don't forget to adjust your processor addresses accordingly (i.e. processor addresses are byte-oriented).

Problem 1a: READ and WRITE

Now, consider the interface to the DRAM. Make sure to read the DRAM specification. Although this DRAM has many options, the memory controller we are going to start with a very simple interface that reads or writes one 32-bit word for each activation. Extra-credit options allow you to get more sophisticated. In this section, you should control the DRAM with the DRAMCLK signal. Assume that the processor interface may be running single stepped (the PROCCLK may take many DRAMCLKs). Keep this in mind throughout your design!

While the processor interface has a 23 bit address, the DRAM interface has only a 12 bit address + 2 bit bank select. This is because the DRAM takes its address in two pieces: the ROW/Bank address (which is the top bits 14 bits) and the COLUMN address (which is the bottom 9 bits). The four primary DRAM control signals, CS, RAS, CAS, and WE together combine to form a "DRAM Command" as follows (from page 12 of manual):

**Figure 2: Command Encodings (Page 12)**

For basic functionality, we will only use the ACTIVE, READ, WRITE, and REFRESH commands. You can use others for extra credit.

Design a statemachine that takes processor commands and sequences the DRAM to perform basic READ and WRITE operations. We will be talking more about DRAMs in class, but the key timing diagrams are shown in the following figures:

Figure 3: DRAM Read Timing (Page 45)

Figure 4: DRAM Write Timing (Page 52)

Notice, in particular, that we are using the "auto-precharge" feature of the DRAM. This means that we are not (at least for read or write) issuing PRECHARGE commands.

Note that we will be designing with a 2-cycle CAS latency. Also, assume that other timing parameters are adjusted appropriately. Note that these times are given in the spec at the bottom of the referenced pages. Consider, for instance, T_RCD. In the spec it is quoted as 20ns with the slower part; with a clock cycle of 25Mhz, this would easily be handled by 1 cycle. At 100Mhz, you would need 2 cycles.

Given the above timing, the minimum number of edges for a DRAM access becomes three at 25Mhz: One for the ACTIVE command, two for CAS latency, and one for the AutoPrecharge (which is overlapped with the last cycle of the CAS latency). If your clock cycle time is too short (as it would be if you optimized your processor well), then you need to use more cycles than the minimum. Further, since you probably won't know exactly how long your clock will be (or may be varying it during debugging), you may want to figure out how to lengthen the number of cycles for events automatically when the clock is too short. (Hint-- look at the time between sucessive clock edges and add additional cycles until you have met each spec). Note also that your controller must be able to handle both "FAST" and "SLOW" DRAMs.

It is important to note that the DRAM databus is bidirectional. Thus, if you are writing to the DRAM, data should be driven on the data bus. If you are reading, the data bus should be left in a high-impedance state (this can be accomplished by using a tristate buffer).

When writing up this part of the problem, make sure to include timing diagrams for your processor interface (read and write), as well as your DRAM interface (read and write). Describe the state-machine that you used to control your DRAM and include the code for your controller. Explain your methodology for testing of the DRAM.

Problem 1b: Initialization

You need to be able to perform initialization properly after reset. Among other things, you need to set the mode register to "0b00 1 00 010 0 000": This means Write Burst=only 1, CAS latency = 2, Burst type = Sequential and Burst length = 1.

Make sure to read the section on initialization. Following is the Initialization diagram:

**Figure 5: Initialization (Page 37)**

This initialization timing should be triggered by a system reset. During the reset (which will be many cycles), you should ignore all processor requests (i.e. the WAIT line will be asserted).

Problem 1c: REFRESH

Finally, you DRAM must periodically execute refresh cycles. Your controller must somehow execute 4096 AUTO Refresh commands every 64 ms. One way to do this is to have a counter that starts at a value of 64ms/DRAMCLKPERIOD/4096 and counts to zero over and over again. Every time that this hits zero, you increment a counter of "missed refreshes" (this count can be small, just a few bits worth).

Everytime your DRAM controller hits idle (all banks precharged), it checks the count of pending refreshes and performs auto-refresh cycles until this count is decremented to zero.

Here is the simple refresh diagram:

**Figure 6: Refresh (Page 40)**

Problem 1d: Testing

How will you test your controller? Come up with test benches to handle as many conditions as you can imagine. Be careful. Assume that the processor may request things at a much slower clock rate than the DRAM clock. However, you can assume that the processor clock is synchronized to the DRAM clock, i.e. that the processor clock gets an edge only right after a DRAM clock edge!

How can you make sure that refresh will happen even if the DRAM controller is always busy?

Problem 2: Designing your Cache

The next step is to design your cache. You may design two separate modules, one for data and the other for instructions, if you find this to be easier than writing a single cache module. The cache you design must have the following properties (unless you do extra credit, mentioned below):

Total cache size of 8K bytes (not including TAGS, etc)
Block size of 8 words (32 bytes)
Direct-mapped cache policy
Write-through write policy with a 4-entry write buffer
Build the SRAM portion of the cache from the same 2Kx32-bit blocks that we used for Lab 4 (with no initialization).
You need something else for tags, etc.

If you wish you can use CoreGen to generate your own custom ram components with the following restrictions:

NO DUAL PORTING! We are well aware that dual-ported components can make your life really easy when crossing clock boundaries, but that is why you are not allowed to use them on this lab. You should learn how to cross clock boundaries yourself.
The instruction and data cache sizes must conform to the size constraints given above.
You may generate asynchronous rams if you like (these can be especially useful for LRU bits...). However, you should, keep in mind that the asynchronous RAMs are built of LUTs, and that if you attempt to build your entire cache system out of asynchronous RAM that you will like exhaust all the LUTs on the board.

Problem 2a: TAGs file

Build the TAGs file for the cache. Make sure that you can reset all the valid bits to zero after reset. You are allowed to use smaller SRAM components for tags, etc, if you wish (make sure that they compile properly to SRAMs that work with the board).

Problem 2b: Write buffer for DATA CACHE

Processor writes to the data cache will go into the cache (if a particular cache line is cached) and will also go directly to DRAM. This can be a serious bottleneck, since it means that every processor write takes a complete DRAM write cycle. To ameliorate this problem, design a 4-entry write buffer for your system. Think of this buffer as sitting between the processor and the cache. This buffer should take writes from the processor and hold them until the DRAM is free. Whenever the buffer is full, you must stall the processor from writing.

Each entry will have a 32-bit address, a 32-bit data word, and a valid bit. Make sure to make this 4-entry write buffer fully-associative so that values sitting in the buffer will be returned properly from load instructions.

Processor load instructions must do the following: First, they check in the write buffer. If there is an entry there with the right address, the load should simply return the value directly from the write buffer. Otherwise, it will look in the cache. If there is a value available, then it will return the value immediately. Otherwise, it will stall the load and request a cache fill from the DRAM. Why can't we simply just look in the cache when we receive a load?

Processor store instructions should do the following: First, they check in the write buffer. If there is an entry there with the right address, then the store will simply overwrite the entry in the buffer. Otherwise, if there is a free entry, then use it for the store. Otherwise, stall the store until an entry is free.

Now, the DRAM controller will have 2 different inputs from the data cache. Either (1) it will fulfill a complete cache-line read during cache miss or (2) it will handle a single-word write to DRAM. Note that, when we decide to empty a single-word write from the write buffer, we will write it both to the DRAM and write it to the cache if the word is properly cached.

The write buffer should be emptied in FIFO order (oldest write first).

Problem 2c: TESTING

Once your cache is designed, you may test it in one of two ways. The first way: leave your cache as a separate component and test it using vectors and manually assigned signals. If you use this method, you must be careful to keep the testing readable and concise for grading. Remember, we don't like looking at waveforms in the lab report.

The second way: you can hook up a new DRAM controller and DRAM TO EACH CACHE and begin testing your design. Keep in mind that the FETCH stage does not have a write buffer. You do not need to worry about simultaneous requests in this problem, which is why we allow a memory block for each cache. Further, force load instructions to stall until the write-buffer is empty. This way, the DRAM controller for the memory stage will always have only one request at a time (write from write buffer vs read from cache).

At this point (regardless of which testing method you use), you should begin to evaluate how the addition of cache affected your critical path (since it is required in the report!).

Problem 3: Adding a Single DRAM and Arbitration to Your Processor

After you have your cache ready to go, there is one more problem that needs to be fixed. Both the data and instruction caches could quite possibly need to access the DRAM at the same time. You will need to design an arbitration method for handling simultaneous DRAM requests. Depending on your cache timing and and how efficient you try to be, this can be the most difficult and trickiest part of this lab. There is no recommended way to accomplish this task, but you are certainly allowed to design a single state machine in Verilog and use it as a memory arbiter/controller; you are also allowed to modify your DRAM controller as needed.

To be more specific, your arbiter is an entity that takes requests from (1) the instruction cache (for cache misses), (2) the data cache (for cache misses) and (3) the write buffer. It should trade-off accesses to the DRAM in a fair way. A suggested priority:

1) If the write-buffer is full, it is highest-priority.
2) Next, if either the instCache or dataCache is busy with a read-miss, let them go forward. Make sure that, if both are requesting, you alternate between them.
3) If nothing else is requesting, try to empty the write buffer.

TESTING: How will you test this? Devise an extensive test methodology and tell us about it. How can you make sure that the write buffer works and that values are written properly through to memory?

Problem 4: Enhancing the I/O module

Finally, we have several enhancements that you must make to the I/O module. We will be adding to the address space we used in Lab 4. There are 4 distinct address ranges to handle.

Problem 4a: Miscellaneous I/O

In the previous lab, we set I/O space in the top-4 words. Now, we are using some of the areas that were "Reserved for future use".

`Address`	`Reads`	`Writes`
0x80000000-0x80000800	See 4c	See 4c
0x80000804-0xFFFFFEDC	Reserved for future use	Reserved for future use
0xFFFFFEE0-0xFFFFFEE8	See 4d	See 4d
0xFFFFFEEC-0xFFFFFEFC	Reserved for future use	Reserved for future use
0xFFFFFF00-0xFFFFFFEC	See 4b	See 4b
`0xFFFFFFF0`	`DP0`	`DPO`
`0xFFFFFFF4`	`DP1`	`DP1`
`0xFFFFFFF8`	`Input switches`	`Nothing`
`0xFFFFFFFC`	`Cycle Counter`	`Nothing`

As in Lab 4, DP0 and DP1 are the registers whose outputs appear on the HEX LEDs. The new entity is the cycle counter. This is a 32-bit counter that counts once per cycle. It will be used to measure statistics. Notice that it should be reset to zero on processor RESET and just count from that point on.

Describe tests that will demonstrate that these features work.

Problem 4b: Level 0 boot

We will have a special 28-word ROM that appears from Address 0xFFFFFF00 - 0xFFFFFF5C. You can build this ROM anyway you wish. Compiling it into hex and producing a logic block will work as well as anything. Instruction reads from this address range will return the corresponding instruction. All data memory accesses to this range will have undefined results. Arrange your RESET sequence so that the first PC is always 0xFFFFFF00.

`Address`	`Instruction`
`0xFFFFFF00`	`lui $8, 0x4849 #Initial Display`
`0xFFFFFF04`	`ori $8, $8, 0x2045`
`0xFFFFFF08`	sw $8, -48($0)
`0xFFFFFF0C`	lui $8, 0x4152
`0xFFFFFF10`	`ori $8, $8, 0x5448`
`0xFFFFFF14`	sw $8, -44($0)
`0xFFFFFF18`	`sw $8, -16($0) #Put "DEADBEEF" on display`
`0xFFFFFF1C`	`lui $1, 0x8000 #Instruction I/O Space`
`0xFFFFFF20`	`ori $7, $0, 0x2000 #Limit of 8K`
`0xFFFFFF24`	`j L3 #Go copy first block`
`0xFFFFFF28`	`lw $8, 4($1) #Save instruction address`
`0xFFFFFF2C`	`L1: addiu $1, $1, 8 #Skip block header`
`0xFFFFFF30`	`L2: lw $4, 0($1) #Next word`
`0xFFFFFF34`	`sw $4, 0($2) #Copy to memory`
`0xFFFFFF38`	`addiu $3, $3, -1 #Decrement count`
`0xFFFFFF3C`	`addiu $1, $1, 4 #Increment source`
`0xFFFFFF40`	`bne $3, $0, L2 #Not done`
`0xFFFFFF44`	`addiu $2, $2, 4 #Increment destination`
`0xFFFFFF48`	`slt $5, $1, $7 #Run over limit?`
`0xFFFFFF4C`	`beq $5, $0, BADEND #Yes. Format problem?`
`0xFFFFFF50`	`L3: lw $3, 0($1) #Get next length`
`0xFFFFFF54`	`bne $3, $0, L1 #Non-zero? Yes, copy`
`0xFFFFFF58`	`lw $2, 4($1) #Get next address`
`0xFFFFFF5C`	`END:sw $8, -16($0) #Put execution addr in DP0`
`0xFFFFFF60`	`jr $8 #Start Executing`
`0xFFFFFF64`	`break 0xAA #Pause with 10101010`
`0xFFFFFF68`	`BADEND: j BADEND #Loop forever`
`0xFFFFFF6C`	`break 0x7F #Indicate problem!`

You can modify the level 0 boot ROM any way you like, but it must fit within the following address range 0xFFFFFF00-0xFFFFFFEC and it must incorporate the basic functionality laid out above. Notice that what happens here is that the code looks for a compact description of instructions starting at address 0x80000000. The format of the block of memory at that address is:

Block 0: Length0
          Address0
          Block0[0]
          Block0[1]
            ...
          Block0[Length0-1]
Block 1: Length1
          Address1
          Block1[0]
          Block1[1]
            ...
          Block1[Length1-1]
Block 2: Length2

...

This sequence is terminated with a zero Length field. It is assumed that Block 0 is a block of instructions and that the system should start executing at Address0 after it is finished copying data. The idea here is that you can have a sequence of instructions that is copied one place in memory and a sequence of data that is copied elsewhere.

Problem 4c: Data Source

You should use either the TFTP Blackbox (highly recommended) or one of the synchronous RAM blocks (only if the black box doesn't work) from Lab 4 for your data source. Assume that you produce data in the above format. Compile it into a 2Kx32 block, then download it to your board. Reads from addresses 0x80000000 - 0x80000800 should read from this block. To reproduce what we had in Lab 4, you will simply add a length header and an address of 0x00000000 to the front of your instructions output from MIPSASM and a word of 0x00000000 to the end of the output. You may find some of MIPSASM's more advanced features useful in specifying higher address ranges and automatically calculating the length of the code.

For example here is a code sample that will generate the proper header and footer for a simple code block:

.count words begin end # Count the number of words between the begin and end labels
.word 0x00000000 # Place the starting address
.address 0x00000000 # Direct MIPSASM to use 0x00000000 as the address when performing jumps
begin:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end:
.word 0x00000000 # A trailing 0x00000000 terminates the instruction stream

Something more fancy might be:

.count words begin1 end1 # Count the number of words in the instruction segment
.word 0x00000000 # Instruction segment starting address
.address 0x00000000 # Direct MIPSASM to use 0x00000000 as the address when performing jumps
begin1:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end1:
.count words begin2 end2 # Count the number of words in the data segment
.word 0x00100000 # Data segment starting address
.address 0x00100000 # Direct MIPSASM to use 0x00000000 as the address when performing jumps
begin2:
.word 0x00000001
.word 0x00000002
.word 0x00000003

...

.word 0x00000500
end2:
.word 0x00000000 # A trailing 0x00000000 terminates the instruction stream

Note that writes to this address range are undefined.

Problem 4d: ASCII Text conversion

The TAs have provided an ASCII conversion tool that takes an ASCII code in 7 bits and outputs 7 bits of coding information for the hexadecimal LEDs. Note that not all characters can be displayed on the LEDs, so some characters will be converted to a black space.

ASCII_REG1 is a 56 bit register that contains the ASCII-converted display information for hexadecimal LEDs 1-4.
ASCII_REG2 is a 56 bit register that contains the ASCII-converted display information for hexadecimal LEDs 5-8.
Both ASCII_REG1 and ASCII_REG2 store the lower-numbered LEDs in the higher bits.
POINT_REG is an 8 bit register that contains the display information for the LED decimal point segments. The high bit corresponds to LED point 1. A high value indicates that the LED point is turned on.

`Address`	`Reads`	`Writes`
0xFFFFFEE0	Nothing	Convert the low 7 bits in each byte of the word stored into ASCII and store it in ASCII_REG1.
0xFFFFFEE4	Nothing	Convert the low 7 bits in each byte of the word stored into ASCII and store it in ASCII_REG2.
0xFFFFFEE8	Nothing	Store the low-order bit of each nibble into POINT_REG. The high-order nibble of the stored word corresponds to the high bit of POINT_REG.

Switch 7 should be still be used to toggle between DP0 and DP1.
Switch 1 should be used to toggle between displaying DP0 or DP1 and the ASCII output.
In simulation stores to these registers should be written to the iooutput.trace file.
Note that there is currently a library bug that prevents ModelSim (in 119 Cory at least) from simulating the ASCII conversion module properly. So the raw hexadecimal values of the words stored to the ASCII registers should be written to the iooutput.trace file instead.

Problem 5: Tying it all together!

Finally, tie your Lab all together. You should have all of the I/O from your previous lab. Further, you should make a new file called "boardlevel.v" that includes (1) the new TopLevel.v from the high-level directory and (2) two copies of the DRAM simulation tied together as on the board. You should be able to demonstrate all of the new features in simulation, just as before.

Produce an extensive test suite to make sure that everything still works. You should use the same test programs from last Lab plus a bunch of new ones. Tell us about your test philosophy.

Then, you should be able to run things on the board. Can you produce a long-lived test that verfies that DRAM refresh still works?

Your lab report should contain a description of (a) how your DRAM controller operates, (b) how your cache operates, (c) how you handle DRAM request arbitration, (d) how the addition of your cache amd main memory affected your critical path, and (e) how you determined your component delay values. Make sure to describe your testing methodology: how did your verify that your components worked properly? In addition, you should include legible schematics, all verilog code, all diagnostic programs (in assembly language), and simulation logs. If you do the extra credit below, also include a description of how the additional improvements affect your performance and critical path with respect to the minimum requirements.

Make sure to give us complete information about the physical design: (1) how many slices did you use/what fraction of the Xilinx chip did you use? (2) What is your critical path? What is the fastest clock that you think you can run with? (3) Can your memory subsystem run at the same speed as your processor clock?

Extra Credit: Optimizing and Improving Your Cache

Instead of following the bare requirements listed above, you may complete any number of the following problems for extra credit. This means that you do not need to design the basic cache structure above, and then proceed to do the extra credit. You should now consider yourself warned that most of these cannot be easily added onto the requirements above; they are typically entirely different cache architectures. These improvements are not trivial, and may impact your ability to finish the lab on time. ONLY WORKING PROJECTS CAN RECEIVE FULL OR EXTRA CREDIT.

2-way set-associative cache -- Instead of the direct-mapped lookup policy suggested above, you can potentially increase the hit rate of your cache by using a 2-way set-associative cache. Keep the total size of your cache constant at 8K bytes with 8-word cache lines. Note that once you have more than a direct-mapped cache, you must deal with the issue of cache replacement. A simple flip-flop that flips its state every clock cycle (i.e. between one and zero and back again), can be used as a random-number generator to choose which entry is replaced on a replacement.

Burst Memory Requesting -- It turns out that you can make your DRAM access consecutive data words without going through a complete precharge cycle. This is the so-called "BURST MODE":

**Figure 7: Burst Read (Page 43)**

This diagram shows bursts of 4 words. You should arrange to allow bursts of 8 so that you can load a complete cache line in a single burst. To make this work, you need to set the burst-length in the mode register to 8. Describe the required modifications to the processor side of your DRAM interface and to the DRAM state machine.

Write a test program that shows that this works and improves your performance. Can you use the cycle-counter to show this?

Please don't be mislead by the extra credit. We are not grading the projects based on their performance relative to other groups in the class. The extra credit just represents more advanced features that are more difficult to design and deserve more points for the additional effort. Not doing the extra credit has no effect beyond not getting bonus points. You can still get a perfect score without it.