Like Lab 5, this is a long lab! MAKE SURE TO
START NOW!
Homework Policy: Homework assignments are due in class. No late homeworks will be accepted. There will be a short quiz in lecture the day the assignment is due; the quiz will be based on the homework. Study groups are encouraged, but what you turn in must be your own work.
Lab Policy: Labs reports due by midnight via the submit program. No late labs will be accepted.
As decided in class, the penalty for cheating
on homework or labs is no credit for the full assignment.
Homework 6:Please
do the following problems from P&H: 7.7, 7.9, 7.11, 7.15, 7.16, 7.18,
7.23, 7.27, 7.32, 7.36, 8.8, 8.13, 8.19, 8.30
Homework assignments should continue to be done
individually.
Lab 6:
In this lab, you will be designing a memory system for your pipelined processor. The previous memory module was far from practical, and you will never have separate, dedicated DRAM banks for instructions and data. Using a realistic main memory system will cause two problems in your pipelined processor: (1) your cycle time will dramatically increase as a result of the main memory write and read latency and (2) you must handle conflicts when both data and instruction accesses occur in the same clock cycle. As you most likely have learned in lecture and in the book, the solution to these problems is the addition of cache memory.
All code and helpfiles for this lab are (as usual) in the m:\lib\high-level including a new TopLevel.v file (with some sample things in it) and a DRAM simulation called mt48lc8m16a2.v. Further, there is a Lab6Help.pdf file. You probably want to use the debouner in common_mods.v.
There are some extra-credit options at the end of the lab. If you choose to try these extra-credit options, they may significantly affect your cache architecture. So, make sure to read through the whole lab first. If you choose to build the extra credit options, you are not required to demonstrate a working version with the base parameters.
Note that Lab6 builds on Lab5. Thus, unless otherwise noted, you should have the same features as the previous lab. We are adding to the functionality.
THIS LAB CAN BE EXTREMELY DIFFICULT, SO AN EARLY START WOULD BE A VERY
GOOD IDEA.
As before, you will be evaluating you and your team members' performance on the last lab assignment (Lab #5). Remember that points are not earned only by doing work, but by how well you work in a team. So if one person does all the work, that certainly does not mean he/she should get all the points!
To evaluate yourself, give us a list of the portions of Lab 5 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).
You may give a total of 20 points per person in the group (besides yourself), but do not give any more than 30 points to one person. Submit your evaluations to your TA by email as soon as possible. Make sure to include a one or two line justification of your evaluation.
Email the result to your TA by Wednesday 4/16 at Midnight.
NOTE: Read these instructions carefully. They contain all the information you will need to write your report for this lab, including some requirements not directly related to the cache and memory additions!
The first step in creating a new memory system is changing the model
for our main memory bank (hereafter called the DRAM). We will be using
a model of the 128 Mega-bit dram from Micron that is on the hardware boards.
Please start by looking over the DRAM spec off the handouts page (http://inst.eecs.berkeley.edu/~cs152/handouts/128msdram_e.pdf).
We will be using RAMS that are in the 2Megx16x4 bank configurations.
Further, the board has 2 side-by-side for a total of 8 mega-bytes of ram
(2 Megx32x4 banks).
All of the control signals for the two physical RAMS are ganged together
as if you had a 256mega-bit DRAM.
There is a model for one of the physical DRAMs called mt48lc8m16a2.v in the m:\lib\high-level\ directory. You should use this for simulation. Note that you will be building a board-level.v module that includes toplevel.v and a couple of DRAMs for your design.
For this problem, design a DRAM controller which has the following entity specification:
module memory_control (PROCCLK,request,r_w,address,data_in,data_out,wait,The processor part of this interface is very simple. When the "request" line is asserted on a rising edge, it is assumed that the "address" and "r_w" signals are stable. In addition, if the "r_w" signal is asserted (equal to "1"), it is then assumed that a write will happen, and the "data_in" signal is stable as well. Immediately afterwards, the "wait" output will be asserted for every cycle until the final cycle of the access (accesses will take more than one cycle). If a read is requested, then you should arrange so that the read data is ready (on the "data_out" bus) on the same clock edge that "wait" is deasserted.
DRAMCLK,Data,Addr,Ba,CKE,CS,RAS,CAS,WE,DQM,CAS_TIME);
input RESET; // System reset
input PROCCLK; // Processor clock
input request; // Set to 1 on when processor making request
input r_w; // Set to 1 when processor writing
input address[22:0]; // 23 bits = 8 megabytes of *words*
input data_in[31:0]; // input data
output data_out[31:0]; // output data
output wait; // Tells the processor to wait for next valueinput DRAMCLK; // Clock used to control the DRAM
inout Data[31:0]; // 32-bits of data bus (bi-directional)
output Addr[11:0]; // 12-bits ROM/Column address
output Ba[1:0]; // Bank address
output CKE,CS,RAS,CAS,WE; // Clock enable and DRAM Command control
output DQM[1:0]; // Data control (Shared between DRAMs, a bit strange!// Your code here
endmodule // memory_control
After the final cycle (falling edge with "wait" deasserted), the processor must set "request" to low or risk starting another access (you have several timing options here -- choose them carefully). One possible timing for a processor read is shown here:
|
Keep in mind that this memory module is word addressed (i.e. the DRAM data bus produces a complete 32-bit word). Don't forget to adjust your processor addresses accordingly (i.e. processor addresses are byte-oriented).
Problem 1a: READ and WRITE
Now, consider the interface to the DRAM. Make sure to read the DRAM specification. Although this DRAM has many options, we are going to start with a very simple interface that reads or writes one 32-bit word for each activation. Extra-credit options allow you to get more sophisticated. In this section, you should control the DRAM with the DRAMCLK signal. Assume that the processor interface may be running single stepped (the PROCCLK may take many DRAMCLKs). Keep this in mind throughout your design!
While the processor interface has a 23 bit address, the DRAM interface
has only a 12 bit address + 2 bit bank select. This is because the
DRAM takes its address in two pieces: the ROW/Bank address (which is the
top bits 14 bits) and the COLUMN address (which is the bottom 9 bits).
The four primary DRAM control signals, CS, RAS, CAS, and WE toogether combine
to form a "DRAM Command" as follows (from page 12 of manual):
For basic functionality, we will only use the ACTIVE, READ, WRITE, and REFRESH commands. You can use others for extra credit.
Design a statemachine that takes processor commands and sequences the DRAM to perform basic READ and WRITE operations. We will be talking more about DRAMs in class, but the key timing diagrams are shown in the following figures:
|
|
|
|
|
|
Notice, in particular, that we are using the "auto-precharge" feature of the DRAM. This means that we are not (at least for read or write) issuing PRECHARGE commands.
Note that we will be designing with a 2-cycle CAS latency. Also, assume that other timing parameters are adjusted appropriately. Note that these times are given in the spec at the bottom of the referenced pages. Consider, for instance, T_RCD. In the spec it is quoted as 20ns with the slower part; with a clock cycle of 25Mhz, this would easily be handled by 1 cycle. At 100Mhz, you would need 2 cycles.
Given the above timing, the minimum number of edges for a DRAM access becomes three at 25Mhz: One for the ACTIVE command, two for CAS latency, and one for the AutoPrecharge (which is overlapped with the last cycle of the CAS latency). If your clock cycle time is too short (as it would be if you optimized your processor well), then you need to use more cycles than the minimum. Further, since you probably won't know exactly how long your clock will be (or may be varying it during debugging), you may want to figure out how to lengthen the number of cycles for events automatically when the clock is too short. (Hint-- look at the time between sucessive clock edges and add additional cycles until you have met each spec). Note also that your controller must be able to handle both "FAST" and "SLOW" DRAMs.
It is important to note that the DRAM databus is bidirectional. Thus, if you are writing to the DRAM, data should be driven on the data bus. If you are reading, the data bus should be left in a high-impedance state (this can be accomplished by assigning the bus a "ZZZZZZZZ\h" value).
When writing up this part of the problem, make sure to include timing diagrams for your processor interface (read and write), as well as your DRAM interface (read and write). Describe the state-machine that you used to control your DRAM and include the code for your controller. Explain your methodology for testing of the DRAM.
Problem 1b: Initialization
You need to be able to perform initialization properly after reset. Among other things, you need to set the mode register to "0b00 1 00 010 0 000": This means Write Burst=only 1, CAS latency = 2, Burst type = Sequential and Burst length = 1.
Make sure to read the section on initialization. Following is
the Initialization diagram:
This initialization timing should be triggered by a system reset. During the reset (which will be many cycles), you should ignore all processor requests (i.e. the WAIT line will be asserted).
Problem 1c: REFRESH
Finally, you DRAM must periodically execute refresh cycles. Your controller must somehow execute 4096 AUTO Refresh commands every 64 ms. One way to do this is to have a counter that starts at a value of 64ms/DRAMCLKPERIOD/4096 and counts to zero over and over again. Every time that this hits zero, you increment a counter of "missed refreshes" (this count can be small, just a few bits worth).
Everytime your DRAM controller hits idle (all banks precharged), it checks the count of pending refreshes and performs auto-refresh cycles until this count is decremented to zero.
Here is the simple refresh diagram:
Problem 1d: Testing
How will you test your controller? Come up with test benches to handle as many conditions as you can imagine. Be careful. Assume that the processor may request things at a much slower clock rate than the DRAM clock. However, you can assume that the processor clock is synchronized to the DRAM clock, i.e. that the processor clock gets an edge only right after a DRAM clock edge!
How can you make sure that refresh will happen even if the DRAM controller
is always busy?
Problem 2a: TAGs file
Build the TAGs file for the cache. Make sure that you can reset all the valid bits to zero after reset. You are allowed to use smaller SRAM components for tags, etc, if you wish (make sure that they compile properly to SRAMs that work with the board).
Problem 2b: Write buffer for DATA CACHE
Processor writes to the data cache will go into the cache (if a particular cache line is cached) and will also go directly to DRAM. This can be a serious bottleneck, since it means that every processor write takes a complete DRAM write cycle. To ameliorate this problem, design a 4-entry write buffer for your system. Think of this buffer as sitting between the processor and the cache. This buffer should take writes from the processor and hold them until the DRAM is free. Whenever the buffer is full, you must stall the processor from writing.
Each entry will have a 32-bit address, a 32-bit data word, and a valid bit. Make sure to make this 4-entry write buffer fully-associative so that values sitting in the buffer will be returned properly from load instructions.
Processor load instructions must do the following: First, they check in the write buffer. If there is an entry there with the right address, the load should simply return the value directly from the write buffer. Otherwise, it will look in the cache. If there is a value available, then it will return the value immediately. Otherwise, it will stall the load and request a cache fill from the DRAM.
Processor store instructions should do the following: First, they check in the write buffer. If there is an entry there with the right address, then the store will simply overwrite the entry in the buffer. Otherwise, if there is a free entry, then use it for the store. Otherwise, stall the store untill an entry is free.
Now, the DRAM controller will have 2 different inputs from the data cache. Either (1) it will fullfill a complete cache-line read during cache miss or (2) it will handle a single-word write to DRAM. Note that, when we decide to empty a single-word write from the write buffer, we will write it both to the DRAM and write it to the cache if the word is properly cached.
The write buffer should be emptied in FIFO order (oldest write first).
Problem 2c: TESTING
Once your cache is designed, you may test it in one of two ways. The first way: leave your cache as a separate component and test it using vectors and manually assigned signals. If you use this method, you must be careful to keep the testing readable and concise for grading. Remember, we don't like looking at waveforms in the lab report.
The second way: you can hook up a new DRAM controller and DRAM TO EACH CACHE and begin testing your design. Keep in mind that the FETCH stage does not have a write buffer. You do not need to worry about simultaneous requests in this problem, which is why we allow a memory block for each cache. Further, force load instructions to stall until the write-buffer is empty. This way, the DRAM controller for the memory stage will always have only one request at a time (write from write buffer vs read from cache).
At this point (regardless of which testing method you use), you should begin to evaluate how the addition of cache affected your critical path (since it is required in the report!).
Problem 3: Adding a Single DRAM and Arbitration to Your Processor
After you have your cache ready to go, there is one more problem that needs to be fixed. Both the data and instruction caches could quite possibly need to access the DRAM at the same time. You will need to design an arbitration method for handling simultaneous DRAM requests. Depending on your cache timing and and how efficient you try to be, this can be the most difficult and trickiest part of this lab. There is no recommended way to accomplish this task, but you are certainly allowed to design a single state machine in VHDL and use it as a memory arbiter/controller; you are also allowed to modify your DRAM controller from Problem 1. Try to be as realistic as possible with the delay through this module, and keep in mind the speed it should have compared to the other components listed above.
To be more specific, your arbiter is an entity that takes requests from (1) the instruction cache (for cache misses), (2) the data cache (for cache misses) and (3) the write buffer. It should trade-off accesses to the DRAM in a fair way. A suggested priority:
1) If the write-buffer is full, it is highest-priority.
2) Next, if either the processor or DRAM is busy with a read-miss,
let them go forward. Make sure that, if both are requesting, you
alternate between them.
3) If nothing else is requesting, try to empty the write buffer.
TESTING: How will you test this? Devise an extensive test methodology and tell us about it. How can you make sure that the write buffer works and that values are written properly through to memory?
Problem 4: Enhancing the I/O module
Finally, we have several enhancements that you must make to the I/O module. We will be adding to the address space we used in Lab 5. There are 3 distinct address ranges to handle.
Problem 4a: Miscellaneous I/O
In the previous lab, we set I/O space in the top-4
words. Now, we would like the following:
|
|
|
0xFFFFFFF0 | DP0 | DPO |
0xFFFFFFF4 | DP1 | DP1 |
0xFFFFFFF8 | Input switches | Nothing |
0xFFFFFFFC | Cycle Counter | Nothing |
As in Lab5, DP0 and DP1 are the registers whose outputs appear on the HEX LEDs. Notice now that your I/O should allow you to read values back that have been writen. Now, 0xFFFFFFF8 reads from the switches as in LAB5. The new entity is the cycle counter. This is a 32-bit counter that counts once per cycle. It will be used to measure statistics. Notice that it should be reset to zero on processor RESET and just count from that point on.
Describe tests that will demonstrate that these features work.
Problem 4b: Level 0 boot
We will have a special 19-word ROM that appears from
Address 0xFFFFFF00 - 0xFFFFFF5C. You can build this ROM anyway you
wish. Compiling it into hex and producing a logic block will work
as well as anything.
Reads from this address range will return the corresponding
instruction. Writes to this range will do nothing. Arrange
your RESET sequence so that the first PC is always 0xFFFFFF00.
|
|
0xFFFFFF00 | lui $r8, 0xDEAD ;Initial Display |
0xFFFFFF04 | ori $r8, $r8, 0xBEEF |
0xFFFFFF08 | sw $r8, -16($r0) ;Put "DEADBEEF" on display |
0xFFFFFF0C | lui $r1, 0x8000 ;Instruction I/O Space |
0xFFFFFF10 | ori $r7, $r0, 0x2000 ;Limit of 8K |
0xFFFFFF14 | j L3 ;Go copy first block |
0xFFFFFF18 | lw $r8, 4($r1) ;Save instruction address |
0xFFFFFF1C | L1: addiu $r1, $r1, 8 ;Skip block header |
0xFFFFFF20 | L2: lw $r4, 0($r1) ;Next word |
0xFFFFFF24 | sw $r4, 0($r2) ;Copy to memory |
0xFFFFFF28 | addiu $r3, $r3, -1 ;Decrement count |
0xFFFFFF2C | addiu $r1, $r1, 4 ;Increment source |
0xFFFFFF30 | bne $r3, $r0, L2 ;Not done |
0xFFFFFF34 | addiu $r2, $r2, 4 ;Increment destination |
0xFFFFFF38 | slt $r5, $r1, $r7 ;Run over limit? |
0xFFFFFF3C | beq $r5, $r0, BADEND ;Yes. Format problem? |
0xFFFFFF40 | L3: lw $r3, 0($r1) ;Get next length |
0xFFFFFF44 | bne $r3, $r0, L1 ;Non-zero? Yes, copy |
0xFFFFFF48 | lw $r2, 4($r1) ;Get next address |
0xFFFFFF4C | END:sw $r8, -16($r0) ;Put execution addr in DP0 |
0xFFFFFF50 | jr $r8 ;Start Executing |
0xFFFFFF54 | break 0xAA ;Pause with 10101010 |
0xFFFFFF58 | BADEND: j BADEND ;Loop forever |
0xFFFFFF5C | break 0x7F ;Indicate problem! |
You can add more instructions if you wish, but you should do this basic functionality. Notice that what happens here is that the code looks for a compact description of instructions starting at address 0x80000000. The format of the block of memory at that address is:
Block 0: Length0
Address0
Block0[0]
block0[1]
...
Block0[Length0-1]
Block 1: Length1
Address1
Block1[0]
block1[1]
...
Block1[Length1-1]
Block 2: Length2
...
This sequence is terminated with a zero Length field. It is assumed that Block 0 is a block of instructions and that the system should start executing at Address0 after it is finished copying data. The idea here is that you can have a sequence of instructions that is copied one place in memory and a sequence of data that is copied elsewhere.
Problem 4c: Data Source
At a later time, we will produce the data source off the network. For now, you should use one of the 2Kx32 blocks from Lab 5 for this. Assume that you produce data in the above format. Compile it into a 2Kx32 block, then download it to your board. Note that reads from address 0x80000000 onward should read from this block. To reproduce what we had in Lab 5, you will simply add a length header and an address of 0 to the front of your instructions output from mipsasm. Then you will compile that into this module and you will be good to go. (Hint: there may be a chance that we will be able to get the contents of such a ram from TFTP packets, which would allow you to download data/instructions to a working board.). Writes to this address range should do nothing for now.
Problem 5: Tying it all together!
Finally, tie your Lab all together. You should have all of the I/O from your previous lab. Further, you should make a new file called "boardlevel.v" that includes (1) the new TopLevel.v from the high-level directory and (2) two copies of the DRAM simulation tied together as on the board. You should be able to demonstrate all of the new features in simulation, just as before.
Produce an extensive test suite to make sure that everything still works. You should use the same test programs from last Lab plus a bunch of new ones. Tell us about your test philosophy.
Then, you should be able to run things on the board. Can you produce a long-lived test that verfies that DRAM refresh still works?
Your lab report should contain a description of (a) how your DRAM controller operates, (b) how your cache operates, (c) how you handle DRAM request arbitration, (d) how the addition of your cache amd main memory affected your critical path, and (e) how you determined your component delay values. Make sure to describe your testing methodology: how did your verify that your components worked properly?? In addition, you should include legible schematics, all verilog code, all diagnostic programs (in assembly language), and simulation logs. If you do the extra credit below, also include a description of how the additional improvements affect your performance and critical path with respect to the minimum requirements.
Make sure to give us complete information about the physical design:
(1) how many slices did you use/what fraction of the Xilinx chip did you
use? (2) What is your critical path? What is the fastest clock
that you think you can run with?
Extra Credit: Optimizing and Improving Your Cache
Instead of following the bare requirements listed above, you may complete any number of the following problems for extra credit. This means that you do not need to design the basic cache structure above, and then proceed to do the extra credit. You should now consider yourself warned that most of these cannot be easily added onto the requirements above; they are typically entirely different cache architectures. These improvements are not trivial, and may impact your ability to finish the lab on time. ONLY WORKING PROJECTS CAN RECEIVE FULL OR EXTRA CREDIT.
This diagram shows bursts of 4 words. You should arrange to allow bursts of 8 so that you can load a complete cache line in a single burst. To make this work, you need to set the burst-length in the mode register to 8. Describe the required modifications to the processor side of your DRAM interface and to the DRAM state machine.
Write a test program that shows that this works and improves your performance.
Can you use the cycle-counter to show this?
Further, both write and read cache misses need to read a complete 8-word cache line. Where this gets complicated is that you may need to kick out a dirty cache line in the process. Make sure that you get this correct! Remember that writing to memory is a long, painful penalty, and a slopy write policy can make things much worse.
If you combine this option with the previous one, you need to enable write-bursts as well (for write back of dirty lines). To do that, you need to set bit M9 of the mode register to 0 to enable write bursts and read bursts to be the same length.
If you implement this improved write policy, you will need to be able to flush instructions out of the data cache in the boot-level 0 code. Why? One possible way to do this is to implement a flush instruction that takes an address and causes a matching line at that cached address to be dumped to memory.
Write a test program that shows that this works and improves your performance. Can you use the cycle-counter to show this?