A formal design document and a division of labor is due via e-mail to your TA Wednesday 4/7 by 9pm. This will allow the TAs time to go over your design document with you during section on Thursday 4/8. You are encouraged to finish your design document and turn it in as early as possible, so that the TA can review it earlier, which will allow you to begin working on this lab earlier..
Lab reports for Lab 4 due Thursday 4/22 at 11:59pm via the submit program. You will demonstrate your lab to your TA on Thursday 4/15 and Thursday 4/22 in section. During this demos, the TA will provide you with secret test code. If you are able to pass these tests on your first try, you will receive bonus points. If you fail the first time, then your TA will give you the source code to the tests and you will have until the end of the day to complete the tests for full credit. If you are not done by the end of the day then you will lose points for each additional day that you are late.
We like to call this lab the "Widowmaker" (just ask Jack and John
what happened last Spring), so you should get
started now!!!!
Lab Policy: Labs (including final reports) must be submitted by 11:59pm on the day that your lab is checked off. To Submit your lab report, run m:\bin\submit-spring2004.exe or at command prompt, type "submit-spring2004.exe" then follow the instructions. Make sure you input the correct section number, and directory to submit. Remember that you can only submit once, so make sure to submit only when you're ready. Otherwise your lab/project grade will NOT be correctly recorded. The required format for lab reports is shown on the handouts page.
Lab 5:
In this lab, you will be designing a memory system for your pipelined processor. The previous memory module was far from practical, and you will never have separate, dedicated DRAM banks for instructions and data. Using a realistic main memory system will cause two problems in your pipelined processor: (1) your cycle time will dramatically increase as a result of the main memory write and read latency and (2) you must handle conflicts when both data and instruction accesses occur in the same clock cycle. As you most likely have learned in lecture and in the book, the solution to these problems is the addition of cache memory.
All code and helpfiles for this lab are (as usual) in the m:\lab5 including a new TopLevel.v file (with some sample things in it) and a DRAM simulation called mt48lc8m16a2.v.
There are some extra-credit options at the end of the lab. If you choose to try these extra-credit options, they may significantly affect your cache architecture. So, make sure to read through the whole lab first. If you choose to build the extra credit options, you are not required to demonstrate a working version with the base parameters.
Note that Lab 5 builds on Lab 4. Thus, unless otherwise noted, you should have the same features as the previous lab. We are adding to the functionality.
Because this lab is so monstrous we are scheduling TWO separate
checkoffs to make sure that you stay on the right track. The
schedule of due dates is as follows.
Important: Your processor does not have to handle self modifying code (except for the level-0 boot); i.e. if it receives self modifying code the behavior is undefined.
As before, you will be evaluating you and your team members' performance on the last lab assignment (Lab #4). Remember that points are not earned only by doing work, but by how well you work in a team. So if one person does all the work, that certainly does not mean he/she should get all the points!
To evaluate yourself, give us a list of the portions of Lab 4 that you were originally assigned to work on, as well as the set of things that you eventually ended up doing (these are not necessarily the same, and we realize this).
Use the same point system as last time, but please remember to include a one or two line justification of your evaluation.
Email the result to your TA by Wednesday 4/7, 9 pm.
NOTE: Read these instructions carefully. They contain all the information you will need to write your report for this lab, including some requirements not directly related to the cache and memory additions!
The first step in creating a new memory system is changing the model
for our main memory bank (hereafter called the DRAM). We will be using
a model of the 128 Mega-bit dram from Micron that is on the hardware
boards.
Please start by looking over the DRAM spec off the handouts page (http://inst.eecs.berkeley.edu/~cs152/handouts/128msdram_e.pdf).
We will be using RAMS that are in the 2Megx16x4 bank
configurations.
Further, the board has 2 side-by-side for a total of 32 mega-bytes of
ram
(2 Megx32x4 banks).
All of the control signals for the two physical RAMS are ganged
together
as if you had a 256mega-bit DRAM.
There is a model for one of the physical DRAMs called mt48lc8m16a2.v in the m:\lab5_6 directory. You should use this for simulation. Note that you will be building a board-level.v module that includes toplevel.v and a couple of DRAMs for your design.
For this problem, design a DRAM controller which has the following entity specification:
module memory_control (PROCCLK,request,r_w,address,data_in,data_out,wait,The processor part of this interface is very simple. When the "request" line is asserted on a rising edge, it is assumed that the "address" and "r_w" signals are stable. In addition, if the "r_w" signal is asserted (equal to "1"), it is then assumed that a write will happen, and the "data_in" signal is stable as well. Immediately afterwards, the "wait" output will be asserted for every cycle until the final cycle of the access (accesses will take more than one cycle). If a read is requested, then you should arrange so that the read data is ready (on the "data_out" bus) on the same clock edge that "wait" is deasserted.
DRAMCLK,Data,Addr,Ba,CKE,CS,RAS,CAS,WE,DQM,CAS_TIME);
input RESET; // System reset
input PROCCLK; // Processor clock
input request; // Set to 1 on when processor making request
input r_w; // Set to 1 when processor writing
input address[22:0]; // 23 bits = 8 megabytes of *words*
input data_in[31:0]; // input data
output data_out[31:0]; // output data
output wait; // Tells the processor to wait for next valueinput DRAMCLK; // Clock used to control the DRAM
inout Data[31:0]; // 32-bits of data bus (bi-directional)
output Addr[11:0]; // 12-bits ROM/Column address
output Ba[1:0]; // Bank address
output CKE,CS,RAS,CAS,WE; // Clock enable and DRAM Command control
output DQM[1:0]; // Data control (Shared between DRAMs, a bit strange!// Your code here
endmodule // memory_control
After the final cycle (falling edge with "wait" deasserted), the processor must set "request" to low or risk starting another access (you have several timing options here -- choose them carefully). One possible timing for a processor read is shown here:
|
Keep in mind that this memory module is word addressed (i.e. the DRAM data bus produces a complete 32-bit word). Don't forget to adjust your processor addresses accordingly (i.e. processor addresses are byte-oriented).
Problem 1a: READ and WRITE
Now, consider the interface to the DRAM. Make sure to read the DRAM specification. Although this DRAM has many options, the memory controller we are going to start with a very simple interface that reads or writes one 32-bit word for each activation. Extra-credit options allow you to get more sophisticated. In this section, you should control the DRAM with the DRAMCLK signal. Assume that the processor interface may be running single stepped (the PROCCLK may take many DRAMCLKs). Keep this in mind throughout your design!
While the processor interface has a 23 bit address, the DRAM
interface
has only a 12 bit address + 2 bit bank select. This is because
the
DRAM takes its address in two pieces: the ROW/Bank address (which is
the
top bits 14 bits) and the COLUMN address (which is the bottom 9
bits).
The four primary DRAM control signals, CS, RAS, CAS, and WE together
combine
to form a "DRAM Command" as follows (from page 12 of manual):
![]() |
For basic functionality, we will only use the ACTIVE, READ, WRITE, and REFRESH commands. You can use others for extra credit.
Design a statemachine that takes processor commands and sequences the DRAM to perform basic READ and WRITE operations. We will be talking more about DRAMs in class, but the key timing diagrams are shown in the following figures:
|
|
|
|
|
|
Notice, in particular, that we are using the "auto-precharge" feature of the DRAM. This means that we are not (at least for read or write) issuing PRECHARGE commands.
Note that we will be designing with a 2-cycle CAS latency. Also, assume that other timing parameters are adjusted appropriately. Note that these times are given in the spec at the bottom of the referenced pages. Consider, for instance, T_RCD. In the spec it is quoted as 20ns with the slower part; with a clock cycle of 25Mhz, this would easily be handled by 1 cycle. At 100Mhz, you would need 2 cycles.
Given the above timing, the minimum number of edges for a DRAM access becomes three at 25Mhz: One for the ACTIVE command, two for CAS latency, and one for the AutoPrecharge (which is overlapped with the last cycle of the CAS latency). If your clock cycle time is too short (as it would be if you optimized your processor well), then you need to use more cycles than the minimum. Further, since you probably won't know exactly how long your clock will be (or may be varying it during debugging), you may want to figure out how to lengthen the number of cycles for events automatically when the clock is too short. (Hint-- look at the time between sucessive clock edges and add additional cycles until you have met each spec). Note also that your controller must be able to handle both "FAST" and "SLOW" DRAMs.
It is important to note that the DRAM databus is bidirectional. Thus, if you are writing to the DRAM, data should be driven on the data bus. If you are reading, the data bus should be left in a high-impedance state (this can be accomplished by using a tristate buffer).
When writing up this part of the problem, make sure to include timing diagrams for your processor interface (read and write), as well as your DRAM interface (read and write). Describe the state-machine that you used to control your DRAM and include the code for your controller. Explain your methodology for testing of the DRAM.
Problem 1b: Initialization
You need to be able to perform initialization properly after reset. Among other things, you need to set the mode register to "0b00 1 00 010 0 000": This means Write Burst=only 1, CAS latency = 2, Burst type = Sequential and Burst length = 1.
Make sure to read the section on initialization. Following is
the Initialization diagram:
![]() |
This initialization timing should be triggered by a system reset. During the reset (which will be many cycles), you should ignore all processor requests (i.e. the WAIT line will be asserted).
Problem 1c: REFRESH
Finally, you DRAM must periodically execute refresh cycles. Your controller must somehow execute 4096 AUTO Refresh commands every 64 ms. One way to do this is to have a counter that starts at a value of 64ms/DRAMCLKPERIOD/4096 and counts to zero over and over again. Every time that this hits zero, you increment a counter of "missed refreshes" (this count can be small, just a few bits worth).
Everytime your DRAM controller hits idle (all banks precharged), it checks the count of pending refreshes and performs auto-refresh cycles until this count is decremented to zero.
Here is the simple refresh diagram:
![]() |
Problem 1d: Testing
How will you test your controller? Come up with test benches to handle as many conditions as you can imagine. Be careful. Assume that the processor may request things at a much slower clock rate than the DRAM clock. However, you can assume that the processor clock is synchronized to the DRAM clock, i.e. that the processor clock gets an edge only right after a DRAM clock edge!
How can you make sure that refresh will happen even if the DRAM
controller
is always busy?
Problem 2a: TAGs file
Build the TAGs file for the cache. Make sure that you can reset all the valid bits to zero after reset. You are allowed to use smaller SRAM components for tags, etc, if you wish (make sure that they compile properly to SRAMs that work with the board).
Problem 2b: Write buffer for DATA CACHE
Processor writes to the data cache will go into the cache (if a particular cache line is cached) and will also go directly to DRAM. This can be a serious bottleneck, since it means that every processor write takes a complete DRAM write cycle. To ameliorate this problem, design a 4-entry write buffer for your system. Think of this buffer as sitting between the processor and the cache. This buffer should take writes from the processor and hold them until the DRAM is free. Whenever the buffer is full, you must stall the processor from writing.
Each entry will have a 32-bit address, a 32-bit data word, and a valid bit. Make sure to make this 4-entry write buffer fully-associative so that values sitting in the buffer will be returned properly from load instructions.
Processor load instructions must do the following: First, they check in the write buffer. If there is an entry there with the right address, the load should simply return the value directly from the write buffer. Otherwise, it will look in the cache. If there is a value available, then it will return the value immediately. Otherwise, it will stall the load and request a cache fill from the DRAM. Why can't we simply just look in the cache when we receive a load?
Processor store instructions should do the following: First, they check in the write buffer. If there is an entry there with the right address, then the store will simply overwrite the entry in the buffer. Otherwise, if there is a free entry, then use it for the store. Otherwise, stall the store until an entry is free.
Now, the DRAM controller will have 2 different inputs from the data cache. Either (1) it will fulfill a complete cache-line read during cache miss or (2) it will handle a single-word write to DRAM. Note that, when we decide to empty a single-word write from the write buffer, we will write it both to the DRAM and write it to the cache if the word is properly cached.
The write buffer should be emptied in FIFO order (oldest write first).
Problem 2c: TESTING
Once your cache is designed, you may test it in one of two ways. The first way: leave your cache as a separate component and test it using vectors and manually assigned signals. If you use this method, you must be careful to keep the testing readable and concise for grading. Remember, we don't like looking at waveforms in the lab report.
The second way: you can hook up a new DRAM controller and DRAM TO EACH CACHE and begin testing your design. Keep in mind that the FETCH stage does not have a write buffer. You do not need to worry about simultaneous requests in this problem, which is why we allow a memory block for each cache. Further, force load instructions to stall until the write-buffer is empty. This way, the DRAM controller for the memory stage will always have only one request at a time (write from write buffer vs read from cache).
At this point (regardless of which testing method you use), you should begin to evaluate how the addition of cache affected your critical path (since it is required in the report!).
Problem 3: Adding a Single DRAM and Arbitration to Your Processor
After you have your cache ready to go, there is one more problem that needs to be fixed. Both the data and instruction caches could quite possibly need to access the DRAM at the same time. You will need to design an arbitration method for handling simultaneous DRAM requests. Depending on your cache timing and and how efficient you try to be, this can be the most difficult and trickiest part of this lab. There is no recommended way to accomplish this task, but you are certainly allowed to design a single state machine in Verilog and use it as a memory arbiter/controller; you are also allowed to modify your DRAM controller as needed.
To be more specific, your arbiter is an entity that takes requests from (1) the instruction cache (for cache misses), (2) the data cache (for cache misses) and (3) the write buffer. It should trade-off accesses to the DRAM in a fair way. A suggested priority:
1) If the write-buffer is full, it is highest-priority.
2) Next, if either the instCache or dataCache is busy with a read-miss,
let them go forward. Make sure that, if both are requesting, you
alternate between them.
3) If nothing else is requesting, try to empty the write buffer.
TESTING: How will you test this? Devise an extensive test methodology and tell us about it. How can you make sure that the write buffer works and that values are written properly through to memory?
Problem 4: Enhancing the I/O module
Finally, we have several enhancements that you must make to the I/O module. We will be adding to the address space we used in Lab 4. There are 4 distinct address ranges to handle.
Problem 4a: Miscellaneous I/O
In the previous lab, we set I/O space in the
top-4
words. Now, we are using some of the areas that were "Reserved for
future use".
|
|
|
0x80000000-0x80000800 |
See 4c |
See 4c |
0x80000804-0xFFFFFEDC |
Reserved
for future use |
Reserved
for future use |
0xFFFFFEE0-0xFFFFFEE8 |
See 4d |
See 4d |
0xFFFFFEEC-0xFFFFFEFC | Reserved
for future use |
Reserved
for future use |
0xFFFFFF00-0xFFFFFFEC |
See 4b |
See 4b |
0xFFFFFFF0 | DP0 | DPO |
0xFFFFFFF4 | DP1 | DP1 |
0xFFFFFFF8 | Input switches | Nothing |
0xFFFFFFFC | Cycle Counter | Nothing |
As in Lab 4, DP0 and DP1 are the registers whose outputs appear on the HEX LEDs. The new entity is the cycle counter. This is a 32-bit counter that counts once per cycle. It will be used to measure statistics. Notice that it should be reset to zero on processor RESET and just count from that point on.
Describe tests that will demonstrate that these features work.
Problem 4b: Level 0 boot
We will have a
special 28-word
ROM that appears from Address 0xFFFFFF00 - 0xFFFFFF5C. You can
build this ROM anyway you wish. Compiling it into hex and
producing a logic block will work as well as anything. Instruction reads from this address range will return the
corresponding
instruction. All data memory accesses to this range will have undefined
results. Arrange
your RESET sequence so that the first PC is always 0xFFFFFF00.
|
|
0xFFFFFF00 | lui $8, 0x4849 #Initial Display |
0xFFFFFF04 | ori $8, $8, 0x2045 |
0xFFFFFF08 |
sw $8, -48($0) |
0xFFFFFF0C |
lui $8, 0x4152 |
0xFFFFFF10 | ori $8, $8, 0x5448 |
0xFFFFFF14 | sw $8, -44($0) |
0xFFFFFF18 | sw $8, -16($0) #Put "DEADBEEF" on display |
0xFFFFFF1C | lui $1, 0x8000 #Instruction I/O Space |
0xFFFFFF20 | ori $7, $0, 0x2000 #Limit of 8K |
0xFFFFFF24 | j L3 #Go copy first block |
0xFFFFFF28 | lw $8, 4($1) #Save instruction address |
0xFFFFFF2C | L1: addiu $1, $1, 8 #Skip block header |
0xFFFFFF30 | L2: lw $4, 0($1) #Next word |
0xFFFFFF34 | sw $4, 0($2) #Copy to memory |
0xFFFFFF38 | addiu $3, $3, -1 #Decrement count |
0xFFFFFF3C | addiu $1, $1, 4 #Increment source |
0xFFFFFF40 | bne $3, $0, L2 #Not done |
0xFFFFFF44 | addiu $2, $2, 4 #Increment destination |
0xFFFFFF48 | slt $5, $1, $7 #Run over limit? |
0xFFFFFF4C | beq $5, $0, BADEND #Yes. Format problem? |
0xFFFFFF50 | L3: lw $3, 0($1) #Get next length |
0xFFFFFF54 | bne $3, $0, L1 #Non-zero? Yes, copy |
0xFFFFFF58 | lw $2, 4($1) #Get next address |
0xFFFFFF5C | END:sw $8, -16($0) #Put execution addr in DP0 |
0xFFFFFF60 | jr $8 #Start Executing |
0xFFFFFF64 | break 0xAA #Pause with 10101010 |
0xFFFFFF68 | BADEND: j BADEND #Loop forever |
0xFFFFFF6C | break 0x7F #Indicate problem! |
You can modify the level 0 boot ROM any way you like, but it must fit within the following address range 0xFFFFFF00-0xFFFFFFEC and it must incorporate the basic functionality laid out above. Notice that what happens here is that the code looks for a compact description of instructions starting at address 0x80000000. The format of the block of memory at that address is:
Block 0: Length0
Address0
Block0[0]
Block0[1]
...
Block0[Length0-1]
Block 1: Length1
Address1
Block1[0]
Block1[1]
...
Block1[Length1-1]
Block 2: Length2
...
This sequence is terminated with a zero Length field. It is assumed that Block 0 is a block of instructions and that the system should start executing at Address0 after it is finished copying data. The idea here is that you can have a sequence of instructions that is copied one place in memory and a sequence of data that is copied elsewhere.
Problem 4c: Data Source
You should use either the TFTP Blackbox (highly
recommended) or one of the synchronous RAM blocks (only if the black
box doesn't work) from Lab 4 for your data source. Assume that you
produce data in
the above format. Compile it into a 2Kx32 block, then download it to
your board. Reads from addresses 0x80000000 - 0x80000800 should read
from this
block. To reproduce what we had in Lab 4, you will simply add a length
header
and an address of 0x00000000 to the front of your instructions output
from MIPSASM and a word of 0x00000000 to the
end of
the output. You may
find some of MIPSASM's more
advanced features useful in specifying
higher address ranges and automatically calculating the length of the
code.
For example here is a code sample that will generate the proper header and footer for a simple code block:
.count
words begin end # Count the number of words between the begin and end
labels
.word 0x00000000 #
Place the starting address
.address 0x00000000 # Direct
MIPSASM to use 0x00000000 as the address when performing jumps
begin:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end:
.word 0x00000000 # A trailing
0x00000000 terminates the instruction stream
Something more fancy might be:
.count
words begin1 end1 # Count the number of words in the instruction segment
.word 0x00000000 #
Instruction segment starting address
.address 0x00000000 # Direct
MIPSASM to use 0x00000000 as the address when performing jumps
begin1:
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
srl $0, $0, 0
end1:
.count words begin2 end2 # Count the number of words in the data segment
.word 0x00100000 # Data
segment starting address
.address 0x00100000 # Direct MIPSASM to use 0x00000000 as
the address when performing jumps
begin2:
.word 0x00000001
.word 0x00000002
.word 0x00000003
Note that writes to this address range are
undefined.
Problem 4d: ASCII Text conversion
The TAs have provided an ASCII conversion tool that takes an ASCII code
in 7 bits and outputs 7 bits of coding information for the hexadecimal
LEDs. Note that not all characters can be displayed on the LEDs,
so some characters will be converted to a black space.
ASCII_REG1 is a 56 bit register that contains the
ASCII-converted display information for hexadecimal LEDs 1-4.
ASCII_REG2 is a 56 bit register that
contains the ASCII-converted display information for hexadecimal LEDs
5-8.
Both ASCII_REG1 and ASCII_REG2 store the lower-numbered LEDs in the higher
bits.
POINT_REG is an 8 bit register that contains the display information
for the LED decimal point segments. The high bit corresponds to
LED point 1. A high value indicates that the LED point is turned
on.
|
|
|
0xFFFFFEE0 | Nothing |
Convert the low 7 bits in
each byte of the word stored into ASCII and store it in ASCII_REG1. |
0xFFFFFEE4 | Nothing | Convert the low 7 bits in each byte of the word stored into ASCII and store it in ASCII_REG2. |
0xFFFFFEE8 | Nothing |
Store the low-order bit of
each nibble into POINT_REG. The high-order nibble of the stored word
corresponds to the high bit of POINT_REG. |
Problem 5: Tying it all together!
Finally, tie your Lab all together. You should have all of the I/O from your previous lab. Further, you should make a new file called "boardlevel.v" that includes (1) the new TopLevel.v from the high-level directory and (2) two copies of the DRAM simulation tied together as on the board. You should be able to demonstrate all of the new features in simulation, just as before.
Produce an extensive test suite to make sure that everything still works. You should use the same test programs from last Lab plus a bunch of new ones. Tell us about your test philosophy.
Then, you should be able to run things on the board. Can you produce a long-lived test that verfies that DRAM refresh still works?
Your lab report should contain a description of (a) how your DRAM controller operates, (b) how your cache operates, (c) how you handle DRAM request arbitration, (d) how the addition of your cache amd main memory affected your critical path, and (e) how you determined your component delay values. Make sure to describe your testing methodology: how did your verify that your components worked properly? In addition, you should include legible schematics, all verilog code, all diagnostic programs (in assembly language), and simulation logs. If you do the extra credit below, also include a description of how the additional improvements affect your performance and critical path with respect to the minimum requirements.
Make sure to give us complete information about the physical
design:
(1) how many slices did you use/what fraction of the Xilinx chip did
you
use? (2) What is your critical path? What is the fastest
clock
that you think you can run with? (3) Can your memory subsystem
run at the same speed as your processor clock?
Extra Credit: Optimizing and Improving Your Cache
Instead of following the bare requirements listed above, you may complete any number of the following problems for extra credit. This means that you do not need to design the basic cache structure above, and then proceed to do the extra credit. You should now consider yourself warned that most of these cannot be easily added onto the requirements above; they are typically entirely different cache architectures. These improvements are not trivial, and may impact your ability to finish the lab on time. ONLY WORKING PROJECTS CAN RECEIVE FULL OR EXTRA CREDIT.
![]() |
This diagram shows bursts of 4 words. You should arrange to allow bursts of 8 so that you can load a complete cache line in a single burst. To make this work, you need to set the burst-length in the mode register to 8. Describe the required modifications to the processor side of your DRAM interface and to the DRAM state machine.
Write a test program that shows that this works and improves your performance. Can you use the cycle-counter to show this?
Please don't be mislead by the extra credit. We are not grading the
projects
based on their performance relative to other groups in the class. The
extra
credit just represents more advanced features that are more difficult
to
design and deserve more points for the additional effort. Not doing the
extra credit has no effect beyond not getting bonus points. You can
still
get a perfect score without it.