CS 252 - Project Suggestions Page
Here are a list of suggested CS 252 projects for Fall 1996.
If you use an item that was suggested by someone, you should at
least tell the person that you are using it, and hopefully tell them the result of the
investigation.
Note that you are not limited to the projects listed here. You may
work on anything that you find interesting that is deemed worthy of being
a CS 252 project.
If you want more of an idea of just what a CS 252 project entails, check out
the projects from last year,
Spring 1996
and
Fall 1995.
Please try to find a project partner and have a rough idea of what you want to
work on by Monday, September 23. Send mail to
patterson@cs.berkeley.edu
and
rfromm@cs.berkeley.edu
indicating who you are working with and what you are interested in working on.
Please, just one message per group.
Feel free to contact us if you have any questions or are looking for suggestions
and/or clarifications.
A more detailed survey concerning your project choice will be coming shortly.
It is important to get started thinking about this early. Trust me, you don't
want to let this all go until the last minute.
XSPEC '02
eXperimental SPEC for 2002: create benchmark set of programs,
data size of future apps
- Why design computers of future using programs of past?
- e.g., Speech understanding, Word processing of book, Video, 3D animation, sound,
Object data base, Encryption, Just-In-Time compiler, Network apps, Games, ...
- Public domain and third party programs
- At least 2 programs, 3 data set sizes ('96, '99, '02), ported to at least 2 instruction sets
(with optimizating compilers), characterize using performance monitors (cache
misses, CPI, instruction types, I/O traffic, paging, ...)
- success requires potability and publishing results
- can be several groups
Java vs. SPEC
JAVA 1: characterization of cache behavior, register usage, branch behavior, and depth of
calls (to name a few interesting data points) for Java applications.
- Contrast these parameters with Spec benchmarks.
- Analyze where differences come from
- Compare with vanilla byte codes and Just-In-Time compilers
Suggested by Dileep Bhandarkar
(
Dileep_Bhandarkar@ccm.sc.intel.com)
SPEC95 across different architectures
Since you have a variety of architectures, take the gcc compiler and measure SPECint95 on
some public version of Unix such as Linux or FreeBSD
- You already know about ATOM (alpha) and EEL (sparc). Now there
is also ETCH for x86. Look at:
http:9/19/96/www.cs.washington.edu/homes/bershad/Etch/index.html
Having similar tools for 3 architectures might allow you to have 3
groups of students look at similar stuff on 3 architectures.
- Compare path lengths for the various architectures
- Do static code size comparisons
- See impact of optimizations
Suggested by Dileep Bhandarkar
(
Dileep_Bhandarkar@ccm.sc.intel.com)
SPEC95 cache analysis
Repeat Mark Hill's SPEC92 cache analysis for SPEC95.
- Gee, J.D.; Hill, M.D.; Pnevmatikatos, D.N.; Smith, A.J.
"Cache performance of the
SPEC92 benchmark suite." IEEE Micro, Aug. 1993, vol.13, (no.4):17-27.
Abstract: The authors consider whether SPECmarks, the figures of meritobtained from
running the SPEC benchmarks under certain specified conditions, accurately
indicate the performance to be expected from real, live work loads. Miss
ratios for the entire set of SPEC92 benchmarks are measured.
- This may need multiple teams, N benchmarks per team.
- See if it varies for x86 vs. RISC? Use NOW, PC clusters.
Suggested by Dileep Bhandarkar
(
Dileep_Bhandarkar@ccm.sc.intel.com)
Measure OS primitives
Pick some set of OS primitives (process creation, synchronization
etc) and measure the times for NT vs Unix on same platform.
- See if it varies for x86 vs. Alpha?
Suggested by Dileep Bhandarkar
(
Dileep_Bhandarkar@ccm.sc.intel.com)
A "voting" data-prefetch engine
This H/W device has the following characteristics:
- For any data reference stream, TWO independent prefetch devices make predictions: one is
a standard load address stride predictor (which predicts strided accesses), and the
other is a stream buffer, which basically reacts to cache miss history.
- The challenge is to design a voting function that dynamically selects one or the other of the
prefetch addresses to issue to the higher levels of the memory hierarchy.
- This selection should be made conditional on whichever of the two predictors is currently
generating the more accurate future address stream.
- Accuracy is defined as the ability to reduce future data cache misses.
Suggested by Sharad Mehrotra
(
Sharad.Mehrotra@Eng.Sun.COM)
Architecture Archeology/Endangered Species Act
Documenting architectural history might attempt to either collect or
construct emulators for machines which are disappearing
- The real wonder for the ARPAnet for me in 1973 was the diversity of architecture. I
started on an IBM 360/75, I believed at that timethat the world revolved around
EBCDIC. Over the next couple of years encountered my first DEC-10,
ILLIAC-IV, CDC-6600, ...
- The value of emulation history is going to take on interesting significance in the future.
The challenge will be to preserve this software history as the base emulation
machines themselves pass into history.
- Write emulators in Java so can run anywhere?Simple assembler so
can write programs?
Suggested by Eugene Miya
(
eugene@pioneer.arc.nasa.gov)
Evaluate usefulness of VIS/MMX instructions
Both Intel (MMX) and SPARC (x86) have recently extended their instruction sets
with mini-vector-like operations that allow the simultaneous operation on groups
of values (i.e. vectors). One area which is being specifically targetted by
this is multimedia. Both companies claim significant speedups from the use
of these instructions. How valid are these claims? Evaluate the usefulness
of these additional instructions for some application(s). Are the instructions
worthwhile? Can you suggest any different new instructions which would be
better? What really matters
is complete, end-to-end performance of an entire application, not just an
impressive speedup on a small micro-kernel. Some claims and/or previous
evaluations may have been published in past issues of
Microprocessor Report.
The following several suggestions deal with the IRAM project. See the
IRAM web page
and/or contact
Dave Patterson
or
Rich Fromm
for more details about the project.
Investigation of logic circuits in memory processes
One problem with building a processor in a DRAM manufacturing process is that
the logic and memory processes are optimized for different factors, and logic
in a DRAM process is likely to run slower. How much slower is a very important
question. It would be very helpful to simulate (with SPICE) various logic
circuits in a memory process to try to quantify these differences. Other
related topics (besides speed differences) include area and power
considerations. Studying the performance of an SRAM cache built in a DRAM
process would also be useful, since an IRAM with a large DRAM main memory
is still likely to have an L1 cache, which is likely to be built from SRAM.
This project would investigate such physical phenomena and apply this knowledge
to some simple architectural evaluations.
Stelios Perisakis
(sper@cs.berkeley.edu)
has already investigated some of these
issues, so it would probably be helpful to consult with him before undertaking
this project.
IRAM with processor redundancy
Memory chips are typically built with redundancy to improve yield. Can this
make sense for logic as well, including logic combined with memory?
- Model processor redundancy to improve yield of IRAM
- Dynamic pipeline and multiple functional units
- Multiple small processors?
On-the-fly compression/decompression
An IRAM will have a tremendous bandwidth to the on-chip memory, but there is
still the problem of accesses that must go off-chip. The off-chip memory
bandwidth is likely to be orders of magnitude less than the on-chip bandwidth.
Perhaps the most sensible solution is to store programs and data in compressed
form, decompress them as they are read into the chip, and compress data as
it is written off of the chip. Evaluate the usefulness of this idea (with
both standard and novel compression schemes) with real programs and data,\
paying careful attention to the
architectural feasibility of any implication, not just how good some
algorithm might perform theoretically.
IRAM for large programs/data
Get traces of large programs. Compare and contrast:
- Cache Only Memory Architecture, all IRAMs
- IRAM as cache, external DRAM as main memory
- IRAM as low physical memory, external DRAM as high physical memory;
paging policy (CS 262)?
Evaluation of existing cache design(s) implemented as an IRAM
Take the memory hierarchy of an existing microprocessor and evaluate its
effectiveness if implemented as an IRAM. As mentioned above, a processor
running in a DRAM process is likely to be slower. Can you quantify how much
slower the processor can be to still have a performance advantage for an IRAM
vs. a conventional solution. Traces of real programs and/or the utilization
of on-chip performance counters on processors in existing systems could be
used as the data upon which to base the analysis.
Design and evalutation of a new microarchitecture for IRAM
Similar to above, but instead of using an existing design of a memory hierarchy,
come up with your own design that is optimized for an IRAM and evaluate its
effectiveness.
IRAM as a single-chip multiprocessor
Integrating the processor and memory provides for a tremendous amount of on-chip
bandwidth. An important question is how to best utilize this. One possibility
is by placing more than one processor on the same chip. Perhaps this is not
the right idea, since if bandwidth is the problem, having multiple processors
will just make the problem worse. But if there is enough available bandwidth
in an IRAM context, perhaps this would make sense.
A paper at this fall's ASPLOS,
The Case for a Single-Chip Multiprocessor,
investigated integrating multiple processors onto a single die. Would the
results of this investigation be any different looking into the future if this
were implemented in the 1 Gb DRAM generation, with the main memory on chip?
(The paper assumed a 256 KB L2 SRAM cache on-chip, implemented in a logic
process.)
IRAM as a building block of a distributed shared memory multiprocessor
Besides viewing an IRAM as a single uniprocessor, or a single multiprocessor
(see above), another potential is that an IRAM could be an ideal building
block for a larger multiprocessor system. Need more memory, just throw
together multiple IRAMs, and you have the added benefit of having multiple
processors too.
There is a substantial body of literature on parallel processing architectures.
What if the building block of a distributed shared memory multiprocessor, which
is typically a processor with some local memory, were all contained on the same
chip? How much of the conclusions drawn from past studies would still apply,
and what might be different?
If you want to do any project (whether or not related to IRAM) that is
related to parallel processing, I would advise talking to David Culler
(culler@cs.berkeley.edu) to get some
more advise and insights. One paper that might be a good starting point for
ideas comes from this summer's ISCA, "STING: A CC-NUMA Computer System for the
Commercial Marketplace".
Treat registers like a cache
Have a (fully?) associative array of readout registers, and only
do the writeback once the readout register is replaced. This means that
a logical read turns into a physical read. It also implies that the
physical DRAM copy is wrong. All the standard cache questions apply:
stream buffers, speculation, etc. In addition, there is the question of
pre-writeback of dirty lines, as in the virtual memory subpiece of
operating systems.
Suggested by Eric Anderson
(
eanders@cs.berkeley.edu)
Programmable prefetch unit
Many people are building more and more
complicated prefetch units that try to learn indirect references, etc.
(See Sharad Mehrota's work at UIUC
Examination of a memory
access classification scheme for pointer-intensive and numeric programs and
Quantifying the
performance potential of a data prefetch mechanism for pointer-intensive and
numeric programs).
A potentially better idea is to let the compiler/user program the prefetch
unit. This would have a second little state machine that can inspect the
state of the primary CPU and issue memory prefetches based on the primary
CPU's state.
Suggested by Eric Anderson
(
eanders@cs.berkeley.edu)
Compressed linked lists vs. fully expanded arrays
When does compressing to a linked representation provide a performance
improvement .vs. a fully expanded array representation. Linked lists allow
for easier insertions into the middle, and more flexible resizing queues
and stacks. Some initial work on compressed representations for linked
lists providing a middle ground between these two extremes is the work on
Cache-Friendly Lists
Suggested by Eric Anderson
(
eanders@cs.berkeley.edu)
Width and depth of circuits
Is Transistor depth/cycle relatively constant? If so, then circuits
need to be "widenable" for performance improvment. Even with IRAM, CPU's
are still likely to improve faster than memory in latency, and hence to
keep from being limited by memory, you need to be able to reduce the miss
rate arbitrarily by wasting space and width -- but not depth in the circuits.
Some of the theory work on circuit depth may be applicable here.
Suggested by Eric Anderson
(
eanders@cs.berkeley.edu)
Cache replacement policies
Is farthest future use an optimal cache replacement policy? Given a
memory reference stream, and an optimal replacement policy, if the cache
does not do well on that stream, then no cache can do well. However, if
that cache can do well, and in addition conventional caches do poorly,
then there are still questions about improving caching. Also question of
how much buffering is neded to do perfect prefetching with the optimal
replacement policy. Define optimal as least total misses.
Suggested by Eric Anderson
(
eanders@cs.berkeley.edu)
The following several suggestions deal with the BRASS project. See the
BRASS web page
and/or contact
John Wawrzynek
or
André DeHon
for more details about the project.
Comparison between full-custom, FPGA, and processor implementations
There are a few instances (e.g. multipliers, FIRs) where we have
custom IC, FPGA, and processor implementations and can compare the
area-time efficiency of each. It would be beneficial to expand this set
with additional computational tasks/kernels/applications. So, an
interesting project might be pick an application which has a plausible
custom implementation to compare against and develop/estimate a good
processor and array (FPGA/GARP-array/etc.) implementation (perhaps include
the processor+FPGA hybrid, as well). In the long run, it will be
interesting to have a large set of tasks with
{custom,processor,fpga,processor+fpga,... matrix,vector...}
implementations to provide a better feel for the tradeoffs involved.
I've been collecting papers on (semi)-custom implementations of
various tasks to provide the custom datapoint, including:
path planning, DFT/FFT, viterbi, DES, search/matching template matching,
IDEA, Reed-Solomon codec.
Suggested by André DeHon
(
amd@CS.Berkeley.edu)
FPGAs to accelerate common APIs
John Hauser suggested the use of the coupled array to implement various
standard APIs which are being used to allow separate software/hardware
implementations (OpenGL, Direct{3D,Draw,X}, Photoshop API?, others?). It
might be interesting to pick one (or a subset of operations from one of
these) and look at the library implementation one could provide with a
coupled FPGA-accelerator. Presumably, we already have (or can easily
measure) the performance of the pure software implementations of these APIs
for comparison. In some cases, there might be custom hardware accelerators
to compare against, as well.
Suggested by Andre DeHon
(
amd@CS.Berkeley.edu)
Specialization Opportunity
We know the FPGA implementation is going to be larger/slower than a custom
implementation of the *exact* same task. However, the custom
implementation must often be heavily generalized over the task one wants to
solve at one particular time. The FPGA implementation can be built to
solve a more specialized task and in the processes recover some of the
area-time penalty (e.g. implementing a multiply by an 8-bit constant
rather than having to implement a full 32x32 multiplier).
The goal would be to estimate the opportunity for specialization.
How often and how tightly are things known such that a more specialized
implementation can be used? and how much smaller/faster is the specialized
implementation?
More specialized versions may be possible because:
-
the structure of the program constrains it (now I multiply by a constant)
-
the value is bound early and then remains constant
(e.g. value changed outside a loop, or parameter
calculated only during initialization)
-
the PDF on a value is highly localized
(e.g. this values is almost always 8)
Looking at traces of program runs including their data values, we
can identify when things are bound and estimate how much tighter an
implementation would be possible if these values were treated as constants.
Similarly, for the localized PDFs of values, when they are sufficiently
localized, we can look at how much tighter the implementation would be for
the particular cases which occur most frequently.
Estimating actual savings might require some finesse. For some
operations, we might be able to pull out an equivalent logic implementation
of the task and run that through synopsys to get a comparison.
Suggested by André DeHon
(
amd@CS.Berkeley.edu)
Coping with Finite Array Size
One issue which comes up regularly is how we deal with a finite array size.
Several models include:
-
map everything assuming an infinitely large array and perform
demand-drive fetching of operations
-
select a subset of the task which fits onto the array
and map the rest onto the coupled processor
[and this can be further broken down to:
a) single, fixed set
b) explicitly swapped overlays ]
-
"generalize" the structure built on the array
until it fits onto the space available
As Seth suggested, the optimum probably lies somewhere in a mix of these
techniques.
A project might pick an application or two and try out one or more
of these techniques to begin to understand the merits of each and the
situations under which each technique is preferable.
[point of reference: BYU's DISC architecture is the only one
currently doing demand paging. I think I've only seen an application or
two on it. I asked Michael Wirthlim at FCCM this year how things would
differ if, instead of demand paging the entire application, he simply
picked a subset of ops to implement on the array and implemented the rest
on the host processor -- he agreed with me that it would probably run
faster had he done that (but that wasn't what *he* was studying).]
[A closely related question of interest which also shows up here is
that of "how large" is the configurable array. In discussions with Tom
McDermott, he advocated that there are probably natural breakpoints --
i.e. there's a set of things you can do with ~100 configurable elements
(e.g. Razdan style, flow control and data conversion), another set you
might be able to do with thousands (small loop kernels), and other things
you can do when you can map large portions of the application into the
array (e.g. SPLASH/PAM apps?). The analog here might be like cache sizes
-- you get one break when the inner loop fits into the I-cache, another
when the program fits into cache, and yet another when the entire working
data set fits into real memory.
So, the question is: are there discontinuous breaks in the benefit
curve for configurable logic? and are there natural places where those
breaks line up between applications? (If they don't line up, the average
across many applications may be continuous as they all break at different
points). Natural breaks could argue for building arrays in particular
size ranges, whereas, a continuous average might argue that you just want
as much as you can justify without worrying about any magic sizes.]
Suggested by André DeHon
(
amd@CS.Berkeley.edu)
Multitasking and Configurable Arrays
We need to look at the interactions of multitasking and configurable
arrays.
Issues:
-
fetch/instruction management scheme
(array instruction faults in some schemes)
-
security
-
dealing w/ (non-array) memory faults
-
overheads and swapping costs (comparison w/ other swapping costs in os)
-
guaranteeing forward progress
One could approach this from various directions:
-
build software layer on top of the John's defined GARP mechanisms
-
design management scheme from scratch based on goals/constraints
(what should the hardware support to make it work well?)
-
consider what you might do w/ a pure array scenario
There are a number of performance issues associated with the details
of the instruction fetch scheme (explicitly/implicit?, granularity of
swapping?,...) and the data swapping schemes. This might be a good tie-in
for someone looking for a pair of closely related 262/252 projects.
Suggested by André DeHon
(
amd@CS.Berkeley.edu)
Important Program Characteristics
There are also a number of program characteristics we'd like to start
trying to extract/estimate:
-
(derivable) datasizes (i.e. is this operation a 1-bit, 8-bit,
16-bit,... operation?) --
A fine-grained array can narrow the width of deployed resources
to the operations (and modern multimedia extensions allows
efficient handling of small ops when they can be performed in
SIMD sets).
-
retiming distances (how long between the production and
final consumption of data items -- this tells us how
much space we really need to hold intermediate data. A PDF
on retiming distances can help us understand the mix of
retiming depths/resources which will be necessary to run
programs).
-
richness of interconnect (I like to think of it as
calculating the Rent parameter for a program -- i.e.
how much locality is there in the program, how much
physical interconnect will we need to support a given
level of parallelism for this application.)
Suggested by André DeHon
(
amd@CS.Berkeley.edu)
Usefulness of Vector IRAM and/or reconfigurable logic for real applications
One possible use of the available bandwidth for an IRAM is with a vector unit.
Applications that parallelize well are likely to be good matches for either
vectorizing or mapping to reconfigurable logic. What are some of these
applications, and just how much of a speedup can be obtained using either
a vector processor or an array of reconfigurable logic. A 294 project last
spring,
MPEG Encoding on a Vector Processor, looked at the performance speedup
from using vector processing on a portion of the MPEG encoding process. A
more complete project would be concerned with total end-to-end performance.
It is much harder to claim impressive speedups on entire applications than
just on small micro-kernels.
If you want to do any project related to vector computing, I would strongly suggest
talking to Krste Asanovic
(krste@icsi.berkeley.edu), who
worked on the
T0 Vector Microprocessor.
Multiprocessor Simulator
We need to understand how different interconnects for large
multiprocessors (order 1K-10K nodes, 100 TFLOPS) impact their performance on
scientific codes which solve coupled-physics problems with multiple domain
decompositions. In these codes, different decompositions are used for
different physics, and all physics must be done at every time step. Note that
these codes are more complicated than many others due to the multiple domain
decompositions.
We wish to determine the dependence of computational performance on
interconnect details such as topology, latency, bandwidth, dynamic
reconfigurability, and cut-through vs. store&forward routing. Because the
codes to be simulated are large and complex, the instruction-level simulator
we're currently using is much too slow. We'd like to switch to a trace-based
simulator.
The immediate job of the student would be to build a simulator code which
can accept a code trace, a simplified description of the compute nodes, and a
detailed interconnect model, and provide output which includes total
time-to-solution plus details of traffic flow in the interconnect. These
details include throughput statistics, latency statistics, and some contention
details (where in the code, where in the interconnect, how much added delay).
This project requires a creative soul to define a generic interconnect
input description and handling of the interconnect within the simulation, plus
the programming to define the simulator. The student would be expected to
talk with us about the relevant interconnect features which must be included,
and then to (rather independently) build a simulator for this purpose.
Obviously, the work needs to be documented well enough so that others can use
the simulator.
Once the simulator is completed, the student can either go off to other
research, or continue on this project-- becoming involved with simulating
different interconnects and understanding what (detailed) attributes are
needed for the codes of interest. Either would be acceptable to me.
I'd like to get a simulation code in place by January of 97 at the latest.
Suggested by Bob Deri
(
Bob.Deri@quickmail.llnl.gov)
Practical Implementations of ILP
Determine the degree of scalarity that would be practical
under realistic physical design constraints for a commercial
microarchitecture such as x86, PowerPC, SPARC, Alpha or MIPS.
Use traces or write a simulator that can execute binary code
in the target architecture.
Compare results for both in-order and out-of-order execution
in hardware. Out-of-order would require the assumption of a
certain size reorder window for load/stores and a certain size
reorder window for non-load/stores, or a single reorder window
for both (HP 8200 has separate windows.)
Assume a certain pipeline (number of stages, etc) and a
certain number of execution units of each type (branch, compare,
integer, load/store, floating-point)
Consider constraints such as:
- 1 vs. 2 load/store units
- maximum number of unresolved branches allowed
- whether more than one branch could be resolved per cycle
- ability to compress two dependent adds into a single cycle
(or to compress an add plus the address calculation
for a dependent load/store into a single cycle)
- 1 vs. 2 instructions per cycle that can set condition codes
- availability or lack of rename registers in the fp unit
Suggested by George Taylor
(
taylor@exp.com)
The following suggestions were from David
Culler for CS 262. Some may be applicable as well for CS 252, some (i.e. many) may not.
Use your own discretion.
Stack vs. Register Machines
A re-evaluation of Stack vs reg
ala Amdahl, Blau, and Brooks in the presence of multiple issue and
serious renaming.
Suggested by David Culler
(
culler@cs.berkeley.edu)
New Programming models on NOW
Shared virtual memory systems have mostly been developed in the context of
TCP/IP class networks (or highly specialized devices). Identify the best
candidate SVM system and port to Active Message on NOW. Run and compare
on Splash.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Port HPF, PC++, CC++, NESL environments to NOW
Understanding system behavior in a large scale cluster. Instrumentation and
visualization of system performance, utilization.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Toolbox for Persistent Applications
Most applications that run for a long time are not monolithic. They
often evaluate a complex operations over a region of a parameter space,
which may be determined dynamically or even via search. Extend the
smart-clients model to one of interacting with persistant "application
servers". Adaptation to load and configuration changes, as well as
fault resilience are inherent in the toolbox.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Resource allocation and starvation in a "successful NOW" with
persistent applications
Suggested by David Culler
(
culler@cs.berkeley.edu)
Resource Allocation, negotiation, coordination in a serverless server
Suggested by David Culler
(
culler@cs.berkeley.edu)
Scheduling for Large Scale I/O problems
BIG I/O problems need to be well co-ordinated, and there is time to do it
Suggested by David Culler
(
culler@cs.berkeley.edu)
Parallel interface to XFS
Vesta, HPFS, PFS may be the wrong direction.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Network Interface Pooling in Clusters of SMPs
In a NOW there are many factors that lead toward a system
configuration of one network interface per processor. SMP nodes offer
a new degree of freedom, since multiple processors share multiple
network interfaces. This allows a single processor to burst data at a
much larger rate by striping communication across the interfaces. It
also allows the network cost to be reduced by sharing interfaces in a
ratio of less than one-to-one. The effects of these factors has yet
not been well characterized.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Network Desgin in Clusters of SMPs
Investigate how this degree of freedom at the nodes impact on the
design of the network itself. With multiple interfaces, each node
attaches to the network fabric at several points. This dramatically
changes the fault tolerance properties of the network as well as its
behavior under contention. It appears likely that many of the
benefits obtained by using randomness within the network can be
realized in practice by simply arbitrating among multiple ports into
the network at each node.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Load balancing in Clusters of SMPs
Statistical analysis shows that systems on the scale of a thousand
processors will lose substantial efficiency from even very slight
variations in workload, such as might arise from cache effects,
floating-point unit effects, pipeline stalls and the like. In
practice, numerous factors make it difficult to achieve a near perfect
workload distribution. Balancing work within and SMP is considerably
easier and cheaper than balancing work between nodes. Our initial
results suggest that the ability to dynamically balance work within
and SMP node significantly improves the efficiency of the overall
system. In addition to balancing work between processors within an
SMP, there is opportunity to dedicated processors to communication
tasks. This potentially reduced synchronization costs while improving
attentiveness to the network. To date, there is little systematic
exploration of the potential alternatives.
Suggested by David Culler
(
culler@cs.berkeley.edu)
Is it fundamental that NT does not virtualize well?
Suggested by David Culler
(
culler@cs.berkeley.edu)
If the desktop is nothing but Java stations (NCs), what is the OS on
the server?
Suggested by David Culler
(
culler@cs.berkeley.edu)
What is the macrolevel (entire division) internet service requirements?
- email(t), web (t)
How does it scale with organizational size?
Caching, prefetching, watchdogs
- Departmental file service?
Suggested by David Culler
(
culler@cs.berkeley.edu)
On-line web tracking of AC transit via GPS
Suggested by David Culler
(
culler@cs.berkeley.edu)
Prototype design for the home computer (not the office or the game room)
Suggested by David Culler
(
culler@cs.berkeley.edu)
Fundamental challenges underlying gigabit ethernet
Suggested by David Culler
(
culler@cs.berkeley.edu)
Floating-point registers: caller vs. callee save
One interesting project would be the division of floating point registers into caller
saved and callee saved registers.
My survey on the comp.compilers indicated that all other system vendors have at
least some callee-saved registers, and Sun is alone in having all registers saved
by the caller. This has a performance impact. It would be interesting to find
out how much it is.
It doesn't have to be a Sparc-centric study, but a general evaluation based on
a generic RISC processor like DLX perhaps.
Suggested by SME architecture group at Sun, via Robert Garner
(
robert.garner@Eng.Sun.COM)
Prepare-to-branch vs. branch-prediction (static vs. dynamic; fairness)
Suggested by SME architecture group at Sun, via Robert Garner
(
robert.garner@Eng.Sun.COM)
Out-of-order vs. in-order machines: Tradeoffs in hardware/software
complexity
Complex machines such as the HP PA8000 use a large window of available
instructions and hardware scheduling with register renaming in an
attempt to issue more instructions per clock. Is the extra complexity
in such processors justified? Can compilers close the gap between
such machines and simpler in-order machines through profile feedback
and static scheduling?
If they can run their traces through appropriate in-order/out-of-order
models (simulators), it would be interesting to see what they come up with.
Suggested by SME architecture group at Sun, via Robert Garner
(
robert.garner@Eng.Sun.COM)
Do out-of-order machines need small L1 caches that are very
close (low L1 load latency) as much as do in-order machines?
What is the (perf) sensitivity of an out-of-order machine to L1 load latency?
If a latency of say up to 5 clocks can be hidden well, then should we just
build the largest cache we can at the 5 clock latency that the machine
can tolerate?
If the machine is capable of hiding the extra latency, why have a smaller
cache that is closer?
This problem can also be studied with the traces and simulators they have.
Suggested by SME architecture group at Sun, via Robert Garner
(
robert.garner@Eng.Sun.COM)
Back to CS252 home page...