U.C. Berkeley CS267: Comments on Assn 4

Overall, the quality of this assignment was much better than the previous one. Partly to blame, I'm sure, was the better temperament of rodin and moore who'll hopefully stay cheerful the remainder of the semester. On the other hand, this was a more interesting assignment. I do have a couple of comments, however.

Several groups apparently didn't read the handout carefully and what it asked for. In particular, you should have given formulas for "number of calls to collide()", "number of messages", and "number of bytes" each for version 1 and version 2 of the algorithm. Admittedly, these questions (and others) were well ensconced in the assignment and needed to be searched for, but doing one pass just for that purpose wouldn't have been too hard.

Several people seemed to be confused about what a message is. A message, as we meant it, is a cm5 hardware message (otherwise, it would have been pointless to ask for both messages and bytes). Remember, a message is not equal to an interaction. Assuming the standard active message sizes (4 words/message), if a particle takes 6 words, you'll need to send 2 messages if you send the particles one at a time using stores. If you pack particles together, 3 particles take 2 messages.

There was a large variety of solutions, and many people choose to ignore inter-processor collisions. Although I didn't penalize for this, I thought that you all should have done it (it's not that bad if you have a good algorithm). Only one group "found" the algorithm I think is easiest *and* best; a short description of that algorithm follows.

We'll use the terminology in the assignment (so please excuse all the "sub"-ing). Inter-subsubblock interactions are computed by scanning all subblocks on each processor. An interaction between two subsubblocks is the usual check for collisions, update, etc. For each subsublock, we follow an interaction path. The path, relative to the current subsubblock, is: right subsubblock, upper right subsubblock, upper subsubblock, upper left subsubblock (or it's symmetric equivalents). Therefore, if all subsubblocks are scanned, all inter-subsubblock interactions will be found.

If the path goes off processor, a switching algorithm is employed. Here is a rough sketch of the switching algorithm.

=========
    current_subsubblock = The current subsubblock to operate on
    tmp_subsubblock = a temporary unused subsubblock buffer

    for all entries in path
      if path is off processor
          swapWithNeighborProc(current_subsubblock,tmp_subsubblock)
       endif
       path_subsubblock = path at (path mod number of horizontal subsubblocks)
       inter_subsubblock_interactions(current_subsubblock,path_subsubblock)
    endfor
=========
Therefore, the current subsubblock is sent to the neighboring processor when the path goes off processor rather than getting the path subsubblocks. This is a good idea because, in general, when a path goes off processor, there is more than 1 path subsubblocks off the processor (in the right and top case). If instead, we brought the path subsubblock to the local processor, would incur additional and unnecessary communication. In other words, we bring the subsubblock to the mountain rather than the mountain to the subsubblock. Other advantages should be obvious. Further (and complete) details of this algorithm may be found here, at Los Alamos