First, most solutions displayed fairly decent performance. However, you did not get close to ESSL because of cache blocking. This is why performance numbers dropped for large matrices. There was only a single group that implemented two level blocking. Since the objective of the assignment was to optimize mm for the memory hierarchy, I felt obligated to take some points off for solutions with register blocking only (a few points actually).
Overall, you could get 25 points for performance (the winner group got all), and 75 points for your design justification, a sequence of different things you have tried even if they did not work quite as expected, your explanation of why some optimizations were good and some weren't, etc.
The groups that used IBM's performance monitor got a few extra points depending on the extent it was used.
You can examine the last year's solutions that beat ESSL. They have the following coefficients computed similar to ours: 1.085 and 1.069 . A routine automatically generated by PHiPAC is available here .