SelfManage Workshop at FCRC03

E Lassettre et al, Dynamic Surge Protection...

deals with trying to predict workload based on both short term and long term forecast. goal is to use this to "protect" against surges by predicting them a little bit in advance. pedram, you may find this useful given yuor rejuvenation research. keith/jim you may find it of interest as well. things of note: - handles some surges pretty well, but to handle the "CNN surge", they also had to tune the DB to do group-commit - didn't seem to be familiar with other academic literature on resource provisioning from SOSP 2001

Quantifying the benefits of resource multiplexing in on-demand data centers

what's the benefit to apps of doing resource allocation in a datacenter at finer-than-node granularity? (ie beyond just dynamically assigning and removing nodes from services, also dynamically assigning and removing fractional node resources to services)

assumptions: apps can be assigned any node (no affinity), homogeneous resources. this could be sort-of true with VM migration
define "optimal allocation" scheme to use infinitely fine-grained in both resource allocation and time (continuous adjustments by infinitesimal amount). "Capacity overhead" is the ratio of the "wasted" capacity (since grain is not so fine in practice) to the optimal capacity.
evidently, assume that app requirements sharing a node result in linear degradation? ie there's no performance crosstalk across apps sharing a node? According to Yahoo this is not the case for them.
do you assume that resource (re)allocation doesn't impact performance at all, ie it's free? (yes, they do.) that's not realistic. in practice reallocation has a cost in app performance since someone's resources have to be allocated to do the reallocation task. (Think scheduler activations)
what is the metric for app performance?
Yahoo doesn't do this because of cross-performance effects and complication of recovery management. Comment?
how much of this is just because of the workload? in practice there is affinity in any nontrivial system.
basically this talk had way too many assumptions that are just wrong in practice, and the strength of the result seemed to rest pretty heavily on these assumptions.

Nonintrusive remote healing using backdoors

Insight: system can't be expected to heal itself if it has corrupted state, etc. So, do remote healing from another system that can remote-access target system. Eliminate reliance on target resources - just like crash-only.

remote sys must provide remote memory access and remote I/O, w/o involving target system's processors or being "visible" to target. Other arch.princ. for backdoors in paper.
remote DMA r/w via Infiniband, VIA, etc (just requires target NIC to be working since it can do the DMA)
use RDMA for both remote monitoring and remote repair
monitoring detects stalls in progress counters
recovery writes new remote copy of per-session server state to reinstate broken session, resume service
they mention the accuracy vs overhead tradeoff for remote monitoring
other challenges: fault propagation; making sure repair doesn't make things worse; security