SelfManage Workshop at FCRC03
E Lassettre et al, Dynamic Surge Protection...
deals with trying to predict workload based on both short term and long term
forecast. goal is to use this to "protect" against surges by
predicting them a little bit in advance. pedram, you may find this useful given
yuor rejuvenation research. keith/jim you may find it of interest as well.
things of note: - handles some surges pretty well, but to handle the "CNN
surge", they also had to tune the DB to do group-commit - didn't seem to be
familiar with other academic literature on resource provisioning from SOSP 2001
Quantifying the benefits of resource multiplexing in on-demand data
centers
what's the benefit to apps of doing resource allocation in a datacenter at
finer-than-node granularity? (ie beyond just dynamically assigning and removing
nodes from services, also dynamically assigning and removing fractional node
resources to services)
- assumptions: apps can be assigned any node (no affinity), homogeneous
resources. this could be sort-of true with VM migration
- define "optimal allocation" scheme to use infinitely
fine-grained in both resource allocation and time (continuous adjustments by
infinitesimal amount). "Capacity overhead" is the ratio of the
"wasted" capacity (since grain is not so fine in practice) to the
optimal capacity.
- evidently, assume that app requirements sharing a node result in linear
degradation? ie there's no performance crosstalk across apps sharing a
node? According to Yahoo this is not the case for them.
- do you assume that resource (re)allocation doesn't impact performance at
all, ie it's free? (yes, they do.) that's not realistic.
in practice reallocation has a cost in app performance since someone's
resources have to be allocated to do the reallocation task. (Think
scheduler activations)
- what is the metric for app performance?
- Yahoo doesn't do this because of cross-performance effects and
complication of recovery management. Comment?
- how much of this is just because of the workload? in practice there is
affinity in any nontrivial system.
- basically this talk had way too many assumptions that are just wrong in
practice, and the strength of the result seemed to rest pretty heavily on
these assumptions.
Nonintrusive remote healing using backdoors
Insight: system can't be expected to heal itself if it has corrupted state,
etc. So, do remote healing from another system that can remote-access
target system. Eliminate reliance on target resources - just like
crash-only.
- remote sys must provide remote memory access and remote I/O, w/o involving
target system's processors or being "visible" to target.
Other arch.princ. for backdoors in paper.
- remote DMA r/w via Infiniband, VIA, etc (just requires target NIC to be
working since it can do the DMA)
- use RDMA for both remote monitoring and remote repair
- monitoring detects stalls in progress counters
- recovery writes new remote copy of per-session server state to reinstate
broken session, resume service
- they mention the accuracy vs overhead tradeoff for remote monitoring
- other challenges: fault propagation; making sure repair doesn't make
things worse; security
-