They deployed a statistical corr

Do not post, archive or circulate outside of ROC-group.

I talked to Allan Vermeulen, CTO; also got to meet Jeff Bezos briefly.

They deployed a statistical corr. engine but they roll software too fast to let it catch up.. They roll SW really often - sometimes multiple times a day, for specific subfeatures. "There's no steady state behavior" - because workload fluctuates so much. They do "big" rolls at night, to minimize impact (see below: some kinds of rollouts have an unavoidable large performance hit while caches re-warm with new data, etc.) Product they were using wasn't fine-grained enough to focus on subsystems, but not coarse-grained enough to hit "high level business metrics" at high scale (e2e user latency, order rate - turns out order rate is the most predictable metric!! Even during large spikes, the ordering rate stays predictable.)

They have statistical models for the order rate, and deviations from these can by themselves trigger (within 30secs) a trouble ticket. Law of large numbers at work: some of their other sites aren't big enough to have as reliable models.

Config problems are their nightmare: I told them the Akamai horror story. One problem is that latent bugs don't show up till after deployment, so they need fast rollback. In some cases, decent online statistical analysis would have prevented the problem. Examples:

- SLOW memory leak that took a day or so to buildup, then died under load.
- Bad pathnames to services that get called only infrequently

"Config changes should be treated like SW changes" - or even more carefully - because it could enable some new code path to be traversed. Or, when a change in one area induces a problem in a downstream system, eg increase the rate at which something sends data; sometime much later, the recipient can't handle a spike and the combination of the spike and the increased rate kills it.

Another category of failures occurs when a single event triggers a really large amount of activity in a short time:

Ex: wrong page-data file resulted in one rogue server telling all clients to invalidate their caches; then reconnect storm happened when clients actually did this and needed to refresh. Seems there should be a way to bound how fast a particular behavior can fluctuate; eg, having the server tell every 10th client, rather than every client, to invalidate their cache, then it might get done in a "rolling" manner.
When they push new content out, caches have to get warmed up; this takes a long time (hours) so they have to do it at night. If they had some system that allowed online upgrade/repartitioning (DStore does), that would be better; but given their current arch. of a DB with caches in the appservers in front of it, there's not much you can do.

They do databases with front-ended caching (compare to Yahoo, which does dedicated simple stores and separate offline relational-like analysis tools). They rely on the DB to be the "golden" store, but mostly it's for durability. It might be useful to have something that factors out durability. They have a class of data where they have to do interesting read-only queries very often, but very infrequent writes. For example, one- or two-key access. Small subset of read-only SQL (disallow joins, or disallow "interesting" queries, etc)

Their catalog changes hundreds of times a second, because it aggregates data from many retailers. (In fact they're spinning off a technology division to be a platform supplier for e-tailers.) Most online queries against the catalog are read-mostly and fairly simple, but it has to handle a pretty high write rate (tho still a lot lower than read rate). Also a lot of the fancy analysis they need to do is offline. We talked about a 2-layer system, where the base layer (simple queries, fast performance and recovery, etc) is used online, and a higher layer (which uses the lower layer as access methods) handles offline analysis tasks.

They're interested in DStore - their catalog server already uses Berkeley DB! - and would be interested in having an intern to work on this.

Also interested in applying Pinpoint; they do collect a lot of stats now, and are deploying multicast for connecting them which makes the incremental cost very low, but the collection hooks are ad-hoc. Also they don't do online monitoring of it now; instead, when somethign's wrong they start doing ad-hoc inspections/analyses of the data to localize the problem. Pinpoint could easily automate this.

Hi level messages:

Configuration is evil - config changes are hard to test and can enable new code paths
Changes that can cause a "big" change are evil - shoudl structure components to do these kinds of things incrementally\
Separate online from offline requirements - can you use common tools for both?