Also: www.autonomic-conference.org (June 13-16, 2005 @ Seattle) Intel Autonomic Seminar ----------------------- speakers: 8:45 – 9:30 Dr. Jeffrey Kephart, IBM Research "Research Challenges of Autonomic Computing" 9:45 – 10:30 Prof. Ken Birman, Cornell Univ. "Bringing Autonomic Technology Into Large data Centers" 10:30 – 11:15 Prof. Greg Ganger, CMU "Self - * Storage" 11:15 – 12:00 Prof. Jeffrey Chase, Duke University "Revisiting the Invisible Hand: Autonomic Resource Diffusion for Federated Load Management" 1:00 – 1:45 Prof. Karsten Schwan, Georgia Tech, "Resource- and Needs-Aware Autonomic Systems" 1:45 – 2:30 Dr. Moises Goldszmidt, HP Labs, "Towards automated diagnosis and control of IT systems" 2:30 – 3:15 Dr. Yi-Min Wang, Microsoft Research, "STRIDER: A New Approach to Configuration and Security Management" 3:30 – 4:00 Emre Kiciman, Stanford Univ., "Combining Statistical Monitoring and Fast Recovery for Self Management" 4:00 – 5:00 Panel and Q&A Moderator: Dr. Milan Milenkovic, Intel Corporation Panel presentations on research vectors and hard problems in autonomic computing, consider platform support aspect Andres Vinburg, Ken Birman, Jeff Chase, Moises Goldszdmidt, and Jeff Kephart ========= 8:30 – 8:45 Welcome address: Bill Sayles & Raj Yavatkar * purpose of this seminar, to bring together researchers in a variety of disciplines, to see how we can work together. and also so that Intel can learn... * "Platform Manageability Problem" - complexity of infrastructure. day-to-day issues require manual intervention - ... * Goal: make self-managing infrastructure - move intelligence from console to enterprise - automate mechanics (rather than replace operator) - intelligence, control, visibility * Research Areas - large-scale agent-based sys, resource provisioning self-healing, self-tuning, self-configuration, app of ML techniques, anomaly detection, fault mgt. ============================================== 8:45 – 9:30 Dr. Jeffrey Kephart, IBM Research "Research Challenges of Autonomic Computing" * what is autonomic computing (plagarized Armando Fox, and used "googlism.com" for definition) "Computing systems that manage themselves in accordance w/ high-level objectives from humans" - systems are complex and fragile * Towards autonomic computing system 1. manual 2. instrument & monitor 3. automate analysis 4. closed loop 5. closed loop w/business priorities * Autonomic Element Behaviors - honor your commitments - dont accept commitments you cannot fulfill * alphaworks has "autonomic mgr toolset". toolkit approach to building autonomic elements * Architecture Challenges - how to coordinate multiple thread of activiity (where is the 'ego')? - how to detect/resolve conflicts arising from internal and external decisions, directives, and policies - system-level: enable more flexible service-oriented patterns of interaction; representation of requests, policies, etc. * Human-System interface challenges - how to express high level goals, objects to systems. how to make them expressive yet avoid spec errors and make it easy for people to use * Challenge: Policy - deriving lower-level policies; deriving actions from goals (e.g., planning, optimization, etc) - Utility functions for making trade-offs * Challenge: Planning - more than just planning, there's a whole planning engine, regarding scheduler, repositories of metadata, partial plans and complete plans, and an analyzer. * slide on ROC, crash-only software and microreboots * Challenge: Negotiation - methods for expressing preferences, and a theoretical foundation and algorithsm for negotiation. When to apply bilateral, multilateral, supply-chain negotiation, etc. * Challenge: Emergent Behavior - develop theory of interacting feedback loops Questions - q: what's the appropriate granularity / level for an autonomic computing element? a: unclear. good q for panel. is it the OS? a machine? a chip? a: we should figure out what the criteria is for choosing. - q: (ken birman) what's status of using ML for problem detection? a: at IBM most ML has been focused =================================================== 9:45 – 10:30 Prof. Ken Birman, Cornell Univ. "Bringing Autonomic Technology Into Large data Centers" [not 100% sure he believes he's in autonomic computing, not sure he believes in it] * q: peek under the covers of the toughest, most powerful systems (e.g., google, amazon). Can we discern a research agenda? (verner vogels (sp?) is Ken's ex-student, now head sys research at Amzn) * wireless sensor networks is driving many of the same research questions (autonomy, adaptation, etc) * a glimpse inside amazon's services & pub-sub architecture * hierarchy of sets - set of data centers, w/sets of services, partitions, programs, machines... * Jim Gray: A RAPS of RACS - RAPS: reliable array of partitioned services - RACS: a relaible array of cluster structured server processes * autonomic technology needs: - programs will need a way to find the "members" of the service and apply te partitioning function to find contacgts within a desired partition - dynamic resource mgt, adaptation of RACS size & mapping to hw - fault detection - within a RACS we also need to replicate data for scalability * Scalability makes this hard! - membership, resource mgt, comm, fault-tolerance, consistency - we have "classic" solutions to all the above, but no one has really engaged scalability issues. e.g., no one has looked at multicast to thousands of overlapping groups simultaneously, w/each group being 3-4 members (this is what, e.g., amazon, needs). * quicksilver project - we're building a scalable infrastructure addressing these needs - consists of: some existing tech (astrolabe, gossip "repair" protocols) - new tech * gossip 101 - push epidemic for spreading info (also push-pull epidemic) - scalability: participant's load independent of size. network load linear in system size. data spreads in log(system size) time - data spreads in log(sys size) * inter-gossip interval * we can gossip many ways - about membership, data replication, failures, new members, updates, ... - gossip & multicast (gossip about what messages you've gotten so that everyone eventually finds out about msgs they've missed). * ex. reliable multicast - As you add more nodes to multicast, the performance drops. 30 nodes works great. 90+ nodes slows miserably. Gets worse faster as you perturb nodes (e.g., some are a bit slower). - if you use gossip for reliability, it scales. * Kelips (gossip-based DHT) - per-participant loads are constant - space required grows in O(sqrt(N)) [-- what's N?] - gives data in 1 hop, resistant to churn * [database w/gossip] - tables are divided up among nodes. - small sets of nodes responsible for some data in table - Build up a hierarchy using a P2P protocol so that all the small sets of nodes are connected * quicksilver current work - goal is to offer scalable support for pub-sub. * also looking at ways to map sensor network problems to same architecture * Summary - focus centers on scalability - autonomic in the sense that they believe this is important tool for robustness - might be easier to use these tools for newer systems that you can migrate to pub-sub (rather than 30-yr legacy) =================================================== 10:30 – 11:15 Prof. Greg Ganger, CMU "Self - * Storage" [10-15 minutes on storage, then rest of the time is a random walk through "what we think we think we know"] * administration needs to be simplified - 1 human admin per 1-10TB. - scary since 1 disk holds .5 GB. (1 admin per 20 disks!) * where is industry going? - trying to build tools to manage existing stuff - not going to work. way too complicated - academics going back to drawing board to design into arch. * building large storage, going to offer 1PB for CMU researchers - goal: research into admin and survivability - (note: don't have nearly enough people to admin at current rates 1/1TB, so have to succeed!) * Management challenges & issues - data protection: mistake recovery, device failure recovery, disaster tol. - capacity planning and acquisitions: initial provisioning and expansion - tuning and load balancing: dataset placement, device params, etc - problem diagnosis and healing (disaster strikes scenario) - monitoring and record keeping [aside: visited Yahoo's data center and walked around. the guy he's walking around the data center. he stopped and said, "the temperature doesn't feel right here". and he made the people admin'ing look into it, and sure enough, the cooling pipe there wasn't working...] * Self-* storage aspirations - tape-less data protection - automated perf tuning - automated integration of new components - automated problem diagnosis - automated healing to extent possible - no rushed or complex repairs by humans - restore data first, so less vulnerable to human error * toward Ursa Major - 1 PB - building Ursa Minor first. 10% of size. * Self-* storage architecture - workers tune themselves - management hierarchy to work on broader goals of system * Some early realizations - versatility is crucial - "anything, anywhere": important for easier fault-tolerance, bottleneck avoidance, tuning - many-purposed hardware requires less expertise when deciding what to purchase, to install & maintain, and reuse - want heavy doses of fault-tolerance - not enough people to admin, much less QA! - monitor and record everything - collecting it, recalling it, remembering activity traces - decision making mechanisms - assertion: workload characterization and device modeling *will* remain *open* problems. - don't depend on them working. - one other option: create classes of devices automatically, then use one's behavior to predict behavior of others in the class (rather than explicit modelling of device) - complaint-driven tuning - bad at generating policies that describe what you want a priori - but great at complaining when it's not doing what you want - eyes on the prize (surprising timesinks) - physical planning of datacenter - chilled water "issues" in datacenter - boot sequencing of everything - writing "sensor access" drivers * academic initiative - CITA: Center for IT Automation - an NSF ERC proposal - broad long-term vision for up to 11 yrs of big NSF support - Self-* generalized to entire IT infrastructure - being organized by Garth Givson, Greg Ganger, Jeff Chase & others - broad-based academic activity led by CMU =================================================== 11:15 – 12:00 Prof. Jeffrey Chase, Duke University "Revisiting the Invisible Hand: Autonomic Resource Diffusion for Federated Load Management" [market based approach to resource provisioning] * "Utility OS" (autonomic?) * Adaptive Resource Management - use SLAs to define performance / dependability targets for customer workloads - requiers mechanisms for isolation & differentiated service, hard partitions (or virtual machines) ; and interposed request scheduling - requires *policies* for 1. provisioning (how much?) - demand estimation (local) and resource arbitration (global) 2. placement or assignment (where?) * Outline - SHARP framework: XML resource contracts diffuse through network of brokers - use *credits* as currency - distributed by global policy - actors bid credits for resources to match load / demand curve (based on utility function, etc) - approximate utility as bid vectors, w/brokers to aggregate demand / value - distribute bids across suppliers according to supplier price [lots of questions on how robust credit-based resource scheduling is... -- answer basically boils down to "it's not a full economy, all credits revert to their original owner once the owner releases the resource it was using. therefore, we avoid the problems of real-life economies. (together w/the fact that there appears to be an assumption of central control over distribution of credits in the first place) I'm not sure this solves all problems, though, e.g., deadlock related issues...] * Leases v. Tickets - leass are "hard" contracts. only the authority can lease. - brokers/agents give tickets. - service mgr requests resources from agent. agent gives a ticket. svc then takes ticket to authority (site) to get a lease and use resources. Note that tickets are not a guarantee so you can overprovision (like airlines, etc) * division of knowledge & func - service mgr knows the application - agent/broker guesses global status (avail of resources, what kind, how much, where, proximity, etc) - authority - knows resources * Agent is an arbiter of resources - goal to sell to highest bidder * The Problem of Currency - problem is recycling $$. credit-system avoids that. Q: where does system policy enter the picture? (e.g., what if you really want to make sure that you never give out single nodes) A: policy is pluggable in the agents. and resources have control over which agents they empower. Q: have you considered bidding for contracts (instead of resources) A: ... =================================================== 12:00 – 1:00 Lunch - spoke w/Jeff Chase about the deadlock issue. and he agreed that this was a serious problem. Auctions work well when you have one commodity, but not when you have to grab more. The solution is to have agents bundle resources and auction off sets of resources that will satisfy clients constraints. =================================================== 1:00 – 1:45 Prof. Karsten Schwan, Georgia Tech, "Resource- and Needs-Aware Autonomic Systems" * what do people need? people want/need systems that are information-centric. they need data reduction and fusion, information creation and manipulation. - implications for autonomic service: the world is not 3-tier web services! - services consist of data produces, consumers and transformers - deployed dynaimcally, cooperating in a heterogeneous environment. - application-dependent. * necessary elemnts of solution approaches 1. middleware / app agility (mismash of middleware that runs system) 2. middleware must be needs and resource-aware 3. * Solution Approach 1: Agile Information Flows - services composed via pub-sub based agile flows - agility: dynamic handler deployment / code modification and adaptation, dynamic overlays - also done "agility" for gridftp, rmi, soap (not just pub-sub) =================================================== 1:45 – 2:30 Dr. Moises Goldszmidt, HP Labs, "Towards automated diagnosis and control of IT systems" * talk is about: it is possible to build models on the fly and do something useful with them * we have lots of data. - just look at CPU utilization... (doesn't work! no one element is predictive of high round-trip time) * what are causes of violations? * can we predict SLO violations? * need something that correlates system metrics and SLO state. How? - we can now munge the available data in milliseconds, we couldn't do that before... [...interesting talk, I got caught up in listening...] =================================================== 2:30 – 3:15 Dr. Yi-Min Wang, Microsoft Research, "STRIDER: A New Approach to Configuration and Security Management" [great talk on diagnosing windows problems, malware, etc] =================================================== 3:30 – 4:00 Emre Kiciman, Stanford Univ., "Combining Statistical Monitoring and Fast Recovery for Self Management" =================================================== 4:00 – 5:00 Panel and Q&A Moderator: Dr. Milan Milenkovic, Intel Corporation Panel presentations on research vectors and hard problems in autonomic computing, consider platform support aspect Anders Vinberg * model-baesd management - system definition model - models come from initial development (visual studio), mgt tools, discovery, operational monitoring, error reporting. - model lives in manifest, travels with application * moving to webservices-based mgt protocols - defines protocols, not mgt model - instrumentation, events, settings, actions, scripting * self-* systems - must have knowledge of itself and its environment (self-* includes self-aware) - long-term DSI direction: model on every system * mgt is about reconciling what the system "ought to be" and what "it is". - models say what the sys "ought to be" - monitoring says what "it is" - propagate two views of system towards each other and try to make them match each other - put mgt logic in different points in the system. some things are going to be done inside the system, others are going to be done closer to operator or developer. -----------------------------> "oughtness" --------> DEVELOPER/IT ADMIN -> SDM -> SDM SERVICE -> CENTRAL MGR -> REMOTE MANAGER -> LOCAL SDM -> MANAGED SYSTEM <----- "is-ness" <---------------------------------- autonomic computing wants to put things at the local layer, near the managed system. It's really a difference in degree, not a difference in kind, as we move ============================================================== Ken Birman Key concern: potential for market failure -> autonomic computing premise: intrinsically high risk - many of the core "issues" stem from limitations of dominant platforms, internet. - By definition, it is hard to interven in the legacy tech base - and despite cost, the human element of sys mgt is entrenched -> dual research challenges - yes, pursue new technologies (that improves self-diagnosis and repair, automate configuration mgt) funding opportunities are key - but also pursue a deployment path - need to identify a rapidly growing market w/intrinsic need for AC solutions -> possible AC testbed goals 1. create a massive, long-lived, wireless sensor deployment and open it to researchers (e.g., environmental monitoring) 2. create AC hardware "assists". (wireless support for out-of-band configuration repair, or unobtrusive monitoring) 3. create a national scalabilty challenges testbed for deploying/ testing new network technologies 4. open the lower levels of the network for AC: enable "underlays" that time-share raw routers, links -> autonomic computing research topics 1. seek "mainstream" home runs. (hurry up and solve a visible problem) 2. develop out of band problem-diagnosis tools 3. monitor healthy systems to learn "good state", then identify "possible problems" when shown unhealthy state. - instead of trying to say "it's acting strangely let's reboot it" wait for someone to complain, then say "the hypothesis is..." 4. develop AC network overlays with rapid self-repair after disruption, QoS guarantees 5. promote an open-source style (but not GPL) toolkit into which researchers can contribute technology ========================================================== Jeff Chase -> closing control loops is really interesting - but not something ready to deploy... it's a bit scary, is it safe? -> low hanging fruit is in empowering the human experts - one great thing is using SLT to filter the masses of data that is being captured. e.g., Yi-Min's strider, Moises' project. ========================================================== Moises -> can already buy closed control loops in products : e.g., oracle 10g, SAP, etc. -> Ken touched on a great point: funding. -> problems aren't going to be solved until we all have experience with large systems -> the way moises approached the problem of choosing a problem, was to choose a problem based on what was too "high-risk" for the product groups. ========================================================== Jeff Kephart -> when are we ready for primetime? - there is stuff out there that one can call "self-managed" - but single-node, small-scale systems, not large-scale. -> impact of AC is already here, much more to come ==========================================================