Also: www.autonomic-conference.org (June 13-16, 2005 @ Seattle)

Intel Autonomic Seminar
-----------------------

speakers:

8:45 – 9:30     Dr. Jeffrey Kephart, IBM Research 
                    "Research Challenges of Autonomic Computing" 
9:45 – 10:30    Prof. Ken Birman, Cornell Univ. 
                    "Bringing Autonomic Technology Into Large data Centers"
10:30 – 11:15   Prof. Greg Ganger, CMU 
                    "Self - * Storage"
11:15 – 12:00   Prof. Jeffrey Chase, Duke University 
                    "Revisiting the Invisible Hand: 
                    Autonomic Resource Diffusion for Federated Load Management"
1:00 – 1:45     Prof. Karsten Schwan, Georgia Tech, 
                    "Resource- and Needs-Aware Autonomic Systems"
1:45 – 2:30     Dr. Moises Goldszmidt, HP Labs, 
                    "Towards automated diagnosis and control of IT systems"
2:30 – 3:15     Dr. Yi-Min Wang, Microsoft Research, 
                    "STRIDER: A New Approach to Configuration and Security Management"
3:30 – 4:00     Emre Kiciman, Stanford Univ., 
                    "Combining Statistical Monitoring and Fast Recovery for Self Management"
4:00 – 5:00     Panel and Q&A  Moderator:  Dr. Milan Milenkovic, Intel Corporation 
                        Panel presentations on research vectors and hard problems 
                        in autonomic computing, consider platform support aspect
		Andres Vinburg, Ken Birman, Jeff Chase, Moises Goldszdmidt, and
                Jeff Kephart

=========

8:30 – 8:45     Welcome address: Bill Sayles & Raj Yavatkar

* purpose of this seminar, to bring together researchers in a
  variety of disciplines, to see how we can work together. and
  also so that Intel can learn...

* "Platform Manageability Problem"
   - complexity of infrastructure.  day-to-day issues require
     manual intervention
   - ...

* Goal: make self-managing infrastructure
   - move intelligence from console to enterprise
   - automate mechanics (rather than replace operator)
        - intelligence, control, visibility

* Research Areas
   - large-scale agent-based sys, resource provisioning
     self-healing, self-tuning, self-configuration, app of
     ML techniques, anomaly detection, fault mgt.

==============================================

8:45 – 9:30     Dr. Jeffrey Kephart, IBM Research 
                    "Research Challenges of Autonomic Computing" 

* what is autonomic computing
  (plagarized Armando Fox, and used "googlism.com" for definition)

  "Computing systems that manage themselves in accordance w/
   high-level objectives from humans"

  - systems are complex and fragile

* Towards autonomic computing system
  1. manual
  2. instrument & monitor
  3. automate analysis 
  4. closed loop
  5. closed loop w/business priorities

* Autonomic Element Behaviors
  - honor your commitments
  - dont accept commitments you cannot fulfill

* alphaworks has "autonomic mgr toolset".  toolkit approach
  to building autonomic elements

* Architecture Challenges
    - how to coordinate multiple thread of activiity (where is the 'ego')?
    - how to detect/resolve conflicts arising from internal and external
      decisions, directives, and policies
    - system-level: enable more flexible service-oriented patterns of
      interaction; representation of requests, policies, etc.

* Human-System interface challenges
    - how to express high level goals, objects to systems.  how
      to make them expressive yet avoid spec errors and make it
      easy for people to use

* Challenge: Policy
    - deriving lower-level policies; deriving actions from goals
      (e.g., planning, optimization, etc)
    - Utility functions for making trade-offs

* Challenge: Planning
    - more than just planning, there's a whole planning engine, regarding
      scheduler, repositories of metadata, partial plans and complete plans,
      and an analyzer.

* slide on ROC, crash-only software and microreboots

* Challenge: Negotiation
    - methods for expressing preferences, and a theoretical foundation 
      and algorithsm for negotiation.  When to apply bilateral, multilateral,
      supply-chain negotiation, etc.

* Challenge: Emergent Behavior
    - develop theory of interacting feedback loops

Questions
- q: what's the appropriate granularity / level for an
     autonomic computing element?
  a: unclear. good q for panel.  is it the OS?  a machine? a
     chip?
  a: we should figure out what the criteria is for choosing.

- q: (ken birman) what's status of using ML for problem detection?
  a: at IBM most ML has been focused

===================================================

9:45 – 10:30    Prof. Ken Birman, Cornell Univ. 
                    "Bringing Autonomic Technology Into Large data Centers"

  [not 100% sure he believes he's in autonomic computing, not
   sure he believes in it]

* q: peek under the covers of the toughest, most powerful
  systems (e.g., google, amazon).  Can we discern a research
  agenda?
   (verner vogels (sp?) is Ken's ex-student, now head sys research at Amzn)

* wireless sensor networks is driving many of the same research
  questions (autonomy, adaptation, etc)

* a glimpse inside amazon's services & pub-sub architecture

* hierarchy of sets
    - set of data centers, w/sets of services, partitions, programs, machines...

* Jim Gray: A RAPS of RACS
    - RAPS: reliable array of partitioned services
    - RACS: a relaible array of cluster structured server processes

* autonomic technology needs:
    - programs will need a way to find the "members" of the service
      and apply te partitioning function to find contacgts within a 
      desired partition
    - dynamic resource mgt, adaptation of RACS size & mapping to hw
    - fault detection
    - within a RACS we also need to replicate data for scalability

* Scalability makes this hard!
    - membership, resource mgt, comm, fault-tolerance, consistency
    - we have "classic" solutions to all the above, but no one has
      really engaged scalability issues.  e.g., no one has looked at
      multicast to thousands of overlapping groups simultaneously, w/each group
      being 3-4 members (this is what, e.g., amazon, needs).

* quicksilver project
    - we're building a scalable infrastructure addressing these needs
    - consists of: some existing tech (astrolabe, gossip "repair" protocols)
    - new tech

* gossip 101
    - push epidemic for spreading info (also push-pull epidemic)
    - scalability:  participant's load independent of size. network
      load linear in system size.  data spreads in log(system size) time
    - data spreads in log(sys size) * inter-gossip interval

* we can gossip many ways
    - about membership, data replication, failures, new members, updates, ...
    - gossip & multicast (gossip about what messages you've gotten
      so that everyone eventually finds out about msgs they've missed).

* ex. reliable multicast
    - As you add more nodes to multicast, the performance drops.  30
      nodes works great.  90+ nodes slows miserably.  Gets worse faster
      as you perturb nodes (e.g., some are a bit slower).
    - if you use gossip for reliability, it scales.

* Kelips (gossip-based DHT)
    - per-participant loads are constant
    - space required grows in O(sqrt(N))  [-- what's N?]
    - gives data in 1 hop, resistant to churn

* [database w/gossip] 
    - tables are divided up among nodes.
    - small sets of nodes responsible for some data in table
    - Build up a hierarchy using a P2P protocol so that
      all the small sets of nodes are connected

* quicksilver current work
    - goal is to offer scalable support for pub-sub.

* also looking at ways to map sensor network problems to same
   architecture

* Summary
    - focus centers on scalability
    - autonomic in the sense that they believe this is
      important tool for robustness
    - might be easier to use these tools for newer systems that
      you can migrate to pub-sub (rather than 30-yr legacy)


===================================================


10:30 – 11:15   Prof. Greg Ganger, CMU 
                    "Self - * Storage"

[10-15 minutes on storage, then rest of the time is
 a random walk through "what we think we think we know"]

* administration needs to be simplified
   - 1 human admin per 1-10TB.
   - scary since 1 disk holds .5 GB.  (1 admin per 20 disks!)

* where is industry going?
   - trying to build tools to manage existing stuff
   - not going to work.  way too complicated
   - academics going back to drawing board to design into arch.

* building large storage, going to offer 1PB for CMU
  researchers
    - goal: research into admin and survivability
    - (note: don't have nearly enough people to admin at
       current rates 1/1TB, so have to succeed!)

* Management challenges & issues
    - data protection: mistake recovery, device failure recovery,
      disaster tol.
    - capacity planning and acquisitions: initial provisioning and expansion
    - tuning and load balancing: dataset placement, device params, etc
    - problem diagnosis and healing (disaster strikes scenario)
    - monitoring and record keeping

[aside: visited Yahoo's data center and walked around.  the
  guy he's walking around the data center.  he stopped and said,
  "the temperature doesn't feel right here".  and he made the
  people admin'ing look into it, and sure enough, the cooling
  pipe there wasn't working...]

* Self-* storage aspirations
    - tape-less data protection
    - automated perf tuning
    - automated integration of new components
    - automated problem diagnosis
    - automated healing to extent possible
        - no rushed or complex repairs by humans
        - restore data first, so less vulnerable to human error

* toward Ursa Major
    - 1 PB
    - building Ursa Minor first. 10% of size.

* Self-* storage architecture
    - workers tune themselves
    - management hierarchy to work on broader goals of
      system

* Some early realizations
    - versatility is crucial
        - "anything, anywhere": important for easier fault-tolerance,
          bottleneck avoidance, tuning
        - many-purposed hardware requires less expertise
          when deciding what to purchase, to install & maintain, and reuse
    - want heavy doses of fault-tolerance
        - not enough people to admin, much less QA!
    - monitor and record everything
        - collecting it, recalling it, remembering activity traces

    - decision making mechanisms
        - assertion: workload characterization and device modeling *will*
          remain *open* problems.
            - don't depend on them working.
        - one other option: create classes of devices automatically, then
          use one's behavior to predict behavior of others in the class
          (rather than explicit modelling of device)

    - complaint-driven tuning
        - bad at generating policies that describe what you want a priori
        - but great at complaining when it's not doing what you want
    - eyes on the prize (surprising timesinks)
        - physical planning of datacenter
        - chilled water "issues" in datacenter
        - boot sequencing of everything
        - writing "sensor access" drivers

* academic initiative
     - CITA: Center for IT Automation
         - an NSF ERC proposal
              - broad long-term vision for up to 11 yrs of big NSF support
         - Self-* generalized to entire IT infrastructure
         - being organized by Garth Givson, Greg Ganger, Jeff Chase & others
     - broad-based academic activity led by CMU

===================================================

11:15 – 12:00   Prof. Jeffrey Chase, Duke University 
                    "Revisiting the Invisible Hand: 
                    Autonomic Resource Diffusion for Federated Load Management"

[market based approach to resource provisioning]

* "Utility OS"  (autonomic?)

* Adaptive Resource Management
    - use SLAs to define performance / dependability targets for
      customer workloads
    - requiers mechanisms for isolation & differentiated service,
      hard partitions (or virtual machines) ; and interposed request scheduling
    - requires *policies* for
          1. provisioning (how much?)
                 - demand estimation (local) and resource arbitration (global)
          2. placement or assignment (where?)

* Outline
    - SHARP framework: XML resource contracts diffuse through network of brokers
    - use *credits* as currency
        - distributed by global policy
    - actors bid credits for resources to match load / demand curve
        (based on utility function, etc)
    - approximate utility as bid vectors, w/brokers to aggregate demand / value
    - distribute bids across suppliers according to supplier price

[lots of questions on how robust credit-based resource scheduling is...
 -- answer basically boils down to "it's not a full economy, all credits
    revert to their original owner once the owner releases the resource
    it was using.  therefore, we avoid the problems of real-life economies.
    (together w/the fact that there appears to be an assumption of central control
    over distribution of credits in the first place)
    I'm not sure this solves all problems, though, e.g., deadlock related
    issues...]

* Leases v. Tickets
    - leass are "hard" contracts.  only the authority can lease.
    - brokers/agents give tickets.

    - service mgr requests resources from agent.  agent gives a
      ticket.  svc then takes ticket to authority (site) to get
      a lease and use resources.  Note that tickets are not a guarantee
      so you can overprovision (like airlines, etc)

* division of knowledge & func
    - service mgr knows the application 
    - agent/broker guesses global status
         (avail of resources, what kind, how much, where, proximity, etc)
    - authority - knows resources

* Agent is an arbiter of resources
    - goal to sell to highest bidder

* The Problem of Currency
    - problem is recycling $$.  credit-system avoids that.

Q: where does system policy enter the picture?  (e.g., what if you really
   want to make sure that you never give out single nodes)
A: policy is pluggable in the agents.  and resources have control over
   which agents they empower.

Q: have you considered bidding for contracts (instead of resources)
A: ...


===================================================

12:00 – 1:00     Lunch

- spoke w/Jeff Chase about the deadlock issue.  and he agreed that
  this was a serious problem.  Auctions work well when you have one
  commodity, but not when you have to grab more.  The solution is
  to have agents bundle resources and auction off sets of resources
  that will satisfy clients constraints.

===================================================

1:00 – 1:45     Prof. Karsten Schwan, Georgia Tech, 
                    "Resource- and Needs-Aware Autonomic Systems"

* what do people need? people want/need systems that are
  information-centric.  they need data reduction and fusion,
  information creation and manipulation.
    - implications for autonomic service:  the world is
      not 3-tier web services!
    - services consist of data produces, consumers and
      transformers
    - deployed dynaimcally, cooperating in a heterogeneous environment.
    - application-dependent.

* necessary elemnts of solution approaches
    1. middleware / app agility (mismash of middleware that runs
       system)
    2. middleware must be needs and resource-aware
    3. 

* Solution Approach 1: Agile Information Flows
    - services composed via pub-sub based agile flows
    - agility: dynamic handler deployment / code modification and
      adaptation, dynamic overlays
    - also done "agility" for gridftp, rmi, soap (not just pub-sub)

===================================================

1:45 – 2:30     Dr. Moises Goldszmidt, HP Labs, 
                    "Towards automated diagnosis and control of IT systems"

* talk is about: it is possible to build models on the fly and
  do something useful with them

* we have lots of data.
    - just look at CPU utilization... (doesn't work!  no one element
      is predictive of high round-trip time)

* what are causes of violations?
* can we predict SLO violations?

* need something that correlates system metrics and SLO state.  How?
   - we can now munge the available data in milliseconds, we couldn't
     do that before...

[...interesting talk, I got caught up in listening...]

===================================================

2:30 – 3:15     Dr. Yi-Min Wang, Microsoft Research, 
                    "STRIDER: A New Approach to Configuration and Security Management"

[great talk on diagnosing windows problems, malware, etc]  

===================================================

3:30 – 4:00     Emre Kiciman, Stanford Univ., 
                    "Combining Statistical Monitoring and Fast Recovery for Self Management"

===================================================

4:00 – 5:00     Panel and Q&A  Moderator:  Dr. Milan Milenkovic, Intel Corporation 
                        Panel presentations on research vectors and hard problems 
                        in autonomic computing, consider platform support aspect

Anders Vinberg

* model-baesd management
    - system definition model
    - models come from initial development (visual studio), mgt tools,
      discovery, operational monitoring, error reporting.
    - model lives in manifest, travels with application

* moving to webservices-based mgt protocols
    - defines protocols, not mgt model
    - instrumentation, events, settings, actions, scripting

* self-* systems
    - must have knowledge of itself and its environment (self-* includes self-aware)
    - long-term DSI direction: model on every system

* mgt is about reconciling what the system "ought to be" and what "it is".
    - models say what the sys "ought to be"
    - monitoring says what "it is"
    - propagate two views of system towards each other and try to make them
      match each other
    - put mgt logic in different points in the system.   some things are going
      to be done inside the system, others are going to be done closer to operator
      or developer.

   -----------------------------> "oughtness" -------->
 
DEVELOPER/IT ADMIN -> SDM -> SDM SERVICE -> CENTRAL MGR ->
    REMOTE MANAGER -> LOCAL SDM -> MANAGED SYSTEM
    
   <----- "is-ness" <----------------------------------

  autonomic computing wants to put things at the local layer, near the
   managed system.  It's really a difference in degree, not a difference
   in kind, as we move 


==============================================================

Ken Birman

Key concern: potential for market failure
  -> autonomic computing premise: intrinsically high risk
        - many of the core "issues" stem from limitations of
          dominant platforms, internet.
        - By definition, it is hard to interven in the legacy tech base
        - and despite cost, the human element of sys mgt is entrenched

  -> dual research challenges
        - yes, pursue new technologies (that improves self-diagnosis and
          repair, automate configuration mgt) funding opportunities are key
        - but also pursue a deployment path
             - need to identify a rapidly growing market w/intrinsic need
               for AC solutions

   -> possible AC testbed goals
        1. create a massive, long-lived, wireless sensor deployment and
           open it to researchers (e.g., environmental monitoring)
        2. create AC hardware "assists".  (wireless support for
           out-of-band configuration repair, or unobtrusive monitoring)
        3. create a national scalabilty challenges testbed for deploying/
           testing new network technologies
        4. open the lower levels of the network for AC: enable "underlays"
           that time-share raw routers, links

   -> autonomic computing research topics
        1. seek "mainstream" home runs.  (hurry up and solve a visible
           problem)
        2. develop out of band problem-diagnosis tools
        3. monitor healthy systems to learn "good state", then identify
           "possible problems" when shown unhealthy state.
              - instead of trying to say "it's acting strangely let's reboot it"
                wait for someone to complain, then say "the hypothesis is..."
        4. develop AC network overlays with rapid self-repair after
           disruption, QoS guarantees
        5. promote an open-source style (but not GPL) toolkit into which
           researchers can contribute technology

==========================================================

Jeff Chase

    -> closing control loops is really interesting
        - but not something ready to deploy... it's a bit scary, is it
          safe?

    -> low hanging fruit is in empowering the human experts
        - one great thing is using SLT to filter the masses of data
          that is being captured.  e.g., Yi-Min's strider, Moises'
          project.

==========================================================

Moises

    -> can already buy closed control loops in products : e.g., 
       oracle 10g, SAP, etc.
    -> Ken touched on a great point: funding.
    -> problems aren't going to be solved until we all have experience
       with large systems
    -> the way moises approached the problem of choosing a problem, was
       to choose a problem based on what was too "high-risk" for 
       the product groups.

==========================================================

Jeff Kephart

    -> when are we ready for primetime?
        - there is stuff out there that one can call "self-managed"
        - but single-node, small-scale systems, not large-scale.
    -> impact of AC is already here, much more to come


==========================================================

<end-of-seminar>