Data Recharging

Adaptive Data Management and Query Processing

Faculty: Franklin and Hellerstein

Societal Information Systems present challenges that cannot be met by existing data management technology. These challenges stem from their large scale, highly-distributed nature, and the need to actively assist users in wading through vast quantities of data in a near-real time manner. A key requirement for data management in SIS is adaptability. That is, the data management components of the SIS infrastructure must be able to quickly evolve and adapt to radical changes in data availability and content, systems and network characteristics, and user needs and context.

Traditional database query processing systems break down in such environments for a number of reasons: First, they are based on static approaches to query optimization and planning. Database systems produce query plans using simple cost models and statistics about the data to estimate the cost of running particular plans. In a dynamic dataflow environment, this approach simply does not work because there are typically no reliable statistics about the data and because the arrival rates, order, and behavior of the data streams are too unpredictable [UFA98] [HFCD+00].

Second, the existing approaches cannot adequately cope with failures that arise during the processing of a query. In current database systems, if the failure of a data source goes undetected, the query processor simply blocks, waiting for the data to arrive. If a failure is detected, then a query is simply aborted and restarted. Neither of these situations is appropriate in an SIS environment in which sources and streams behave unpredictably, and queries can be extremely long-running (so called "continuous queries").

Third, existing approaches are optimized for a batch style of processing in which the goal is to deliver an entire answer (i.e., they are optimized for the delivery of the last result). In an SIS environment, where users will be interacting with the system in a fine-grained fashion, such approaches are unacceptable. Processed data (e.g., query results, event notifications, etc.) must be passed on to the user as soon as they are available. Furthermore, because the system is interactive, a user may choose to modify her queries on the basis of previously returned information or other factors. Thus, the system must be able to gracefully adjust to changes in the needs of the users [HACO+99].

The research topics we plan to address in data management for SIS are the following:

Adaptive Data Flow Processing ¾ The Telegraph project at UC Berkeley is developing an adaptive dataflow processing engine. Telegraph uses a novel approach to query execution based on "eddies", which are dataflow control structures that route data to query operators on an item-by-item basis [AH00]. Telegraph does not rely upon a traditional query plan, but rather, allows the "plan" to develop and adapt during the execution. For queries over continuous streams of data, the system can continually adapt to changes in the data arrival rates, data characteristics, and the availability of processing, storage, and communication resources. An initial prototype of Telegraph has been built, but much remains to be done. The challenges to be addressed include: 1) the development of cluster-based and wide-area implementations of the processing engine, 2) the design of fault-tolerance mechanisms, particularly for long-running queries, 3) support for continuous queries over streaming data from sensors and web-based sources, and 4) the development of appropriate user interfaces for manipulating data flows.

Sensor Query Processing ¾ Much of the data to be processed in SIS will be continually streaming in from tiny, low-power sensors. Techniques for querying these sensor data streams will be crucial. These techniques must not only be efficient, but must also be tolerant of the power limitations and error characteristics of the sensors. We plan to extend the data flow query processing architecture with two techniques for dealing with sensors: 1) the "Fjords" operator architecture, and 2) power-sensitive "sensor proxy" operators. The Fjords architecture provides the functionality and interfaces necessary to integrate erratic, streaming dataflows into query plans. It allows streaming data to be pushed through operators that pull from traditional data sources, efficiently merging streams and local data as samples flow past. Fjords also allow processing from multiple queries to share the same data stream, thereby providing huge scalability improvements. Sensor proxies are specialized query operators that serve as mediators between sensors and query plans, using sensors to facilitate query processing while adapting to their power, processor, and communications limitations.

Context Aware Data and Event Dissemination ¾ Due to the huge volume of data managed by the SIS, such a system must provide special support for the targeted and timely delivery of relevant data and notifications to users based on their interests, roles, and context at a particular time. Such dissemination must be driven by user profiles, which contain information about user requirements, priorities, and information needs [CFG 00] [CFZ01]. We envision a user profile language that allows the specification of three types of information: 1) Domain specification: a specification of the kinds of data that are of interest to the user. This description must be declarative in nature, so that it can encompass newly created data in addition to existing data. The description must also be flexible enough to express predicates over different types of data and media. 2)Utility specification: because of limitations on bandwidth, device-local storage, and human attention, only a small portion of available information can be sent to a user. Thus, the profile must also express the user’s preferences in terms of priorities among data items, desired resolutions of multi-resolution items, consistency requirements, and other properties. 3) Context specifictation: user context can be dynamically incorporated into the dissemination process by parameterizing the user profile with user context information. This context information may be obtained through on-line observation of users, or from other sources such as data stored in Personal Information Management (PIM) applications such as the calendar, contact list, and To Do list. The challenges here involve language development, profile processing issues, and delivery scheduling. We plan to draw on our earlier work on large-scale XML document filtering for this purpose.

References

[AF00] Mehmet Altinel, Michael J. Franklin. Efficient Filtering of XML Documents for Selective Dissemination of Information, Proceedings of the International Conference on Very Large Data Bases, Cairo, September 2000.

[AH00] Ron Avnur, Joseph M. Hellerstein. Eddies: Continuously Adaptive Query Processing, Procedings of the ACM SIGMOD Conference, Philadelphia, PA, June 2000,

[CFG00] Ugur Cetintemel, Michael J. Franklin, and C. Lee Giles. Self-Adaptive User Profiles for Large-Scale Data Delivery, Proceedings of the International Conference on Data Engineering, San Diego, CA, February, 2000, pp 622-633

[CFZ01] Mitch Cherniack, Michael J. Franklin, Stan Zdonik. Expressing User Profiles for Data Recharging, accepted for publication, IEEE Personal Communications, April 2001 (to appear).

[HACO+99] Joseph M. Hellerstein Ron Avnur, Andy Chou, Chris Olston, Vijayshankar Raman, Tali Roth, Christian Hidber, Peter J.Haas. Interactive Data Analysis with CONTROL, IEEE Computer 32(8), August, 1999, pp. 51-59.

[HFCD+00] Joseph M. Hellerstein, Michael J. Franklin, Sirish Chandrasekaran, Amol Deshpande, Kris Hildrum, Sam Madden, Vijayshankar Raman, Mehul Shah. Adaptive Query Processing: Technology in Evolution, IEEE Data Engineering Bulletin, June 2000, pp 7-18.

[UFA98] Tolga Urhan, Michael J. Franklin, Laurent Amsaleg. Cost Based Query Scrambling for Initial Delays,Procedings of the ACM SIGMOD Conference, Seattle, WA, June 1998, pp. 130-141.