Chapter 13: Database Management Systems

The ERL Research Summary for 2002

Query Processing for Streaming Sensor Data

Samuel Madden
(Professor Michael Franklin)
ITR grant IIS00-86057 and DARPA under contractDARPA Contract N66001-99-2-8913

Over the past few years, a great deal of attention in the networking and mobile-computing communities has been directed towards building networks of ad-hoc collections of sensors scattered throughout the world. MIT, UCLA, and UC Berkeley have all embarked on projects to produce small, wireless, battery-powered sensors and low-level networking protocols. These projects have brought us close to the the vision of ubiquitous computing, in which computers and sensors assist in every aspect of our lives. To fully realize this vision, however, it will be necessary to combine and query the readings produced by these collections of sensors. An obvious approach for doing this would be to apply traditional data-processing techniques and operators from the database community. Unfortunately, standard DBMS assumptions about the characteristics of data sources do not apply to sensors, so a significantly different architecture is needed. Determining the nature of this architecture is the goal of my research.

More generally, my work can be summarized as ``data processing for streams of sensor data'', and can be divided into the two related areas:

- Server Side Issues: Modifications to traditional query processor architectures to enable them to access and efficiently query streaming sensor data.

- Sensor Side Issues: Modifications to the software which runs on the sensors themselves to enable database-style queries involving joins, aggregates, and other operators to be partially executed within sensor networks.

Send mail to the author : (madden@cs.berkeley.edu)

Streaming Views over Streaming Data

Sirish Chandrasekaran
(Professor Michael Franklin)
Department of Defense Space and Naval Warfare Systems Command : N66001-99-2-891301

Systems gurus predict that sensors will soon be all around us, recording interesting data, and streaming them to base stations. There is a need to query these streams adaptively and efficiently. *Without sacrificing the richness of the workload*, a query engine over streams must scale along the following dimensions:

1. Number of data streams - Different types of sensors (eg. home-appliance usage sensor, machine parts wear-n-tear sensor, weather sensor, traffic sensor etc.) produce tuples with different schemas. Also, typically, there are numerous instances of each type of sensor (numerous homes, machines, geographical locations, roadways etc.).

2. frequency of data arrival - this could vary from seconds (traffic sensor) to days (weather sensor).

3. number of queries - potentially a few thousand per second.

Most current query engines err either on the side of lower query functionality, or lower performance. In this paper, we show how shared, incrementally-refined materialized views allow good scaling behavior over a rich query workload:

- Materialization and incremental maintenance of results allows us to support a richer class of queries than NiagaraCQ[1], XFilter[2] and CACQ[3].

- Shared storage and computation of different views allows us to overcome the performance limitations of traditional data-warehousing oriented materialized view solutions such as the Chonicle data model[4] in this new streaming data environment. For single table views, we further optimize memory requirements by sharing storage between the views and base data.

[1]: Jianjun Chen, David J. DeWitt, Feng Tian and Yuan Wang, "NiagaraCQ: A Scalable Continuous Query System for Internet Databases", Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA
[2]: Mehmet Altinel and Michael J. Franklin, "Efficient Filtering of XML Documents for Selective Dissemination of Information", VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt
[3]: Sam Madden, Mehul Shah and Joseph Hellerstein, "CACQ: Continuous Queries over Eddies", Submitted for publication.
[4]: H. V. Jagadish, Inderpal Singh Mumick and Abraham Silberschatz, "View Maintenance Issues for the Chronicle Data Model", Proceedings of the Fourteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 22-25, 1995, San Jose, California

Send mail to the author : (sirish@cs.berkeley.edu)

Path Sharing and Predicate Evaluation for High-Performance XML Filtering

Yanlei Diao
(Professor Michael Franklin)

Soon, much of the data exchanged over the Internet will be encoded in XML. This encoding can be used as the basis for sophisticated filtering and content-based routing of data using XML queries. Such queries can involve constraints on both structure and contents of XML data. Structure-based queries are often written using path expressions. Some filtering systems represent XML path expressions as Finite State Machines and index them. In contrast, work on continuous query processing for database systems has focused on content-based queries in a relational style and proposed grouping common parts of queries. It is clear that for scalable and efficient XML filtering both types of constraints should be supported and both optimizations will be needed. We take a new approach based on Nondeterministic Finite Automata (NFA) that indexes path expressions naturally while capturing the commonality between queries, and processes path expressions using the index. Experimental results show that the NFA-based approach dramatically improves filtering performance and system scalability with very low maintenance cost. Equipped with the capability of processing path expressions, we evaluate additional constraints written as predicates by applying selection to elements of path expressions and processing nested path expressions. We propose alternative orderings of the above operations and study their impact on system performance.

More information (http://www.cs.berkeley.edu/~diaoyl) or
Send mail to the author : (diaoyl@eecs.berkeley.edu)

Selection ordering and correlated predicates

Amol Deshpande
(Professor Joseph M. Hellerstein)

Given a set of correlated predicates on a database table or a data stream, we want to find the optimal ordering in which to apply these predicates to the data. We assume that the correlations between predicates are provided using a statistical model such as Markov networks. As far as we know, the complexity of this problem in the general case is not known. We are also interested in the extension of this problem to the Eddy framework for query processing [1]. Eddies is an adaptive framework for query processing where the query plan is not fixed in advance and changes dynamically based on data characteristics. For correlated predicates, we need to determine what kind of probability distributions to learn from the data and how to use this knowledge effectively during query processing.

[1]: Ron Avnur and Joe Hellerstein, Eddies: Continuously Adaptive Query Processing, SIGMOD, 2000

More information (http://www.cs.berkeley.edu/~amol) or
Send mail to the author : (amol@cs.berkeley.edu)

Using histograms to estimate intermediate result sizes

Amol Deshpande
(Professor Joseph M. Hellerstein)

Histograms are commonly used for estimating intermediate result sizes during the process of query optimization. But the problem of using these histograms optimally has not received much attention. The cost of estimation is not very high for simple one-dimensional histograms, but with the proposed use of more complex multi-dimensional histograms such as MHists [1], Wavelet-based histograms [2], Dependency-based histograms [3] etc., this becomes important as the join and project opeartions on these histograms can be very expensive. We are investigating the complexity of this problem and developing algorithms to solve it.

[1]: V. Poosala and Yannis Ioanndis, "Selectivity Estimation Without the Attribute Value Independence Assumption", VLDB, 1997
[2]: Y. Matias, J. Vitter and Min Wang, "Wavelet-based Histograms for Selectivity Estimation", SIGMOD, 1998
[3]: A. Deshpande, M. Garofalakis and R. Rastogi, "Independence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data", SIGMOD, 2001

More information (http://www.cs.berkeley.edu/~amol) or
Send mail to the author : (amol@cs.berkeley.edu)

Eddies: Mechanism and Policy

Fred Reiss
(Professor Joseph M. Hellerstein)

Database systems operate in a world that changes rapidly, yet traditional query processing engines operate under the assumption that the world always stays the same. Eddies [1] are a query processing architecture that allows for continuously adaptive reordering of relational operators. Previous work on Eddies concentrated on identifying the ways in which operators can be reordered and defining a general framework for continuous adaptivity.

Our current work aims to make Eddies a viable substitute for a traditional static query engine. Naive implementations of the Eddy framework achieve adaptivity, but with a high overhead for each tuple processed. We are working to reduce the overhead of adaptivity through various novel mechanisms. We are also exploring different policies for routing tuples among relational operators in an effort to understand which policies adapt best to different kinds of change.

[1]: R. Avnur and J. Hellerstein, "Eddies: Continuously Adaptive Query Processing," SIGMOD 2000.

More information (http://www.cs.berkeley.edu/~phred) or
Send mail to the author : (phred@eecs.berkeley.edu)

Boolean Bounding Predicates for Spatial Access Methods

Megan Thomas
(Professor Joseph M. Hellerstein)
National Physical Sciences Consortium Fellowship and Lawrence Livermore National Labs

We are working on a general, experimentally-validated framework for access method optimization. We propose algorithms to improve the performance of tree-structured multi-dimensional indexes like R-trees, SR-trees, and others. Our approach uses empty space on index node pages in order to reduce imprecision in the bounding predicates. We have created APIs in the Generalized Search Trees (GiST) infrastructure to allow an access method designer to use our algorithms. We are implementing and testing a variety of heuristic algorithms to perform our proposed access method performance optimization.

More information (http://www.cs.berkeley.edu/~mct/) or
Send mail to the author : (mct@eecs.berkeley.edu)

Robust, Scalable Dataflows in Telegraph

Mehul Shah
(Professor Joseph M. Hellerstein)

There are a number of data-intensive applications that involve long-running dataflow programs. Some examples include data mining queries, crawls over "deep-web" databases, and publish/subscribe data dissemination systems. The latter application is most challenging because it requires scaling to millions of user registered "continuous" queries running over a number of sources. We must apply newly generated data to these queries and quickly disseminate the results.

We can scale these applications by running them across a shared-nothing cluster of computers. However, the runtime environment on this platform is unpredictable. Node failures can occur due to Heisenbugs in the underlying system, and load imbalances can arise due to heterogeneity in the hardware. Moreover, in long-running dataflows the underlying source data distributions can change leading to further imbalances.

The Telegraph system provides a dataflow programming environment that is robust in the face of this volatility. The mechanism for achieving this robustness is the Flux operator that we interpose between stages of a dataflow pipeline. The Flux operator is a generalization of the exchange operator. It encapsulates the logic for detecting and correcting load-imbalances at runtime and rapidly recovering from node failures. We are currently investigating policies for migrating state for load-balancing, and replication for fault-tolerance.

More information (http://www.cs.berkeley.edu/~mashah) or
Send mail to the author : (mashah@eecs.berkeley.edu)