Chapter 12: Database Management Systems

The ERL Research Summary for 1999

CLOUDS: Online Database Visualization

Chris Olston¹, Tali Roth², and Andy Chou³
(Professor Joseph M. Hellerstein)
MICRO

Data visualization is a hot topic in the database community because of its potential to make databases easier to use. Visualization systems present graphical representations of large data sets in order to make the data easy to understand [1]. Typically, these images can be produced only after processing every row in a large data table, forcing the visualization system to wait for a long time before it can display a useful picture.

Most visualization systems can produce the graphical representation one row at a time in an online, constantly updating fashion. This improves the interactivity of the visualization system by displaying in graphical form a progressive sample of the data table. We are researching a new visualization algorithm called CLOUDS [2], which improves upon this notion. The goal of CLOUDS is to make visualization more interactive by quickly displaying an accurate approximation of the final image. In addition to displaying a progressive sample, CLOUDS uses various statistical techniques to predict in advance what the final picture will look like once it has been completed. As rows are being retrieved from the database table, CLOUDS displays the graphical representations of the rows that have been retrieved so far along with translucent "clouds" that indicate where the remaining graphical objects are predicted to lie. This produces an approximate image that refines as more data is processed.

Figure 1 shows an example visualization in progress of US cities. Each row in the database is displayed as one black point on the screen. This same visualization is shown in Figure 2 using the CLOUDS algorithm. Note that the "clouds" indicate the density of cities that have not yet been plotted.

As data points are being retrieved from the database, CLOUDS approximates the final picture more closely than does the conventional display algorithm (see Figure 3). Consequently, visualization systems using CLOUDS can display closer approximations to the final image faster than conventional visualization systems.

Figure 1: A partially completed visualization of US cities without CLOUDS

Figure 2: A partially completed visualization of US cities with CLOUDS. Note that the "clouds" indicate the density of cities that have not yet been plotted.

Figure 3: A graph of the mean squared error over time. The CLOUDS visualization algorithms (pink, red lines) have lower error than the non-CLOUDS algorithm (blue line) for the first half of the rendering time. Eventually, the non-CLOUDS algorithm "catches up" with CLOUDS. However, CLOUDS achieves the goal of quickly displaying a more correct approximate visualization than the conventional algorithm.

[1]: C. Olston, A. Woodruff, A. Aiken, M. Chu, V. Ercegovac, M. Lin, M. Spalding, and M. R. Stonebraker, "DataSplash," Proc. SIGMOD, Seattle, WA, June 1998.
[2]: R. Avnur, J. M. Hellerstein, B. Lo, C. Olston, B. Raman, V. Raman, T. Roth, and K. Wylie, "CONTROL: Continuous Output and Navigation Technology with Refinement On-Line," Proc. SIGMOD, Seattle, WA, June 1998.

¹Undergraduate (EECS)
²Undergraduate (EECS)
³Undergraduate (EECS)

More information (http://datasplash.cs.berkeley.edu/online) or
Send mail to the author : (cao@cs.berkeley.edu)

An Analysis Framework for Access Methods

Marcel Kornacker and Mehul Shah
(Professor Joseph M. Hellerstein)
(NASA) 1996-MTPE-00099 and (NSF) IRI-97-03972

Designing and tuning access methods (AMs) have always been more of a black art than a rigorous discipline, with performance assessments being mostly reduced to presenting bottom-line runtime or I/O numbers. This work presents an analysis framework for AMs that gives detailed feedback about observed page access performance and the performance implications of individual aspects of an AM design. The analysis process takes a workload--a tree and a set of queries--as input and provides metrics that characterize the performance of each query as well as that of the tree structure and the structure-shaping aspects of the AM implementation. Central to the framework is the use of the optimal behavior--which can be approximated relatively efficiently--as a point of reference against which the actual observed performance is measured. The performance metrics themselves reflect the fundamental performance-relevant properties of the input tree. The framework applies to most balanced tree-structured AMs and is not restricted to particular types of data or queries. It is implemented in \amdb, an analysis, visualization, and debugging tool for AMs that is constructed on top of the Generalized Search Tree abstraction.

More information (http://www.cs.berkeley.edu/~marcel) or
Send mail to the author : (marcel@cs.berkeley.edu)

Indexing Blobworld

Megan Thomas
(Professor Joseph M. Hellerstein)
National Physical Sciences Consortium Fellowship sponsored by LLNL

Blobworld separates images into homogeneous regions in order to improve content-based image retrieval. We investigate whether indexing image region features leads to increased speed and accuracy in image retrieval, over more traditional techniques that index image features based on the entire image. We also explore the image indexing and query optimization issues that arise when users can query based on the color, texture, shape, and location of multiple image regions, and vary the relative importance of each feature for each query.

[1]: C. Carson, M. Thomas, S. Belongie, J. M. Hellerstein, and J. Malik, "Blobworld: A System for Region-based Image Indexing and Retrieval," VISual, 1999 (submitted for publication).

More information (http://www.cs.berkeley.edu/~mct) or
Send mail to the author : (mct@cs.berkeley.edu)

CONTROL: Online Query Processing

Ron Avnur and Vijayshankar Raman
(Professor Joseph M. Hellerstein)
Informix Corporation, MICRO, (NSF) IIS-98-02051, (NSF) CDA-94-01156, and Sloan Foundation Fellowship

We have implemented an online query processing system that enables a user to issue an SQL query, see results immediately, and adjust the processing as the query runs. For aggregation queries, the user sees refining estimates of the final results. For non-aggregation queries, the user receives an ever-growing collection of result records. This behavior allows users to provide feedback to the system while the query is running, such as expressing preference for a particular group of an aggregation query. In addition, users can halt a query and advance to their next step in data exploration after receiving enough information.

We implemented the online query processing system in Informix's Dynamic Server with Universal Data Option, a commercial object-relational database management system [1]. In order to provide sensitivity to user preferences, we implemented operators for preferential data access and delivery. These operators differ from traditional data scans in that they associate higher priority with records that the user identifies as more interesting.

Blocking query operators, such as traditional hash-based joins, cannot be used by online query processing systems because of the system's need to report immediate results. We have implemented a family of non-blocking join algorithms, called Ripple Joins. Simply put, these algorithms take a set of previously unseen tuples from one relation and join them with all the previously seen tuples of another, alternating between the relations for sources of new records.

Our implementation of these algorithms gives careful consideration to the end-to-end issues required to deliver useful behavior to users. The various algorithms must be incorporated inside the server alongside their traditional counterparts, and must be made to work together. In addition, interfaces must be added to the system to handle interactive behaviors that are not offered by traditional query execution plans. Our work has resulted in a complete, interactive query processing system that supports a large portion of SQL. We are currently working on parallelizing these algorithms and online query optimization techniques.

[1]: J. M. Hellerstein, R. Avnur, and V. Raman, "Informix under CONTROL: Online Query Processing," Data Mining and Knowledge Discovery J. (submitted for publication).

More information (http://control.cs.berkeley.edu) or
Send mail to the author : (ronathan@cs.berkeley.edu)

Online Dynamic Reordering for Interactive Data Processing

Vijayshankar Raman and Bhaskaran Raman
(Professor Joseph M. Hellerstein)
Informix, MICRO, (NSF) CDA-94-01156, and (NSF) IIS-98-02051

We have designed and implemented a pipelining, dynamically controllable reorder operator, for use in data-intensive applications. Reordering data on the fly is useful in several interactive contexts such as online aggregation and scalable spreadsheets: it allows the user to control the processing of data by dynamically specifying preferences for different data items, so that there is a priority for data of interest for early processing. Surprisingly, it is also useful for improving the performance of traditional batch query processing in certain instances, because it can provide a rough version of "interesting orders" without introducing blocks into a data pipeline.

We have developed an efficient, non-blocking algorithm for reordering, which can be used over arbitrary data streams from files, indexes, and continuous data feeds. We have also come up with a general model for reordering based on the performance goals of various typical applications. We have implemented this algorithm in three different contexts: online aggregation in the Informix Dynamic Server with Universal Data Option, traditional query processing in Informix, and sorting and scrolling in a large-scale spreadsheet. Preliminary experiments demonstrate that for a variety of data distributions and applications, reordering is responsive to dynamic preference changes, imposes minimal overheads in overall completion time, and provides dramatic improvements in the quality of the feedback over time.

[1]: J. M. Hellerstein, P. J. Haas, and H. J. Wang, "Online Aggregation," Proc. ACM-SIGMOD Int. Conf. Management of Data, Tucson, AZ, May 1997.

More information (http://www.cs.berkeley.edu/~rshankar/online) or
Send mail to the author : (rshankar@cs.berkeley.edu)

Algorithms for Index-assisted Selectivity Estimation

Paul M. Aoki
(Professor Michael R. Stonebraker)
(NASA) FD-NAG5-6587, (NASA) FD-NAGW-5198, and (NSF) IRI-94-00773

High-performance relational database management systems rely heavily on statistical performance prediction models. Some of these models (e.g., CPU and I/O cost estimates) can be reused in extensible, or object-relational, database management systems. Those having to do with specific data types (i.e., the query selectivity estimation algorithms) cannot be easily reused. This is because the standard mechanisms for query selectivity estimation used in relational database systems rely on properties specific to SQL data types. For example, mechanisms based on histograms and quantile values rely on the ability to define a well-ordering over the attribute values. The query optimizer in an object-relational database system will, in general, be unable to exploit such mechanisms for user-defined types, requiring the database extender to create entirely new estimation mechanisms. Worse, writing the selectivity estimation routines is extremely difficult because the software interfaces provided by vendors are relatively low-level.

My approach in this work is to use extensions [1] of the generalized search tree, or GiST, to support user-defined selectivity estimators in a variety of ways. The problem of estimating query selectivity has many dualities with the problem of building good index structures. The approach leverages the work of access method implementors: by making searches cheap, they also make selectivity estimation cheap. I describe algorithms for this problem and present promising preliminary results in [2].

[1]: P. M. Aoki, "Generalizing 'Search' in Generalized Search Trees," Proc. Int. Conf. Data Engineering, Orlando, FL, February 1998. [PDF]
[2]: P. M. Aoki, Algorithms for Index-assisted Selectivity Estimation, UC Berkeley Computer Science Division, Report No. UCB/CSD 98/1021, October 1998. [PDF]

More information (http://www.cs.berkeley.edu/~aoki) or
Send mail to the author : (aoki@cs.berkeley.edu)