Chapter 15: Database Management

The ERL Research Summary for 1998

Database Visualization in DataSplash

Vuk Ercegovac¹, Chris Olston², Mybrid Spalding, and Allison Woodruff
(Professors Alexander Aiken and Michael R. Stonebraker)
(NSF) IRI-94-00773 and (NSF) IRI-94-11334

DataSplash is a database visualization system developed by the Tioga research group. DataSplash has two primary goals: (1) to simplify the visualization process so that naive users can design custom visualizations; and (2) to provide an intuitive browsing model for navigating in visualizations.

To achieve the first goal, we have designed a paint program that is integrated with a database management system, POSTGRES95. This allows the user to program by directly manipulating the visualization. Our paint program contains most of the features of a conventional paint program. Unlike a standard paint program, however, our paint program accesses tables from the database management system. Objects in the visualization can be based on values from these tables [1].

To achieve the second goal (that of providing a sophisticated browsing interface), we have incorporated advanced features in the paint program. For example, objects in a canvas may be designated as visual links to other canvases. Furthermore, we provide a mechanism with which naive users can graphically program the way objects in a visualization will behave when users browse the visualization [2].

The first version of DataSplash is fully designed and implementation is complete. A beta version of the software (released into the public domain) can be obtained at the DataSplash web page (http://datasplash.cs.berkeley.edu).

Figure 1: Visualization of average commute time in the United States

Figure 2: A visual link to another canvas

Figure 3: Histogram of commute time in Baltimore

[1]: A. Aiken, J. Chen, M. R. Stonebraker, and A. Woodruff, "Tioga-2: A Direct Manipulation Database Visualization Environment," Proc. Int. Conf. Data Engineering, New Orleans, LA, February 1996.
[2]: A. Woodruff, A. Su, M. R. Stonebraker, C. Paxson, et al., "Navigation and Coordination Primitives for Multidimensional Browsers," Proc. IFIP 2.6 Working Conf. Visual Database Systems, Lausanne, Switzerland, March 1995.

¹Undergraduate (EECS)
²Undergraduate (EECS)

More information (http://datasplash.cs.berkeley.edu) or
Send mail to Vuk Ercegovac : (datasplash@cs.berkeley.edu)

Amdb: An Access Method Design and Debugging Tool Indexing Genomic Sequence Data

Marcel Kornacker
(Professor Joseph M. Hellerstein)
(NASA) FDNAGW-5198, (NSF) FD7-03972, and Schlumberger Foundation Fellowship

Amdb: An Access Method Design and Debugging Tool

The design and tuning of new access methods (AMs) for non-traditional data types and application areas has always been more of a black art than a rigorous discipline. The designer can only rely on intuition to come up with an effective design; its evaluation requires tedious instrumentation of complex code and automatic profiling requires a host of hand-written scripts.

To address these issues, we are developing amdb, a visual AM "debugging" tool to support the AM design and implementation process. It is based on the GiST (Generalized Search Tree) framework for AM construction, which offers the designer an abstracted view of a tree-structured AM and factors out the mechanical aspects of an AM implementation, such as tree traversal, concurrency control, and recovery. Amdb is a visual analysis, debugging, and profiling tool for AMs written as extensions of libgist, a public-domain C++ implementation of stand-alone GiSTs.

Indexing Genomic Sequence Data

The field of molecular genetics has received increasing attention in recent years and is widely recognized as having one of the key technologies of this century. One of its main focal points is the accumulation of vast genomic sequences databases containing sequences composed of nucleotides or amino acids and the development of computer-oriented techniques for managing and searching these databases. As sequencing techniques have improved and sequence databases have grown exponentially, the importance of fast sequence search techniques has increased. Nevertheless, the existing search techniques, although highly refined and very effective, still require a scan of the entire database; it is unlikely that these techniques will be able to handle the expected future volume of data. What is needed is a way to index genomic sequence data, similar to how text and numerical data can be indexed by today's database management systems. We are working on such an access method, which will have logarithmic search time and therefore improve the linear search time of the existing techniques.

More information (http://www.cs.berkeley.edu/~marcel) or
Send mail to Marcel Kornacker : (marcel@cs.berkeley.edu)

GiST: Generalized Search Trees for Indexing Complex Workloads

Paul Aoki, Marcel Kornacker, Mehul Shah, and Megan Thomas
(Professor Joseph M. Hellerstein)
(NASA) 1996-MTPE-00099 and (NSF) IRI-97-03972

The GiST project builds upon a data structure we developed called the Generalized Search Tree (GiST) [1], which is an indexing framework that subsumes most of the prior indexing research in Database and Geographic Information Systems. In a single data structure, the GiST provides all the basic search tree logic required by a database system, thereby unifying disparate structures such as B+-trees, R-trees, and RD-trees in a single piece of code. We have implemented the GiST as a standalone C++ library available on the web, which runs over UNIXand NT files, or can be patched into the SHORE object-oriented database storage manager.

Perhaps more significant than its unification of existing structures is the GiST's ability to allow any dataset to be indexed in a manner supporting arbitrary queries; the power of the structure is essentially unlimited. The GiST provides no guarantees, however, on the efficiency with which queries are answered. Without very careful tuning for each particular workload, the GiST is likely to perform even worse than scanning through all of the data; with careful tuning, it can be orders-of-magnitude faster. In essence, the GiST provides power, but extremely crude controls for best exploiting that power. In our current work, we are pursuing two interrelated directions toward harnessing the power of the GiST:

An index developer's toolkit: We are developing an extensible, type-independent index visualization and tuning tool, which can give index developers insight into critical design decisions.
A theory of Indexability: We propose studying the boundaries of indexing possibility, to precisely characterize when it is mathematically impossible for an efficient index to exist for a given workload. For those workloads that can have efficient indexes, we also hope to characterize just how efficient those indexes can be. We have published initial work in this area [3].

We expect these two research directions to work in a synergistic fashion, where the theoretical results will guide the development of tools that are indeed useful, and the constructive tools we develop will be able to provide intuition for the theoretical analysis.

Since the GiST framework is really an abstraction of many earlier ideas, our results in this effort should be of interest to users of a broad class of indexing techniques, not merely those who choose to use GiST software. We intend to apply our results within the GiST framework to a variety of indexing challenges--currently we are focusing on sequence data (e.g., for bioinformatics applications), point data (for geographic and higher-dimensional applications), and image data (e.g., for earth science applications using satellite imagery).

[1]: J. M. Hellerstein, J. F. Naughton, and A. Pfeffer, "Generalized Search Trees for Database Systems," Proc. Int. Conf. Very Large Data Bases, Zurich, Switzerland, September 1995.
[2]: M. Kornacker, C. Mohan, and J. M. Hellerstein, "Concurrency and Recovery in Generalized Search Trees," Proc. ACM-SIGMOD Int. Conf. Management of Data, Tucson, AZ, May 1997.
[3]: J. M. Hellerstein, E. Koutsoupias, and C. H. Papadimitriou, "Towards an Analysis of Indexing Schemes," ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems, Tucson, AZ, May 1997.
[4]: P. M. Aoki. "Generalizing 'Search' in Generalized Search Trees," Proc. Int. Conf. Data Engineering, Orlando, FL, February 1998 (to appear).

More information (http://gist.cs.berkeley.edu) or
Send mail to Paul Aoki : (jmh@cs.berkeley.edu)

Online Permutation: Process the Interesting Data First!

Vijayshankar Raman and Bhaskaran Raman
(Professor Joseph M. Hellerstein)

Many time-consuming data processing applications are more usable when they are interactive. Rather than making the user wait for a long time to get a result, he must be allowed to guide the processing based on intermediate, approximate estimates. In this project we aim to give the user control over the rates at which different data items are processed, based on his interest. We permute the data before processing, so that interesting items cross the barrier first (Figure 1). This barrier could be complex processing, or a slow network.

An immediate application of our work is in "Online Aggregation" [1], where the barrier is typically a time-consuming relational operation such as a multi-table join. Here, the input is divisible into groups and "interestingness" is defined at the granularity of a group, over which an aggregate is being computed. The user specifies his interest in a group by specifying a numerical preference that determines the rate at which items for that group are processed. Thus, results for the more interesting groups are updated faster.

We have built a prototype "online permuter" using a two-phase approach where the data items are quickly streamed through a buffer onto an auxiliary disk. We have integrated our prototype with an instructional DBMS. Preliminary results from testing with TPC-D queries show that our technique is successful in satisfying user preferences under a wide range of input data distributions. We intend to build online permutation into a commercial DBMS for a better understanding of the issues involved.

We are currently investigating the extension of our approach to handle GUI widgets like listboxes over large data sets. Here the barrier is a slow network between the user and the data source, and we want to send over to the user only what he is interested in. We are also looking at techniques to do random sampling of the input data while it is being permuted, so that good statistical estimates can be made.

Figure 1: Online permutation

[1]: J. M. Hellerstein, P. J. Haas, and H. J. Wang, "Online Aggregation," Proc. ACM/SIGMOD Int. Conf. Management of Data, Tucson, AZ, May 1997.
[2]: J. M. Hellerstein, "Online Processing Redux," IEEE Data Engineering Bulletin, September 1997.
[3]: "Online Permutation," http://www.cs.berkeley.edu/~rshankar/online.

More information (http://www.cs.berkeley.edu/~rshankar/online) or
Send mail to Vijayshankar Raman : (rshankar@cs.berkeley.edu)

CONTROL for Data-Intensive Processing

Vijayshankar Raman, Bruce Lo¹, Chris Olston², Kirk Wylie³, and Ron Avnur⁴
(Professor Joseph M. Hellerstein)

Despite significant advances in data processing over the years, many data-intensive applications still run in batch mode: they require a long period of time to execute, and during that time they provide neither feedback nor control to the user. This state of affairs holds in a number of important domains, from the desktop (e.g., spreadsheets) to the back office (e.g., decision support) to new high-end applications (e.g., data mining and visualization tools). In each of these domains, batch processing is frustrating or even unacceptable for serious users. In order to investigate large data sets, users require on-line behavior: they need to get meaningful feedback while the system runs, and they need to be able to act on the feedback by controlling the system while it runs.

We are addressing these problems with CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. We are developing a set of core technologies for building applications that provide CONTROL behavior, and deploying those technologies in a variety of applications including database query processing, data mining, user-interface widgets, and data visualization.

Figure 1: A CONTROL interface for a relational aggregation query to find the average grade of students at a university, grouped by college.

[1]: J. M. Hellerstein, P. J. Haas, and H. J. Wang, "Online Aggregation," Proc. ACM-SIGMOD Int. Conf. Management of Data, Tucson, AZ, May 1997.
[2]: J. M. Hellerstein, "Online Processing Redux," IEEE Data Engineering Bulletin, September 1997.

¹Undergraduate (EECS)
²Undergraduate (EECS)
³Undergraduate (EECS)
⁴Staff

More information (http://control.cs.berkeley.edu) or
Send mail to Vijayshankar Raman : (jmh@cs.berkeley.edu)

Research on Database System Performance

Windsor Hsu
(Professor Alan J. Smith)
Fujitsu Microelectronics, IBM Corporation, MICRO, Sony Research Laboratories, Sun Microsystems, and Toshiba Corporation

Processor performance has been increasing at a phenomenal rate of more than 50% per year. While I/O capacity has been improving by 60% per year, I/O performance, being limited by mechanical delays, has been improving by less than 5% per year. A widely used technique to bridge this widening performance gap is to cache in electronic storage the data that is likely to be needed soon. The capacity, cost, and performance of electronic storage follows yet another set of improvement curves. All these different rates of improvements have strong implications on the design of balanced computer systems. In this research, we consider these implications and explore ways of improving I/O performance with special emphasis on database systems. The methodology we use is trace-driven simulation. An ongoing part of this research is the collection of high-quality I/O traces.

In the first phase of this research, we perform a comprehensive characterization of the I/O behavior of modern database systems. In addition, we compare the I/O behavior of widely accepted benchmarks such as TPC-C and TPC-D with that of production database workloads. The second phase of this research will focus on I/O improvements at the application level. Specifically, we are interested in looking at issues in managing huge database buffer pools and at the use of application knowledge to perform pre-fetching. In the third phase, we will examine I/O improvements at the storage system and device level. Issues we are interested in exploring include the management of cache and non-volatile memory. Since commodity hardware will become increasingly prevalent, we will also consider how any technique to improve I/O performance for database systems will affect workloads that do not have aggressive application-level caching.

More information (http://www.cs.berkeley.edu/~windsorh/) or
Send mail to Windsor Hsu : (windsorh@cs.berkeley.edu)

The Mariposa Agoric Distributed Database Management System

Jeff Sidell
(Professor Michael R. Stonebraker)
Hitachi, Microsoft, and NASA

The Mariposa agoric distributed database management system (D-DBMS) was designed to address the shortcomings of traditional D-DBMSs. First among these is scalability. Traditional D-DBMSs, which rely on an exhaustive, cost-based query optimizer, cannot scale to a large number of processing sites. Mariposa takes a different approach to distributed DBMS design. Mariposa is an example of an agoric system. Agoric systems cast distributed systems problems in terms of economics: programs and servers become buyers and sellers of computational resources such as memory, disk I/O, CPU time, and network bandwidth as well as data. In agoric systems, each buyer and seller is autonomous--there is no central control, as in traditional DBMSs. This approach is inherently scalable, and also provides a framework in which many of the other shortcomings of traditional D-DBMSs can be addressed simply. In Mariposa, sites compete against one another by bidding for work. The buyer chooses the lowest bidders to perform the work. This approach addresses the following shortcomings by having bidders adjust their prices.

Load imbalance: A traditional D-DBMS processes every query as if it were the only one in the system, ignoring current conditions. This can result in one server being overloaded with work while others remain relatively idle. In Mariposa, sites simply raise their prices as they become busier, resulting in more work going to lower, less busy bidders.

Resource constraints: Traditional D-DBMSs ignore current availability of resources such as disk space and memory. Mariposa addresses this problem by allowing each site to ascertain resource availability before bidding on work.

Heterogeneous environments: Traditional D-DBMSs have assumed that all processors are identical, as is the network connecting them. In Mariposa, a fast processor can bid differently than a slower one. Furthermore, a Mariposa processing site can take network bandwidth availability into account when formulating its bid.

User cost and time constraints: Traditional D-DBMSs have ignored the fact that some users need an answer quickly while others are willing to have theirs run overnight.

We have implemented Mariposa. The code and published papers are available from our web site.

More information (http://mariposa.cs.berkeley.edu) or
Send mail to Jeff Sidell : (jsidell@eecs.berkeley.edu)

Selectivity Estimation for Object-Relational Database Management Systems

Paul M. Aoki
(Professor Michael R. Stonebraker)
(ARO) FD-DAAH04-94-G-0223 and (NSF) IRI-94-00773

High-performance relational database management systems rely heavily on statistical performance prediction models. Some of these models (e.g., CPU and I/O cost estimates) can be reused in extensible, or object-relational, database management systems [1]. Those having to do with specific data types (i.e., query selectivity estimation algorithms) cannot be easily reused.

My work uses an approach based on the generalized search tree, or GiST [2]. The problem of estimating query selectivity turns out to have many dualities with the problem of building good index structures (which is basically the clustering problem). The approach therefore leverages the work of access method implementors: by making searches cheap, they also make selectivity estimation cheap.

In the course of my research on extensible selectivity estimation, I have developed a software framework that permits extenders to create their own tree traversal algorithms [3]. This framework supports many disparate types of traversals proposed in the database literature, such as nearest-neighbor search and acceptance/rejection sampling.

[1]: M. Stonebraker and D. Moore, Object-Relational DBMSs: The Next Great Wave, Morgan Kaufmann, 1996. [HTML]
[2]: J. M. Hellerstein, J. F. Naughton, and A. Pfeffer, "Generalized Search Trees for Database Systems," Proc. Int. Conf. Very Large Data Bases, Zürich, Switzerland, September 1995. [PDF]
[3]: P. M. Aoki, "Generalizing 'Search' in Generalized Search Trees," Proc. Int. Conf. Data Engineering, Orlando, FL, February 1998. [PDF]

More information (http://www.cs.berkeley.edu/~aoki) or
Send mail to Paul M. Aoki : (aoki@cs.berkeley.edu)