Chapter 16. Database Management

Chapter 16. Introduction

The INGRES project was initiated in 1973 to design and implement a full-function relational database management system. When the INGRES system was first released in 1976, it contained over 100,000 lines of code in the C programming language and ran on PDP-11s under the UNIX operating system. INGRES, along with several other research prototypes, proved that the relational model was a suitable basis for an efficient, easy-to-use database management system. The system was distributed to over 150 sites around the world and provided the foundation for several commercial products.

In the late 1970s, the INGRES project shifted its focus to distributed database systems and application-development environments. A distributed database management system allows users to query and update data stored on different host computers in a computer network. The prototype distributed INGRES system, running on two VAX machines, was demonstrated in 1982. During this period, a group also worked on the problem of connecting heterogeneous database systems (e.g., hierarchical, network, and relational) that might be geographically dispersed. Also demonstrated in 1982 was a prototype forms-based programming environment called the Forms Application Development System (FADS) that allowed end users to code their applications by filling in forms displayed on a terminal screen. In the 1980s, all of this research was transferred to commercial products. Distributed databases, gateways to heterogeneous database systems, and forms-based application development tools are now available from many vendors.

In the mid-1980s, the focus of the INGRES project again shifted to a new topic: database support for complex, dynamic data. Early database systems, particularly those based on the relational model, could solve many of the problems facing data-processing organizations that dealt with rigidly structured, regular data (e.g., business data). However, they have not solved the problems posed by more dynamic data, such as documents, geographical data, CAD/CAM data, and programs.

In 1985, the INGRES project embarked on the design and implementation of a new database management system called POSTGRES (for Rafter INGRESS) and a new application development environment called PICASSO to provide support for these applications. In 1987 the focus of POSTGRES was expanded to include effective execution on a shared-memory multiprocessor. This resulting system, XPRS (eXtended POSTGRES on RAID and Sprite), attempts to provide both high performance and high availability. The major design objectives for these systems and for other research on database management systems are described in the research summaries below.

Mariposa: An Economic Paradigm for Query Processing and Data Migration

Paul M. Aoki, Marcel Kornacker, Avi Pfeffer, Adam Sah, Jeff Sidell, and Andrew Yu

(Professor M. R. Stonebraker)

(ARO) DAAL03-91-G-0183, (DARPA) DABT63-92-C-0007, and (NSF) IRI-9107455

Mariposa is a distributed database system designed to provide high performance in an environment of high data mobility over thousands of autonomous sites and on memory hierarchies with very large capacity. The complexity in scheduling distributed actions in such a large system stems from the combinatorially large number of possible choices for each action, expense of global synchronization, and the dynamically changing environment.

To deal with the complexity of these issues, we have elected to reformulate all issues relating to shared resources (query optimization and processing, storage management, and naming services) into a microeconomic framework. The advantages of this approach over traditional solutions are: (1) the decision process is inherently decentralized, which is a prerequisite for achieving scalability; (2) prices in a market system fluctuate in accordance with the demand and supply of resources, allowing the system to dynamically adapt to resource contention; and (3) queries, storage managers, and name servers can be integrated into a single market-based economy, simplifying resource management algorithms. This will also result in an efficient allocation of every available resource.

We expect to have a functioning initial system by the end of 1994. Further details can be found in [1].

[1] M. Stonebraker, R. Devine, M. Kornacker, W. Litwin, A. Pfeffer, A. Sah, and C. Staelin, "An Economic Paradigm for Query Processing and Data Migration in Mariposa," Proc. Int. Conf. Parallel and Distributed Information Systems, Austin, TX, September 1994, pp. 58-67.

Query Processing and Caching in Tertiary Memory Databases

Sunita Sarawagi

(Professor M. R. Stonebraker)

(ARO) 91-G-0183, (ARPA) T63-92-C-0007, Digital Equipment Corporation, and (NSF) IRI-91-07455

With the rapid increase in the number of applications that require access to large amounts of data, the design of efficient tertiary memory databases is emerging as a challenging problem. The characteristics of tertiary memory devices are very different from conventional secondary storage devices. Worst case access time of a magnetic disk is only a factor of 4 larger than the best case access time, whereas some tape-oriented devices have 3 orders of magnitude more variation. This makes it crucial to carefully optimize the order in which data blocks are accessed on these devices and to rethink design decisions made in conventional database systems. Tertiary memory devices differ widely not only from typical secondary memory devices, but also among themselves. In some of the devices the seek time is the dominant cost, while in others the transfer time or the media switch time dominate. We want to parameterize our optimization decisions to the characteristics of the storage device.

We studied methods of scheduling queries, caching, and controlling order of data retrieval for efficient operation in a tertiary memory environment. We showed how careful interspersing of queries and informed cache management can achieve remarkable reductions in access time compared to conventional methods. The main techniques we used for optimizing accesses to tertiary memory are:

(1) Divide a relation into fragments based on how it is laid out on tertiary memory. By doing query execution and data movement in units of fragments we got rid of small random I/Os, which are disastrous on high-latency, tertiary memory systems.

(2) Design policies for fetching and evicting fragments from disk cache to tertiary memory. For designing these policies we did careful analysis of the performance bottlenecks in different tertiary devices using simulations and came up with policies that are good for a wide range of tertiary memory types and workload parameters.

We are in the process of extending the POSTGRES database system to handle the new cache management and query optimization strategies. Two tertiary memory devices, a Sony optical jukebox and an HP magneto-optical jukebox, have already been attached to POSTGRES.

Buffer Management for Tertiary Storage Devices

Andrew Yu

(Professor M. R. Stonebraker)

UC Regents Fellowship

As more and more scientific and commercial DBMS applications have massive storage demands, large-capacity storage systems are gradually becoming an active part of the storage hierarchy. To handle the massive amount of data, DBMSs have to get smart about databases stored on tertiary memory.

In a typical storage hierarchy, higher-level storage acts as a cache for lower level storage. However, the size of the magnetic disk cache for a tertiary storage device is often too small for caching effectively. On the other hand, computational time is insignificant compared to the access time and more complex fetching and replacement algorithms can be used.

Initial results of trace-driven simulation using the Sequoia 2000 benchmark indicate that replacement algorithms based on a time history of references, and also algorithms that take tape switches into account are much more effective than traditional methods (e.g., least recently used).


ilp@eecs.berkeley.edu