Most existing query optimizers for federated/distributed database systems perform query optimization statically and assume that the cost models for the underlying systems are known. This suffers from two major problems : (1) the cost of a plan depends heavily on the runtime conditions such as loads on the underlying systems and the communication costs; (2) underlying data sources may have ad hoc cost models that cannot be exported to the optimizer. Hence, to do correct cost-based optimization, the optimization must be performed at runtime, and the optimizer must contact the sites involved in an operation to find the cost for the operation. This can involve a wide area message, and hence the optimization costs can be considerable. There is a tradeoff between the number of costs requested (and consequently, the number of messages) and the quality of the plan found. Optimization algorithms that can gather maximum information in minimum number of messages are more attractive. We are evaluating applicability of various well-known optimization techniques such as dynamic programming, randomized algorithms, two-phase optimization, and parametric optimization in this scenario.
The field of molecular genetics has received increasing attention in recent years and is widely recognized as being one of the key technologies of this century. One of its main focal points is the accumulation of vast genomic sequences databases containing sequences composed of nucleotides or amino acids and the development of computer-oriented techniques for managing and searching these databases. As sequencing techniques have improved and sequence databases have grown exponentially, the importance of fast sequence search techniques has increased. Nevertheless, existing search techniques such as BLAST, although highly refined and very effective, still require a scan of the entire database; it is unlikely that these techniques will be able to handle the expected future volume of data. What is needed is a way to index genomic sequence data, similar to how text and numerical data can be indexed by today's database management systems. I am working on such an access method with the goal of achieving logarithmic search time, which would be a substantial improvement over the linear search time offered by existing techniques.
The development process for access methods (AMs) in database systems is complex and tedious. Traditional tools such as programming language debuggers and profilers are cumbersome and uninformative for developing access methods. Amdb is a graphical tool that facilitates the design and tuning process for height-balanaced tree-structured AMs. It allows a designer to single step through search tree operations, and provides detailed performance metrics that flag deficiencies in the structure and design of an AM. Visualizations of the structure and contents of the search tree complement these facilities, allowing a designer to browse and pinpoint deficiencies more naturallly. Currently, we are extending Amdb to make it more customizable.
GiST is a tool for the development of tree-structured access methods with a minimum of design and coding effort. We are designing a general, experimentally validated framework for access method optimization. For static optimization, the optimization uses performance metrics derived from analysis of the access methods using amdb. Currently, we are focusing on extending the GiST framework to allow automatic per-node re-optimization in an access method.
Developers of large-scale applications do not have an easy storage solution if neither file systems nor database systems meet their requirements exactly. Database Management Systems are complex, heavyweight pieces of software that are highly optimized for specific workloads and are used mainly in mission-critical applications that require high levels of consistency, recovery, and scalability. Most other applications like search engines, email servers, web servers, and data mining systems either develop their custom data storage alternatives, or depend on file systems for their storage needs, implementing any application specific requirements on top. Such applications can greatly benefit from a unified and customizable storage solution.
The BUDS (Berkeley Universal Data Storage) system is a high performance, scalable storage manager that exports a flexible and extensible API. Using this API, applications with varying storage requirements can configure the consistency and recovery support provided by the storage system. Applications under development include a transactional record manager, a simple file system, and distributed data structures for large-scale Internet infrastructure services.
Interactive responses and natural, intuitive controls are important for data analysis. We are building A-B-C, a scalable spreadsheet for data analysis that combines exploration, grouping, and aggregation. We focus on interactive responses in place of long delays, and intuitive, direct-manipulation operations in place of complex queries. A-B-C allows analysts to interactively explore the data at varying granularities of detail using a spreadsheet. Hypotheses that arise during the exploration can be checked by dynamically computing aggregates on interactively chosen groups of data items.
Real-world data often has discrepancies in structure and content. Traditional methods for "cleaning" the data involve many iterations of time-consuming "data quality" analysis to find discrepancies, and long-running transformations to fix them. This process requires users to endure long waits and often write complex transformation programs.
We have developed an interactive framework for data cleaning that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in an intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. Discrepancy detection is done incrementally in the background on the current transformed version of data, and discrepancies are flagged as they are found. This interactive combination of discrepancy detection and transformation allows users to gradually develop transformations as discrepancies are found, and clean the data without having to write complex programs or endure long waits. Balancing the goals of power, ease of specification, and interactive application, we develop a set of transforms that can be used for transformations within data records as well as for higher-order transformations. We also study the compilation of sequences of transforms into optimized programs.