The collaborative filtering problem, which predicts user preferences based on the preferences and recommendations of a user community, has often expressed itself in front-end applications. Prototype systems have addressed such problems as selection of Usenet posts [1] or movie recommendations [2]; production implementations provide sales support and augment cross-selling tools (e.g., [3]).
Some recent work has applied graphical models formalisms [4,5] to collaborative filtering, and present results indicate that incorporating temporal aspects of user preference histories can greatly improve predictive accuracy. Bayesian models provide a natural mechanism for integrating explicit user preferences (user profiles) with user histories to yield a more robust representation of usage patterns. We intend to apply these results to data caching and prefetching applications for distributed databases, with an eye to more efficient bandwidth use for wireless devices. We also plan to consider recommender systems for server-side administration tasks (as opposed to the client/server interface), where better usage models might support more efficient virtual data representations.
The Telegraph project is focused on running database-style queries over wide-area data, particularly non-traditional sources such as websites and streams of sensor data. During Summer '00, the Telegraph group built a system that uses eddies to provide good performance on queries over websites with highly variable latencies. This system, however, doesn't provide any facilities for querying streaming data. There are a number of architectural issues involved in extending the current system to handle such queries, and this will be the focus of my research for the next year.
Current database research has focused on queries over streams of data only in very limited contexts. Typically, these systems are limited in one of several ways. In some cases, as with the dQUOB system from Georgia Tech, they have no notion of history; that is, they only allow queries over the data that is currently passing through the stream, so that queries can't react to past behavior in any way. In other cases, as with D. Stott Parker's research at UCLA, the focus is on theoretical notions of streams and sets, without consideration for practical implementations or real performance.
Given this background, my research is directed toward characterizing how various streaming applications make use of history, and how to enable multiple queries to execute for long periods of time over a single stream of data.
Rapidly increasing levels of connectivity among computer systems and the explosion of data sources and applications (henceforth jointly referred to as services) over the web call for a change in traditional distributed programming models. We believe that the composition of services to create newer services will be the dominant programming paradigm on the web. This model will need new techniques for handling faults. Fault mitigation, the reduction of the impact of faults only to the extent that is required by the user, will be more important in this environment than traditional fault tolerance techniques that seek to make faults completely transparent to the user.
Traditionally, fault tolerance research has concentrated on providing exactly-once transmission semantics in distributed systems and ACID semantics for the execution of applications. They therefore lack the notion of masking errors from users only to the extent that the user requires. On the web, this is a very important requirement, since the time spent to mask the effects of a failure might not be acceptable to the user. More importantly the effort might be wasted, since the effects of the failure might not be visible to the user (because of the inherent semantics of the component services), or the user might specifically only be looking for an approximate answer. It is apparent that fault mitigating systems, which are aware of the relaxed semantics of applications, are much more suited to this environment. Fault mitigation is not only a more desirable fault model, it can also be used to artificially degrade one query to aggressively process queries competing for the same resources.
Rapidly increasing amounts of data motivate the need for compressing data stored in databases. The main challenge to this is that common database operations (e.g., selections and joins) must not be significantly affected because of compression; in particular, they should not require uncompressing all the data. Because of this, well-known compression techniques such as Lempel-Ziv encoding are not very suitable in this scenario. In addition, the semantics of databases provide additional opportunities that can be used to compress better, because: (1) relations are highly structured and the information about the structure is available; (2) the tuples in a relation can be reordered without changing the semantics of the data; and (3) many data processing needs can trade inaccuracies in the query results for better response times. We are also looking at use of compression to reduce communication in wide-area query processing in the context of the Telegraph project.
Most existing query optimizers for federated/distributed database systems perform query optimization statically and assume that the cost models for the underlying systems are known. This suffers from two major problems: (1) the cost of a plan depends heavily on the runtime conditions such as loads on the underlying systems and the communication costs; (2) underlying data sources may have ad-hoc cost models that cannot be exported to the optimizer. To do correct cost-based optimization, the optimization must be performed at runtime, and the optimizer must contact the sites involved in an operation to find the cost of the operation. Since this could involve a message over a (possibly wide-area) network, the optimization cost can be significant. We have adapted various well-known optimization techniques for this scenario and our initial experiments motivate our claim that aggressive optimization techniques are required. One surprising phenomenon that we observed during the experimental studies was that two-phase optimization, which breaks the optimization process in two phases (finding the best static plan and scheduling it at runtime) works very well under varying load conditions and communication costs. We are currently trying to analyze this phenomenon in a more general setting.
GiST is a tool for the development of tree-structured access methods with a minimum of design and coding effort. We are designing a general, experimentally validated framework for access method optimization. Currently, we are focusing on extending the GiST framework to allow adaptive, online re-optimization of access methods.
An important step in data warehousing and enterprise data integration is cleaning data of discrepancies in structure and content. Current commercial solutions for data cleaning involve many iterations of time-consuming "auditing" to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs.
We present Potter's Wheel A-B-C, an interactive framework for data cleaning that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in an intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. In the background, the system automatically infers the structure of the data in terms of user-defined domains and applies suitable algorithms to check it for discrepancies, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. We choose and adapt a small set of transforms from existing literature and describe methods for their graphical specification and interactive application. We combine the minimum description length principle with the traditional database notion of user-defined types to automatically extract suitable structures for data values in an extensible fashion. Such structure extraction is also applied in the graphical specification of transforms, to infer transforms from examples. We also describe methods for optimizing the final sequence of transforms for memory allocations and copies.
Database systems nowadays access data from diverse sources with unpredictable response times, throughputs, and data distributions. Often, especially with data accessed from Internet sources, there are many competing sources and access methods with varying speeds for getting the same data. The execution environment is also volatile, with resources like memory varying with the system load. Traditional static query optimization is often ineffective in such situations, prompting recent efforts to dynamically adapt join and selection orders.
We have developed global state modules (GSMs) as a mechanism for adapting to competitive access methods and data sources, and to changing memory availability. GSMs are a way to extract state hidden as private members within query processing modules into first class query processing modules. An implementation of GSMs in the Telegraph tuple-level adaptive dataflow system [1] demonstrates that it is possible to dynamically learn good access methods without requiring wasted effort or extra memory. Additionally, GSMs provide a simple way of exploiting common subexpressions in query workloads, without hindering pipelined execution or requiring static multi-query optimization.
We use the eddy operator [2] as a central point for adaptively routing tuples between operators in Telegraph. In this setting, many typical query processing operators can be viewed as particular ways of routing tuples between GSMs. This software architecture has two benefits. First, exposing state to the eddy as a global resource allows the eddy to perform global resource managment. In particular, the eddy can effectively adapt to changing memory availability in the environment by suitably (de)populating the GSMs. This has been a hard problem in query processing because each module distributes its allocated memory among its private data strucutes for its local benefits, rather than the overall benefit of the query or the system's query workload. Second, since many join algorithms are performed through appropriate routing policies, the eddy can automatically learn hybrid variants of join algorithms by varying the routing policy to suit system conditions. We have found promising initial evidence that even simple routing policies can learn useful hybrid join strategies.