Query Processing, Resource Management, and Approximation
in a Data Stream Management System
Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock, Shivnath Babu,
Mayur Datar, Gurmeet Manku, Chris Olston, Justin Rosenstein, Rohit Varma
Stanford University
http://www-db.stanford.edu/stream
Abstract
This paper describes our ongoing work developing the
Stanford Stream Data Manager (STREAM), a system for
executing continuous queries over multiple continuous
data streams. The STREAM system supports a declar-
ative query language, and it copes with high data rates
and query workloads by providing approximate answers
when resources are limited. This paper describes specific
contributions made so far and enumerates our next steps
in developing a general-purpose Data Stream Manage-
ment System.
1 Introduction
At Stanford we are building a Data Stream Management
System (DSMS) that we call STREAM. The new chal-
lenges in building a DSMS instead of a traditional DBMS
arise from two fundamental differences:
1. In addition to managing traditional stored data such
as relations, a DSMS must handle multiple contin-
uous, unbounded, possibly rapid and time-varying
data streams.
2. Due to the continuous nature of the data, a DSMS
typically supports long-running continuous queries,
which are expected to produce answers in a continu-
ous and timely fashion.
Our goal is to build and evaluate a general-purpose
DSMS that supports a declarative query language and
can cope with high data rates and thousands of continu-
ous queries. In addition to the obvious need for multi-
This work was supported by NSF Grants IIS-0118173
and IIS-9817799, by Stanford Graduate Fellowships from
Chambers, 3Com and Rambus, by a Microsoft graduate fel-
lowship, and by grants from Microsoft, Veritas, and the Okawa
Foundation.
Permission to copy without fee all or part of this material is
granted provided that the copies are not made or distributed
for direct commercial advantage, the VLDB copyright notice
and the title of the publication and its date appear, and notice
is given that copying is by permission of the Very Large Data
Base Endowment. To copy otherwise, or to republish, requires
a fee and/or special permission from the Endowment.
Proceedings of the 2003 CIDR Conference
query optimization, judicious resource allocation, and
sophisticated scheduling to achieve high performance,
we are targetingenvironmentswheredata rates and query
load may exceed available resources. In these cases our
system is designed to provide approximate answers to
continuous queries. Managing the interaction between
resource availability and approximation is an important
focus of our project. We are developing both static tech-
niques and techniques for adapting as run-time condi-
tions change.
This paper presents a snapshot of our languagedesign,
algorithms, system design, and system implementation
efforts as of autumn 2002. Clearly we are not presenting
a finished prototype in any sense, e.g., our query lan-
guage is designed but only a subset is implemented, and
our approximation techniques have been identified but
are not exploited fully by our resource allocation algo-
rithms. However, there are a number of concrete contri-
butions to report on at this point:
An extension of SQL suitable for a general-purpose
DSMS with a precisely-defined semantics (Section 2)
Structure of query plans, accounting for plan sharing
and approximation techniques (Section 3)
An algorithm for exploiting constraints on data
streams to reduce memory overhead during query
processing (Section 4.1)
A near-optimal scheduling algorithm for reducing
inter-operator queue sizes (Section 4.2)
A set of techniques for static and dynamic approxi-
mation to cope with limited resources (Section 5)
An algorithm for allocating resources to queries (in
a limited environment) that maximizes query result
precision (Section 5.3)
A software architecture designed for extensibility
and for easy experimentation with DSMS query pro-
cessing techniques (Section 6)
Some current limitations are:
Our DSMS is centralized and based on the relational
model. We believe that distributed query processing
will be essential for many data stream applications,
and we are designing our query processor with a mi-