StreamCube资源-CSDN文库

需积分: 9 160 浏览量 2011-07-06 22:59:22 上传评论收藏 1.8MB PDF 举报

标题：Stream Cube 描述：韩加炜教授的文章，里面介绍了大量的流式数据构建立方体的方法在韩加炜教授的论文《StreamCube: An Architecture for Multi-Dimensional Analysis of Data Streams》中，深入探讨了流式数据立方体（Stream Cube）的概念与实现，这是一种针对实时监控系统、电信系统及其他动态环境产生的大量（潜在无限）流数据进行高效多维分析的架构。这些系统通常会产生巨量的数据，数据量之大，难以进行多次扫描，且大部分数据处于较低的抽象级别，而分析师们更关注的是较高层次的动态变化，如趋势和异常值。为了发现这些高阶特征，需要对流数据执行在线的多层次、多维度分析处理。 ### Stream Cube 架构概述 Stream Cube 架构旨在促进在线、多维度、多层次的流数据分析。为了实现快速在线多维流数据分析，文章提出了三种关键技术来有效计算流立方体： 1. **倾斜时间框架模型**：作为一种多分辨率模型，用于记录时间相关数据。较近的数据以更高的分辨率注册，而较远的数据则以较低的分辨率注册。这种设计减少了整体的时间相关数据存储，并适应实际中常见的数据分析任务。 2. **关键层维护**：与物质化所有级别的立方体不同，建议仅维护少量关键层。基于观察层和最小有趣层的概念，可以灵活高效地进行分析。 3. **高效流数据立方体算法**：开发了一种算法，只计算特定的层（立方体），从而实现了资源的有效利用和分析速度的提升。 ### 倾斜时间框架模型倾斜时间框架模型是一种创新的数据存储方法，它根据数据的新旧程度动态调整存储分辨率。对于最近的数据，以较高的分辨率存储，确保了对近期趋势和模式的准确捕捉；而对于历史数据，则采用较低的分辨率，这样既节省了存储空间，又保持了数据的历史脉络。这种方法特别适合于处理流数据，因为它能够自动适应不同的分析需求，同时优化了存储效率。 ### 关键层维护与观察层、最小有趣层概念传统的数据立方体构建方法往往需要预计算所有可能的立方体组合，这在处理流数据时是不现实的，因为数据量巨大且持续增长。Stream Cube 提出的关键层维护策略，只保留那些对当前分析最有价值的数据层。观察层和最小有趣层的概念则是这一策略的核心，它们帮助分析师聚焦于最具洞察力的数据切片，避免了对大量无用信息的无效计算，显著提高了分析效率。 ### 高效流数据立方体算法 Stream Cube 还提出了一种专门针对流数据立方体计算的高效算法。该算法专注于计算那些被定义为关键层的立方体，避免了对全部数据集的遍历，从而大幅减少了计算时间和资源消耗。通过这种算法，即使面对海量的流数据，也能实现实时的多维度分析，为决策者提供即时的洞察和响应能力。 Stream Cube 架构及其关键技术为流数据的多维度分析提供了一个全面而高效的解决方案，它不仅解决了传统数据分析面临的存储和计算瓶颈，还为实时数据分析开辟了新的可能性。

资源推荐

资源详情

资源评论

Distributed and Parallel Databases, 18, 173–197, 2005



2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

DOI: 10.1007/s10619-005-3296-1

Stream Cube: An Architecture

for Multi-Dimensional Analysis of Data Streams

JIAWEI HAN hanj@cs.uiuc.edu

University of Illinois

YIXIN CHEN chen@cse.wustl.edu

Washington University, St. Louis

GUOZHU DONG gdong@cs.wright.edu

Wright State University

JIAN PEI jpei@cs.sfu.ca

Simon Fraser University, B. C., Canada

BENJAMIN W. WAH b-wah@uiuc.edu

University of Illinois

JIANYONG WANG jianyong@tsinghua.edu.cn

Tsinghua University, Beijing, China

Y. DORA CAI ycai@ncsa.uiuc.edu

University of Illinois

Recommended by: Ahmed Elmagarmid

Published online: 20 September 2005

Abstract. Real-time surveillance systems, telecommunication systems, and other dynamic environments often

generate tremendous (potentially inﬁnite) volume of stream data: the volume is too huge to be scanned multiple

times. Much of such data resides at rather low level of abstraction, whereas most analysts are interested in relatively

high-level dynamic changes (such as trends and outliers). To discover such high-level characteristics, one may need

to perform on-line multi-level, multi-dimensional analytical processing of stream data. In this paper, we propose

an architecture, called stream

cube, to facilitate on-line, multi-dimensional, multi-level analysis of stream data.

For fast online multi-dimensional analysis of stream data, three important techniques are proposed for efﬁcient

and effective computation of stream cubes. First, a tilted time frame model is proposed as a multi-resolution model

to register time-related data: the more recent data are registered at ﬁner resolution, whereas the more distant data

are registered at coarser resolution. This design reduces the overall storage of time-related data and adapts nicely

to the data analysis tasks commonly encountered in practice. Second, instead of materializing cuboids at all levels,

we propose to maintain a small number of critical layers. Flexible analysis can be efﬁciently performed based on

the concept of observation layer and minimal interesting layer. Third, an efﬁcient stream data cubing algorithm

is developed which computes only the layers (cuboids) along a popular path and leaves the other cuboids for

query-driven, on-line computation. Based on this design methodology, stream data cube can be constructed and

maintained incrementally with a reasonable amount of memory, computation cost, and query response time. This

is veriﬁed by our substantial performance study.

174 HAN ET AL.

Stream data cube architecture facilitates online analytical processing of stream data. It also forms a preliminary

data structure for online stream data mining. The impact of the design and implementation of stream data cube in

the context of stream data mining is also discussed in the paper.

1. Introduction

With years of research and development of data warehousing and OLAP technology [9,

15], a large number of data warehouses and data cubes have been successfully constructed

and deployed in applications, and data cube has become an essential component in most

data warehouse systems and in some extended relational database systems and has been

playing an increasingly important role in data analysis and intelligent decision support.

The data warehouse and OLAP technology is based on the integration and consolidation

of data in multi-dimensional space to facilitate powerful and fast on-line data analysis.

Data are aggregated either completely or partially in multiple dimensions and multiple

levels, and are stored in the form of either relations or multi-dimensional arrays [1, 29]. The

dimensions in a data cube are of categorical data, such as products, region, time, etc., and

the measures are numerical data, representing various kinds of aggregates, such as sum,

average, variance of sales or proﬁts, etc.

The success of OLAP technology naturally leads to its possible extension from the

analysis of static, pre-integrated, historical data to that of current, dynamically changing

data, including time-series data, scientiﬁc and engineering data, and data produced in other

dynamic environments, such as power supply, network trafﬁc, stock exchange, telecommu-

nication data ﬂow, Web click streams, weather or environment monitoring, etc.

A fundamental difference in the analysis of stream data from that of relational and

warehouse data is that the stream data is generated in huge volume, ﬂowing in-and-out

dynamically, and changing rapidly. Due to limited memory or disk space and processing

power available in today’s computers, most data streams may only be examined in a

single pass. These characteristics of stream data have been emphasized and investigated by

many researchers, such as [6, 7, 12, 14, 16], and efﬁcient stream data querying, clustering

and classiﬁcation algorithms have been proposed recently (such as [12, 14, 16, 17, 20]).

However, there is another important characteristic of stream data that has not drawn enough

attention: Most of stream data resides at rather low level of abstraction, whereas an analyst

is often more interested in higher and multiple levels of abstraction. Similar to OLAP

analysis of static data, multi-level, multi-dimensional on-line analysis should be performed

on stream data as well.

The requirement for multi-level, multi-dimensional on-line analysis of stream data,

though desirable, raises a challenging research issue: “Is it feasible to perform OLAP

analysis on huge volumes of stream data since a data cube is usually much bigger than the

original data set, and its construction may take multiple database scans?”

In this paper, we examine this issue and present an interesting architecture for on-

line analytical analysis of stream data. Stream data is generated continuously in a dynamic

environment, with huge volume, inﬁnite ﬂow, and fast changing behavior. As collected, such

data is almost always at rather low level, consisting of various kinds of detailed temporal

and other features. To ﬁnd interesting or unusual patterns, it is essential to perform analysis

on some useful measures, such as sum, average, or even more sophisticated measures, such

STREAM CUBE: AN ARCHITECTURE FOR MULTI-DIMENSIONAL ANALYSIS 175

as regression, at certain meaningful abstraction level, discover critical changes of data, and

drill down to some more detailed levels for in-depth analysis, when needed.

To illustrate our motivation, let’s examine the following examples.

Example 1. A power supply station can watch inﬁnite streams of power usage data, with

the lowest granularity as individual user, location, and second. Given a large number of

users, it is only realistic to analyze the ﬂuctuation of power usage at certain high levels,

such as by city or street district and by quarter (of an hour), making timely power supply

adjustments and handling unusual situations.

Conceptually, for multi-dimensional analysis, one can view such stream data as a virtual

data cube, consisting of one or a few measures and a set of dimensions, including one time

dimension, and a few other dimensions, such as location, user-category, etc. However, in

practice, it is impossible to materialize such a data cube, since the materialization requires a

huge amount of data to be computed and stored. Some efﬁcient methods must be developed

for systematic analysis of such data.

Example 2. Suppose that a Web server, such as Yahoo.com, receives a huge volume of

Web click streams requesting various kinds of services and information. Usually, such

stream data resides at rather low level, consisting of time (down to subseconds), Web page

address (down to concrete URL), user ip address (down to detailed machine IP address),

etc. However, an analyst may often be interested in changes, trends, and unusual patterns,

happening in the data streams, at certain high levels of abstraction. For example, it is

interesting to ﬁnd that the Web clicking trafﬁc in North America on sports in the last

15 minutes is 40% higher than the last 24 hours’ average.

From the point of view of a Web analysis provider, given a large volume of fast changing

Web click streams, and with limited resource and computational power, it is only realistic

to analyze the changes of Web usage at certain high levels, discover unusual situations,

and drill down to some more detailed levels for in-depth analysis, when needed, in order to

make timely responses.

Interestingly, both the analyst and analysis provider share a similar view on such stream

data analysis: instead of bogging down to every detail of data stream, a demanding request is

to provide on-line analysis of changes, trends and other patterns at high levels of abstraction,

with low cost and fast response time.

In this study, we take Example 2 as a typical scenario and study how to perform efﬁcient

and effective multi-dimensional analysis of stream data, with the following contributions.

1. For on-line stream data analysis, both space and time are critical. In order to avoid im-

posing unrealistic demand on space and time, instead of computing a fully materialized

cube, we suggest to compute a partially materialized data cube, with a tilted time frame

as its time dimension model. In the tilted time frame, time is registered at different levels

of granularity. The most recent time is registered at the ﬁnest granularity; the more

distant time is registered at coarser granularity; the level of coarseness depends on the

application requirements and on how old the time point is. This model is sufﬁcient for

most analysis tasks, and at the same time it also ensures that the total amount of data to

retain in memory or to be stored on disk is small.

176 HAN ET AL.

2. Due to limited memory space in stream data analysis, it is often too costly to store

a precomputed cube, even with the tilted time frame, which substantially compresses

the storage space. We propose to compute and store only two critical layers (which

are essentially cuboids) in the cube: (1) an observation layer, called o-layer, which is

the layer that an analyst would like to check and make decisions for either signaling

the exceptions or drilling on the exception cells down to lower layers to ﬁnd their

corresponding lower level exceptions; and (2) the minimal interesting layer, called

m-layer, which is the minimal layer that an analyst would like to examine, since it is

often neither cost-effective nor practically interesting to examine the minute detail of

stream data. For example, in Example 1, we assume that the o-layer is user-region,

theme, and quarter, while the m-layer is user, sub-theme, and minute.

3. Storing a cube at only two critical layers leaves a lot of room at what to compute and

how to compute for the cuboids between the two layers. We propose one method, called

popular-path cubing, which rolls up the cuboids from the m-layer to the o-layer,by

following one popular drilling path, materializes only the layers along the path, and

leave other layers to be computed only when needed. Our performance study shows

that this method achieves a reasonable trade-off between space, computation time, and

ﬂexibility, and has both quick aggregation time and exception detection time.

The rest of the paper is organized as follows. In Section 2, we deﬁne the basic concepts

and introduce the research problem. In Section 3, we present an architectural design for

online analysis of stream data by deﬁning the problem and introducing the concepts of tilted

time frame and critical layers. In Section 4, we present the popular-path cubing method,

an efﬁcient algorithm for stream data cube computation that supports on-line analytical

processing of stream data. Our experiments and performance study of the proposed methods

are presented in Section 5. The related work and possible extensions of the model are

discussed in Section 6, and our study is concluded in Section 7.

2. Problem deﬁnition

In this section, we introduce the basic concepts related to data cubes, multi-dimensional

analysis of stream data, and stream data cubes, and deﬁne the problem of research.

The concept of data cube [15] was introduced to facilitate multi-dimensional, multi-level

analysis of large data sets.

Let D be a relational table, called the

base table, of a given cube. The set of all attributes

A in D are partitioned into two subsets, the dimensional attributes DIM and the measure

attributes M (so DIM ∪ M = A and DIM ∩ M = φ). The measure attributes functionally

depend on the dimensional attributes in DBand are deﬁned in the context of data cube using

some typical aggregate functions, such as

COUNT, SUM, AVG, or some more sophisticated

computational functions, such as standard deviation, regression, etc.

A tuple with schema A in a multi-dimensional space (i.e., in the context of data cube)

is called a cell. Given three distinct cells c

, c

and c

, c

is an ancestor of c

, and c

descendant of c

iff on every dimensional attribute, either c

and c

share the same value,

or c

’s value is a generalized value of c

’s in the dimension’s concept hierarchy. c

is a

剩余24页未读，继续阅读

评论收藏

内容反馈

rush_lee

粉丝: 0
资源: 1

Stream Cube

Cube

Stream

Stream-It

Central-Stream

Elmore-Stream-IT

Cube Attacks on Tweakable Black Box Polynomials

CRC.zip_but2u1_morentt_stm32 crc_stm32cube_stm32cube crc

nucleo-f767 STM32CubeMX + SDIO读取TF卡

流动stream_VR游戏开发_天空盒子_Skybox_高清_16K_EXR

net-oce-protocol:与 net-oce 对话的协议定义

轉Serializable至Stream

Stream-2-Stream

Stream_Tracklists

Java中的Stream流

Java 8 之 流（Stream）

TTS-Cube:递归神经网络的2端到2端语音合成

STM32CubeMX F4系列 USART1实现串行Modbus通讯

C＃处理鼠标和键盘事件

java程序（一）

sigmod2011全部论文(2)

Redis Stream

Cam_Stream

java Stream流

Playlist-Stream

Stream-Minifier

VS-Cube: Analyzing Variations of Multi-dimensional Patterns over Data Streams

DMA_PWM_103.rar

Unity3D AssetBundle的无缝地图实现

基于流立方体的数据流频繁模式挖掘算法1

Introduction to 3D Game Programming with DirectX 11

最新资源

Java 8 之流（Stream）