StreamingSystems:TheWhat,Where,When,andHowofLarge-ScaleDataProcessing资源-CSDN文库

共7个文件

pdf：7个

streaming

需积分: 9 127 浏览量 2018-11-13 21:15:20 上传评论 1 收藏 4.24MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

Streaming Systems- The What Where When and How of Large-Scale Data Processing.rar （7个子文件）

Streaming Systems

Chapter 04. Advanced Windowing.pdf 837KB

Chapter 06. Streams and Tables.pdf 868KB

safaribooksonline.com-.pdf 72KB

Chapter 01. Streaming 101.pdf 595KB

Chapter 05. Exactly-Once and Side Effects.pdf 288KB

Chapter 02. The What, Where, When, and How of Data Processing.pdf 1.34MB

Chapter 03. Watermarks.pdf 1.12MB

safaribooksonline.com/library/view/streaming-systems/9781491983867/ch02.html

Chapter 2. The What, Where, When, and How of Data

Processing

Okay party people, it’s time to get concrete!

Chapter 1 focused on three main areas: terminology, defining precisely

what I mean when I use overloaded terms like “streaming”; batch

versus streaming, comparing the theoretical capabilities of the two

types of systems, and postulating that only two things are necessary to

take streaming systems beyond their batch counterparts: correctness

and tools for reasoning about time; and data processing patterns,

looking at the conceptual approaches taken with both batch and

streaming systems when processing bounded and unbounded data.

In this chapter, we’re now going to focus further on the data processing

patterns from Chapter 1, but in more detail, and within the context of

concrete examples. By the time we’re finished, we’ll have covered what

I consider to be the core set of principles and concepts required for

robust out-of-order data processing; these are the tools for reasoning

about time that truly get you beyond classic batch processing.

To give you a sense of what things look like in action, I use snippets of

Apache Beam code, coupled with time-lapse diagrams to provide a

visual representation of the concepts. Apache Beam is a unified

programming model and portability layer for batch and stream

processing, with a set of concrete SDKs in various languages (e.g.,

Java and Python). Pipelines written with Apache Beam can then be

portably run on any of the supported execution engines (Apache Apex,

Apache Flink, Apache Spark, Cloud Dataflow, etc.).

I use Apache Beam here for examples not because this is a Beam book

(it’s not), but because it most completely embodies the concepts

described in this book. Back when “Streaming 102” was originally

written (back when it was still the Dataflow Model from Google Cloud

Dataflow and not the Beam Model from Apache Beam), it was literally

1/40

the only system in existence that provided the amount of

expressiveness necessary for all the examples we’ll cover here. A year

and a half later, I’m happy to say much has changed, and most of the

major systems out there have moved or are moving toward supporting a

model that looks a lot like the one described in this book. So rest

assured that the concepts we cover here, though informed through the

Beam lens, as it were, will apply equally across most other systems

you’ll come across.

Roadmap

To help set the stage for this chapter, I want to lay out the five main

concepts that will underpin all of the discussions therein, and really, for

most of the rest of Part I. We’ve already covered two of them.

In Chapter 1, I first established the critical distinction between event

time (the time that events happen) and processing time (the time they

are observed during processing). This provides the foundation for one of

the main theses put forth in this book: if you care about both correctness

and the context within which events actually occurred, you must analyze

data relative to their inherent event times, not the processing time at

which they are encountered during the analysis itself.

I then introduced the concept of windowing (i.e., partitioning a dataset

along temporal boundaries), which is a common approach used to cope

with the fact that unbounded data sources technically might never end.

Some simpler examples of windowing strategies are fixed and sliding

windows, but more sophisticated types of windowing, such as sessions

(in which the windows are defined by features of the data themselves;

for example, capturing a session of activity per user followed by a gap of

inactivity) also see broad usage.

In addition to these two concepts, we’re now going to look closely at

three more:

Triggers

A trigger is a mechanism for declaring when the output for a window

should be materialized relative to some external signal. Triggers provide

flexibility in choosing when outputs should be emitted. In some sense,

2/40

you can think of them as a flow control mechanism for dictating when

results should be materialized. Another way of looking at it is that

triggers are like the shutter-release on a camera, allowing you to

declare when to take a snapshots in time of the results being computed.

Triggers also make it possible to observe the output for a window

multiple times as it evolves. This in turn opens up the door to refining

results over time, which allows for providing speculative results as data

arrive, as well as dealing with changes in upstream data (revisions) over

time or data that arrive late (e.g., mobile scenarios, in which someone’s

phone records various actions and their event times while the person is

offline and then proceeds to upload those events for processing upon

regaining connectivity).

Watermarks

A watermark is a notion of input completeness with respect to event

times. A watermark with value of time X makes the statement: “all input

data with event times less than X have been observed.” As such,

watermarks act as a metric of progress when observing an unbounded

data source with no known end. We touch upon the basics of

watermarks in this chapter, and then Slava goes super deep on the

subject in Chapter 3.

Accumulation

An accumulation mode specifies the relationship between multiple

results that are observed for the same window. Those results might be

completely disjointed; that is, representing independent deltas over

time, or there might be overlap between them. Different accumulation

modes have different semantics and costs associated with them and

thus find applicability across a variety of use cases.

Also, because I think it makes it easier to understand the relationships

between all of these concepts, we revisit the old and explore the new

within the structure of answering four questions, all of which I propose

are critical to every unbounded data processing problem:

What results are calculated? This question is answered by the

types of transformations within the pipeline. This includes things

like computing sums, building histograms, training machine

3/40

learning models, and so on. It’s also essentially the question

answered by classic batch processing

Where in event time are results calculated? This question is

answered by the use of event-time windowing within the pipeline.

This includes the common examples of windowing from Chapter 1

(fixed, sliding, and sessions); use cases that seem to have no

notion of windowing (e.g., time-agnostic processing; classic batch

processing also generally falls into this category); and other, more

complex types of windowing, such as time-limited auctions. Also

note that it can include processing-time windowing, as well, if you

assign ingress times as event times for records as they arrive at

the system.

When in processing time are results materialized? This question is

answered by the use of triggers and (optionally) watermarks. There

are infinite variations on this theme, but the most common patterns

are those involving repeated updates (i.e., materialized view

semantics), those that utilize a watermark to provide a single

output per window only after the corresponding input is believed to

be complete (i.e., classic batch processing semantics applied on a

per-window basis), or some combination of the two.

How do refinements of results relate? This question is answered by

the type of accumulation used: discarding (in which results are all

independent and distinct), accumulating (in which later results build

upon prior ones), or accumulating and retracting (in which both the

accumulating value plus a retraction for the previously triggered

value(s) are emitted).

We look at each of these questions in much more detail throughout the

rest of the book. And, yes, I’m going to run this color scheme thing into

the ground in an attempt to make it abundantly clear which concepts

relate to which question in the What/Where/When/How idiom. You’re

welcome <winky-smiley/>.

Batch Foundations: What and Where

Okay, let’s get this party started. First stop: batch processing.

4/40

What: Transformations

The transformations applied in classic batch processing answer the

question: “What results are calculated?” Even though you are likely

already familiar with classic batch processing, we’re going to start there

anyway because it’s the foundation on top of which we add all of the

other concepts.

In the rest of this chapter (and indeed, through much of the book), we

look at a single example: computing keyed integer sums over a simple

dataset consisting of nine values. Let’s imagine that we’ve written a

team-based mobile game and we want to build a pipeline that

calculates team scores by summing up the individual scores reported by

users’ phones. If we were to capture our nine example scores in a SQL

table named “UserScores,” it might look something like this:

> SELECT * FROM UserScores ORDER BY EventTime;

------------------------------------------------

------------------------------------------------

| Julie | TeamX | 5 | 12:00:26 | 12:05:19 |

| Frank | TeamX | 9 | 12:01:26 | 12:08:19 |

| Ed | TeamX | 7 | 12:02:26 | 12:05:39 |

| Julie | TeamX | 8 | 12:03:06 | 12:07:06 |

| Amy | TeamX | 3 | 12:03:39 | 12:06:13 |

| Fred | TeamX | 4 | 12:04:19 | 12:06:39 |

| Naomi | TeamX | 3 | 12:06:39 | 12:07:19 |

| Becky | TeamX | 8 | 12:07:26 | 12:08:39 |

| Naomi | TeamX | 1 | 12:07:46 | 12:09:00 |

------------------------------------------------

Note that all the scores in this example are from users on the same

team; this is to keep the example simple, given that we have a limited

number of dimensions in our diagrams that follow. And because we’re

grouping by team, we really just care about the last three columns:

Score

The individual user score associated with this event

EventTime

The event time for the score; that is, the time at which the score

occurred

5/40

评论收藏

内容反馈

lzhshen_xmu

粉丝: 1
资源: 7

Streaming Systems: The What, Where, When, and How of Large-Scale...

最新资源

Streaming Systems: The What, Where, When, and How of Large-Scale...

streaming systems

Streaming Systems

Streaming Systems.pdf.zip

Streaming Systems - Tyler Akidau.pdf_streamingsystems_Tyler_

Streaming Systems(EarlyRelease) 无水印pdf

Stream Processing with Apache Flink完整书签高清pdf和epub版，以及评价超高的Streaming Systems

Streaming Systems - Tyler Akidau.pdf

Streaming Systems - Tyler Akidau.pdf.zip

Large Scale and Big Data Processing and Management.pdf

流计算系统

Relay Discovery and Selection for Large-scale P2P Streaming

Systems 无水印pdf

Streaming Systems(EarlyRelease) mobi

system.pdf

Streaming Data Understanding the real-time pipeline v2.pdf

Mastering Apache Spark 2.x Scale your m l and d l systems with SparkML, DL4j and

Big.Data.Algorithms.Analytics.and.Applications.pdf

Handbook of Big Data Technologies

Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark.1484

Natural_Language_Proces with pytorch.pdf

Streaming Systems - Tyler Akidau(english).zip

Streaming Systems - Tyler Akidau

GrokkingStreamingSystems:对于一本书

Practical-Real-time-Data-Processing-and-Analytics.pdf

Big Data and Computational Intelligence in Networking-CRC(2018).pdf

large scale machine learning with spark

最新资源