flume1.8.rar_flume1.8下载资源-CSDN文库

共4个文件

pdf：4个

需积分: 8 3 浏览量 2022-10-12 08:27:21 上传评论收藏 1.11MB RAR 举报

Flume 是 Apache Hadoop 生态系统中的一个分布式、可靠且可用于有效收集、聚合和移动大量日志数据的工具。在1.8版本中，它继续提供了高效的数据传输能力，适用于实时大数据流处理。以下是关于 Flume 1.8 的一些核心知识点： 1. **Flume 组件**： - **Agent**: Flume 的基本工作单元，负责数据的采集、处理和传输。一个 Flume Agent 包含 Source、Channel 和 Sink 三个组件。 - **Source**: 负责接收外部数据源产生的数据，如日志文件、网络套接字等。 - **Channel**: 存储 Source 接收到的数据，作为临时缓冲区，保证数据的可靠性。 - **Sink**: 将 Channel 中的数据传输到目标位置，如 HDFS、HBase、Kafka 或其他数据存储系统。 2. **Flume 特性**： - **可靠性**：通过 Checkpoint 和 Transaction 机制，确保数据在传输过程中的完整性。 - **容错性**：Agent 可以配置为高可用模式，当主 Agent 失效时，备份 Agent 可接管工作。 - **可扩展性**：Flume 支持多级 Agent 链路，可以构建复杂的分布式数据流处理拓扑。 - **动态配置**：运行时可以动态修改 Flume Agent 配置，实现数据流的动态调整。 3. **Flume 1.8 新特性与改进**： - **性能优化**：在1.8版本中，对数据处理和传输的性能进行了提升，减少了延迟，提高了吞吐量。 - **新的 Source 和 Sink**：增加了支持更多数据源和目标的插件，如 Kafka Source 和 Elasticsearch Sink。 - **增强的监控和管理**：提供更好的监控指标和更灵活的管理工具，便于用户监控和调试 Flume 集群。 - **配置文件改进**：增强了配置文件的可读性和灵活性，简化了复杂配置的编写。 4. **Flume 配置**： - 配置文件通常采用 YAML 格式，包含 Agent 名称、Source、Channel 和 Sink 的详细配置。 - 例如，配置一个简单的 Flume Agent 来从一个文件源读取数据并写入 HDFS： ``` agentName: sources: - sourceName: type: file ... channels: - channelName: type: memory ... sinks: - sinkName: type: hdfs ... ``` 5. **Flume 实际应用**： - 在日志收集场景中，Flume 可以用于从多个服务器收集应用程序日志，并将其集中存储在 HDFS 上，便于后续的分析和挖掘。 - 在实时流处理中，Flume 可以与其他大数据工具（如 Storm 或 Spark）结合，实现实时数据处理和分析。 6. **最佳实践**： - 适当调整 Channel 的大小以平衡内存使用和数据处理速度。 - 使用 Avro 格式进行数据传输，因为它具有良好的跨语言兼容性和高效的序列化/反序列化性能。 - 定期备份和验证 Checkpoint，以防止数据丢失。 7. **故障排查**： - 当数据传输出现问题时，检查 Agent 日志，分析错误信息定位问题。 - 使用 `flume-ng agent` 命令行工具进行诊断和调试。 Flume 1.8 作为一个强大的日志收集工具，其1.8版本在性能、功能和易用性上都有所增强，是大数据环境中不可或缺的一部分。理解并熟练掌握 Flume 的配置、操作和最佳实践，对于优化日志管理和实时数据分析流程至关重要。

资源详情

资源评论

资源推荐

收起资源包目录

flume1.8.rar （4个子文件）

flume1.8

Version 1.8.0 — Apache Flume.pdf 174KB

Flume 1.8.0 User Guide — Apache Flume.pdf 676KB

Overview (Apache Flume 1.8.0 API).pdf 67KB

Flume 1.8.0 Developer Guide — Apache Flume.pdf 307KB

2022/10/11 Flume 1.8.0 User Guide — Apache Flume

https://flume.apache.org/releases/content/1.8.0/FlumeUserGuide.html 1/55

Flume 1.8.0 User Guide ¶

Introduction

Overview

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log

data from many different sources to a centralized data store.

The use of Apache Flume is not only restricted to log data aggregation. Since data sources are customizable, Flume can be used to

transport massive quantities of event data including but not limited to network traffic data, social-media-generated data, email

messages and pretty much any data source possible.

Apache Flume is a top level project at the Apache Software Foundation.

There are currently two release code lines available, versions 0.9.x and 1.x.

Documentation for the 0.9.x track is available at the Flume 0.9.x User Guide.

This documentation applies to the 1.4.x track.

New and existing users are encouraged to use the 1.x releases so as to leverage the performance improvements and configuration

flexibilities available in the latest architecture.

System Requirements

1. Java Runtime Environment - Java 1.8 or later

2. Memory - Sufficient memory for configurations used by sources, channels or sinks

3. Disk Space - Sufficient disk space for configurations used by channels or sinks

4. Directory Permissions - Read/Write permissions for directories used by agent

Architecture

Data flow model

A Flume event is defined as a unit of data flow having a byte payload and an optional set of string attributes. A Flume agent is a

(JVM) process that hosts the components through which events flow from an external source to the next destination (hop).

A Flume source consumes events delivered to it by an external source like a web server. The external source sends events to Flume

in a format that is recognized by the target Flume source. For example, an Avro Flume source can be used to receive Avro events

from Avro clients or other Flume agents in the flow that send events from an Avro sink. A similar flow can be defined using a Thrift

Flume Source to receive events from a Thrift Sink or a Flume Thrift Rpc Client or Thrift clients written in any language generated

from the Flume thrift protocol.When a Flume source receives an event, it stores it into one or more channels. The channel is a

passive store that keeps the event until it’s consumed by a Flume sink. The file channel is one example – it is backed by the local

filesystem. The sink removes the event from the channel and puts it into an external repository like HDFS (via Flume HDFS sink) or

forwards it to the Flume source of the next Flume agent (next hop) in the flow. The source and sink within the given agent run

asynchronously with the events staged in the channel.

Complex flows

Apache Flume

™

2022/10/11 Flume 1.8.0 User Guide — Apache Flume

https://flume.apache.org/releases/content/1.8.0/FlumeUserGuide.html 2/55

Flume allows a user to build multi-hop flows where events travel through multiple agents before reaching the final destination. It

also allows fan-in and fan-out flows, contextual routing and backup routes (fail-over) for failed hops.

Reliability

The events are staged in a channel on each agent. The events are then delivered to the next agent or terminal repository (like

HDFS) in the flow. The events are removed from a channel only after they are stored in the channel of next agent or in the terminal

repository. This is a how the single-hop message delivery semantics in Flume provide end-to-end reliability of the flow.

Flume uses a transactional approach to guarantee the reliable delivery of the events. The sources and sinks encapsulate in a

transaction the storage/retrieval, respectively, of the events placed in or provided by a transaction provided by the channel. This

ensures that the set of events are reliably passed from point to point in the flow. In the case of a multi-hop flow, the sink from the

previous hop and the source from the next hop both have their transactions running to ensure that the data is safely stored in the

channel of the next hop.

Recoverability

The events are staged in the channel, which manages recovery from failure. Flume supports a durable file channel which is backed

by the local file system. There’s also a memory channel which simply stores the events in an in-memory queue, which is faster but

any events still left in the memory channel when an agent process dies can’t be recovered.

Setup

Setting up an agent

Flume agent configuration is stored in a local configuration file. This is a text file that follows the Java properties file format.

Configurations for one or more agents can be specified in the same configuration file. The configuration file includes properties of

each source, sink and channel in an agent and how they are wired together to form data flows.

Configuring individual components

Each component (source, sink or channel) in the flow has a name, type, and set of properties that are specific to the type and

instantiation. For example, an Avro source needs a hostname (or IP address) and a port number to receive data from. A memory

channel can have max queue size (“capacity”), and an HDFS sink needs to know the file system URI, path to create files, frequency

of file rotation (“hdfs.rollInterval”) etc. All such attributes of a component needs to be set in the properties file of the hosting Flume

agent.

Wiring the pieces together

The agent needs to know what individual components to load and how they are connected in order to constitute the flow. This is

done by listing the names of each of the sources, sinks and channels in the agent, and then specifying the connecting channel for

each sink and source. For example, an agent flows events from an Avro source called avroWeb to HDFS sink hdfs-cluster1 via a file

channel called file-channel. The configuration file will contain names of these components and file-channel as a shared channel for

both avroWeb source and hdfs-cluster1 sink.

Starting an agent

An agent is started using a shell script called flume-ng which is located in the bin directory of the Flume distribution. You need to

specify the agent name, the config directory, and the config file on the command line:

$ bin/flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template

Now the agent will start running source and sinks configured in the given properties file.

A simple example

Here, we give an example configuration file, describing a single-node Flume deployment. This configuration lets a user generate

events and subsequently logs them to the console.

# example.conf: A single-node Flume configuration

# Name the components on this agent

a1.sources = r1

a1.sinks = k1

a1.channels = c1

# Describe/configure the source

a1.sources.r1.type = netcat

a1.sources.r1.bind = localhost

a1.sources.r1.port = 44444

# Describe the sink

a1.sinks.k1.type = logger

2022/10/11 Flume 1.8.0 User Guide — Apache Flume

https://flume.apache.org/releases/content/1.8.0/FlumeUserGuide.html 3/55

# Use a channel which buffers events in memory

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

This configuration defines a single agent named a1. a1 has a source that listens for data on port 44444, a channel that buffers

event data in memory, and a sink that logs event data to the console. The configuration file names the various components, then

describes their types and configuration parameters. A given configuration file might define several named agents; when a given

Flume process is launched a flag is passed telling it which named agent to manifest.

Given this configuration file, we can start Flume as follows:

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=INFO,console

Note that in a full deployment we would typically include one more option: --conf=<conf-dir>. The <conf-dir> directory would include a

shell script flume-env.sh and potentially a log4j properties file. In this example, we pass a Java option to force Flume to log to the

console and we go without a custom environment script.

From a separate terminal, we can then telnet port 44444 and send Flume an event:

$ telnet localhost 44444

Trying 127.0.0.1...

Connected to localhost.localdomain (127.0.0.1).

Escape character is '^]'.

Hello world! <ENTER>

The original Flume terminal will output the event in a log message.

12/06/19 15: 32:19 INFO source.NetcatSource: Source starting

12/06/19 15: 32:19 INFO source.NetcatSource: Created serverSocket:sun.nio.ch.ServerSocketChannelImpl[/127.0.0.1:44444]

12/06/19 15: 32:34 INFO sink.LoggerSink: Event: { headers:{} body: 48 65 6C 6C 6F 20 77 6F 72 6C 64 21 0D Hello world!. }

Congratulations - you’ve successfully configured and deployed a Flume agent! Subsequent sections cover agent configuration in

much more detail.

Using environment variables in configuration files

Flume has the ability to substitute environment variables in the configuration. For example:

a1.sources = r1

a1.sources.r1.type = netcat

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = ${NC_PORT}

a1.sources.r1.channels = c1

NB: it currently works for values only, not for keys. (Ie. only on the “right side” of the = mark of the config lines.)

This can be enabled via Java system properties on agent invocation by setting propertiesImplementation =

org.apache.flume.node.EnvVarResolverProperties.

For example::

$ NC_PORT=44444 bin/flume-ng agent –conf conf –conf-file example.conf –name a1 -Dflume.root.logger=INFO,console -

DpropertiesImplementation=org.apache.flume.node.EnvVarResolverProperties

Note the above is just an example, environment variables can be configured in other ways, including being set in conf/flume-

env.sh.

Logging raw data

Logging the raw stream of data flowing through the ingest pipeline is not desired behaviour in many production environments

because this may result in leaking sensitive data or security related configurations, such as secret keys, to Flume log files. By

default, Flume will not log such information. On the other hand, if the data pipeline is broken, Flume will attempt to provide clues

for debugging the problem.

One way to debug problems with event pipelines is to set up an additional Memory Channel connected to a Logger Sink, which will

output all event data to the Flume logs. In some situations, however, this approach is insufficient.

In order to enable logging of event- and configuration-related data, some Java system properties must be set in addition to log4j

properties.

To enable configuration-related logging, set the Java system property -Dorg.apache.flume.log.printconfig=true. This can either be passed on

the command line or by setting this in the JAVA_OPTS variable in flume-env.sh.

2022/10/11 Flume 1.8.0 User Guide — Apache Flume

https://flume.apache.org/releases/content/1.8.0/FlumeUserGuide.html 4/55

To enable data logging, set the Java system property -Dorg.apache.flume.log.rawdata=true in the same way described above. For most

components, the log4j logging level must also be set to DEBUG or TRACE to make event-specific logging appear in the Flume logs.

Here is an example of enabling both configuration logging and raw data logging while also setting the Log4j loglevel to DEBUG for

console output:

Zookeeper based Configuration

Flume supports Agent configurations via Zookeeper. This is an experimental feature. The configuration file needs to be uploaded in

the Zookeeper, under a configurable prefix. The configuration file is stored in Zookeeper Node data. Following is how the Zookeeper

Node tree would look like for agents a1 and a2

- /flume

|- /a1 [Agent config file]

|- /a2 [Agent config file]

Once the configuration file is uploaded, start the agent with following options

$ bin/flume-ng agent –conf conf -z zkhost:2181,zkhost1:2181 -p /flume –name a1 -Dflume.root.logger=INFO,console

Argument Name Default Description

z – Zookeeper connection string. Comma separated list of hostname:port

p /flume Base Path in Zookeeper to store Agent configurations

Installing third-party plugins

Flume has a fully plugin-based architecture. While Flume ships with many out-of-the-box sources, channels, sinks, serializers, and

the like, many implementations exist which ship separately from Flume.

While it has always been possible to include custom Flume components by adding their jars to the FLUME_CLASSPATH variable in

the flume-env.sh file, Flume now supports a special directory called plugins.d which automatically picks up plugins that are packaged

in a specific format. This allows for easier management of plugin packaging issues as well as simpler debugging and troubleshooting

of several classes of issues, especially library dependency conflicts.

The plugins.d directory

The plugins.d directory is located at $FLUME_HOME/plugins.d. At startup time, the flume-ng start script looks in the plugins.d directory for plugins

that conform to the below format and includes them in proper paths when starting up java.

Directory layout for plugins

Each plugin (subdirectory) within plugins.d can have up to three sub-directories:

1. lib - the plugin’s jar(s)

2. libext - the plugin’s dependency jar(s)

3. native - any required native libraries, such as .so files

Example of two plugins within the plugins.d directory:

plugins.d/

plugins.d/custom-source-1/

plugins.d/custom-source-1/lib/my-source.jar

plugins.d/custom-source-1/libext/spring-core-2.5.6.jar

plugins.d/custom-source-2/

plugins.d/custom-source-2/lib/custom.jar

plugins.d/custom-source-2/native/gettext.so

Data ingestion

Flume supports a number of mechanisms to ingest data from external sources.

RPC

An Avro client included in the Flume distribution can send a given file to Flume Avro source using avro RPC mechanism:

$ bin/flume-ng avro-client -H localhost -p 41414 -F /usr/logs/log.10

The above command will send the contents of /usr/logs/log.10 to to the Flume source listening on that ports.

Executing commands

$ bin/flume-ng agent --conf conf --conf-file example.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apach