Presto-SQL-on-Everything.pdf资源-CSDN文库

需积分: 13 182 浏览量 2021-03-29 12:08:26 上传评论收藏 543KB PDF 举报

标题所提及的“Presto-SQL-on-Everything.pdf”明确指出了文档将围绕开源分布式查询引擎Presto进行讨论，并阐述其如何在各种数据源上运用SQL进行查询。描述中提到Presto支持Facebook上大部分的SQL分析工作负载，并且具有适应性、灵活性和可扩展性。标签“sql引擎”进一步明确了文档的重点是对Presto这种SQL引擎的深入讨论。从给出的内容部分中我们可以提炼出以下关键知识点： 1. Presto是一个开源分布式查询引擎。这意味着Presto的源代码是公开的，社区中的开发者可以自由地查看、修改和分发。分布式查询引擎的性质允许Presto处理大规模的数据集，将其切分成更小的部分，然后在多个节点上同时运行查询。 2. Presto在Facebook得到了生产环境中的应用，表明其在真实世界的应用场景中具有稳定性和可靠性。Facebook是一个数据密集型的公司，拥有大量的数据需要进行处理和分析，Presto能够在这样的环境中运行，说明了其高性能和可扩展性的特性。 3. Presto支持多种数据源，包括Hadoop数据仓库、关系型数据库管理系统（RDBMS）、NoSQL系统以及流处理系统。这种多源支持的特性使得Presto成为一个非常灵活的工具，可以满足多样化的数据处理需求。 4. Presto的Connector API允许开发者创建插件，提供高性能的输入/输出接口（I/O interface）。这表明Presto的设计允许第三方开发者对其功能进行扩展，从而可以与不同的数据源进行高效交互。 5. 文档还提到了Presto在多个大型公司的应用，如Uber、Netflix、Airbnb、Bloomberg和LinkedIn。同时，Qubole、TreasureData和StarburstData等公司基于Presto提供商业解决方案。这说明Presto不仅在技术社区中有广泛的影响力，而且在商业领域也有重要的地位和应用价值。 6. Amazon Athena互动查询服务也是建立在Presto之上的。这表明Presto已经足够成熟和可靠，以至于被像亚马逊这样的大型企业采用为其服务的基础。 7. Presto在GitHub上拥有超过一百名贡献者，说明了它有一个强大的开源社区。开源社区的存在对于Presto的持续发展和改进非常关键，为Presto带来了丰富的知识共享和技术合作机会。 8. 文档强调了Presto设计的适应性、灵活性和可扩展性。适应性可能指的是Presto能够适应不同的数据规模和类型；灵活性可能涉及到Presto执行查询的方式，以及用户对查询进行定制的能力；而可扩展性则是指Presto能够轻松增加或减少资源来处理数据量的变化。 9. Presto能够支持从用户面对的报表应用到需要聚合或联合数以亿计数据的多小时ETL作业等多样化的使用案例。这里的“用户面对的报表应用”指的可能是那些需要即时或近即时响应时间的应用程序，而“多小时ETL作业”则强调了Presto能够处理长时间运行的数据抽取、转换和加载任务。 10. 文档还提到了Presto的架构、实现、特性和性能优化，以及性能测试结果。这些内容可能涵盖Presto的技术细节、如何组织其内部组件以高效执行查询、特定功能如何增强性能以及这些功能如何具体地影响了性能结果。 Presto在处理大数据和数据仓库方面的重要性不断增长，它不仅为组织提供了快速、易用、灵活的数据查询工具，而且支持数据分析师使用流行的SQL语言来进行数据分析，使得数据洞察的提取变得更加容易。通过其插件化的设计和广泛的连接性，Presto在现代数据处理环境中发挥了重要作用，成为许多公司和组织不可或缺的数据分析工具。

资源推荐

资源详情

资源评论

Presto: SQL on Everything

Raghav Sethi, Martin Traverso

∗

, Dain Sundstrom

∗

, David Phillips

∗

, Wenlei Xie, Yutian Sun,

Nezih Yigitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte

∗

, Christopher Berner

∗

Facebook, Inc.

Abstract—Presto is an open source distributed query engine

that supports much of the SQL analytics workload at Facebook.

Presto is designed to be adaptive, ﬂexible, and extensible. It

supports a wide variety of use cases with diverse characteristics.

These range from user-facing reporting applications with sub-

second latency requirements to multi-hour ETL jobs that aggre-

gate or join terabytes of data. Presto’s Connector API allows

plugins to provide a high performance I/O interface to dozens

of data sources, including Hadoop data warehouses, RDBMSs,

NoSQL systems, and stream processing systems. In this paper, we

outline a selection of use cases that Presto supports at Facebook.

We then describe its architecture and implementation, and call

out features and performance optimizations that enable it to

support these use cases. Finally, we present performance results

that demonstrate the impact of our main design decisions.

Index Terms—SQL, query engine, big data, data warehouse

I. INTRODUCTION

The ability to quickly and easily extract insights from large

amounts of data is increasingly important to technology-

enabled organizations. As it becomes cheaper to collect and

store vast amounts of data, it is important that tools to query

this data become faster, easier to use, and more ﬂexible. Using

a popular query language like SQL can make data analytics

accessible to more people within an organization. However,

ease-of-use is compromised when organizations are forced

to deploy multiple incompatible SQL-like systems to solve

different classes of analytics problems.

Presto is an open-source distributed SQL query engine that

has run in production at Facebook since 2013 and is used today

by several large companies, including Uber, Netﬂix, Airbnb,

Bloomberg, and LinkedIn. Organizations such as Qubole,

Treasure Data, and Starburst Data have commercial offerings

based on Presto. The Amazon Athena

interactive querying

service is built on Presto. With over a hundred contributors

on GitHub, Presto has a strong open source community.

Presto is designed to be adaptive, ﬂexible, and extensible.

It provides an ANSI SQL interface to query data stored in

Hadoop environments, open-source and proprietary RDBMSs,

NoSQL systems, and stream processing systems such as

Kafka. A ‘Generic RPC’

connector makes adding a SQL

interface to proprietary systems as easy as implementing a

half dozen RPC endpoints. Presto exposes an open HTTP

API, ships with JDBC support, and is compatible with sev-

eral industry-standard business intelligence (BI) and query

∗

Author was afﬁliated with Facebook, Inc. during the contribution period.

https://aws.amazon.com/athena

Using Thrift, an interface deﬁnition language and RPC protocol used for

deﬁning and creating services in multiple languages.

authoring tools. The built-in Hive connector can natively read

from and write to distributed ﬁle systems such as HDFS and

Amazon S3; and supports several popular open-source ﬁle

formats including ORC, Parquet, and Avro.

As of late 2018, Presto is responsible for supporting much

of the SQL analytic workload at Facebook, including interac-

tive/BI queries and long-running batch extract-transform-load

(ETL) jobs. In addition, Presto powers several end-user facing

analytics tools, serves high performance dashboards, provides

a SQL interface to multiple internal NoSQL systems, and

supports Facebook’s A/B testing infrastructure. In aggregate,

Presto processes hundreds of petabytes of data and quadrillions

of rows per day at Facebook.

Presto has several notable characteristics:

• It is an adaptive multi-tenant system capable of concur-

rently running hundreds of memory, I/O, and CPU-intensive

queries, and scaling to thousands of worker nodes while

efﬁciently utilizing cluster resources.

• Its extensible, federated design allows administrators to

set up clusters that can process data from many different

data sources even within a single query. This reduces the

complexity of integrating multiple systems.

• It is ﬂexible, and can be conﬁgured to support a vast variety

of use cases with very different constraints and performance

characteristics.

• It is built for high performance, with several key related

features and optimizations, including code-generation. Mul-

tiple running queries share a single long-lived Java Virtual

Machine (JVM) process on worker nodes, which reduces

response time, but requires integrated scheduling, resource

management and isolation.

The primary contribution of this paper is to describe the design

of the Presto engine, discussing the speciﬁc optimizations and

trade-offs required to achieve the characteristics we described

above. The secondary contributions are performance results for

some key design decisions and optimizations, and a description

of lessons learned while developing and maintaining Presto.

Presto was originally developed to enable interactive query-

ing over the Facebook data warehouse. It evolved over time to

support several different use cases, a few of which we describe

in Section II. Rather than studying this evolution, we describe

both the engine and use cases as they exist today, and call

out main features and functionality as they relate to these use

cases. The rest of the paper is structured as follows. In Section

III, we provide an architectural overview, and then dive into

system design in Section IV. We then describe some important

performance optimizations in Section V, present performance

results in Section VI, and engineering lessons we learned

while developing Presto in Section VII. Finally, we outline

key related work in Section VIII, and conclude in Section

IX. Presto is under active development, and signiﬁcant new

functionality is added frequently. In this paper, we describe

Presto as of version 0.211, released in September 2018.

II. USE CASES

At Facebook, we operate numerous Presto clusters (with sizes

up to ∼1000 nodes) and support several different use cases.

In this section we select four diverse use cases with large

deployments and describe their requirements.

A. Interactive Analytics

Facebook operates a massive multi-tenant data warehouse

as an internal service, where several business functions and

organizational units share a smaller set of managed clusters.

Data is stored in a distributed ﬁlesystem and metadata is stored

in a separate service. These systems have APIs similar to that

of HDFS and the Hive metastore service, respectively. We refer

to this as the ‘Facebook data warehouse’, and use a variant of

the Presto ‘Hive’ connector to read from and write to it.

Facebook engineers and data scientists routinely examine

small amounts of data (∼50GB-3TB compressed), test hy-

potheses, and build visualizations or dashboards. Users often

rely on query authoring tools, BI tools, or Jupyter notebooks.

Individual clusters are required to support 50-100 concurrent

running queries with diverse query shapes, and return results

within seconds or minutes. Users are highly sensitive to end-

to-end wall clock time, and may not have a good intuition

of query resource requirements. While performing exploratory

analysis, users may not require that the entire result set be

returned. Queries are often canceled after initial results are

returned, or use LIMIT clauses to restrict the amount of result

data the system should produce.

B. Batch ETL

The data warehouse we described above is populated with

fresh data at regular intervals using ETL queries. Queries are

scheduled by a workﬂow management system that determines

dependencies between tasks and schedules them accordingly.

Presto supports users migrating from legacy batch processing

systems, and ETL queries now make up a large fraction of

the Presto workload at Facebook by CPU. These queries

are typically written and optimized by data engineers. They

tend to be much more resource intensive than queries in the

Interactive Analytics use case, and often involve performing

CPU-heavy transformations and memory-intensive (multiple

TBs of distributed memory) aggregations or joins with other

large tables. Query latency is somewhat less important than

resource efﬁciency and overall cluster throughput.

C. A/B Testing

A/B testing is used at Facebook to evaluate the impact of

product changes through statistical hypothesis testing. Much of

the A/B test infrastructure at Facebook is built on Presto. Users

expect test results be available in hours (rather than days) and

that the data be complete and accurate. It is also important for

users to be able to perform arbitrary slice and dice on their

results at interactive latency (∼5-30s) to gain deeper insights.

It is difﬁcult to satisfy this requirement by pre-aggregating

data, so results must be computed on the ﬂy. Producing results

requires joining multiple large data sets, which include user,

device, test, and event attributes. Query shapes are restricted

to a small set since queries are programmatically generated.

D. Developer/Advertiser Analytics

Several custom reporting tools for external developers and

advertisers are built on Presto. One example deployment of

this use case is Facebook Analytics

, which offers advanced

analytics tools to developers that build applications which use

the Facebook platform. These deployments typically expose

a web interface that can generate a restricted set of query

shapes. Data volumes are large in aggregate, but queries

are highly selective, as users can only access data for their

own applications or ads. Most query shapes contain joins,

aggregations or window functions. Data ingestion latency is

in the order of minutes. There are very strict query latency

requirements (∼50ms-5s) as the tooling is meant to be inter-

active. Clusters must have 99.999% availability and support

hundreds of concurrent queries given the volume of users.

III. ARCHITECTURE OVERVIEW

A Presto cluster consists of a single coordinator node and

one or more worker nodes. The coordinator is responsible

for admitting, parsing, planning and optimizing queries as

well as query orchestration. Worker nodes are responsible for

query processing. Figure 1 shows a simpliﬁed view of Presto

architecture.

Worker

Data Source API

Processor

Worker

Coordinator

Planner/Optimizer

Scheduler

Metadata API

Data Location API

Queue

Processor

Query

Results

Data Source API

Processor

Worker

External

Storage

System

Fig. 1. Presto Architecture

The client sends an HTTP request containing a SQL state-

ment to the coordinator. The coordinator processes the request

https://analytics.facebook.com

剩余11页未读，继续阅读

评论收藏

内容反馈

边城水手

粉丝: 113
资源: 35

Presto-SQL-on-Everything.pdf

Presto_SQL_on_Everything.pdf

Presto SQL on Everything

presto-cli-0.223-executable.jar

presto-server-0.196.tar.gz

ranger-2.0.0-presto-plugin.tar.gz

presto-cli-0.184-executable.jar

presto-cli-0.244.1-executable-noarch.jar

presto-cli-0.274-executable

presto-jdbc-0.245.jar

ranger-2.1.0-presto-plugin.tar.gz

presto-sql.txt

presto-jdbc-0.221.jar

presto-hive-0.67.zip

分布式大数据SQL查询引擎-Presto-0.229

presto-cli-0.191-executable.jar

presto-parser-0.152.2.zip

Python库 | presto-client-0.301.0.tar.gz

presto-jdbc-0.222.jar

presto-oracle-0.147.jar

presto-server-0.256.tar.gz

presto-oracle

presto-hive-hadoop1-0.65.zip

presto-cli-0.198-executable

presto-hive-cdh4-0.69.zip

开源项目-prestodb-presto-go-client.zip

Presto0.196 server以及客户端.zip

presto-hive-hadoop1-0.55.zip

presto-oracle-0.203.jar

Navicat 17.0 中文绿色免安装版

免费下载Navicat15安装包+工具+教程.zip

最新资源