Presto: SQL on Everything
Raghav Sethi, Martin Traverso
∗
, Dain Sundstrom
∗
, David Phillips
∗
, Wenlei Xie, Yutian Sun,
Nezih Yigitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte
∗
, Christopher Berner
∗
Facebook, Inc.
Abstract—Presto is an open source distributed query engine
that supports much of the SQL analytics workload at Facebook.
Presto is designed to be adaptive, flexible, and extensible. It
supports a wide variety of use cases with diverse characteristics.
These range from user-facing reporting applications with sub-
second latency requirements to multi-hour ETL jobs that aggre-
gate or join terabytes of data. Presto’s Connector API allows
plugins to provide a high performance I/O interface to dozens
of data sources, including Hadoop data warehouses, RDBMSs,
NoSQL systems, and stream processing systems. In this paper, we
outline a selection of use cases that Presto supports at Facebook.
We then describe its architecture and implementation, and call
out features and performance optimizations that enable it to
support these use cases. Finally, we present performance results
that demonstrate the impact of our main design decisions.
Index Terms—SQL, query engine, big data, data warehouse
I. INTRODUCTION
The ability to quickly and easily extract insights from large
amounts of data is increasingly important to technology-
enabled organizations. As it becomes cheaper to collect and
store vast amounts of data, it is important that tools to query
this data become faster, easier to use, and more flexible. Using
a popular query language like SQL can make data analytics
accessible to more people within an organization. However,
ease-of-use is compromised when organizations are forced
to deploy multiple incompatible SQL-like systems to solve
different classes of analytics problems.
Presto is an open-source distributed SQL query engine that
has run in production at Facebook since 2013 and is used today
by several large companies, including Uber, Netflix, Airbnb,
Bloomberg, and LinkedIn. Organizations such as Qubole,
Treasure Data, and Starburst Data have commercial offerings
based on Presto. The Amazon Athena
1
interactive querying
service is built on Presto. With over a hundred contributors
on GitHub, Presto has a strong open source community.
Presto is designed to be adaptive, flexible, and extensible.
It provides an ANSI SQL interface to query data stored in
Hadoop environments, open-source and proprietary RDBMSs,
NoSQL systems, and stream processing systems such as
Kafka. A ‘Generic RPC’
2
connector makes adding a SQL
interface to proprietary systems as easy as implementing a
half dozen RPC endpoints. Presto exposes an open HTTP
API, ships with JDBC support, and is compatible with sev-
eral industry-standard business intelligence (BI) and query
∗
Author was affiliated with Facebook, Inc. during the contribution period.
1
https://aws.amazon.com/athena
2
Using Thrift, an interface definition language and RPC protocol used for
defining and creating services in multiple languages.
authoring tools. The built-in Hive connector can natively read
from and write to distributed file systems such as HDFS and
Amazon S3; and supports several popular open-source file
formats including ORC, Parquet, and Avro.
As of late 2018, Presto is responsible for supporting much
of the SQL analytic workload at Facebook, including interac-
tive/BI queries and long-running batch extract-transform-load
(ETL) jobs. In addition, Presto powers several end-user facing
analytics tools, serves high performance dashboards, provides
a SQL interface to multiple internal NoSQL systems, and
supports Facebook’s A/B testing infrastructure. In aggregate,
Presto processes hundreds of petabytes of data and quadrillions
of rows per day at Facebook.
Presto has several notable characteristics:
• It is an adaptive multi-tenant system capable of concur-
rently running hundreds of memory, I/O, and CPU-intensive
queries, and scaling to thousands of worker nodes while
efficiently utilizing cluster resources.
• Its extensible, federated design allows administrators to
set up clusters that can process data from many different
data sources even within a single query. This reduces the
complexity of integrating multiple systems.
• It is flexible, and can be configured to support a vast variety
of use cases with very different constraints and performance
characteristics.
• It is built for high performance, with several key related
features and optimizations, including code-generation. Mul-
tiple running queries share a single long-lived Java Virtual
Machine (JVM) process on worker nodes, which reduces
response time, but requires integrated scheduling, resource
management and isolation.
The primary contribution of this paper is to describe the design
of the Presto engine, discussing the specific optimizations and
trade-offs required to achieve the characteristics we described
above. The secondary contributions are performance results for
some key design decisions and optimizations, and a description
of lessons learned while developing and maintaining Presto.
Presto was originally developed to enable interactive query-
ing over the Facebook data warehouse. It evolved over time to
support several different use cases, a few of which we describe
in Section II. Rather than studying this evolution, we describe
both the engine and use cases as they exist today, and call
out main features and functionality as they relate to these use
cases. The rest of the paper is structured as follows. In Section
III, we provide an architectural overview, and then dive into
system design in Section IV. We then describe some important