没有合适的资源?快使用搜索试试~ 我知道了~
Parallel-Grouped-Aggregation-in-DuckDB-DuckDB.pdf
需积分: 5 0 下载量 149 浏览量
2023-04-21
11:20:18
上传
评论
收藏 305KB PDF 举报
温馨提示
试读
15页
Parallel_Grouped_Aggregation_in_DuckDB_-_DuckDB.pdf
资源推荐
资源详情
资源评论
Parallel Grouped Aggregation in DuckDB - DuckDB 1
Parallel Grouped Aggregation in
DuckDB - DuckDB
https://duckdb.org/2022/03/07/aggregate-hashtable.html
TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently
aggregate over millions of groups.
Grouped aggregations are a core data analysis command. It is particularly important for
large-scale data analysis (“OLAP”) because it is useful for computing statistical
summaries of huge tables. DuckDB contains a highly optimized parallel aggregation
capability for fast and scalable summarization.
Jump straight to the benchmarks?
Introduction
GROUP BY changes the result set cardinality - instead of returning the same number of
rows of the input (like a normal SELECT ), GROUP BY returns as many rows as there are
groups in the data. Consider this (weirdly familiar) example query:
SELECT
l_returnflag,
l_linestatus,
sum(l_extendedprice),
avg(l_quantity)
FROM
lineitem
GROUP BY
l_returnflag,
l_linestatus;
GROUP BY is followed by two column names, l_returnflag and l_linestatus . Those are
the columns to compute the groups on, and the resulting table will contain all
combinations of the same column that occur in the data. We refer to the columns in the
GROUP BY clause as the “grouping columns” and all occurring combinations of values
therein as “groups”. The SELECT clause contains four (not five) expressions: References
Parallel Grouped Aggregation in DuckDB - DuckDB 2
to the grouping columns, and two aggregates: the sum over l_extendedprice and the avg
over l_quantity . We refer to those as the “aggregates”. If executed, the result of this
query looks something like this:
l_returnflag l_linestatus sum(l_extendedprice) avg(l_quantity)
N O 114935210409.19 25.5
R F 56568041380.9 25.51
A F 56586554400.73 25.52
N F 1487504710.38 25.52
In general, SQL allows only columns that are mentioned in the GROUP BY clause to be
part of the SELECT expressions directly, all other columns need to be subject to one of
the aggregate functions like sum , avg etc. There are many more aggregate functions
depending on which SQL system you use.
How should a query processing engine compute such an aggregation? There are many
design decisions involved, and we will discuss those below and in particular the
decisions made by DuckDB. The main issue when computing grouping results is that
the groups can occur in the input table in any order. Were the input already sorted on
the grouping columns, computing the aggregation would be trivial, as we could just
compare the current values for the grouping columns with the previous ones. If a
change occurs, the next group begins and a new aggregation result needs to be
computed. Since the sorted case is easy, one straightforward way of computing
grouped aggregates is to sort the input table on the grouping columns first, and then
use the trivial approach. But sorting the input is unfortunately still a computationally
expensive operation despite our best efforts. In general, sorting has a computational
complexity of O(nlogn) with n being the number of rows sorted.
Hash Tables for Aggregation
A better way is to use a hash table. Hash tables are a foundational data structure in
computing that allow us to find entries with a computational complexity of O(1) . A full
discussion on how hash tables work is far beyond the scope of this post. Below we try
to focus on a very basic description and considerations related to aggregate
computation.
Parallel Grouped Aggregation in DuckDB - DuckDB 3
O(n) plotted against O(nlogn) to illustrate scaling behavior
To add n rows to a hash table we are looking at a complexity of O(n) , much, much
better than O(nlogn) for sorting, especially when n goes into the billions. The figure
above illustrates how the complexity develops as the table size increases. Another big
advantage is that we do not have to make a sorted copy of the input first, which is going
to be just as large as the input. Instead, the hash table will have at most as many
entries as there are groups, which can be (and usually are) dramatically fewer than
input rows. The overall process is thus this: Scan the input table, and for each row,
update the hash table accordingly. Once the input is exhausted, we scan the hash table
to provide rows to upstream operators or the query result directly.
Collision Handling
So, hash table it is then! We build a hash table on the input with the groups as keys and
the aggregates as the entries. Then, for every input row, we compute a hash of the
group values, find the entry in the hash table, and either create or update the aggregate
states with the values from the row? Its unfortunately not that simple: Two rows with
different values for the grouping columns may result in a hash that points to the same
hash table entry, which would lead to incorrect results.
剩余14页未读,继续阅读
资源评论
悟世者
- 粉丝: 3828
- 资源: 157
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- SSCMS登录模块需要的JS文件
- JSP网络购物中心毕业设计(源代码+论文).rar
- 白盒测试报告.docx
- 基于LM5117芯片评估开发板硬件参考设计(原理图+PCB)+中英文数据手册资料.zip
- 照片批量重命名软件(文件批量修改图片文件名)
- app.apk
- 人工智能(AI)是计算机科学的一个分支,旨在开发和应用能够模拟、延伸和扩展人类智能的理论、方法和技术,包括机器人、语言识别、图像
- 嵌入式与物联网开发是当今信息技术领域的两大重要分支,它们相互交织,共同推动着智能化时代的进步 嵌入式开发主要关注在嵌入式操作
- 网络安全,这一看似高深莫测的领域,实则与我们每个人的生活息息相关
- 毕业设计基于深度学习的视觉问答系统源码+文档说明+答辩PPT.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功