# RDMA-bench
A framework to understand RDMA performance. This is the source code for our
[USENIX ATC paper](http://www.cs.cmu.edu/~akalia/doc/atc16/rdma_bench_atc.pdf).
## Required hardware and software
* InfiniBand HCAs. Some C++ benchmarks work with RoCE HCAs.
* Linux-based OS with RDMA drivers (Mellanox OFED or upstream OFED). Ubuntu,
RHEL, and CentOS have been tested.
* Required packages: cmake, memcached, gflags, libmemcached-dev, libnuma-dev
* Root access is required only for hugepages.
## Required settings
All benchmarks require one server machine and multiple client machines. Every
benchmark is contained in one directory.
* The number of client machines required is described in each benchmark's README
file. The server will wait for all clients to launch, so the benchmarks won't
make progress until the correct number of clients are launched.
* Modify `HRD_REGISTRY_IP` in `run-servers.sh` and `run-machines.sh` to the IP
address of the server machine. The server runs a memcached instance that is
used as a queue pair registry.
* Allocate hugepages on all machines, and set unlimited SHM limits:
```
sudo echo 8192 > /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
sudo bash -c "echo kernel.shmmax = 9223372036854775807 >> /etc/sysctl.conf"
sudo bash -c "echo kernel.shmall = 1152921504606846720 >> /etc/sysctl.conf"
sudo sysctl -p /etc/sysctl.conf
```
## Benchmark description
The benchmarks used in the paper are described below. This repository contains
other benchmarks as well.
| Benchmark | Description |
| ------------- | ------------- |
| `herd` | An improved implementation of the [HERD key-value cache](http://www.cs.cmu.edu/~akalia/doc/sigcomm14/herd_readable.pdf). |
| `mica` | A simplified implementation of [MICA](https://github.com/efficient/mica). |
| | |
| `atomics-sequencer` | Sequencer using one-sided fetch-and-add. Also emulates DrTM-KV. |
| `ws-sequencer` | Sequencer using HERD RPCs (UC WRITE requests, UD SEND responses). |
| `ss-sequencer` | Sequencer using header-only datagram RPCs (i.e., UD SENDs only). |
| | |
| `rw-tput-sender` | Microbenchmark to measure throughput of outbound READs and WRITEs. |
| `rw-tput-receiver` | Microbenchmark to measure throughput of inbound READs and WRITEs. |
| `ud-sender` | Microbenchmark to measure throughput of outbound SENDs. |
| `ud-receiver` | Microbenchmark to measure throughput of inbound SENDs. |
| | |
| `rw-allsig` | WQE cache misses for outbound READs and WRITEs. |
| | |
| `write-incomplete` | This PoC shows that a completed WRITE can be invisible to the remote CPU. |
| `write-reordering` | A test for left-to-right ordering of WRITEs. |
## Implementation details
### `libhrd`
The `libhrd` library is used to implement all benchmarks. It consists of
convenience functions for initial RDMA setup, such as creating and connecting
QPs, and allocating hugepage memory.
### Memcached
Distributing QP information (required for connection setup in connected
transports, and routing in datagram transports) requires a temporary out-of-band
communication channel. To simplify this process, we use a `memcached` instance
to publish (e.g., `hrd_publish_conn_qp()`) and pull QP information (e.g.,
`hrd_get_published_qp`) using global QP names.
### Client connection logic
The code was written to work on a cluster that has dual-port NICs, but the switch
connectivity does not allow cross-port communication. Using both ports in this
constrained environment makes the initial QP connection setup slightly
complicated. All benchmarks also work on single-port NICs. Usually, we use the
following logic while setting up connections:
* There are `N` client threads in the system and each client thread uses `Q` QPs.
* The server has `num_server_ports` ports starting from port `base_port_index`.
Similarly, clients have `num_client_ports`. The `base_port_index` may be
different for server and clients.
* On the CIB cluster, port `i` on a NIC can only communicate with port `i`
on other NICs. So `base_port_index` must be same for clients and server, and
`num_client_ports == num_server_ports`.
* One server thread (the master thread in case there are worker threads) creates
`N * Q` QPs on each server port. For applications requiring a request region,
only one memory region is created and registered with all of the
`num_server_ports` control blocks. Only some of these QPs actually get used by
clients.
* Client threads have a global index `clt_i`. Each client thread uses a single
control block and creates all its QPs on port index (using base
`base_port_index`) `clt_i % num_client_ports`. It connects all these QPs to
QPs on server port indexed `clt_i % num_server_ports` (using the server's
`base_port_index`). This works for both CIB, and Apt and Intel clusters that
support any-to-any communication between ports.
### Selective signaling logic
Most benchmarks post one signaled work request per `UNSIG_BATCH` work requests.
This is done to reduce CQE DMAs. With `UNSIG_BATCH = 4`, a sequence of work
requests looks as follows. Note that a work request is **not** `post()`ed
immediately; it is added to a list and posted when the number of work requests
in the list equals `postlist`.
```
wr 0 -> signaled
wr 1 -> unsignaled
wr 2 -> unsignaled
wr 3 -> unsignaled
Poll for wr 0's completion. A postlist should have ended.
wr 4 -> signaled
...
wr 5 -> unsignaled
Poll for wr 4's completion. Another postlist should have ended.
```
This imposes 2 requirements:
* **Postlist check:** `postlist <= UNSIG_BATCH`. We poll for a completion
**before** queueing work request `UNSIG_BATCH + 1`. If `postlist > UNSIG_BATCH`,
nothing will have been posted at this point, so polling will get stuck.
* **Queue capacity check:** `HRD_Q_DEPTH >= 2 * UNSIG_BATCH`. With the above
scheme, up to `2 * UNSIG_BATCH - 1` work requests can be un-ACKed by the QP.
With a QP of size `N`, `N - 1` work requests are allowed to be un-ACKed by the
InfiniBand/RoCE specification.
## Work in progress
The benchmarks are being ported to use C++ and CMake. Some benchmarks will
continue to use C (i.e., `libhrd`); others will move to C++ (i.e., `libhrd_cpp`).
## Contact
Anuj Kalia (akalia@cs.cmu.edu)
## License
Copyright 2016, Carnegie Mellon University
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
没有合适的资源?快使用搜索试试~ 我知道了~
rdma_bench, 理解RDMA的框架.zip
共226个文件
sh:81个
c:36个
h:22个
需积分: 25 13 下载量 54 浏览量
2019-09-17
23:08:57
上传
评论
收藏 325KB ZIP 举报
温馨提示
rdma_bench, 理解RDMA的框架 rdma工作台理解RDMA性能的框架。 这是我们的 USENIX ATC的源代码。需要硬件和软件InfiniBand hca一些 C 基准与处理器接口协同工作。基于RDMA驱动程序( 。Mellanox OFED或者上游 O
资源推荐
资源详情
资源评论
收起资源包目录
rdma_bench, 理解RDMA的框架.zip (226个子文件)
configure.ac 3KB
Makefile.am 2KB
apt-allsig-32b_reads-11_machines-varnum_vms 11KB
apt-allsig-32b_reads-6_machines-varnum_vms 8KB
apt-allsig-varsize_reads-6_machines-1_vm 15KB
AUTHORS 30B
qp.c 90KB
verbs.c 87KB
cq.c 52KB
ec.c 32KB
mlx5.c 31KB
hrd_conn.c 20KB
buf.c 16KB
city.c 16KB
hrd_util.c 13KB
alloc.c 12KB
mica.c 11KB
main.c 10KB
worker.c 9KB
worker.c 8KB
implicit_lkey.c 7KB
srq.c 6KB
client.c 6KB
client.c 6KB
worker.c 6KB
server.c 6KB
client.c 5KB
main.c 5KB
main.c 5KB
test.c 4KB
client.c 4KB
dbrec.c 4KB
main.c 3KB
client.c 3KB
server.c 3KB
master.c 3KB
master.c 3KB
main.c 3KB
main.c 2KB
master.c 2KB
test.c 2KB
main.c 90B
hrd_conn.cc 20KB
hrd_util.cc 14KB
main.cc 11KB
main.cc 11KB
main.cc 11KB
main.cc 10KB
main.cc 9KB
main.cc 9KB
main.cc 8KB
main.cc 8KB
main.cc 7KB
main.cc 5KB
main.cc 5KB
main.cc 4KB
changelog 10KB
.clang-format 545B
compat 2B
config 189B
control 2KB
COPYING 19KB
copyright 2KB
mlx5.driver 12B
format 12B
.gitignore 553B
.gitignore 196B
.gitignore 83B
.gitignore 21B
.gitignore 9B
.gitreview 107B
mlx5.h 38KB
list.h 10KB
mlx5-abi.h 10KB
hrd.h 8KB
hrd.h 7KB
doorbell.h 7KB
wqe.h 7KB
latency.h 5KB
ec.h 4KB
alloc.h 4KB
city.h 3KB
bitmap.h 3KB
mica.h 3KB
main.h 3KB
implicit_lkey.h 3KB
main.h 2KB
main.h 2KB
main.h 942B
main.h 630B
main.h 606B
hrd_sizes.h 595B
main.h 250B
libmlx5.spec.in 2KB
libmlx5-1.install 77B
libmlx5-dev.install 41B
libmlx_expose_headers 8KB
Makefile 396B
Makefile 342B
Makefile 342B
共 226 条
- 1
- 2
- 3
资源评论
weixin_38743737
- 粉丝: 376
- 资源: 2万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- AIS2024 valid
- 最入门的爬虫代码 python.docx
- 爬虫零基础入门-爬取天气预报.pdf
- 最通俗易懂的 MongoDB 非结构化文档存储数据库教程.zip
- 以mongodb为数据库的订单物流小项目.zip
- 腾讯云-mongodb数据库, 项目部署.zip
- 腾讯 APIJSON 的 MongoDB 数据库插件.zip
- 理解非关系型数据库和关系型数据库的区别.zip
- 操作简单的Mongodb网页web管理工具,基于Spring Boot2.0支持mongodb集群.zip
- tms-mongodb-web,提供访问mongodb数据的REST API和可灵活扩展的mongodb web 客户端.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功