DBCSR：分布式块压缩稀疏行矩阵库_Fortran_C

共398个文件

f：114个

md：45个

py：27个

版权申诉

186 浏览量 2023-04-13 23:40:19 上传评论收藏 2.79MB ZIP 举报

DBCSR，全称为Distributed Block Compressed Sparse Row（分布式块压缩稀疏行矩阵库），是一种专为处理大规模稀疏矩阵而设计的高效计算库。它主要用于科学计算、工程模拟、数据分析等领域，尤其在并行计算环境中表现优异。该库支持Fortran和C两种编程语言，使得它具有广泛的适用性。在高性能计算中，稀疏矩阵因其大量的零元素存储和运算效率问题，对内存和计算资源提出了挑战。DBCSR库通过采用块压缩存储策略，将矩阵的非零元素按块进行存储，降低了内存占用，同时优化了计算过程中的数据访问模式，提高了运算速度。 DBCSR库的核心特性包括： 1. **分布式内存**：该库能够有效地在多处理器或集群环境中分布存储和计算稀疏矩阵，确保了大规模计算的可行性。 2. **块压缩**：将矩阵划分为若干小块，每个块以压缩格式存储，减少了内存需求，同时利于并行计算。 3. **并行算法**：设计了高效的并行算法，如矩阵-矩阵乘法、求解线性系统等，充分利用多核处理器或分布式计算资源。 4. **接口兼容**：提供了Fortran和C语言的API，方便与各种科学计算软件和框架集成。 5. **灵活性**：支持不同大小的矩阵块，适应不同的应用需求和硬件环境。 6. **可扩展性**：随着硬件的发展，DBCSR库可以轻松扩展以利用更先进的硬件特性。在使用DBCSR库时，开发人员需要注意以下关键点： 1. **矩阵划分**：根据问题的特性和计算资源选择合适的矩阵块大小，以平衡内存使用和计算效率。 2. **并行策略**：合理分配计算任务，避免通信开销，提高并行效率。 3. **性能调优**：根据具体平台调整库参数，例如缓存大小、并行度等，以达到最佳性能。 4. **错误处理**：在使用过程中，需要考虑可能的错误情况，如内存不足、通信失败等，并进行适当的错误处理。在实际应用中，DBCSR库通常与其他数值求解器结合，例如用于求解偏微分方程的有限元方法或有限差分方法。此外，它也可以应用于机器学习中的图算法，例如在社交网络分析或推荐系统中的应用。为了开始使用DBCSR库，你需要从提供的"dbcsr-develop"压缩包中解压并编译源代码。编译过程中可能需要依赖于其他库，如MPI（消息传递接口）以实现并行计算。一旦编译完成，你可以通过库提供的例程和API文档了解如何在你的代码中调用和使用DBCSR功能。 DBCSR库是解决大规模稀疏矩阵问题的强大工具，它通过高效的存储和计算策略，使得在分布式环境下处理这类问题变得更加便捷和高效。无论是科研工作者还是工程师，都能从中受益，提高其在科学计算领域的计算效率和能力。

资源推荐

资源详情

资源评论

收起资源包目录

DBCSR：分布式块压缩稀疏行矩阵库_Fortran_C_下载.zip （398个子文件）

AUTHORS 1KB

generate.bash 23KB

Dockerfile.build-env-latest-gcc 923B

Dockerfile.build-env-rocm 1KB

Dockerfile.build-env-ubuntu 2KB

Dockerfile.build-env-ubuntu-cuda 2KB

opencl_libsmm.c 91KB

acc_opencl.c 63KB

acc_bench_smm.c 24KB

acc_opencl_mem.c 23KB

acc_opencl_stream.c 15KB

acc_bench_trans.c 11KB

acc_opencl_event.c 9KB

dbcsr_acc_test.c 8KB

c_cpp 808B

cray.cce 1KB

.ccls 40B

multiply.cl 20KB

transpose.cl 2KB

.clang-format 698B

FindBLAS.cmake 29KB

FindLAPACK.cmake 15KB

CompilerConfiguration.cmake 7KB

CheckFortranSourceRuns.cmake 7KB

GetGitRevisionDescription.cmake 5KB

fypp-sources.cmake 3KB

CustomTargets.cmake 2KB

CheckCompilerSupport.cmake 726B

libsmm_acc.cpp 20KB

dbcsr_tensor_test.cpp 16KB

libsmm_acc_benchmark.cpp 14KB

dbcsr_test.cpp 11KB

dbcsr_tensor_example_2.cpp 10KB

dbcsr_example_3.cpp 6KB

acc_mem.cpp 5KB

libsmm_acc_init.cpp 4KB

acc_event.cpp 4KB

libsmm_acc_unittest_transpose.cpp 4KB

acc_stream.cpp 3KB

dbcsr_cuda_nvtx_cu.cpp 2KB

acc_blas.cpp 2KB

acc_hip.cpp 2KB

acc_dev.cpp 2KB

acc_init.cpp 2KB

acc_cuda.cpp 2KB

acc_error.cpp 1KB

tune_multiply_PVC.csv 34KB

tune_multiply_A100-80GB.csv 23KB

tune_multiply_H100.csv 22KB

tune_multiply_P100.csv 21KB

tune_multiply_A100-40GB.csv 17KB

tune_multiply_V100.csv 13KB

dbcsr_mpiwrap.F 186KB

dbcsr_operations.F 142KB

dbcsr_mm_3d.F 135KB

dbcsr_transformations.F 116KB

dbcsr_mm_cannon.F 104KB

dbcsr_tensor.F 102KB

dbcsr_block_operations.F 87KB

dbcsr_work_operations.F 80KB

dbcsr_tensor_types.F 73KB

dbcsr_tas_mm.F 72KB

dbcsr_api.F 67KB

dbcsr_csr_conversions.F 64KB

dbcsr_api_c.F 62KB

dbcsr_index_operations.F 60KB

dbcsr_tensor_api_c.F 53KB

dbcsr_io.F 52KB

dbcsr_data_methods_low.F 51KB

dbcsr_block_access.F 49KB

dbcsr_mm.F 47KB

dbcsr_tas_base.F 47KB

dbcsr_mm_dist_operations.F 46KB

dbcsr_tensor_example_1.F 39KB

dbcsr_tensor_split.F 38KB

dbcsr_mm_csr.F 38KB

dbcsr_test_multiply.F 38KB

dbcsr_tensor_unittest.F 38KB

dbcsr_mp_operations.F 37KB

dbcsr_tas_reshape_ops.F 37KB

dbcsr_iterator_operations.F 35KB

dbcsr_dist_util.F 35KB

dbcsr_performance_multiply.F 34KB

dbcsr_tensor_test.F 34KB

dbcsr_methods.F 32KB

dbcsr_dist_operations.F 32KB

dbcsr_mm_multrec.F 31KB

dbcsr_log_handling.F 31KB

dbcsr_types.F 29KB

dbcsr_test_methods.F 28KB

dbcsr_mm_hostdrv.F 28KB

dbcsr_mm_sched.F 28KB

dbcsr_config.F 27KB

dbcsr_tensor_block.F 27KB

dbcsr_dist_methods.F 27KB

dbcsr_test_add.F 26KB

dbcsr_btree.F 26KB

dbcsr_mm_common.F 25KB

dbcsr_tas_split.F 25KB

共 398 条

# LIBSMM (OpenCL) ## Overview The LIBSMM library implements the [ACC LIBSMM interface](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_libsmm.h), and depends on the [OpenCL backend](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/README.md). ## Customization Compile-time settings are (implicitly) documented and can be adjusted by editing [opencl_libsmm.h](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/smm/opencl_libsmm.h), e.g., `OPENCL_LIBSMM_VALIDATE` is disabled by default but can be enabled for debug purpose. The `OPENCL_LIBSMM_VALIDATE` compile-time setting enables side-by-side validation of matrix transpose and multiply operations between device and host. For example, running DBCSR's unit tests with `OPENCL_LIBSMM_VALIDATE` enabled produces console output that allows to pin-point a kernel which misses validation. Runtime settings are made by the means of environment variables. The OpenCL backend provides `acc_getenv.sh` to list all occurrences of `getenv` categorized into "OpenCL Backend environment variables" and "OpenCL LIBSMM environment variables". Common backend related settings are: * `ACC_OPENCL_DEVSPLIT`: integer enabling devices to be split into subdevices (non-zero/default: subdevices, zero: aggregated). * `ACC_OPENCL_DEVTYPE`: character string selecting "cpu", "gpu", "all" (unfiltered), or any other string (neither CPU or GPU). * `ACC_OPENCL_DEVICE`: non-negative integer number to select a device from the (internally enumerated) list of devices. * `ACC_OPENCL_VENDOR`: character string matching the vendor of the OpenCL device in a case-insensitive fashion, e.g., "intel". * `ACC_OPENCL_VERBOSE`: verbosity level (integer) with console output on `stderr`. * `ACC_OPENCL_VERBOSE=1`: outputs the number of devices found and the name of the selected device. * `ACC_OPENCL_VERBOSE=2`: outputs the duration needed to generate a requested kernel. * `ACC_OPENCL_VERBOSE=3`: outputs device-side performance of kernels (every launch profiled). * `ACC_OPENCL_DUMP`: dump preprocessed kernel source code (1) or dump compiled OpenCL kernels (2). * `ACC_OPENCL_DUMP=1`: dump preprocessed kernel source code and use it for JIT compilation. Instantiates the original source code using preprocessor definitions (`-D`) and collapses the code accordingly. * `ACC_OPENCL_DUMP=2`: dump compiled OpenCL kernels (depends on OpenCL implementation), e.g., PTX code on Nvidia. There are two categories for the two domains in OpenCL based LIBSMM, i.e., matrix transpose (`OPENCL_LIBSMM_TRANS_*`) and matrix multiplication (`OPENCL_LIBSMM_SMM_*`). For transposing matrices, the settings are: * `OPENCL_LIBSMM_TRANS_BUILDOPTS`: character string with build options (compile and link) supplied to the OpenCL runtime compiler. * `OPENCL_LIBSMM_TRANS_INPLACE`: Boolean value (zero or non-zero integer) for in-place matrix transpose (no local memory needed). * `OPENCL_LIBSMM_TRANS_BM`: non-negative integer number (less/equal than the M-extent) denoting the blocksize in M-direction. The most common settings for multiplying matrices are: * `OPENCL_LIBSMM_SMM_BUILDOPTS`: character string with build options (compile and link) supplied to the OpenCL runtime compiler. * `OPENCL_LIBSMM_SMM_ATOMICS`: selects the kind of atomic operation used for global memory updates (`xchg`, `cmpxchg`, `cmpxchg2`), attempts to force atomic instructions, or disables atomic instructions (`0`). The latter is for instance to quantify the impact of atomic operations. * `OPENCL_LIBSMM_SMM_PARAMS`: Disable embedded/auto-tuned parameters (`0`), or load CSV-file (e.g., `path/to/tune_multiply.csv`). * `OPENCL_LIBSMM_SMM_BS`: non-negative integer number denoting the intra-kernel (mini-)batchsize mainly used to amortize atomic updates of data in global/main memory. The remainder with respect to the "stacksize" is handled by the kernel. * `OPENCL_LIBSMM_SMM_BM`: non-negative integer number (less/equal than the M-extent) denoting the blocksize in M-direction. * `OPENCL_LIBSMM_SMM_BN`: non-negative integer number (less/equal than the N-extent) denoting the blocksize in N-direction. * `OPENCL_LIBSMM_SMM_AP`: specifies access to array of parameters (batch or "stack"). * `OPENCL_LIBSMM_SMM_AA`: specifies access to array of A-matrices. * `OPENCL_LIBSMM_SMM_AB`: specifies access to array of B-matrices. * `OPENCL_LIBSMM_SMM_AC`: specifies access to array of C-matrices. The full list of tunable parameters and some explanation can be received with `smm/tune_multiply.py --help`, i.e., short description, default settings, and accepted values. **NOTE**: LIBSMM's tunable runtime settings can be non-smooth like producing distinct code-paths, e.g., `OPENCL_LIBSMM_SMM_BS=1` vs. `OPENCL_LIBSMM_SMM_BS=2`. ## Auto Tuning Auto tuning code for performance is a practical way to find the "best" setting for parameterized code (e.g., GPU kernels). Introducing effective parameters is a prerequisite, and exploring the (potentially) high-dimensional parameter space in an efficient way is an art. It is desirable to have reasonable defaults even without auto-tuning the parameters. It would be even better to avoid auto-tuning if best performance was possible right away. For the OpenCL based LIBSMM, a variety of parameters are explored using [OpenTuner](http://opentuner.org/). The script [tune_multiply.py](https://github.com/cp2k/dbcsr/blob/develop/src/acc/opencl/smm/tune_multiply.py) (or tune_multiply.sh) leverages the `acc_bench_smm` by parsing console output (timing, data type, etc.). This way, the tuning is implemented without being intermingled with the subject being tuned. The "communication" between the tuner and the executable is solely based on environment variables. **NOTE**: If `tune_multiply.py` (or `tune_multiply.sh`) is called with an environment variable already set, the respective parameter (e.g., `OPENCL_LIBSMM_SMM_BM` or `OPENCL_LIBSMM_SMM_BN`) is considered fixed (and not tuned automatically). This way, the parameter space is reduced in size and effort can be directed more intensely towards the remaining parameters. To toggle the benchmarks between tuning single precision (SP) and double precision (DP), `make ELEM_TYPE=float` can be used when building the benchmark drivers (`ELEM_TYPE` can be also directly edited in [acc_bench_smm.c](https://github.com/cp2k/dbcsr/blob/develop/src/acc/acc_bench_smm.c#L26)). Auto-tuned parameters for SP and DP can be embedded into the same final application and are considered correctly at runtime. To build the benchmarks in double precision (`ELEM_TYPE=double` is default): ```bash cd src/acc/opencl make ``` To build the benchmarks in single precision (SP): ```bash cd src/acc/opencl make ELEM_TYPE=float ``` To auto-tune, please install the Python `wheel` and `opentuner` packages: ```bash cd src/acc/opencl/smm pip install -r requirements.txt ``` The OpenTuner script supports several command line arguments (`tune_multiply.py --help`). For example, `--stop-after=300` can be of interest to finish in five minutes (without limit, OpenTuner decides when the auto-tuning process is finished). A single kernel can be selected by M, N, and K parameters (GEMM), e.g., `M=15`, `N=5`, and `K=7`: ```bash ./tune_multiply.py 13x5x7 ``` **NOTE**: If multiple different kernels are tuned using `tune_multiply.py`, it is advisable to delete the `opentuner.db` directory prior to tuning a different kernel since otherwise auto-tuning is potentially (mis-)guided by information which was collected for a different kernel (`tune_multiply.sh` does this automatically). The OpenTuner script implements multiple objectives ("cost"), primarily "accuracy" (maximized) and a secondary objective "size" (minimized). The former represents the achieved performance (GFLOPS/s) while the latter represents an artificial kernel requirement (just to prefer one parameter set over another in case of similar performance). The console output looks like ("accuracy" denotes performance in GFLOP

评论收藏

内容反馈

版权申诉