Fast matrix multiplication
--------
Austin R. Benson and Grey Ballard
This software contains implementations of fast matrix multiplication algorithms for
sequential and shared-memory parallel environments.
To cite this work, please use:
Austin R. Benson and Grey Ballard. "A framework for practical parallel fast matrix multiplication". In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2015.
An extended version of the paper is available on [arxiv](http://arxiv.org/pdf/1409.2908v1.pdf).
For references to APA algorithms, see
Grey Ballard, Jack Weissenberger, and Luoping Zhang. "Accelerating Neural Network Training using Arbitrary Precision Approximating Matrix Multiplication Algorithms". In Proceedings of the 50th International Conference on Parallel Processing Workshops (ICPP-W), 2021.
License
--------
Copyright 2014 Sandia Corporation. Under the terms of Contract DE-AC04-94AL85000 with Sandia Corporation, the U.S. Government retains certain rights in this software.
This software is released under the BSD 2-Clause license. Please see the LICENSE file.
Setup
--------
The code requires:
* Intel MKL
* Compiler supporting C++11 and OpenMP
The Makefile depends on an included file that specifies the compiler and the run-time mode.
You must specify this file in the first line of the Makefile.
For an example, see the file `make.incs/make.inc.edison`, which contains the information for running
on NERSC's Edison machine.
The `MODE` variable specifies sequential or parallel mode.
The `DEFINES` variable can specify the type of parallelism if running in parallel mode.
The `DEFINES` variable also specifies the naming convention for BLAS routines.
By default, names are considered to be like "dgemm".
However, by defining BLAS_POST, names are considered to be like "dgemm_", i.e., the routines have a trailing underscore.
The `MKL_ROOT` variable must be set for your machine.
We did most testing using the Intel compiler (icpc).
Depending on the version of g++, the OpenMP task constructs can be different and the hybrid shared-memory
parallel code may crash. Sequential mode, DFS parallel, and BFS parallel should work with g++.
Building examples
--------
First, use the code generator to generate the algorithms:
cd codegen
bash gen_all_algorithms.sh 0
Some simple codes that use the fast algorithms are in the `examples` directory.
For example, you can build and run the (4, 3, 3) algorithm:
make fast433
./build/fast433
Building tests
--------
We now assume that all of the algorithms have been gernated with the code generator (see above).
The tests are built and run with:
make matmul_tests
./build/matmul_tests -all 1
The tests are just for correctness of the algorithms, not for performance.
You should see output like:
STRASSEN_1: 257, 500, 55
Maximum relative difference: 3.87253e-15
This test runs one step of Strassen's algorithm, multiplying a 257 x 500 matrix with a 500 x 55 matrix.
The maximum relative difference is an error measure:
max_{ij} |C_{ij} - D_{ij}| / |C_{ij}|,
where C is the result computed with the fast algorithm and D is the result computed with the classical algorithm.
For all of the exact fast algorithms, the error should be around 1e-14 or 1e-15.
The approximate algorithms (e.g., Bini's) have larger error.
Typically, additional recursive steps leads to a larger error.
Building with different parallel methods
--------
The BFS, DFS, and HYBRID parallel algorithms are compile-time options.
In your make include file in the `make.incs` directory, to use DFS:
MODE=openmp
DEFINES += -D_PARALLEL_=1
Switch the `_PARALLEL_` define to 2 for BFS or 3 for HYBRID.
For an example, run
make fast424
./build/fast424
DGEMM curve benchmarks
--------
Build and run the benchmark for the dgemm curves:
make dgemm_curves
./build/dgemm_curves 1 # N x N x N
./build/dgemm_curves 2 # N x 800 x N
./build/dgemm_curves 3 # N x 800 x 800
The output is a semi-colon separated list, where each item loooks like:
M K N num_trials time;
The M, K, and N terms specify the matrix dimensions: M x K multiplied by K x N.
The time is in milliseconds and is the total time to run num_trials multiplies.
For example,
1200 800 1200 5 104.87;
means that it took 104.87 milliseconds to multiply a 1200 x 800 matrix by a 800 x 1200 matrix five times.
To build with parallelism enabled, you need to define the `_PARALLEL_` (see `make.incs/make.inc.linux`).
To run without dynamic threads (i.e., mkl_set_dynamic(0)), append a second argument, e.g.:
./build/dgemm_curves 1 1 # Square timings without dynamic thread allocation
Fast algorithms benchmarks
--------
Build benchmarking code for all of the fast algorithms:
make matmul_benchmarks
The build takes a while because we are compiling all of the algorithms.
To run a small test to benchmark MKL against Strassen with one, two, and three levels of recursion:
./build/matmul_benchmarks -square_test 1
The output format is specified in `data/README.md`.
To run all of the benchmarks for the tall-and-skinny matrix multiplied by a small square matrix (N x k x k for fixed k):
./build/matmul_benchmarks -ts_square_like 1
没有合适的资源?快使用搜索试试~ 我知道了~
快速矩阵乘法_C++_MATLAB_下载.zip
共2000个文件
hpp:56个
cpp:34个
sh:29个
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 3 浏览量
2023-04-13
23:56:23
上传
评论
收藏 4.5MB ZIP 举报
温馨提示
快速矩阵乘法_C++_MATLAB_下载.zip
资源推荐
资源详情
资源评论
收起资源包目录
快速矩阵乘法_C++_MATLAB_下载.zip (2000个子文件)
adds.424.1 3KB
adds.423.1 3KB
outerprod.1600 10KB
tssquare.1600 10KB
adds.424.2 3KB
adds.423.2 3KB
tssquare.2000 11KB
all.outer.24 27KB
smirnov.24 23KB
all.tssquare.24 23KB
all.square.24 11KB
comparison.tss.3000.24 8KB
comparison.outer.2800.24 7KB
tssquare.2400 8KB
square.24.424 1KB
square.24.433 1KB
square.24.442 1KB
smirnov.6 23KB
all.outer.6 20KB
all.tssquare.6 19KB
comparison.tss.3000.6 8KB
comparison.outer.2800.6 7KB
dgemm_curves.all 15KB
tssquare.all 4KB
all.square.6.bfs 11KB
classical.test.bfs 3KB
square.all.big 25KB
bini223-10-52-approx 407B
bini232-10-52-approx 407B
bini322-10-52-approx 528B
classical222-8-24 246B
classical234-24-72 4KB
classical333-27-81 1KB
classical423-24-72 1KB
make.inc.corn 800B
add_benchmark.cpp 53KB
fast-matmul-search.cpp 15KB
kernels.cpp 13KB
matmul_benchmarks.cpp 10KB
cse_add_perf_benchmarks.cpp 7KB
matmul_tests.cpp 6KB
RandomMT.cpp 4KB
scaling_stability.cpp 4KB
square54_benchmark.cpp 3KB
fast423_stability.cpp 3KB
aux.cpp 3KB
dgemm_curves.cpp 3KB
fast_scaling.cpp 2KB
strassen_scaling_perf.cpp 2KB
daxpy_benchmark.cpp 2KB
dealloc_test.cpp 2KB
strassen_bad_case.cpp 1KB
strassen_scaling.cpp 1KB
schonhage_stability.cpp 1000B
smirnov333_stability.cpp 996B
bini_stability.cpp 988B
simple_dgemm.cpp 971B
strassen.cpp 943B
schonhage333.cpp 931B
fast332.cpp 902B
classical.cpp 901B
fast333.cpp 901B
fast322.cpp 892B
fast323.cpp 891B
hk332.cpp 885B
bini322.cpp 884B
fast433.cpp 884B
fast432.cpp 878B
fast424.cpp 877B
dalberto_piped_seq 9KB
make.inc.edison 762B
fast423-130 1KB
fast423-134 1KB
fast423-138 1KB
fast423-156 1KB
.gitignore 25B
.gitignore 18B
.gitignore 6B
grey-strassen 190B
grey234-20-144 1KB
grey243-20-144 1KB
grey252-18-99 950B
grey322-11-49 468B
grey322-11-50 411B
grey323-15-103 712B
grey323-15-89 1KB
grey323-15-89 1KB
grey324-20-144 1KB
grey332-15-103 712B
grey332-15-90 1KB
grey333-23-125 2KB
grey333-23-142 1KB
grey333-23-143 1KB
grey333-23-144 1KB
grey333-23-152 1KB
grey333-23-221 1KB
grey342-20-144 1KB
grey343-29-234 2KB
grey422-14-84 631B
grey423-20-121 2KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
快撑死的鱼
- 粉丝: 1w+
- 资源: 9154
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功