alg_eng_matrix-master_matrix_源码资源-CSDN文库

共22个文件

hpp：10个

cpp：6个

gitignore：1个

版权申诉

112 浏览量 2021-10-01 00:37:54 上传评论收藏 389KB ZIP 举报

在IT行业中，矩阵乘法是一项基础且重要的计算任务，尤其在图形处理、机器学习和高性能计算等领域中扮演着核心角色。CUDA（Compute Unified Device Architecture）是NVIDIA公司推出的一种并行计算平台，允许程序员利用GPU（图形处理器）的强大计算能力进行高性能计算。本项目"alg_eng_matrix-master_matrix_源码"显然是一个基于CUDA实现矩阵乘法的开源工程，源自GitHub上的fhvilshoj/alg_eng_matrix仓库。 CUDA矩阵乘法的核心在于充分利用GPU的并行计算能力，将传统的串行计算任务转化为并行计算任务。在CUDA编程中，主要涉及以下几个关键概念： 1. **CUDA线程**：CUDA中的计算是通过线程执行的，这些线程被组织成多级结构——线程块（thread block）和网格（grid）。线程块内的线程可以高效地同步和共享数据，而网格则包含多个线程块，用于执行更大的任务。 2. **全局内存与共享内存**：GPU的全局内存是所有线程都可以访问的，但速度相对较慢。相比之下，共享内存只对同一线程块内的线程可见，但访问速度快得多。在实现矩阵乘法时，合理使用共享内存能显著提升性能。 3. **CUDA核函数（Kernel）**：矩阵乘法的计算逻辑通常定义在一个CUDA核函数中，该函数会被同时执行于大量线程。每个线程负责计算矩阵乘法的一部分，以达到并行化的目的。 4. **同步与原子操作**：在矩阵乘法中，可能会遇到需要线程之间同步的情况，例如确保所有相关计算完成后再进行下一步。此外，原子操作（如原子加）可以安全地更新共享或全局内存中的值，避免竞态条件。 5. **优化策略**：为了最大化性能，开发者需要考虑如何最佳地划分线程和线程块，以及如何有效地使用内存。例如，通过选择合适的矩阵维度大小以适应GPU的硬件特性，或者采用动态内存分配来减小内存碎片。 6. **CUDA C++编程**：CUDA编程通常结合C++进行，利用C++的模板和类来抽象和封装复杂的GPU计算。CUDA库如cuBLAS和cuSPARSE也提供了预优化的矩阵运算接口，对于一些基本操作可以简化编程过程。 7. **CUDA工具和调试**：CUDA开发过程中，NVIDIA提供了一套完整的工具链，包括NVIDIA NSight Eclipse Edition和Visual Studio插件，用于代码编辑、编译、调试以及性能分析。利用这些工具可以有效地定位性能瓶颈并进行优化。通过研究"alg_eng_matrix-master"中的源代码，我们可以深入理解CUDA编程的细节，尤其是如何在实际应用中实现高效矩阵乘法。这不仅能提高我们的编程技巧，还能帮助我们更好地理解和利用GPU的并行计算能力，为解决更复杂的问题打下坚实的基础。

资源推荐

资源详情

资源评论

收起资源包目录

alg_eng_matrix-master.zip （22个子文件）

alg_eng_matrix-master

cache-oblivious-algorithms.pdf 317KB

benchmark

benchmark2.cpp 21KB

benchmark.cpp 17KB

print_result_cols.py 658B

test

test.cpp 51B

tests

test_naive.cpp 8KB

lib

catch.hpp 371KB

src

oblivious.hpp 3KB

helper.hpp 5KB

oblivious_cores.hpp 4KB

oblivious_s.hpp 3KB

naive.hpp 2KB

oblivious_s_flip.hpp 4KB

main.cpp 997B

naive_flip.hpp 2KB

tiled.hpp 3KB

tiled_flip.hpp 3KB

.gitignore 2KB

CMakeLists.txt 2KB

README.md 5KB

.editorconfig 114B

datagen

datagen.cpp 2KB

# AlgEng - Matrix Multiplication This C++ project is build as a part of the Algorithm Engineering course at Aarhus University. It includes different includes implementations of different alrogithms for multiplying matrices. ## Algorithms The project contains the following algorithms. Algorithm | Description --- | --- **Naive** | This algorithm is simply three simple for-loops iterating through the two matrices. **Naive:x** | This algorithm is equal to **Naive** but usees OpenMP to parallelize **Naive:T** | This algorithm takes advantage of transposing B in A * B before multiplying the two to lower cache misses from $n^3$ to $\frac{n^3}{cache-line-size}$. **Naive:T:x** | This algorithm is equal to **Naive:T** but uses OpenMP to parallelize **Obl** | This algorithm uses a cache oblivious approach to lower the cache misses by keeping on recursing on smaller sub problems. **Obl:x** | This algorithm uses the same recursion as **Obl** but stops when the problem size is below a threshold of size *x* where it switches to the **Naive** algorihm. **Obl:T:x** | This algorithm combines the benefits of **Naive:T** and **Obl:x** by transposing B and recursing until some threshold *x* and then switching to **Naive:T**. **Tile:x** | This algorithm uses a tile based approach where it divides A and B into tiles that fits into cache and then calculates every sub problem using **Naive**. This algorithm is thus a cache aware algorithm. **Tile:T:x** | This algorithm uses the same algorithm as **Tile:x** but uses **Naive:T** for the sub problems. ## Building To build the project make a folder called output in the root of the project and run cmake and make to build it. **Note** this project depends on Intels PCM library as well as PAPI. Both are listed under dependencies. The build will generate four executables: - `testing`: Used to run unittests of all the implementations. - `datagen`: Used to generate data for the benchmark programs. - `benchmark`: Benchmarks algorithms using the PAPI library - `benchmark2`: Benchmarks algorithms using the PCM library ## Testing The testing executable takes no arguments and runs all tests in the test folder ```commandline > ./testing ``` ## Datagen The datagen executable takes four arguments and generates a datafile for two matrices A of size m x n and B og size n x p filled with random numbers between -1000 and 1000. ```commandline > ./datagen <m> <n> <p> <output-file> ``` ## Benchmark The benchmark executable takes different arguments and benchmarks all the algorithms using PAPI. **Arguments**: - `-r:<refresh>`: is an unsigned int telling how many refresh iterations to run before measuring. Default 2. - `-l:<loop>`: is an unsigned int telling how many iterations to measure and average over. Default 5. - `-i:<input-file>`: specifies a file generated by datagen to benchmark algorithms against. Multiple files can be specified. - `-o:<output-file>`: specifies the prefix of the resulting output files. Example: ```commandline > sudo taskset -c 0 nice -n -20 ./benchmark -i:data/42_42_42.data -r:5 -l:10 -o:my_run ``` *Note* that all benchmark result data will be located in the *output* folder in the root of the project. ## Benchmark2 The benchmark executable takes different arguments and benchmarks all the algorithms using PCM. Arguments are the same as benchmark but with an additional - `-c:<core>`: specifies the core to measure events from (should be the same as set in taskset for linux). Example: ```commandline > sudo taskset -c 0 nice -n -20 ./benchmark -i:data/42_42_42.data -r:5 -l:10 -o:my_run -c:0 ``` *Note* that all benchmark result data will be located in the *output* folder in the root of the project. ## Extra tools The project includes `print_result_cols.py` that can be used to print a single column from the result files in the following way. **Arguments**: - `<col-idx>`: 0 based index of the col to print from every file - `<data-file>`: One or more files to print data from. ```commandline > python print_result_cols.py 0 my_run.nai1.data my_run.obl.160.data ``` Will print the first column of `my_run.nai1.data` and `my_run.obl.160.data` in a table like style. ## Dependencies - [Catch test framework](https://github.com/philsquared/Catch): Catch is included in the project and should work out of the box. - [Intel Performance Counter Monitoring (PCM)](https://github.com/opcm/pcm): For PCM to work it has to be cloned from its repository and build. Afterwards it can be linked in the `CMakeLists.txt` file in this project. Note the the project is setup to work with Intel® Core™ i7-5600U Processor and not guaranteed to work with others. - [Performance API (PAPI)](http://icl.utk.edu/papi/): PAPI works only on Linux and it has to be installed such that `/usr/local/lib/libpapi.a` is present at this exact location. Installation guide can be found on the webpage.

评论收藏

内容反馈

版权申诉