【免费】CUBLAS是用于在GPU进行高性能线性代数计算的函数库，提供了矩阵和向量操作函数，例如矩阵相乘、向量相乘等

线性代数

矩阵运算

需积分: 0 94 浏览量 2024-02-18 20:33:27 上传评论收藏 1.86MB PDF 举报

资源推荐

资源详情

资源评论

DU-06702-001_v11.8 | October2022

cuBLAS Library

User Guide

cuBLAS Library DU-06702-001_v11.8|1

Chapter1. Introduction

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top

of the NVIDIA

CUDA

runtime. It allows the user to access the computational resources of

NVIDIA Graphics Processing Unit (GPU).

The cuBLAS Library exposes three sets of API:

‣

The cuBLAS API, which is simply called cuBLAS API in this document (starting with CUDA

6.0),

‣

The cuBLASXt API (starting with CUDA 6.0), and

‣

The cuBLASLt API (starting with CUDA 10.1)

To use the cuBLAS API, the application must allocate the required matrices and vectors in the

GPU memory space, fill them with data, call the sequence of desired cuBLAS functions, and

then upload the results from the GPU memory space back to the host. The cuBLAS API also

provides helper functions for writing and retrieving data from the GPU.

To use the cuBLASXt API, the application may have the data on the Host or any of the devices

involved in the computation, and the Library will take care of dispatching the operation to, and

transferring the data to, one or multiple GPUs present in the system, depending on the user

request.

The cuBLASLt is a lightweight library dedicated to GEneral Matrix-to-matrix Multiply (GEMM)

operations with a new flexible API. This library adds flexibility in matrix data layouts, input

types, compute types, and also in choosing the algorithmic implementations and heuristics

through parameter programmability. After a set of options for the intended GEMM operation

are identified by the user, these options can be used repeatedly for different inputs. This is

analogous to how cuFFT and FFTW first create a plan and reuse for same size and type FFTs

with different input data.

1.1. Data Layout

For maximum compatibility with existing Fortran environments, the cuBLAS library uses

column-major storage, and 1-based indexing. Since C and C++ use row-major storage,

applications written in these languages can not use the native array semantics for two-

dimensional arrays. Instead, macros or inline functions should be defined to implement

matrices on top of one-dimensional arrays. For Fortran code ported to C in mechanical

fashion, one may chose to retain 1-based indexing to avoid the need to transform loops. In this

Introduction

cuBLAS Library DU-06702-001_v11.8|2

case, the array index of a matrix element in row “i” and column “j” can be computed via the

following macro

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))

Here, ld refers to the leading dimension of the matrix, which in the case of column-major

storage is the number of rows of the allocated matrix (even if only a submatrix of it is being

used). For natively written C and C++ code, one would most likely choose 0-based indexing, in

which case the array index of a matrix element in row “i” and column “j” can be computed via

the following macro

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

1.2. New and Legacy cuBLAS API

Starting with version 4.0, the cuBLAS Library provides a new API, in addition to the existing

legacy API. This section discusses why a new API is provided, the advantages of using it, and

the differences with the existing legacy API.

WARNING: The legacy cuBLAS API is deprecated and will be removed in a future release.

The new cuBLAS library API can be used by including the header file “cublas_v2.h”. It has

the following features that the legacy cuBLAS API does not have:

‣

The handle to the cuBLAS library context is initialized using the function and is explicitly

passed to every subsequent library function call. This allows the user to have more control

over the library setup when using multiple host threads and multiple GPUs. This also

allows the cuBLAS APIs to be reentrant.

‣

The scalars and can be passed by reference on the host or the device, instead of only

being allowed to be passed by value on the host. This change allows library functions to

execute asynchronously using streams even when and are generated by a previous

kernel.

‣

When a library routine returns a scalar result, it can be returned by reference on the

host or the device, instead of only being allowed to be returned by value only on the host.

This change allows library routines to be called asynchronously when the scalar result is

generated and returned by reference on the device resulting in maximum parallelism.

‣

The error status cublasStatus_t is returned by all cuBLAS library function calls.

This change facilitates debugging and simplifies software development. Note that

cublasStatus was renamed cublasStatus_t to be more consistent with other types in

the cuBLAS library.

‣

The cublasAlloc() and cublasFree() functions have been deprecated. This change

removes these unnecessary wrappers around cudaMalloc() and cudaFree(),

respectively.

‣

The function cublasSetKernelStream() was renamed cublasSetStream() to be more

consistent with the other CUDA libraries.

Introduction

cuBLAS Library DU-06702-001_v11.8|3

The legacy cuBLAS API, explained in more detail in the Appendix A, can be used by including

the header file “cublas.h”. Since the legacy API is identical to the previously released cuBLAS

library API, existing applications will work out of the box and automatically use this legacy API

without any source code changes.

The current and the legacy cuBLAS APIs cannot be used simultaneously in a single translation

unit: including both “cublas.h” and “cublas_v2.h” header files will lead to compilation

errors due to incompatible symbol redeclarations.

In general, new applications should not use the legacy cuBLAS API, and existing applications

should convert to using the new API if it requires sophisticated and optimal stream

parallelism, or if it calls cuBLAS routines concurrently from multiple threads.

For the rest of the document, the new cuBLAS Library API will simply be referred to as the

cuBLAS Library API.

As mentioned earlier the interfaces to the legacy and the cuBLAS library APIs are the header

file “cublas.h” and “cublas_v2.h”, respectively. In addition, applications using the cuBLAS

library need to link against:

‣

The DSO cublas.so for Linux,

‣

The DLL cublas.dll for Windows, or

‣

The dynamic library cublas.dylib for Mac OS X.

Note: The same dynamic library implements both the new and legacy cuBLAS APIs.

1.3. Example Code

The following code examples show an application written in C using the cuBLAS library API

with two indexing styles. Example 1 shows 1-based indexing and Example 2 shows 0-based

indexing.

//Example 1. Application Using C and cuBLAS: 1-based indexing

//-----------------------------------------------------------

#include <stdio.h>

#include <stdlib.h>

#include <math.h>

#include <cuda_runtime.h>

#include "cublas_v2.h"

#define M 6

#define N 5

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))

static __inline__ void modify (cublasHandle_t handle, float *m, int ldm, int n, int

p, int q, float alpha, float beta){

cublasSscal (handle, n-q+1, &alpha, &m[IDX2F(p,q,ldm)], ldm);

cublasSscal (handle, ldm-p+1, &beta, &m[IDX2F(p,q,ldm)], 1);

}

int main (void){

cudaError_t cudaStat;

cublasStatus_t stat;

cublasHandle_t handle;

剩余274页未读，继续阅读

评论收藏

内容反馈

杨咩咩ing

粉丝: 395
资源: 2

CUBLAS是用于在GPU进行高性能线性代数计算的函数库，提供了矩阵和向量操作函数，例如矩阵相乘、向量相乘等

vasp.5.4.1.05Feb16.tar.gz

GPU矩阵相乘

线性代数计算器

矩阵相乘、求逆、转置函数

矩阵计算器

矩阵运算的内库

cublas_benchmarks：一些测试gemm和gemv性能的基准

opengl实现图形平移，缩放，旋转（不使用opengl自带的图形变换函数）

einsum满足你一切需要：深度学习中的爱因斯坦求和约定 - 知乎1

南开大学19秋学期(1709至1909)《并行程序设计》在线作业答案1.docx

向量与矩阵运算数学实验

GPU编程模型

矩阵相乘C++程序

两个矩阵相乘

矩阵运算（类里包含了对矩阵的一些基本运算程序）

南开大学19秋学期(1709至1909)《并行程序设计》在线作业答案1.pdf

OpenGL3.3_GetStarting_FirstCamera

android-opengl图片3d旋转

CPU做角度旋转demo

OpenGL3.3_GetingStarted_Transformation

利用GPU进行高性能数据并行计算.pdf

C语言 矩阵相乘 矩阵转置

用计算机C语言表示矩阵相乘

c#矩阵运算库

矩阵乘法.c

IOS应用源码——OpenGL ES关于长方块沿不过原点的任意轴旋转的实现Square.zip

基于python及pytorch中乘法的使用详解

Caffe官方教程中译本

vB语言有关矩阵运算的一些函数定义

最新资源

C语言矩阵相乘矩阵转置