没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
在处理工程问题时,常常需要对线性或非线性方程组进行求解。对于实际应用中经常遇到的大型方程组进行求解则需要相当长的时间。使用图形处理器(GPU)代替传统的 CPU,将多块GPU通过操作系统进行协调,并将PBi-CGstab方法和Inexact Newton方法进行适合多GPU并行的改造以此作为多GPU求解器的核心算法,加速求解大型线性和非线性方程组。本文的多GPU求解器在成倍扩展了单GPU求解器允许的计算规模的同时取得了令人满意的加速比。
资源推荐
资源详情
资源评论
Sept.2011 Transactions of Nanjing University of Aeronautics
&.
Astronautics
Vo
l.
28
No. 3
SOLVERS
FOR
SYSTEMS
OF
LARGE
SPARSE
LINEAR
AND
NONLINEAR
EQUATIONS BASED ON
MULTI-GPUS
Liu
Sha
1
,
Zhong
Chengwen
1
•
2
, Chen
Xiaopeng
3
O.
National Key Laboratory of
Sc
ience and Technology on Aerodynamic
De
sign and Research ,
Northwestern Polytechnical University, Xi' an , 710072 , P. R. China;
2.
Center for High Performance Computing, Northwestern Polytechnical University , Xi' an , 710072 , P. R. China;
3.
School of Mechanics, Civ
i\
Engineering and Architecture , Northwestern
Polytechnical University
, Xi' an , 710072 , P. R. China)
Abstract: Numerical treatment of engineering application problems often eventually results in a solution of sys-
tems of linear or nonlinear equations. The solution process using digital computational devices usually takes
tremendous time due to the extremely large size encountered
in
most real-world engineering applications.
So
,
practical solvers for systems of linear and nonlinear equations based on multi graphic process units
(GPUs)
are
proposed in order to accelerate the solving process. In the linear and nonlinear solvers
, the preconditioned bi-con-
jugate gradient stable (PBi-CGstab) method and the Inexact Newton method are used to achieve the fast and sta-
ble convergence behavior. Multi-GPUs are utilized to obtain more data storage that large size problems need.
Key
words: general purpose graphic process unit
(GPGPU);
compute unified device architecture
(CUDA);
sys-
tem of linear equations; system of nonlinear equations; Inexact Newton method; bi-conjugate gradi-
ent stable (Bi-CGstab) method
CLC number: TP391 Document code:A Article
ID:
1005-1120(2011)03-0300-09
INTRODUCTION
Mathematica1
modeling
of
engineering
prob-
1ems
often
1eads
to
systems
of
1inear
or
non1inear
equations.
The
solution
of
such
resulting
equa-
tions
utilizing
numerica1
too1s
via
digita1
computa-
tiona1
devices
is
usually
of
very
time-consuming
,
because
most
rea1-world
engineering
app1ications
are
often
of
extremely
1arge
size
for
computation.
In
this
paper
,
with
the
Inexact
Newton
method
and
the
preconditioned
bi-conjugate
gradient
sta-
b1e
(PBi-CGstab)
method
,
linear
and
non1inear
solvers
based
on
muti
graphic
process
units
(GPUs)
are
proposed
for
1arge
sca1e
prob1ems.
Genera1
purpose
GPU
(GPGPU)
technique
denotes
the
imp1ementation
of
genera1
purpose
computing
by
using
programmab1e
GPUs[
I].
It
has
been
wide1y app1ied
in
many
computationa1
areas
owing
to
its
more
powerfu1
floating
calcu1a-
tion
abilities
and
wider
bandwidth
compared
with
the
traditiona1
CPU[2-3].
Furthermore
,
the
inher-
ent
sing1e-instruction-multip1e-data
(SIMD)
mechanism
for
GPGPU
operation
renders
this
technique
suitab1e
for
massive1y
10aded
ca1cu1a-
t
lO
ns.
A seria1
of
GPU-based
linear
a1gebra
opera-
tions
were
proposed
by
Krüger
et
a1
in
2003[4].
The
first
GPU-based
conjugate
gradie
l}
t
solver
for
unstructured
matrices
was
proposed
by
Bo
1z
et
aF5].
Buatois
et
a1
proposed
a
genera1
sparse
lin-
ear
sol
ver
using
CG
method
in
2 0 0
9
∞.
Cevahir
et
a1
deve10ped
a
fast
CG
based
on
GPUs
with
some
nove1
optimization
techniques[7].
These
arti-
cles
show
great
speedup
ratio
of
GPU
to
CPU.
In
early
work
,
Zhong
and
Liu
proposed
a
fast
solver
which
has
a
great
speedup
ratio
about
30
on
sing1e
GPU[8].
In
this
paper
,
multi-GPUs
are
used
to
obtain
more
data
storage
space
that
1arge
size
prob1ems
need.
In
the
case
of
1inear
solver
,
the
Bi-CGstab
method
can
afford
to
solve
the
sys-
Received date: 2010-10-13; revision received date: 2011-03-10
E-mail:virgilius@mai
l.
nwpu.edu.cn
No.3
Li
u Sha , et al.
Sol
vers
for
Systems
of
Large Sparse Linear
and
...
301
tem
of
linear
equations
with
non-symmetric
ma-
trix
which
cannot
be
solved
by
the
CG
method.
A
better
convergence
of
the
method
is achieved
by
using
the
precondition
strateg
机
For
the
nonlinear
solver
,
Inexact
Newton
method
is utilized.
The
grid
generation
project
in
computational
fluid
dy-
namics is used
to
test
the
practicability
of
linear
and
nonlinear
solvers
,
in
which
systems
of
linear
and
nonlinear
equations
are
solved
in
order
to
ob-
tain
the
coordinates
of
grid
nodes.
1
COMPUTE
UNIFIED
DEVICE
ARCHITECTURE
The
compute
unified device
architecture
(CUDA)
∞
is
a
GPU
architecture
manufactured
by
NVIDIA.
CUDA
GPU
contains
a
number
of
SIMD
multiprocessors.
Each
multiuprocessor
contains
its
own
shared
memory
,
read-only
con-
stant
,
and
texture
caches
that
are
accessible
by
all
processors
on
the
mUltiprocesso
r.
GPU
has
a de
vice
memory
which
is accessible
by
all
multipro-
cessors.
CUDA
GPU
devices
run
a
high
number
of
threads
in
paralle
l.
Threads
are
grouped
together
as
thread
blocks.
Each
block
of
threads
is
execut-
ed
on
the
same
multiprocessor
and
can
communi-
cate
through
fast
shared
memory.
Threads
in
different
blocks
can
communicate
only
through
device
memory.
Access
to
the
device
memory
is
very
slow
compared
with
the
shared
memory.
Device
memory
accesses
should
be
as
refrained
as
possible
,
and
these
accesses
should
be
coalesced
to
attain
high
performance.
Coalesc-
ing
is
possible
if
the
threads
access
consecutive
memory
addresses
of
4 , 8
or
16
bytes
and
the
base
address
for
such
a coalesced access
should
be
multiple
of
16
(half
warp[9])
times
size
of
the
aforementioned
memory
types
accessed
by
each
thread.
2
SYNCHRONIZATION
threads
,
CUDA
codes "
cutStartThread
"
and
"cutWaitForThreads"
based
on
Win64
API
can
also
be
used
for
simplification.
The
multi-GPUs
solver
works
as
this
pat-
tern:
Firstly
,
CPU
distributes
data
and
tasks
to
G
PU
S;
Then
,
set
barriers
to
G
PU
managing
threads.
A
GPU
managing
thread
to
run
in
front
of
its
barrier
means
that
its
GPU
has
finished
computational
work
,
renewed
its
own
data
on
de-
vice
memory
along
with
its
host
memory
counter
part
,
and
now
, i t is wai
ting
for
the
da
ta
needed
to
be
renewed
by
other
GPUs.
When
all
threads
have
run
in
front
of
their
barriers
,
the
barriers
are
released
,
then
, each
GPU
obtains
renewed
data
that
they
want
and
continues
to
work.
The
whole
process
consisting
of
setting
and
releasing
barriers
is
one
time
of
synchronization.
In
this
paper
,
the
semaphores
are
used
to
manage
the
synchronization
among
threads.
The
synchronization
process
is
shown
in
Fig.
1.
De-
fine
an
array
of
semaphores
(Sem[GPU_NUM])
for
GPU
managing
threads
with
initial
value 0
and
activation
value
GPU
_NUM:
HANDLE
Sem[GPU_NUM];
Sem
[device_
num]
CreateSemaphore
(NULL
, 0 ,
GPU_NUM
,
NULL)
When
a
thread
reaches
barrier
,
its
semaphore
is
activated
by
adding
GPU_NUM
to
initial
value
0:
ReleaseSemaphore
(Sem
[device_
num]
,
GPU
_NUM
,
NUL
L)
Then
the
thread
waits
for
the
activation
of
semaphores
corresponding
to
other
threads:
WaitForMultipleObjiect
(GPU
_NUM
,
Sem
,
true
,
INFINITE)
The
first
parameter
of
this
function
is
the
number
of
semaphores.
The
second
parameter
is
"
Sem"
,
the
first
word
address
of
semaphores.
The
third
parameter
is
set
"true"
,
it
means
the
function
cannot
return
until
all
semaphores
are
Using
Win64
API
, a
thread
is
created
for
activated.
The
fourth
parameter
is
the
maximum
each
GPU
on
board
in
the
program.
Each
thread
waiting
time
,
which
is
set
to
"INFINITE"
to
in-
manages
the
data
input
and
output
of
GPU
,
and
sure
the
logic
correctness
of
multi-GPUs
solvers.
calls
the
GPU
kernel
functions
and
synchronizes
When
the
demand
of
this
function
is fulfilled ,
the
with
other
threads.
When
creating
and
ending
synchronization
process
is finished.
剩余8页未读,继续阅读
资源评论
weixin_38656103
- 粉丝: 0
- 资源: 956
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 14-基于LLM+向量库的文档对话 经验面.pdf
- 18-大模型(LLMs)RAG 版面分析——文本分块面.pdf
- 17-大模型(LLMs)RAG 版面分析——表格识别方法篇.pdf
- 16-LLM文档对话 —— pdf解析关键问题.pdf
- 19-大模型外挂知识库优化——如何利用大模型辅助召回?.pdf
- 20-大模型外挂知识库优化——负样本样本挖掘篇.pdf
- 24-大模型(LLMs)RAG 优化策略 —— RAG-Fusion篇.pdf
- 22-检索增强生成(RAG) 优化策略篇.pdf
- 27-适配器微调(Adapter-tuning)篇.pdf
- 25-Graph RAG 面 — 一种 基于知识图谱的大模型检索增强实现策略.pdf
- 26-大模型(LLMs)参数高效微调(PEFT) 面.pdf
- 28-提示学习(Prompting)篇.pdf
- 31-大模型(LLMs)推理面.pdf
- 32-大模型(LLMs)增量预训练篇.pdf
- PCB设计的基础教程与技巧分享笔记
- 35-大模型(LLMs)评测面.pdf
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功