Multicore and GPU Programming An Integrated Approach.pdf

所需积分/C币:40 2019-03-21 21:47:21 36.56MB PDF
15
收藏 收藏
举报

Multicore and GPU Programming An Integrated Approach.pdf
Dedicated to my late parents for making it possible, and my loving wife and children for making it worthwhile Preface Parallel computing has been given a fresh breath of life since the emergence of multicore architectures in the first decade of the new century. The new platforms demand a new approach to software development one that blends the tools and established practices of the network-of-workstations era with emerging software platforms such aS CUDA This book tries to address this need by covering the dominant contemporary tools and techniques, both in isolation and also most importantly in combination with each other. We strive to provide examples where multiple platforms and programming paradigms(e.g, message passing threads) are effectively combined. " Hybrid computation, as it is usually called, is a new trend in high-performance computing, one that could possibly allow software to scale to the "millions of threads required for exascale performance All chapters are accompanied by extensive examples and practice problems with an emphasis on putting them to work, while comparing alternative design scenarios All the little details, which can make the difference between a productive software development and a stressed exercise in futility, are presented in a orderly fashion The book covers the latest advances in tools that have been inherited from the 1990s(e. g, the OpenMP and MPI standards), but also more cutting-edge platforms, such as the Qt library with its sophisticated thread management and the Thrust template library with its capability to deploy the same software over diverse multicore architectures, including both CPUs and Graphical Processing Units(GPUs) We could never accomplish the feat of covering all the tools available for multicore development today. Even some of the industry-standard ones, like POSIX threads are omitted Our goal is to both sample the dominant paradigms(ranging from OpenMP'S semi-automatic parallelization of sequential code to the explicit communication plumping"that underpins MPD, while at the same time explaining the rationale and how-to, behind efficient multicore program development WHAT IS IN THIS BOOK This book can be separated in the following logical units, although no such distinction is made in the text. Introduction, designing multicore software: Chapter I introduces multicore hardware and examines influential instances of this architectural paradigm Chapter I also introduces speedup and efficiency, which are essential metrics used in the evaluation of multicore and parallel software. Amdahl's law and Gustafson-Barsis's rebuttal cap-up the chapter, providing estimates of what can XV Preface be expected from the exciting new developments in multicore and many-core hardware Chapter 2 is all about the methodology and the design patterns that can be employed in the development of parallel and multicore software. Both work decomposition patterns and program structure patterns are examined Shared-memory programming Two different approaches for shared-memory parallel programming are examined: explicit and implicit parallelization. On the explicit side, Chapter 3 covers threads and two of the most commonly used synchronization mechanisms, semaphores and monitors. Frequently encountered design patterns, such as producers-consumers and readers-writers are explained thoroughly and applied in a range of examples On the implicit side, Chapter 4 covers the OpenMP standard that has been specifically designed for parallelizing existing sequential code with minimum effort Development time is significantly reduced as a result. There are still complications, such as loop-carried dependencies, which are also addressed Distributed memory programming: Chapter 5 introduces the de facto standard for distributed memory parallel programming, i.e., the Message Passing Interface (MPI). MPI is relevant to multicore programming as it is designed to scale from a shared-memory multicore machine to a million-node supercomputer. As such, MPI provides the foundation for utilizing multiple disjoint multicore machines, as a single virtual platform The features that are covered include both point-to-point and collective communication as well as one-sided communication. a section is dedicated to the Boost MPI library, as it does simplify the proceedings of using MPl, although it is not yet feature-complete GPU programming: GPUS are one of the primary reasons why this book was put together. In a similar fashion to shared-memory programming, we examine the problem of developing GPU-specific software from two perspectives: on one hand we have the "nuts-and-bolts"approach of Nvidia'S CUDA, where memory transfers, data placement, and thread execution configuration have to be carefully planned CUDA is examined in Chapter 6 On the other hand, we have the high-level, algorithmic approach of the Thrust template library, which is covered in Chapter 7. The STL-like approach to program design affords Thrust the ability to target both CPUs and GPU platforms, a unique feature among the tools we cover Load balancing: Chapter 8 is dedicated to an often under-estimated aspect of multicore development. In general, load balancing has to be seriously considered once heterogeneous computing resources come into play. For example, a cPu and a gpu constitute such a set of resources, so we should not think only of clusters of dissimilar machines as fitting this requirement. Chapter 8 briefly discusses the linda coordination language, which can be considered a high-level abstraction of dynamic load balancing The main focus is on static load balancing and the mathematical models that can be used to drive load partitioning and data communication sequences Preface l A well-established methodology known as Divisible Load Theory DLt) is explained and applied in a number of scenarios. A simple C++ library that mplements parts of the DLt research results, which have been published over the past two decades, is also presented USING THIS BOOK AS A TEXTBOOK The material covered in this book is appropriate for senior undergraduate or postgrad uate course work. The required student background includes programming in C, C++ (both languages are used throughout this book), basic operating system concepts, and at least elementary knowledge of computer architecture Depending on the desired focus, an instructor may choose to follow one of the suggested paths listed below. The first two chapters lay the foundations for the other chapters, so they are included in all sequences Emphasis on parallel programming(undergraduate) Chapter 1: Flynns taxonomy, contemporary multicore machines performance metrics. Sections: 1.1-1.5 Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2. 1-2.5 Chapter 3: Threads, semaphores, monitors. Sections 3. 1-3.7 Chapter 4: OpenMP basics, work-Sharing constructs Sections 4.1-4.4 Chapter 5: MPI, point-to-point communications, collective operations object/structure communications, debugging and profiling. Sections 5.1-5.12.5.15-5.18.5.20 Chapter 6: CUDA programming model, memory hierarchy, GPU-Specific optimizations. Sections6.1-6.6,6.7.1,6.7.3,6.7.6,6.96.11,6.121. Chapter 7: Thrust basics. Sections 7.1-7.4 Chapter 8: Load balancing. Sections 8.1-8.3 Emphasis on multicore programming(undergraduate) Chapter 1: Flynn's taxonomy, contemporary multicore machines erformance metrics Sections 1.1-1.5 Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections 2.1-2.5 Chapter 3: Threads, semaphores, monitors. Sections 3. 1-3.10 Chapter 4: OpenMP basics, work-sharing constructs, correctness and performance issues. Sections 4.1-4.8 Chapter 5: MPI, point-to-point communications, collective operations, debugging and profiling Sections 5. 1-5.12, 5.16-5.18, 5.21 Chapter 6: CUDA programming model, memory hierarchy, GPU-Specific optimizations Sections 6. 1-6.10, 6.12.1 Chapter 7: Thrust basics. Sections 7.1-7.4 Chapter 8: Load balancing. Sections 8.1-8.3 xviii Preface Advanced multicore programming Chapter 1: Flynns taxonomy, contemporary multicore machines, performance metrics. Sections 1.1-1.5 Chapter 2: Design, PCAM methodology, decomposition patterns, program structure patterns. Sections: 2.1-2.5 Chapter 3: Threads, semaphores, monitors, advanced thread management Sections.1-3.10 Chapter 4: OpenMP basics, work-sharing constructs, correctness, and performance issues. Sections 4.1-4.8 Chapter 5: MPI, point-to-point communications, collective operations object/structure communications, debugging and profiling. Sections 5.1-5.12,5.15-5.18,5.21-5.22 Chapter 6: CUDA programming model, memory hierarchy, GPU-specific optimizations. Sections 6.. 12 Chapter 7: Thrust datatypes and algorithms. Sections 7.1-7.7 Chapter 8: Load balancing, DLT based partitioning Sections 8.1-8.5 SOFTWARE AND HARDWARE REQUIREMENTS The book examples have been developed and tested on Ubuntu Linux. All the software used throughout this book are available in free or Open-Source form These Include GNU C/C++ Compiler Suite 4.8.x(for CUDA compatibility)and 4.9.x(for OpenMP 4.0 compatibility) Digia's Qt 4. x or 5. x library OpenMPi 1.6 MPE Nvidia’ S CUDA SDK6.5 Thrust library 1.7 A reasonably recent Linux installation, with the above or newer versions of the listed software, should have no problem running the sample code provided Although, we do not provide makefiles or instructions for compiling and executin them using Visual Studio on a Windows platform, users without access to a Linux installation should be able to port the examples with minimum changes. Given that we use standard C/C++ libraries, the changes-if any-should affect only header files. i.e.. which ones to include In terms of hardware, the only real restriction is the need to have a Compute Capability 2. x or newer Nvidia GPU. Earlier generation chips maybe used, but their Linux can be easily installed without even modifying the configuration of a machine, via virtualization technology. The freely available Virtualbox software from Oracle can handle running Linux on a host Windows system, with minimal resource consumption. Preface xi peculiarities, especially regarding global memory access, are not explained in the text Users without a Nvidia gPu may have some success in running CUDA programs via the workarounds explained in Appendix e SAMPLE CODE The programs presented in the pages of this book are made available in a compressed archiveformsfromthepublishersWebsite(http:/store.elseviercom/9780124171374) The programs are organized in dedicated folders, identified by the chapter name as shown in figure 1 -Each listing in the book, is headed by the location of the corresponding file lative to the chapter's directory Single-file programs contain the command that compiles and links them, in their first-line comments. Multifile projects reside in their own directories, which also contain a makefile, or a project (. pro)file. Sample input data are also provided wherever needed multicore book code Chapters GPu Chapter1_Intro Name Vi Size Chapter3_SharedMen 9 deviceQuerycu Chapter4_openMP S execution ConfHeurcu 28K ChapterS_MPI c executionConfHeur-_old cu 5.1 chapter GPU 9 hello.cu Chapter7_Thrust s) hello2cu 日 Chapters.LoadBalancing C memcpyTestcu 9 memcpyTestCallbackcu S odd.cu odd_withErrorCheckingcu 25K S ping_pong.cu 27K S ping_ pong_CUDAaware cu 2.2K FIGURE 1 Screenshot showing how sample code is organized in chapter-specific folders Acquiring Editor: Todd Green Developmental Editor: Nate McFadden Project Manager: Punithavathy govindaradjane Designer: Mark Rogers Morgan Kaufmann is an imprint of Elsevier 225 Wyman Street, Waltham, MA 02451, USA Copyright o 2015 Elsevier Inc. All rights reserved No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher's permissions policies and our arrangements with organizations such as the copyright clearance center and the copyright LicensingAgency,canbefoundatourwebsitewww.elsevier.com/permissions This book and the individual contributions contained in it are protected under copyright by the publisher (other than as may be noted herein) Notices Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary. Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility To the fullest extent of the law neither the publisher nor the authors contributors or editors assume any liability for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods products, instructions, or ideas contained in the material herein ISBN:978-0-12-417137-4 British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library Library of Congress Cataloging-in-Publication Data A catalog record for this book is available from the Library of Congress For information on all mk publications visitourwebsiteatwww.mkp.com Working together to grow libraries in ELSEVTERBtrk Aid developing countries www.elsevier.com.www.bookaid.org List of tables Table 1.1 The Top Nine most Powerful Supercomputers as of June 2014, Sorted in Descending Order of their TFlop/KW Ratio Table 2.1 Decomposition Patterns and the most suitable Program Structure Patterns for Implementing Them Table 3.1 Qt classes for implementing binary and counting semaphores and their most important methods. The acquire(n), release(n), and tryacquire(n)are convenience methods that increment or decrement a general semaphore n times, without the need for a loop Table 3.2 Software patterns employed for using a semaphore in each of its three distinct roles 74 Table 3.3 A summary of the required number and type of semaphores needed to solve the producers-consumers problem Table 3.4 a list of the functions provided by the at Concurrent namespace. T represents the type of element to which the map/filter/reduce functions apply Table 4.1 List of the available operators for the reduction clause, along with the initial value of the reduction variable's private copies [37] 177 Table 5. 1 A sample of the possible outputs that can be generated from a run of the program in Listing 5. 3 using four processes 244 Table 5.2 A partial list of mPi datatypes 249 Table 5.3 List of predefined reduction operators in MPI 277 Table 5.4 The color and key parameters to mP I_Comm_split as they are calculated in Listing 5.24 305 Table 5. 5 A list of file opening modes that can be used in MPI__File_open 320 Table 5. 6 A selection of collective operations supported by Boost MPI 344 Table 5.7 Average encryption cracking time for the program of Listing 5.36 on a third-generation i7 CPU clocked at 4.2 GHZ. The message was encrypted with the key 107481429 Reported numbers are averaged over 100 runs 358 Table 6.1 Compute Capabilities and Associated Limits on Block and Grid sizes 395 Table 6.2 A sample list of gPU chips and their Sm capabilities 400 Table 6.3 Compute Capabilities and Associated Limits on Kernel and Thread Scheduling 402 Table 6.4 Possible values for the-arch and -code parameters of the nvcc command 406 Table 6.5 Summary of the memory hierarchy characteristics 415 Table 6.6 Pairs of /D and /D values for each of 6 warps running the program of Listing 6.18 for N=3 446 Table 6.7 An estimation of the shared memory needed per SM, to provide conflict-free access during a histogram calculation of an image 448

...展开详情
试读 127P Multicore and GPU Programming An Integrated Approach.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
一个资源只可评论一次,评论内容不能少于5个字
  • 分享王者

    成功上传51个资源即可获取
关注 私信 TA的资源
上传资源赚积分or赚钱
    最新推荐
    Multicore and GPU Programming An Integrated Approach.pdf 40积分/C币 立即下载
    1/127
    Multicore and GPU Programming An Integrated Approach.pdf第1页
    Multicore and GPU Programming An Integrated Approach.pdf第2页
    Multicore and GPU Programming An Integrated Approach.pdf第3页
    Multicore and GPU Programming An Integrated Approach.pdf第4页
    Multicore and GPU Programming An Integrated Approach.pdf第5页
    Multicore and GPU Programming An Integrated Approach.pdf第6页
    Multicore and GPU Programming An Integrated Approach.pdf第7页
    Multicore and GPU Programming An Integrated Approach.pdf第8页
    Multicore and GPU Programming An Integrated Approach.pdf第9页
    Multicore and GPU Programming An Integrated Approach.pdf第10页
    Multicore and GPU Programming An Integrated Approach.pdf第11页
    Multicore and GPU Programming An Integrated Approach.pdf第12页
    Multicore and GPU Programming An Integrated Approach.pdf第13页
    Multicore and GPU Programming An Integrated Approach.pdf第14页
    Multicore and GPU Programming An Integrated Approach.pdf第15页
    Multicore and GPU Programming An Integrated Approach.pdf第16页
    Multicore and GPU Programming An Integrated Approach.pdf第17页
    Multicore and GPU Programming An Integrated Approach.pdf第18页
    Multicore and GPU Programming An Integrated Approach.pdf第19页
    Multicore and GPU Programming An Integrated Approach.pdf第20页

    试读结束, 可继续阅读

    40积分/C币 立即下载 >