论文研究-A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP system.pdf

所需积分/C币:6 2019-08-19 08:30:31 129KB .PDF
收藏 收藏
举报

面向采用总线侦听协议的共享内存处理器的软件控制缓存一致性优化技术,张悠慧,Ziqiang Qian,目前在基于总线侦听协议的共享内存系统中,有研究表明平均67%的总线广播消息是不必要的。为减少这些不必要的广播消息,本文提出了
中国科技论又在统 http://www.paper.edu.cn there arc always somc control registers or coprocessor registers in most popular processor architectures. For example, there are some coprocessors for MIPS and ARM, as well as AsR (ancillary state register) for UIltra-Sparc. Accordingly some related instructions are defined to access these registers. Generally speaking there are some reserved bits in these registers for future extension So. these bits can be used to control the switch of cache coherence mechanism Regardless of the concrctc micro-architccturc, the simplest hardware design can be prcscntcd as follows One control or coprocessor register with reserved bits is employed to log the current status of cache coherence mechanism. In detail, if the special bit of this register is 1, the coherence is enabled. Othcrwisc it is disabled. Thc rclated register acccss instruction can modify the bit as needed. And then, the bit will be visited to decide whether the snoopy-based protocol should be enabled or not before any cache operation But this design can not suit the multi-task environment. For example, if one task, a, enables the mechanism( that is, the related bit is 1)before it is suspended and the next running task, b, sets the bit o. it will violate the cache coherence when a is re-cxccutcd. ther arc two solutions for this instance-one is to swap out/in the register as a part of the task's context; the other is to add an internal register to record the accumulative total of enable operations. When a disable instruction is committed, the register is decreased by 1. Only if the internal register is 0, the reserved bit is set 0. Otherwise it is 1 In this paper, we adopt the second solution for simplicity. Moreover, our target is the embedded system that often has one task running on a processor simultaneously 3. Evaluation Methodology 3.1 Eyaluation System Detailed timing evaluation is performed with one multiprocessor simulator, Sparc-Sulima [ll].It is a machine simulator for the Ultra-SPARC implementation of the SPARC V9 processor architecture. Sparc-Sulima can be used to give accurate analysis of memory behavior for threads interactions in the SMP context where all CPUs are linked by one system bus Table 1. Cache Configurations Read/Write Size/line/block Replace Scheme Scheme LI DCache Write-through 16KB, two 16 byte sub blocksldirect mapped read-allocationber line Ll ICache read-allocation16KB, 32 byte block per line two-way set-associative 2 Cache Write-back 1MB, 64 byte line direct mapped As mentioned in [11], Sparc-Sulima implements the level-1 ICache/DCache and the combined level-2 cache and cache-coherency related functions for intra-CPU and SMP coherency. In the version 0.3 we used, the snoopy-based mechanism is employed to implement its mOESI protocol Moreover, some configurations of the simulated system can be adjusted by a script file, including the CPU number, the related memory transaction latencies and some parameters of cache configurations 中国科技论又在统 http://www.paper.edu.cn Wc modify Sparc-Sulima so that the instructions to rcad/write ASR-RDASR and WRASR, arc enhanced as mentioned by section 2. It means, whether any cache-coherence operation will be disabled or not will depend on the control bit in ASR In our evaluation, the example programs combined with the simulator package are tested in the simplest fashion, which is to boot the simulated Ultra-SPARC SMP directly from main functions of thesc programs using a specially compiled cxccutablc. The test is undcr diffcrent systcm cache configurations, which are presented in Table 1. The related operation latencies(cycle)are described as follows R/W latency of Ll Cache: 1; R/W latency of L2 Cache: 3 R/W latency of main memory: 32 Latency of the completion of broadcasting invalidation/read requests: 3 n(n is the number of CPUS) 3.2 Results The following command is used to launch the test program, myprog, on the simulator. run myprog/-pp//t/tm// Here t and tm are the memory system trace levels. p is the number of CPUs to be linked to the single system bus. For p>l, each CPU boots from main( in SPMd style using independent stacks, with any shared data being declared static. For SMP programs, libsmp in the package provides all necessary calls The test cases in Table 1 are completed and all running time are recorded respectively in Table 2 We can see that the performance improvement lies from 3. 4% to 9.7% for these programs And for programs with more memory operations and for programs running in a larger SMP system, the Improvement Is more Table 2 Test results Data in parentheses arc running timc when the optimization is disabled and thc ratio of the two results CPU Number= 2 CPU Number= 3 CPU Number= 4 (unit: 32 cycle (unit: 32 cycles) (unit: 32 cycles) Hel5742(5952,96%)|573(601,90% 5792(6048,958%) Mcminst4710(5120,920%)4967(5408,918%) 5056(5600,90.3%) Ins (64640,6290066720,943%)64064(68416,93.6%) 96.0%) Intops 3405783(3698560,3455894(3768768,3495900(3853056,90.7%) 2.1%) 91.70 4. Conclusions This software/hardware hybrid cache coherence optimization requires programmers to insert some special instructions to the program to enable/disable the cache coherence mechanism. For an instruction block that contains accesses to shared variable(s), one pair of enable/disable instructions is inserted to maintain the data coherence. Although the insertion is manual, it accords with the natural SMP programming model. Test results show that the optimization can fairly reduce average memory access latency in a snoopy-based multiprocessor system, and hence improves the overall performance 中国科技论又在统 http://www.paper.edu.cn References 1]J. Huh, D Burger, and S. W. Keckler. Exploring the design space of future CMPs. In Proc. 10th International Conference on Parallel Architectures and Compilation Techniques, September 2001 [2D. A. Wood and M. D. lill Cost-effective parallel computing IEE Computer Magazine, 28(2), Feb. 1995 [3] Charlesworth, A. The Sun Fireplane System Interconnect. Proceedings of Sc2001 [4] Tendler, J, Dodson, S, and Fields, S. IBM eServer Power4 System Microarchitecture, Technical White Paper IBM Server Group, 2001 5] Kalla,R, Sinharoy, B, and Tendler, J. IBM Power5 Chip: A Dual-Core Multithreaded Processor IEEE Micro 2004. 6 Jason F. Cantin, Mikko H. Lipasti, and James E. Smith. Improving Multiprocessor Performance with Coarse-Cirain Coherence Tracking. Proceedings of the 3 2nd Annual International Symposium on Computer Architecture table of contents Pages: 246-257 2005 [7 Andreas Moshovos Region Scout: Exploiting Coarse Grain Sharing in Snoop-Based Coherence Proceedings of the 32nd Annual International Symposium on Computer Architecture. Pages: 234-245. 2005 [8 John L. Hennessy, David A. Patterson. Chapter 3, Computer Architecture: A Quantitative Approach(3rd Edition Pub. Datc: Junc 200 [9P. Sweazey, A.J. Smith. A Class of Compatible Cache Consistency Protocols and their Support by the ieee Futurebus. Proceedings of the 13th annual international symposium on Computer architecture. Pages: 414-423 986 [10] Kai Baukus, Ron van der Meyden: A Knowledge Based Analysis of Cache Coherence. ICFEM 2004: 99-114 1] Bill Clarke, Andrew Over and Peter Strazdins. The Sparc-Sulima Manual http:/cap.anu.edu.au/cap/iprojects/sulima.TheauStralianNationalUniversity2004 Author brief introduction Zhang, Youhui is an Associate Professor in the department of Computer Science at the University of Tsinghua, China. His rescarch interests includc portable computing, nctwork storagc and microprocessor architecture. Hc rcccivcd his Ph. D. dcgrcc in Computcr Scicncc from thc samc university in 2002

...展开详情
试读 5P 论文研究-A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP system.pdf
立即下载 低至0.43元/次 身份认证VIP会员低至7折
    抢沙发
    一个资源只可评论一次,评论内容不能少于5个字
    img

    关注 私信 TA的资源

    上传资源赚积分,得勋章
    最新推荐
    论文研究-A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP system.pdf 6积分/C币 立即下载
    1/5
    论文研究-A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP system.pdf第1页
    论文研究-A Software-Controlled Cache Coherence Optimization for Snoopy-based SMP system.pdf第2页

    试读已结束,剩余3页未读...

    6积分/C币 立即下载 >