{"id":102343,"date":"2010-02-07T00:00:00","date_gmt":"2010-02-07T00:00:00","guid":{"rendered":"https:\/\/www.deberes.net\/tesis\/sin-categoria\/on-the-programmability-of-heterogeneous-massively-parallel-computing-systems\/"},"modified":"2010-02-07T00:00:00","modified_gmt":"2010-02-07T00:00:00","slug":"on-the-programmability-of-heterogeneous-massively-parallel-computing-systems","status":"publish","type":"post","link":"https:\/\/www.deberes.net\/tesis\/ciencia-de-los-ordenadores\/on-the-programmability-of-heterogeneous-massively-parallel-computing-systems\/","title":{"rendered":"On the programmability of heterogeneous massively-parallel computing systems"},"content":{"rendered":"<h2>Tesis doctoral de <strong> Isaac Gelado Fernandez <\/strong><\/h2>\n<p>Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently  execute both sequential control-intensive and data-parallel phases of applications.  Existing programming  models for heterogeneous parallel computing impose added coding complexity when compared to  traditional sequential shared-memory programming models for homogeneous systems. This extra code  complexity is assumable in supercomputing environments, where programmability is sacrificed in pursuit  of high performance. However, heterogeneous parallel systems are massively reaching the desktop  market (e.G., 425.4 million of gpu cards were sold in 2009), where the trade-off between performance  and programmability is the opposite. The code complexity required when using accelerators and the lack  of compatibility prevents programmers from exploiting the full computing capabilities of heterogeneous  parallel systems in general purpose applications.    this dissertation aims to increase the programmability of cpu &#8211; accelerator systems, without introducing  major performance penalties. The key insight is that general purpose application programmers tend to  favor programmability at the cost of system performance. This fact is illustrated by the tendency to use  high-level programming languages, such as c++, to ease the task of programming at the cost of minor  performance penalties. Moreover, currently many general purpose applications are being developed using  interpreted languages, such as java, c# or python, which raise the abstraction level even further  introducing relatively large performance overheads. This dissertation also takes the approach of raising  the level of abstraction for accelerators to improve programmability and investigates hardware and  software mechanisms to efficiently implement these high-level abstractions without introducing major  performance overheads.    heterogeneous parallel systems typically implement separate memories for cpus and accelerators,  although commodity systems might use a shared memory at the cost of lower performance. However, in  these commodity shared memory systems, coherence between accelerator and cpus is not guaranteed.  this system architecture implies that cpus can only access system memory, and accelerators can only  access their own local memory. This dissertation assumes separate system and accelerator memory and  shows that low-level abstractions for these disjoint address spaces are the source of poor programmability  of heterogeneous parallel systems.    a first consequence of having separate system and accelerator memories are the current data transfer  models for heterogeneous parallel systems. In this dissertation two data transfer paradigms are identified:  per-call and double-buffered. In these two models, data structures used by accelerators are allocated in  both, system and accelerator memories. These models differ on how data between accelerator and  system memories is managed. The per-call model transfers the input data needed by accelerators before  accelerator calls, and transfers back the output data produced by accelerators on accelerator call return.  the per-call model is quite simple, but might impose unacceptable performance penalties due to data  transfer overheads. The double-buffered model aims to overlap data communication and cpu and  accelerator computation. This model requires a relative quite complex code due to parallel execution and  the need of synchronization between data communication and processing tasks. The extra code required  for data transfers in these two models is necessary due to the lack of by-reference parameter passing to  accelerators. This dissertation presents a novel accelerator-hosted data transfer model. In this model,  data used by accelerators is hosted in the accelerator memory, so when the cpu accesses this data, it is  effectively accessing the accelerator memory. Such a model cleanly supports by-reference parameter passing to accelerator calls, removing the need to explicit data transfers.  the second consequence of separate system and accelerator memories is that current programming  models export separate virtual system and accelerator address spaces to application programmers. This  dissertation identifies the double-pointer problem as a direct consequence of these separate virtual  memory spaces. The double-pointer problem is that data structures used by both, accelerators and cpus,  are referenced by different virtual memory addresses (pointers) in the cpu and accelerator code. The  double-pointer problem requires programmers to add extra code to ensure that both pointers contain  consistent values (e.G., When reallocating a data structure). Keeping consistency between system and  accelerator pointers might penalize accelerator performance and increase the accelerator memory  requirements when pointers are embedded within data structures (e.G., A linked-list). For instance, the  double-pointer problem requires increasing the numbers of global memory accesses by 2x in a gpu code  that reconstructs a linked-list. This dissertation argues that a unified virtual address space that includes  both, system and accelerator memories is an efficient solution to the double-pointer problem.  Moreover,  such a unified virtual address space cleanly complements the accelerator-hosted data model previously  discussed.  this dissertation introduces the non-uniform accelerator memory access (nuama) architecture, as a  hardware implementation of the accelerator-hosted data transfer model and the unified virtual address  space. In nuama an accelerator memory collector (amc) is included within the system memory  controller to identify memory requests for accelerator-hosted data. The amc buffers and coalesces such  memory requests to efficiently transfer data from the cpu to the accelerator memory. Nuama also  implements a hybrid l2 cache memory. The l2 cache in nuama follows a write-throughwrite-non- allocate policy for accelerator hosted data. This policy ensures that the contents of the accelerator  memory are updated eagerly and, therefore, when the accelerator is called, most of the data has been  already transferred. The eager update of the accelerator memory contents effectively overlaps data  communication and cpu computation. A write-backwrite-allocate policy is used for the data hosted by the  system memory, so the performance of applications that does not use accelerators is not affected. In  nuama, accelerator-hosted data is identified using a tlb-assisted mechanism. The page table entries  are extended with a bit, which is set for those memory pages that are hosted by the accelerator memory.  nuama increases the average bandwidth requirements for the l2 cache memory and the interconnection  network between the cpu and accelerators, but the instantaneous bandwidth, which is the limiting factor,  requirements are lower than in traditional dma-based architectures. The nuama architecture is compared  to traditional dma systems using cycle-accurate simulations. Experimental results show that nuama and  traditional dma-based architectures perform equally well. However, the application source code  complexity of nuama is much lower than in dma-based architectures.  a software implementation of the accelerator-hosted model and the unified virtual address space is also  explored. This dissertation presents the asymmetric distributed shared memory (adsm) model. Adsm  maintains a shared logical memory space for cpus to access data in the accelerator physical memory but  not vice versa. The asymmetry allows light-weight implementations that avoid common pitfalls of  symmetrical distributed shared memory systems. Adsm allows programmers to assign data structures to  performance critical methods. When a method is selected for accelerator execution, its associated data  objects are allocated within the shared logical memory space, which is hosted in the accelerator physical  memory and transparently accessible by the methods executed on cpus.  Adsm reduces programming  efforts for heterogeneous parallel computing systems and enhances application portability. The design and  implementation of an adsm run-time, called gmac, on top of cuda in a gnu\/linux environment is  presented. Experimental results show that applications written in adsm and running on top of gmac  achieve performance comparable to their counterparts using programmer-managed data transfers. This  dissertation presents the gmac system, evaluates different design choices, and it further suggests  additional architectural support that will likely allow gmac to achieve higher application performance than  the current cuda model.  finally, the execution model of heterogeneous parallel systems is considered. Accelerator execution is  abstracted in different ways in existent programming models. This dissertation explores three approaches  implemented by existent programming models. Opencl and the nvidia cuda driver api use file  descriptor semantics to abstract accelerators: user processes access accelerators through descriptors.  this approach increases the complexity of using accelerators because accelerator descriptors are needed  in any call involving the accelerator (e.G., Memory allocations or passing a parameter to the accelerator).  the ibm cell sdk abstract accelerators as separate execution threads. This approach requires adding  the necessary code to create new execution threads and synchronization primitives to use of accelerators.  finally, the nvidia cuda run-time api abstract accelerators as remote procedure calls (rpc). This  approach is fundamentally incompatible with adsm, because it assumes separate virtual address spaces  for accelerator and cpu code. The heterogeneous parallel execution (hpe) model is presented in this  dissertation. This model extends the execution thread abstraction to incorporate different execution  modes. Execution modes define the capabilities (e.G., Accessible virtual address space, code isa, etc) of the code being executed.  In this execution model, accelerator calls are implemented as execution mode  switches, analogously to system calls. Accelerator calls in hpe are synchronous, on the contrary of  cuda, opencl and the ibm cell sdk. Synchronous accelerator calls provide full compatibility with the  existent sequential execution model provided by most operating systems.  Moreover, abstracting  accelerator calls as execution mode switches allows application that use accelerator to run on system  without accelerators. In these systems, the execution mode switch falls back to an emulation layer, which  emulates the accelerator execution in the cpu. This dissertation further presents different design and  implementation choices for the hpe model, in gmac. The necessary hardware support for an efficient  implementation of this model is also presented. Experimental results show that hpe introduces a low  execution-time overhead while offering a clean and simple programming interface to applications.<\/p>\n<p>&nbsp;<\/p>\n<h3>Datos acad\u00e9micos de la tesis doctoral \u00ab<strong>On the programmability of heterogeneous massively-parallel computing systems<\/strong>\u00ab<\/h3>\n<ul>\n<li><strong>T\u00edtulo de la tesis:<\/strong>\u00a0 On the programmability of heterogeneous massively-parallel computing systems <\/li>\n<li><strong>Autor:<\/strong>\u00a0 Isaac Gelado Fernandez <\/li>\n<li><strong>Universidad:<\/strong>\u00a0 Polit\u00e9cnica de catalunya<\/li>\n<li><strong>Fecha de lectura de la tesis:<\/strong>\u00a0 02\/07\/2010<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Direcci\u00f3n y tribunal<\/h3>\n<ul>\n<li><strong>Director de la tesis<\/strong>\n<ul>\n<li>Nacho Navarro Mas<\/li>\n<\/ul>\n<\/li>\n<li><strong>Tribunal<\/strong>\n<ul>\n<li>Presidente del tribunal: patt Yale <\/li>\n<li>avi Mendelson (vocal)<\/li>\n<li>david b. Kirk (vocal)<\/li>\n<li>mateo Valero cort\u00e9s (vocal)<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Tesis doctoral de Isaac Gelado Fernandez Heterogeneous parallel computing combines general purpose processors with accelerators to efficiently execute both sequential [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-gradient":""}},"footnotes":""},"categories":[4810,1890,15596],"tags":[207649,207650,207647,13321,194998,207648],"class_list":["post-102343","post","type-post","status-publish","format-standard","hentry","category-arquitectura-de-ordenadores","category-ciencia-de-los-ordenadores","category-politecnica-de-catalunya","tag-avi-mendelson","tag-david-b-kirk","tag-isaac-gelado-fernandez","tag-mateo-valero-cortes","tag-nacho-navarro-mas","tag-patt-yale"],"_links":{"self":[{"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/posts\/102343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/comments?post=102343"}],"version-history":[{"count":0,"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/posts\/102343\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/media?parent=102343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/categories?post=102343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.deberes.net\/tesis\/wp-json\/wp\/v2\/tags?post=102343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}