This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > FPGAs Fundamentals, advanced features, and applications in industrial electronics > Tools and Methodologies for FPGA-Based Design > Design of HPC Multithread Accelerators

Design of HPC Multithread Accelerators

FONT SIZE : AAA

As analyzed in Section 6.4, HLS synthesis tools can be a good performance booster for accelerating certain critical tasks within a control system with limited extra design effort. An alternative and, in many cases, advantageous solution is to define parallelism in an explicit manner in the source code of the algorithm to be accelerated. Languages with explicit parallelism, such as CUDA or OpenCL, share the same model of computation, that is, the way efficient code is to be produced to achieve significant acceleration. While CUDA is specific to NVDIA GPUs, OpenCL is becoming a de facto standard due to its portability to different platforms, such as multiprocessors, GPUs (and GPU clusters), FPGAs, or even heterogeneous systems formed by com- binations of these platforms. 

OpenCL ensures code portability between different computing devices, although performance is not guaranteed. It is clear that the computation model underneath the code and the hardware architecture on which it is executed play a crucial role in the resulting performance. As a matter of fact, if the code is not written carefully enough, performance can be degraded to the extent that it can be worse than that achieved using a single processor. 

The computation model provided by this type of languages relies on a multithread approach, based on the parallel execution of multiple basic ele- ments (called work items in OpenCL and threads in CUDA), with different levels of interaction between them. Each work item/thread has its own pri- vate memory for independent computing, ensuring that the maximum pos- sible bandwidth is achieved. 

Work items/threads can be bundled into work-groups/thread blocks. All bundled elements share a second memory level, called local memory, which can be accessed by any of them, but with some restrictions. Local memory is multibank, so it has multiple ports for parallel access from all elements at the same time. However, each memory bank can only be accessed by one of them at a time, except if access is gained from the same memory position within the bank. Special care must be taken with this type of accesses since good parallelism exploitation comes from parallel coalesced accesses to this memory, with no congestion due to chaotic access. 

Each work-group/thread block is expected to be fully executed in the same computing unit (CU), but since the computing models dictate the execution of different work-groups/thread blocks to be independent, each of them may run in a different CU. All work-groups to be executed are bundled into a kernel. A kernel is invoked whenever there is a need to perform multiple work items/threads in parallel. If the number of work- groups to be executed is higher than the number of CUs available, execu- tion is sequenced until that of all work-groups/thread blocks in the kernel is finished. 

All work-groups/thread blocks in a kernel also share a third memory level, called global memory, which is accessed by them through burst trans- actions, in order for data throughput at this level to be maximized. Every work-group/thread block and work item/thread has a numeric identifier that enables each of them to access their own sets of data in this memory. These identifiers can be one, two, or three dimensional in order for differ- ent data organization to be possible, allowing algorithm memory needs and work-group/thread block and work item/thread organizations to match. For instance, one- dimensional partitioning is adequate for dealing with single signals, whereas two-dimensional (2D) access is more efficient for 2D image processing, and three-level identifiers are the best solution for finite-element analysis (e.g., mechanical) of a 3D structure. 

Kernels are invoked from a host, which executes serial code containing kernel invocations. Kernels are then executed on the so-called device, which contains the CUs required to accelerate kernel execution. Host and device have their respective own memories, so memory transactions are required between them for data provision and result collection. If kernels are not too computing intensive, the time saved in parallel computing may be counter- acted by the time used in memory transactions. Another possible cause of performance degradation is the need for synchronization between parallel work items/threads, equivalent to a barrier in multiprocessing terminology, which might cause some CUs to be underutilized. 

A host program may, of course, invoke more than one kernel along its serial execution. It is also possible to specify how many accelerators (CUs) are to be allocated in the FPGA for each kernel. Additionally, it is possible to modify the logic inside the FPGA by partially reconfiguring the area devoted to the accelerators so that different combinations of CUs can be used along host program execution. An example of the use of partial reconfiguration for this purpose is described in Section 8.3.3. 

The main FPGA vendors offer tools for accelerating OpenCL kernels implemented in FPGAs. They are suited to be used with powerful high-end FPGA boards (the devices) hosted in personal computers (the hosts) and con- nected through PCIe interfaces. It is also possible to run the same OpenCL programs in the host processor to verify functionality, verify them with hardware simulators, test them in real hardware using just one CU in the FPGA, or fully verify them. 

Although multithread acceleration tools are intended to support software designers, some knowledge of hardware acceleration and, more importantly, the characteristics of the computing model and their impact on accelera- tion are required. CUs capable of executing one work-group/thread block at a time are obtained by means of an HLS synthesis process, but specific directives (or pragma declarations) are required to customize the number of work items/threads per work-group/thread block, which in turn deter- mines the size of every CU, as well as the number of CUs to be allocated in the FPGA fabric. 

Same as for stand-alone HLS tools, OpenCL acceleration environments offer estimation tools in order to explore the design space (basically area and performance) before going into the detailed design process, which is quite time-consuming. Estimations may be obtained about latency in every CU, throughput, and resource utilization of each CU, among others. 

  • XC3SD3400A-5CS484C

    Manufacturer:Xilinx

  • FPGA Spartan-3A DSP Family 3.4M Gates 53712 Cells 770MHz 90nm Technology 1.2V 484-Pin LCSBGA
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS: No RoHS

  • XC3SD3400A-FGG676I

    Manufacturer:Xilinx

  • Xilinx BGA676
  • Product Categories:

    Lifecycle:Any -

    RoHS:

  • XC2C512-7PQ208C

    Manufacturer:Xilinx

  • CPLD CoolRunner -II Family 12K Gates 512 Macro Cells 179MHz 0.18um Technology 1.8V 208-Pin PQFP
  • Product Categories: CPLDs

    Lifecycle:Active Active

    RoHS: No RoHS

  • XC4028XL-09HQ240C

    Manufacturer:Xilinx

  • FPGA XC4000X Family 28K Gates 2432 Cells 0.35um Technology 3.3V 240-Pin HSPQFP EP
  • Product Categories: Contrôleur logique

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XC2C64-7CP56C

    Manufacturer:Xilinx

  • This lends power savings to High-end Communication equipment and speed to battery operated devices.
  • Product Categories: Programmable logic array

    Lifecycle:Any -

    RoHS: -

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.