FONT SIZE : AAA
As introduced in Section 6.1, HLS tools offer the possibility of mapping algo- rithms into hardware from descriptions that are not time explicit or, in other words, do not contain information about transactions between registers in every clock cycle. These tools make the appropriate scheduling of operations in a given set of operators, which may be reused over time for different purposes.
The algorithm specification defines relationships between variables con- taining data and operations, so data are transformed along the algorithm execution. The tools identify involved operators and data dependencies, in order for the modules in charge of these operations to be reused at differ- ent times of the algorithm execution. They also generate the multiplexing schemes and the associated control required to select the appropriate data path(s) at every time point during execution. As a result, they provide
• A data flow graph, which contains the registers required to hold data, the multiplexing schemes required to feed operators with these data, and the operators themselves
• A control state machine, which controls the data flow graph in order for the required operations to be performed in the required sequences
Reuse of operators can be maximized to reduce logic resource utilization, usually at the expense of longer latency. Alternatively, if execution speed has higher priority than size, the circuit may be “widened” by multiple instantiations of operators so that parallelism may be exploited. In this case, pipelining, loop unrolling, parallel memory access (memory reshaping), and I/O adaptation are the main techniques used to speed up algorithm execu- tion. These techniques are briefly described here:
• Pipelined structures achieve high execution speeds at the expense of high number of registers and long latencies. A well-designed pipe- lined circuit should have all stages performing operations and hold- ing data, cycle by cycle, with data coming from various execution cycles, as long as the signals are being propagated by the pipeline. Thus, pipelined structures are incompatible with resource reuse since structural hazards would be produced.
• Loop unrolling is a technique that uses several functional instances for the inner loops of the code so that all iterations within the loop are executed in parallel. In order for this to be feasible, the loop must contain a fixed, predefined number of iterations (i.e., it does not depend on a variable but on a constant). If loops are nested, more than one loop may be set to be unrolled, but the chances for huge resource utilization increase. In general, this technique requires high resource utilization but few additional registers, and it should be complemented with memory reshaping and I/O adaptation, because all resources must be fed with the appropriate data at high speeds and simultaneously, otherwise, no performance improve- ment would be achieved.
• Fast access to memories by the functional resources is crucial to achieve high computing bandwidth. With this purpose, memories (in particular those storing vectors or arrays) may be set to use wide parallel buses, capable of providing data to the possibly replicated computing resources at the required speeds. Since memory con- tents are the same, memory utilization inside the FPGA remains unchanged and the only overhead is that caused by parallel wiring. For this reason, this technique is called memory reshaping.
• Data from the external elements have to be fed to the blocks designed through HLS techniques fast enough for all required data to be available at the right times. Similarly, these blocks must be capable of delivering output values to their destinations at the right times. High data throughput may be achieved by using DMA engines on dedi- cated ports. They may be embedded into the system under design for the control flow part to produce the proper transactions at the right times.Apart from traditional HLS tools, which are being inte- grated into design suites, there are also tools aimed at embedding (in a somewhat automated way) hardware accelerators within SoPC systems. They are targeted to a restricted set of devices or families and are conceived to support software designers with little expertise in hardware development.
A special case of this approach is the development of hardware accelerators from programming languages that allow explicit parallelism to be described. OpenCL is becoming a widely used standard for such specifications because of its adequacy to cater to a variety of devices, such as GPGPUs, multicore systems, or SoPCs. It also supports heterogeneous computing, in the sense that different portions of the code may run on different computing platforms, as discussed in Section 3.1.1.1. This is very convenient for the newest FPGA families, which integrate several different hard processing fabrics in the same device. Because of its expected increased significance, the issues related to the design of these particular accelerators are discussed in Section 6.5.
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories: Embedded - CPLDs (Complex Programmable Logic Devices)
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories:
Lifecycle:Any -
RoHS: -
Manufacturer:Xilinx
Product Categories: Programmable logic array
Lifecycle:Any -
RoHS: -
Support