Date: Aug 17, 2021
Click Count: 126
The growing demand for data acceleration is putting higher and higher demands on hardware platforms, and FPGAs are playing an increasingly important role as a programmable and customizable high-performance hardware. In recent years, high-end FPGA chips have adopted more and more Hard IP to enhance the data transfer bandwidth and memory bandwidth of the FPGA periphery. But inside the FPGA, while the programmable logic part is improving with the process, the improvement of internal and external data exchange performance is not as obvious, so the exchange of data inside the FPGA is increasingly becoming a bottleneck for data transmission.
To address this issue, Achronix includes a revolutionary and innovative two-dimensional network-on-chip (2D NoC) in its latest Speedster7t FPGA devices based on TSMC's 7nm FinFET process. This 2D NoC acts as a highway network running on top of the FPGA programmable logic fabric, providing ultra-high bandwidth of approximately up to 27 Tbps for data transfer between the FPGA's external high-speed interface and the internal programmable logic.
As one of the key innovations in the Speedster7t FPGA device, 2D NoC offers several important benefits to FPGA designs, including
Improves design performance by eliminating data transfers within the FPGA as a bottleneck.
Saving FPGA programmable logic resources and simplifying logic design, with NoC going to replace traditional logic for high-speed data transfer and data bus management.
Increases FPGA wiring resources, effectively reducing the risk of layout wiring congestion for designs with high resource usage.
Realize the true modular design and reduce the debugging workload of FPGA designers.
This paper uses a specific FPGA design example to demonstrate several important roles of NoC in FPGA design as mentioned above. The main purpose of this design is to show how the logic inside the FPGA can access the off-chip memory. As shown in Figure 1, this design contains 8 read/write modules, which need to access 8 GDDR6 channels, thus requiring an 8x8 AXI interconnect module, as well as logic across the clock domain to convert each GDDR6 user interface clock to a logical master clock. Except for the 8 read/write modules in Figure 1, the logic in the red area needs to be implemented with FPGA programmable logic.
Figure 1 Traditional FPGA Implementation Architecture
For the AXI interconnect module, we use the open source AXI4 bus connector on Github to implement it. This AXI4 bus connector connects four AXI4 bus master devices to eight AXI4 bus slave devices, and the source code can be downloaded from the link in Reference 2. We extend this code by adding up to 8 AXI4 bus masters connected to 8 AXI4 bus slaves with the addition of cross-clock domain logic.
For comparison, we use another design that still aims to use these 8 read/write modules to access the 8 GDDR6 channels; the difference is that this time we connect the 8 read/write modules to the 2D NoC of Achronix's Speedster7t FPGA device and then access the 8 GDDR6 channels through the 2D NoC. As shown in Figure 2.
Figure 2 Speedster7t 1500 implementation architecture
First, let's do a comparison in terms of resources and performance, as shown in Figure 3.
Figure 3 Resource Usage and Performance Comparison
In terms of resource usage, the AXI bus connector design will take up a lot more resources than the 2D NoC design to implement the AXI interconnect and the cross-clock domain logic. It is also important to note that this open source AXI interconnect implementation is the simplest bus connector and does not support all the features that 2D NoC can provide, such as address table mapping and priority configuration.
The most important point is that AXI interconnect only supports blocking access (blocking), not non-blocking access (non-blocking). Blocking access means that after a read or write request is initiated, the next read or write request cannot be initiated until after the current read or write operation is completed. Non-blocking access means that you can continuously initiate a read or write request without waiting for the last read or write operation to complete. In terms of improving the access efficiency of GDDR6, blocking access will make the read/write efficiency drop significantly.
If the programmable logic of FPGA is used to implement the complete 2D NoC function, including 64 access points, 128bit bit width and 400MHz rate, roughly 850 k LE is required, which equals to occupy 56% of programmable resources of Speedster7t 1500 FPGA device. The 2D NoC, on the other hand, provides 80 access points, 256bit bit width and 2GHz rate, and does not consume FPGA programmable logic.
In terms of performance, the design using the AXI bus connector can only run up to 157MHz, while the design using the NoC can run up to 500MHz. if we look at the layout wiring diagram on the back end of the design, we will have a better understanding. Figure 4 shows the layout wiring diagram of the back end of the design using the AXI bus connector.
Figure 4: Layout and wiring diagram of the back end of the design using AXI interconnect
As you can see from the figure, because the GDDR6 controller is distributed on both sides of the device (the colored highlighted part in the figure), the AXI bus connector layout is basically distributed in the middle of the device, neither close to the left nor close to the right, so this results in the performance not going up. If you increase the registers of the pipeline can improve the performance of the system, but this will occupy a lot of register resources, and at the same time will bring a lot of latency to the GDDR access.
If we look again at the layout wiring diagram in Figure 5, which uses 2D NoC, there is a clear contrast. First, because the AXI bus connector and modules across the clock domain are implemented with 2D NoC, this saves a lot of resources; in addition, because 2D NoC is spread throughout the device with a total of 80 access points, the 8 read/write modules can be placed anywhere on the device by the tool without affecting the performance of the design.
Figure 5 Back-end layout wiring diagram of the design using 2D NoC
From the whole flow of this design, using 2D NoC will greatly simplify the design, improve the performance, and save a lot of resources; FPGA design engineers can spend more effort on the core module or algorithm module design, and leave the bus transfer, external interface access arbitration and interface asynchronous clock domain conversion to 2D NoC.