Date: Jun 22, 2020
Click Count: 676
The method of manually building algorithms is replaced by the ability of computers to automatically learn composable systems from large amounts of data, which has led to major breakthroughs in key areas such as computer vision, speech recognition, and natural language processing. Deep learning is a technology commonly used in these fields, and it has also received great attention from the industry. However, deep learning models require extremely large amounts of data and computing power, and only better hardware acceleration conditions can meet the demand for the continued expansion of existing data and models. Existing solutions use graphics processing unit (GPU) clusters as general-purpose computing graphics processing units (GPGPU), but field programmable gate arrays (FPGAs) provide another solution worth exploring. The increasingly popular FPGA design tools make it more compatible with the upper-layer software that is frequently used in the field of deep learning, making FPGAs easier for model builders and deployers.
The FPGA architecture is flexible, allowing researchers to explore model optimization outside of fixed architectures such as GPUs. At the same time, FPGAs have stronger performance under unit energy consumption, which is crucial for the research of large-scale server deployment or embedded applications with limited resources. This article examines deep learning and FPGA from the perspective of hardware acceleration, points out what trends and innovations make these technologies match each other, and stimulates a discussion on how FPGAs can help the development of deep learning.
Machine learning has a profound impact on daily life. Whether it’s clicking on personalized recommendations, using voice communication on a smartphone, or taking photos using facial recognition technology, some form of artificial intelligence technology is used. This new trend of artificial intelligence is also accompanied by a change in the concept of algorithm design. In the past, data-based machine learning mostly used professional knowledge in specific fields to artificially "shape" the "features" to be learned. Key performance areas such as natural language processing have achieved major performance breakthroughs. Research on these data-driven technologies is called deep learning, and it is now receiving the attention of two important groups in the technical community: one is the researchers who want to use and train these models to achieve extremely high-performance cross-task computing, and the other is to Application scientists deploying these models with new applications in the real world. However, they are all faced with a limitation that hardware acceleration capabilities still need to be strengthened to meet the needs of expanding the scale of existing data and algorithms.
For deep learning, the current hardware acceleration mainly relies on the use of graphics processing unit (GPU) clusters as a general purpose computing graphics processing unit (GPGPU). Compared with the traditional general-purpose processor (GPP), the core computing power of the GPU is several orders of magnitude more, and it is also easier to perform parallel computing. In particular, NVIDIACUDA, as the mainstream GPGPU writing platform, is used by all major deep learning tools for GPU acceleration. Recently, the open parallel programming standard OpenCL has attracted much attention as an alternative tool for heterogeneous hardware programming, and the enthusiasm for these tools is also rising. Although OpenCL is slightly inferior to CUDA in the field of deep learning, OpenCL has two unique features. First of all, OpenCL's approach to open source and free is different from CUDA's single supplier approach. Secondly, OpenCL supports a series of hardware, including GPU, GPP, field programmable gate array (FPGA) and digital signal processor (DSP).
As a strong competitor of GPU in algorithm acceleration, whether FPGA immediately supports different hardware is particularly important. The difference between FPGA and GPU is that the hardware configuration is flexible, and when FPGA runs key subprograms in deep learning (such as the calculation of sliding window), the unit energy consumption usually provides better performance than GPU. However, setting up FPGAs requires specific hardware knowledge, which many researchers and application scientists do not have. Because of this, FPGAs are often seen as an expert-specific architecture. Recently, FPGA tools have begun to use software-level programming models including OpenCL, making them more and more popular with users who have been trained in mainstream software development.
For researchers investigating a series of design tools, the selection criteria for the tools are usually related to whether they have user-friendly software development tools, flexible and scalable model design methods, and whether they can be quickly calculated to reduce the training of large models. Time is related. As FPGAs become easier to write because of the emergence of highly abstract design tools, their reconfigurability makes it possible to customize architectures, and at the same time, high parallel computing capabilities increase instruction execution speed, FPGAs will be deep learning researchers brings advantages.
For application scientists, despite the similar tool-level options, the focus of hardware selection is to maximize the performance of unit energy consumption, thereby reducing costs for large-scale operations. Therefore, FPGAs can benefit deep-learning application scientists by virtue of the strong performance of unit energy consumption and the ability to customize the architecture for specific applications.
FPGA can meet the needs of two types of audience, is a logical choice. This article examines the current state of deep learning on FPGAs and the technological developments currently used to fill the gap between the two. Therefore, this article has three important purposes. First, it points out that there is an opportunity to explore a new hardware acceleration platform in the field of deep learning, and FPGA is an ideal choice. Second, outline the current status of FPGAs supporting deep learning and point out potential limitations. After zui, he made key suggestions on the future direction of FPGA hardware acceleration to help solve the problems facing deep learning in the future.
Traditionally, when evaluating the acceleration of a hardware platform, the trade-off between flexibility and performance must be considered. On the one hand, the general purpose processor (GPP) can provide a high degree of flexibility and ease of use, but the performance is relatively inefficient. These platforms are often more accessible, can be produced at low prices, and are suitable for multiple uses and reuse. On the other hand, application specific integrated circuits (ASICs) can provide high performance, but at the cost of being inflexible and more difficult to produce. These circuits are dedicated to a specific application and are expensive and time-consuming to produce.
FPGA is a compromise between these two extremes. FPGA belongs to a more general programmable logic device (PLD), and in simple terms, it is a reconfigurable integrated circuit. Therefore, FPGA can not only provide the performance advantages of integrated circuits, but also have the flexibility of GPP reconfiguration. FPGAs can implement sequential logic simply by using flip-flops (FF) and combinational logic by using lookup tables (LUT). Modern FPGAs also contain hardened components to implement some common functions, such as full processor cores, communication cores, computing cores, and block memory (BRAM). In addition, the current FPGA trend tends to system-on-chip (SoC) design methods, that is, the ARM coprocessor and FPGA are usually located on the same chip. The current FPGA market is dominated by Xilinx, occupying more than 85% of the market share. In addition, FPGAs are rapidly replacing ASICs and application-specific standard products (ASSPs) to implement fixed-function logic. The FPGA market size is expected to reach US$10 billion in 2016.
For deep learning, FPGAs offer significant potential over traditional GPP acceleration capabilities. The implementation of GPP at the software level relies on the traditional von Neumann architecture, where instructions and data are stored in external memory and retrieved when needed. This has promoted the emergence of caches and greatly reduced expensive external memory operations. The bottleneck of this architecture is the communication between the processor and the memory, which seriously weakens the performance of GPP, especially affecting the storage information technology that deep learning often needs to obtain.
In comparison, FPGA programmable logic originals can be used to implement data and control paths in common logic functions without relying on von Neumann structures. They can also utilize distributed on-chip memory, as well as deep use of pipeline parallelism, which naturally fits with feed-forward deep learning methods. Modern FPGAs also support partial dynamic reconfiguration, and when a part of the FPGA is reconfigured, the other part can still be used. This will have an impact on the large-scale deep learning model. Each layer of the FPGA can be reconfigured without disturbing the ongoing calculations of other layers. This can be used for models that cannot be accommodated by a single FPGA, and at the same time can save the high global storage read costs by storing intermediate results in local storage.
The design of the GPU and other fixed architectures follows the software execution model and builds structures around the autonomous computing units in parallel to perform tasks. Therefore, the goal of developing GPUs for deep learning technology is to adapt the algorithm to this model, to allow calculations to be done in parallel, and to ensure that data is interdependent. In contrast, FPGA architectures are customized specifically for applications. When developing deep learning techniques for FPGAs, less emphasis is placed on adapting algorithms to a fixed computing structure, thus leaving more freedom to explore optimizations at the algorithm level. Technologies that require many complex lower-level hardware control operations are difficult to implement in upper-level software languages, but are particularly attractive for FPGA implementation. However, this flexibility comes at the cost of a large amount of compilation (positioning and looping) time, which is often a problem for researchers who need to quickly iterate through the design loop.
In addition to compilation time, the problem of attracting researchers and application scientists who prefer upper-level programming languages to develop FPGAs is particularly difficult. Although fluent use of one software language often means that you can easily learn another software language, this is not the case for hardware language translation skills. The commonly used languages for FPGAzui are Verilog and VHDL, both of which are hardware description languages (HDL). The main difference between these languages and traditional software languages is that HDL simply describes the hardware, while software languages such as C language describe sequential instructions and do not need to know the details of the hardware-level execution. Effectively describing the hardware requires expertise in digital design and circuits. Although some of the lower-level implementation decisions can be left to automatic synthesis tools to implement, it is often impossible to achieve designs. Therefore, researchers and application scientists tend to choose software design because it is very mature and has a large number of abstract and convenient classifications to improve the efficiency of programmers. These trends make the field of FPGAs more favored by highly abstract design tools.
Whether the future of deep learning is FPGA or overall, it mainly depends on scalability. For these technologies to successfully solve future problems, we must expand to the scale and architecture of data that can support rapid growth. FPGA technology is adapting to this trend, and hardware is moving towards larger memory, fewer feature points, and better interconnectivity to accommodate multiple FPGA configurations. Intel's acquisition of Altera and IBM's cooperation with Xilinx are all signs of changes in the FPGA field. In the future, you may soon see the integration of FPGAs with personal applications and data center applications. In addition, algorithm design tools may develop in the direction of further abstraction and experience softwareization, thereby attracting users of a wider technical range.
Compared to GPUs and GPPs, FPGAs provide attractive alternatives to meet the hardware needs of deep learning. With pipeline parallel computing capabilities and energy consumption, FPGAs will demonstrate unique advantages not found in GPUs and GPPs in general deep learning applications. At the same time, algorithm design tools are maturing, and it is now possible to integrate FPGAs into commonly used deep learning frameworks. In the future, FPGA will effectively adapt to the development trend of deep learning, and ensure that related applications and research can be realized freely from the framework.
FPGA Spartan-3E Family 100K Gates 2160 Cells 572MHz 90nm Technology 1.2V 144-Pin TQFP EP
FPGA XC3000 Family 1.5K Gates 64 Cells 70MHz 5V 68-Pin PLCC
FPGA XC4000X Family 10K Gates 950 Cells 0.35um Technology 3.3V 208-Pin PQFP
SPARTAN-II FPGA WITH POWER-ON REQUIREMENTS