This website uses cookies. By using this site, you consent to the use of cookies. For more information, please take a look at our Privacy Policy.
Home > FPGA Technical Tutorials > FPGAs: World Class Designs > Other Design Flows > DSP-BASED DESIGN FLOWS

TABLE OF CONTENTS

Xilinx FPGA FPGA Forum

DSP-BASED DESIGN FLOWS

FONT SIZE : AAA

Digital signal processing includes compression, decompression, modulation,  error correction, filtering, and otherwise manipulating audio (voice, music, etc.),  video, image, and similar data for such applications as telecommunications,  radar, and image processing (including medical imaging). In many cases, the  data to be processed starts out as a signal in the real (analog) world. This analog  signal is periodically sampled, with each sample being converted into a digital  equivalent by means of an analog-to-digital (A/D) converter ( Figure 6-9 ).

What is DSP.png

These samples are then processed in the digital domain. In many cases, the  processed digital samples are subsequently converted into an analog equivalent  by means of a digital-to-analog (D/A) converter.   

DSP occurs all over the place—in cell phones and telephone systems; CD,  DVD, and MP3 players; cable desktop boxes; wireless and medical equipment;  electronic vision systems; … the list goes on. This means that the overall DSP  market is huge.

Alternative DSP Implementations

As usual, nothing is simple because DSP tasks can be implemented in a  n umber of different ways:  

● A general-purpose microprocessor (µP): This may also be referred to as a central processing unit (CPU) or a microprocessor unit (MPU). The processor can perform DSP by running an appropriate DSP algorithm.  

● A digital signal processor (DSP): This is a special form of microprocessor  chip (or core, as discussed below) that has been designed to perform DSP  tasks much faster and more efficiently than can be achieved by means of a general-purpose microprocessor.  

● Dedicated ASIC hardware : For the purposes of these discussions, we will  assume that this refers to a custom hardware implementation that executes  the DSP task. However, we should also note that the DSP task could be  implemented in software by including a microprocessor or DSP core on the  ASIC.  

● Dedicated FPGA hardware : For the purposes of these discussions, we will  assume that this refers to a custom hardware implementation that executes  the DSP task. Once again, however, we should also note that the DSP  functionality could be implemented in software by means of an embedded  microprocessor core on the FPGA.

System-level Evaluation and Algorithmic Verififi cation  

Irrespective of the final implementation technology (µP, DSP, ASIC, FPGA), if  one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment (we consider this in more detail later in  this chapter).   

Although this book attempts to avoid focusing on companies and products  as far as possible, it is encumbant on us to mention that—at the time of this  writing—the de facto industry standard for DSP algorithmic verification is  MATLAB® from The MathWorks ( www.mathworks.com ).2   

For the purposes of these discussions, therefore, we shall refer to MATLAB  as necessary. However, it should be noted that there are a number of other very  powerful tools and environments available to DSP developers. For example, Simulink® from The MathWorks has a certain following; the Signal Processing  Worksystem (SPW) environment from CoWare3 (www.coware.com) is very  popular, especially in telecom markets; and tools from Elanix (www.elanix. com) also find favor with many designers.

Software Running on a DSP Core

Let’s assume that our new DSP algorithm is to be implemented using a microprocessor or DSP chip (or core). In this case, the flow might be as shown in   Figure 6-10 .  

A simple design flow for a software DSP realizationpng

● The process commences with someone having an idea for a new algorithm  or suite of algorithms. This new concept typically undergoes verification  using tools such as MATLAB as discussed above. In some cases, one might  leap directly from the concept into handcrafting C/C ++ (or assembly  language).  

● Once the algorithms have been verified, they have to be regenerated in  C/C ++or in assembly language. MATLAB can be used to generate  C/C ++tuned for the target DSP core automatically, but in some cases,  design teams may prefer to perform this translation step by hand because  they feel that they can achieve a more optimal representation this way. As  yet another alternative, one might first auto-generate C/C ++ code from  the algorithmic verification environment, analyze and profile this code to  determine any performance bottlenecks, and then recode the most critical  portions by hand.  

● Once you have your  C/C ++ (or assembly language) representation, you  compile it (or assemble it) into the machine code that will ultimately be  executed by the microprocessor or DSP core.

This type of implementation is very flexible because any desired changes can  be addressed relatively quickly and easily by simply modifying and recompiling the source code. However, this also results in the slowest performance for  the DSP algorithm because microprocessor and DSP chips are both classed as Turing machines. This means that their primary role in life is to process  instructions, so both of these devices operate as follows:  

● Fetch an instruction.  

● Decode the instruction.  

● Fetch a piece of data.  

● Perform an operation on the data.  

● Store the result somewhere.  

● :  

● Fetch another instruction and start all over again.

Dedicated DSP Hardware

There are myriad ways in which one might implement a DSP algorithm in an  ASIC or FPGA—the latter option being the focus of this chapter, of course.  But before we hurl ourselves into the mire, let’s first consider how different  architectures can affect the speed and area (in terms of silicon real estate) of  the implementation.   

DSP algorithms typically require huge numbers of multiplications and  additions. As a really simple example, let’s assume that we have a new DSP  algorithm that contains an expression something like the following:

Y = (A * B) + (C * D) + (E * F) + (G * H);

As usual, this is a generic syntax that does not favor any particular HDL and  is used only for the purposes of these discussions. Of course, this would be a  minuscule element in a horrendously complex algorithm, but DSP algorithms  tend to contain a lot of this type of thing.   

The point is that we can exploit the parallelism inherent in hardware to  perform DSP functions much more quickly than can be achieved by means of  software running on a DSP core. For example, suppose that all of the multiplications were performed in parallel (simultaneously) followed by two stages of  additions ( Figure 6-11 ).   

A parallel implementation of the functionpng

An in-between implementation of the functionpng

Remembering that multipliers are relatively large and complex and that  adders are sort of large, this implementation will be very fast, but will consume a correspondingly large amount of chip resources. 

As an alternative, we might employ resource sharing (sharing some of the  multipliers and adders between multiple operations) and opt for a solution that  is a mixture of parallel and serial ( Figure 6-12 ).

This solution requires the addition of four 2:1 multiplexers and a register  (remember that each of these will be the same multibit width as their r espective signal paths). However, multiplexers and registers consume much less area than  the two multipliers and adder that are no longer required as compared to our  initial solution.

On the downside, this approach is slower, because we must first perform the  (A * B) and (C * D) multiplications, add the results together, add this total to  the existing contents of the register (which will have been initialized to contain  zero), and store the result in the register. Next, we must perform the (E * F) and  (G * H) multiplications, add these results together, add this total to the existing  contents of the register (which currently contains the results from the first set of  multiplications and additions), and store this result in the register.   

As yet another alternative, we might decide to use a fully serial solution  ( Figure 6-13 ).   

A serial implementation of the functionpng

This latter implementation is very efficient in terms of area because it  requires only a single multiplier and a single adder. This is the slowest implementation, however, because we must first perform the (A * B) multiplication,  add the result to the existing contents of the register (which will have been initialized to contain zero), and store the total in the register. Next, we must perform the (C * D) multiplication, add this result to the existing contents of the  register, and store this new total in the register. And so forth for the remaining  multiplication operations. (Note that when we say “ this is the slowest implementation, ” we are referring to these hardware solutions, but even the slowest  hardware implementation remains much, much faster than a software equivalent running on a microprocessor or DSP.)

DSP-related Embedded FPGA Resources  

As previously discussed, some functions like multipliers are inherently slow  if they are implemented by connecting a large number of programmable logic blocks together inside an FPGA. Since many applications require these functions, many FPGAs incorporate special hard-wired multiplier blocks. (These  are typically located in close proximity to embedded RAM blocks because  these functions are often used in conjunction with each other.)   

Similarly, some FPGAs offer dedicated adder blocks. One operation that  is very common in DSP-type applications is ulate. As its name would suggest,  this function multiplies two numbers together and adds the result into a running total stored in an accumulator (register). Hence, it is commonly referred  to as a MAC, which stands for multiply, add, and accumulate ( Figure 6-14 ).   

The functions forming a MACpng

Note that the multiplier, adder, and register portions of the serial implementation of our function shown in Figure 6-13 offer a classic example of a  MAC. If the FPGA you are working with supplies only embedded multipliers,  you would be obliged to implement this function by combining the multiplier  with an adder formed from a number of programmable logic blocks, while the  result would be stored in a block RAM or in a number of distributed RAMs.  Life becomes a little easier if the FPGA also provides embedded adders, and  some FPGAs provide entire MACs as embedded functions.

FPGA-centric Design Flows for DSPs

At the time of this writing, using FPGAs to perform DSP is still relatively new.  Thus, there really are no definitive design flows or methodologies here—everyone  seems to have his or her unique way of doing things, and whichever option you  choose, you’ll almost certainly end up breaking new ground one way or another.   Domain-specififi c Languages   The way of the world is that electronic designs increase in size and complexity over time. To manage this problem while maintaining—or, more usually, increasing—productivity, it is necessary to keep raising the level of abstraction  used to capture the design’s functionality and verify its intent.   

For this reason the gate-level schematics were superceded by the RTL representations in VHDL and Verilog, as discussed in Chapter 5. Similarly, the  drive toward C-based flows as discussed earlier is powered by the desire to  capture complex concepts quickly and easily while facilitating architectural  analysis and exploration.   

In the case of specialist areas such as DSPs, system architects and design  engineers can achieve a dramatic improvement in productivity by means of  domain-specific languages (DSLs), which provide more concise ways of representing specific tasks than do general-purpose languages such as C/C ++ and  SystemC.   One such language is MATLAB, which allows DSP designers to represent  a signal transformation, such as an FFT, that can potentially take up an entire  FPGA, using a single line of code4 along the lines of  

y = fft(x);   

Actually, the term MATLAB refers both to a language and an algorithmiclevel simulation environment. To avoid confusion, it is common to talk about  M-code (meaning “ MATLAB code ” ) and M-files (files containing MATLAB  code).

Insider Info  

Some engineers in the trenches occasionally refer to the “ M language, ” but this is  not argot favored by the folks at The MathWorks.

In addition to sophisticated transformation operators like the FFT shown  above, there are also much simpler transformations like adders, subtractors,  multipliers, logical operators, matrix arithmetic, and so forth. The more complex transformations like an FFT can be formed from these fundamental entities if required. The output from each transformation can be used as the input  to one or more downstream transformations, and so forth, until the entire system has been represented at this high level of abstraction.   

One important point is that such a system-level representation does not initially imply a hardware or software implementation. In the case of DSP core,  for example, it could be that the entire function is implemented in software as  discussed earlier in this chapter. Alternatively, the system architects could partition the design such that some functions are implemented in software, while  other performance-critical tasks are implemented in hardware using dedicated  ASIC or FPGA fabric. In this case, one typically needs to have access to a  hardware or software codesign environment. For the purposes of these discussions, however, we shall assume pure hardware implementations.

System-level Design and Simulation Environments  

System-level design and simulation environments are conceptually at a higher  level than DSLs. One well-known example of this genre is Simulink from The  MathWorks. Depending on who you’re talking to, there may be a perception  that Simulink is simply a graphical user interface to MATLAB. In reality,  however, it is an independent dynamic modeling application that works with MATLAB.   

If you are using Simulink, you typically commence the design process by  creating a graphical block diagram of your system showing a schematic of  functional blocks and the connections between them. Each of these blocks may  be user-num defined, or they may originate in one of the libraries supplied with  Simulink (these include DSP, communications, and control function block sets).  In the case of a user-defined block, you can “ push ” into that block and represent  its contents as a new graphical block diagram. You can also create blocks containing MATLAB functions, M-code, C/C ++ , FORTRAN … the list goes on.   

Once you’ve captured the design’s intent, you use Simulink to simulate and  verify its functionality. As with MATLAB, the input stimulus to a Simulink  simulation might come from one or more mathematical functions, such as sinewave generators, or it might be provided in the form of real-world data such as  audio or video files. In many cases, it comes as a mixture of both; for example,  real-world data might be augmented with pseudorandom noise supplied by a  Simulink block.

—Technology Trade-offs—  

● The point here is that there’s no hard-and-fast rule. Some DSP designers  prefer to use MATLAB as their starting point, while others opt for Simulink  (this latter case is much rarer in the scheme of things). Some folks say that  this preference depends on the user’s background (software DSP development versus ASIC/FPGA designs), but others say that this is a load of tosh.   

Floating-point versus Fixed-point Representations   

Irrespective as to whether one opts for Simulink or MATLAB (or a similar  environment from another vendor) as a starting point, the first-pass model of the system is almost invariably described using floating-point representations. In the context of the decimal number system, this refers to numbers like  1.235 * 103 (that is, a fractional number raised to some power of 10). In the  context of applications like MATLAB, equivalent binary values are represented  inside the computer using the IEEE standard for double-precision floatingpoint numbers.   

Floating-point numbers of this type have the advantage of providing  extremely accurate values across a tremendous dynamic range. However, implementing floating-point calculations of this type in dedicated FPGA or ASIC  hardware requires a humongous amount of silicon resources, and the result  is painfully slow (in hardware terms). Thus, at some stage, the design will be  migrated over to use fixed-point representations, which refers to numbers having a fixed number of bits to represent their integer and fractional portions. This  process is commonly referred to as quantization .   

This is totally system/algorithm dependent, and it may take a considerable  amount of experimentation to determine the optimum balance between using  the fewest number of bits to represent a set of values (thereby decreasing the  amount of silicon resources required and speeding the calculations), while  maintaining sufficient accuracy to perform the task in hand. (One can think  of this trade-off in terms of how much noise the designer is willing to accept  for a given number of bits.) In some cases, designers may spend days deciding   “ should we use 14, 15, or 16 bits to represent these particular values? ” And,  just to increase the fun, it may be best to vary the number of bits used to represent values at different locations in the system/algorithm.

Things start to get really fun in that the conversion from floating-point to  fixed-point representations may take place upstream in the system/a lgorithmic  design and verification environment, or downstream in the C/C ++ code. This  is shown in more detail in the “ System/algorithmic level to C/C ++ ” section  below. Suffice it to say that if one is working in a MATLAB environment,  these conversions can be performed by passing the floating-point signals  through special transformation functions called quantizers . Alternatively, if  one is working in a Simulink environment, the conversions can be performed  by running the floating-point signals through special fixed-point blocks.

System/algorithmic Level to RTL (Manual Translation)  

At the time of this writing, many DSP design teams commence by performing their system-level evaluations and algorithmic validation in MATLAB (or  the equivalent) using floating-point representations. (It is also very common to  include an intermediate step in which a fixed-point C/C ++ model is created for  use in rapid simulation/validation.) At this point, many design teams bounce  directly into hand-coding fixed-point RTL equivalents of the design in VHDL or  Verilog ( Figure 6-14a ). Alternatively, they may first transition the floating-point  representations into their fixed-point counterparts at the system/ algorithmic  level, and then hand-code the RTL in VHDL or Verilog ( Figure 6-14b ).

Manual RTL generationpng

Of course, once an RTL representation of the design has been created, we  can assume the use of the downstream logic-synthesis-based flows that were  introduced in Chapter 5.

—Technology Trade-offs—  

● There are a number of problems with this flow, not the least being that  there is a significant conceptual and representational divide between the  system architects working at the system/algorithmic level and the hardware  design engineers working with RTL representations in VHDL or Verilog.  

● Because the system/algorithmic and RTL domains are so different, manual  translation from one to the other is time-consuming and prone to error.  

● There is also the fact that the resulting RTL is implementation specific  because realizing the optimal design in an FPGA requires a different RTL  coding style from that used for an optimal ASIC implementation.  

● Another consideration is that manually modifying and reverifying RTL to  perform a series of what-if evaluations of alternative microarchitecture implementations is extremely time-consuming (such evaluations may include performing certain operations in parallel versus sequential, pipelining portions of  the design versus nonpipelining, sharing common resources—for example, two  operations sharing a single multiplier—versus using dedicated resources, etc.)  

● Similarly, if any changes are made to the original specification during the course of the project, it’s relatively easy to implement and evaluate these changes in the system-/algorithmic-level representations, but  subsequently folding these changes into the RTL by hand can be painful  and time-consuming.

System/Algorithmic Level to RTL (Automatic-generation)  

As was noted in the previous section, performing system-/algorithmic-level to-RTL translation manually is time-consuming and prone to error. There are  alternatives, however, because some system-/algorithmic-level design environments offer direct VHDL or Verilog RTL code generation ( Figure 6-15 ).   As usual, the system-/algorithmic-level design would commence by  using floating-point representations. In one version of the flow, the system/  algorithmic environment is used to migrate these representations into their  fixed-point counter-parts and then to generate the equivalent RTL in VHDL or  Verilog automatically ( Figure 6-15a ).   Alternatively, a third-party environment might be used to take the floatingpoint system-/algorithmic-level representation, autointeractively quantize it  into its fixed-point counterpart, and then automatically generate the equivalent  RTL in VHDL or Verilog ( Figure 6-15b ).

Direct RTL generationpng

As before, once an RTL representation of the design has been created, we  can assume the use of the downstream logic-synthesis-based flows that were  introduced in Chapter 5.

System/Algorithmic Level to C/C ++

Due to the problems associated with exploring the design at the RTL level,  there is an increasing trend to use a stepping-stone approach. This involves  transitioning from the system-/algorithmic-level domain into to some sort of  C/C ++ representation, which itself is subsequently migrated into an RTL  equivalent. One reason this is attractive is that the majority of DSP design  teams already generate a C/C ++ model for use as a golden (reference) model,  in which case this sort of comes for free as far as the downstream RTL design  engineer is concerned.   

Of course, the first thing to decide is when and where in the flow one should  transition from floating-point to fixed-point representations ( Figure 6-16 ).   

Frighteningly enough, Figure 6-16 shows only a subset of the various  potential flows. For example, in the case of the handcrafted options, as opposed  to first hand-coding the C/C ++ and then gradually transmogrifying this representation into Handel-C or SystemC, one could hand-code directly into these  languages.

Migrating from floating point to fixed pointpng

Block-level IP Environments  

Nothing is simple in this world because there is always just one more way to  do things. As an example, one might create a library of DSP functional blocks  at the system/algorithmic level of abstraction along with a one-to-one equivalent library of blocks at the RTL level of abstraction in VHDL or Verilog.   

The idea here is that you could then capture and verify your design using  a hierarchy of functional blocks specified at the system/algorithmic level of  abstraction. Once you were happy with your design, you could then generate a  structural netlist instantiating the RTL-level blocks, and use this to drive downstream simulation and synthesis tools. (These blocks would have to be parameterized at all levels of abstraction to allow you to specify such things as bus  widths and so forth.)   

As an alternative, the larger FPGA vendors typically offer IP core generators (in this context, the term core is considered to refer to a block that performs a specific logical function; it does not refer to a microprocessor or DSP  core). In several cases, these core generators have been integrated into system-/  algorithmic-level environments. This means that you can create a design based  on a collection of these blocks in the system-/algorithmic-level environment,  specify any parameters associated with these blocks, and perform your system-/ algorithmic-level verification.   

Later, when you’re ready to rock and roll, the core generator will automatically generate the hardware models corresponding to each of these blocks.  (The system-/algorithmic-level models and the hardware models ensuing from  the core generator are bit identical and cycle identical.) In some cases the  hardware blocks will be generated as synthesizable RTL in VHDL or Verilog.  Alternatively, they may be presented as firm cores at the LUT/CLB level of  abstraction, thereby making the maximum use of the targeted FPGA’s internal  resources.

—Technology Trade-offs—  

● One big drawback associated with this approach is that, by their very  nature, IP blocks are based on hard-coded microarchitectures. This means  that the ability to create highly tuned implementations to address specific  design goals is somewhat diminished. The result is that IP-based flows may  achieve an implementation faster with less risk, but such an i mplementation may be less optimal in terms of area, performance, and power as compared  to a custom hardware implementation.

Don’t Forget the Testbench!

One point the folks selling you DSP design tools often neglect to mention is  the test bench. For example, let’s assume that your flow involves taking your  system-/algorithmic-level design and hand-translating it into RTL. In that case,  you are going to have to do the same with your testbench. In many cases, this  is a nontrivial task that can take days or weeks! 

 Or let’s say that your flow is based on taking your floating- point system-/ algorithmic-level design and hand-translating it into floating-point C/C ++ , at  which point you will wish to verify this new representation. Then you might  take your floating-point C/C ++ and hand-translate it into fixed-point C/C ++ ,  at which point you will wish to verify this representation. And then you might  take your fixed-point C/C ++ and (hopefully) automatically synthesize an  equivalent RTL representation, at which point … but you get my drift.   The problem is that at each stage you are going to have to do the same  thing with your testbench (unless you do something cunning as discussed in  the next (and last—hurray!) section.

Mixed DSP and VHDL/Verilog etc. Environments  

In the previous chapter, we noted that a number of EDA companies can provide mixed-level design and verification environments that can support the  cosimulation of models specified at multiple levels of abstraction. For example,  one might start with a graphical block-based editor showing the design’s major  functional units, where the contents of each block can be represented using  

● VHDL  

● Verilog  

● SystemVerilog  

● SystemC  

● Handel-C  

● Pure C/C ++

In this case, the top-level design might be in a traditional HDL that calls  submodules represented in the various HDLs and in one or more flavors of  C/C ++ . Alternatively, the top-level design might be in one of the flavors of  C/C ++ that calls submodules in the other languages.   

More recently, integrations between system-/algorithmic-level and  i mplementation-level environments have become available. The way in which  this works depends on who is doing what and what that person is trying to. For  example, a system architect working at the system/algorithmic level (e.g., in  MATLAB) might decide to replace one or more blocks with equivalent representations in VHDL or Verilog at the RTL level of abstraction. Alternatively, a design engineer working in VHDL or Verilog at the RTL level of abstraction might decide to call one or more blocks at the system/algorithmic level of  abstraction.   

Both of these cases require cosimulation between the system-/algorithmic-level environment and the VHDL/Verilog environment, the main difference being who calls whom. Of course, this sounds easy if you say it quickly,  but there is a whole host of considerations to be addressed, such as synchronizing the concept of time between the two domains and specifying how different signal types are translated as they pass from one domain to the other  (and back again).

Insider Info  

Treat any canned demonstration with a healthy amount of suspicion. If you are  planning on doing this sort of thing, you need to sit down with the vendor’s engineer and work your own example through from beginning to end. Call me an old  cynic if you will, but my advice is to let their engineer guide you, while keeping your hands firmly on the keyboard and mouse. (You’d be amazed how much  activity can go on in just a few seconds should you turn your head in response to  the age-old question, “ Good grief! Did you see what just flew by the window? ” )















  • XC3S400-5FGG456C

    Manufacturer:Xilinx

  • FPGA Spartan-3 Family 400K Gates 8064 Cells 725MHz 90nm Technology 1.2V 456-Pin FBGA Tray
  • Product Categories: FPGAs (Field Programmable Gate Array)

    Lifecycle:Active Active

    RoHS:

  • XC3S400-5TQ144I

    Manufacturer:Xilinx

  • Xilinx QFP-144
  • Product Categories:

    Lifecycle:Any -

    RoHS: -

  • XC2S50-6FGG256C

    Manufacturer:Xilinx

  • FPGA Spartan-II Family 50K Gates 1728 Cells 263MHz 0.18um Technology 2.5V 256-Pin FBGA
  • Product Categories: FPGAs

    Lifecycle:Active Active

    RoHS:

  • XCR5128-10PC84C

    Manufacturer:Xilinx

  • CPLD CoolRunner Family 4K Gates 128 Macro Cells 0.5um Technology 5V 84-Pin PLCC
  • Product Categories: Industrial components

    Lifecycle:Obsolete -

    RoHS: No RoHS

  • XC5206-3PQ160C

    Manufacturer:Xilinx

  • FPGA XC5200 Family 10K Gates 784 Cells 83MHz 0.5um Technology 5V 160-Pin PQFP
  • Product Categories:

    Lifecycle:Obsolete -

    RoHS: No RoHS

Need Help?

Support

If you have any questions about the product and related issues, Please contact us.