FONT SIZE : AAA
Digital signal processing includes compression, decompression, modulation, error correction, filtering, and otherwise manipulating audio (voice, music, etc.), video, image, and similar data for such applications as telecommunications, radar, and image processing (including medical imaging). In many cases, the data to be processed starts out as a signal in the real (analog) world. This analog signal is periodically sampled, with each sample being converted into a digital equivalent by means of an analog-to-digital (A/D) converter ( Figure 6-9 ).
These samples are then processed in the digital domain. In many cases, the processed digital samples are subsequently converted into an analog equivalent by means of a digital-to-analog (D/A) converter.
DSP occurs all over the place—in cell phones and telephone systems; CD, DVD, and MP3 players; cable desktop boxes; wireless and medical equipment; electronic vision systems; … the list goes on. This means that the overall DSP market is huge.
As usual, nothing is simple because DSP tasks can be implemented in a n umber of different ways:
● A general-purpose microprocessor (µP): This may also be referred to as a central processing unit (CPU) or a microprocessor unit (MPU). The processor can perform DSP by running an appropriate DSP algorithm.
● A digital signal processor (DSP): This is a special form of microprocessor chip (or core, as discussed below) that has been designed to perform DSP tasks much faster and more efficiently than can be achieved by means of a general-purpose microprocessor.
● Dedicated ASIC hardware : For the purposes of these discussions, we will assume that this refers to a custom hardware implementation that executes the DSP task. However, we should also note that the DSP task could be implemented in software by including a microprocessor or DSP core on the ASIC.
● Dedicated FPGA hardware : For the purposes of these discussions, we will assume that this refers to a custom hardware implementation that executes the DSP task. Once again, however, we should also note that the DSP functionality could be implemented in software by means of an embedded microprocessor core on the FPGA.
Irrespective of the final implementation technology (µP, DSP, ASIC, FPGA), if one is creating a product that is to be based on a new DSP algorithm, it is common practice to first perform system-level evaluation and algorithmic verification using an appropriate environment (we consider this in more detail later in this chapter).
Although this book attempts to avoid focusing on companies and products as far as possible, it is encumbant on us to mention that—at the time of this writing—the de facto industry standard for DSP algorithmic verification is MATLAB® from The MathWorks ( www.mathworks.com ).2
For the purposes of these discussions, therefore, we shall refer to MATLAB as necessary. However, it should be noted that there are a number of other very powerful tools and environments available to DSP developers. For example, Simulink® from The MathWorks has a certain following; the Signal Processing Worksystem (SPW) environment from CoWare3 (www.coware.com) is very popular, especially in telecom markets; and tools from Elanix (www.elanix. com) also find favor with many designers.
Let’s assume that our new DSP algorithm is to be implemented using a microprocessor or DSP chip (or core). In this case, the flow might be as shown in Figure 6-10 .
● The process commences with someone having an idea for a new algorithm or suite of algorithms. This new concept typically undergoes verification using tools such as MATLAB as discussed above. In some cases, one might leap directly from the concept into handcrafting C/C ++ (or assembly language).
● Once the algorithms have been verified, they have to be regenerated in C/C ++or in assembly language. MATLAB can be used to generate C/C ++tuned for the target DSP core automatically, but in some cases, design teams may prefer to perform this translation step by hand because they feel that they can achieve a more optimal representation this way. As yet another alternative, one might first auto-generate C/C ++ code from the algorithmic verification environment, analyze and profile this code to determine any performance bottlenecks, and then recode the most critical portions by hand.
● Once you have your C/C ++ (or assembly language) representation, you compile it (or assemble it) into the machine code that will ultimately be executed by the microprocessor or DSP core.
This type of implementation is very flexible because any desired changes can be addressed relatively quickly and easily by simply modifying and recompiling the source code. However, this also results in the slowest performance for the DSP algorithm because microprocessor and DSP chips are both classed as Turing machines. This means that their primary role in life is to process instructions, so both of these devices operate as follows:
● Fetch an instruction.
● Decode the instruction.
● Fetch a piece of data.
● Perform an operation on the data.
● Store the result somewhere.
● :
● Fetch another instruction and start all over again.
There are myriad ways in which one might implement a DSP algorithm in an ASIC or FPGA—the latter option being the focus of this chapter, of course. But before we hurl ourselves into the mire, let’s first consider how different architectures can affect the speed and area (in terms of silicon real estate) of the implementation.
DSP algorithms typically require huge numbers of multiplications and additions. As a really simple example, let’s assume that we have a new DSP algorithm that contains an expression something like the following:
Y = (A * B) + (C * D) + (E * F) + (G * H);
As usual, this is a generic syntax that does not favor any particular HDL and is used only for the purposes of these discussions. Of course, this would be a minuscule element in a horrendously complex algorithm, but DSP algorithms tend to contain a lot of this type of thing.
The point is that we can exploit the parallelism inherent in hardware to perform DSP functions much more quickly than can be achieved by means of software running on a DSP core. For example, suppose that all of the multiplications were performed in parallel (simultaneously) followed by two stages of additions ( Figure 6-11 ).
Remembering that multipliers are relatively large and complex and that adders are sort of large, this implementation will be very fast, but will consume a correspondingly large amount of chip resources.
As an alternative, we might employ resource sharing (sharing some of the multipliers and adders between multiple operations) and opt for a solution that is a mixture of parallel and serial ( Figure 6-12 ).
This solution requires the addition of four 2:1 multiplexers and a register (remember that each of these will be the same multibit width as their r espective signal paths). However, multiplexers and registers consume much less area than the two multipliers and adder that are no longer required as compared to our initial solution.
On the downside, this approach is slower, because we must first perform the (A * B) and (C * D) multiplications, add the results together, add this total to the existing contents of the register (which will have been initialized to contain zero), and store the result in the register. Next, we must perform the (E * F) and (G * H) multiplications, add these results together, add this total to the existing contents of the register (which currently contains the results from the first set of multiplications and additions), and store this result in the register.
As yet another alternative, we might decide to use a fully serial solution ( Figure 6-13 ).
This latter implementation is very efficient in terms of area because it requires only a single multiplier and a single adder. This is the slowest implementation, however, because we must first perform the (A * B) multiplication, add the result to the existing contents of the register (which will have been initialized to contain zero), and store the total in the register. Next, we must perform the (C * D) multiplication, add this result to the existing contents of the register, and store this new total in the register. And so forth for the remaining multiplication operations. (Note that when we say “ this is the slowest implementation, ” we are referring to these hardware solutions, but even the slowest hardware implementation remains much, much faster than a software equivalent running on a microprocessor or DSP.)
As previously discussed, some functions like multipliers are inherently slow if they are implemented by connecting a large number of programmable logic blocks together inside an FPGA. Since many applications require these functions, many FPGAs incorporate special hard-wired multiplier blocks. (These are typically located in close proximity to embedded RAM blocks because these functions are often used in conjunction with each other.)
Similarly, some FPGAs offer dedicated adder blocks. One operation that is very common in DSP-type applications is ulate. As its name would suggest, this function multiplies two numbers together and adds the result into a running total stored in an accumulator (register). Hence, it is commonly referred to as a MAC, which stands for multiply, add, and accumulate ( Figure 6-14 ).
Note that the multiplier, adder, and register portions of the serial implementation of our function shown in Figure 6-13 offer a classic example of a MAC. If the FPGA you are working with supplies only embedded multipliers, you would be obliged to implement this function by combining the multiplier with an adder formed from a number of programmable logic blocks, while the result would be stored in a block RAM or in a number of distributed RAMs. Life becomes a little easier if the FPGA also provides embedded adders, and some FPGAs provide entire MACs as embedded functions.
At the time of this writing, using FPGAs to perform DSP is still relatively new. Thus, there really are no definitive design flows or methodologies here—everyone seems to have his or her unique way of doing things, and whichever option you choose, you’ll almost certainly end up breaking new ground one way or another. Domain-specififi c Languages The way of the world is that electronic designs increase in size and complexity over time. To manage this problem while maintaining—or, more usually, increasing—productivity, it is necessary to keep raising the level of abstraction used to capture the design’s functionality and verify its intent.
For this reason the gate-level schematics were superceded by the RTL representations in VHDL and Verilog, as discussed in Chapter 5. Similarly, the drive toward C-based flows as discussed earlier is powered by the desire to capture complex concepts quickly and easily while facilitating architectural analysis and exploration.
In the case of specialist areas such as DSPs, system architects and design engineers can achieve a dramatic improvement in productivity by means of domain-specific languages (DSLs), which provide more concise ways of representing specific tasks than do general-purpose languages such as C/C ++ and SystemC. One such language is MATLAB, which allows DSP designers to represent a signal transformation, such as an FFT, that can potentially take up an entire FPGA, using a single line of code4 along the lines of
y = fft(x);
Actually, the term MATLAB refers both to a language and an algorithmiclevel simulation environment. To avoid confusion, it is common to talk about M-code (meaning “ MATLAB code ” ) and M-files (files containing MATLAB code).
Some engineers in the trenches occasionally refer to the “ M language, ” but this is not argot favored by the folks at The MathWorks.
In addition to sophisticated transformation operators like the FFT shown above, there are also much simpler transformations like adders, subtractors, multipliers, logical operators, matrix arithmetic, and so forth. The more complex transformations like an FFT can be formed from these fundamental entities if required. The output from each transformation can be used as the input to one or more downstream transformations, and so forth, until the entire system has been represented at this high level of abstraction.
One important point is that such a system-level representation does not initially imply a hardware or software implementation. In the case of DSP core, for example, it could be that the entire function is implemented in software as discussed earlier in this chapter. Alternatively, the system architects could partition the design such that some functions are implemented in software, while other performance-critical tasks are implemented in hardware using dedicated ASIC or FPGA fabric. In this case, one typically needs to have access to a hardware or software codesign environment. For the purposes of these discussions, however, we shall assume pure hardware implementations.
System-level design and simulation environments are conceptually at a higher level than DSLs. One well-known example of this genre is Simulink from The MathWorks. Depending on who you’re talking to, there may be a perception that Simulink is simply a graphical user interface to MATLAB. In reality, however, it is an independent dynamic modeling application that works with MATLAB.
If you are using Simulink, you typically commence the design process by creating a graphical block diagram of your system showing a schematic of functional blocks and the connections between them. Each of these blocks may be user-num defined, or they may originate in one of the libraries supplied with Simulink (these include DSP, communications, and control function block sets). In the case of a user-defined block, you can “ push ” into that block and represent its contents as a new graphical block diagram. You can also create blocks containing MATLAB functions, M-code, C/C ++ , FORTRAN … the list goes on.
Once you’ve captured the design’s intent, you use Simulink to simulate and verify its functionality. As with MATLAB, the input stimulus to a Simulink simulation might come from one or more mathematical functions, such as sinewave generators, or it might be provided in the form of real-world data such as audio or video files. In many cases, it comes as a mixture of both; for example, real-world data might be augmented with pseudorandom noise supplied by a Simulink block.
—Technology Trade-offs—
● The point here is that there’s no hard-and-fast rule. Some DSP designers prefer to use MATLAB as their starting point, while others opt for Simulink (this latter case is much rarer in the scheme of things). Some folks say that this preference depends on the user’s background (software DSP development versus ASIC/FPGA designs), but others say that this is a load of tosh.
Irrespective as to whether one opts for Simulink or MATLAB (or a similar environment from another vendor) as a starting point, the first-pass model of the system is almost invariably described using floating-point representations. In the context of the decimal number system, this refers to numbers like 1.235 * 103 (that is, a fractional number raised to some power of 10). In the context of applications like MATLAB, equivalent binary values are represented inside the computer using the IEEE standard for double-precision floatingpoint numbers.
Floating-point numbers of this type have the advantage of providing extremely accurate values across a tremendous dynamic range. However, implementing floating-point calculations of this type in dedicated FPGA or ASIC hardware requires a humongous amount of silicon resources, and the result is painfully slow (in hardware terms). Thus, at some stage, the design will be migrated over to use fixed-point representations, which refers to numbers having a fixed number of bits to represent their integer and fractional portions. This process is commonly referred to as quantization .
This is totally system/algorithm dependent, and it may take a considerable amount of experimentation to determine the optimum balance between using the fewest number of bits to represent a set of values (thereby decreasing the amount of silicon resources required and speeding the calculations), while maintaining sufficient accuracy to perform the task in hand. (One can think of this trade-off in terms of how much noise the designer is willing to accept for a given number of bits.) In some cases, designers may spend days deciding “ should we use 14, 15, or 16 bits to represent these particular values? ” And, just to increase the fun, it may be best to vary the number of bits used to represent values at different locations in the system/algorithm.
Things start to get really fun in that the conversion from floating-point to fixed-point representations may take place upstream in the system/a lgorithmic design and verification environment, or downstream in the C/C ++ code. This is shown in more detail in the “ System/algorithmic level to C/C ++ ” section below. Suffice it to say that if one is working in a MATLAB environment, these conversions can be performed by passing the floating-point signals through special transformation functions called quantizers . Alternatively, if one is working in a Simulink environment, the conversions can be performed by running the floating-point signals through special fixed-point blocks.
At the time of this writing, many DSP design teams commence by performing their system-level evaluations and algorithmic validation in MATLAB (or the equivalent) using floating-point representations. (It is also very common to include an intermediate step in which a fixed-point C/C ++ model is created for use in rapid simulation/validation.) At this point, many design teams bounce directly into hand-coding fixed-point RTL equivalents of the design in VHDL or Verilog ( Figure 6-14a ). Alternatively, they may first transition the floating-point representations into their fixed-point counterparts at the system/ algorithmic level, and then hand-code the RTL in VHDL or Verilog ( Figure 6-14b ).
Of course, once an RTL representation of the design has been created, we can assume the use of the downstream logic-synthesis-based flows that were introduced in Chapter 5.
—Technology Trade-offs—
● There are a number of problems with this flow, not the least being that there is a significant conceptual and representational divide between the system architects working at the system/algorithmic level and the hardware design engineers working with RTL representations in VHDL or Verilog.
● Because the system/algorithmic and RTL domains are so different, manual translation from one to the other is time-consuming and prone to error.
● There is also the fact that the resulting RTL is implementation specific because realizing the optimal design in an FPGA requires a different RTL coding style from that used for an optimal ASIC implementation.
● Another consideration is that manually modifying and reverifying RTL to perform a series of what-if evaluations of alternative microarchitecture implementations is extremely time-consuming (such evaluations may include performing certain operations in parallel versus sequential, pipelining portions of the design versus nonpipelining, sharing common resources—for example, two operations sharing a single multiplier—versus using dedicated resources, etc.)
● Similarly, if any changes are made to the original specification during the course of the project, it’s relatively easy to implement and evaluate these changes in the system-/algorithmic-level representations, but subsequently folding these changes into the RTL by hand can be painful and time-consuming.
As was noted in the previous section, performing system-/algorithmic-level to-RTL translation manually is time-consuming and prone to error. There are alternatives, however, because some system-/algorithmic-level design environments offer direct VHDL or Verilog RTL code generation ( Figure 6-15 ). As usual, the system-/algorithmic-level design would commence by using floating-point representations. In one version of the flow, the system/ algorithmic environment is used to migrate these representations into their fixed-point counter-parts and then to generate the equivalent RTL in VHDL or Verilog automatically ( Figure 6-15a ). Alternatively, a third-party environment might be used to take the floatingpoint system-/algorithmic-level representation, autointeractively quantize it into its fixed-point counterpart, and then automatically generate the equivalent RTL in VHDL or Verilog ( Figure 6-15b ).
As before, once an RTL representation of the design has been created, we can assume the use of the downstream logic-synthesis-based flows that were introduced in Chapter 5.
Due to the problems associated with exploring the design at the RTL level, there is an increasing trend to use a stepping-stone approach. This involves transitioning from the system-/algorithmic-level domain into to some sort of C/C ++ representation, which itself is subsequently migrated into an RTL equivalent. One reason this is attractive is that the majority of DSP design teams already generate a C/C ++ model for use as a golden (reference) model, in which case this sort of comes for free as far as the downstream RTL design engineer is concerned.
Of course, the first thing to decide is when and where in the flow one should transition from floating-point to fixed-point representations ( Figure 6-16 ).
Frighteningly enough, Figure 6-16 shows only a subset of the various potential flows. For example, in the case of the handcrafted options, as opposed to first hand-coding the C/C ++ and then gradually transmogrifying this representation into Handel-C or SystemC, one could hand-code directly into these languages.
Nothing is simple in this world because there is always just one more way to do things. As an example, one might create a library of DSP functional blocks at the system/algorithmic level of abstraction along with a one-to-one equivalent library of blocks at the RTL level of abstraction in VHDL or Verilog.
The idea here is that you could then capture and verify your design using a hierarchy of functional blocks specified at the system/algorithmic level of abstraction. Once you were happy with your design, you could then generate a structural netlist instantiating the RTL-level blocks, and use this to drive downstream simulation and synthesis tools. (These blocks would have to be parameterized at all levels of abstraction to allow you to specify such things as bus widths and so forth.)
As an alternative, the larger FPGA vendors typically offer IP core generators (in this context, the term core is considered to refer to a block that performs a specific logical function; it does not refer to a microprocessor or DSP core). In several cases, these core generators have been integrated into system-/ algorithmic-level environments. This means that you can create a design based on a collection of these blocks in the system-/algorithmic-level environment, specify any parameters associated with these blocks, and perform your system-/ algorithmic-level verification.
Later, when you’re ready to rock and roll, the core generator will automatically generate the hardware models corresponding to each of these blocks. (The system-/algorithmic-level models and the hardware models ensuing from the core generator are bit identical and cycle identical.) In some cases the hardware blocks will be generated as synthesizable RTL in VHDL or Verilog. Alternatively, they may be presented as firm cores at the LUT/CLB level of abstraction, thereby making the maximum use of the targeted FPGA’s internal resources.
—Technology Trade-offs—
● One big drawback associated with this approach is that, by their very nature, IP blocks are based on hard-coded microarchitectures. This means that the ability to create highly tuned implementations to address specific design goals is somewhat diminished. The result is that IP-based flows may achieve an implementation faster with less risk, but such an i mplementation may be less optimal in terms of area, performance, and power as compared to a custom hardware implementation.
One point the folks selling you DSP design tools often neglect to mention is the test bench. For example, let’s assume that your flow involves taking your system-/algorithmic-level design and hand-translating it into RTL. In that case, you are going to have to do the same with your testbench. In many cases, this is a nontrivial task that can take days or weeks!
Or let’s say that your flow is based on taking your floating- point system-/ algorithmic-level design and hand-translating it into floating-point C/C ++ , at which point you will wish to verify this new representation. Then you might take your floating-point C/C ++ and hand-translate it into fixed-point C/C ++ , at which point you will wish to verify this representation. And then you might take your fixed-point C/C ++ and (hopefully) automatically synthesize an equivalent RTL representation, at which point … but you get my drift. The problem is that at each stage you are going to have to do the same thing with your testbench (unless you do something cunning as discussed in the next (and last—hurray!) section.
In the previous chapter, we noted that a number of EDA companies can provide mixed-level design and verification environments that can support the cosimulation of models specified at multiple levels of abstraction. For example, one might start with a graphical block-based editor showing the design’s major functional units, where the contents of each block can be represented using
● VHDL
● Verilog
● SystemVerilog
● SystemC
● Handel-C
● Pure C/C ++
In this case, the top-level design might be in a traditional HDL that calls submodules represented in the various HDLs and in one or more flavors of C/C ++ . Alternatively, the top-level design might be in one of the flavors of C/C ++ that calls submodules in the other languages.
More recently, integrations between system-/algorithmic-level and i mplementation-level environments have become available. The way in which this works depends on who is doing what and what that person is trying to. For example, a system architect working at the system/algorithmic level (e.g., in MATLAB) might decide to replace one or more blocks with equivalent representations in VHDL or Verilog at the RTL level of abstraction. Alternatively, a design engineer working in VHDL or Verilog at the RTL level of abstraction might decide to call one or more blocks at the system/algorithmic level of abstraction.
Both of these cases require cosimulation between the system-/algorithmic-level environment and the VHDL/Verilog environment, the main difference being who calls whom. Of course, this sounds easy if you say it quickly, but there is a whole host of considerations to be addressed, such as synchronizing the concept of time between the two domains and specifying how different signal types are translated as they pass from one domain to the other (and back again).
Treat any canned demonstration with a healthy amount of suspicion. If you are planning on doing this sort of thing, you need to sit down with the vendor’s engineer and work your own example through from beginning to end. Call me an old cynic if you will, but my advice is to let their engineer guide you, while keeping your hands firmly on the keyboard and mouse. (You’d be amazed how much activity can go on in just a few seconds should you turn your head in response to the age-old question, “ Good grief! Did you see what just flew by the window? ” )
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories:
Lifecycle:Any -
RoHS: -
Manufacturer:Xilinx
Product Categories: FPGAs
Lifecycle:Active Active
RoHS:
Manufacturer:Xilinx
Product Categories: Industrial components
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories:
Lifecycle:Obsolete -
RoHS: No RoHS
Support