FONT SIZE : AAA
Examining the nine control states in the main loop and relating these to the mapping of the control graph to the dataflow graph showed that the last 8 cycles were performing the S-block and the first 2 cycles were mainly related to transforming the key. The second state is an overlap state where both key and data transforms are taking place. The problem with the last 8 cycles was fairly self-evident since there are eight substitutions and there are eight control states to perform them. Clearly there was something causing each substitution to be locked into a separate control state and therefore preventing optimization with respect to latency. It wasn’t difficult to see what each of these states contained: just register assignments, concatenations and a ROM read operation. It is the last of these that is the problem; the ROM implementation being targeted is a synchronous circuit, so the S-block ROM can only be accessed once per clock cycle—in other words once per control state. It is this that is preventing the datapath operations from being performed in parallel. Attacking this problem is beyond the capabilities of behavioral synthesis because it requires knowledge of the dataflow at a much higher level than can be automatically extracted. The solution therefore requires modification of the original design.
There are two obvious solutions to this problem: either split the S-block into eight smaller ROMs that can therefore be accessed in parallel or make the S-block a non-ROM so that the array gets expanded into a decoder block once for each access, giving eight decoders. The latter solution appears simplest, but it will result in eight 512-way decoders, which will be a very large implementation. The solution of splitting the ROMs is more likely to yield a useful solution. The substitute function was rewritten to have eight mini-ROMs:
function substitute(data : vec48) return vec32 is
−−moods inline
type S_block_type is
array(0 to 63) of natural range 0 to 15;
constant S_block0 : S_block_type := ( . );
−−moods ROM
.
constant S_block7 : S_block_type := ( . );
begin
−−moods ROM
return std_logic_vector(to_unsigned(S_block0(to_integer(
unsigned(data(1) & data(6) & data(2 to 5)))),4)) &
.
std_logic_vector(to_unsigned(S_block7(to_integer(
unsigned(data(43) & data(48) & data(44 to 47)))),4));
end;
This was resynthesized and resulted in the control graph shown in Figure 19.3. The inner loop was found to have been reduced to two states, and examination of the last state confirmed that all of the S-block substitutions were being carried out in the one state c4. The key transformations were still split across the two inner states c3 and c4.
One interesting side-effect of this optimization is that it is also a smaller design. MOODS predicts that this design has the area and delay characteristics shown in Table 19.1 in the line labeled (2).
Figure 19.3 Control state machine for optimized S-blocks.
Optimizing the Key Transformations
Examination of the two control states in the main loop, which both contain key transformations, showed that both of these states were performing ROM access and rotate operations. Examination of the original key_rotate function showed that the shift distance ROMs are accessed twice per call, so this turned out to be exactly the same problem as with the S-block ROM. Since ROMs are synchronous, they can only be accessed once per cycle and this forces at least two cycles to be used for the rotate. To solve this, the function can be rewritten to only access the ROMs once per call:
if encrypt = 1 then
distance := encrypt_shift_distance(round);
result :=
vec28(unsigned(key(1 to 28)) rol distance) &
vec28(unsigned(key(29 to 56)) rol distance);
else
distance := decrypt_shift_distance(round);
result :=
vec28(unsigned(key(1 to 28)) ror distance) &
vec28(unsigned(key(29 to 56)) ror distance);
end if;
This was resynthesized and resulted in the control graph shown in Figure 19.4. The inner loop was found to have been reduced to one state (c3) containing both the key and data transformations, which are repeated 16 times. As before, states c1 and c2 implement the input handshake.
So, this optimization means that the target of 1 clock cycle per round of the core was achieved. MOODS predicts that this design has the area and delay characteristics shown in Table 19.1 in the line labeled (3).
Figure 19.4 Control state machine for optimized key rotate.
Manufacturer:Xilinx
Product Categories: Module RF, IC et Accessoires
Lifecycle:Obsolete -
RoHS:
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories: FPGAs (Field Programmable Gate Array)
Lifecycle:Obsolete -
RoHS:
Manufacturer:Xilinx
Product Categories:
Lifecycle:Obsolete -
RoHS: No RoHS
Manufacturer:Xilinx
Product Categories:
Lifecycle:Obsolete -
RoHS: No RoHS
Support