Speedgoat IO3xx Coprocessor Example
Speedgoat IO3xx Coprocessor Example — Example showcasing several implementations of coprocessor DMA engine
The model is called IO3xx_coprocessor_hdlc.slx.
The list of basic software requirements are provided in the prerequisites section of the
Getting Started page.
In addition, you must also install the DSP System
The example uses the following interfaces:
The example uses the following user blocks:
Direct Memory Access (DMA) is used to reduce the latency of data transferred
between the FPGA and the target CPU domain, especially if larger amounts of data
need to be transferred. There are different use cases depending on whether the data
transfer is in the direction FPGA to CPU (data logging), or in the direction CPU to
FPGA (playback) or bidirectional (coprocessor mode). A simplified setup including
the basic blocks required to discuss the DMA use cases in a system comprising an
FPGA-based I/O module and a Real-Time Target x86 CPU is illustrated below.
The FPGA I/O Module consists of the I/O channels
(digital, analog, multi-gigabit transceiver, etc.), the FPGA itself, an external RAM
(typically DDR3/DDR4 SDRAM), the design under test (DUT), the DMA engine and the
PCIe Endpoint (used to communicate with the Target x86 CPU).
The Motherboard (x86) consists of the Target CPU
(x86), the System Memory and the Solid-State Drive (SSD) for persistent data
To test the HDL interface functionality, dedicated
examples are included in the downloaded archive file. To open the examples, navigate
to the corresponding folder. Note that the examples only test I/O channels for which
the loopback test method is possible. The terminal board provided must be wired as
described. Examples do not test I/O channels that require external hardware (for
some examples a function generator or an oscilloscope is required), but running this
example will still provide sufficient confirmation of the correct setup of this
implementation. The examples only test interface channels which are provided by the
base functionality of the I/O module. Please note that the examples provided have
been color coded. The green colored subsystem (FPGA domain) is the part of the model
which is actually compiled using HDL Coder and ultimately runs on the FPGA. The FPGA
domain usually has a sample frequency in the range of 100 MHz and is set in the
HDL Workflow Advisor (FPGA Synthesis Software
Settings). The blue blocks (CPU domain) which surround the green
subsystem are interfaces to the processor section of the model. The CPU domain
usually has a sample frequency in the range of 1 kHz. The interrupt subsystem
has been given another color (magenta), as its functionality is asynchronous to both
the processor and FPGA. The interrupt source can be selected in the generated model
in the Interrupt Setup block once the model has been run through the HDL Coder
The DMA engine is well suited to enabling the FPGA and the CPU to complement one
another by acting in a coprocessing capacity. The engine is used to transfer data
back and forth between the CPU and FPGA. There are several methods which can be used
to sequence the DMA transfers, but in general, the data transfer is either
uncontrolled, in which case the model execution rate must be kept long enough to
ensure the completion of a transaction, or the transfer runs in a controlled fashion
and model execution is kept in lock-step using interrupts for completion.
Uncontrolled execution ensures deterministic model time steps, while sacriﬁcing an
optimized data rate.
DMA coprocessor mode
Uncontrolled DMA transfers with deterministic execution times
Uncontrolled DMA transfers with deterministic execution times and IRQ for
Coprocessor modes only vary in terms of the manner in which the model execution is
driven forward. In uncontrolled execution, the model executes with a ﬁxed model step
rate, but it is the responsibility of the modeler to ensure that either the step
size is long enough and both the read and the write execution have time to complete,
or that it is acceptable for new results to not be required at each data step. In
order to keep data read and write operations aligned, the write task must be
completed in order to enable the read task. The other variation is to use either IRQ
interrupts of the DMA read task to control further executions of the model. This
way, the model runs as fast as possible, enabling the maximum data throughput.
However, as DMA transfers are not a deterministic process, there will be a
considerable jitter in the model execution times.
Calculation models a simple square root
calculation performed on a stream of ﬂoating point data values.
FIFO Buffer handles the brief buffering of the
data results to compensate for any delays in initializing the DMA transfer out of
the design under test (DUT). It is worth noting that the data values are a sequence
of scalars in the FPGA, since they come in as a stream, but they are aggregated into
a vector by the DMA engine for use in the CPU domain.
The example includes several variations of implementations of coprocessor DMA
engine handling, all in the same model, using variant subsystems (Variant Subsystem).
Variant subsystems allow you to provide multiple implementations for a subsystem
where only one implementation is active during simulation. You can programmatically
swap out the active implementation and replace it with one of the other
implementations without modifying the model. The provided script, when executed,
ﬁrst runs a simulation of the data transfer and co-processing, then runs through all
four design variants provided. Each type of coprocessor execution includes a variant
in which the CPU and the FPGA perform the same calculation (a square root operation)
on a vector of values. There is also a variant where the calculation is performed
exclusively on the FPGA. After all the variants are built and executed, a plot is
generated which shows the agreement between the CPU and FPGA calculations. Finally,
a plot is generated which shows the average total execution time for each design
variant. This allows the user to examine the effect of IRQ vs. standard execution.
Additionally, the computation time saved by performing the square root operation on
the CPU instead of the FPGA can be seen in this second plot.
Open the example model by navigating to the
folder containing the "*.slx" model file and double clicking the file. If the
example is provided as a Simulink Project, navigate to the corresponding example
folder and extract the Simulink project zip file. Then double-click the "*.prj" icon
to open the project. After opening the project, open the model by double clicking
the "*.slx" file. The model is shown as follows:
The design under test (DUT) subsystem is shown as follows:
The simulation is run automatically when the script executes. As speciﬁc changes
must be written to the data dictionary, it is best to run the simulation by running
Target CPU driver
In the CPU domain of the model, the system labelled DMA_Engine contains all the
DMA structural variants. The subsystem enabled is determined by a setting in the
data dictionary. This setting will be automatically changed as required by the run
script. Each of these subsystem variants (except for the simulation) contain the DMA
read and the DMA write blocks. The size of the DMA transfer is saved in the
frameSize variable, deﬁned in the IO3xx_coprocessor.sldd data dictionary.
Running HDL Workflow Advisor
Before the example can be deployed and run on the real-time target machine, you
will need to run through the HDL Coder Workflow Advisor steps to actually generate
HDL code and a FPGA bitstream using HDL Coder (FPGA Synthesis Software
New: Reference design parameters, set at step 1.2
now control which interfaces will be available to target in step 1.3 of the
workflow. This has reduced the total number of reference designs, and the list of
interfaces available. Please remember to select the front plug-in and rear plug-in
setting that is appropriate for your module, as well as the Aurora settings that
should be used for your model (if applicable). These additional reference design
parameter settings are further described in the interface sections for which they
New: Prior to running the workflow advisor, be
sure to double click the Select Module block in the demo model. If one or more of
your modules support the model (due to available interface compatibility), a pop-up
will display prompting you to select the module you would like to target. If only a
single module is installed, and providing it is compatible, it will be automatically
selected when the box is double clicked.
Upon completion, a newly generated model containing the Simulink Real-Time
interface subsystem appears. At first sight, this subsystem resembles the FPGA
subsystem. However, inside, the Simulink algorithm has been removed and replaced
with blocks that the real-time application will use to communicate with the FPGA
during simulation execution. The newly generated model is now ready to be deployed
to a real-time target machine. To download the FPGA bitstream and the Simulink model
to the target, click the Build Model button on the
Simulink Editor toolbar. The real-time application loads on the Speedgoat target
machine and the FPGA algorithm bitstream loads on the FPGA. If you are using I/O
lines, check that you have connected the lines to the external hardware under test.
Please note that some example models do have Global Delay
Balancing intentionally disabled. If an error is displayed about
delay balancing in step 2.3 of the HDL Coder Workflow Advisor, it can be safely
ignored by checking the Ignore warnings checkbox.
Running the Example
To run the generated model, simply run the example script IO3xx_coprocessor_run.m.
The script conﬁgures, builds and runs the model through all of the design variants.
Finally, it produces plots which show the results of the execution in the various
design variants, as well as the effect of various coprocessor schemas on the total
Data results from the various design variants
Design variant average task execution times (TETs)