Unlocking FPGA Performance with High-Level Synthesis

Field-Programmable Gate Arrays (FPGAs) have traditionally demanded deep expertise in hardware description languages (HDLs) like VHDL and Verilog. High-Level Synthesis (HLS) flips that model, allowing developers to write algorithms in C, C++, or SystemC and automatically generate optimized RTL code. This shift makes FPGA development accessible to software engineers while slashing iteration cycles from concept to working hardware. Mastering HLS tools can deliver productivity gains of 10× or more, with performance and resource utilization that often rivals hand-coded HDL. This guide covers the essential techniques to make HLS work for your next FPGA project, including a detailed walkthrough of a real-world example.

What Is High-Level Synthesis?

High-level synthesis is a compilation process that converts an untimed behavioral description—typically in C/C++—into a timed hardware implementation. Unlike software compilers that target a fixed instruction set, HLS must schedule operations into clock cycles, allocate functional units, bind operations to specific hardware resources, and generate a finite-state machine with datapath. The process accounts for the FPGA’s logic blocks, DSP slices, and memory architecture, guided by user-specified timing constraints and optimization directives.

The critical advantage is abstraction: loops, arrays, and function calls are synthesized directly without manually crafting state machines or pipelining datapaths. The tool infers parallelism, generates interface protocols, and optimizes resource sharing. For example, the same C function can map to an AXI4-Stream interface, a memory-mapped AXI4 slave, or both, simply by changing pragmas. This makes HLS particularly valuable for video processing, machine learning inference, digital signal processing, and network packet processing, where algorithm refinement is rapid and hardware performance is non-negotiable. By raising the abstraction level, HLS enables more thorough design-space exploration early in the development cycle, reducing the risk of late-stage rework.

Choosing the Right HLS Tool

Several mature HLS tools are available, each tightly integrated with a vendor ecosystem or offered by third-party EDA companies. Selection often depends on target device family and design complexity.

  • AMD Vitis HLS (formerly Vivado HLS): The flagship for AMD Xilinx devices, supporting C, C++, and OpenCL kernel synthesis. It generates RTL that plugs directly into Vivado IP integrator and works seamlessly with the Vitis unified software platform for accelerated applications. More details are available on the AMD Vitis HLS product page.
  • Intel High-Level Synthesis Compiler (HLS Compiler): Integrated into Intel Quartus Prime, this tool synthesizes C++ for Intel Agilex, Stratix, and Arria FPGAs. It excels at datapath-intensive designs and supports task parallelism and fine-grained loop pipelining. Reference materials are on the Intel HLS Compiler page.
  • Siemens Catapult HLS: A vendor-agnostic tool that synthesizes from SystemC or C++ for both ASIC and FPGA targets. It is widely used in aerospace and automotive applications and offers formal equivalence checking, making it suitable for safety-critical systems.
  • Open-Source Options: The Bambu HLS tool from Politecnico di Milano is an actively maintained open-source framework that accepts standard C and generates Verilog. While not as performance-driven as vendor tools, it is excellent for teaching and research, and it supports flexible exploration of HLS algorithms.

Each tool has its own pragma syntax and optimization philosophy, but the core HLS concepts remain consistent. The examples in this article focus on vendor-provided tools but apply broadly across platforms.

The HLS-Based Design Flow

Adopting HLS means shifting from an RTL-centric workflow to a software-like cycle of coding, simulation, and incremental refinement. The following steps outline a complete flow from algorithm to bitstream.

Step 1: Algorithm Specification and C-Level Validation

Start by implementing your algorithm entirely in C or C++ as a “golden model.” This model should be bit-accurate and self-checking, with test vectors that cover all corner cases. Because HLS synthesis is sensitive to coding style, separate the synthesizable functionality from non-synthesizable test harness code—typically by placing the core algorithm in a dedicated function. Avoid dynamic memory allocation, recursion, and system calls inside synthesizable code. Use fixed-size arrays, fixed-point data types where needed, and compile-time loop bounds. Pay special attention to data types: use int, ap_int, or ap_fixed from the HLS library rather than float or double unless absolutely necessary, as floating-point imposes heavy resource costs.

Validate the golden model with standard C compilation and simulation (e.g., using GCC or MSVC). This catches algorithmic errors early, long before hardware simulation begins. The HLS tool will later use the same testbench for C/RTL co-simulation, so investing effort here pays off handsomely. Consider adding randomized testing to stress the model.

Step 2: Tool Configuration and Target Specification

Create a new HLS project in your chosen tool (Vitis HLS, Intel HLS Compiler, etc.). You must define:

  • The top function to synthesize.
  • The target FPGA part or board, which determines available resources, clock frequency, and device architecture.
  • The clock period constraint, typically in nanoseconds. This drives scheduling and pipelining decisions.
  • Simulation settings and, for Vitis HLS, whether to use C simulation or co-simulation with an external RTL simulator.

Proper configuration ensures the tool’s optimizations align with physical timing capabilities. A common mistake is setting an overly optimistic clock period, causing synthesis failures later. Start with a conservative target (e.g., 10 ns / 100 MHz) and tighten gradually after reviewing scheduling reports.

Step 3: Code Optimization Using Pragmas and Directives

Pragmas are the primary mechanism for guiding the HLS tool. Without them, the tool synthesizes a safe but under-optimized design—sequential loops, fully shared resources, minimal parallelism. Key optimization directives include:

  • Loop pipelining: #pragma HLS PIPELINE causes loop iterations to overlap, initiating a new iteration every II (initiation interval) cycles. An II=1 pipeline delivers one result per clock cycle after initial latency, maximizing throughput.
  • Loop unrolling: #pragma HLS UNROLL replicates loop bodies to execute multiple iterations in parallel, exchanging area for performance. Partial unrolling balances resource usage.
  • Array partitioning and reshaping: #pragma HLS ARRAY_PARTITION splits arrays into smaller memory banks for parallel access. ARRAY_RESHAPE combines split data into a wider single memory word.
  • Function inlining: #pragma HLS INLINE merges function hierarchies, giving the tool more scope for cross-boundary optimization.
  • Interface pragmas: Specify how the top function connects—ap_ctrl_none for streaming, s_axilite for a memory-mapped control interface, m_axi for external DDR memory access, etc.
  • Dataflow: #pragma HLS DATAFLOW enables task-level parallelism, allowing a sequence of functions or loops to run concurrently as a pipeline with streaming channels.
  • Resource allocation: #pragma HLS ALLOCATION or RESOURCE directives can limit the number of DSPs or memory ports, preventing resource contention.

Well-chosen pragmas can mean the difference between a design that barely meets throughput and one that leaves resources idle. The optimization process is iterative: apply directives, synthesize, inspect performance and utilization reports, and refine. Keep a log of which pragmas were tried and their effect on area and latency.

Step 4: Synthesis and Analysis

Run HLS synthesis to produce RTL code and comprehensive reports. The most important report is the performance profile, showing each loop’s latency, initiation interval, and pipeline depth. The resource utilization report breaks down LUTs, flip-flops, DSPs, and block RAM usage. Cross-reference these with your target device’s capacity and clock constraint.

Modern HLS tools also generate a schedule viewer (a Gantt chart) and a binding map, helping you visualize how operations are distributed across clock cycles and functional units. If the achieved initiation interval or latency is higher than desired, look for “loop-carried dependencies” or memory port conflicts flagged in the report. Often a subtle C construct—like an accumulator dependent on its previous value—prevents achieving II=1 without recoding or array partitioning. Use the schedule viewer to pinpoint stalls.

Step 5: C/RTL Co-Simulation

Before integrating the generated RTL into a larger FPGA design, verify functional equivalence through co-simulation. The tool compiles the original C testbench against the generated RTL using a bundled simulator (e.g., Xcelium, ModelSim, or Vivado Simulator). It passes the same input vectors and compares outputs cycle by cycle. Co-simulation not only confirms logic correctness but also exposes timing mismatches, such as when the C model assumes immediate memory writes while the RTL has write delays due to BRAM latency.

If mismatches occur, inspect the waveform or transaction log. Adjust the C model or pragmas (e.g., adding #pragma HLS INTERFACE with appropriate latency) until the RTL behavior matches the golden model cycle-accurately. It is good practice to run co-simulation on small sub-functions before scaling to the full design, reducing debug iterations.

Step 6: Export IP and Integrate into the FPGA Design Flow

Once verified, export the design as a packaged IP core—typically in IP-XACT or Intel Qsys format. This IP block can then be instantiated in a block design (e.g., Vivado IP Integrator) alongside other RTL modules, soft processors, or memory controllers. The HLS-generated IP includes timing constraints and is ready for placement and routing.

In the traditional FPGA flow, you then run synthesis and implementation (place-and-route) to generate the final bitstream. Monitor implementation timing reports carefully. HLS tools provide estimated timing based on pre-placement models; real placement may reveal longer routing delays, requiring you to relax the target clock or revisit the HLS constraints. If a loop’s target II cannot be met in hardware, the tool will downrate the clock or the design will fail timing, so this feedback loop is essential. Budget extra slack (10–20%) during HLS to account for physical effects.

Practical Example: Implementing a FIR Filter with HLS

To solidify these concepts, consider a finite impulse response (FIR) filter—a common digital signal processing building block. The C code below implements a 16-tap FIR filter with fixed-point coefficients. We’ll apply pragmas to achieve high throughput on an AMD Xilinx FPGA.

#include <ap_fixed.h>
#include <hls_stream.h>

typedef ap_fixed<16,8> data_t;
typedef ap_fixed<16,8> coeff_t;

void fir(hls::stream<data_t> &in, hls::stream<data_t> &out, coeff_t coeffs[16]) {
#pragma HLS INTERFACE axis port=in
#pragma HLS INTERFACE axis port=out
#pragma HLS INTERFACE s_axilite port=coeffs
    static data_t shift_reg[16];
#pragma HLS ARRAY_PARTITION variable=shift_reg complete dim=1
    data_t acc = 0;
    // Shift and accumulate
    ShiftLoop:
    for (int i = 15; i > 0; --i) {
#pragma HLS PIPELINE II=1
        shift_reg[i] = shift_reg[i-1];
        acc += shift_reg[i] * coeffs[i];
    }
    shift_reg[0] = in.read();
    acc += shift_reg[0] * coeffs[0];
    out.write(acc);
}

Key pragmas in this example:

  • INTERFACE axis: Uses AXI4-Stream for input and output, ideal for continuous data flow.
  • ARRAY_PARTITION complete: Splits the shift register into individual registers, enabling parallel access to all taps.
  • PIPELINE II=1: Ensures one new sample is processed per clock cycle after initial latency.

After synthesis, check the reports: the shift loop should achieve II=1, and resource usage (DSPs for multiplications) should align with 16 multipliers. This design is then exported as an IP core and integrated into a larger system—for example, connected to an AXI DMA to stream data from a sensor. This example demonstrates how a few pragmas translate a straightforward C function into a high-performance hardware accelerator.

Optimization Strategies for Performance and Area

Effective HLS requires balancing throughput, latency, and resource consumption. Several patterns recur in successful designs.

  • Prefer fixed-point arithmetic: Floating-point operations consume significant resources and limit frequency. Unless dynamic range is critical, use fixed-point types (e.g., ap_fixed in Vitis HLS) to reduce DSP and LUT counts while preserving precision.
  • Stream data instead of random memory access: Hardware is most efficient when data flows through a pipeline. Use hls::stream or similar streaming constructs to connect tasks, avoiding large shared memories that lead to arbitration and buffer stalls.
  • Structure loop nests for perfect loop nests: The tool can pipeline an innermost loop automatically. Ensure loops have no loop-carried dependencies beyond known patterns (e.g., reduction). For convolution or matrix multiply, consider local memory buffering and tiling to exploit data reuse.
  • Use template metaprogramming for configurability: C++ templates allow compile-time parameterization of array sizes and data widths, making the same HLS source reusable across devices without performance loss.
  • Balance resource sharing and latency: The #pragma HLS ALLOCATION directive can force sharing of expensive operators like dividers. However, over-sharing may serialize operations and increase latency; weigh against pipeline performance.
  • Leverage bit-accurate types wisely: Using tightly-typed fixed-point representations minimizes hardware cost. For example, ap_ufixed<8,0> for pixel data uses minimal resources while retaining necessary precision. Always profile quantization error against algorithmic tolerance.

HLS tools also offer “solution” directories where you can maintain multiple optimization sets (e.g., “low area,” “high throughput”) and compare them. This is invaluable for exploring the design space without losing earlier results.

Debugging and Verification Best Practices

Debugging HLS designs differs from both software and RTL debugging. Because the source code is C++, traditional debuggers can validate functionality but cannot reveal hardware parallelism or timing bugs. The following practices reduce pain:

  • Maintain a cycle-approximate pure C++ model that uses the same interface protocols (e.g., streaming) so that you can simulate fast.
  • Implement self-checking testbenches with randomized input generation and golden reference outputs.
  • Use the HLS tool’s log and pragma warnings aggressively. Treat non-synthesizable constructs or sub-optimal loop structures as errors.
  • Start co-simulation early on a small sub-module before scaling to the full design. This isolates synthesis issues quickly.
  • Use the HLS tool’s built-in performance analysis to view initiation interval bottlenecks before running long RTL simulations.
  • Inspect the generated RTL code for unexpected structures: for example, large multiplexers often indicate overly complex conditional branches. Simplify conditionals by flattening nested if statements where possible.

Common Pitfalls and How to Avoid Them

Even experienced engineers encounter repeat issues when moving to HLS. Recognizing them upfront smooths the transition.

  • Unbounded loops: Loops with variable trip counts that are not calculable at compile time cannot be properly scheduled. Pre-define maximum trip counts and use #pragma HLS LOOP_TRIPCOUNT to guide the tool.
  • Large memory interfaces with poor bandwidth: A single AXI4-Lite master interface for large data arrays will bottleneck performance. For high-throughput, use AXI4-Stream or AXI4 master with datawidth conversion and burst support, controlled by appropriate pragmas.
  • Ignoring reset and initialization: Unlike pure RTL, HLS sometimes assumes registers can start in a valid state. Ensure you have a clean reset strategy and avoid uninitialized local arrays that may infer uninitialized RAMs (use #pragma HLS RESET where needed).
  • Over-relying on tool auto-optimization: While HLS tools are powerful, they cannot guess design intent. A simple handshake protocol might need explicit ap_ctrl interface selection to match expected behavior; relying on defaults can lead to mismatched interfaces.
  • Neglecting real-world timing constraints: The HLS scheduling uses a simple timing model. Physical placement of high-fanout nets or large multiplexers can cause unexpected timing violations. Budget extra slack—target a clock period 10–20% higher than the HLS estimated maximum.
  • Forgetting to verify pipeline stalls: In a pipelined loop, if the input stream stalls, the pipeline must be able to drain without deadlock. Use backpressure-aware interfaces and verify stall behavior in co-simulation.

Integrating HLS with Heterogeneous Systems

Modern FPGA platforms pair programmable logic with hard processor systems (e.g., ARM Cortex in Zynq, Agilex SoC). HLS fits naturally into these architectures. A common pattern is to use the processor to control and configure an HLS-generated accelerator via AXI-Lite, while high-bandwidth data streams through AXI4-Stream or AXI4 master ports. The Vitis HLS documentation provides extensive guidance on integrating with the Xilinx Runtime (XRT) and OpenCL APIs. Similarly, Intel’s HLS Compiler within the oneAPI framework allows the same C++ kernel code to target both CPUs and FPGAs, simplifying development of reconfigurable accelerators.

For real-time control systems, HLS can generate a custom RTL peripheral that interfaces with the processor’s AXI interconnect, handling time-critical I/O while the processor manages policies and network stacks. This division of labor maximizes performance without sacrificing flexibility. When designing such systems, pay attention to data width matching: an AXI4 master with a 64-bit interface may require burst alignment logic in the HLS kernel.

The Future of High-Level Synthesis

HLS is rapidly evolving, with improvements in compiler heuristics, formal verification, and library ecosystems. Several trends are shaping the road ahead:

  • Machine learning for AutoML-style HLS: Tools are beginning to incorporate ML models that predict optimal pragma configurations, reducing manual tuning. Research from both academia and industry aims to build “push-button” synthesis that rivals expert-crafted designs.
  • Standardization around C++17 and beyond: As HLS front-ends adopt modern C++ standards, designers can leverage constexpr, lambdas, and template metaprogramming to write highly parameterized, reusable hardware libraries.
  • Closer integration with high-level verification: Universal Verification Methodology (UVM) and SystemC transaction-level modeling are being combined with HLS to create unified design-and-verification flows, reducing the verification bottleneck.
  • Open-source hardware stacks: Projects like the CHIPS Alliance are fostering open HLS frameworks and libraries, making HLS more accessible beyond the major FPGA vendors.
  • Increased support for dynamic reconfiguration: Future HLS flows may allow run-time swapping of kernels, enabling adaptive systems that reconfigure in response to changing workloads.

As FPGA density continues to grow, managing complexity at the RTL level becomes unsustainable. HLS offers a way to manage this complexity by raising the abstraction level while retaining hardware efficiency. Mastering HLS now positions engineers to build the next generation of high-performance, reconfigurable systems, from edge AI accelerators to high-speed networking equipment.