Skip to content

Guide to Optimization with Forge

Forge provides easy-to-use model optimization capabilities to enhance the performance, efficiency, and deployability of your machine learning models. This guide will help you understand:

  • How to choose the right optimization path for your use case
  • Common optimization steps (calibration and quantization)
  • Path-specific operations (compilation or export)
  • Best practices and considerations for model optimization

Optimization Workflow

The diagram below shows the optimization workflow, including common steps shared between paths:

flowchart TB
A[Trained Model] --> B{Deployment target};
B -->|Android/CPU| C[Relay Module];
B -->|TensorRT/GPU| D[ONNX Module];
C --> F1[Quantize];
D --> F2[Quantize];
C -->|FP32/FP16| G1[Compile];
F1 -->|INT8| G1[Compile];
D -->|FP32/FP16| G2[Export];
F2 -->|INT8| G2[Export];
G1 -->|modelLibrary.so| H[Optimized Model];
G2 -->|model.onnx| H;

Optimization Paths

Forge provides two distinct optimization paths through different Intermediate Representations (IRs). Each path offers unique advantages and is optimized for specific use cases:

Relay Module is a sophisticated IR powered by TVM that enables:

  • Optimizations for CPUs
  • Fine-grained control over memory layouts and computation patterns
  • Advanced optimizations through the .compile() method

ONNX Module is a standardized IR focused on portability that provides:

  • Hardware-agnostic optimizations through ONNX Runtime
  • Excellent support for NVIDIA GPU targets
  • Ideal for rapid prototyping and flexible deployment: optimize once, deploy anywhere

Relay Module Optimization—Compilation

The RelayModule in Forge is optimized using its compile() method, which enables extensive hardware-aware optimizations via TVM. This method is especially recommended for CPU targets or when you need to optimize for a specific hardware platform. The most important argument to provide is the target, which specifies the desired hardware backend (details below). For reference, the method’s type signature and docstring are included here.

RelayModule Compile Method Docstring - Click to Expand & Collapse

forge.RelayModule.compile

compile(target='llvm', host=None, output_path='./compile_output', opt_level=3, set_float16=False, set_channel_layout=None, export_relay=False, export_metadata=False, force_overwrite=False, uuid=None, encrypt_password=None)

Compiles the model for a specified target with various configuration options.

This method compiles the model for a given target, which can be a string or a dictionary specifying the target attributes. The compilation can be customized through various parameters, including optimization level and data type settings.

Parameters:

Name Type Description Default
target Union[str, Dict[str, Any]]

Can be one of a literal target string, a target tag (pre-defined target alias), a json string describing, a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are:

kind : str (required) Which codegen path to use, for example "llvm" or "cuda".

keys : List of str (optional) A set of strategies that can be dispatched to. When using "kind=opencl" for example, one could set keys to ["mali", "opencl", "gpu"].

device : str (optional) A single key that corresponds to the actual device being run on. This will be effectively appended to the keys.

libs : List of str (optional) The set of external libraries to use. For example ["cblas", "mkl"].

system-lib : bool (optional) If True, build a module that contains self registered functions. Useful for environments where dynamic loading like dlopen is banned.

mcpu : str (optional) The specific cpu being run on. Serves only as an annotation.

model : str (optional) An annotation indicating what model a workload came from.

runtime : str (optional) An annotation indicating which runtime to use with a workload.

mtriple : str (optional) The llvm triplet describing the target, for example "arm64-linux-android".

mattr : List of str (optional) The llvm features to compile with, for example ["+avx512f", "+mmx"].

mfloat-abi : str (optional) An llvm setting that is one of "hard" or "soft" indicating whether to use hardware or software floating-point operations.

mabi : str (optional) An llvm setting. Generate code for the specified ABI, for example "lp64d".

host : Union[str, Dict[str, Any]] (optional) Description for target host. Can be recursive. Similar to target.

'llvm'
host Optional[Union[str, Dict[str, Any]]]

Similar to target but for target host. Can be one of a literal target host string, a target tag (pre-defined target alias), a json string describing a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are same as target.

None
output_path Optional[Union[str, Path]]

The path to save the compiled output, ./compile_output by default.

'./compile_output'
opt_level int

Optimization level, ranging from 0 to 4. Larger numbers, correspond with more aggressive compilation optimizations. Default is 3.

3
set_float16 bool

If True, enables Float16 data type for all operators permitted. Default is False.

False
set_channel_layout Optional[str]

Optional specification of the channel layout ("first", "last"), defaults to no changing of layout if None.

None
export_relay bool

If True, exports the Relay text representation of the model. Default is False.

False
export_metadata bool

If True, exports the metadata JSON of the model as a text file. Default is False.

False
force_overwrite bool

If True, the method will overwrite if the provided output path already exists. A ValueError will be thrown if False and the output path already exists. Default is False.

False
uuid Optional[str]

Optional specification of a uuid from the user when the model needs to have a unique identifier that will be set by the user, when this value is not set, the uuid will be a randomely generated one.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

The method operates in place.

Compilation Arguments

Compilation is dictated by the passed options. Details on the corresponding arguments and their effects on compilation are provided below.

target: A string or dictionary that denotes the targeted hardware for compilation. See here for more details. A simpler method here would be to leverage the pre-defined hardware "tags" (aliases).

Target Tags - Click to Expand & Collapse

There are pre-defined hardware tags (aliases) that can greatly simplify the passing of the target to the compiler. For example, one only needs to pass "raspberry-pi/5" instead of its detailed target description, "llvm -mtriple=aarch64-linux-gnu -mcpu=cortex-a76 -mattr=+neon -num-cores=4". Tags are broken out into three categories: CUDA, x86, and ARM. See lists of tags with the provided APIs.

CUDA Tags

forge.list_cuda_tags

list_cuda_tags(verbose=False)

List all tags (pre-defined aliases) of CUDA targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

x86 Tags

forge.list_x86_tags

list_x86_tags(verbose=False)

List all tags (pre-defined aliases) of x86 targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

ARM Tags

forge.list_arm_tags

list_arm_tags(verbose=False)

List all tags (pre-defined aliases) of ARM targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

Android Tags

forge.list_android_tags

list_android_tags(verbose=False)

List all tags (pre-defined aliases) for Android targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

All Tags

forge.list_target_tags

list_target_tags(verbose=False)

List all tags (pre-defined aliases) of all targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

host: An optional string or dictionary that denotes the host hardware containing the targeted hardware. This is relevant for multi-target compilation, e.g. GPU + CPU.

output_path: A directory to write the compiled artifact to. If the directory does not exist, it will be made. If the directory already exists, a directory with a number will be created.

opt_level: An optimization flag leveraged by the compiler, where the highest level of 4 corresponds to the most aggressive optimizations.

set_float16: An option that will convert any float32 nodes into float16 nodes (operator-permitting). This option is ignored for TensorRT compilation.

set_channel_layout: The data and kernel layout of the model can have a major impact on the final inference latency. There are two channel-layout options: "first" or "last". For quantized models, it is generally recommended that one compile with a channel-last layout. If set, this option will convert the model's layouts to maximize either channel-first or channel-last compute. This option is ignored for TensorRT compilation, which defaults to channel-first.

export_relay: This flag will save a text file of the Relay text representation of the model to the desginated output_path.

export_metadata: This flag will save a JSON text file of the metadata leveraged by the Latent Runtime Engine.

force_overwrite: This flag, when set to True, will force an overwrite of the output path if it already exists. Otherwise, it will raise a ValueError.


Compile Examples

Basic CPU and GPU Compilation

# Basic CPU compilation
ir.compile()

# GPU compilation with CPU fallback
ir.compile(
    target="cuda",
    host="llvm -mcpu=skylake",
    force_overwrite=True
)

Hardware-Specific Compilation

# Raspberry Pi 5 compilation
ir.compile(
    target="raspberry-pi/5",
    force_overwrite=True
)

# Android CPU compilation
ir.compile(
    target="android/cpu",
    force_overwrite=True
)

Advanced Optimizations

# CPU compilation with channel layout and Relay export
ir.compile(
    set_channel_layout="first",
    export_relay=True,
    force_overwrite=True
)

# GPU compilation with specific architecture and float16
ir.compile(
    target="cuda -arch=sm_86",
    set_float16=True,
    force_overwrite=True
)

TensorRT Integration

For TensorRT acceleration:

  • Use forge.ONNXModule instead of RelayModule
  • Provides optimized performance on NVIDIA GPUs
  • Maintains model portability across different platforms

Example: Complete Optimization Flow Using Relay IR

import forge
import onnx

# 1. Load your model
onnx_model = onnx.load("path/to/onnx/model")
ir = forge.RelayModule.from_onnx(onnx_model)

# 2. Optimize through calibration and quantization
ir.calibrate(calibration_dataset)  # Analyze model behavior
ir.quantize(
    activation_dtype="uint8",
    kernel_dtype="uint8",
    quant_type="static"
)  # Reduce model size

# 3. Compile with hardware-specific optimizations
target = "llvm"  # or specific CPU target like "llvm -mcpu=cascadelake"
ir.compile(
    target=target,
    output_path="cpu_optimized"
)
For more model ingestion options, see Load a Model guide

ONNX Module Optimization—Export

The ONNXModule in Forge is optimized using its export() method, which leverages ONNX Runtime to perform hardware-independent, graph-level optimizations. This approach is ideal for general-purpose deployments and NVIDIA GPU targets, providing fast and portable model transformations. The export() method allows you to save the optimized ONNX model for deployment. For reference, the method’s type signature and docstring are included here.

ONNXModule Export Method Docstring - Click to Expand & Collapse

forge.ONNXModule.export

export(f='./model.onnx', force_overwrite=False, is_tensorrt=False, uuid=None, encrypt_password=None)

Exports the current state of the ONNX model to the specified output with metadata for inference with the LEIP LatentRuntimeEngine (LRE).

This method manages output path validation and enforces the '.onnx' file extension. If the model is quantized, related metadata will be included in the export.

Parameters:

Name Type Description Default
f Union[str, Path]

A string containing a file name or a pathlike object. Defaults to "./model.onnx".

'./model.onnx'
force_overwrite bool

If True, overwrites the output path if it already exists. Defaults to False.

False
is_tensorrt bool

DEPRECATED If True, only exports the model in its unquantized state, but exports the current state of collected calibration data needed to run the model with TensorRT's 8-bit quantization. See module 'calibrate()' and 'quantize()` methods, both necessary steps before any calibration data gets exported.

False
uuid Optional[str]

A custom UUID for the export. If not provided, a new UUID4 is generated.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

This method operates in place.

Example: Optimize and Export ONNX Model

import forge

# 1. Load ONNX model
ir = forge.ONNXModule("path/to/your/model.onnx")

# 2. Quantization (optional)
ir.calibrate(calibration_dataset)
ir.quantize(activation_dtype="uint8", kernel_dtype="uint8", quant_type='static')

# 3. Export
ir.export("optimized_model.onnx", uuid='quantized model')

Next Steps

For more details on specific optimization techniques, refer to:

Best Practices

  • Always validate your optimized model's accuracy against the original
  • Consider quantization when targeting resource-constrained devices
  • Use the appropriate channel layout for your target hardware