Guide to Optimization with Forge¶
Forge provides easy-to-use model optimization capabilities to enhance the performance, efficiency, and deployability of your machine learning models. This guide will help you understand:
- How to choose the right optimization path for your use case
- Common optimization steps (calibration and quantization)
- Path-specific operations (compilation or export)
- Best practices and considerations for model optimization
Optimization Workflow¶
The diagram below shows the optimization workflow, including common steps shared between paths:
flowchart TB
A[Trained Model] --> B{Deployment target};
B -->|Android/CPU| C[Relay Module];
B -->|TensorRT/GPU| D[ONNX Module];
C --> F1[Quantize];
D --> F2[Quantize];
C -->|FP32/FP16| G1[Compile];
F1 -->|INT8| G1[Compile];
D -->|FP32/FP16| G2[Export];
F2 -->|INT8| G2[Export];
G1 -->|modelLibrary.so| H[Optimized Model];
G2 -->|model.onnx| H;
Optimization Paths¶
Forge provides two distinct optimization paths through different Intermediate Representations (IRs). Each path offers unique advantages and is optimized for specific use cases:
Relay Module is a sophisticated IR powered by TVM that enables:
- Optimizations for CPUs
- Fine-grained control over memory layouts and computation patterns
- Advanced optimizations through the
.compile()method
ONNX Module is a standardized IR focused on portability that provides:
- Hardware-agnostic optimizations through ONNX Runtime
- Excellent support for NVIDIA GPU targets
- Ideal for rapid prototyping and flexible deployment: optimize once, deploy anywhere
Relay Module Optimization—Compilation¶
The RelayModule in Forge is optimized using its compile() method, which enables extensive hardware-aware optimizations via TVM. This method is especially recommended for CPU targets or when you need to optimize for a specific hardware platform. The most important argument to provide is the target, which specifies the desired hardware backend (details below). For reference, the method’s type signature and docstring are included here.
RelayModule Compile Method Docstring - Click to Expand & Collapse
forge.RelayModule.compile
¶
compile(target='llvm', host=None, output_path='./compile_output', opt_level=3, set_float16=False, set_channel_layout=None, export_relay=False, export_metadata=False, force_overwrite=False, uuid=None, encrypt_password=None)
Compiles the model for a specified target with various configuration options.
This method compiles the model for a given target, which can be a string or a dictionary specifying the target attributes. The compilation can be customized through various parameters, including optimization level and data type settings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target
|
Union[str, Dict[str, Any]]
|
Can be one of a literal target string, a target tag (pre-defined target alias), a json string describing, a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are: kind : str (required) Which codegen path to use, for example "llvm" or "cuda". keys : List of str (optional) A set of strategies that can be dispatched to. When using "kind=opencl" for example, one could set keys to ["mali", "opencl", "gpu"]. device : str (optional) A single key that corresponds to the actual device being run on. This will be effectively appended to the keys. libs : List of str (optional) The set of external libraries to use. For example ["cblas", "mkl"]. system-lib : bool (optional) If True, build a module that contains self registered functions. Useful for environments where dynamic loading like dlopen is banned. mcpu : str (optional) The specific cpu being run on. Serves only as an annotation. model : str (optional) An annotation indicating what model a workload came from. runtime : str (optional) An annotation indicating which runtime to use with a workload. mtriple : str (optional) The llvm triplet describing the target, for example "arm64-linux-android". mattr : List of str (optional) The llvm features to compile with, for example ["+avx512f", "+mmx"]. mfloat-abi : str (optional) An llvm setting that is one of "hard" or "soft" indicating whether to use hardware or software floating-point operations. mabi : str (optional) An llvm setting. Generate code for the specified ABI, for example "lp64d". host : Union[str, Dict[str, Any]] (optional) Description for target host. Can be recursive. Similar to target. |
'llvm'
|
host
|
Optional[Union[str, Dict[str, Any]]]
|
Similar to target but for target host. Can be one of a literal target host string, a target tag (pre-defined target alias), a json string describing a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are same as target. |
None
|
output_path
|
Optional[Union[str, Path]]
|
The path to save the compiled output, |
'./compile_output'
|
opt_level
|
int
|
Optimization level, ranging from 0 to 4. Larger numbers, correspond with more aggressive compilation optimizations. Default is 3. |
3
|
set_float16
|
bool
|
If True, enables Float16 data type for all operators permitted. Default is False. |
False
|
set_channel_layout
|
Optional[str]
|
Optional specification of the channel layout ("first", "last"), defaults to no changing of layout if None. |
None
|
export_relay
|
bool
|
If True, exports the Relay text representation of the model. Default is False. |
False
|
export_metadata
|
bool
|
If True, exports the metadata JSON of the model as a text file. Default is False. |
False
|
force_overwrite
|
bool
|
If True, the method will overwrite if the provided output path already exists. A ValueError will be thrown if False and the output path already exists. Default is False. |
False
|
uuid
|
Optional[str]
|
Optional specification of a uuid from the user when the model needs to have a unique identifier that will be set by the user, when this value is not set, the uuid will be a randomely generated one. |
None
|
encrypt_password
|
Optional[str]
|
Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
The method operates in place. |
Compilation Arguments
Compilation is dictated by the passed options. Details on the corresponding arguments and their effects on compilation are provided below.
target: A string or dictionary that denotes the targeted hardware for compilation. See here for more details. A simpler method here would be to leverage the pre-defined hardware "tags" (aliases).
Target Tags - Click to Expand & Collapse
There are pre-defined hardware tags (aliases) that can greatly simplify the passing of the target to the compiler. For example, one only needs to pass "raspberry-pi/5" instead of its detailed target description, "llvm -mtriple=aarch64-linux-gnu -mcpu=cortex-a76 -mattr=+neon -num-cores=4". Tags are broken out into three categories: CUDA, x86, and ARM. See lists of tags with the provided APIs.
CUDA Tags
forge.list_cuda_tags
¶
list_cuda_tags(verbose=False)
List all tags (pre-defined aliases) of CUDA targets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
x86 Tags
forge.list_x86_tags
¶
list_x86_tags(verbose=False)
List all tags (pre-defined aliases) of x86 targets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
ARM Tags
forge.list_arm_tags
¶
list_arm_tags(verbose=False)
List all tags (pre-defined aliases) of ARM targets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
Android Tags
forge.list_android_tags
¶
list_android_tags(verbose=False)
List all tags (pre-defined aliases) for Android targets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
All Tags
forge.list_target_tags
¶
list_target_tags(verbose=False)
List all tags (pre-defined aliases) of all targets
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
| Type | Description |
|---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
host: An optional string or dictionary that denotes the host hardware containing the targeted hardware. This is relevant for multi-target compilation, e.g. GPU + CPU.
output_path: A directory to write the compiled artifact to. If the directory does not exist, it will be made. If the directory already exists, a directory with a number will be created.
opt_level: An optimization flag leveraged by the compiler, where the highest level of 4 corresponds to the most aggressive optimizations.
set_float16: An option that will convert any float32 nodes into float16 nodes (operator-permitting). This option is ignored for TensorRT compilation.
set_channel_layout: The data and kernel layout of the model can have a major impact on the final inference latency. There are two channel-layout options: "first" or "last". For quantized models, it is generally recommended that one compile with a channel-last layout. If set, this option will convert the model's layouts to maximize either channel-first or channel-last compute. This option is ignored for TensorRT compilation, which defaults to channel-first.
export_relay: This flag will save a text file of the Relay text representation of the model to the desginated output_path.
export_metadata: This flag will save a JSON text file of the metadata leveraged by the Latent Runtime Engine.
force_overwrite: This flag, when set to True, will force an overwrite of the output path if it already exists. Otherwise, it will raise a ValueError.
Compile Examples
Basic CPU and GPU Compilation
# Basic CPU compilation
ir.compile()
# GPU compilation with CPU fallback
ir.compile(
target="cuda",
host="llvm -mcpu=skylake",
force_overwrite=True
)
Hardware-Specific Compilation
# Raspberry Pi 5 compilation
ir.compile(
target="raspberry-pi/5",
force_overwrite=True
)
# Android CPU compilation
ir.compile(
target="android/cpu",
force_overwrite=True
)
Advanced Optimizations
# CPU compilation with channel layout and Relay export
ir.compile(
set_channel_layout="first",
export_relay=True,
force_overwrite=True
)
# GPU compilation with specific architecture and float16
ir.compile(
target="cuda -arch=sm_86",
set_float16=True,
force_overwrite=True
)
TensorRT Integration
For TensorRT acceleration:
- Use
forge.ONNXModuleinstead of RelayModule - Provides optimized performance on NVIDIA GPUs
- Maintains model portability across different platforms
Example: Complete Optimization Flow Using Relay IR
import forge
import onnx
# 1. Load your model
onnx_model = onnx.load("path/to/onnx/model")
ir = forge.RelayModule.from_onnx(onnx_model)
# 2. Optimize through calibration and quantization
ir.calibrate(calibration_dataset) # Analyze model behavior
ir.quantize(
activation_dtype="uint8",
kernel_dtype="uint8",
quant_type="static"
) # Reduce model size
# 3. Compile with hardware-specific optimizations
target = "llvm" # or specific CPU target like "llvm -mcpu=cascadelake"
ir.compile(
target=target,
output_path="cpu_optimized"
)
ONNX Module Optimization—Export¶
The ONNXModule in Forge is optimized using its export() method, which leverages ONNX Runtime to perform hardware-independent, graph-level optimizations. This approach is ideal for general-purpose deployments and NVIDIA GPU targets, providing fast and portable model transformations. The export() method allows you to save the optimized ONNX model for deployment. For reference, the method’s type signature and docstring are included here.
ONNXModule Export Method Docstring - Click to Expand & Collapse
forge.ONNXModule.export
¶
export(f='./model.onnx', force_overwrite=False, is_tensorrt=False, uuid=None, encrypt_password=None)
Exports the current state of the ONNX model to the specified output with metadata for inference with the LEIP LatentRuntimeEngine (LRE).
This method manages output path validation and enforces the '.onnx' file extension. If the model is quantized, related metadata will be included in the export.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
f
|
Union[str, Path]
|
A string containing a file name or a pathlike object. Defaults to "./model.onnx". |
'./model.onnx'
|
force_overwrite
|
bool
|
If True, overwrites the output path if it already exists. Defaults to False. |
False
|
is_tensorrt
|
bool
|
DEPRECATED If True, only exports the model in its unquantized state, but exports the current state of collected calibration data needed to run the model with TensorRT's 8-bit quantization. See module 'calibrate()' and 'quantize()` methods, both necessary steps before any calibration data gets exported. |
False
|
uuid
|
Optional[str]
|
A custom UUID for the export. If not provided, a new UUID4 is generated. |
None
|
encrypt_password
|
Optional[str]
|
Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key. |
None
|
Returns:
| Name | Type | Description |
|---|---|---|
None |
None
|
This method operates in place. |
Example: Optimize and Export ONNX Model
import forge
# 1. Load ONNX model
ir = forge.ONNXModule("path/to/your/model.onnx")
# 2. Quantization (optional)
ir.calibrate(calibration_dataset)
ir.quantize(activation_dtype="uint8", kernel_dtype="uint8", quant_type='static')
# 3. Export
ir.export("optimized_model.onnx", uuid='quantized model')
Next Steps¶
For more details on specific optimization techniques, refer to:
- Model Loading Guide: Explore different ways to load and prepare models for optimization
- Graph Introspection Guide: Understand how to analyze and debug your optimized models
- Calibration and Quantization Guide: Learn how to reduce model size and improve inference speed through quantization
Best Practices
- Always validate your optimized model's accuracy against the original
- Consider quantization when targeting resource-constrained devices
- Use the appropriate channel layout for your target hardware