Guide to Compilation with Forge¶
This guide will show you how to compile the Forge IRModule for a range of targets.
Load an IRModule
import forge
import onnx
onnx_model = onnx.load("path/to/model.onnx")
ir = forge.from_onnx(onnx_model)
Compiling¶
Forge IRModule can be compiled with its compile()
method. The main thing a user will need to pass to the compile method will be the 'target' designation (to be detailed below). Below is the type-signature and docstring for the method for reference.
IRModule Compile Method Docstring - Click to Expand & Collapse
forge.IRModule.compile(target='llvm', host=None, output_path='./compile_output', opt_level=3, set_float16=False, set_channel_layout=None, export_relay=False, export_metadata=False, force_overwrite=False, uuid=None, encrypt_password=None)
¶
Compiles the model for a specified target with various configuration options.
This method compiles the model for a given target, which can be a string or a dictionary specifying the target attributes. The compilation can be customized through various parameters, including optimization level and data type settings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
target
|
Union[str, Dict[str, Any]]
|
Can be one of a literal target string, a target tag (pre-defined target alias), a json string describing, a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are: kind : str (required) Which codegen path to use, for example "llvm" or "cuda". keys : List of str (optional) A set of strategies that can be dispatched to. When using "kind=opencl" for example, one could set keys to ["mali", "opencl", "gpu"]. device : str (optional) A single key that corresponds to the actual device being run on. This will be effectively appended to the keys. libs : List of str (optional) The set of external libraries to use. For example ["cblas", "mkl"]. system-lib : bool (optional) If True, build a module that contains self registered functions. Useful for environments where dynamic loading like dlopen is banned. mcpu : str (optional) The specific cpu being run on. Serves only as an annotation. model : str (optional) An annotation indicating what model a workload came from. runtime : str (optional) An annotation indicating which runtime to use with a workload. mtriple : str (optional) The llvm triplet describing the target, for example "arm64-linux-android". mattr : List of str (optional) The llvm features to compile with, for example ["+avx512f", "+mmx"]. mfloat-abi : str (optional) An llvm setting that is one of "hard" or "soft" indicating whether to use hardware or software floating-point operations. mabi : str (optional) An llvm setting. Generate code for the specified ABI, for example "lp64d". host : Union[str, Dict[str, Any]] (optional) Description for target host. Can be recursive. Similar to target. |
'llvm'
|
host
|
Optional[Union[str, Dict[str, Any]]]
|
Similar to target but for target host. Can be one of a literal target host string, a target tag (pre-defined target alias), a json string describing a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are same as target. |
None
|
output_path
|
Optional[Union[str, Path]]
|
The path to save the compiled output, |
'./compile_output'
|
opt_level
|
int
|
Optimization level, ranging from 0 to 4. Larger numbers, correspond with more aggressive compilation optimizations. Default is 3. |
3
|
set_float16
|
bool
|
If True, enables Float16 data type for all operators permitted. Default is False. This option is ignored for TensorRT compilation. |
False
|
set_channel_layout
|
Optional[str]
|
Optional specification of the channel layout ("first", "last"), defaults to no changing of layout if None. This option is ignored for TensorRT compilation and the defaults to channel-first. |
None
|
export_relay
|
bool
|
If True, exports the Relay text representation of the model. Default is False. |
False
|
export_metadata
|
bool
|
If True, exports the metadata JSON of the model as a text file. Default is False. |
False
|
force_overwrite
|
bool
|
If True, the method will overwrite if the provided output path already exists. A ValueError will be thrown if False and the output path already exists. Default is False. |
False
|
uuid
|
Optional[str]
|
Optional specification of a uuid from the user when the model needs to have a unique identifier that will be set by the user, when this value is not set, the uuid will be a randomely generated one. |
None
|
encrypt_password
|
Optional[str]
|
Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key. |
None
|
Returns:
Name | Type | Description |
---|---|---|
None |
None
|
The method operates in place. |
Compilation Arguments
The compilation is dictated by the passed options, below will provide some details on the corresponding arguments and its effects on compilation.
target
: A string or dictionary that denotes the targeted hardware for compilation. See here for more details. A simpler method here would be to leverage the pre-defined hardware "tags" (aliases).
Target Tags - Click to Expand & Collapse
There are pre-defined hardware tags (aliases) that can greatly simplify the passing of the target to the compiler. For example, one only needs to pass "raspberry-pi/5" instead of its detailed target description, "llvm -mtriple=aarch64-linux-gnu -mcpu=cortex-a76 -mattr=+neon -num-cores=4". Tags are broken out into three categories: CUDA
, x86
, and ARM
. See lists of tags with the provided APIs.
CUDA Tags
forge.list_cuda_tags(verbose=False)
¶
List all tags (pre-defined aliases) of CUDA targets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
Type | Description |
---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
x86 Tags
forge.list_x86_tags(verbose=False)
¶
List all tags (pre-defined aliases) of x86 targets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
Type | Description |
---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
ARM Tags
forge.list_arm_tags(verbose=False)
¶
List all tags (pre-defined aliases) of ARM targets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
Type | Description |
---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
Android Tags
forge.list_android_tags(verbose=False)
¶
List all tags (pre-defined aliases) for Android targets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
Type | Description |
---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
All Tags
forge.list_target_tags(verbose=False)
¶
List all tags (pre-defined aliases) of all targets
Parameters:
Name | Type | Description | Default |
---|---|---|---|
verbose
|
bool
|
A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False. |
False
|
Returns:
Type | Description |
---|---|
Union[List[str], List[Tuple[str, TargetHost]]]
|
A list of tags |
host
: An optional string or dictionary that denotes the host hardware containing the targeted hardware. This is relevant for multi-target compilation, e.g. GPU + CPU.
output_path
: A directory to write the compiled artifact to. If the directory does not exist, it will be made. If the directory already exists, a directory with a number will be created.
opt_level
: An optimization flag leveraged by the compiler, where the highest level of 4 corresponds to the most aggressive optimizations.
set_float16
: An option that will convert any float32
nodes into float16
nodes (operator-permitting). This option is ignored for TensorRT compilation.
set_channel_layout
: The data and kernel layout of the model can have a major impact on the final inference latency. There are two channel-layout options: "first" or "last". For quantized models, it is generally recommended that one compile with a channel-last layout. If set, this option will convert the model's layouts to maximize either channel-first or channel-last compute. This option is ignored for TensorRT compilation, which defaults to channel-first.
export_relay
: This flag will save a text file of the Relay text representation of the model to the desginated output_path
.
export_metadata
: This flag will save a JSON text file of the metadata leveraged by the Latent Runtime Engine.
force_overwrite
: This flag, when set to True, will force an overwrite of the output path if it already exists. Otherwise, it will raise a ValueError.
Example Code¶
# compile for CPU
ir.compile()
# compile for CPU, set channel, and export Relay as a text file
ir.compile(set_channel_layout="first", export_relay=True, force_overwrite=True)
# compile for GPU (targets host GPU)
ir.compile(target="cuda", force_overwrite=True)
# compile for GPU and/or CPU with explicit target strings (gives control to target a specific CPU or GPU)
ir.compile(target="cuda -arch=sm_86", host="llvm -mcpu=skylake", force_overwrite=True)
# compile for GPU with host CPU details (provides CPU acceleration for model sections mapped to CPU)
ir.compile(target="cuda", host="llvm -mcpu=skylake", force_overwrite=True)
# compile for Raspberry Pi using a hardware tag
ir.compile(target="raspberry-pi/5", force_overwrite=True)
# compile for Android SoC using a hardware tag
ir.compile(target="android/cpu", force_overwrite=True)
# compile for CPU with float16
ir.compile(set_float16=True, force_overwrite=True)
Selecting between CUDA and TensorRT
It should be noted that if a model is not partitioned for TensorRT, as described in the following section, the GPU targets provided above will use CUDA libraries to compile the model.
Compiling with TensorRT¶
Compilation with TensorRT involves two steps:
-
Partition for TensorRT
-
Compile Partitioned Graph
Partition for TensorRT¶
When compiling for NVIDIA chips, it can be greatly beneficial to leverage NVIDIA's TensorRT compiler. Forge partitions an IRModule for TensorRT optimization by identifying and separating TensorRT-compatible subgraphs from the "main" computational graph.
Get a count of graphs
There will always be at least 1 graph, the "main" graph, i.e. the graph_count
property is a definite positive number.
ir.graph_count # int
Partitioning
Partition the graph into TensorRT-compatible subgraphs with the partition_for_tensorrt()
method. This modifies the graph to a state that is TensorRT-ready.
ir.partition_for_tensorrt()
ir.mod # observe the partitioned-graph
Confirm the IRModule is partitioned
ir.is_tensorrt # bool
Get a count of subgraphs
After partitioning, any TensorRT-compatible subgraphs will be partitioned out of the "main" computational graph. It is desirable to have as few subgraphs as possible, where a single subgraph is the best-case scenario. There can be a number of zero or greater subgraphs, i.e. the subgraph_count
property is a definite non-negative number.
ir.subgraph_count # int
Compile Partitioned Graph¶
Compilation with TensorRT proceeds identically as one might without TensorRT, by invoking the compile
method, except for the minor difference that the graph is already partitioned.
ir.partition_for_tensorrt()
assert ir.is_tensorrt
ir.compile(target="cuda") # target can be optionally omitted if the IRModule is TensorRT-partitioned
Undo partitioning
Undoing any partioning is completed with the inline_partitions()
method. In doing so, this returns the graph to a state that is not TensorRT-ready.
ir.inline_partitions()
ir.mod # observe the single "flattened" or "inlined" graph
Compilation Extras¶
Compiling for GPU without TensorRT¶
One can compile for the GPU, without TensorRT by simply compiling for the CUDA target architecture on an un-partitioned graph.
assert not ir.is_tensorrt
ir.compile(target="cuda")
Compiling with Custom UUID¶
All the model artifacts contain metadata. By default each model is assigned a unique UUID, but the user has the ability to define and assign a custom UUID to the compiled model artifact using the API:
ir.compile(
target="cuda",
uuid="123e4567-e89b-12d3-a456-426614174000",
)
Compilation Artifacts¶
The compiled object will be a .so
file placed in the designated output_path
directory, along with optional Relay-text and runtime-metadata text/JSON files.
Compiling and Encrypting the Artifact¶
The compile API offers the possibility to generate an encrypted compiled object. An example of the API usage is shown as:
ir.compile(
target="cuda",
output_path=output_path,
force_overwrite=True,
encrypt_password="test_password",
)
Encryption Time
Be aware that encrypting the model will result in slower compilation time.
Encryption will generate an additional file, such that the output_path will contain not only the .so
file, but also a .bin
key which needs to be provided at runtime.
from pylre import LatentRuntimeEngine
lre = LatentRuntimeEngine(f"{output_path}/model_library.so", f"{output_path}/modelKey.bin", "test_password")
[05:04:55] /app/src/runtime/latentai/lre.cpp:166: Model looks encrypted but no key provided
[05:04:55] /app/src/runtime/latentai/lre_cryption_service.cpp:66: Corrupted chunk encountered while decrypting the key, wrong password or corrupted key file