Skip to content

Guide to Compilation with Forge

This guide will show you how to compile the Forge IRModule for a range of targets.

Load an IRModule

import forge
import onnx

onnx_model = onnx.load("path/to/model.onnx")
ir = forge.from_onnx(onnx_model)


Compiling

Forge IRModule can be compiled with its compile() method. The main thing a user will need to pass to the compile method will be the 'target' designation (to be detailed below). Below is the type-signature and docstring for the method for reference.

IRModule Compile Method Docstring - Click to Expand & Collapse

forge.IRModule.compile(target='llvm', host=None, output_path='./compile_output', opt_level=3, set_float16=False, set_channel_layout=None, export_relay=False, export_metadata=False, force_overwrite=False, uuid=None, encrypt_password=None)

Compiles the model for a specified target with various configuration options.

This method compiles the model for a given target, which can be a string or a dictionary specifying the target attributes. The compilation can be customized through various parameters, including optimization level and data type settings.

Parameters:

Name Type Description Default
target Union[str, Dict[str, Any]]

Can be one of a literal target string, a target tag (pre-defined target alias), a json string describing, a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are:

kind : str (required) Which codegen path to use, for example "llvm" or "cuda".

keys : List of str (optional) A set of strategies that can be dispatched to. When using "kind=opencl" for example, one could set keys to ["mali", "opencl", "gpu"].

device : str (optional) A single key that corresponds to the actual device being run on. This will be effectively appended to the keys.

libs : List of str (optional) The set of external libraries to use. For example ["cblas", "mkl"].

system-lib : bool (optional) If True, build a module that contains self registered functions. Useful for environments where dynamic loading like dlopen is banned.

mcpu : str (optional) The specific cpu being run on. Serves only as an annotation.

model : str (optional) An annotation indicating what model a workload came from.

runtime : str (optional) An annotation indicating which runtime to use with a workload.

mtriple : str (optional) The llvm triplet describing the target, for example "arm64-linux-android".

mattr : List of str (optional) The llvm features to compile with, for example ["+avx512f", "+mmx"].

mfloat-abi : str (optional) An llvm setting that is one of "hard" or "soft" indicating whether to use hardware or software floating-point operations.

mabi : str (optional) An llvm setting. Generate code for the specified ABI, for example "lp64d".

host : Union[str, Dict[str, Any]] (optional) Description for target host. Can be recursive. Similar to target.

'llvm'
host Optional[Union[str, Dict[str, Any]]]

Similar to target but for target host. Can be one of a literal target host string, a target tag (pre-defined target alias), a json string describing a configuration, or a dictionary of configuration options. When using a dictionary or json string to configure target, the possible values are same as target.

None
output_path Optional[Union[str, Path]]

The path to save the compiled output, ./compile_output by default.

'./compile_output'
opt_level int

Optimization level, ranging from 0 to 4. Larger numbers, correspond with more aggressive compilation optimizations. Default is 3.

3
set_float16 bool

If True, enables Float16 data type for all operators permitted. Default is False. This option is ignored for TensorRT compilation.

False
set_channel_layout Optional[str]

Optional specification of the channel layout ("first", "last"), defaults to no changing of layout if None. This option is ignored for TensorRT compilation and the defaults to channel-first.

None
export_relay bool

If True, exports the Relay text representation of the model. Default is False.

False
export_metadata bool

If True, exports the metadata JSON of the model as a text file. Default is False.

False
force_overwrite bool

If True, the method will overwrite if the provided output path already exists. A ValueError will be thrown if False and the output path already exists. Default is False.

False
uuid Optional[str]

Optional specification of a uuid from the user when the model needs to have a unique identifier that will be set by the user, when this value is not set, the uuid will be a randomely generated one.

None
encrypt_password Optional[str]

Optional specification of a password if is desirable to have the model encrypted. As an output, there will be the model file and the key.

None

Returns:

Name Type Description
None None

The method operates in place.

Compilation Arguments

The compilation is dictated by the passed options, below will provide some details on the corresponding arguments and its effects on compilation.

target: A string or dictionary that denotes the targeted hardware for compilation. See here for more details. A simpler method here would be to leverage the pre-defined hardware "tags" (aliases).

Target Tags - Click to Expand & Collapse

There are pre-defined hardware tags (aliases) that can greatly simplify the passing of the target to the compiler. For example, one only needs to pass "raspberry-pi/5" instead of its detailed target description, "llvm -mtriple=aarch64-linux-gnu -mcpu=cortex-a76 -mattr=+neon -num-cores=4". Tags are broken out into three categories: CUDA, x86, and ARM. See lists of tags with the provided APIs.

CUDA Tags

forge.list_cuda_tags(verbose=False)

List all tags (pre-defined aliases) of CUDA targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

x86 Tags

forge.list_x86_tags(verbose=False)

List all tags (pre-defined aliases) of x86 targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

ARM Tags

forge.list_arm_tags(verbose=False)

List all tags (pre-defined aliases) of ARM targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

Android Tags

forge.list_android_tags(verbose=False)

List all tags (pre-defined aliases) for Android targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

All Tags

forge.list_target_tags(verbose=False)

List all tags (pre-defined aliases) of all targets

Parameters:

Name Type Description Default
verbose bool

A flag to return all the tags with their corresponding target string literals for the TVM compiler if true. Default is False.

False

Returns:

Type Description
Union[List[str], List[Tuple[str, TargetHost]]]

A list of tags

host: An optional string or dictionary that denotes the host hardware containing the targeted hardware. This is relevant for multi-target compilation, e.g. GPU + CPU.

output_path: A directory to write the compiled artifact to. If the directory does not exist, it will be made. If the directory already exists, a directory with a number will be created.

opt_level: An optimization flag leveraged by the compiler, where the highest level of 4 corresponds to the most aggressive optimizations.

set_float16: An option that will convert any float32 nodes into float16 nodes (operator-permitting). This option is ignored for TensorRT compilation.

set_channel_layout: The data and kernel layout of the model can have a major impact on the final inference latency. There are two channel-layout options: "first" or "last". For quantized models, it is generally recommended that one compile with a channel-last layout. If set, this option will convert the model's layouts to maximize either channel-first or channel-last compute. This option is ignored for TensorRT compilation, which defaults to channel-first.

export_relay: This flag will save a text file of the Relay text representation of the model to the desginated output_path.

export_metadata: This flag will save a JSON text file of the metadata leveraged by the Latent Runtime Engine.

force_overwrite: This flag, when set to True, will force an overwrite of the output path if it already exists. Otherwise, it will raise a ValueError.


Example Code

# compile for CPU
ir.compile()

# compile for CPU, set channel, and export Relay as a text file
ir.compile(set_channel_layout="first", export_relay=True, force_overwrite=True)

# compile for GPU (targets host GPU)
ir.compile(target="cuda", force_overwrite=True)

# compile for GPU and/or CPU with explicit target strings (gives control to target a specific CPU or GPU)
ir.compile(target="cuda -arch=sm_86", host="llvm -mcpu=skylake", force_overwrite=True)

# compile for GPU with host CPU details (provides CPU acceleration for model sections mapped to CPU)
ir.compile(target="cuda", host="llvm -mcpu=skylake", force_overwrite=True)

# compile for Raspberry Pi using a hardware tag
ir.compile(target="raspberry-pi/5", force_overwrite=True)

# compile for Android SoC using a hardware tag
ir.compile(target="android/cpu", force_overwrite=True)

# compile for CPU with float16
ir.compile(set_float16=True, force_overwrite=True)

Selecting between CUDA and TensorRT

It should be noted that if a model is not partitioned for TensorRT, as described in the following section, the GPU targets provided above will use CUDA libraries to compile the model.


Compiling with TensorRT

Compilation with TensorRT involves two steps:

  1. Partition for TensorRT

  2. Compile Partitioned Graph

Partition for TensorRT

When compiling for NVIDIA chips, it can be greatly beneficial to leverage NVIDIA's TensorRT compiler. Forge partitions an IRModule for TensorRT optimization by identifying and separating TensorRT-compatible subgraphs from the "main" computational graph.

Get a count of graphs

There will always be at least 1 graph, the "main" graph, i.e. the graph_count property is a definite positive number.

ir.graph_count  # int

Partitioning

Partition the graph into TensorRT-compatible subgraphs with the partition_for_tensorrt() method. This modifies the graph to a state that is TensorRT-ready.

ir.partition_for_tensorrt()
ir.mod  # observe the partitioned-graph

Confirm the IRModule is partitioned

ir.is_tensorrt  # bool

Get a count of subgraphs

After partitioning, any TensorRT-compatible subgraphs will be partitioned out of the "main" computational graph. It is desirable to have as few subgraphs as possible, where a single subgraph is the best-case scenario. There can be a number of zero or greater subgraphs, i.e. the subgraph_count property is a definite non-negative number.

ir.subgraph_count  # int

Compile Partitioned Graph

Compilation with TensorRT proceeds identically as one might without TensorRT, by invoking the compile method, except for the minor difference that the graph is already partitioned.

ir.partition_for_tensorrt()
assert ir.is_tensorrt
ir.compile(target="cuda")  # target can be optionally omitted if the IRModule is TensorRT-partitioned

Undo partitioning

Undoing any partioning is completed with the inline_partitions() method. In doing so, this returns the graph to a state that is not TensorRT-ready.

ir.inline_partitions()
ir.mod  # observe the single "flattened" or "inlined" graph


Compilation Extras

Compiling for GPU without TensorRT

One can compile for the GPU, without TensorRT by simply compiling for the CUDA target architecture on an un-partitioned graph.

assert not ir.is_tensorrt
ir.compile(target="cuda")

Compiling with Custom UUID

All the model artifacts contain metadata. By default each model is assigned a unique UUID, but the user has the ability to define and assign a custom UUID to the compiled model artifact using the API:

ir.compile(
        target="cuda",
        uuid="123e4567-e89b-12d3-a456-426614174000",
    )
This is an interesting use case where each part of the UUID can signify host identifier, hardware, or any other encoded information that is defined by the user.

Compilation Artifacts

The compiled object will be a .so file placed in the designated output_path directory, along with optional Relay-text and runtime-metadata text/JSON files.

Compiling and Encrypting the Artifact

The compile API offers the possibility to generate an encrypted compiled object. An example of the API usage is shown as:

ir.compile(
        target="cuda",
        output_path=output_path,
        force_overwrite=True,
        encrypt_password="test_password",
    )

Encryption Time

Be aware that encrypting the model will result in slower compilation time.

Encryption will generate an additional file, such that the output_path will contain not only the .so file, but also a .bin key which needs to be provided at runtime.

from pylre import LatentRuntimeEngine
lre = LatentRuntimeEngine(f"{output_path}/model_library.so", f"{output_path}/modelKey.bin", "test_password")
If you attempt to run an encrypted model without providing the key and password to the LRE, the model will not run. An error will be logged:
[05:04:55] /app/src/runtime/latentai/lre.cpp:166: Model looks encrypted but no key provided
or
[05:04:55] /app/src/runtime/latentai/lre_cryption_service.cpp:66: Corrupted chunk encountered while decrypting the key, wrong password or corrupted key file