Guide to Introspection with Forge¶
This guide will show how to introspect on the many properties of the model with Forge's intermediate representation, the forge.IRModule
. There are many useful properties of Forge's IRModule that can aid the engineer, scientist, or developer.
Load a Model
Following the guide on loading let's load a Forge IRModule.
import forge
import onnx
onnx_model = onnx.load("path/to/model.onnx")
ir = forge.from_onnx(onnx_model)
What is an IRModule?¶
The forge.IRModule
is Forge's intermediate representation module. It is a framework-agnostic representation of the model that provides the compiler a generalized and standardized abstraction that captures the algorithm of the model. It describes what the algorithm is, and not how a device ought to execute the algorithm. Because Forge is built atop the open-source TVM project, we adopt their intermediate representation language, Relay - TVM's Intermediate Representation. Forge extends TVM by providing a graph backend and a refined API.
Distinction Between Forge and Relay
One should note that there is a distinction between the 'Forge IRModule' and the 'Relay IRModule'. The Forge IRModule is an object that wraps the Relay IRModule, TVM's native intermediate representation. The Forge IRModule aims to provide a one-to-one parallel to the underlying Relay IRModule.
Properties of an IRModule¶
The readable properties of a forge.IRModule
.
See the Intermediate Representation¶
The Relay IRModule can be referenced with the class's mod
and typed_mod
properties. In a notebook cell, these calls will display the Relay graph as text. Use print statements to display the Relay graph to the console when not using a notebook. The Relay should look familiar as a representation of your model. Don't be concerned with understanding all the details of the output for now.
ir.mod # Relay IRModule
ir.typed_mod # Relay IRModule w/ static-typing
See the Operators¶
It's simple to get a count of all the distinct operators within a model.
ir.operators # Dict[str, int]
Get Input and Output Information¶
There are a handful of properties that provides quick access to the inputs and outputs of a model.
# input properties
ir.input_count # int
ir.input_shapes # List[Tuple[int, ...]]
ir.input_dtypes # List[str]
# output properties
ir.output_count # int
ir.output_shapes # List[Tuple[int, ...]]
ir.output_dtypes # List[str]
Identify your Model¶
Models can be tricky to identify. Sometimes two files may be duplicates, but how can you be sure? In Forge, there are two ways to distinguish a model's identity.
ir.fingerprint # str
fingerprint
property is a deterministic hashing of a model's Relay structure and weights. One can ascertain that two Forge IRModule's with matching fingerprints are completely identical.
hash(ir) # int
hash()
function is a deterministic hashing of a model's Relay structure (excluding weights), i.e. two models of identical structures, trained on different data sets will yield matching hashes (but different fingerprints).
Hashing Uniqueness
Both the fingerprint
and hash
features are reliant upon hashing. Hashing does not guarantee uniqueness, but it is highly improbable for different models to derive matching hashes.
Inference Debugging¶
One may want to quickly get a python callable that emulates the inference of the underlying model (especially in situations of manipulating the underlying graph). The inference function expects numpy arrays as positional arguments. The inference function is not an optimized compilation of the model. It should only be used as a tool for debugging and validating accuracy.
func = ir.get_inference_function()
func(input_data) # func(input0, input1, ..., inputN) for multiple inputs
Partitioning the IRModule¶
Partitioning a Relay graph for the purpose of compilation with different compiler backends or hardware is done to leverage the strengths of various environments for optimal performance. Essentially, it involves:
-
Dividing the Graph: Breaking down the computational graph of a model into segments or partitions.
-
Targeted Execution: Assigning these partitions to different compiler backends (like TensorRT) or hardware units (like CPUs, GPUs, TPUs) that are best suited for executing them.
-
Performance Optimization: This approach optimizes the overall performance by ensuring that each part of the model runs on the most efficient platform for its specific type of computation.
In essence, it's about matching different parts of the model with the most effective resources available for their execution.
Partitioning for TensorRT¶
When compiling for NVIDIA chips, it can be greatly beneficial to leverage NVIDIA's TensorRT compiler. Forge partitions an IRModule for TensorRT optimization by identifying and separating TensorRT-compatible subgraphs from the "main" computational graph.
Get a Count of Graphs¶
There will always be at least 1 graph, the "main" graph, i.e. the graph_count
property is a definite positive number.
ir.graph_count # int
Partition for TensorRT¶
Partition the graph into TensorRT-compatible subgraphs with the partition_for_tensorrt()
method.
ir.partition_for_tensorrt()
ir.mod # observe the partitioned-graph
Confirm the IRModule is Partitioned¶
ir.is_tensorrt # bool
Get a Count of Subgraphs¶
After partitioning, any TensorRT-compatible subgraphs will be partitioned out of the "main" computational graph. It is desirable to have as few subgraphs as possible, where a single subgraph is the best-case scenario. There can be a number of zero or greater subgraphs, i.e. the subgraph_count
property is a definite non-negative number. The "subgraph" count is different from the "graph" count in that it excludes the "main" graph from the count.
ir.subgraph_count # int
Optimal Number of Subgraphs
The optimal number of subgraphs for a TensorRT-partitioned graph is one. The fewer, the better the computational performance. Since not all operators are supported by TensorRT in TVM's integration, unsupported operators result in "breaks" between subgraphs of supported operators.
Undo Partitioning¶
Undoing any partioning is completed with the inline_partitions()
method.
ir.inline_partitions()
ir.mod # observe the single "flattened" or "inlined" graph