Skip to content

Guide to Inference with Forge

GraphExecutor

This guide will show you how to run inference on the compiled binary with Forge using its GraphExecutor.

Inference with Forge

It should be noted that running inference with Forge is not an optimal setup for edge deployment. It is a heavy environment. But Forge does allow one to run inference with the compiled binary locally for convenience and testing purposes.


Loading a GraphExecutor

It is simple to load a compiled binary. One only needs to provide the path to the compiled .so file, and denote the device-string (usually "cpu" or "cuda"). An error will occur if an incompatible device-string is passed, e.g. using device="cpu" for a GPU-compiled binary.

Example Code

import forge
gx = forge.GraphExecutor("path/to/modelLibrary.so", device="cpu")


GraphExecutor Introspection

There are some properties for high-level introspection.

Example Code

gx.input_count  # number of inputs
gx.input_shapes  # list of input shapes
gx.input_dtypes  # list of input dtypes

gx.output_count  # number of outputs
gx.output_shapes  # list of output shapes
gx.output_dtypes  # list of output dtypes

gx.output_type  # string-type of the inference (see "GraphExecutor Inference")


GraphExecutor Benchmarking

The GraphExecutor provides a wrapper around TVM's benchmarking function. Below is the type-signature and docstring.

GraphExecutor Benchmark Method Docstring - Click to Expand & Collapse

forge.GraphExecutor.benchmark(repeat=5, number=5, min_repeat_ms=None, limit_zero_time_iterations=100, end_to_end=False, cooldown_interval_ms=0, repeats_to_cooldown=1, return_ms=True)

Calculate runtime of a function by repeatedly calling it.

Use this function to get an accurate measurement of the runtime of a function. The function is run multiple times in order to account for variability in measurements, processor speed or other external factors. Mean, median, standard deviation, min and max runtime are all reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that synchonization and data transfer operations are not counted towards the runtime. This allows for fair comparison of runtimes across different functions and models. The end_to_end flag switches this behavior to include data transfer operations in the runtime.

The benchmarking loop looks approximately like so:

.. code-block:: python

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start)/number)

Parameters:

Name Type Description Default
repeat int

Number of times to run the outer loop of the timing code (see above). The output will contain repeat number of datapoints. Defaults to 5.

5
number int

Number of times to run the inner loop of the timing code. This inner loop is run in between the timer starting and stopping. In order to amortize any timing overhead, number should be increased when the runtime of the function is small (less than a 1/10 of a millisecond). Defaults to 5.

5
min_repeat_ms Optional[int]

If set, the inner loop will be run until it takes longer than min_repeat_ms milliseconds. This can be used to ensure that the function is run enough to get an accurate measurement. Defaults to None.

None
limit_zero_time_iterations Optional[int]

The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100.

100
end_to_end bool

If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False.

False
cooldown_interval_ms Optional[int]

The cooldown interval in milliseconds between the number of repeats defined by repeats_to_cooldown. Defaults to 0.

0
repeats_to_cooldown Optional[int]

The number of repeats before the cooldown is activated. Defaults to 1.

1
return_ms bool

A flag to convert all measurements to milliseconds. Defaults to True.

True

Returns:

Name Type Description
timing_results Dict

Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std".

Benchmarking Details

It is important to note that this benchmarking does not account for...

  1. the setting of the input

  2. the retrieval of the output

i.e. there is no benchmarking for the setting of input and retrieval of output in the the benchmark loop.

Example Code

gx.benchmark(repeat=10, number=5, return_ms=False)
gx.benchmark(end_to_end=True)


GraphExecutor Inference

Inference Methods

There are a couple of ways to run inference.

Method 1:

gx.infer(input_data)  # runs inference but does not return output
output = gx.get_outputs()  # retrieve a list of output tensors

Method 2:

output = gx(input_data)  # runs inference and returns output

The input_data can be a singular or multiple set of positional arguments. The input_data should be Numpy, Torch, or TensorFlow tensor objects, i.e. any tensor object of which uphold the DLPack protocol.

Output Type

By default, the list of outputs returned will be Numpy arrays. However, a user can manually set this to return one of three supported options: "numpy", "dlpack", or "torch". By using "dlpack" and "torch" a user can force the resulting output to stay on the target device (e.g. the GPU) to avoid the expense of memory transferring the output from the target device to the CPU. If "dlpack" is elected, it is upon the user to ingest the object into a framework of their choice.

gx.set_output_type("torch")
gx.output_type  # the string denoting the tensor object type returned

Environment

One must have torch installed in the environment to have the GraphExecutor return torch.Tensor objects.


Inference with a TensorRT-Compiled Model

To use the GraphExecutor on a model compiled with TensorRT, the functional flow is no different as described from above. But it should be noted that the inference engines are built by TensorRT at runtime.

The first time a TensorRT-compiled model is loaded and ran, TensorRT will kick off a engine-building process using the compiled .so. This is important because the first inference may appear exceptionally slow for a TensorRT-compiled model if engine-building is required. Built engines are cached in the same directory as the .so and future loading and runs of inference will not result in engine-building so long as the cached engines are found.