Guide to Inference with Forge¶

GraphExecutor¶

This guide will show you how to run inference on the compiled binary with Forge using its GraphExecutor.

Inference with Forge

It should be noted that running inference with Forge is not an optimal setup for edge deployment. It is a heavy environment. But Forge does allow one to run inference with the compiled binary locally for convenience and testing purposes.

Loading a GraphExecutor¶

It is simple to load a compiled binary. One only needs to provide the path to the compiled .so file, and denote the device-string (usually "cpu" or "cuda"). An error will occur if an incompatible device-string is passed, e.g. using device="cpu" for a GPU-compiled binary.

Example Code

import forge
gx = forge.GraphExecutor("path/to/modelLibrary.so", device="cpu")

GraphExecutor Introspection¶

There are some properties for high-level introspection.

Example Code

gx.input_count  # number of inputs
gx.input_shapes  # list of input shapes
gx.input_dtypes  # list of input dtypes

gx.output_count  # number of outputs
gx.output_shapes  # list of output shapes
gx.output_dtypes  # list of output dtypes

gx.output_type  # string-type of the inference (see "GraphExecutor Inference")

GraphExecutor Benchmarking¶

The GraphExecutor provides a wrapper around TVM's benchmarking function. Below is the type-signature and docstring.

GraphExecutor Benchmark Method Docstring - Click to Expand & Collapse

`forge.GraphExecutor.benchmark(repeat=5, number=5, min_repeat_ms=None, limit_zero_time_iterations=100, end_to_end=False, cooldown_interval_ms=0, repeats_to_cooldown=1, return_ms=True)` ¶

Calculate runtime of a function by repeatedly calling it.

Use this function to get an accurate measurement of the runtime of a function. The function is run multiple times in order to account for variability in measurements, processor speed or other external factors. Mean, median, standard deviation, min and max runtime are all reported. On GPUs, CUDA and ROCm specifically, special on-device timers are used so that synchonization and data transfer operations are not counted towards the runtime. This allows for fair comparison of runtimes across different functions and models. The end_to_end flag switches this behavior to include data transfer operations in the runtime.

The benchmarking loop looks approximately like so:

.. code-block:: python

for r in range(repeat):
    time_start = now()
    for n in range(number):
        func_name()
    time_end = now()
    total_times.append((time_end - time_start)/number)

Parameters:

Name	Type	Description	Default
`repeat`	`int`	Number of times to run the outer loop of the timing code (see above). The output will contain `repeat` number of datapoints. Defaults to 5.	`5`
`number`	`int`	Number of times to run the inner loop of the timing code. This inner loop is run in between the timer starting and stopping. In order to amortize any timing overhead, `number` should be increased when the runtime of the function is small (less than a 1/10 of a millisecond). Defaults to 5.	`5`
`min_repeat_ms`	`Optional[int]`	If set, the inner loop will be run until it takes longer than `min_repeat_ms` milliseconds. This can be used to ensure that the function is run enough to get an accurate measurement. Defaults to None.	`None`
`limit_zero_time_iterations`	`Optional[int]`	The maximum number of repeats when measured time is equal to 0. It helps to avoid hanging during measurements. Defaults to 100.	`100`
`end_to_end`	`bool`	If enabled, include time to transfer input tensors to the device and time to transfer returned tensors in the total runtime. This will give accurate timings for end to end workloads. Defaults to False.	`False`
`cooldown_interval_ms`	`Optional[int]`	The cooldown interval in milliseconds between the number of repeats defined by `repeats_to_cooldown`. Defaults to 0.	`0`
`repeats_to_cooldown`	`Optional[int]`	The number of repeats before the cooldown is activated. Defaults to 1.	`1`
`return_ms`	`bool`	A flag to convert all measurements to milliseconds. Defaults to True.	`True`

Returns:

Name	Type	Description
`timing_results`	`Dict`	Runtime results broken out into the raw "results", along with the computed statistics of "max", "median", "min", "mean", and "std".

Benchmarking Details

It is important to note that this benchmarking does not account for...

the setting of the input
the retrieval of the output

i.e. there is no benchmarking for the setting of input and retrieval of output in the the benchmark loop.

Example Code

gx.benchmark(repeat=10, number=5, return_ms=False)
gx.benchmark(end_to_end=True)

GraphExecutor Inference¶

Inference Methods¶

There are a couple of ways to run inference.

Method 1:

gx.infer(input_data)  # runs inference but does not return output
output = gx.get_outputs()  # retrieve a list of output tensors

Method 2:

output = gx(input_data)  # runs inference and returns output

The input_data can be a singular or multiple set of positional arguments. The input_data should be Numpy, Torch, or TensorFlow tensor objects, i.e. any tensor object of which uphold the DLPack protocol.

Output Type¶

By default, the list of outputs returned will be Numpy arrays. However, a user can manually set this to return one of three supported options: "numpy", "dlpack", or "torch". By using "dlpack" and "torch" a user can force the resulting output to stay on the target device (e.g. the GPU) to avoid the expense of memory transferring the output from the target device to the CPU. If "dlpack" is elected, it is upon the user to ingest the object into a framework of their choice.

gx.set_output_type("torch")
gx.output_type  # the string denoting the tensor object type returned

Environment

One must have torch installed in the environment to have the GraphExecutor return torch.Tensor objects.

Inference with a TensorRT-Compiled Model¶

To use the GraphExecutor on a model compiled with TensorRT, the functional flow is no different as described from above. But it should be noted that the inference engines are built by TensorRT at runtime.

The first time a TensorRT-compiled model is loaded and ran, TensorRT will kick off a engine-building process using the compiled .so. This is important because the first inference may appear exceptionally slow for a TensorRT-compiled model if engine-building is required. Built engines are cached in the same directory as the .so and future loading and runs of inference will not result in engine-building so long as the cached engines are found.

Guide to Inference with Forge¶

GraphExecutor¶

Loading a GraphExecutor¶

GraphExecutor Introspection¶

GraphExecutor Benchmarking¶

forge.GraphExecutor.benchmark(repeat=5, number=5, min_repeat_ms=None, limit_zero_time_iterations=100, end_to_end=False, cooldown_interval_ms=0, repeats_to_cooldown=1, return_ms=True) ¶

GraphExecutor Inference¶

Inference Methods¶

Output Type¶

Inference with a TensorRT-Compiled Model¶

`forge.GraphExecutor.benchmark(repeat=5, number=5, min_repeat_ms=None, limit_zero_time_iterations=100, end_to_end=False, cooldown_interval_ms=0, repeats_to_cooldown=1, return_ms=True)` ¶