Latent Runtime Engine API
pylre.LatentRuntimeEngine
LatentRuntimeEngine(
model_path: Union[str, PathLike],
options: Optional[Union[ONNXOptions, TVMOptions]] = None,
)
A Python wrapper around the C++ LRE.
This class exposes and provides a Python API to the underlying C++ LRE implementation.
The Python LRE can run inference on any tensor inputs that follow the DLPack protocol,
i.e. the tensor objects have a defined __dlpack__ method, e.g. NumPy arrays, PyTorch
tensors, etc. The returned outputs will also be DLPack objects that can be ingested by
common libraries like NumPy, PyTorch, etc.
Initialize a runtime instance
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path
|
Union[str, PathLike]
|
Either an '.onnx' or '.so' artifact generated from LEIP Optimize |
required |
options
|
Optional[Union[ONNXOptions, TVMOptions]]
|
The runtime options for the model. Use 'TVMOptions' for TVM-compiled binaries. Use 'ONNXOptions' for ONNX protobufs. |
None
|
Methods:
| Name | Description |
|---|---|
__call__ |
Method to invoke inference and return outputs by calling the instance. |
get_metadata |
Get a dictionary of the model's metadata. |
get_output |
Get a specific tensor output by index from the last executed inference. |
get_outputs |
Returns the all the outputs from the last executed inference. |
infer |
Runs inference upon provided input(s). Outputs are saved to buffers. |
set_cpu_output |
Sets whether the output should be a CPU PyDLPack tensor. |
Attributes:
| Name | Type | Description |
|---|---|---|
input_dtypes |
List[str]
|
Model's input data types |
input_shapes |
List[Tuple[int, ...]]
|
Model's input shapes |
is_cpu_output |
bool
|
Flag is true if the runtime's current output device is CPU |
is_trt |
bool
|
Flag is true if the runtime session uses TensorRT |
model_id |
str
|
Model's UUID metadata field |
number_inputs |
int
|
Model's number of inputs |
number_outputs |
int
|
Model's number of outputs |
output_dtypes |
List[str]
|
Model's output data types |
output_shapes |
List[Tuple[int, ...]]
|
Model's output shapes |
runtime_options |
Union[ONNXOptions, TVMOptions]
|
Runtime options of the current session |
Attributes
input_dtypes
property
input_dtypes: List[str]
Model's input data types
input_shapes
property
input_shapes: List[Tuple[int, ...]]
Model's input shapes
is_cpu_output
property
is_cpu_output: bool
Flag is true if the runtime's current output device is CPU
is_trt
property
is_trt: bool
Flag is true if the runtime session uses TensorRT
model_id
property
model_id: str
Model's UUID metadata field
number_inputs
property
number_inputs: int
Model's number of inputs
number_outputs
property
number_outputs: int
Model's number of outputs
output_dtypes
property
output_dtypes: List[str]
Model's output data types
output_shapes
property
output_shapes: List[Tuple[int, ...]]
Model's output shapes
runtime_options
property
runtime_options: Union[ONNXOptions, TVMOptions]
Runtime options of the current session
Functions
__call__
__call__(inputs) -> List[PyDLPack]
Method to invoke inference and return outputs by calling the instance. A composition of running inference and getting outputs.
get_metadata
get_metadata() -> dict
Get a dictionary of the model's metadata.
Returns:
| Name | Type | Description |
|---|---|---|
metadata |
dict
|
Dictionary of metadata key-values |
get_output
get_output(index: int) -> PyDLPack
Get a specific tensor output by index from the last executed inference.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
The desired output tensor index. |
required |
get_outputs
get_outputs() -> List[PyDLPack]
Returns the all the outputs from the last executed inference.
Returns:
| Name | Type | Description |
|---|---|---|
outputs |
List[PyDLPack]
|
List of DLPack-protocol objects |
infer
infer(inputs) -> None
Runs inference upon provided input(s). Outputs are saved to buffers.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs
|
Either a single "DLPack-Tensor" or a list/tuple of "DLPack-Tensor"
objects. A "DLPack-Tensor" object is any tensor that implements the
DLPack protocol, i.e. has a |
required |
set_cpu_output
set_cpu_output(use_cpu: bool) -> None
Sets whether the output should be a CPU PyDLPack tensor.
This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
use_cpu
|
bool
|
If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference. |
required |
TVMOptions
A class that provides options for configuring TVM-compiled models.
-
Example:
import pylre from pylre import LatentRuntimeEngine as LRE # Create TVMOptions with all possible configurations options = pylre.TVMOptions( precision="int8", # Set precision to INT8 tensorrt_timing_cache="timing_dir", # Set TensorRT timing cache directory tensorrt_engine_cache="engine_dir", # Set TensorRT engine cache directory device_id=0, # Use GPU device ID 0 password="password", # Set encryption password key_path="path_to_key" # Provide encryption key path ) # Initialize the Latent Runtime Engine with the configured options lre = LRE(model_path="path_to_model", options=options) -
precision: Optional[str]
- The precision mode to use during inference. Possible values: "float32", "float16", "int8". Defaults to "float32".
Supported runtime precision
Because TensorRT performs second-level optimization at runtime, it provides some flexibility with respect to precision at execution.
| TRT Precision | int8 at runtime | float16 at runtime | float32 at runtime |
|---|---|---|---|
| Compiled as int8 | X | X | X |
| Compiled as float16 | X | X | |
| Compiled as float32 | X | X |
| non-TRT Precision | float32 at runtime | int8 at runtime | uint8 at runtime |
|---|---|---|---|
| Compiled as float32 | X | ||
| Compiled as int8 | X | ||
| Compiled uint8 | X |
-
tensorrt_timing_cache: Optional[str]
- A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
-
tensorrt_engine_cache: Optional[str]
- A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
-
device_id: Optional[int]
- The ID of the device to use for inference
- For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
- For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to
0.
- The ID of the device to use for inference
-
password: Optional[str]
- Password that was used to encrypt your model. If not specified, no password is required.
-
key_path: Optional[str]
- The path to a key file used for model encryption. If not specified, no key file is required.
ONNXOptions
A class that provides options for configuring ONNX-exported models.
-
Example:
import pylre from pylre import LatentRuntimeEngine as LRE # Create ONNXOptions with all possible configurations options = pylre.ONNXOptions( execution_provider="cuda", # Use CUDA for execution precision="int8", # Set precision to INT8 tensorrt_timing_cache="timing_dir", # Set TensorRT timing cache directory tensorrt_engine_cache="engine_dir", # Set TensorRT engine cache directory device_id=0, # Use GPU device ID 0 password="password", # Set encryption password key_path="path_to_key" # Provide encryption key path ) # Initialize the Latent Runtime Engine with the configured options lre = LRE(model_path="path_to_model", options=options) -
execution_provider: Optional[str]
- The provider to use for model execution. Possible values: "cpu", "cuda", "tensorrt". Defaults to "cpu".
-
precision: Optional[str]
- The precision mode to use during inference. Possible values: "float32", "int8". Defaults to "float32".
Supported runtime precision
| TRT Precision | int8 at runtime | float16 at runtime | float32 at runtime |
|---|---|---|---|
| Exported as int8 | X | X | X |
| Exported as float16 | X | X | |
| Exported float32 | X | X |
| non-TRT Precision | float32 at runtime | int8 at runtime | uint8 at runtime |
|---|---|---|---|
| Exported as float32 | X | ||
| Exported as int8 | X | ||
| Exported uint8 | X |
-
tensorrt_timing_cache: Optional[str]
- A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
-
tensorrt_engine_cache: Optional[str]
- A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
-
device_id: Optional[int]
- The ID of the device to use for inference
- For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
- For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to
0.
- The ID of the device to use for inference
-
password: Optional[str]
- Password that was used to encrypt your model. If not specified, no password is required.
-
key_path: Optional[str]
- The path to a key file used for model encryption. If not specified, no key file is required.