Skip to content

Latent Runtime Engine API

pylre.LatentRuntimeEngine

LatentRuntimeEngine(
    model_path: Union[str, PathLike],
    options: Optional[Union[ONNXOptions, TVMOptions]] = None,
)

A Python wrapper around the C++ LRE.

This class exposes and provides a Python API to the underlying C++ LRE implementation. The Python LRE can run inference on any tensor inputs that follow the DLPack protocol, i.e. the tensor objects have a defined __dlpack__ method, e.g. NumPy arrays, PyTorch tensors, etc. The returned outputs will also be DLPack objects that can be ingested by common libraries like NumPy, PyTorch, etc.

Initialize a runtime instance

Parameters:

Name Type Description Default
model_path Union[str, PathLike]

Either an '.onnx' or '.so' artifact generated from LEIP Optimize

required
options Optional[Union[ONNXOptions, TVMOptions]]

The runtime options for the model. Use 'TVMOptions' for TVM-compiled binaries. Use 'ONNXOptions' for ONNX protobufs.

None

Methods:

Name Description
__call__

Method to invoke inference and return outputs by calling the instance.

get_metadata

Get a dictionary of the model's metadata.

get_output

Get a specific tensor output by index from the last executed inference.

get_outputs

Returns the all the outputs from the last executed inference.

infer

Runs inference upon provided input(s). Outputs are saved to buffers.

set_cpu_output

Sets whether the output should be a CPU PyDLPack tensor.

Attributes:

Name Type Description
input_dtypes List[str]

Model's input data types

input_shapes List[Tuple[int, ...]]

Model's input shapes

is_cpu_output bool

Flag is true if the runtime's current output device is CPU

is_trt bool

Flag is true if the runtime session uses TensorRT

model_id str

Model's UUID metadata field

number_inputs int

Model's number of inputs

number_outputs int

Model's number of outputs

output_dtypes List[str]

Model's output data types

output_shapes List[Tuple[int, ...]]

Model's output shapes

runtime_options Union[ONNXOptions, TVMOptions]

Runtime options of the current session

Attributes

input_dtypes property

input_dtypes: List[str]

Model's input data types

input_shapes property

input_shapes: List[Tuple[int, ...]]

Model's input shapes

is_cpu_output property

is_cpu_output: bool

Flag is true if the runtime's current output device is CPU

is_trt property

is_trt: bool

Flag is true if the runtime session uses TensorRT

model_id property

model_id: str

Model's UUID metadata field

number_inputs property

number_inputs: int

Model's number of inputs

number_outputs property

number_outputs: int

Model's number of outputs

output_dtypes property

output_dtypes: List[str]

Model's output data types

output_shapes property

output_shapes: List[Tuple[int, ...]]

Model's output shapes

runtime_options property

runtime_options: Union[ONNXOptions, TVMOptions]

Runtime options of the current session

Functions

__call__

__call__(inputs) -> List[PyDLPack]

Method to invoke inference and return outputs by calling the instance. A composition of running inference and getting outputs.

get_metadata

get_metadata() -> dict

Get a dictionary of the model's metadata.

Returns:

Name Type Description
metadata dict

Dictionary of metadata key-values

get_output

get_output(index: int) -> PyDLPack

Get a specific tensor output by index from the last executed inference.

Parameters:

Name Type Description Default
index int

The desired output tensor index.

required

get_outputs

get_outputs() -> List[PyDLPack]

Returns the all the outputs from the last executed inference.

Returns:

Name Type Description
outputs List[PyDLPack]

List of DLPack-protocol objects

infer

infer(inputs) -> None

Runs inference upon provided input(s). Outputs are saved to buffers.

Parameters:

Name Type Description Default
inputs

Either a single "DLPack-Tensor" or a list/tuple of "DLPack-Tensor" objects. A "DLPack-Tensor" object is any tensor that implements the DLPack protocol, i.e. has a __dlpack__ method defined.

required

set_cpu_output

set_cpu_output(use_cpu: bool) -> None

Sets whether the output should be a CPU PyDLPack tensor.

This method configures the output to be a CPU PyDLPack tensor if the inference device is CUDA. If the inference device is already set to CPU, this setting has no effect since the output is already on the CPU.

Parameters:

Name Type Description Default
use_cpu bool

If set to True, the output will be a CPU PyDLPack tensor when the inference device is CUDA. If set to False, the output will remain on the device used for inference.

required

TVMOptions

A class that provides options for configuring TVM-compiled models.

  • Example:

    import pylre
    from pylre import LatentRuntimeEngine as LRE
    
    # Create TVMOptions with all possible configurations
    options = pylre.TVMOptions(
        precision="int8",                    # Set precision to INT8
        tensorrt_timing_cache="timing_dir",  # Set TensorRT timing cache directory
        tensorrt_engine_cache="engine_dir",  # Set TensorRT engine cache directory
        device_id=0,                         # Use GPU device ID 0
        password="password",                 # Set encryption password
        key_path="path_to_key"               # Provide encryption key path
    )
    
    # Initialize the Latent Runtime Engine with the configured options
    lre = LRE(model_path="path_to_model", options=options)
    

  • precision: Optional[str]

    • The precision mode to use during inference. Possible values: "float32", "float16", "int8". Defaults to "float32".
Supported runtime precision

Because TensorRT performs second-level optimization at runtime, it provides some flexibility with respect to precision at execution.

TRT Precision int8 at runtime float16 at runtime float32 at runtime
Compiled as int8 X X X
Compiled as float16 X X
Compiled as float32 X X
non-TRT Precision float32 at runtime int8 at runtime uint8 at runtime
Compiled as float32 X
Compiled as int8 X
Compiled uint8 X
  • tensorrt_timing_cache: Optional[str]

    • A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
  • tensorrt_engine_cache: Optional[str]

    • A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
  • device_id: Optional[int]

    • The ID of the device to use for inference
      • For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
      • For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to 0.
  • password: Optional[str]

    • Password that was used to encrypt your model. If not specified, no password is required.
  • key_path: Optional[str]

    • The path to a key file used for model encryption. If not specified, no key file is required.

ONNXOptions

A class that provides options for configuring ONNX-exported models.

  • Example:

    import pylre
    from pylre import LatentRuntimeEngine as LRE
    
    # Create ONNXOptions with all possible configurations
    options = pylre.ONNXOptions(
        execution_provider="cuda",        # Use CUDA for execution
        precision="int8",                 # Set precision to INT8
        tensorrt_timing_cache="timing_dir",  # Set TensorRT timing cache directory
        tensorrt_engine_cache="engine_dir",  # Set TensorRT engine cache directory
        device_id=0,                      # Use GPU device ID 0
        password="password",               # Set encryption password
        key_path="path_to_key"             # Provide encryption key path
    )
    
    # Initialize the Latent Runtime Engine with the configured options
    lre = LRE(model_path="path_to_model", options=options)
    

  • execution_provider: Optional[str]

    • The provider to use for model execution. Possible values: "cpu", "cuda", "tensorrt". Defaults to "cpu".
  • precision: Optional[str]

    • The precision mode to use during inference. Possible values: "float32", "int8". Defaults to "float32".
Supported runtime precision
TRT Precision int8 at runtime float16 at runtime float32 at runtime
Exported as int8 X X X
Exported as float16 X X
Exported float32 X X
non-TRT Precision float32 at runtime int8 at runtime uint8 at runtime
Exported as float32 X
Exported as int8 X
Exported uint8 X
  • tensorrt_timing_cache: Optional[str]

    • A cache path for TensorRT timing data. If not specified, timing data will be stored in memory.
  • tensorrt_engine_cache: Optional[str]

    • A cache path for TensorRT engine files. If not specified, engine files will be generated and stored in memory.
  • device_id: Optional[int]

    • The ID of the device to use for inference
      • For vanilla CPU memory, pinned memory, or managed memory, this is set to 0.
      • For Multi GPU systems allows selecting specific GPU (e.g., "0" for GPU 0). Defaults to 0.
  • password: Optional[str]

    • Password that was used to encrypt your model. If not specified, no password is required.
  • key_path: Optional[str]

    • The path to a key file used for model encryption. If not specified, no key file is required.