Quantize and Compile an ONNX Model¶
Forge can ingest transformer models through its ONNX
backend, which is useful for leveraging state-of-the-art ONNX models and execution providers. This tutorial will guide you through the steps required to quantize a float32 (FP32) DETR model from Hugging Face to run with integer8 (INT8) precision.
To learn how to use the Forge TVM backend, see the Forge TVM tutorial.
Environment Setup¶
First, let’s ensure that your environment is set up correctly.
You have two options:
- Set up a Docker container.
- Create a conda environment.
Follow the installation guide for step-by-step instructions.
To run this tutorial, you’ll need the following Python packages:
- torch
- torchvision
- huggingface_hub
Let’s ensure these dependencies are installed in your environment.
!pip install torch==2.4.1 torchvision==0.19.1 --extra-index-url https://download.pytorch.org/whl/cu121
!pip install huggingface_hub colorama optimum timm
import os
import urllib.request
import zipfile
import numpy as np
from optimum.exporters.onnx import main_export
import forge
from torch.utils.data import Dataset
from torchvision import transforms
from PIL import Image
Model and Dataset Setup¶
In this step, we’ll set up the necessary resources for optimization. This includes organizing your working directories, downloading the DETR
ONNX model from Hugging Face, and getting the COCO dataset to use for quantization.
We need to create a clear folder structure to maintain artifacts during the tutorial:
- Place the input ONNX model in the
models
directory. - Download the COCO dataset for calibration.
- Export the optimized ONNX model to the
optimized_outputs
directory.
Run the code below to create the folder structure:
model_path = "models"
if not os.path.exists(model_path):
os.makedirs(model_path)
if not os.path.exists(f"{model_path}/detr_resnet50/model.onnx"):
model_id = "facebook/detr-resnet-50"
output_dir = f"{model_path}/detr_resnet50"
main_export(
model_name_or_path=model_id,
output=output_dir,
task="object-detection",
opset=12,
device="cpu"
)
dataset_dir = "val2017"
if not os.path.exists(dataset_dir):
print("Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision")
url = "http://images.cocodataset.org/zips/val2017.zip"
file_path = "val2017.zip"
urllib.request.urlretrieve(url, file_path)
with zipfile.ZipFile(file_path, 'r') as zip_ref:
zip_ref.extractall(".")
optimized_output_dir = "optimized_outputs"
if not os.path.exists(optimized_output_dir):
os.makedirs(optimized_output_dir)
Loading the Model¶
The Forge ONNX backend supports ingesting models either as an ONNX object
or directly from an .onnx
file. For more information, consult the guide on loading models.
Important Note:¶
Forge does not support ONNX models exported using torch.dynamo
. Only models exported from torch.jit
are supported for ingestion.
ir = forge.ONNXModule(f"{model_path}/detr_resnet50/model.onnx")
Now that the model has been successfully ingested as an IR object, we’re ready to perform transformations and quantization on the model.
Setting Static Inputs¶
Forge supports exporting models only with static inputs. This means input shapes must be defined and remain constant during inference. To do this, you can use the .set_static_inputs()
method, which takes an input shape dictionary where the keys are input names, and the values are their corresponding shapes.
input_shape_dict = {
"pixel_values":[1, 3, 800, 1333]
}
ir.set_static_input(input_shape_dict)
Quantizing the Model¶
Let's now quantize the model to optimize it for efficient inference. Forge supports two types of quantization:
- Static Quantization: A calibration step is performed where the model runs on a representative dataset. During calibration, we gather the distribution statistics of the activations, which are then used to determine the optimal scaling factors (quantization parameters) for each layer.
- Dynamic Quantization: Weights are quantized at runtime, providing more flexibility but with potentially lower performance.
You can find more details and advanced quantization options in the guide on quantization.
Loading a Calibration Dataset¶
In this tutorial, we’ll run calibration on 20 images from the COCO dataset. In practice, you should use a larger dataset better suited to your specific model.
Pro Tip: Modify your calibration dataset according to your model or select some images from your validation dataloader.
class CustomImageDataset(Dataset):
def __init__(self, img_dir, end_index = 20, transform=None):
self.img_dir = img_dir
self.transform = transform
self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]
self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)
self.img_labels = self.img_labels[:end_index]
def __len__(self):
return self.end_index
def __getitem__(self, idx):
img_path = os.path.join(self.img_dir, self.img_labels[idx])
image = Image.open(img_path).convert("RGB")
if self.transform:
image = self.transform(image).unsqueeze(0)
return image
transform = transforms.Compose([
transforms.Resize((800, 1333)), # Resize the images to a fixed size
transforms.ToTensor(), # Convert the images to tensors
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize the images
])
coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)
Calibrating the Model¶
ir.calibrate(coco_dataset)
Static Quantization¶
Forge supports different quantization formats for activations and kernels during static quantization. You can choose between INT8
and UINT8
formats based on your model's needs:
- Activations: Supported formats are INT8 and UINT8.
- Note: For dynamic quantization, only UINT8 is supported for activations.
- Kernels (Weights): Supported formats include INT8 and UINT8.
For more advanced options, consult the guide on quantization.
In this tutorial, we will choose UINT8
for both activation and kernel.
ir.quantize(activation_dtype="uint8", kernel_dtype="uint8", quant_type='static')
Exporting the Model¶
After calibrating and quantizing the model, we can export it using Forge for deployment. Forge supports two export options:
- TensorRT Export: Recommended if
TensorRT
is available on your target device, as it offers performance optimizations for NVIDIA GPUs. You can use theis_tensorrt
flag to indicate whether the model should be exported with TensorRT optimizations. - Non-TensorRT Export: A general export for devices where TensorRT is unavailable.
ir.export(f"{optimized_output_dir}/quantized_model.onnx", uuid='quantized model', force_overwrite=True)
ir.export(f"{optimized_output_dir}/quantized_trt_model.onnx", uuid='tensorrt calibrated model', is_tensorrt=True, force_overwrite=True)
Head over to the LEIP Deploy documentation to learn how to deploy this exported model on your target!