{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compile a Model for an NVIDIA Target\n",
    "\n",
    "The [Forge TVM tutorial](/optimize/content/notebooks/BYOMwithForgeTutorialTVM/) covers basic compilation steps. However, different hardware backends offer device specific optimizations you can leverage to get significantly better peformance. This tutorial offers step-by-step instructions for targeting NVIDIA GPUs with Forge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Environment Setup\n",
    "\n",
    "To get started, you'll first need to install additional tools needed for this tutorial and set up a Forge environment. \n",
    "Follow the [installation instructions](/optimize/content/getting-started/install/) to set up a Docker container or conda environment. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dependencies for this tutorial:\n",
    "- ultralytics\n",
    "- onnx\n",
    "- torch\n",
    "- torchvision\n",
    "- PIL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install torch\n",
    "!pip install torchvision\n",
    "!apt-get update\n",
    "!apt-get install -y libgl1\n",
    "!pip install ultralytics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import urllib.request\n",
    "import zipfile\n",
    "\n",
    "from ultralytics import YOLO\n",
    "\n",
    "import onnx\n",
    "import torch\n",
    "\n",
    "import forge\n",
    "\n",
    "from torch.utils.data import Dataset\n",
    "from torchvision import transforms\n",
    "from PIL import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model and Dataset Setup\n",
    "\n",
    "Next, we'll acquire some useful inputs for the compilation process and create a folder structure to maintain artifacts. We'll place our input traced model from Ultralytics in a `models` directory, download a COCO dataset for quantization as `val2017`, and save our compiled models inside an `optimized_outputs` directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model_path = \"models\"\n",
    "if not os.path.exists(model_path):\n",
    "    os.makedirs(model_path)\n",
    "if not os.path.exists(f\"{model_path}/yolov8n.onnx\"):\n",
    "    model = YOLO(f\"{model_path}/yolov8n.pt\")\n",
    "    model.export(format=\"onnx\")\n",
    "\n",
    "dataset_dir = \"val2017\"\n",
    "if not os.path.exists(dataset_dir):\n",
    "    print(\"Downloading val2017 dataset from COCO. This is required to quantize our model with INT8 precision\")\n",
    "    url = \"http://images.cocodataset.org/zips/val2017.zip\"\n",
    "    file_path = \"val2017.zip\"\n",
    "    urllib.request.urlretrieve(url, file_path)\n",
    "    with zipfile.ZipFile(file_path, 'r') as zip_ref:\n",
    "        zip_ref.extractall(\".\")\n",
    "\n",
    "optimized_output_dir = \"optimized_outputs\"\n",
    "if not os.path.exists(optimized_output_dir):\n",
    "    os.makedirs(optimized_output_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the Model\n",
    "\n",
    "Forge supports loading models from various frameworks. The [guide on loading models](/optimize/content/guide/load/) describes the available methods.\n",
    "\n",
    "We can load the traced model with ONNX and then ingest it into Forge. We will use `forge.from_onnx()`: Load a model from an ONNX file. This will create an IR object which will be used by Forge for subsequent transformations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "onnx_model = onnx.load(f\"{model_path}/yolov8n.onnx\")\n",
    "ir = forge.from_onnx(onnx_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can use the [IR Object](/optimize/content/api/relay/irmodule/) to introspect on the model. For instance, you can see what operators are used in this model using the following method. Knowing the operators can be useful to understand how a model will compile into machine code for a particular target."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir.operators"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An IR object is a graph, and graphs are easy to manipulate. IR objects can also be subjected to optimizations&mdash;quantization, for example&mdash;and eventually compiled to machine code. This makes an IR graph quite powerful, as it can be used to design, manipulate, and translate a representation of a machine learning model to create optimal machine code. Consult the [LEIP Optimize how-to guides](/optimize/content/guide/) for further details. It is important to note that some of these transforms are irreversible."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compiling the Model\n",
    "You can compile any IR graph to compiled machine code. However, some hardware targets may not support your model. Forge employs hardware-specific compiler toolchains to do this translation. For NVIDIA GPU targets, Forge uses [nvcc](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/) and hardware-optimized libraries [cudnn](https://docs.nvidia.com/cudnn/index.html) and [tensorrt](https://docs.nvidia.com/deeplearning/tensorrt/index.html). Let's target a GPU on the same device you're working on right now for deployment. For more compilation details, consult the [compilation guide](/optimize/content/guide/compile/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "target = \"cuda\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can set the target to be `cuda` and Forge will automatically target the accessible GPU for compilation. This will compile the model using accelerated kernels using CUDA libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir.compile(target=target, output_path=f\"{optimized_output_dir}/cuda_fp32\", force_overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Pro Tip**: Some operations over a graph are irreversible; it's good practice to make a copy before you do such transforms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir_trt = ir.copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, NVIDIA also has a more specialized TensorRT library that has more device-specific acceleration. This does not support all operators, so first we will partition the graph into subgraphs supported and not supported by TensorRT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir_trt.partition_for_tensorrt()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once partitioned, you can use the same command as before to compile to TensorRT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir_trt.compile(target=target, output_path=f\"{optimized_output_dir}/trt_fp32\", force_overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once you've decided on a compilation target, you can use the [`compile` function](/optimize/content/api/relay/irmodule/#forge.IRModule.compile). Note that we're setting the compiled output to `optimized_outputs/notebook_cuda_fp32` and `optimized_outputs/notebook_trt_fp32`. By default, Forge will avoid overwriting compiled artifacts that already exist. We're forcing Forge to overwrite so we can observe the compilation process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can take this compiled model to a deployment environment and [deploy it](/deploy/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also proceed further and attempt an optimization&mdash;in this case, quantization&mdash;on your model before deployment."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quantizing the Model\n",
    "For CUDA compilation, you can do compile-time quantization. But for TensorRT, you can only do quantization at runtime. For CUDA, you will do both calibration and quantization; for TensorRT, you will only conduct calibration at compile time.\n",
    "\n",
    "For more details, consult the [how-to guide on quantization](/optimize/content/guide/quantize/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading a Calibration Dataset\n",
    "Let's conduct the same calibration we did in the [basics tutorial](/optimize/content/notebooks/BYOMwithForgeTutorialTVM/)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "class CustomImageDataset(Dataset):\n",
    "    def __init__(self, img_dir, end_index = 20, transform=None):\n",
    "        self.img_dir = img_dir\n",
    "        self.transform = transform\n",
    "        self.img_labels = [f for f in os.listdir(img_dir) if f.endswith(('jpg', 'jpeg', 'png'))]\n",
    "        self.end_index = end_index if end_index <= len(self.img_labels) else len(self.img_labels)\n",
    "        self.img_labels = self.img_labels[:end_index]\n",
    "    def __len__(self):\n",
    "        return self.end_index\n",
    "\n",
    "    def __getitem__(self, idx):\n",
    "        img_path = os.path.join(self.img_dir, self.img_labels[idx])\n",
    "        image = Image.open(img_path).convert(\"RGB\")\n",
    "        if self.transform:\n",
    "            image = self.transform(image).unsqueeze(0)\n",
    "        return image"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "transform = transforms.Compose([\n",
    "    transforms.Resize((640, 640)),  # Resize the images to a fixed size\n",
    "    transforms.ToTensor(),          # Convert the images to tensors\n",
    "    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])  # Normalize the images\n",
    "])\n",
    "coco_dataset = CustomImageDataset(img_dir=dataset_dir, transform=transform)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Calibrating the Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will first calibrate for CUDA."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir.calibrate(coco_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For CUDA we have to do quantization at compile time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir.quantize(activation_dtype=\"uint8\", quant_type=\"static\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once quantized we can compile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir.compile(target=target, output_path=f\"{optimized_output_dir}/notebook_cuda_int8\", force_overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we will calibrate for TensorRT."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir_trt.calibrate(coco_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can compile a TensorRT model that allows us to generate a quantized engine during runtime."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ir_trt.compile(target=target, output_path=f\"{optimized_output_dir}/trt_int8\", force_overwrite=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Head over to [LEIP Deploy](https://docs.latentai.io/leip/deploy/latest/) to learn how to deploy this compiled model on your target with NVIDIA GPUs!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
