engine optimization service

6d1bcfcd · Sikhin VC · 6b012da9 · 6b012da9 · 6d1bcfcd · 6d1bcfcd
Commit 6d1bcfcd authored Jun 08, 2023 by Sikhin VC
27 changed files
--- a/README.md
+++ b/README.md
-# yolo_model_optimization
-
--- a/jk_v5_cam_47.wts
+++ b/jk_v5_cam_47.wts
--- a/schemas/api_schema.py
+++ b/schemas/api_schema.py
+from __future__ import annotations
+
+from typing import Any, List, Optional
+
+from pydantic import BaseModel
+
+
+class optimization(BaseModel):
+    num_class: str
+    image_size: str
--- a/.gitignore
+++ b/.gitignore
--- a/Dockerfile
+++ b/Dockerfile
--- a/LICENSE
+++ b/LICENSE
--- a/tensorrtx/README.md
+++ b/tensorrtx/README.md
+# TensorRTx
+
+TensorRTx aims to implement popular deep learning networks with tensorrt network definition APIs. As we know, tensorrt has builtin parsers, including caffeparser, uffparser, onnxparser, etc. But when we use these parsers, we often run into some "unsupported operations or layers" problems, especially some state-of-the-art models are using new type of layers.
+
+So why don't we just skip all parsers? We just use TensorRT network definition APIs to build the whole network, it's not so complicated.
+
+I wrote this project to get familiar with tensorrt API, and also to share and learn from the community.
+
+All the models are implemented in pytorch/mxnet/tensorflown first, and export a weights file xxx.wts, and then use tensorrt to load weights, define network and do inference. Some pytorch implementations can be found in my repo [Pytorchx](https://github.com/wang-xinyu/pytorchx), the remaining are from polular open-source implementations.
+
+## News
+
+- `23 May 2022`. [yhpark](https://github.com/yester31): Real-ESRGAN, Practical Algorithms for General Image/Video Restoration.
+- `15 Mar 2022`. [sky_hole](https://github.com/wdhao): Swin Transformer - Semantic Segmentation.
+- `19 Oct 2021`. [liuqi123123](https://github.com/liuqi123123) added cuda preprossing for yolov5, preprocessing + inference is 3x faster when batchsize=8.
+- `18 Oct 2021`. [xupengao](https://github.com/xupengao): YOLOv5 updated to v6.0, supporting n/s/m/l/x/n6/s6/m6/l6/x6.
+- `31 Aug 2021`. [FamousDirector](https://github.com/FamousDirector): update retinaface to support TensorRT 8.0.
+- `27 Aug 2021`. [HaiyangPeng](https://github.com/HaiyangPeng): add a python wrapper for hrnet segmentation.
+- `1 Jul 2021`. [freedenS](https://github.com/freedenS): DE⫶TR: End-to-End Object Detection with Transformers. First Transformer model!
+- `10 Jun 2021`. [upczww](https://github.com/upczww): EfficientNet b0-b8 and l2.
+- `23 May 2021`. [SsisyphusTao](https://github.com/SsisyphusTao): CenterNet DLA-34 with DCNv2 plugin.
+- `17 May 2021`. [ybw108](https://github.com/ybw108): arcface LResNet100E-IR and MobileFaceNet.
+- `6 May 2021`. [makaveli10](https://github.com/makaveli10): scaled-yolov4 yolov4-csp.
+- `29 Apr 2021`. [upczww](https://github.com/upczww): hrnet segmentation w18/w32/w48, ocr branch also.
+- `28 Apr 2021`. [aditya-dl](https://github.com/aditya-dl): mobilenetv2, alexnet, densenet121, mobilenetv3 with python API.
+- `26 Apr 2021`. [makaveli10](https://github.com/makaveli10) add Inceptionv4.
+- `25 Apr 2021`. YOLOv5 updated to v5.0, supporting s/m/l/x/s6/m6/l6/x6.
+
+## Tutorials
+
+- [Install the dependencies.](./tutorials/install.md)
+- [A guide for quickly getting started, taking lenet5 as a demo.](./tutorials/getting_started.md)
+- [The .wts file content format](./tutorials/getting_started.md#the-wts-content-format)
+- [Frequently Asked Questions (FAQ)](./tutorials/faq.md)
+- [Migrating from TensorRT 4 to 7](./tutorials/migrating_from_tensorrt_4_to_7.md)
+- [How to implement multi-GPU processing, taking YOLOv4 as example](./tutorials/multi_GPU_processing.md)
+- [Check if Your GPU support FP16/INT8](./tutorials/check_fp16_int8_support.md)
+- [How to Compile and Run on Windows](./tutorials/run_on_windows.md)
+- [Deploy YOLOv4 with Triton Inference Server](https://github.com/isarsoft/yolov4-triton-tensorrt)
+- [From pytorch to trt step by step, hrnet as example(Chinese)](./tutorials/from_pytorch_to_trt_stepbystep_hrnet.md)
+
+## Test Environment
+
+1. TensorRT 7.x
+2. TensorRT 8.x(Some of the models support 8.x)
+
+## How to run
+
+Each folder has a readme inside, which explains how to run the models inside.
+
+## Models
+
+Following models are implemented.
+
+|Name | Description |
+|-|-|
+|[mlp](./mlp) | the very basic model for starters, properly documented |
+|[lenet](./lenet) | the simplest, as a "hello world" of this project |
+|[alexnet](./alexnet)| easy to implement, all layers are supported in tensorrt |
+|[googlenet](./googlenet)| GoogLeNet (Inception v1) |
+|[inception](./inception)| Inception v3, v4 |
+|[mnasnet](./mnasnet)| MNASNet with depth multiplier of 0.5 from the paper |
+|[mobilenet](./mobilenet)| MobileNet v2, v3-small, v3-large |
+|[resnet](./resnet)| resnet-18, resnet-50 and resnext50-32x4d are implemented |
+|[senet](./senet)| se-resnet50 |
+|[shufflenet](./shufflenetv2)| ShuffleNet v2 with 0.5x output channels |
+|[squeezenet](./squeezenet)| SqueezeNet 1.1 model |
+|[vgg](./vgg)| VGG 11-layer model |
+|[yolov3-tiny](./yolov3-tiny)| weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3) |
+|[yolov3](./yolov3)| darknet-53, weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3) |
+|[yolov3-spp](./yolov3-spp)| darknet-53, weights and pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3) |
+|[yolov4](./yolov4)| CSPDarknet53, weights from [AlexeyAB/darknet](https://github.com/AlexeyAB/darknet#pre-trained-models), pytorch implementation from [ultralytics/yolov3](https://github.com/ultralytics/yolov3) |
+|[yolov5](./yolov5)| yolov5 v1.0-v6.0, pytorch implementation from [ultralytics/yolov5](https://github.com/ultralytics/yolov5) |
+|[retinaface](./retinaface)| resnet50 and mobilnet0.25, weights from [biubug6/Pytorch_Retinaface](https://github.com/biubug6/Pytorch_Retinaface) |
+|[arcface](./arcface)| LResNet50E-IR, LResNet100E-IR and MobileFaceNet, weights from [deepinsight/insightface](https://github.com/deepinsight/insightface) |
+|[retinafaceAntiCov](./retinafaceAntiCov)| mobilenet0.25, weights from [deepinsight/insightface](https://github.com/deepinsight/insightface), retinaface anti-COVID-19, detect face and mask attribute |
+|[dbnet](./dbnet)| Scene Text Detection, weights from [BaofengZan/DBNet.pytorch](https://github.com/BaofengZan/DBNet.pytorch) |
+|[crnn](./crnn)| pytorch implementation from [meijieru/crnn.pytorch](https://github.com/meijieru/crnn.pytorch) |
+|[ufld](./ufld)| pytorch implementation from [Ultra-Fast-Lane-Detection](https://github.com/cfzd/Ultra-Fast-Lane-Detection), ECCV2020 |
+|[hrnet](./hrnet)| hrnet-image-classification and hrnet-semantic-segmentation, pytorch implementation from [HRNet-Image-Classification](https://github.com/HRNet/HRNet-Image-Classification) and [HRNet-Semantic-Segmentation](https://github.com/HRNet/HRNet-Semantic-Segmentation) |
+|[psenet](./psenet)| PSENet Text Detection, tensorflow implementation from [liuheng92/tensorflow_PSENet](https://github.com/liuheng92/tensorflow_PSENet) |
+|[ibnnet](./ibnnet)| IBN-Net, pytorch implementation from [XingangPan/IBN-Net](https://github.com/XingangPan/IBN-Net), ECCV2018 |
+|[unet](./unet)| U-Net, pytorch implementation from [milesial/Pytorch-UNet](https://github.com/milesial/Pytorch-UNet) |
+|[repvgg](./repvgg)| RepVGG, pytorch implementation from [DingXiaoH/RepVGG](https://github.com/DingXiaoH/RepVGG) |
+|[lprnet](./lprnet)| LPRNet, pytorch implementation from [xuexingyu24/License_Plate_Detection_Pytorch](https://github.com/xuexingyu24/License_Plate_Detection_Pytorch) |
+|[refinedet](./refinedet)| RefineDet, pytorch implementation from [luuuyi/RefineDet.PyTorch](https://github.com/luuuyi/RefineDet.PyTorch) |
+|[densenet](./densenet)| DenseNet-121, from torchvision.models |
+|[rcnn](./rcnn)| FasterRCNN and MaskRCNN, model from [detectron2](https://github.com/facebookresearch/detectron2) |
+|[tsm](./tsm)| TSM: Temporal Shift Module for Efficient Video Understanding, ICCV2019 |
+|[scaled-yolov4](./scaled-yolov4)| yolov4-csp, pytorch from [WongKinYiu/ScaledYOLOv4](https://github.com/WongKinYiu/ScaledYOLOv4) |
+|[centernet](./centernet)| CenterNet DLA-34, pytorch from [xingyizhou/CenterNet](https://github.com/xingyizhou/CenterNet) |
+|[efficientnet](./efficientnet)| EfficientNet b0-b8 and l2, pytorch from [lukemelas/EfficientNet-PyTorch](https://github.com/lukemelas/EfficientNet-PyTorch) |
+|[detr](./detr)| DE⫶TR, pytorch from [facebookresearch/detr](https://github.com/facebookresearch/detr) |
+|[swin-transformer](./swin-transformer)| Swin Transformer - Semantic Segmentation, only support Swin-T. The Pytorch implementation is [microsoft/Swin-Transformer](https://github.com/microsoft/Swin-Transformer.git) |
+|[real-esrgan](./real-esrgan)| Real-ESRGAN. The Pytorch implementation is [real-esrgan](https://github.com/xinntao/Real-ESRGAN) |
+
+## Model Zoo
+
+The .wts files can be downloaded from model zoo for quick evaluation. But it is recommended to convert .wts from pytorch/mxnet/tensorflow model, so that you can retrain your own model.
+
+[GoogleDrive](https://drive.google.com/drive/folders/1Ri0IDa5OChtcA3zjqRTW57uG6TnfN4Do?usp=sharing) | [BaiduPan](https://pan.baidu.com/s/19s6hO8esU7-TtZEXN7G3OA) pwd: uvv2
+
+## Tricky Operations
+
+Some tricky operations encountered in these models, already solved, but might have better solutions.
+
+|Name | Description |
+|-|-|
+|BatchNorm| Implement by a scale layer, used in resnet, googlenet, mobilenet, etc. |
+|MaxPool2d(ceil_mode=True)| use a padding layer before maxpool to solve ceil_mode=True, see googlenet. |
+|average pool with padding| use setAverageCountExcludesPadding() when necessary, see inception. |
+|relu6| use `Relu6(x) = Relu(x) - Relu(x-6)`, see mobilenet. |
+|torch.chunk()| implement the 'chunk(2, dim=C)' by tensorrt plugin, see shufflenet. |
+|channel shuffle| use two shuffle layers to implement `channel_shuffle`, see shufflenet. |
+|adaptive pool| use fixed input dimension, and use regular average pooling, see shufflenet. |
+|leaky relu| I wrote a leaky relu plugin, but PRelu in `NvInferPlugin.h` can be used, see yolov3 in branch `trt4`. |
+|yolo layer v1| yolo layer is implemented as a plugin, see yolov3 in branch `trt4`. |
+|yolo layer v2| three yolo layers implemented in one plugin, see yolov3-spp. |
+|upsample| replaced by a deconvolution layer, see yolov3. |
+|hsigmoid| hard sigmoid is implemented as a plugin, hsigmoid and hswish are used in mobilenetv3 |
+|retinaface output decode| implement a plugin to decode bbox, confidence and landmarks, see retinaface. |
+|mish| mish activation is implemented as a plugin, mish is used in yolov4 |
+|prelu| mxnet's prelu activation with trainable gamma is implemented as a plugin, used in arcface |
+|HardSwish| hard_swish = x * hard_sigmoid, used in yolov5 v3.0 |
+|LSTM| Implemented pytorch nn.LSTM() with tensorrt api |
+
+## Speed Benchmark
+
+| Models | Device | BatchSize | Mode | Input Shape(HxW) | FPS |
+|-|-|:-:|:-:|:-:|:-:|
+| YOLOv3-tiny | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 333 |
+| YOLOv3(darknet53) | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 39.2 |
+| YOLOv3(darknet53) | Xeon E5-2620/GTX1080 | 1 | INT8 | 608x608 | 71.4 |
+| YOLOv3-spp(darknet53) | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 38.5 |
+| YOLOv4(CSPDarknet53) | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 35.7 |
+| YOLOv4(CSPDarknet53) | Xeon E5-2620/GTX1080 | 4 | FP32 | 608x608 | 40.9 |
+| YOLOv4(CSPDarknet53) | Xeon E5-2620/GTX1080 | 8 | FP32 | 608x608 | 41.3 | 
+| YOLOv5-s v3.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 142 |
+| YOLOv5-s v3.0 | Xeon E5-2620/GTX1080 | 4 | FP32 | 608x608 | 173 |
+| YOLOv5-s v3.0 | Xeon E5-2620/GTX1080 | 8 | FP32 | 608x608 | 190 |
+| YOLOv5-m v3.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 71 |
+| YOLOv5-l v3.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 43 |
+| YOLOv5-x v3.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 29 |
+| YOLOv5-s v4.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 142 |
+| YOLOv5-m v4.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 71 |
+| YOLOv5-l v4.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 40 |
+| YOLOv5-x v4.0 | Xeon E5-2620/GTX1080 | 1 | FP32 | 608x608 | 27 |
+| RetinaFace(resnet50) | Xeon E5-2620/GTX1080 | 1 | FP32 | 480x640 | 90 |
+| RetinaFace(resnet50) | Xeon E5-2620/GTX1080 | 1 | INT8 | 480x640 | 204 |
+| RetinaFace(mobilenet0.25) | Xeon E5-2620/GTX1080 | 1 | FP32 | 480x640 | 417 |
+| ArcFace(LResNet50E-IR) | Xeon E5-2620/GTX1080 | 1 | FP32 | 112x112 | 333 |
+| CRNN | Xeon E5-2620/GTX1080 | 1 | FP32 | 32x100 | 1000 |
+
+Help wanted, if you got speed results, please add an issue or PR.
+
+## Acknowledgments & Contact
+
+Any contributions, questions and discussions are welcomed, contact me by following info.
+
+E-mail: wangxinyu_es@163.com
+
+WeChat ID: wangxinyu0375 (可加我微信进tensorrtx交流群，**备注：tensorrtx**)
--- a/yolov5/CMakeLists.txt
+++ b/yolov5/CMakeLists.txt
--- a/yolov5/README.md
+++ b/yolov5/README.md
--- a/yolov5/calibrator.cpp
+++ b/yolov5/calibrator.cpp
--- a/yolov5/calibrator.h
+++ b/yolov5/calibrator.h
--- a/yolov5/common.hpp
+++ b/yolov5/common.hpp
--- a/yolov5/cuda_utils.h
+++ b/yolov5/cuda_utils.h
--- a/yolov5/gen_wts.py
+++ b/yolov5/gen_wts.py
--- a/yolov5/logging.h
+++ b/yolov5/logging.h
--- a/yolov5/macros.h
+++ b/yolov5/macros.h
--- a/yolov5/preprocess.cu
+++ b/yolov5/preprocess.cu
--- a/yolov5/preprocess.h
+++ b/yolov5/preprocess.h
--- a/yolov5/samples
+++ b/yolov5/samples
--- a/yolov5/utils.h
+++ b/yolov5/utils.h
--- a/yolov5/yololayer.cu
+++ b/yolov5/yololayer.cu
--- a/yolov5/yololayer.h
+++ b/yolov5/yololayer.h
@@ -17,9 +17,9 @@ namespace Yolo
        float anchors[CHECK_COUNT * 2];
    };
    static constexpr int MAX_OUTPUT_BBOX_COUNT = 1000;
-    static constexpr int CLASS_NUM = 80;
-    static constexpr int INPUT_H = 640;  // yolov5's input height and width must be divisible by 32.
-    static constexpr int INPUT_W = 640;
+    static constexpr int CLASS_NUM = 2;
+    static constexpr int INPUT_H = 416;
+    static constexpr int INPUT_H = 416;

    static constexpr int LOCATIONS = 4;
    struct alignas(float) Detection {

--- a/yolov5/yololayer_bkp.h
+++ b/yolov5/yololayer_bkp.h
--- a/yolov5/yolov5.cpp
+++ b/yolov5/yolov5.cpp
--- a/yolov5/yolov5_trt.py
+++ b/yolov5/yolov5_trt.py
--- a/yolov5/yolov5_trt_cuda_python.py
+++ b/yolov5/yolov5_trt_cuda_python.py
--- a/yolo_optimization.py
+++ b/yolo_optimization.py
+import glob
 import os
 import subprocess
 from loguru import logger
 import shutil
+from fastapi import FastAPI
+from schemas.api_schema import optimization

+app = FastAPI()

 class ModelOptimization:
-    def __init__(self, num_class, weight_path, image_size=416):
+    def __init__(self, num_class, image_size=416):
        self.num_class = num_class
        self.image_size = image_size
-        self.weight_path = weight_path

    def change_configurations(self):
        logger.info(f"Provided number of classes and image size are :  {self.num_class} and {self.image_size}")
        try:

-            with open('yolov5/yololayer.h', 'r') as file:
+            with open('tensorrtx/yolov5/yololayer.h', 'r') as file:
                # read a list of lines into data
                data = file.readlines()

            data[19] = f"    static constexpr int CLASS_NUM = {self.num_class};\n"
            data[20] = f"    static constexpr int INPUT_H = {self.image_size};\n"
-            data[21] = f"    static constexpr int INPUT_W = {self.image_size};\n"
+            data[21] = f"    static constexpr int INPUT_H = {self.image_size};\n"

            # and write everything back
-            with open('yolov5/yololayer.h', 'w') as file:
+            with open('tensorrtx/yolov5/yololayer.h', 'w') as file:
                file.writelines(data)
            logger.info("Successfully changed configurations")
        except Exception as e:
            logger.info(f"Failed to change configurations :  {e}")

-    def optimize_model(self):
+    def optimize_model(self, weight_path):
        try:
-
-
+            shutil.copy(weight_path, 'tensorrtx/yolov5/build')
+            weight_name_with_extension = os.path.basename(weight_path)
+            weight_name, extension = os.path.splitext(weight_name_with_extension)
            current_directory = os.getcwd()
-
            logger.info(f"Current directory is :  {current_directory}")
-            build_path = os.path.join(current_directory, "yolov5", "build")
-            if os.path.isdir('yolov5/build'):
-                logger.info("build directory exists. Removing build directory!!")
-                shutil.rmtree('yolov5/build')
-            os.mkdir(build_path)
-            weight_name_with_extension = os.path.basename(self.weight_path)
-            weight_name, extension = os.path.splitext(weight_name_with_extension)
-            src = f"{current_directory}/yolov5/build/"
-            shutil.copy(self.weight_path, src)
-            logger.info(f"Created build folder")
-            os.chdir('yolov5/build')
+            build_files = glob.glob("tensorrtx/yolov5/build/*")
+            for file in build_files:
+                os.remove(file)
+
+            # build_path = os.path.join(current_directory, "yolov5", "build")
+            # os.mkdir(build_path)
+            # logger.info(f"Created build folder")
+            os.chdir('tensorrtx/yolov5/build')
            logger.info("Running CMake command")
            subprocess.run(['cmake', '..'])
            logger.info("Running Make command")
            subprocess.run(['make'])
            logger.info("Optimizing model")
-            engine_name = weight_name + ".engine"
+            engine_name = "best.engine"
            subprocess.run(["sudo", "./yolov5", "-s", weight_name_with_extension, engine_name, "c", "0.33", "0.50"])

        except Exception as e:
            logger.info(f"Failed to optimized model :  {e}")

-
-obj = ModelOptimization(num_class=1,weight_path="/home/ilens/cam_42_best.wts", image_size=416)
-obj.change_configurations()
-obj.optimize_model()
+@app.post("/optimize")
+async def root(content: optimization):
+    # print(content.dict())
+    obj = ModelOptimization(num_class=int(content.num_class), image_size=int(content.image_size))
+    obj.change_configurations()
+    obj.optimize_model(weight_path = "jk_v5_cam_47.wts")
+    return {"message": "successfull"}