Model Quantization and Pruning¶

Complex models are conducive to better model performance, but they may also lead to certain redundancy. This section presents ways to streamline the model, including model quantization (quantization training and offline quantization) and model pruning.

Model quantization reduces the full precision to a fixed number of points to lower the redundancy and achieve the purpose of simplifying the model computation and improving model inference performance. Model quantization can reduce the size of model parameters by converting its precision from FP32 to Int8 without losing model precision, followed by accelerated computation, creating a quantized model with more speed advantages when deployed on mobile devices.

Model pruning decreases the number of model parameters by cutting out the unimportant convolutional kernels in the CNN, thus bringing down the computational complexity.

This tutorial explains how to use PaddleSlim, PaddlePaddle's model compression library, for PaddleClas compression, i.e., pruning and quantization. PaddleSlim integrates a variety of common and leading model compression functions such as model pruning, quantization (including quantization training and offline quantization), distillation, and neural network search. If you are interested, please follow us and learn more.

To start with, you are recommended to learn PaddleClas Training and PaddleSlim, see Model Pruning and Quantization Algorithms for related pruning and quantization methods.

Catalogue¶

1. Prepare the Environment¶

Once a model has been trained, you can adopt quantization or pruning to further compress the model size and speed up the inference.

Five steps are included：

Install PaddleSlim
Prepare the trained the model
Compress the model
Export quantized inference model
Inference and deployment of the quantized model

1.1 Install PaddleSlim¶

You can adopt pip install for installation.

pip install paddleslim -i https://pypi.tuna.tsinghua.edu.cn/simple

You can also install it from the source code with the latest features of PaddleSlim.

git clone https://github.com/PaddlePaddle/PaddleSlim.git
cd Paddleslim
python3.7 setup.py install

1.2 Prepare the Trained Model¶

PaddleClas offers a list of trained models. If the model to be quantized is not in the list, you need to follow the regular training method to get the trained model.

2. Quick Start¶

Go to PaddleClas root directory

cd PaddleClas

Related code for slim training has been integrated under ppcls/engine/, and the offline quantization code can be found in deploy/slim/quant_post_static.py.

2.1 Model Quantization¶

Quantization training includes offline and online training. Online quantitative training, the more effective one, requires loading a pre-trained model, which can be quantized after defining the strategy.

2.1.1 Online Quantization Training¶

Try the following command：

CPU/Single GPU

Take CPU for example, if you use GPU, change the cpu to gpu.

python3.7 tools/train.py -c ppcls/configs/slim/ResNet50_vd_quantization.yaml -o Global.device=cpu

The parsing of the yaml file is described in reference document. For accuracy, the pretrained model has already been adopted by the yaml file.

Launch in single-machine multi-card/ multi-machine multi-card mode

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3.7 -m paddle.distributed.launch \
    --gpus="0,1,2,3" \
      tools/train.py \
      -c ppcls/configs/slim/ResNet50_vd_quantization.yaml

2.1.2 Offline Quantization¶

Note: Currently, the inference model exported from the trained model is a must for offline quantization. See the tutorial for general export of the inference model.

Normally, offline quantization may lose more accuracy.

After generating the inference model, the offline quantization is run as follows:

python3.7 deploy/slim/quant_post_static.py -c ppcls/configs/ImageNet/ResNet/ResNet50_vd.yaml -o Global.save_inference_dir=./deploy/models/class_ResNet50_vd_ImageNet_infer

The inference model is stored inGlobal.save_inference_dir.

Successfully executed, the quant_post_static_model folder is created in the Global.save_inference_dir, where the generated offline quantization models are stored and can be deployed directly without re-exporting the models.

2.2 Model Pruning¶

Trying the following command：

CPU/Single GPU

Take CPU for example, if you use GPU, change the cpu to gpu.

python3.7 tools/train.py -c ppcls/configs/slim/ResNet50_vd_prune.yaml -o Global.device=cpu

Launch in single-machine single-card/ single-machine multi-card/ multi machine multi-card mode

export CUDA_VISIBLE_DEVICES=0,1,2,3
python3.7 -m paddle.distributed.launch \
    --gpus="0,1,2,3" \
      tools/train.py \
      -c ppcls/configs/slim/ResNet50_vd_prune.yaml

3. Export the Model¶

Having obtained the saved model after online quantization training and pruning, it can be exported as an inference model for inference deployment. Here we take model pruning as an example:

python3.7 tools/export.py \
    -c ppcls/configs/slim/ResNet50_vd_prune.yaml \
    -o Global.pretrained_model=./output/ResNet50_vd/best_model \
    -o Global.save_inference_dir=./inference

4. Deploy the Model¶

The exported model can be deployed directly using inference, please refer to inference deployment.

You can also use PaddleLite's opt tool to convert the inference model to a mobile model for its mobile deployment. Please refer to Mobile Model Deployment for more details.

5. Hyperparameter Training¶

For quantization and pruning training, it is recommended to load the pre-trained model obtained from conventional training to accelerate the convergence of quantization training.
For quantization training, it is recommended to modify the initial learning rate to 1/20~1/10 of the conventional training and the number of training epochs to 1/5~1/2, while adding Warmup to the learning rate strategy. Please make no other modifications to the configuration information.
For pruning training, the hyperparameter configuration is recommended to remain the same as the regular training.