【模型推理】量化實現分享一:詳解 min-max 對稱量化演算法實現
大家好,我是極智視界,本文剖析一下 min-max 對稱量化演算法實現,以 Tengine 的實現為例。
Tengine 是 OpenAILab 開源的優秀端側深度學習推理框架,其核心主要由 C 語言實現,包裹的功能程式碼嵌套了 C++。量化是推理加速必不可少的優化環節,成熟的推理框架一般會把量化模組剝離出來形成獨立的一套工具,如 Tengine、NCNN、昇騰、寒武紀都這麼做,這主要是因為量化過程和硬體非強相關,解耦開來能幹更多的事。
min-max 和 kl 量化演算法是硬體廠商適配推理引擎的基礎和標配, 其中 kl 量化深受用戶喜愛,如NVIDIA 的 TensorRT 也正是採用了 kl 量化策略;而這裡要介紹的 min-max 的特點是邏輯簡單、效果良好,作為量化實現分享系列的開篇比較合適,這裡帶大家一起研究一下 Tengine 中 minx-max 量化策略的具體實現。
量化主要分為激活值(動態)量化、權值&偏置(靜態)量化,而權值&偏置的量化是對精度影響比較大的,激活值的量化對整體影響較小,但也需要量化,才有可能協同達到整體滿意的效果。對於一般量化來說,權值&偏置的量化會採用逐通道 perChannel 的方式,而激活值的量化一般是逐層 perLayer 的方式。解釋一下為啥會這樣,對於量化來說,卷積肯定是大頭,對於卷積來說,若激活值量化採用逐通道方式,這和卷積核參數共享是相悖的,所以一般激活值就用逐層量化,以契合卷積參數共享。
這裡主要看一下 Tengine 量化需要的傳參:
Input model:傳入的 fp32 tmfile 模型文件;
Output model:生成的 int8 tmfile 模型文件;
Calib images:傳入的激活值量化校準圖片;
Scale file:生成的校準表文件;
Agorithm:量化演算法,可選 MIN-MAX、KL、ACIQ、DFQ、EQ;
Dims:輸入校準圖的 shape,這裡傳三維 c h w,n 在程式碼中寫死 n = 1;
Center crop:影像預處理,裁剪;
Letter box:影像預處理,保持橫縱比的前提下對影像做 resize;
YOLOv5 focus:類似 yolov5 的預處理注意力機制;
Thread num:量化多執行緒設置;
2、min-max 量化
min-max 是最簡單的量化演算法,主要邏輯如下:
在 Tengine 中實現 min-max 方法的主要程式碼如下:
if (quant_tool.scale_file.empty()){
quant_tool.scale_file = "table_minmax.scale";
save_graph_i8_perchannel(quant_tool.model_file.c_str(), quant_tool.scale_file.c_str(), quant_tool.output_file, quant_tool.inplace, false);
/* Evaluate quantitative losses */
if (quant_tool.evaluate){
fprintf(stderr, "[Quant Tools Info]: Step Evaluate, evaluate quantitative losses\n");
其中最主要的量化搜索策略介面是 quant_tool.activation_quant_tool()
和 save_graph_i8_perchannel
,對於 min-max 來說這兩個介面分別做了兩件事:
(1) 激活值量化,生成 table_minmax.scale
(2) 權值&偏置量化,生成 scale_weight.txt
和 scale_bias.txt
2.1 激活值量化
看 Tengine 源碼一定要抓住 struct graph* ir_graph
,graph 這個結構體是精髓。
// 將 input_tensor 和 input_data 地址綁定,而 input_tensor=>ir_graph->tensor_list。注意:這一步一定要看到,不然後續程式碼很難看懂
tensor_t input_tensor = get_graph_input_tensor(ir_graph, 0, 0);
if (set_tensor_shape(input_tensor, dims, 4) < 0){
fprintf(stderr, "Set input tensor shape failed\n");
return -1;
if (set_tensor_buffer(input_tensor, input_data.data(), img_size * sizeof(float)) < 0){
fprintf(stderr, "Set input tensor buffer failed\n");
return -1;
// prerun graph,做一些初始化配置
if (prerun_graph_multithread(ir_graph, this->opt) < 0){
fprintf(stderr, "Prerun multithread graph failed.\n");
return -1;
// 影像預處理,傳出 input_data,這個和前面的 input_tensor & ir_graph->tensor_list[0] 輸入參 綁定,修改了 input_data 即修改了 ir_graph.tensor_list,這樣就能看懂
get_input_data_cv(imgs_list[nums].c_str(), input_data.data(), img_c, img_h, img_w, mean, scale, sw_RGB, center_crop, letterbox_rows, letterbox_cols, focus);
然後 run 一下,把中間激活值記錄到 ir_graph->tensor_list[i]
if (run_graph(ir_graph, 1) < 0){
fprintf(stderr, "Run graph failed\n");
return -1;
激活激活值的 min、max 值:
/* get the min/max value of activation tensor */
for (int i = 0; i < ir_graph->tensor_num; i++){
struct tensor* act_tensor = ir_graph->tensor_list[i];
if (act_tensor->tensor_type == TENSOR_TYPE_VAR || act_tensor->tensor_type == TENSOR_TYPE_INPUT){
float* start_addr = (float*)act_tensor->data;
float* end_addr = (float*)act_tensor->data + act_tensor->elem_num;
max_activation[i] = std::max(max_activation[i], *std::max_element(start_addr, end_addr));
min_activation[i] = std::min(min_activation[i], *std::min_element(start_addr, end_addr));}
計算激活值量化尺度,對於 softmax 層 scale 默認為 1 / 127.f
/* save the calibration file with min-max algorithm */
FILE* fp_minmax = fopen("table_minmax.scale", "wb");
for (int i = 0; i < ir_graph->tensor_num; i++){
struct tensor* t = ir_graph->tensor_list[i];
if (t->tensor_type == TENSOR_TYPE_VAR || t->tensor_type == TENSOR_TYPE_INPUT){
float act_scale = 1.f;
int act_zero_point = 0;
act_scale = std::max(std::abs(max_activation[i]), std::abs(min_activation[i])) / 127.f;
/* the scale of softmax is always scale = 1 / 127.f */
for (int j = 0; j < ir_graph->node_num; j++){
struct node* noden = ir_graph->node_list[j];
struct tensor* tensor_tmp = get_ir_graph_tensor(ir_graph, noden->output_tensors[0]);
if (!(tensor_tmp->tensor_type == TENSOR_TYPE_INPUT || tensor_tmp->tensor_type == TENSOR_TYPE_VAR))
std::string tmp_op_name = get_op_name_from_type(noden->op.type);
std::string cur_name = t->name;
std::string tmp_name = tensor_tmp->name;
if ((cur_name == tmp_name) && tmp_op_name == "Softmax"){
act_scale = 1 / 127.f;
fprintf(fp_minmax, "%s %f %d\n", ir_graph->tensor_list[i]->name, act_scale, act_zero_point);}
2.2 權值 & 偏置量化
權值 & 偏置量化和激活值量化不太一樣,激活值量化需要校準圖片推理以獲得輸入數據的動態分布,而權值 & 偏置是靜態的,單純的量化過程不需執行前向推理。
2.2.1 創建 graph
載入 tmfile,構建 graph:
struct graph* ir_graph = (struct graph*)create_graph(nullptr, "tengine", model_file);
if (nullptr == ir_graph){
fprintf(stderr, "Create graph failed.\n");
return -1;}
2.2.2 優化激活值量化 scale
這裡主要做一個 quant.inplace 的優化,這是針對非卷積運算元的量化處理策略。
if (inplace == 0){
for (int i = 0; i < ir_graph->tensor_num; i++){
struct tensor* ir_tensor = ir_graph->tensor_list[i];
if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
ir_tensor->scale = layer_scale[ir_tensor->name];
ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];}}
std::tr1::unordered_map<std::string, bool> layer_pass;
for (int i = ir_graph->tensor_num - 1; i >= 0; i--){
struct tensor* ir_tensor = ir_graph->tensor_list[i];
if (ir_tensor->tensor_type == TENSOR_TYPE_VAR || ir_tensor->tensor_type == TENSOR_TYPE_INPUT){
if (layer_pass[ir_tensor->name] == false){
uint32_t ir_node_idx = ir_tensor->producer;
struct node* t_node = ir_graph->node_list[ir_node_idx];
std::string op_name = get_op_name_from_type(t_node->op.type);
bool poolTrue = false;
bool reluTrue = false;
if (op_name == "Pooling"){
struct pool_param* pool_param = (struct pool_param*)t_node->op.param_mem;
if (pool_param->pool_method == 0)
poolTrue = true;
else if (op_name == "ReLU"){
struct relu_param* relu_param = (struct relu_param*)t_node->op.param_mem;
if (relu_param->negative_slope == 0.f)
reluTrue = true;
if (op_name == "Flatten" || op_name == "Reshape" || op_name == "Squeeze" || op_name == "Clip" || op_name == "Slice" || poolTrue || reluTrue){
struct tensor* t_in_tensor = ir_graph->tensor_list[t_node->input_tensors[0]];
if (layer_scale[ir_tensor->name] != 0){
ir_tensor->scale = layer_scale[ir_tensor->name];
ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];
if (t_in_tensor->tensor_type == TENSOR_TYPE_VAR || t_in_tensor->tensor_type == TENSOR_TYPE_INPUT){
recursion_pass_through(ir_graph, ir_tensor->name, t_in_tensor, layer_used, layer_scale, layer_zeropoint, layer_pass);}}
ir_tensor->scale = layer_scale[ir_tensor->name];
ir_tensor->zero_point = layer_zeropoint[ir_tensor->name];
layer_pass[ir_tensor->name] = true;}}}
2.2.3 權值 & 偏置量化
量化的整個過程和激活值量化類似,即先搜索 min、max 值,後做截斷縮放處理。這裡不僅需要計算 scale,而且還要做截斷縮放處理的原因是需要生成 int8 tmfile 量化模型文件。這裡還有一點需要注意的是權值量化精度為 int8,偏置量化精度為 int32,因為權值做完矩陣乘後值很有可能就會溢出 int8,所以需要權值矩陣乘後的值用 int32 存儲,然後與 int32 的偏置做加法。
除了以上這些,和激活值量化還有個區別是,激活值量化是 perLayer 的,而權值 & 偏置量化是 perChannel 的。
權值 int8 量化:
/* quantize the weight data from fp32 to int8 */
if (op_name == "Convolution" || op_name == "FullyConnected" || op_name == "Deconvolution"){
struct tensor* weight_tensor = ir_graph->tensor_list[noden->input_tensors[1]];
int channel_num = weight_tensor->dims[0];
int cstep = int(weight_tensor->elem_num / channel_num);
float* weight_data = (float*)weight_tensor->data;
int8_t* i8_weight_data = (int8_t*)sys_malloc(weight_tensor->elem_num * sizeof(int8_t));
float* weight_scale_list = (float*)sys_malloc(channel_num * sizeof(float));
int* weight_zp_list = (int*)sys_malloc(channel_num * sizeof(int));
fprintf(fp_weight, "%s ", weight_tensor->name);
/* calculate the quant scale value of weight perchannel, scale = abs(min, max) / 127 */
if (internal){
for (int ch = 0; ch < channel_num; ch++){
weight_scale_list[ch] = weight_tensor->scale_list[ch];
weight_zp_list[ch] = 0;}
for (int ch = 0; ch < channel_num; ch++){
float* weight_data_ch_start = weight_data + ch * cstep;
float* weight_data_ch_end = weight_data + (ch + 1) * cstep;
float weight_max = *std::max_element(weight_data_ch_start, weight_data_ch_end);
float weight_min = *std::min_element(weight_data_ch_start, weight_data_ch_end);
weight_scale_list[ch] = std::max(std::abs(weight_max), std::abs(weight_min)) / 127.f;
weight_zp_list[ch] = 0;
fprintf(fp_weight, "%8.8f ", weight_scale_list[ch]);
fprintf(fp_weight, "\n");
/* quantize the value of weight from Float32 to Int8, value_i8 = (value_fp32 / scale).round().clip(-127, 127) */
for (int ch = 0; ch < channel_num; ch++){
for (int j = 0; j < cstep; j++){
if (weight_data[ch * cstep + j] == 0 ||