基于AX650N部署SegFormer

一背景
语义分割（semantic segmentation）是计算机视觉中的一项基本任务。与单张图像分类任务相比，语义分割相当于是像素级别上的分类任务。语义分割为许多下游应用特别是近几年来的智能驾驶技术的落地提供了可能。
本文将简单介绍segformer的基本原理，同时指导如何导出onnx模型，并将其部署在优秀的端侧ai芯片ax650n上，希望能给行业内对边缘侧/端侧部署transformer模型的爱好者提供新的思路。
二segformer介绍
segformer的论文中提出了一个简单、高效的语义分割方案，它将transformers与轻量级多层感知（mlps）解码器结合起来。segformer有两个吸引人的特点：
1segformer包含一个新的层次结构的transformer编码器，输出多尺度特征。它不需要位置编码，这样就不用对位置编码做插值。
2segformer避免了复杂的解码器。所提出的mlp解码器从不同的层聚集信息，从而结合local attention和global attention来呈现强大的表示。这种简单而轻量级的设计是用transformer高效地做分割的关键。
论文中扩展了上述方案，得到了一系列不同大小型号的模型，从segformer-b0到segformer-b5，相比之前的分割模型达到了更好的性能和效率。例如，segformer-b4在ade20k上以64m参数实现了50.3%的miou，最佳模型segformer-b5在cityscapes验证集上达到了84.0%的miou。
2.1 骨干网络
backbone
2.2 分级transformer编码器
论文中提出了一系列的mix transformer编码器（mit），mit-b0到mit-b5，具有相同的结构，但尺寸不同。mit-b0是用于快速推理的轻量级模型，而mit-b5是用于最佳性能的最大模型。设计的mit部分灵感来自vit，但针对语义分割进行了定制和优化。
2.3 轻量级all-mlp解码器
集成了一个仅由mlp层组成的轻量级解码器，这避免了其他方法中通常用的手工制作和计算要求很高的组件。实现这种简单解码器的关键是分级transformer编码器比传统的cnn编码器具有更大的有效感受野（erf）。
benchmark
三ax650n
ax650n是一款兼具高算力与高能效比的soc芯片，集成了八核cortex-a55 cpu，10.8tops@int8 npu（针对 transformer 模型进行了定制优化），支持8k@30fps的isp，以及h.264、h.265编解码的vpu。接口方面，ax650n支持64bit lpddr4x，多路mipi输入，千兆ethernet、usb、以及hdmi 2.0b输出，并支持32路1080p@30fps解码。强大的性能可以让ax650n帮助用户在智慧城市、智慧教育、智能制造等领域发挥更大的价值。
ax650n更多介绍请点击下图查看：
四
模型转换
本文以segformer-b0-cityscapes-640-1280为例。
4.1 模型下载
这次我们推荐从huggingface的modelzoo下载模型。
huggingface
● onnx模型导出的脚本
onnx模型导出
import torchfrom transformers import segformerforsemanticsegmentation, segformerfeatureextractorfrom pathlib import pathfrom onnxruntime.quantization import quantize_dynamic, quanttype, preprocessimport onnximport onnxruntimeimport osfrom pil import imagefrom typing import listdef export_model(model_name: str, export_dir: str, input_sample: torch.tensor): model = segformerforsemanticsegmentation.from_pretrained(model_name) model.eval() export_path = os.path.join(export_dir, model_name) path(export_path).mkdir(parents=true, exist_ok=true) onnx_path = os.path.join(export_path, model.onnx) # export the model to onnx while preserving the first dimension as dynamic torch.onnx.export(model, input_sample, onnx_path, export_params=true, opset_version=11, input_names=[input], output_names=[output], )export_dir = ./hf_export/model_name = nvidia/segformer-b0-finetuned-cityscapes-640-1280export_model(model_name, export_dir, torch.randn([1,3,640,1280]))# model_name = nvidia/segformer-b1-finetuned-ade-512-512# export_model(model_name, export_dir, torch.randn([1,3,512,512]))# check_models(model_name, export_dir, image_paths)
● onnxsim优化
$ onnxsim segformer-b0-cityscapes-640-1280.onnx segformer-b0-cityscapes-640-1280-sim.onnxsimplifying...finish! here is the difference:┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓┃ ┃ original model ┃ simplified model ┃┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩│ add │ 136 │ 136 ││ concat │ 21 │ 1 ││ constant │ 176 │ 0 ││ conv │ 20 │ 20 ││ div │ 46 │ 46 ││ erf │ 8 │ 8 ││ matmul │ 68 │ 68 ││ mul │ 46 │ 46 ││ pow │ 30 │ 30 ││ reducemean │ 60 │ 60 ││ relu │ 1 │ 1 ││ reshape │ 76 │ 76 ││ resize │ 4 │ 4 ││ shape │ 20 │ 0 ││ slice │ 20 │ 0 ││ softmax │ 8 │ 8 ││ sqrt │ 30 │ 30 ││ sub │ 30 │ 30 ││ transpose │ 76 │ 76 ││ model size │ 14.3mib │ 14.3mib │└────────────┴────────────────┴──────────────────┘
● 添加argmax输出头
由于ax650n的npu支持argmax算子，因此可以将argmax添加到该模型的输出头，直接获取每个像素点置信度最高的类别id。
首先安装onnx_graphsurgeon依赖：
pip install onnx_graphsurgeon --index-url https://pypi.ngc.nvidia.com
运行下面的脚本，添加argmax op:
import numpy as npimport onnximport onnx_graphsurgeon as gsmodel_path = ./segformer-b0-cityscapes-640-1280-sim.onnxoutput_model_path = ./segformer-b0-cityscapes-640-1280-sim-argmax.onnxonnx_model = onnx.load(model_path)onnx_graph = gs.import_onnx(onnx_model)node_last_conv = onnx_graph.nodes[-1]# attrs for argmaxaxis = 1keepdims = 1argmax_out_shape = node_last_conv.outputs[0].shape.copy()argmax_out_shape[axis] = 1argmax_out = gs.variable( argmax_output, dtype=np.int64, shape=argmax_out_shape,)argmax_node = gs.node( op=argmax, name=decode_head_argmax, inputs=[node_last_conv.outputs[0]], outputs=[argmax_out], attrs={axis: axis, keepdims: keepdims},)onnx_graph.nodes.append(argmax_node)onnx_graph.outputs.clear()onnx_graph.outputs = [argmax_out]onnx_graph.cleanup().toposort()onnx_model_with_argmax = gs.export_onnx(onnx_graph)onnx_model_with_argmax.ir_version = onnx_model.ir_versiononnx.save(onnx_model_with_argmax, output_model_path)
添加argmax前后两个onnx模型对比：
add-argmax
4.2 模型编译
使用ax650n配套的ai工具链pulsar2，一键完成图优化、离线量化、编译、对分功能。
$ pulsar2 build --input model/segformer-b0-cityscapes-640-1280-sim-argmax.onnx --output_dir segformer/ --config config/seg former_config.json --npu_mode npu3building onnx ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00patool: extracting ./dataset/coco_4.tar ...patool: running /usr/bin/tar --extract --file ./dataset/coco_4.tar --directory segformer/quant/dataset/inputpatool: ... ./dataset/coco_4.tar extracted to `segformer/quant/dataset/input'. quant config table┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓┃ input ┃ shape ┃ dataset directory ┃ data format ┃ tensor format ┃ mean ┃ std ┃┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩│ input │ [1, 3, 640, 1280] │ input │ image │ rgb │ [123.67500305175781, 116.27999877929688, │ [58.39500045776367, 57.119998931884766, ││ │ │ │ │ │ 103.52999877929688] │ 57.356998443603516] │└───────┴───────────────────┴───────────────────┴─────────────┴───────────────┴───────────────────────────────────────────────┴──────────────────────────────────────────────┘4 file(s) loaded.[14:17:44] ax lstm operation format pass running ... finished.[14:17:44] ax refine operation config pass running ... finished.[14:17:44] ax transformer optimize pass running ... finished.[14:17:45] ax reset mul config pass running ... finished.[14:17:45] ax tanh operation format pass running ... finished.[14:17:45] ax quantization config refine pass running ... finished.[14:17:45] ax quantization fusion pass running ... finished.[14:17:45] ax quantization simplify pass running ... finished.[14:17:45] ax parameter quantization pass running ... finished.calibration progress(phase 1): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:23<00:00, 5.86s/it]finished.[1411] ax passive parameter quantization running ... finished.[1411] ax parameter baking pass running ... finished.[1412] ax refine int parameter pass running ... finished.network quantization finished.quant.axmodel export success: segformer/quant/quant_axmodel.onnxbuilding native ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 000......2023-05-30 1418.661 | info | yasched.test_onepass1615 - max_cycle = 378839542023-05-30 14:35:04.545 | info | yamain.command.build904 - fuse 1 subgraph(s)
五上板部署
5.1 ax-samples
开源项目ax-samples实现了常见的深度学习开源算法在爱芯元智的ai soc上的示例代码，方便社区开发者进行快速评估和适配。最新版本已开始提供ax650系列的npu示例，其中也包含了本文介绍的segformer参考代码。
https://github.com/axera-tech/ax-samples/blob/main/examples/ax650/ax_segformer_steps.cc
5.2 运行
# ./ax_segformer -m segformer-b0-cityscapes-640-1280-argmax.axmodel -i segformer_test.png--------------------------------------model file : segformer-b0-cityscapes-640-1280-argmax.axmodelimage file : segformer_test.pngimg_h, img_w : 640 1280--------------------------------------post process cost time:7.07 ms--------------------------------------repeat 1 times, avg time 48.15 ms, max_time 48.15 ms, min_time 48.15 ms----------------------------------------------------------------------------
segformer运行结果展示
六后续计划
● 尝试部署视觉大模型dinov2，敬请期待！

LED电视的优点 2021年LED电视综合排行榜
英飞凌与Wolfspeed延长硅碳化（SiC）晶圆供应协议
FCC批准SpaceX建造100万个地面天线
VR或成最大隐私侵入者，美好中存隐患
经历股价一度触底及市值蒸发后,工业富联却交出了一份超预估的成绩单
基于AX650N部署SegFormer
工信部将开展互联网应用适老化改造专项行动
2022未来品牌评选出炉：李宁、小鹏汽车、Leader等上榜
汽车芯片为什么短缺什么时候能缓解
可穿戴设备下个方向是情感还是健康
WCDMA系统中物理信道的功率分配方式
如何才能通过设置系统服务的方式优化win10系统
线扬声器和无线耳机等可穿戴终端通过什么进行连接？
无线充电用超薄黑色覆盖膜的研制
华为智能优化器成功应用郑州万邦物流园60MW屋顶分布式光伏电站中
Nokia8Sirocco评测屏占比最高的传统屏旗舰机
科普贴片电感的感值降低的原因
关于SMT生产线对贴片机的要求，从三个方面来解析
舜宇光学将在2021年开始出货iPhone镜头
陶氏有机硅：重新定义消费电子散热