从Transformers到vLLM：MiniCPM-V-4.6-AWQ全框架部署指南-编程实验室

从Transformers到vLLM：MiniCPM-V-4.6-AWQ全框架部署指南

【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ

MiniCPM-V-4.6-AWQ是OpenBMB开源社区推出的轻量级多模态模型，基于AWQ量化技术实现高效部署。本文将详细介绍如何通过Transformers和vLLM框架快速部署这一模型，让你在消费级GPU上也能体验强大的图像与视频理解能力。

模型简介：为什么选择MiniCPM-V-4.6-AWQ？

MiniCPM-V-4.6-AWQ作为MiniCPM-V 4.6的AWQ量化版本，继承了原模型的三大核心优势：

超高效架构：基于LLaVA-UHD v4技术，视觉编码计算量减少50%以上，相比Qwen3.5-0.8B实现约1.5倍的token吞吐量
多模态能力：在OpenCompass、RefCOCO等多个基准测试中达到Qwen3.5 2B级别性能，支持单图、多图和视频理解
广泛部署支持：适配vLLM、SGLang、llama.cpp等主流推理框架，提供GGUF、BNB、AWQ、GPTQ等多种量化格式

该模型特别适合边缘设备部署，已成功适配iOS、Android和HarmonyOS三大移动平台，所有边缘适配代码均已开源。

环境准备：部署前的必要配置

在开始部署前，请确保你的环境满足以下要求：

Python 3.8+
PyTorch 2.0+
CUDA 11.7+（推荐12.1以上以获得最佳性能）
至少4GB显存（量化版本）

首先克隆项目仓库：

git clone https://gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ cd MiniCPM-V-4.6-AWQ

方法一：使用Transformers框架部署

Transformers是Hugging Face推出的开源库，提供了简单易用的API来加载和运行预训练模型。

安装依赖

pip install "transformers[torch]>=5.7.0" torchvision torchcodec

CUDA兼容性提示：torchcodec可能与某些CUDA版本存在兼容性问题。如果遇到RuntimeError: Could not load libtorchcodec错误，可以：
使用PyAV替代：pip install "transformers[torch]>=5.7.0" torchvision av
指定CUDA版本安装：pip install "transformers>=5.7.0" torchvision torchcodec --index-url https://download.pytorch.org/whl/cu128

加载模型

from transformers import AutoModelForImageTextToText, AutoProcessor model_id = "openbmb/MiniCPM-V-4.6-AWQ" processor = AutoProcessor.from_pretrained(model_id) model = AutoModelForImageTextToText.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) # 推荐使用Flash Attention 2加速（需要安装flash-attn） # model = AutoModelForImageTextToText.from_pretrained( # model_id, # torch_dtype=torch.bfloat16, # attn_implementation="flash_attention_2", # device_map="auto", # )

图像推理示例

messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}, {"type": "text", "text": "What causes this phenomenon?"}, ], } ] downsample_mode = "16x" # 使用"4x"可获得更精细的细节 inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", downsample_mode=downsample_mode, max_slice_nums=36, ).to(model.device) generated_ids = model.generate(**inputs, downsample_mode=downsample_mode, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text[0])

启动Transformers服务

Transformers提供了轻量级的OpenAI兼容服务器，适合快速测试和中等负载部署：

pip install "transformers[serving]>=5.7.0" transformers serve openbmb/MiniCPM-V-4.6-AWQ --port 8000 --host 0.0.0.0 --continuous-batching

发送请求示例：

curl -s http://localhost:8000/v1/chat/completions \ -H 'Content-Type: application/json' \ -d '{ "model": "openbmb/MiniCPM-V-4.6-AWQ", "messages": [{ "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}}, {"type": "text", "text": "What causes this phenomenon?"} ] }] }'

方法二：使用vLLM框架部署

vLLM是一个高性能的LLM服务库，支持PagedAttention技术，可显著提高吞吐量并降低延迟。

安装vLLM

pip install vllm

启动vLLM服务

vllm serve openbmb/MiniCPM-V-4.6-AWQ \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser qwen3_coder \ --default-chat-template-kwargs '{"enable_thinking": false}'

提示：如果不需要工具调用功能，可以简化命令为：vllm serve openbmb/MiniCPM-V-4.6-AWQ --port 8000

发送推理请求

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "openbmb/MiniCPM-V-4.6-AWQ", "messages": [{"role": "user", "content": [ {"type": "image_url", "image_url": {"url": "https://huggingface.co/datasets/openbmb/DemoCase/resolve/main/refract.png"}}, {"type": "text", "text": "What causes this phenomenon?"} ]}] }'

工具调用示例

vLLM支持自动工具调用功能，示例如下：

curl -s http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{ "model": "openbmb/MiniCPM-V-4.6-AWQ", "messages": [{"role": "user", "content": [ {"type": "text", "text": "北京的天气"} ]}], "tools": [{ "type": "function", "function": { "name": "get_weather", "description": "Get the current weather for a given location", "parameters": { "type": "object", "properties": { "location": {"type": "string", "description": "City name"} }, "required": ["location"] } } }] }'

高级参数配置

无论是使用Transformers还是vLLM，都可以通过调整参数来平衡性能和效果：

参数	默认值	适用对象	描述
`downsample_mode`	`"16x"`	图像 & 视频	视觉token下采样模式。"16x"为效率优先；"4x"保留更多细节，需同时传递给generate()
`max_slice_nums`	`9`	图像 & 视频	高分辨率图像分割的最大切片数。图像推荐36，视频推荐1
`max_num_frames`	`128`	视频	视频最大帧数。短视频默认1 FPS，长视频自动均匀采样
`stack_frames`	`1`	视频	每秒采样点数。短视频推荐1，长视频推荐3或5

其他部署选项

除了Transformers和vLLM，MiniCPM-V-4.6-AWQ还支持多种部署框架：

SGLang部署

pip install sglang python -m sglang.launch_server --model openbmb/MiniCPM-V-4.6-AWQ --port 30000

llama.cpp部署

# 首先获取GGUF格式模型 llama-server -m MiniCPM-V-4.6-Q4_K_M.gguf --port 8080

Ollama部署

ollama run minicpm-v-4.6

在交互会话中，直接粘贴图像路径或URL即可与模型对话。

总结与展望

MiniCPM-V-4.6-AWQ凭借其高效的架构设计和广泛的框架支持，成为边缘设备和消费级GPU上部署多模态模型的理想选择。无论是使用Transformers进行快速集成，还是通过vLLM获得更高吞吐量，都能轻松实现模型的本地化部署。

随着移动平台部署代码的开源，开发者可以进一步探索在iOS、Android和HarmonyOS设备上的部署方案。对于需要定制化的场景，还可以利用LLaMA-Factory或ms-swift等工具进行微调，快速适配新的领域和任务。

通过本文介绍的方法，你可以在短短几分钟内完成MiniCPM-V-4.6-AWQ的部署，开启高效多模态AI应用的开发之旅！

【免费下载链接】MiniCPM-V-4.6-AWQ项目地址: https://ai.gitcode.com/OpenBMB/MiniCPM-V-4.6-AWQ

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

从Transformers到vLLM：MiniCPM-V-4.6-AWQ全框架部署指南