通过 Open WebUI 接入QwQ-32B模型服务
Open WebUI 是一个可扩展、功能丰富、用户友好的自托管人工智能平台,支持各种 LLM 运行程序。本文为您详细介绍用户如何通过 Open WebUI 高效调用 QwQ-32B API 服务并开启联网搜索功能。
前提条件
QwQ-32B 的模型服务部署
QwQ-32B 在 32B 的规模上,提供了出色的推理能力。我们提供了几种不同的部署方案:
- 1xA800:单请求吞吐 23 token/s
- 2xA800:单请求吞吐 40 token/s,支持较高并发
- 4xA800:单请求吞吐 62 token/s,支持更高并发
- 本文中使用1卡A800作为演示,创建并编辑 inference.tpl.yaml 文件:vi inference.tpl.yaml,文件中包含模型服务Deployment和公网IP的Service,yaml示例如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: qwq-32b-1
namespace: default
labels:
app: qwq-32b-1
spec:
replicas: 1
selector:
matchLabels:
app: qwq-32b-1
template:
metadata:
labels:
app: qwq-32b-1
spec:
affinity: # Pod调度亲和性,选择合适的 GPU 卡型号
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.ebtech.com/gpu
operator: In
values:
- A800_NVLINK_80GB
volumes:
# 挂载共享模型数据大盘(如需要)
- name: models
hostPath:
path: /public
- name: shm
emptyDir:
medium: Memory
sizeLimit: "50Gi"
containers:
- name: qwq-32b
image: registry-cn-huabei1-internal.ebcloud.com/docker.io/lmsysorg/sglang:v0.4.5-cu125
command:
- bash
- "-c"
- |
python3 -m sglang.launch_server \
--model-path /public/huggingface-models/Qwen/QwQ-32B \
--tp "1" \
--host 0.0.0.0 --port 8000 \
--trust-remote-code \
--context-length 65536 \
--served-model-name qwq-32b \
--tool-call-parser qwen25
env:
- name: HF_DATASETS_OFFLINE
value: "1"
- name: TRANSFORMERS_OFFLINE
value: "1"
- name: HF_HUB_OFFLINE
value: "1"
ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 100G
nvidia.com/gpu: "1"
requests:
cpu: "10"
memory: 100G
nvidia.com/gpu: "1"
volumeMounts:
- name: shm
mountPath: /dev/shm
# 挂载共享模型数据大盘(如需要)
- name: models
mountPath: /public
---
apiVersion: v1
kind: Service
metadata:
name: qwq-32b-1
namespace: default
spec:
ports:
- name: http-qwq-32b-1
port: 80
protocol: TCP
targetPort: 8000
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: qwq-32b-1
sessionAffinity: None
# 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
type: LoadBalancer
# 部署 QwQ-32B 的模型服务,同时会部署一个 service,用于后续 API 调用
kubectl apply -f inference.tpl.yaml
# 查看部署的 pod
kubectl get pod
# 查看公网 IP,下面命令对应的输出中,Open WebUI 服务对应 External IP 即为分配到的公网 IP。
kubectl get service
# 查看 pod 日志
kubectl logs -f <YOUR-POD-NAME>
出现如下日志,即表示模型服务启动完成:
Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:35<00:02, 2.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:38<00:00, 2.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:38<00:00, 2.75s/it]
[2025-04-09 13:43:07 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=17.75 GB, mem usage=61.17 GB.
[2025-04-09 13:43:07 TP0] KV Cache is allocated. #tokens: 33901, K size: 4.14 GB, V size: 4.14 GB
[2025-04-09 13:43:07 TP0] Memory pool end. avail mem=8.72 GB
[2025-04-09 13:43:07 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=8.09 GB
Capturing batches (avail_mem=6.62 GB): 100%|██████████| 23/23 [00:07<00:00, 3.09it/s]
[2025-04-09 13:43:14 TP0] Capture cuda graph end. Time elapsed: 7.45 s. avail mem=6.60 GB. mem usage=1.49 GB.
[2025-04-09 13:43:15 TP0] max_total_num_tokens=33901, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=65536
[2025-04-09 13:43:15] INFO: Started server process [1]
[2025-04-09 13:43:15] INFO: Waiting for application startup.
[2025-04-09 13:43:15] INFO: Application startup complete.
[2025-04-09 13:43:15] INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
- 调用测试,验证服务是否可以正常调用
# 登录到容器内进行调试
kubectl exec -it <pod-name> -- bash
# 批式调用接口
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/public/huggingface-models/Qwen/QwQ-32B",
"prompt": "github是什么",
"max_tokens": 100,
"temperature": 0
}'
# 流式调用接口
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/public/huggingface-models/Qwen/QwQ-32B",
"prompt": "github是什么",
"max_tokens": 100,
"temperature": 0,
"stream": 1
}'
# 测试公网IP暴露的服务可用性
curl http://<YOUR-PUBLIC-IP>:80/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "/public/huggingface-models/Qwen/QwQ-32B",
"prompt": "github是什么",
"max_tokens": 100,
"temperature": 0,
}'
- 至此QwQ-32B模型服务部署完成,若服务后续不再需要,需要删除资源和公网IP,请执行以下命令删除部署的所有资源:
kubectl delete -f inference.tpl.yaml
Open WebUI 的部署
我们使用开源方案 Open WebUI 来部署一个前端页面,承接交互式聊天的使用场景。
- 部署 Open WebUI的示例 owebui.tpl.yaml,文件中包含Deployment和公网IP的Service以及一个1GiB的共享存储卷PVC,yaml示例如下:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: owebui-data # 存储卷的名称
spec:
accessModes:
- ReadWriteMany # 存储卷的读写模式
resources:
requests:
storage: 1Gi # 存储卷的容量大小
storageClassName: shared-nvme-cn-huabei1 # 创建存储卷使用的StorageClass的名字
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: owebui-1
namespace: default
labels:
app: owebui-1
spec:
replicas: 1
selector:
matchLabels:
app: owebui-1
template:
metadata:
labels:
app: owebui-1
spec:
volumes:
# 挂载共享模型数据大盘(如需要)
- name: models
hostPath:
path: /public
# 挂载分布式存储盘(如需要)
- name: data
persistentVolumeClaim:
claimName: owebui-data
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
containers:
- name: owebui
image: registry-cn-huabei1-internal.ebcloud.com/ghcr.io/open-webui/open-webui:ollama
command:
- bash
- "-c"
- |
# pip install huggingface_hub[hf_xet] -i https://pypi.tuna.tsinghua.edu.cn/simple
# prepare open-webui data dir
mkdir -p /data/owebui-data/data
rm -rf /app/backend/data
ln -s /data/owebui-data/data /app/backend/
# prepare ollama models dir
mkdir -p /data/ollama-data
rm -rf /root/.ollama
ln -s /data/ollama-data /root/.ollama
# echo "pulling llama3.2:1b ..."
# ollama pull llama3.2:1b
# prepare open-webui secret key
if [ ! -f /app/backend/data/.webui_secret_key ]; then
echo $(head -c 12 /dev/random | base64) > /app/backend/data/.webui_secret_key
fi
ln -s /app/backend/data/.webui_secret_key /app/backend/.webui_secret_key
# comment out lines containing "application/json" in images.py
sed -i '/application\/json/s/^/#/' /app/backend/open_webui/routers/images.py
# start open-webui
bash /app/backend/start.sh
env:
- name: HF_ENDPOINT
value: "https://hf-mirror.com"
- name: ENABLE_EVALUATION_ARENA_MODELS
value: "false"
- name: RAG_EMBEDDING_MODEL
value: "/public/huggingface-models/sentence-transformers/all-MiniLM-L6-v2"
# - name: ENABLE_OPENAI_API # 不带模型服务(OpenAI_API_BASE_URL/OpenAI_API_BASE_URLS)启动时,取消掉这里的注释,以避免对 openai 官方 API 的依赖
# value: "false"
- name: OPENAI_API_KEYS # 多个 API Key 用分号分隔
value: "sk-foo-bar"
- name: OPENAI_API_BASE_URLS # 多个 API Base URL 用分号分隔
value: "http://<YOUR-PUBLIC-IP>/v1"
- name: ENABLE_WEB_SEARCH
value: "true"
- name: WEB_SEARCH_ENGINE
value: "searxng"
- name: SEARXNG_QUERY_URL
value: "http://searxng-1/search?q=<query>"
ports:
- containerPort: 8080
resources:
limits:
cpu: "2"
memory: 4G
requests:
cpu: "2"
memory: 4G
volumeMounts:
- name: shm
mountPath: /dev/shm
# 挂载共享模型数据大盘(如需要)
- name: models
mountPath: /public
# 挂载分布式存储盘(如需要)
- name: data
mountPath: /data
---
apiVersion: v1
kind: Service
metadata:
name: owebui-1
namespace: default
spec:
ports:
- name: http-owebui-1
port: 80
protocol: TCP
targetPort: 8080
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: owebui-1
sessionAffinity: None
# 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
type: LoadBalancer
# type: ClusterIP
# 部署 Open WebUI,跟随它会同时申请一个公网 IP,方便我们在本地浏览器访问、使用它
kubectl apply -f owebui.tpl.yaml
# 查看公网 IP,下面命令对应的输出中,Open WebUI 服务对应 External IP 即为分配到的公网 IP。
kubectl get service
查看到公网 IP 后,直接在浏览器访问该 IP 即可打开我们部署的 Open WebUI 页面。
我们在 Open WebUI 的YAML配置中默认配置了前期启动的 QwQ-32B 模型服务和后续我们将部署的 SearXNG 服务,如需定制化调整只需调整对应的参数即可:
containers:
env:
- name: OPENAI_API_KEYS # 多个 API Key 用分号分隔
value: "sk-foo-bar"
- name: OPENAI_API_BASE_URLS # 多个 API Base URL 用分号分隔
value: "http://<YOUR-PUBLIC-IP>/v1"
- name: ENABLE_RAG_WEB_SEARCH
value: "true"
- name: RAG_WEB_SEARCH_ENGINE
value: "searxng"
- name: SEARXNG_QUERY_URL
value: "http://searxng-1/search?q=<query>"
联网搜索功能的部署
Open WebUI 支持丰富的周边功能,其中联网搜索是最为常用的功能之一。可以自行申请、配置 Google、Baidu、Bing 等搜索引擎的 API,我们这里基于开源方案 SearXNG,提供一个易部署的私有化搜索方案,可以提供更具隐私安全的搜索体验。
- 部署联网搜索功能的示例 searxng.tpl.yaml,文件中包含Deployment和公网IP的Service以及一个1GiB的共享存储卷PVC,yaml示例如下:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: searxng-data # 存储卷的名称
spec:
accessModes:
- ReadWriteMany # 存储卷的读写模式
resources:
requests:
storage: 1Gi # 存储卷的容量大小
storageClassName: shared-nvme-cn-huabei1 # 创建存储卷使用的StorageClass的名字
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: searxng-1
namespace: default
labels:
app: searxng-1
spec:
replicas: 1
selector:
matchLabels:
app: searxng-1
template:
metadata:
labels:
app: searxng-1
spec:
volumes:
# 挂载分布式存储盘(如需要)
- name: data
persistentVolumeClaim:
claimName: searxng-data
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
- name: public-data
hostPath:
path: /public
containers:
- name: searxng
image: registry-cn-huabei1-internal.ebcloud.com/docker.io/searxng/searxng:latest
command:
- sh
- "-c"
- |
rm -f /etc/searxng/settings.yml /etc/searxng/limiter.toml
if [ ! -f /root/.searxng/settings.yml ]; then
cp /public/shared-resources/searxng-config/settings.yml /root/.searxng/
fi
if [ ! -f /root/.searxng/limiter.toml ]; then
cp /public/shared-resources/searxng-config/limiter.toml /root/.searxng/
fi
cp /root/.searxng/settings.yml /etc/searxng/
cp /root/.searxng/limiter.toml /etc/searxng/
# export SEARXNG_BIND_ADDRESS=0.0.0.0
# export SEARXNG_PORT=8080
export SEARXNG_SECRET=ABABAB
# export SEARXNG_LIMITER=true
# export SEARXNG_PUBLIC_INSTANCE=true
/usr/local/searxng/dockerfiles/docker-entrypoint.sh
env:
- name: HF_ENDPOINT
value: "https://hf-mirror.com"
ports:
- containerPort: 8080
resources:
limits:
cpu: "4"
memory: 8G
requests:
cpu: "2"
memory: 4G
volumeMounts:
- name: shm
mountPath: /dev/shm
- name: data
mountPath: /root/.searxng/
subPath: searxng-data-mount
- name: public-data
mountPath: /public
---
apiVersion: v1
kind: Service
metadata:
name: searxng-1
namespace: default
spec:
ports:
- name: http-searxng-1
port: 80
protocol: TCP
targetPort: 8080
# The label selector should match the deployment labels & it is useful for prefix caching feature
selector:
app: searxng-1
sessionAffinity: None
# 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
type: LoadBalancer
# 部署 SearXNG 服务,服务一般会在 30s 内启动完成,默认我们配置了国内可稳定访问的部分搜索服务
kubectl apply -f searxng.tpl.yaml
部署完成后,我们就可以在 Open WebUI 的页面上使用“联网搜索”功能了。
