通过 Open WebUI 接入QwQ-32B模型服务

Open WebUI 是一个可扩展、功能丰富、用户友好的自托管人工智能平台,支持各种 LLM 运行程序。本文为您详细介绍用户如何通过 Open WebUI 高效调用 QwQ-32B API 服务并开启联网搜索功能。

前提条件

  • 注册英博云平台账号,并完成实名认证、充值账户等准备工作,详情请参考:准备工作
  • 本地计算机环境已经连接集群,详情请参考:连接集群

QwQ-32B 的模型服务部署

QwQ-32B 在 32B 的规模上,提供了出色的推理能力。我们提供了几种不同的部署方案:

  • 1xA800:单请求吞吐 23 token/s
  • 2xA800:单请求吞吐 40 token/s,支持较高并发
  • 4xA800:单请求吞吐 62 token/s,支持更高并发
  1. 本文中使用1卡A800作为演示,创建并编辑 inference.tpl.yaml 文件:vi inference.tpl.yaml,文件中包含模型服务Deployment和公网IP的Service,yaml示例如下:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: qwq-32b-1
  namespace: default
  labels:
    app: qwq-32b-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: qwq-32b-1
  template:
    metadata:
      labels:
        app: qwq-32b-1
    spec:
      affinity: # Pod调度亲和性,选择合适的 GPU 卡型号
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: cloud.ebtech.com/gpu
                    operator: In
                    values:
                      - A800_NVLINK_80GB
      volumes:
        # 挂载共享模型数据大盘(如需要)
        - name: models
          hostPath:
            path: /public
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "50Gi"
      containers:
      - name: qwq-32b
        image: registry-cn-huabei1-internal.ebcloud.com/docker.io/lmsysorg/sglang:v0.4.5-cu125
        command:
          - bash
          - "-c"
          - |
            python3 -m sglang.launch_server \
                  --model-path /public/huggingface-models/Qwen/QwQ-32B \
                  --tp "1" \
                  --host 0.0.0.0 --port 8000 \
                  --trust-remote-code \
                  --context-length 65536 \
                  --served-model-name qwq-32b \
                  --tool-call-parser qwen25
        env:
          - name: HF_DATASETS_OFFLINE
            value: "1"
          - name: TRANSFORMERS_OFFLINE
            value: "1"
          - name: HF_HUB_OFFLINE
            value: "1"
        ports:
        - containerPort: 8000
        resources:
          limits:
            cpu: "10"
            memory: 100G
            nvidia.com/gpu: "1"
          requests:
            cpu: "10"
            memory: 100G
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
        # 挂载共享模型数据大盘(如需要)
        - name: models
          mountPath: /public
---
apiVersion: v1
kind: Service
metadata:
  name: qwq-32b-1
  namespace: default
spec:
  ports:
  - name: http-qwq-32b-1
    port: 80
    protocol: TCP
    targetPort: 8000
  # The label selector should match the deployment labels & it is useful for prefix caching feature
  selector:
    app: qwq-32b-1
  sessionAffinity: None
  # 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
  type: LoadBalancer
# 部署 QwQ-32B 的模型服务,同时会部署一个 service,用于后续 API 调用
kubectl apply -f inference.tpl.yaml
# 查看部署的 pod
kubectl get pod
# 查看公网 IP,下面命令对应的输出中,Open WebUI 服务对应 External IP 即为分配到的公网 IP。
kubectl get service
# 查看 pod 日志
kubectl logs -f <YOUR-POD-NAME>

出现如下日志,即表示模型服务启动完成:

Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:35<00:02,  2.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:38<00:00,  2.81s/it]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:38<00:00,  2.75s/it]

[2025-04-09 13:43:07 TP0] Load weight end. type=Qwen2ForCausalLM, dtype=torch.bfloat16, avail mem=17.75 GB, mem usage=61.17 GB.
[2025-04-09 13:43:07 TP0] KV Cache is allocated. #tokens: 33901, K size: 4.14 GB, V size: 4.14 GB
[2025-04-09 13:43:07 TP0] Memory pool end. avail mem=8.72 GB
[2025-04-09 13:43:07 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=8.09 GB
Capturing batches (avail_mem=6.62 GB): 100%|██████████| 23/23 [00:07<00:00,  3.09it/s]
[2025-04-09 13:43:14 TP0] Capture cuda graph end. Time elapsed: 7.45 s. avail mem=6.60 GB. mem usage=1.49 GB.
[2025-04-09 13:43:15 TP0] max_total_num_tokens=33901, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=2049, context_len=65536
[2025-04-09 13:43:15] INFO:     Started server process [1]
[2025-04-09 13:43:15] INFO:     Waiting for application startup.
[2025-04-09 13:43:15] INFO:     Application startup complete.
[2025-04-09 13:43:15] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
  1. 调用测试,验证服务是否可以正常调用
# 登录到容器内进行调试
kubectl exec -it <pod-name> -- bash
# 批式调用接口
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/public/huggingface-models/Qwen/QwQ-32B",
        "prompt": "github是什么",
        "max_tokens": 100,
        "temperature": 0
    }'
# 流式调用接口
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/public/huggingface-models/Qwen/QwQ-32B",
        "prompt": "github是什么",
        "max_tokens": 100,
        "temperature": 0,
        "stream": 1
    }'
# 测试公网IP暴露的服务可用性
curl http://<YOUR-PUBLIC-IP>:80/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "/public/huggingface-models/Qwen/QwQ-32B",
        "prompt": "github是什么",
        "max_tokens": 100,
        "temperature": 0,
    }'
  1. 至此QwQ-32B模型服务部署完成,若服务后续不再需要,需要删除资源和公网IP,请执行以下命令删除部署的所有资源:kubectl delete -f inference.tpl.yaml

Open WebUI 的部署

我们使用开源方案 Open WebUI 来部署一个前端页面,承接交互式聊天的使用场景。

  1. 部署 Open WebUI的示例 owebui.tpl.yaml,文件中包含Deployment和公网IP的Service以及一个1GiB的共享存储卷PVC,yaml示例如下:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: owebui-data # 存储卷的名称
spec:
  accessModes:
    - ReadWriteMany # 存储卷的读写模式
  resources:
    requests:
      storage: 1Gi # 存储卷的容量大小
  storageClassName: shared-nvme-cn-huabei1 # 创建存储卷使用的StorageClass的名字
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: owebui-1
  namespace: default
  labels:
    app: owebui-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: owebui-1
  template:
    metadata:
      labels:
        app: owebui-1
    spec:
      volumes:
        # 挂载共享模型数据大盘(如需要)
        - name: models
          hostPath:
            path: /public
        # 挂载分布式存储盘(如需要)
        - name: data
          persistentVolumeClaim:
            claimName: owebui-data
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
      containers:
      - name: owebui
        image: registry-cn-huabei1-internal.ebcloud.com/ghcr.io/open-webui/open-webui:ollama
        command:
          - bash
          - "-c"
          - |
            # pip install huggingface_hub[hf_xet] -i https://pypi.tuna.tsinghua.edu.cn/simple
            # prepare open-webui data dir
            mkdir -p /data/owebui-data/data
            rm -rf /app/backend/data
            ln -s /data/owebui-data/data /app/backend/

            # prepare ollama models dir
            mkdir -p /data/ollama-data
            rm -rf /root/.ollama
            ln -s /data/ollama-data /root/.ollama
            # echo "pulling  llama3.2:1b ..."
            # ollama pull llama3.2:1b
            
            # prepare open-webui secret key
            if [ ! -f /app/backend/data/.webui_secret_key ]; then
              echo $(head -c 12 /dev/random | base64) > /app/backend/data/.webui_secret_key
            fi
            ln -s /app/backend/data/.webui_secret_key /app/backend/.webui_secret_key
            
            # comment out lines containing "application/json" in images.py
            sed -i '/application\/json/s/^/#/' /app/backend/open_webui/routers/images.py
            
            # start open-webui
            bash /app/backend/start.sh
        env:
          - name: HF_ENDPOINT
            value: "https://hf-mirror.com"
          - name: ENABLE_EVALUATION_ARENA_MODELS
            value: "false"
          - name: RAG_EMBEDDING_MODEL
            value: "/public/huggingface-models/sentence-transformers/all-MiniLM-L6-v2"
          # - name: ENABLE_OPENAI_API # 不带模型服务(OpenAI_API_BASE_URL/OpenAI_API_BASE_URLS)启动时,取消掉这里的注释,以避免对 openai 官方 API 的依赖
          #   value: "false"
          - name: OPENAI_API_KEYS # 多个 API Key 用分号分隔
            value: "sk-foo-bar"
          - name: OPENAI_API_BASE_URLS # 多个 API Base URL 用分号分隔
            value: "http://<YOUR-PUBLIC-IP>/v1"
          - name: ENABLE_WEB_SEARCH
            value: "true"
          - name: WEB_SEARCH_ENGINE
            value: "searxng"
          - name: SEARXNG_QUERY_URL
            value: "http://searxng-1/search?q=<query>"
        ports:
        - containerPort: 8080
        resources:
          limits:
            cpu: "2"
            memory: 4G
          requests:
            cpu: "2"
            memory: 4G
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
        # 挂载共享模型数据大盘(如需要)
        - name: models
          mountPath: /public
        # 挂载分布式存储盘(如需要)
        - name: data
          mountPath: /data
---
apiVersion: v1
kind: Service
metadata:
  name: owebui-1
  namespace: default
spec:
  ports:
  - name: http-owebui-1
    port: 80
    protocol: TCP
    targetPort: 8080
  # The label selector should match the deployment labels & it is useful for prefix caching feature
  selector:
    app: owebui-1
  sessionAffinity: None
  # 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
  type: LoadBalancer
  # type: ClusterIP
# 部署 Open WebUI,跟随它会同时申请一个公网 IP,方便我们在本地浏览器访问、使用它
kubectl apply -f owebui.tpl.yaml
# 查看公网 IP,下面命令对应的输出中,Open WebUI 服务对应 External IP 即为分配到的公网 IP。
kubectl get service

查看到公网 IP 后,直接在浏览器访问该 IP 即可打开我们部署的 Open WebUI 页面。

我们在 Open WebUI 的YAML配置中默认配置了前期启动的 QwQ-32B 模型服务和后续我们将部署的 SearXNG 服务,如需定制化调整只需调整对应的参数即可:

      containers:
        env:
          - name: OPENAI_API_KEYS # 多个 API Key 用分号分隔
            value: "sk-foo-bar"
          - name: OPENAI_API_BASE_URLS # 多个 API Base URL 用分号分隔
            value: "http://<YOUR-PUBLIC-IP>/v1"
          - name: ENABLE_RAG_WEB_SEARCH
            value: "true"
          - name: RAG_WEB_SEARCH_ENGINE
            value: "searxng"
          - name: SEARXNG_QUERY_URL
            value: "http://searxng-1/search?q=<query>"

联网搜索功能的部署

Open WebUI 支持丰富的周边功能,其中联网搜索是最为常用的功能之一。可以自行申请、配置 Google、Baidu、Bing 等搜索引擎的 API,我们这里基于开源方案 SearXNG,提供一个易部署的私有化搜索方案,可以提供更具隐私安全的搜索体验。

  1. 部署联网搜索功能的示例 searxng.tpl.yaml,文件中包含Deployment和公网IP的Service以及一个1GiB的共享存储卷PVC,yaml示例如下:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: searxng-data # 存储卷的名称
spec:
  accessModes:
    - ReadWriteMany # 存储卷的读写模式
  resources:
    requests:
      storage: 1Gi # 存储卷的容量大小
  storageClassName: shared-nvme-cn-huabei1 # 创建存储卷使用的StorageClass的名字
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: searxng-1
  namespace: default
  labels:
    app: searxng-1
spec:
  replicas: 1
  selector:
    matchLabels:
      app: searxng-1
  template:
    metadata:
      labels:
        app: searxng-1
    spec:
      volumes:
        # 挂载分布式存储盘(如需要)
        - name: data
          persistentVolumeClaim:
            claimName: searxng-data
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "2Gi"
        - name: public-data
          hostPath:
            path: /public
      containers:
      - name: searxng
        image: registry-cn-huabei1-internal.ebcloud.com/docker.io/searxng/searxng:latest
        command:
          - sh
          - "-c"
          - |
            rm -f /etc/searxng/settings.yml /etc/searxng/limiter.toml

            if [ ! -f /root/.searxng/settings.yml ]; then
              cp /public/shared-resources/searxng-config/settings.yml /root/.searxng/
            fi
            if [ ! -f /root/.searxng/limiter.toml ]; then
              cp /public/shared-resources/searxng-config/limiter.toml /root/.searxng/
            fi

            cp /root/.searxng/settings.yml /etc/searxng/
            cp /root/.searxng/limiter.toml /etc/searxng/
            # export SEARXNG_BIND_ADDRESS=0.0.0.0
            # export SEARXNG_PORT=8080
            export SEARXNG_SECRET=ABABAB
            # export SEARXNG_LIMITER=true
            # export SEARXNG_PUBLIC_INSTANCE=true
            /usr/local/searxng/dockerfiles/docker-entrypoint.sh
        env:
          - name: HF_ENDPOINT
            value: "https://hf-mirror.com"
        ports:
        - containerPort: 8080
        resources:
          limits:
            cpu: "4"
            memory: 8G
          requests:
            cpu: "2"
            memory: 4G
        volumeMounts:
        - name: shm
          mountPath: /dev/shm
        - name: data
          mountPath: /root/.searxng/
          subPath: searxng-data-mount
        - name: public-data
          mountPath: /public

---
apiVersion: v1
kind: Service
metadata:
  name: searxng-1
  namespace: default
spec:
  ports:
  - name: http-searxng-1
    port: 80
    protocol: TCP
    targetPort: 8080
  # The label selector should match the deployment labels & it is useful for prefix caching feature
  selector:
    app: searxng-1
  sessionAffinity: None
  # 指定 LoadBalancer 类型,用于将服务暴露到外部,自动分配公网 IP
  type: LoadBalancer
# 部署 SearXNG 服务,服务一般会在 30s 内启动完成,默认我们配置了国内可稳定访问的部分搜索服务
kubectl apply -f searxng.tpl.yaml

部署完成后,我们就可以在 Open WebUI 的页面上使用“联网搜索”功能了。