如何发起spot竞价实例
spot竞价实例是一种独特的按量计费方式,与常规的按量计费相比,抢占式实例可以将实例的计算资源以一定折扣进行售卖。使用该计费方式购买的云服务器实例,性能与常规服务器无异,但是当库存不足时,实例会被其他未开启竞价的实例抢占而导致中断。
前提条件
开发机开启竞价
介绍如何在控制台申请spot竞价开发机实例。
- 登录英博云控制台。
- 在页面左侧导航栏,选择 开发机,进入开发机列表页面。
- 在开发机列表页面,单击左上角 创建开发机,打开是否竞价的开关。

训练任务开启竞价
介绍如何在集群内发起spot训练任务,以两机16卡启用RDMA的 H800
nccl-test为例:
- 在集群中预先安装Kubeflow Training Operator(v1.8.0)
kubectl apply --force-conflicts --server-side -k "https://ghfast.top/github.com/kubeflow/training-operator.git/manifests/overlays/standalone?ref=v1.8.0"
nccl.yaml
示例如下:
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: nccl-test-16
annotations:
eks.ebcloud.com/enable-spot: "true" # false:不开启竞价,可以抢占开启竞价的实例 true:开启竞价,价格更低,会被系统中断。
eks.ebcloud.com/gang-min-member: "3" # gang-min-member= spec中replicas之和
spec:
slotsPerWorker: 8 # 每个Worker使用8个slot(对应8张GPU)
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1 # 启动一个 Launcher Pod
template:
spec:
affinity:
nodeAffinity: # Pod调度亲和性
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.ebtech.com/cpu # CPU节点的标签
operator: In
values:
- amd-epyc-milan
containers:
#- image: registry-cn-huabei1-internal.ebcloud.com/job-template/nccl-tests:v2.13.8-nccl2.23.4-ibperf24.07.0-cuda12.0.1-cudnn8-devel-ubuntu20.04-1
- image: registry-cn-huabei1-internal.ebcloud.com/job-template/nccl-tests:12.2.2-cudnn8-devel-ubuntu20.04-nccl2.21.5-1-2ff05b2
name: mpi-launcher
command: ["/bin/bash", "-c"]
args: [
"sleep 20 && \
mpirun \
--mca btl_tcp_if_include bond0 \
-np 16 \
--allow-run-as-root \
-bind-to none \
-x LD_LIBRARY_PATH \
-x NCCL_IB_DISABLE=0 \
-x NCCL_IB_HCA=mlx5_100,mlx5_101,mlx5_102,mlx5_103,mlx5_104,mlx5_105,mlx5_106,mlx5_107 \
-x NCCL_SOCKET_IFNAME=bond0 \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1 \
-x NCCL_COLLNET_ENABLE=0 \
-x NCCL_ALGO=NVLSTREE \
-x NCCL_DEBUG=INFO \
-x NCCL_DEBUG_SUBSYS=all \
-x NCCL_DEBUG_FILE=/data/nccl.%h.%p.log \
-x NCCL_TOPO_DUMP_FILE=/data/a_topo.xml \
-x NCCL_GRAPH_DUMP_FILE=/data/a_graph.xml \
/opt/nccl_tests/build/all_reduce_perf -b 512M -e 8G -f 2 -g 1 #-n 200 #-w 2 -n 20
",
]
resources:
limits:
cpu: "1"
memory: "2Gi"
Worker:
replicas: 2 # 启动 2 个 Worker Pod
template:
spec:
hostNetwork: true
hostPID: true
affinity:
nodeAffinity: # Pod调度亲和性
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: cloud.ebtech.com/gpu # 节点的标签
operator: In
values:
- H800_NVLINK_80GB
#- key: ring
#operator: In
#values:
#- ff
volumes:
- emptyDir:
medium: Memory
name: dshm
- name: file
persistentVolumeClaim:
claimName: train
containers:
#- image: registry-cn-huabei1-internal.ebcloud.com/job-template/nccl-tests:v2.13.8-nccl2.23.4-ibperf24.07.0-cuda12.0.1-cudnn8-devel-ubuntu20.04-1
- image: registry-cn-huabei1-internal.ebcloud.com/job-template/nccl-tests:12.2.2-cudnn8-devel-ubuntu20.04-nccl2.21.5-1-2ff05b2
name: mpi-worker
command: ["/bin/bash", "-c"]
volumeMounts:
- mountPath: /dev/shm
name: dshm
- mountPath: /data
name: file
securityContext:
capabilities:
add:
- IPC_LOCK
# - SYS_RESOURCE
args:
- |
echo "Starting SSH Server..."
/usr/sbin/sshd -De &
sleep infinity
resources:
limits:
nvidia.com/gpu: 8 # 每个Worker请求8张GPU
rdma/hca_shared_devices_ib: 8 # 启用RDMA
- 执行竞价训练任务
kubectl apply -f nccl.yaml