教程：在 OpenBayes 上使用 SGLang 部署 DeepSeek V3

环境准备

使用「工作空间集群」启动包含两个节点的 batch workspace：

每个 job 都使用 8 x H800 的算力。
镜像使用 sglang 0.4.1
数据集绑定 DeepSeek V3 模型

启动成功后，进入 master 容器。

启动服务

准备以下两个命令，存放在 jupyter 默认开启的 /home 目录下：

`run.sh`

export NCCL_IB_DISABLE=0
export NCCL_IB_HCA=mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1
export NCCL_IB_GID_INDEX=5
export NCCL_SOCKET_IFNAME="eth0"
export NCCL_DEBUG=info

python -m sglang.launch_server \
    --model-path /input0/DeepSeek-V3 --served-model-name deepseek-v3 --tp 16 \
    --nccl-init $MASTER_IP:5000 --nnodes $NNODES --node-rank $NODE_RANK \
    --trust-remote-code \
    --host 0.0.0.0 --port 8080

`master_run.py`

import json
import subprocess

# Step 1: Read hostfile.json
with open('/hostfile.json') as f:
    hosts = json.load(f)

MASTER_IP = hosts[0]['ip']
NNODES = len(hosts)

# Step 2: Set environment variables on master node and execute run.sh
subprocess.run([
    'tmux', 'new-session', '-d', '-s', 'node_0', '-n', 'run_tab',
    f'bash -c "export MASTER_IP={MASTER_IP} && export NNODES={NNODES} && export NODE_RANK=0 && bash /openbayes/home/run.sh; exec bash"'
])

print(f"Master IP: {MASTER_IP}, NNODES: {NNODES}, NODE_RANK: 0")

# Step 3: Iterate over worker nodes and configure them
for rank, node in enumerate(hosts[1:], start=1):
    node_ip = node['ip']
    print(f"Configuring node {rank} at {node_ip}")
    
    # Copy run.sh to the remote node
    subprocess.run([
        'scp', 'run.sh', f'root@{node_ip}:/openbayes/home/run.sh'
    ])
    
    # Set environment variables on remote node and start the script in a tmux session with a new window
    subprocess.run([
        'ssh', f'root@{node_ip}',
        f'tmux new-session -d -s node_{rank} \; new-window -n run_tab bash -c "export MASTER_IP={MASTER_IP} && export NNODES={NNODES} && export NODE_RANK={rank} && bash /openbayes/home/run.sh; exec bash"'
    ])

print("All nodes have been configured and started in tmux sessions with a dedicated window for run.sh.")

执行命令 python master_run.py，即可启动服务。通过命令 tmux a 可以查看服务的启动流程和服务的运行状态。

注意由于 DeepSeek V3 模型较大，启动服务需要较长时间，通常需要 30 - 40 分钟的启动时间，请耐心等待。

当看到如下信息时说明服务启动成功了：

[2025-01-07 09:38:20] INFO:     Started server process [9209]
[2025-01-07 09:38:20] INFO:     Waiting for application startup.
[2025-01-07 09:38:20] INFO:     Application startup complete.
[2025-01-07 09:38:20] INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
[2025-01-07 09:38:21] INFO:     127.0.0.1:46856 - "GET /get_model_info HTTP/1.1" 200 OK

效果测试

服务的 api 地址在开启 jupyter 服务后右侧边栏的「API地址」中可以看到，这里所使用的 sglang 提供一个 OpenAI 兼容的 API 接口，可以参考 OpenAI 的 API 文档进行调用。

环境准备​

启动服务​

run.sh​

master_run.py​

效果测试​

环境准备

启动服务

`run.sh`

`master_run.py`

效果测试