本篇介绍如何在 Jupyter Notebook 中使用 LMDeploy 对大模型进行量化和部署。

环境配置

Terminal 中执行下面命令:

1
2
3
4
5
conda create -n lmdeploy --clone /share/conda_envs/internlm-base
conda activate lmdeploy

pip install ipykernel
python -m ipykernel install --user --name lmdeploy --display-name lmdeploy

创建 notebook,选择内核 lmdeploy,执行一下代码

1
2
3
4
5
6
7
8
9
10
11
# 设置notebook环境
import os, sys

PATH = os.environ['PATH']
basedir = os.path.dirname(os.path.dirname(sys.exec_prefix))

# 这里的 $PATH 也可以替换为 {os.environ['PATH']}。这里只是为了展示 $变量 的形式也是可行的
%env CONDA_EXE={os.path.join(basedir, 'bin/conda')}
%env CONDA_PREFIX={sys.exec_prefix}
%env CONDA_PYTHON_EXE={os.path.join(basedir, 'bin/python')}
%env PATH={os.path.join(sys.exec_prefix, 'bin')}:$PATH
env: CONDA_EXE=/root/.conda/bin/conda
env: CONDA_PREFIX=/root/.conda/envs/lmdeploy
env: CONDA_PYTHON_EXE=/root/.conda/bin/python
env: PATH=/root/.conda/envs/lmdeploy/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
1
2
!wget  https://github.com/Dao-AILab/flash-attention/releases/download/v2.3.6/flash_attn-2.3.6+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
%pip install flash_attn-2.3.6+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
1
%pip install -q 'lmdeploy[all]==v0.1.0'

模型转换

使用 TurboMind 推理模型需要先将模型转化为 TurboMind 的格式,目前支持在线转换和离线转换两种形式。在线转换可以直接加载 Huggingface 模型,离线转换需要先保存模型再加载。

TurboMind 是一款关于 LLM 推理的高效推理引擎,基于英伟达的 FasterTransformer 研发而成。它的主要功能包括:LLaMa 结构模型的支持,persistent batch 推理模式和可扩展的 KV 缓存管理器。

在线转换

在线转换会自动下载模型文件或使用本地下载好的模型文件,自动进行转换,转换后启动服务

1
2
3
4
5
6
7
# # 这两个需要能访问 Huggingface 的网络环境

# # 加载使用 lmdeploy 量化的版本
# !lmdeploy chat turbomind internlm/internlm-chat-20b-4bit --model-name internlm-chat-20b

# # 加载其他 LLM 模型
# !lmdeploy chat turbomind Qwen/Qwen-7B-Chat --model-name qwen-7b
1
2
# 加载本地下载好的模型
!lmdeploy chat turbomind /share/temp/model_repos/internlm-chat-7b/ --model-name internlm-chat-7b

离线转换

离线转换先将模型转为 lmdeploy TurboMind 的格式,然后在手动启动服务

1
2
# 转换为 TurboMind格式
!lmdeploy convert internlm-chat-7b /root/share/temp/model_repos/internlm-chat-7b/
create workspace in directory ./workspace
copy triton model templates from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models"
copy service_docker_up.sh from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/service_docker_up.sh" to "./workspace"
model_name             internlm-chat-7b
model_format           None
inferred_model_format  hf
model_path             /root/share/temp/model_repos/internlm-chat-7b/
tokenizer_path         /root/share/temp/model_repos/internlm-chat-7b/tokenizer.model
output_format          fp16
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.0.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.0.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.1.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.1.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.1.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.1.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.1.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.1.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.1.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.2.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.2.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.2.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.2.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.2.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.2.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.2.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.3.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.3.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.3.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.3.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.3.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.3.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.3.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.4.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.4.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.4.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.4.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.4.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.4.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.4.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.5.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.5.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.5.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.5.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.5.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.5.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.5.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.6.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.6.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.6.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.6.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.6.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.6.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.6.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.7.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.7.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.7.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.7.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.7.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.7.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.7.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.8.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.8.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.8.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.8.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.8.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.8.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.8.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.9.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.9.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.9.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.9.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.9.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.9.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.9.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.10.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.10.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.10.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.10.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.10.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.10.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.10.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.11.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.11.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.11.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.11.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.11.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.11.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.11.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.12.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.12.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.12.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.12.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.12.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.12.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.12.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.13.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.13.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.13.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.13.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.13.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.13.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.13.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.14.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.14.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.14.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.14.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.14.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.14.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.14.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.15.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.15.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.15.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.15.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.15.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.15.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.15.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.16.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.16.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.16.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.16.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.16.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.16.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.16.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.17.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.17.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.17.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.17.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.17.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.17.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.17.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.18.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.18.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.18.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.18.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.18.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.18.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.18.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.19.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.19.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.19.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.19.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.19.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.19.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.19.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.20.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.20.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.20.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.20.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.20.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.20.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.20.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.21.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.21.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.21.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.21.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.21.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.21.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.21.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.22.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.22.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.22.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.22.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.22.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.22.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.22.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.23.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.23.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.23.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.23.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.23.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.23.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.23.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.24.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.24.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.24.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.24.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.24.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.24.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.24.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.25.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.25.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.25.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.25.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.25.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.25.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.25.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.26.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.26.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.26.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.26.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.26.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.26.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.26.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.27.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.27.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.27.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.27.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.27.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.27.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.27.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.28.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.28.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.28.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.28.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.28.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.28.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.28.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.29.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.29.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.29.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.29.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.29.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.29.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.29.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.30.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.30.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.30.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.30.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.30.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.30.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.30.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.31.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.31.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.31.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.31.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
Convert to turbomind format: 100%|██████████████| 32/32 [00:25<00:00,  1.23it/s]

执行完成后将会在当前目录生成一个 workspace 的文件夹。这里面包含的就是 TurboMind 和 Triton “模型推理”需要到的文件。

1
!ls workspace
model_repository  service_docker_up.sh	triton_models

TurboMind 推理

接下来推理都基于离线转换的workspace进行。

这里支持多种方式运行,比如Turbomind、PyTorch、DeepSpeed。但 PyTorch 和 DeepSpeed 调用的其实都是 Huggingface 的 Transformers 包,PyTorch表示原生的 Transformer 包,DeepSpeed 表示使用了 DeepSpeed 作为推理框架。Pytorch/DeepSpeed 目前功能都比较弱,不具备生产能力,不推荐使用。

命令行形式

1
2
# Turbomind + Bash Local Chat
!lmdeploy chat turbomind ./workspace

API 服务

”模型推理/服务“目前提供了 Turbomind 和 TritonServer 两种服务化方式。此时,Server 是 TurboMind 或 TritonServer,API Server 可以提供对外的 API 服务。我们推荐使用 TurboMind

1
2
3
4
5
6
7
8
9
# ApiServer+Turbomind   api_server => AsyncEngine => TurboMind
# 参数中 server_name 和 server_port 分别表示服务地址和端口
# tp 参数表示 Tensor 并行
# instance_num 参数,表示实例数,可以理解成 Batch 的大小。
!lmdeploy serve api_server ./workspace \
--server_name 0.0.0.0 \
--server_port 23333 \
--instance_num 64 \
--tp 1

然后,我们可以新开一个窗口,执行下面的 Client 命令

1
2
# ChatApiClient+ApiServer(注意是http协议,需要加http)
!lmdeploy serve api_client http://localhost:23333

或者使用API服务,查看API服务列表:

1
2
# 把 33449 换成自己的端口
ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 33449

然后打开:http://localhost:23333 查看

网页服务

这一部分主要是将 Gradio 作为前端 Demo 演示。由于 Gradio 需要本地访问展示界面,因此也需要通过 ssh 将数据转发到本地。命令如下:

1
ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p <你的 ssh 端口号>

TurboMind API 服务作为后端

即 gradio 直接与 API Server,间接与 turbomind 连接。另一种是 gradio 与直接与 turbomind 连接,不需要 API Server

1
2
3
4
5
6
# 先启动API服务
!lmdeploy serve api_server ./workspace \
--server_name 0.0.0.0 \
--server_port 23333 \
--instance_num 64 \
--tp 1
1
2
3
4
5
# Gradio+ApiServer。必须先开启 Server,此时 Gradio 为 Client
!lmdeploy serve gradio http://0.0.0.0:23333 \
--server_name 0.0.0.0 \
--server_port 6006 \
--restful_api True

TurboMind 推理作为后端

1
2
# Gradio+Turbomind(local)
!lmdeploy serve gradio ./workspace

TurboMind 推理 + Python 代码集成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
from lmdeploy import turbomind as tm

# load model
model_path = "/root/share/temp/model_repos/internlm-chat-7b/"
tm_model = tm.TurboMind.from_pretrained(model_path, model_name='internlm-chat-7b')
generator = tm_model.create_instance()

# process query
# Prompt 其实就是增加了 <|System|> 消息和 <|User|> 消息(即用户的 query),以及一个 <|Bot|> 的标记,表示接下来该模型输出响应了。
query = "请介绍下你自己吧"
prompt = tm_model.model.get_prompt(query)
input_ids = tm_model.tokenizer.encode(prompt)

# inference
for outputs in generator.stream_infer(
session_id=0,
input_ids=[input_ids]):
res, tokens = outputs[0]

response = tm_model.tokenizer.decode(res.tolist())
print(response)
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
model_source: hf_model


WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.


model_config:
{
  "model_name": "internlm-chat-7b",
  "tensor_para_size": 1,
  "head_num": 32,
  "kv_head_num": 32,
  "vocab_size": 103168,
  "num_layer": 32,
  "inter_size": 11008,
  "norm_eps": 1e-06,
  "attn_bias": 1,
  "start_id": 1,
  "end_id": 2,
  "session_len": 2056,
  "weight_type": "fp16",
  "rotary_embedding": 128,
  "rope_theta": 10000.0,
  "size_per_head": 128,
  "group_size": 0,
  "max_batch_size": 64,
  "max_context_token_num": 1,
  "step_length": 1,
  "cache_max_entry_count": 0.5,
  "cache_block_seq_len": 128,
  "cache_chunk_size": 1,
  "use_context_fmha": 1,
  "quant_policy": 0,
  "max_position_embeddings": 2048,
  "rope_scaling_factor": 0.0,
  "use_logn_attn": 0
}
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 2056.
[TM][WARNING] [LlamaTritonModel] `num_tokens_per_iter` is not set, default to `max_context_token_num` (2056).
get 323 model params
                                                                            

[WARNING] gemm_config.in is not found; using default GEMM algo


[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 64 MB
[TM][INFO] [BlockManager] max_block_count = 159
[TM][INFO] [BlockManager] chunk_size = 1
[TM][INFO] LlamaBatch<T>::Start()
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 0 received.
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 108, max_q = 108, max_k = 108
[TM][INFO] ------------------------- step = 110 -------------------------
[TM][INFO] ------------------------- step = 120 -------------------------
[TM][INFO] ------------------------- step = 130 -------------------------
[TM][INFO] ------------------------- step = 140 -------------------------
[TM][INFO] ------------------------- step = 150 -------------------------
[TM][INFO] ------------------------- step = 160 -------------------------
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][INFO] ------------------------- step = 180 -------------------------
[TM][INFO] ------------------------- step = 190 -------------------------
[TM][INFO] ------------------------- step = 200 -------------------------
[TM][INFO] ------------------------- step = 210 -------------------------
[TM][INFO] ------------------------- step = 220 -------------------------


书生·浦语,上海人工智能实验室开发的人工智能语言模型,致力于通过执行常见的基于语言的任务和提供建议来帮助人类。我的设计理念是有用、诚实并且无害。我可以使用汉语和英语进行交流。我能够回答问题、提供定义和解释、将文本从一种语言翻译成另一种语言、总结文本、生成文本、编写故事、分析情感、提供推荐、开发算法、编写代码以及其他任何基于语言的任务。但是我不能看、听、尝、触摸、闻、移动、与物理世界交互、感受情感或体验感官输入、执行需要身体能力的任务。


[TM][INFO] ------------------------- step = 230 -------------------------
[TM][INFO] [Interrupt] slot = 0, id = 0
[TM][INFO] [forward] Request complete for 0, code 0

总结

  • 我想对外提供类似 OpenAI 那样的 HTTP 接口服务。推荐使用 TurboMind推理 + API 服务
  • 我想做一个演示 Demo,Gradio 无疑是比 Local Chat 更友好的。推荐使用 TurboMind 推理作为后端的Gradio进行演示
  • 我想直接在自己的 Python 项目中使用大模型功能。推荐使用 TurboMind推理 + Python

模型量化

本部分内容主要介绍如何对模型进行量化。主要包括 KV Cache 量化和模型参数量化。总的来说,量化是一种以参数或计算中间结果精度下降换空间节省(以及同时带来的性能提升)的策略。

正式介绍 LMDeploy 量化方案前,需要先介绍两个概念:

  • 计算密集(compute-bound): 指推理过程中,绝大部分时间消耗在数值计算上;针对计算密集型场景,可以通过使用更快的硬件计算单元来提升计算速。
  • 访存密集(memory-bound): 指推理过程中,绝大部分时间消耗在数据读取上;针对访存密集型场景,一般通过减少访存次数、提高计算访存比或降低访存量来优化。

常见的 LLM 模型由于 Decoder Only 架构的特性,实际推理时大多数的时间都消耗在了逐 Token 生成阶段(Decoding 阶段),是典型的访存密集型场景。

那么,如何优化 LLM 模型推理中的访存密集问题呢? 我们可以使用 KV Cache 量化和 4bit Weight Only 量化(W4A16)。

  • KV Cache 量化是指将逐 Token(Decoding)生成过程中的上下文 K 和 V 中间结果进行 INT8 量化(计算时再反量化),以降低生成过程中的显存占用。
  • 34bit Weight 量化,将 FP16 的模型权重量化为 INT4,Kernel 计算时,访存量直接降为 FP16 模型的 1/4,大幅降低了访存成本。Weight Only 是指仅量化权重,数值计算依然采用 FP16(需要将 INT4 权重反量化)。

KV Cache 量化

量化步骤

KV Cache 量化是将已经生成序列的 KV 变成 Int8,使用过程一共包括三步:

第一步:计算 minmax。主要思路是通过计算给定输入样本在每一层不同位置处计算结果的统计情况。

  • 对于 Attention 的 K 和 V:取每个 Head 各自维度在所有Token的最大、最小和绝对值最大值。对每一层来说,上面三组值都是 (num_heads, head_dim) 的矩阵。这里的统计结果将用于本小节的 KV Cache。
  • 对于模型每层的输入:取对应维度的最大、最小、均值、绝对值最大和绝对值均值。每一层每个位置的输入都有对应的统计值,它们大多是 (hidden_dim, ) 的一维向量,当然在 FFN 层由于结构是先变宽后恢复,因此恢复的位置维度并不相同。这里的统计结果用于下个小节的模型参数量化,主要用在缩放环节(回顾PPT内容)。

第一步执行命令如下:

1
2
3
4
5
6
7
# 计算 minmax
!lmdeploy lite calibrate \
--model /root/share/temp/model_repos/internlm-chat-7b/ \
--calib_dataset "c4" \
--calib_samples 128 \
--calib_seqlen 2048 \
--work_dir ./quant_output

在这个命令行中,会选择 128 条输入样本,每条样本长度为 2048,数据集选择 C4,输入模型后就会得到上面的各种统计值。值得说明的是,如果显存不足,可以适当调小 samples 的数量或 sample 的长度。

这一步由于默认需要从 Huggingface 下载数据集,国内经常不成功。所以我们导出了需要的数据,大家需要对读取数据集的代码文件做一下替换。共包括两步:

  • 第一步:复制 calib_dataloader.py 到安装目录替换该文件:cp /root/share/temp/datasets/c4/calib_dataloader.py /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/lite/utils/

  • 第二步:将用到的数据集(c4)复制到下面的目录:cp -r /root/share/temp/datasets/c4/ /root/.cache/huggingface/datasets/

第二步:通过 minmax 获取量化参数。主要就是利用下面这个公式,获取每一层的 K V 中心值(zp)和缩放值(scale)。

1
2
3
4
zp = (min+max) / 2
scale = (max-min) / 255
quant: q = round( (f-zp) / scale)
dequant: f = q * scale + zp

有这两个值就可以进行量化和解量化操作了。具体来说,就是对历史的 K 和 V 存储 quant 后的值,使用时在 dequant。

第二步的执行命令如下:

1
2
3
4
5
6
# 通过 minmax 获取量化参数
!lmdeploy lite kv_qparams \
--work_dir ./quant_output \
--turbomind_dir workspace/triton_models/weights/ \
--kv_sym False \
--num_tp 1

在这个命令中,num_tp 的含义前面介绍过,表示 Tensor 的并行数。每一层的中心值和缩放值会存储到 workspace 的参数目录中以便后续使用。kv_sym 为 True 时会使用另一种(对称)量化方法,它用到了第一步存储的绝对值最大值,而不是最大值和最小值。

第三步:修改配置。也就是修改 weights/config.ini 文件,这个我们在《2.6.2 模型配置实践》中已经提到过了(KV int8 开关),只需要把 quant_policy 改为 4 即可。

这一步需要额外说明的是,如果用的是 TurboMind1.0,还需要修改参数 use_context_fmha,将其改为 0。

接下来就可以正常运行前面的各种服务了,只不过咱们现在可是用上了 KV Cache 量化,能更省(运行时)显存了。

量化效果

官方给出了 internlm-chat-7b 模型在 KV Cache 量化前后的显存对比情况,KV Cache 可以节约大约 20% 的显存。

同时,还在 opencompass 平台上测试了量化前后的精准度(Accuracy)对比情况,可以看出,精度不仅没有明显下降,相反在不少任务上还有一定的提升。可能得原因是,量化会导致一定的误差,有时候这种误差可能会减少模型对训练数据的拟合,从而提高泛化性能。量化可以被视为引入轻微噪声的正则化方法。或者,也有可能量化后的模型正好对某些数据集具有更好的性能。

总结一下,KV Cache 量化既能明显降低显存占用,还有可能同时带来精准度(Accuracy)的提升。

W4A16 量化

量化步骤

W4A16中的A是指Activation,保持FP16,只对参数进行 4bit 量化。使用过程也可以看作是三步。

第一步:同 KV Cache 量化第一步,不再赘述。

第二步:量化权重模型。利用第一步得到的统计值对参数进行量化,具体又包括两小步:

  • 缩放参数。主要是性能上的考虑。
  • 整体量化。

第二步的执行命令如下:

1
2
3
4
5
6
# 量化权重模型
!lmdeploy lite auto_awq \
--model /root/share/temp/model_repos/internlm-chat-7b/ \
--w_bits 4 \
--w_group_size 128 \
--work_dir ./quant_output

命令中 w_bits 表示量化的位数,w_group_size 表示量化分组统计的尺寸,work_dir 是量化后模型输出的位置。这里需要特别说明的是,因为没有 torch.int4,所以实际存储时,8个 4bit 权重会被打包到一个 int32 值中。所以,如果你把这部分量化后的参数加载进来就会发现它们是 int32 类型的。

第三步:转换成 TurboMind 格式。

1
2
3
4
# 转换模型的layout,存放在默认路径 ./workspace 下
!lmdeploy convert internlm-chat-7b ./quant_output \
--model-format awq \
--group-size 128

这个 group-size 就是上一步的那个 w_group_size。如果不想和之前的 workspace 重复,可以指定输出目录:–dst_path,比如:

1
2
3
4
!lmdeploy convert  internlm-chat-7b ./quant_output \
--model-format awq \
--group-size 128 \
--dst_path ./workspace_quant

接下来和上一节一样,可以正常运行前面的各种服务了,不过咱们现在用的是量化后的模型。

最后再补充一点,量化模型和 KV Cache 量化也可以一起使用,以达到最大限度节省显存。

量化效果

  • TurboMind 相比其他框架速度优势非常显著,比 mlc-llm 快了将近 30%。
  • 4bit 模型可以降低 50-60% 的显存占用,效果非常明显。

总而言之,W4A16 参数量化后能极大地降低显存,同时相比其他框架推理速度具有明显优势。

首先我们需要明白一点,服务部署和量化是没有直接关联的,量化的最主要目的是降低显存占用,主要包括两方面的显存:模型参数和中间过程计算结果。前者对应《W4A16 量化》,后者对应《Cache 量化》。

量化在降低显存的同时,一般还能带来性能的提升,因为更小精度的浮点数要比高精度的浮点数计算效率高,而整型要比浮点数高很多。

所以我们的建议是:在各种配置下尝试,看效果能否满足需要。这一般需要在自己的数据集上进行测试。具体步骤如下。

  • Step1:优先尝试正常(非量化)版本,评估效果。
    • 如果效果不行,需要尝试更大参数模型或者微调。
    • 如果效果可以,跳到下一步。
  • Step2:尝试正常版本+KV Cache 量化,评估效果。
    • 如果效果不行,回到上一步。
    • 如果效果可以,跳到下一步。
  • Step3:尝试量化版本,评估效果。
    • 如果效果不行,回到上一步。
    • 如果效果可以,跳到下一步。
  • Step4:尝试量化版本+ KV Cache 量化,评估效果。
    • 如果效果不行,回到上一步。
    • 如果效果可以,使用方案。

另外需要补充说明的是,使用哪种量化版本、开启哪些功能,除了上述流程外,还需要考虑框架、显卡的支持情况,比如有些框架可能不支持 W4A16 的推理,那即便转换好了也用不了。

根据实践经验,一般情况下:

  • 精度越高,显存占用越多,推理效率越低,但一般效果较好。
  • Server 端推理一般用非量化版本或半精度、BF16、Int8 等精度的量化版本,比较少使用更低精度的量化版本。
  • 端侧推理一般都使用量化版本,且大多是低精度的量化版本。这主要是因为计算资源所限。

示例

使用 LMDeploy 以本地对话、网页Gradio、API服务中的一种方式部署 InternLM-Chat-7B 模型,生成 300 字的小故事(需截图)

1
2
3
4
5
6
7
8
9
10
11
# 设置notebook环境
import os, sys

PATH = os.environ['PATH']
basedir = os.path.dirname(os.path.dirname(sys.exec_prefix))

# 这里的 $PATH 也可以替换为 {os.environ['PATH']}。这里只是为了展示 $变量 的形式也是可行的
%env CONDA_EXE={os.path.join(basedir, 'bin/conda')}
%env CONDA_PREFIX={sys.exec_prefix}
%env CONDA_PYTHON_EXE={os.path.join(basedir, 'bin/python')}
%env PATH={os.path.join(sys.exec_prefix, 'bin')}:$PATH
env: CONDA_EXE=/root/.conda/bin/conda
env: CONDA_PREFIX=/root/.conda/envs/lmdeploy
env: CONDA_PYTHON_EXE=/root/.conda/bin/python
env: PATH=/root/.conda/envs/lmdeploy/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
1
2
3
# 转换模型
%cd ~
!lmdeploy convert internlm-chat-7b /root/share/temp/model_repos/internlm-chat-7b/
/root


/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/IPython/core/magics/osm.py:393: UserWarning: using bookmarks requires you to install the `pickleshare` library.
  bkms = self.shell.db.get('bookmarks', {})
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: using dhist requires you to install the `pickleshare` library.
  self.shell.db['dhist'] = compress_dhist(dhist)[-100:]


create workspace in directory ./workspace
copy triton model templates from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/triton_models" to "./workspace/triton_models"
copy service_docker_up.sh from "/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/lmdeploy/serve/turbomind/service_docker_up.sh" to "./workspace"
model_name             internlm-chat-7b
model_format           None
inferred_model_format  hf
model_path             /root/share/temp/model_repos/internlm-chat-7b/
tokenizer_path         /root/share/temp/model_repos/internlm-chat-7b/tokenizer.model
output_format          fp16
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
*** splitting layers.0.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.0.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.0.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.0.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.0.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.0.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.1.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.1.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.1.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.1.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.1.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.1.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.1.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.2.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.2.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.2.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.2.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.2.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.2.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.2.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.3.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.3.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.3.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.3.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.3.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.3.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.3.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.4.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.4.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.4.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.4.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.4.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.4.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.4.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.5.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.5.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.5.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.5.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.5.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.5.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.5.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.6.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.6.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.6.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.6.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.6.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.6.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.6.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.7.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.7.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.7.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.7.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.7.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.7.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.7.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.8.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.8.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.8.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.8.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.8.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.8.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.8.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.9.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.9.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.9.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.9.attention.wo.bias, shape=torch.Size([4096])                
*** splitting layers.9.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.9.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.9.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.10.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.10.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.10.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.10.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.10.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.10.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.10.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.11.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.11.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.11.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.11.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.11.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.11.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.11.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.12.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.12.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.12.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.12.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.12.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.12.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.12.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.13.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.13.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.13.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.13.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.13.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.13.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.13.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.14.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.14.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.14.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.14.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.14.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.14.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.14.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.15.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.15.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.15.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.15.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.15.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.15.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.15.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.16.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.16.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.16.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.16.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.16.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.16.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.16.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.17.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.17.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.17.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.17.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.17.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.17.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.17.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.18.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.18.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.18.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.18.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.18.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.18.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.18.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.19.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.19.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.19.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.19.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.19.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.19.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.19.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.20.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.20.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.20.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.20.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.20.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.20.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.20.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.21.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.21.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.21.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.21.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.21.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.21.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.21.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.22.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.22.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.22.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.22.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.22.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.22.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.22.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.23.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.23.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.23.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.23.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.23.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.23.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.23.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.24.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.24.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.24.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.24.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.24.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.24.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.24.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.25.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.25.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.25.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.25.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.25.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.25.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.25.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.26.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.26.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.26.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.26.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.26.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.26.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.26.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.27.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.27.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.27.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.27.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.27.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.27.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.27.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.28.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.28.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.28.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.28.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.28.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.28.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.28.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.29.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.29.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.29.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.29.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.29.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.29.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.29.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.30.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.30.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.30.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.30.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.30.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.30.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.30.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
*** splitting layers.31.attention.w_qkv.weight, shape=torch.Size([4096, 12288]), split_dim=-1, tp=1
*** splitting layers.31.attention.wo.weight, shape=torch.Size([4096, 4096]), split_dim=0, tp=1
*** splitting layers.31.attention.w_qkv.bias, shape=torch.Size([1, 12288]), split_dim=-1, tp=1
### copying layers.31.attention.wo.bias, shape=torch.Size([4096])               
*** splitting layers.31.feed_forward.w1.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.31.feed_forward.w3.weight, shape=torch.Size([4096, 11008]), split_dim=-1, tp=1
*** splitting layers.31.feed_forward.w2.weight, shape=torch.Size([11008, 4096]), split_dim=0, tp=1
Convert to turbomind format: 100%|██████████████| 32/32 [00:22<00:00,  1.44it/s]
1
2
# 启动 Gradio+Turbomind服务
!lmdeploy serve gradio ./workspace
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/gradio/components/button.py:89: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Button(...)` instead of `return gr.Button.update(...)`.
  warnings.warn(
model_source: workspace
WARNING: Can not find tokenizer.json. It may take long time to initialize the tokenizer.
[TM][WARNING] [LlamaTritonModel] `max_context_token_num` = 2056.
[TM][WARNING] [LlamaTritonModel] `num_tokens_per_iter` is not set, default to `max_context_token_num` (2056).
[WARNING] gemm_config.in is not found; using default GEMM algo
[TM][INFO] NCCL group_id = 0
[TM][INFO] [BlockManager] block_size = 64 MB
[TM][INFO] [BlockManager] max_block_count = 159
[TM][INFO] [BlockManager] chunk_size = 1
[TM][INFO] LlamaBatch<T>::Start()
server is gonna mount on: http://0.0.0.0:6006
Running on local URL:  http://0.0.0.0:6006

Could not create share link. Missing file: /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/gradio/frpc_linux_amd64_v0.2. 

Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps: 

1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /root/.conda/envs/lmdeploy/lib/python3.10/site-packages/gradio
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/gradio/components/textbox.py:163: UserWarning: Using the update method is deprecated. Simply return a new object instead, e.g. `return gr.Textbox(...)` instead of `return gr.Textbox.update(...)`.
  warnings.warn(
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 117, max_q = 117, max_k = 117
[TM][INFO] ------------------------- step = 120 -------------------------
[TM][INFO] ------------------------- step = 130 -------------------------
[TM][INFO] [Interrupt] slot = 0, id = 1
[TM][INFO] [forward] Request complete for 1, code 0
[TM][INFO] [forward] Enqueue requests
[TM][INFO] [forward] Wait for requests to complete ...
[TM][WARNING] [ProcessInferRequests] Request for 1 received.
[TM][INFO] [Forward] [0, 1), dc_bsz = 0, pf_bsz = 1, n_tok = 19, max_q = 19, max_k = 158
[TM][INFO] ------------------------- step = 160 -------------------------
[TM][INFO] ------------------------- step = 170 -------------------------
[TM][INFO] ------------------------- step = 180 -------------------------
[TM][INFO] ------------------------- step = 190 -------------------------
[TM][INFO] ------------------------- step = 200 -------------------------
[TM][INFO] ------------------------- step = 210 -------------------------
[TM][INFO] ------------------------- step = 220 -------------------------
[TM][INFO] ------------------------- step = 230 -------------------------
[TM][INFO] ------------------------- step = 240 -------------------------
[TM][INFO] ------------------------- step = 250 -------------------------
[TM][INFO] ------------------------- step = 260 -------------------------
[TM][INFO] ------------------------- step = 270 -------------------------
[TM][INFO] ------------------------- step = 280 -------------------------
[TM][INFO] ------------------------- step = 290 -------------------------
[TM][INFO] ------------------------- step = 300 -------------------------
[TM][INFO] ------------------------- step = 310 -------------------------
[TM][INFO] ------------------------- step = 320 -------------------------
[TM][INFO] ------------------------- step = 330 -------------------------
[TM][INFO] ------------------------- step = 340 -------------------------
[TM][INFO] ------------------------- step = 350 -------------------------
[TM][INFO] ------------------------- step = 360 -------------------------
[TM][INFO] ------------------------- step = 370 -------------------------
[TM][INFO] ------------------------- step = 380 -------------------------
[TM][INFO] [Interrupt] slot = 0, id = 1
[TM][INFO] [forward] Request complete for 1, code 0
^C
Keyboard interruption in main thread... closing server.

端口转发:

1
ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p 33449

结果见下图

参考文献

  1. InternLM/lmdeploy: LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
  2. 仅需一块 3090 显卡,高效部署 InternLM-20B 模型 - 知乎
  3. LMDeploy 的量化和部署