# rtp-llm **Repository Path**: alibaba/rtp-llm ## Basic Information - **Project Name**: rtp-llm - **Description**: RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications. - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 3 - **Forks**: 0 - **Created**: 2024-10-31 - **Last Updated**: 2025-12-15 ## Categories & Tags **Categories**: ai **Tags**: None ## README [English](README.md) [中文](README_cn.md) ## 介绍 * rtp-llm 是阿里巴巴大模型预测团队开发的 LLM 推理加速引擎。rtp-llm 在阿里巴巴内部被广泛使用,支持了包括淘宝、天猫、闲鱼、菜鸟、高德、饿了么、AE、Lazada 等多个部门的大模型推理业务。 * rtp-llm项目是[havenask](https://github.com/alibaba/havenask)项目的子项目。 ## 特点 ### 实战验证 在众多LLM场景中得到实际应用与检验 * 淘宝问问 * 阿里巴巴国际AI平台[Aidge](https://aidc-ai.com/) * [OpenSearch LLM智能问答版](https://www.aliyun.com/activity/bigdata/opensearch/llmsearch) * [Large Language Model based Long-tail Query Rewriting in Taobao Search](https://arxiv.org/abs/2311.03758) ### 高性能 * 使用高性能的 CUDA kernel, 包括 PagedAttention、FlashAttention、FlashDecoding 等 * WeightOnly INT8 量化,加载时自动量化; WeightOnly INT4量化,支持[GPTQ](https://github.com/AutoGPTQ/AutoGPTQ)/[AWQ](https://github.com/casper-hansen/AutoAWQ) * 自适应 KVCache 量化 * 框架上对动态凑批的 overhead 进行了细致优化 * 对 V100 进行了特别优化 ### 灵活易用 * 和 HuggingFace 无缝对接,支持 SafeTensors/Pytorch/Megatron 等多种权重格式 * 单模型实例同时部署多 LoRA 服务 * 多模态(图片和文本混合输入) * 多机/多卡 Tensor 并行 * P-tuning 模型 ### 高级加速 * 剪枝后的不规则模型加载 * 多轮对话上下文 Prefix Cache * System Prompt Cache * Speculative Decoding ## 使用方法 ### 需求 * 操作系统: Linux * Python: 3.10 * NVIDIA GPU: Compute Capability 7.0 或者更高 (e.g., RTX20xx, RTX30xx, RTX40xx, V100, T4, A10/A30/A100, L4, H100, etc.) ### 安装和启动 #### 从docker启动 docker版本见[镜像发布历史](docs/DockerHistory.md) ```bash cd rtp-llm/docker # IMAGE_NAME = # if cuda11: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:{version}_cuda11 # if cuda12: registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:{version}_cuda12 sh ./create_container.sh sh CONTAINER_NAME/sshme.sh cd ../ # start http service TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m rtp_llm.start_server # request to server curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}' ``` #### 从wheel包启动 ```bash # Install rtp-llm cd rtp-llm # For cuda12 environment, please use requirements_torch_gpu_cuda12.txt pip3 install -r ./open_source/deps/requirements_torch_gpu.txt # Use the corresponding whl from the release version, here's an example for the cuda11 version 0.1.0, for the cuda12 whl package please check the release page. pip3 install rtp_llm-0.1.9+cuda118-cp310-cp310-manylinux1_x86_64.whl # start http service cd ../ TOKENIZER_PATH=/path/to/tokenizer CHECKPOINT_PATH=/path/to/model MODEL_TYPE=your_model_type FT_SERVER_TEST=1 python3 -m rtp_llm.start_server # request to server curl -XPOST http://localhost:8088 -d '{"prompt": "hello, what is your name", "generate_config": {"max_new_tokens": 1000}}' ``` #### 从源码构建并启动 见 [源码构建](docs/Build.md) ### 常见问题 1. libcufft.so **Error log**: `OSError: libcufft.so.11: cannot open shared object file: No such file or directory` **Resolution**: 检查cuda和rtp-llm版本是否匹配 2. libth_transformer.so **Error log**: `OSError: /rtp-llm/rtp_llm/libs/libth_transformer.so: cannot open shared object file: No such file or directory` **Resolution**: 如果是通过whl或者开发镜像安装,测试时请不要在rtp-llm目录下,否则python会用相对路径下的rtp-llm包而非安装好的 3. Bazel build time out **Error log**: `ERROR: no such package '@pip_gpu_cuda12_torch//': rules_python_external failed: (Timed out)` **Resolution**: 1. 修改pip源,在open_source/deps/pip.bzl里添加extra_pip_args=["--index_url=xxx"]配置 2. 手动安装python的依赖包,尤其是对于pytorch,因为bazel build默认的600秒超时对于pytorch的下载可能是不够的 ## 文档 * [镜像发布历史](docs/DockerHistory.md) * [启动服务样例](docs/OpenAI-Tutorial.md) * [RWKV-Runner 样例](docs/RWKV-Runner.md) * [Python Library 样例](docs/HF.md) * [在Aliyun Ecs中使用RTP-LLm](https://zhuanlan.zhihu.com/p/679610919) * [配置参数](docs/Config.md) * [内置请求格式](docs/Request.md) * [多卡推理](docs/MultiGPU.md) * [LoRA](docs/LoRA-Tutorial.md) * [PTuning](docs/PTuning-Tutorial.md) * [SystemPrompt](docs/SystemPrompt-Tutorial.md) * [Embedding/Reranker模型部署](docs/Embedding.md) * [多轮会话](docs/ReuseKVCache-Tutorial.md) * [多模态](docs/Multimodal-Tutorial.md) * [结构化剪枝](docs/Sparse-Tutorial.md) * [量化](docs/Quantization.md) * [投机采样](docs/SpeculativeDecoding-Tutroial.md) * [多进程前端Server模式](docs/MultiFrontendServer.md) * [Roadmap](docs/Roadmap.md) * [Contributing](docs/Contributing.md) * [Benchmark&Performance](benchmark/README.md) ## 致谢 我们的项目主要基于[FasterTransformer](https://github.com/NVIDIA/FasterTransformer),并在此基础上集成了[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)的部分kernel实现。FasterTransformer和TensorRT-LLM为我们提供了可靠的性能保障。[Flash-Attention2](https://github.com/Dao-AILab/flash-attention)和[cutlass](https://github.com/NVIDIA/cutlass)也在我们持续的性能优化过程中提供了大量帮助。我们的continuous batching和increment decoding参考了[vllm](https://github.com/vllm-project/vllm)的实现;采样参考了[transformers](https://github.com/huggingface/transformers),多模态部分集成了[llava](https://github.com/haotian-liu/LLaVA)和[qwen-vl](https://github.com/QwenLM/Qwen-VL)的实现。感谢这些项目对我们的启发和帮助。 ## 支持模型列表 ### LLM * Aquila 和 Aquila2(BAAI/AquilaChat2-7B, BAAI/AquilaChat2-34B, BAAI/Aquila-7B, BAAI/AquilaChat-7B等) * Baichuan 和 Baichuan2 (baichuan-inc/Baichuan2-13B-Chat, baichuan-inc/Baichuan-7B) * Bloom (bigscience/bloom, bigscience/bloomz) * ChatGlm (THUDM/chatglm2-6b, THUDM/chatglm3-6b) * Falcon (tiiuae/falcon-7b, tiiuae/falcon-40b, tiiuae/falcon-rw-7b等) * GptNeox (EleutherAI/gpt-neox-20b) * GPT BigCode (bigcode/starcoder, bigcode/starcoder2等) * LLaMA 和 LLaMA-2 (meta-llama/Llama-2-7b, meta-llama/Llama-2-13b-hf, meta-llama/Llama-2-70b-hf, lmsys/vicuna-33b-v1.3, 01-ai/Yi-34B, xverse/XVERSE-13B等) * MPT (mosaicml/mpt-30b-chat等) * Phi (microsoft/phi-1_5等) * Qwen (Qwen/Qwen-7B, Qwen/Qwen-7B-Chat, Qwen/Qwen-14B, Qwen/Qwen-14B-Chat, Qwen1.5等) * InternLM (internlm/internlm-7b, internlm/internlm-chat-7b等) * Gemma (google/gemma-it等) * Mixtral (mistralai/Mixtral-8x7B-v0.1等) ### LLM + 多模态 * LLAVA (liuhaotian/llava-v1.5-13b, liuhaotian/llava-v1.5-7b) * Qwen-VL (Qwen/Qwen-VL) ## 联系我们 #### 钉钉群 #### 微信群