Unsloth的Kimi K2 Thinking 本地運行實戰

sbf_2000

發布于 2025-11-10 07:30

瀏覽

0收藏

unsloth與 Kimi 團隊合作修復了 K2 Thinking 的聊天模板問題，該問題導致第一輪對話時未能添加默認系統提示詞"You are Kimi, an AI assistant created by Moonshot AI."。同時也修復了 llama.cpp 工具調用時的自定義 jinja 分隔符問題！

現在可以使用unsloth的動態 1-bit GGUF 在本地運行 Kimi K2 Thinking！unsloth將 1T 模型壓縮到 245GB（減少 62%）并保留了約 85% 的準確率。可在 247GB RAM 上運行。unsloth還與 Kimi 團隊合作進行了系統提示詞修復。

Unsloth的Kimi K2 Thinking 本地運行實戰-AI.x社區圖片

Kimi-K2 和 Kimi-K2-Thinking 在知識、推理、編碼和智能體任務中實現了 SOTA性能。Kimi的完整 1T 參數模型需要 1.09TB 磁盤空間(至少需要 8 個 H200 GPU)，而量化后的 Unsloth Dynamic 1.8-bit 版本將其減少到僅 230GB（減少 80% 大小）：??https://huggingface.co/unsloth/Kimi-K2-Instruct-GGUF???

所有上傳的文件都使用 Unsloth Dynamic 2.0 以實現 SOTA Aider Polyglot 和 5-shot MMLU 性能。

Kimi-K2-Thinking 推理的官方推薦設置：

?設置 temperature 1.0 以減少重復和不連貫。

?建議上下文長度 = 98,304（最高 256K）

?注意：使用不同工具可能需要不同設置

在 llama.cpp 中運行 Kimi K2 Thinking

現在可以使用最新更新的 llama.cpp 來運行模型：

(1)在??https://github.com/ggml-org/llama.cpp??獲取最新的 llama.cpp。也可以按照下面的構建說明進行操作。若沒有 GPU 或只想進行 CPU 推理，請將 -DGGML_CUDA=ON 更改為 -DGGML_CUDA=OFF。

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split llama-mtmd-cli
cp llama.cpp/build/bin/llama-* llama.cpp

（2）如果想直接使用 llama.cpp 加載模型，可以執行以下操作：(:UD-TQ1_0) 是量化類型。也可以通過 Hugging Face 下載（第 3 點）。這類似于 ollama run。使用 export LLAMA_CACHE="folder" 強制 llama.cpp 保存到特定位置。

export LLAMA_CACHE="unsloth/Kimi-K2-Thinking-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Kimi-K2-Thinking-GGUF:UD-TQ1_0 \
    --n-gpu-layers 99 \
    --temp 1.0 \
    --min-p 0.01 \
    --ctx-size 16384 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU"

(3)上述配置將使用約 8GB GPU 內存。如果有約 360GB 組合 GPU 內存，請刪除 -ot ".ffn_.*_exps.=CPU" 以獲得最大速度.

（4）通過以下方式下載模型（安裝 pip install huggingface_hub hf_transfer 后）。建議使用 2bit 動態量化 UD-Q2_K_XL 來平衡大小和準確性。所有版本：huggingface.co/unsloth/Kimi-K2-Thinking-GGUF

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0" # 有時會限速，因此設置為 0 以禁用
from huggingface_hub import snapshot_download
snapshot_download(
    
repo_id = "unsloth/Kimi-K2-Thinking-GGUF",
    
local_dir = "unsloth/Kimi-K2-Thinking-GGUF",
    
allow_patterns = ["*UD-TQ1_0*"], # 使用 "*UD-Q2_K_XL*" 用于 Dynamic 2bit (381GB)
)

(5)運行任何提示。

(6)編輯 --threads -1 以設置 CPU 線程數（默認設置為最大 CPU 線程數），--ctx-size 16384 用于上下文長度，--n-gpu-layers 99 用于 GPU 卸載的層數。將其設置為 99 并結合 MoE CPU 卸載以獲得最佳性能。如果您的 GPU 內存不足，請嘗試調整它。如果您只進行 CPU 推理，也可以刪除它。

./llama.cpp/llama-cli \
    --model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
    --n-gpu-layers 99 \
    --temp 1.0 \
    --min_p 0.01 \
    --ctx-size 16384 \
    --seed 3407 \
    -ot ".ffn_.*_exps.=CPU"

使用 llama-server 和 OpenAI 的 completion 庫進行部署

按照上面步驟安裝 llama.cpp 后，可以使用以下命令啟動兼容 OpenAI 的服務器

./llama.cpp/llama-server \
    --model unsloth/Kimi-K2-Thinking-GGUF/UD-TQ1_0/Kimi-K2-Thinking-UD-TQ1_0-00001-of-00006.gguf \
    --alias "unsloth/Kimi-K2-Thinking" \
    --threads -1 \
    -fa on \
    --n-gpu-layers 999 \
    -ot ".ffn_.*_exps.=CPU" \
    --min_p 0.01 \
    --ctx-size 16384 \
    --port 8001 \
    --jinja

然后在 pip install openai 后使用 OpenAI 的 Python 庫：

from openai import OpenAI
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Kimi-K2-Thinking",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

unsloth對Kimi K2分詞器特性和錯誤修復

2025 年 11 月 7 日：unsloth通知了 Kimi 團隊，并修復了第一個用戶提示時未顯示默認系統提示詞You are Kimi, an AI assistant created by Moonshot AI.?的問題！參見 ??https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/12???

2025 年 7 月 16 日：Kimi K2 更新了分詞器以支持多個工具調用，參見 ??https://x.com/Kimi_Moonshot/status/1945050874067476962???

2025 年 7 月 18 日：unsloth修復了系統提示詞 - Kimi 也在這里發推文提到的修復：???https://x.com/Kimi_Moonshot/status/1946130043446690030???。修復內容也在這里描述：???https://huggingface.co/moonshotai/Kimi-K2-Instruct/discussions/28????

文章標題：Kimi K2 Thinking: How to Run Locally

鏈接：??https://docs.unsloth.ai/models/kimi-k2-thinking-how-to-run-locally????#run??-kimi-k2-thinking-in-llama.cpp

轉載自??AI帝國??，作者：無影寺

標簽

Unsloth

AI.

Kimi-K2

已于2025-11-10 07:30:49修改

贊

回復