實戰(zhàn) LLaMA Factory：在國產(chǎn)DCU上高效微調(diào) Llama 3 模型

發(fā)布于 2025-6-5 06:55

瀏覽

0收藏

一、前言

隨著大語言模型（LLM）的飛速發(fā)展，如何在特定領域或任務上對預訓練模型進行高效微調(diào)，已成為業(yè)界關注的焦點。LLaMA Factory 作為一個功能強大且易于上手的 LLM 微調(diào)框架，受到了廣泛關注。本文將聚焦于在國產(chǎn) DCU 平臺上，利用 LLaMA Factory 對 Llama 3 模型進行 LoRA 微調(diào)的實踐過程，并分享其中的關鍵步驟與經(jīng)驗。

?? 海光DCU實戰(zhàn)項目來了！助您輕松駕馭大模型與HPC開發(fā) ??

為幫助開發(fā)者更便捷在海光DCU上進行大模型（訓練、微調(diào)、推理）及科學計算，我依托海光DCU開發(fā)者社區(qū)，精心打造了一個開箱即用的實戰(zhàn)項目 —— “dcu-in-action”！

旨在為您提供：

? ?? 直接上手的代碼示例與實踐指南

? ? 加速您在海光DCU上的開發(fā)與部署流程

歡迎各位開發(fā)者：

? 訪問項目GitHub倉庫，深入體驗、參與貢獻，共同完善： https://github.com/FlyAIBox/dcu-in-action

? 如果項目對您有幫助，請我們點亮一個寶貴的 Star ??

二、環(huán)境準備與 LLaMA Factory 安裝

本次實踐的環(huán)境基于國產(chǎn)海光 DCU K100-AI，DTK 版本為 25.04。核心軟件棧包括 Python 3.10 以及針對 DCU 優(yōu)化的 PyTorch (torch==2.4.1+das.opt2.dtk2504) 及其相關深度學習庫（如 lmslim, flash-attn,vllm,deepspeed 的特定版本）。

1. 創(chuàng)建虛擬環(huán)境

conda create -n dcu_llm_fine python=3.10
conda activate dcu_llm_fine

2. 安裝 DCU 特定深度學習庫

根據(jù)文檔指引，從光合開發(fā)者社區(qū)下載并安裝適配 DCUK100-AI (DTK 25.04, Python 3.10) 的 PyTorch, lmslim,flash-attn, vllm deepspeed 等 whl 包。確保各組件版本嚴格對應。

3. 安裝 LLaMA Factory

git clone http://developer.hpccube.com/codes/OpenDAS/llama-factory.git
cd /your_code_path/llama_factory
pip install -e ".[torch,metrics]"

注意：如遇包沖突，可嘗試 pip install --no-deps -e .。

三、Llama 3 LoRA 微調(diào)實戰(zhàn)

我們以 Meta-Llama-3-8B-Instruct 模型為例，采用 LoRA (Low-Rank Adaptation) 方法進行監(jiān)督式微調(diào) (SFT)。

1. 微調(diào)配置文件解析 (llama3_lora_sft.yaml)

以下是核心配置參數(shù)：

### model
model_name_or_path:/root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct# 模型路徑
trust_remote_code:true

### method
stage:sft                      # 微調(diào)階段：監(jiān)督式微調(diào)
do_train:true
finetuning_type:lora           # 微調(diào)方法：LoRA
lora_rank:8                    # LoRA 秩
lora_target:all                # LoRA 應用目標：所有線性層

### dataset
dataset:identity,alpaca_en_demo# 使用的數(shù)據(jù)集
template:llama3                # 對話模板
cutoff_len:2048                # 序列截斷長度
max_samples:1000               # 每個數(shù)據(jù)集最大樣本數(shù)
overwrite_cache:true
preprocessing_num_workers:16   # 預處理進程數(shù)

### output
output_dir:saves/llama3-8b/lora/sft# 輸出目錄
logging_steps:10
save_steps:500
plot_loss:true
overwrite_output_dir:true
save_only_model:false          # 保存完整checkpoint而非僅模型

### train
per_device_train_batch_size:1# 每GPU批大小
gradient_accumulation_steps:8# 梯度累積步數(shù)
learning_rate:1.0e-4           # 學習率
num_train_epochs:3.0           # 訓練輪次
lr_scheduler_type:cosine       # 學習率調(diào)度器
warmup_ratio:0.1               # 預熱比例
bf16:true                      # 使用bf16混合精度
ddp_timeout:180000000
resume_from_checkpoint: null

2. 啟動微調(diào)

llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

3. 微調(diào)過程關鍵日志輸出與解讀

環(huán)境初始化與分布式設置 (日志時間: 21:16:40 - 21:16:51)

? Setting ds_accelerator to cuda (auto detect)

? Initializing 8 distributed tasks at: 127.0.0.1:54447

? 各 GPU 進程 (如 [PG 0 Rank 2]) 初始化 NCCL，日志顯示 size: 8, global rank: 2, TIMEOUT(ms): 180000000000。

? 各進程確認信息，例如 Process rank: 2, world size: 8, device: cuda:2, distributed training: True, compute dtype: torch.bfloat16?，表明已啟用 bf16 混合精度。

? Set ddp_find_unused_parameters to False in DDP training since LoRA is enabled.

Tokenizer 與模型配置加載 (日志時間: 21:16:51 - 21:16:52)

? 加載 tokenizer.json, tokenizer.model 等文件。

? 加載模型配置文件 /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct/config.json，確認模型架構如 hidden_size: 4096, num_hidden_layers: 32, torch_dtype: "bfloat16"。

數(shù)據(jù)集加載與預處理 (日志時間: 21:16:52 - 21:17:01)

? 加載數(shù)據(jù)集 identity.json (91條樣本) 和 alpaca_en_demo.json (1000條樣本)。

? Converting format of dataset (num_proc=16) 和 Running tokenizer on dataset (num_proc=16)，共處理 1091 條樣本。

? 展示了處理后的一個訓練樣本 training example，包括 input_ids, inputs (已格式化模板) 和 label_ids (prompt部分為-100)。

基礎模型權重加載與 LoRA 適配器設置 (日志時間: 21:17:01 - 21:17:16)

? KV cache is disabled during training.

? 加載模型權重 /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct/model.safetensors.index.json，共4個分片。

? 出現(xiàn)警告: Using the SDPA attention implementation on multi-gpu setup with ROCM may lead to performance issues due to the FA backend. Disabling it to use alternative backends.

? Gradient checkpointing enabled.

? Fine-tuning method: LoRA

? Found linear modules: v_proj,q_proj,k_proj,down_proj,o_proj,gate_proj,up_proj (這些是 lora_target: all 選中的層)。

? trainable params: 20,971,520 || all params: 8,051,232,768 || trainable%: 0.2605，明確了 LoRA 引入的可訓練參數(shù)量和占比。

Trainer 初始化與訓練循環(huán) (日志時間: 21:17:16 - 21:22:15)

? ***** Running training *****

? Num examples = 1,091, Num Epochs = 3

? Instantaneous batch size per device = 1, Total train batch size (w. parallel, distributed & accumulation) = 64

? Gradient Accumulation steps = 8, Total optimization steps = 51

? 訓練日志周期性輸出 (每 logging_steps: 10次迭代，但日志中是按優(yōu)化步聚合后展示的)：

{'loss': 1.4091, 'grad_norm': 1.0385..., 'learning_rate': 9.8063...e-05, 'epoch': 0.58}

{'loss': 1.0404, 'grad_norm': 0.6730..., 'learning_rate': 7.7959...e-05, 'epoch': 1.17}

{'loss': 0.9658, 'grad_norm': 0.4174..., 'learning_rate': 4.4773...e-05, 'epoch': 1.75}

{'loss': 0.9389, 'grad_norm': 0.3942..., 'learning_rate': 1.4033...e-05, 'epoch': 2.34}

{'loss': 0.894, 'grad_norm': 0.4427..., 'learning_rate': 1.2179...e-07, 'epoch': 2.92}

? 訓練過程中反復出現(xiàn) UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:627.)

訓練完成與模型保存 (日志時間: 15:22:15 - 15:22:17)

? Saving model checkpoint to saves/llama3-8b/lora/sft/checkpoint-51

? 最終訓練指標 ***** train metrics *****：

epoch = 2.9781

train_loss = 1.0481

train_runtime = 0:04:56.32 (即 296.3281秒)

train_samples_per_second = 11.045

train_steps_per_second = 0.172

? Figure saved at: saves/llama3-8b/lora/sft/training_loss.png

? NCCL 通信器關閉，各進程資源清理。

四、模型推理測試

微調(diào)完成后，我們加載 LoRA 適配器進行推理測試。

1. 推理配置文件 (llama3_lora_sft.yaml for inference)

model_name_or_path: /root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft # 加載微調(diào)后的LoRA適配器
template: llama3
infer_backend: huggingface # 推理后端
trust_remote_code: true

2. 啟動推理

llamafactory-cli chat examples/inference/llama3_lora_sft.yaml

3. 推理過程關鍵日志輸出與測試結果

模型加載 (日志時間: 17:30:16 - 17:31:18)

? 加載基礎模型 Tokenizer, config (torch_dtype: "bfloat16", use_cache: true) 和權重 (model.safetensors.index.json, 4個分片)。

? KV cache is enabled for faster generation.

? 再次出現(xiàn) SDPA on ROCm 性能警告。

? 加載 LoRA 適配器: Loaded adapter(s): saves/llama3-8b/lora/sft。

? Merged 1 adapter(s).，確認 LoRA 權重已合并到基礎模型。

? 加載后模型參數(shù)量 all params: 8,030,261,248。

交互測試結果

? User:

你是誰

Assistant:

我是 {{name}}，由 {{author}} 訓練的 AI 助手。我旨在為您提供幫助，回答問題和完成任務。

評析：輸出中的 {{name}}? 和 {{author}}? 占位符，表明模型學習了微調(diào)數(shù)據(jù) identity.json 中的模板格式。

五、模型導出

將微調(diào)后的 LoRA 權重與基礎模型合并，并導出為獨立模型。

1. 導出配置文件 (llama3_lora_sft.yaml for export)

### Note: DO NOT use quantized model or quantization_bit when merging lora adapters

### model
model_name_or_path:/root/.cache/modelscope/hub/models/LLM-Research/Meta-Llama-3-8B-Instruct
adapter_name_or_path:saves/llama3-8b/lora/sft
template:llama3
trust_remote_code:true

### export
export_dir:output/llama3_lora_sft# 導出目錄
export_size:5                     # 模型分片大小上限 (GB)
export_device:cpu                 # 導出時使用的設備
export_legacy_format:false        # 不使用舊格式，優(yōu)先safetensors

重要提示：配置文件中明確指出，合并 LoRA 適配器時不應使用已量化的模型。

2. 啟動導出

llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

3. 導出過程關鍵日志輸出 (日志時間: 18:06:54 - 18:08:22)

? 加載基礎模型 Tokenizer, config (torch_dtype: "bfloat16") 和權重 (4個分片)。

? 加載 LoRA 適配器: Loaded adapter(s): saves/llama3-8b/lora/sft。

? Merged 1 adapter(s).，LoRA 權重與基礎模型合并。

? Convert model dtype to: torch.bfloat16.

? 配置文件保存: Configuration saved in output/llama3_lora_sft/config.json 和 output/llama3_lora_sft/generation_config.json。

? 模型權重保存: The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 4 checkpoint shards. You can find where each parameters has been saved in the index located at output/llama3_lora_sft/model.safetensors.index.json. (根據(jù)配置 export_size: 5)

? Tokenizer 文件保存: tokenizer config file saved in output/llama3_lora_sft/tokenizer_config.json 和 special_tokens_map.json。

? 額外功能: Ollama modelfile saved in output/llama3_lora_sft/Modelfile。

七、總結與展望

本次實踐完整地展示了使用 LLaMA Factory 在國產(chǎn) DCU 平臺上對 Llama 3 模型進行 LoRA 微調(diào)、推理和導出的流程。LLaMA Factory 憑借其清晰的配置和便捷的命令行工具，顯著降低了 LLM 微調(diào)的門檻。通過對各階段關鍵日志輸出和測試信息的詳細解讀，我們可以更直觀地把握模型在訓練中的學習動態(tài)、在推理中的行為表現(xiàn)以及導出后的結構。

本文轉載自 ?????螢火AI百寶箱??????，作者：螢火AI百寶箱

標簽

DCU

Llama 3

模型

贊

回復

舉報

回復

51CTO

51CTO博客

51CTO學堂