機器學習 | 從0開發大模型-譯llama3-from-scratch

發布于 2025-2-19 12:48

瀏覽

0收藏

最近在看一篇github上大佬的文章，從0開始訓練llama3，覺得對于《從0開發大模型》有點幫助，于是翻譯一下，發現其中很多內容當前系列文章的知識點相似。原文：https://github.com/naklecha/llama3-from-scratch其中meta-llama/Meta-Llama-3-8B文件地址：https://huggingface.co/meta-llama/Meta-Llama-3-8B/tree/main/original

1、Tokenizer

原始代碼沒有實現tokenizer，而是使用llama3的 tokenizer.model，實現代碼如下：

# 執行：pip install blobfile
# 執行：pip install tiktoken

from pathlib import Path
import tiktoken
from tiktoken.load import load_tiktoken_bpe

tokenizer_path = "Meta-Llama-3-8B/tokenizer.model"
special_tokens = [
            "<|begin_of_text|>",
            "<|end_of_text|>",
            "<|reserved_special_token_0|>",
            "<|reserved_special_token_1|>",
            "<|reserved_special_token_2|>",
            "<|reserved_special_token_3|>",
            "<|start_header_id|>",
            "<|end_header_id|>",
            "<|reserved_special_token_4|>",
            "<|eot_id|>",  # end of turn
        ] + [f"<|reserved_special_token_{i}|>"for i in range(5, 256 - 5)]
mergeable_ranks = load_tiktoken_bpe(tokenizer_path)
tokenizer = tiktoken.Encoding(
    name=Path(tokenizer_path).name,
    pat_str=r"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\r\n\p{L}\p{N}]?\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]+[\r\n]*|\s*[\r\n]+|\s+(?!\S)|\s+",
    mergeable_ranks=mergeable_ranks,
    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},
)

print(tokenizer.decode(tokenizer.encode("hello world!")))

## 輸出
hello world!

這里用了字節對編碼（BPE），和我們訓練的tokenzier使用的方式一樣。

2、讀取模型文件

將模型文件下載到 Meta-Llama-3-8B 文件夾中，然后讀取模型文件，代碼如下：

import torch
import json
model = torch.load("Meta-Llama-3-8B/consolidated.00.pth")
print(json.dumps(list(model.keys())[:20], indent=4))

with open("Meta-Llama-3-8B/params.json", "r") as f:
    config = json.load(f)
print(config)

## 輸出
[
    "tok_embeddings.weight",
    "layers.0.attention.wq.weight",
    "layers.0.attention.wk.weight",
    "layers.0.attention.wv.weight",
    "layers.0.attention.wo.weight",
    "layers.0.feed_forward.w1.weight",
    "layers.0.feed_forward.w3.weight",
    "layers.0.feed_forward.w2.weight",
    "layers.0.attention_norm.weight",
    "layers.0.ffn_norm.weight",
    "layers.1.attention.wq.weight",
    "layers.1.attention.wk.weight",
    "layers.1.attention.wv.weight",
    "layers.1.attention.wo.weight",
    "layers.1.feed_forward.w1.weight",
    "layers.1.feed_forward.w3.weight",
    "layers.1.feed_forward.w2.weight",
    "layers.1.attention_norm.weight",
    "layers.1.ffn_norm.weight",
    "layers.2.attention.wq.weight"
]

{
    'dim': 4096,
    'n_layers': 32,
    'n_heads': 32,
    'n_kv_heads': 8,
    'vocab_size': 128256,
    'multiple_of': 1024,
    'ffn_dim_multiplier': 1.3,
    'norm_eps': 1e-05,
    'rope_theta': 500000.0
}

其中輸出的配置看：

n_layers=32：表示該模型有32個Transformer層
n_heads=32：表示每個Transformer層有32個注意力頭
vobac_size=128256：表示詞匯表大小為128256

3、文本轉換為token

使用 tiktoken（openai的庫）作為 tokenizer，實現如下：

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

prompt = "the answer to the ultimate question of life, the universe, and everything is "
tokens = [128000] + tokenizer.encode(prompt)
print(tokens)
tokens = torch.tensor(tokens)
prompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]
print(prompt_split_as_tokens)

## 輸出
[128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]
['<|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']

其中，128000是 <|begin_of_text|> 的token，還包括如下特殊token：

"<|begin_of_text|>",
"<|end_of_text|>",
"<|reserved_special_token_0|>",
"<|reserved_special_token_1|>",
"<|reserved_special_token_2|>",
"<|reserved_special_token_3|>",
"<|start_header_id|>",
"<|end_header_id|>",
"<|reserved_special_token_4|>",
"<|eot_id|>"

4、將token轉換為embedding

將上面的 token 通過 embedding 層，[17X1] 轉換為 [17X4096]，即 17 個 embeding（每個token一個），長度為 4096。

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

代碼如下：

embedding_layer = torch.nn.Embedding(vocab_size, dim)
embedding_layer.weight.data.copy_(model["tok_embeddings.weight"])
token_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)
token_embeddings_unnormalized.shape

## 輸出
torch.Size([17, 4096])

5、使用RMS對embedding進行歸一化

當前步驟不會改變形狀，只是做標準化處理，需要用到norm_eps，代碼如下：

def rms_norm(tensor, norm_weights):
    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

6、構建Transformer第一層

6.1、歸一化

獲取模型的第一層權重（layer.0），進行歸一化處理：

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

token_embeddings = rms_norm(token_embeddings_unnormalized, model["layers.0.attention_norm.weight"])
token_embeddings.shape

## 輸出
torch.Size([17, 4096])

6.2、注意力機制

先加載Transformer的第一層注意力頭：

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

從模型中加載wq，wk，wv，wo的權重，分別是[4096X4096]，[1024X4096]，[1024X4096]，[4096X4096]
乍一看這很奇怪，因為理想情況下我們希望每個頭部的每個 q、k、v 和 o 單獨存在
代碼的作者將它們捆綁在一起，因為它有助于并行化注意力頭部乘法

print(
    model["layers.0.attention.wq.weight"].shape,
    model["layers.0.attention.wk.weight"].shape,
    model["layers.0.attention.wv.weight"].shape,
    model["layers.0.attention.wo.weight"].shape
)

## 輸出
torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])

6.3、展開query

將多個注意力頭展開query，可以得到的形狀為 [32X128X4096]，其中32是llama3的頭數，128是查詢向量的大小，4096是token emebdding的大小，代碼如下：

q_layer0 = model["layers.0.attention.wq.weight"]
head_dim = q_layer0.shape[0] // n_heads
q_layer0 = q_layer0.view(n_heads, head_dim, dim)
q_layer0.shape

## 輸出
torch.Size([32, 128, 4096])

6.4、實現第一層的第一個head

訪問第一層的query權重矩陣第一個head，權重矩陣的大小是[128x4096]，打印一下：

q_layer0_head0 = q_layer0[0]
q_layer0_head0.shape

## 輸出
torch.Size([128, 4096])

6.5、query權重和token embedding相乘

token embeding的形狀是[17x128]，這是因為我們有17個標記，每個標記都有一個128長度的查詢，將查詢的權重和token embeding層相乘，代碼如下：

q_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)
q_per_token.shape

## 輸出
torch.Size([17, 128])

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

6.6、位置編碼

當前階段我們的token都有一個query向量，但是仔細想想——單獨的query向量根本不知道提示詞的位置，比如：

query: "the answer to the ultimate question of life, the universe, and everything is "

在 prompt 中，使?了三次 "the" ，需要根據它們在prompt中的位置為每個 "the" token ?成不同的query向量（每個?度為 128），可以使? RoPE （旋轉位置編碼）來實現這?點，具體RoPE實現可以參考之前的文章，也可以看看這個視頻：https://www.youtube.com/watch?v=o29P0Kpobz0&t=530s。

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)
q_per_token_split_into_pairs.shape

## 輸出
torch.Size([17, 64, 2])

從上面的步驟中，將query向量分成幾對，每對使用旋轉角度移位，現在將一個大小為[17X64X2]的向量，這個是prompt中每個標記分成64對的128個長度的query向量， 64對中的每一對都將旋轉 m*theta，其中 m 是我們要旋轉查詢的 token 位置。

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

6.7、復數的點積來計算旋轉向量

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

zero_to_one_split_into_64_parts = torch.tensor(range(64))/64
zero_to_one_split_into_64_parts

## 輸出
tensor([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250,
        0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656,
        0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062,
        0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469,
        0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875,
        0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281,
        0.8438, 0.8594, 0.8750, 0.8906, 0.9062, 0.9219, 0.9375, 0.9531, 0.9688,
        0.9844])

freqs = 1.0 / (rope_theta ** zero_to_one_split_into_64_parts)
freqs

## 輸出
tensor([1.0000e+00, 8.1462e-01, 6.6360e-01, 5.4058e-01, 4.4037e-01, 3.5873e-01,
        2.9223e-01, 2.3805e-01, 1.9392e-01, 1.5797e-01, 1.2869e-01, 1.0483e-01,
        8.5397e-02, 6.9566e-02, 5.6670e-02, 4.6164e-02, 3.7606e-02, 3.0635e-02,
        2.4955e-02, 2.0329e-02, 1.6560e-02, 1.3490e-02, 1.0990e-02, 8.9523e-03,
        7.2927e-03, 5.9407e-03, 4.8394e-03, 3.9423e-03, 3.2114e-03, 2.6161e-03,
        2.1311e-03, 1.7360e-03, 1.4142e-03, 1.1520e-03, 9.3847e-04, 7.6450e-04,
        6.2277e-04, 5.0732e-04, 4.1327e-04, 3.3666e-04, 2.7425e-04, 2.2341e-04,
        1.8199e-04, 1.4825e-04, 1.2077e-04, 9.8381e-05, 8.0143e-05, 6.5286e-05,
        5.3183e-05, 4.3324e-05, 3.5292e-05, 2.8750e-05, 2.3420e-05, 1.9078e-05,
        1.5542e-05, 1.2660e-05, 1.0313e-05, 8.4015e-06, 6.8440e-06, 5.5752e-06,
        4.5417e-06, 3.6997e-06, 3.0139e-06, 2.4551e-06])

freqs_for_each_token = torch.outer(torch.arange(17), freqs)
freqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)
freqs_cis.shape

## 輸出
torch.Size([17, 64])

6.8、現在在每個token的query元素都有?個復數（角度變化向量）

我們可以將 query（分成兩對的查詢）轉換為復數，然后使用點積根據位置旋轉查詢，代碼如下：

q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)
q_per_token_as_complex_numbers.shape

## 輸出
torch.Size([17, 64])

q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis
q_per_token_as_complex_numbers_rotated.shape

# 輸出
torch.Size([17, 64])

6.9、獲得旋轉后的向量

可以通過再次將復數看作實數來返回成對的 query，代碼如下：

q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)
q_per_token_split_into_pairs_rotated.shape

# 輸出
torch.Size([17, 64, 2])

旋轉對現在已合并，現在有了?個新的query向量（旋轉query向量），其shape為[17x128]，其中17是token的數量，128是query向量的維度。

q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)
q_per_token_rotated.shape

# 輸出
torch.Size([17, 128])

6.10、keys（操作query一樣）

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

我太懶了，所以我不打算對key進行數學運算，唯一需要記住的是：

key生成的key向量是128維
key的權重數量只有查詢的1/4，因為key的權重一次在4個頭部之間共享，以減少所需的計算次數
key也會選擇添加位置信息，和查詢一樣

k_layer0 = model["layers.0.attention.wk.weight"]
k_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0] // n_kv_heads, dim)
k_layer0.shape

## 輸出
torch.Size([8, 128, 4096])

k_layer0_head0 = k_layer0[0]
k_layer0_head0.shape

## 輸出
torch.Size([128, 4096])

k_per_token = torch.matmul(token_embeddings, k_layer0_head0.T)
k_per_token.shape

## 輸出
torch.Size([17, 128])

k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)
k_per_token_split_into_pairs.shape

## 輸出
torch.Size([17, 64, 2])

k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)
k_per_token_as_complex_numbers.shape

## 輸出
torch.Size([17, 64])

k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)
k_per_token_split_into_pairs_rotated.shape

## 輸出
torch.Size([17, 64, 2])

k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)
k_per_token_rotated.shape

## 輸出
torch.Size([17, 128])

完成當前步驟，每個token的query和key都是[17X128]維度。

6.11、將query和key相乘

這樣做會給我們一個分數，將每個token相互映射，這個分數描述了每個token的查詢與每個token的鍵的關聯程度，這是自我注意力。注意力得分矩陣 (qk_per_token) 的形狀是 [17x17]，其中17是提示中的token數量。

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(head_dim)**0.5
qk_per_token.shape

## 輸出
torch.Size([17, 17])

7、屏蔽QK分數

7.1、屏蔽預測的token

在llama3的訓練過程中，未來的token qk分數被屏蔽，為什么？因為在訓練過程中，只學習使?過去的token來預測token，因此，在推理過程中，將未來的token設置為零。

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

執行如下代碼輸出：

def display_qk_heatmap(qk_per_token):
    _, ax = plt.subplots()
    im = ax.imshow(qk_per_token.to(float).detach(), cmap='viridis')
    ax.set_xticks(range(len(prompt_split_as_tokens)))
    ax.set_yticks(range(len(prompt_split_as_tokens)))
    ax.set_xticklabels(prompt_split_as_tokens)
    ax.set_yticklabels(prompt_split_as_tokens)
    ax.figure.colorbar(im, ax=ax)
    
display_qk_heatmap(qk_per_token)

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

mask = torch.full((len(tokens), len(tokens)), float("-inf"), device=tokens.device)
mask = torch.triu(mask, diagnotallow=1)
mask

## 輸出
tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

qk_per_token_after_masking = qk_per_token + mask
display_qk_heatmap(qk_per_token_after_masking)

qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)
display_qk_heatmap(qk_per_token_after_masking_after_softmax)

7.2、values（注意力機制最后部分）

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

注意力分數用于確定每個token的使用多少矩陣，就像key一樣，value的權重也是每4個注意力頭之間共享，因此value矩陣形狀是[8x128x4096]。

v_layer0 = model["layers.0.attention.wv.weight"]
v_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] // n_kv_heads, dim)
v_layer0.shape

## 輸出
torch.Size([8, 128, 4096])

第一層，第一個頭值權重矩陣如下：

v_layer0_head0 = v_layer0[0]
v_layer0_head0.shape

## 輸出
torch.Size([128, 4096])

7.3、value向量

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

我們現在使用value權重來獲取每個token的注意力值，其大小為[17x128]，其中17是提示中的token數量，128是每個token的值向量的大小。

v_per_token = torch.matmul(token_embeddings, v_layer0_head0.T)
v_per_token.shape

## 輸出
torch.Size([17, 128])

7.4、最終的注意力輸出

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

與每個token的值相乘后得到的注意向量形狀為[17*128]，輸出如下：

qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)
qkv_attention.shape

## 輸出
torch.Size([17, 128])

8、多頭注意力機制

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

現在獲取了第一層和第一個頭的注意力輸出，只需要按照n_heads重復這個步驟，對第一層的每個頭部執行與上面的單元格完全相同的數學運算。

qkv_attention_store = []

for head in range(n_heads):
    q_layer0_head = q_layer0[head]
    k_layer0_head = k_layer0[head//4] # key weights are shared across 4 heads
    v_layer0_head = v_layer0[head//4] # value weights are shared across 4 heads
    q_per_token = torch.matmul(token_embeddings, q_layer0_head.T)
    k_per_token = torch.matmul(token_embeddings, k_layer0_head.T)
    v_per_token = torch.matmul(token_embeddings, v_layer0_head.T)

    q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)
    q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)
    q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis[:len(tokens)])
    q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)

    k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)
    k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)
    k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis[:len(tokens)])
    k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)

    qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5
    mask = torch.full((len(tokens), len(tokens)), float("-inf"), device=tokens.device)
    mask = torch.triu(mask, diagnotallow=1)
    qk_per_token_after_masking = qk_per_token + mask
    qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)
    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)
    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)
    qkv_attention_store.append(qkv_attention)

len(qkv_attention_store)

## 輸出
32

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

現在又有了第一層的32個head的qkv_attention矩陣，接下來將所有的注意力分數合并為一個大小為[17x4096]的大矩陣，如下所示：

stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)
stacked_qkv_attention.shape

## 輸出
torch.Size([17, 4096])

9、最后一步：計算權重矩陣

對于第0層注意力，最后要做的事情之一是將權重矩陣相乘：

w_layer0 = model["layers.0.attention.wo.weight"]
w_layer0.shape

## 輸出
torch.Size([4096, 4096])

9.1、簡單的線性層，所以我們只需要matmul

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

embedding_delta = torch.matmul(stacked_qkv_attention, w_layer0.T)
embedding_delta.shape

## 輸出
torch.Size([17, 4096])

現在embeding值變化了，添加到原始的token中嵌入：

embedding_after_edit = token_embeddings_unnormalized + embedding_delta
embedding_after_edit.shape

## 輸出
torch.Size([17, 4096])

9.2、歸一化，然后通過embedding增量運??個前饋神經?絡

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

embedding_after_edit_normalized = rms_norm(embedding_after_edit, model["layers.0.ffn_norm.weight"])
embedding_after_edit_normalized.shape

## 輸出
torch.Size([17, 4096])

9.3、加載FFN權重并實現前饋?絡

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

在llama3中，使用了SwiGLU前饋網絡，這種網絡架構非常適合在模型需要時添加非線性，如今，在llms中使用這種前饋網絡架構相當常見。

w1 = model["layers.0.feed_forward.w1.weight"]
w2 = model["layers.0.feed_forward.w2.weight"]
w3 = model["layers.0.feed_forward.w3.weight"]
output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)
output_after_feedforward.shape

## 輸出
torch.Size([17, 4096])

10、在第?層之后，終于為每個token生成新的embeding

在我們完成之前只需要再經過31層（一個 for 循環），可以想象這個生成過的embeding具有關于第一層上提出的所有查詢的信息。現在，對所有提出的問題每?層都會對query進?越來越復雜的編碼，直到得到?個embedding，其中包含了需要的下?個token的所有信息。

layer_0_embedding = embedding_after_edit+output_after_feedforward
layer_0_embedding.shape

## 輸出
torch.Size([17, 4096])

for循環對每一層都執行相同的邏輯，最終獲得final_embedding，如下所示：

final_embedding = token_embeddings_unnormalized
for layer in range(n_layers):
    qkv_attention_store = []
    layer_embedding_norm = rms_norm(final_embedding, model[f"layers.{layer}.attention_norm.weight"])
    q_layer = model[f"layers.{layer}.attention.wq.weight"]
    q_layer = q_layer.view(n_heads, q_layer.shape[0] // n_heads, dim)
    k_layer = model[f"layers.{layer}.attention.wk.weight"]
    k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] // n_kv_heads, dim)
    v_layer = model[f"layers.{layer}.attention.wv.weight"]
    v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] // n_kv_heads, dim)
    w_layer = model[f"layers.{layer}.attention.wo.weight"]
    for head in range(n_heads):
        q_layer_head = q_layer[head]
        k_layer_head = k_layer[head//4]
        v_layer_head = v_layer[head//4]
        q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)
        k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)
        v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)
        q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)
        q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)
        q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis)
        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)
        k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)
        k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)
        k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)
        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)
        qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)/(128)**0.5
        mask = torch.full((len(token_embeddings_unnormalized), len(token_embeddings_unnormalized)), float("-inf"))
        mask = torch.triu(mask, diagnotallow=1)
        qk_per_token_after_masking = qk_per_token + mask
        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)
        qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)
        qkv_attention_store.append(qkv_attention)

    stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)
    w_layer = model[f"layers.{layer}.attention.wo.weight"]
    embedding_delta = torch.matmul(stacked_qkv_attention, w_layer.T)
    embedding_after_edit = final_embedding + embedding_delta
    embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f"layers.{layer}.ffn_norm.weight"])
    w1 = model[f"layers.{layer}.feed_forward.w1.weight"]
    w2 = model[f"layers.{layer}.feed_forward.w2.weight"]
    w3 = model[f"layers.{layer}.feed_forward.w3.weight"]
    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)
    final_embedding = embedding_after_edit+output_after_feedforward

11、有了final_embedding，這是模型對下一個標記做出的最佳猜測

embeding的形狀和token embeding的形狀相同，都是[17X4096]，其中17是token的數量，4096是嵌入的維度。

final_embedding = rms_norm(final_embedding, model["norm.weight"])
final_embedding.shape

## 輸出
torch.Size([17, 4096])

12、最后，讓我們將embeding解碼為token

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

我們將使用輸出解碼器將最終的嵌入轉換為token，如下所示：

model["output.weight"].shape

## 輸出
torch.Size([128256, 4096])

最終我們希望這個問題the answer to the ultimate question of life, the universe, and everything is ，答案是42，現在執行如下代碼：

logits = torch.matmul(final_embedding[-1], model["output.weight"].T)
logits.shape

## 輸出
torch.Size([128256])

預測下一個token的代碼：

next_token = torch.argmax(logits, dim=-1)
next_token

## 輸出
tensor(2983)

機器學習 | 從0開發大模型-譯llama3-from-scratch-AI.x社區

llama3-scratch

最終輸出結果：

tokenizer.decode([next_token.item()])

## 輸出
42

最終我整理了一下代碼：https://github.com/linkxzhou/mylib/blob/master/llm/0-llama3-from-scratch.py

參考

（1）https://github.com/naklecha/llama3-from-scratch

本文轉載自 ??周末程序猿??，作者：周末程序猿

標簽

開發

大模型

scratch

已于2025-2-19 17:58:59修改

贊

回復

舉報

回復

51CTO

51CTO博客

51CTO學堂

機器學習 | 從0開發大模型-譯llama3-from-scratch

1、Tokenizer

2、讀取模型文件

3、文本轉換為token

4、將token轉換為embedding

5、使用RMS對embedding進行歸一化

6、構建Transformer第一層

6.1、歸一化

6.2、注意力機制

6.3、展開query

6.4、實現第一層的第一個head

6.5、query權重和token embedding相乘

6.6、位置編碼

6.7、復數的點積來計算旋轉向量

6.8、現在在每個token的query元素都有?個復數（角度變化向量）

6.9、獲得旋轉后的向量

6.10、keys（操作query一樣）

6.11、將query和key相乘

7、屏蔽QK分數

7.1、屏蔽預測的token

7.2、values（注意力機制最后部分）

7.3、value向量

7.4、最終的注意力輸出

8、多頭注意力機制

9、最后一步：計算權重矩陣

9.1、簡單的線性層，所以我們只需要matmul

9.2、歸一化，然后通過embedding增量運??個前饋神經?絡

9.3、加載FFN權重并實現前饋?絡

10、在第?層之后，終于為每個token生成新的embeding

11、有了final_embedding，這是模型對下一個標記做出的最佳猜測

12、最后，讓我們將embeding解碼為token

參考

目錄