精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

機器學習 | 從0開發(fā)大模型之模型預訓練

人工智能 機器學習
在訓練過程中,通常會使用 scaler.scale(loss).backward() 來計算縮放后的損失的梯度,然后使用 scaler.step(optimizer) 來更新模型參數(shù),最后使用 scaler.update() 來更新縮放因子,這樣可以確保訓練過程的穩(wěn)定性和效率。

1、參數(shù)初始化

初始化參數(shù)模板:

from transformers import PretrainedConfig

class MyPretrainConfig(PretrainedConfig):
    model_type = "myllm"

    def __init__(
            self,
            dim: int = 512,
            n_layers: int = 8,
            n_heads: int = 16,
            n_kv_heads: int = 8,
            vocab_size: int = 6400,
            hidden_dim: int = None,
            multiple_of: int = 64,
            norm_eps: float = 1e-5,
            max_seq_len: int = 512,
            dropout: float = 0.0,
            flash_attn: bool = True,
            use_moe: bool = False,
            num_experts_per_tok=2,
            n_routed_experts=4,
            n_shared_experts: bool = True,
            scoring_func='softmax',
            aux_loss_alpha=0.01,
            seq_aux=True,
            norm_topk_prob=True,
            **kwargs,
    ):
        self.dim = dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.multiple_of = multiple_of
        self.norm_eps = norm_eps
        self.max_seq_len = max_seq_len
        self.dropout = dropout
        self.flash_attn = flash_attn
        self.num_experts_per_tok = num_experts_per_tok  # 每個token選擇的專家數(shù)量
        self.n_routed_experts = n_routed_experts        # 總的專家數(shù)量
        self.n_shared_experts = n_shared_experts        # 共享專家
        self.scoring_func = scoring_func                # 評分函數(shù),默認為'softmax'
        self.aux_loss_alpha = aux_loss_alpha            # 輔助損失的alpha參數(shù)
        self.seq_aux = seq_aux                          # 是否在序列級別上計算輔助損失
        self.norm_topk_prob = norm_topk_prob            # 是否標準化top-k概率
        super().__init__(**kwargs)

這里依賴 transformers 庫的 PretrainedConfig,其中 MyPretrainConfig 參數(shù)如下:

  • dim: int = 512:模型的維度,默認為 512
  • n_layers: int = 8:模型的層數(shù),默認為 8
  • n_heads: int = 16:注意力頭的數(shù)量,默認為 16
  • n_kv_heads: int = 8:鍵值對的頭數(shù),默認為 8
  • vocab_size: int = 6400:詞匯表的大小,默認為 6400
  • hidden_dim: int = None:隱藏層的維度,默認為 None,可以根據(jù)需要設置
  • multiple_of: int = 64:模型維度必須是這個值的倍數(shù),默認為 64
  • norm_eps: float = 1e-5:歸一化的 epsilon 值,默認為 1e-5
  • max_seq_len: int = 512:最大序列長度,默認為 512
  • dropout: float = 0.0:dropout 概率,默認為 0.0
  • flash_attn: bool = True:是否使用快速注意力機制,默認為 True
  • num_experts_per_tok=2:每個 token 選擇的專家數(shù)量,默認為 2
  • n_routed_experts=4:總的專家數(shù)量,默認為 4
  • n_shared_experts: bool = True:是否使用共享專家,默認為 True
  • scoring_func='softmax':評分函數(shù),默認為 'softmax'
  • aux_loss_alpha=0.01:輔助損失的 alpha 參數(shù),默認為 0.01
  • seq_aux=True:是否在序列級別上計算輔助損失,默認為 True
  • norm_topk_prob=True:是否標準化 top-k 概率,默認為 True
  • **kwargs:接收其他關(guān)鍵字參數(shù),傳遞給父類的構(gòu)造函數(shù)

PretrainedConfig 提供預訓練的參數(shù)模板,由于每個模型都是不一樣的,所以一般做成配置文件攜帶模型一起發(fā)布。

2、加載預處理的數(shù)據(jù)

加載上一篇文章已經(jīng)處理好的預處理數(shù)據(jù),代碼如下:

data_path_list = [f'./pretrain_data.bin']
train_ds = PretrainDataset(data_path_list, max_length=max_seq_len, memmap=True)
train_sampler = None
num_workers = 16  # 可以根據(jù)系統(tǒng)的 CPU 核心數(shù)來調(diào)整
train_loader = DataLoader(
    train_ds,
    batch_size=batch_size,
    pin_memory=True,
    drop_last=False,
    shuffle=False,
    num_workers=num_workers,
    sampler=train_sampler
)

其中 PretrainDataset 是加載代碼,主要目的是將數(shù)據(jù)轉(zhuǎn)換到內(nèi)存中,方便 DataLoader 獲取:

class PretrainDataset(Dataset):
    def __init__(self, data_path_lst, max_length=512, memmap=False):
        super().__init__()
        if memmap:
            with open(data_path_lst[0], 'r') as f:
                nbytes = f.seek(0, 2)
                flen = f.tell() // np.dtype('uint16').itemsize
            self.data = np.memmap(data_path_lst[0], dtype=np.dtype('uint16'), shape=(flen // max_length, max_length))
        else:
            data_lst = []
            for data_path in data_path_lst:
                with open(data_path, 'rb') as f:
                    data = np.fromfile(f, dtype=np.uint16)
                    data_lst.append(data)
            data = np.concatenate(data_lst)
            data = data[:max_length * int(len(data) / max_length)]
            self.data = data.reshape(-1, max_length)
        print("memmap:{} train data.shape:{}".format(memmap, self.data.shape))
        print("downloading finished.....")

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index: int):
        sample = self.data[index]
        X = np.array(sample[:-1]).astype(np.int64)
        Y = np.array(sample[1:]).astype(np.int64)

        return torch.from_numpy(X), torch.from_numpy(Y)

其中 Datasetfrom torch.utils.data import Dataset 通用代碼。

3、初始化模型

初始化模型,借鑒 llama2.c 的代碼,路徑:https://github.com/karpathy/llama2.c/blob/master/model.py,使用 Transformerdecoder 階段,即 Decoder-Only,主要是如下邏輯:

  • 初始化:創(chuàng)建tok_embeddings,dropout,layers和CausalLMOutputWithPast等
  • forward:獲取迭代輸出的結(jié)果

具體代碼如下:

class Transformer(PreTrainedModel):
    last_loss: Optional[torch.Tensor]

    def __init__(self, params: MyPretrainConfig):
        super().__init__(params)
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
        self.dropout = nn.Dropout(params.dropout)
        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))
        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = nn.Linear(params.dim, params.vocab_size, bias=False)

        # share the unembedding parameters with the embedding parameters
        self.tok_embeddings.weight = self.output.weight # https://paperswithcode.com/method/weight-tying

        # some useful precompute for the RoPE relative positional embeddings
        freqs_cos, freqs_sin = precompute_freqs_cis(self.params.dim // self.params.n_heads, self.params.max_seq_len)
        self.register_buffer("freqs_cos", freqs_cos, persistent=False)
        self.register_buffer("freqs_sin", freqs_sin, persistent=False)

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('w3.weight') or pn.endswith('wo.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * params.n_layers))

        # Initialize attribute for the loss of the last forward call. This will be set if the forward is called with a targets tensor.
        self.last_loss = None
        self.OUT = CausalLMOutputWithPast()

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, tokens: torch.Tensor, targets: Optional[torch.Tensor] = None) -> torch.Tensor:
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        h = self.dropout(h)
        freqs_cos = self.freqs_cos[:seqlen]
        freqs_sin = self.freqs_sin[:seqlen]

        for layer in self.layers:
            h = layer(h, freqs_cos, freqs_sin)
        h = self.norm(h)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.output(h)
            self.last_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the output on the very last position
            logits = self.output(h[:, [-1], :]) # note: using list [-1] to preserve the time dim
            self.last_loss = None

        self.OUT.__setitem__('logits', logits)
        self.OUT.__setitem__('last_loss', self.last_loss)
        return self.OUT
...

然后通過上述模型初始化,并打印模型:

def init_model():
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

    model = Transformer(lm_config).to(device)
    print(f'LLM總參數(shù)量:{count_parameters(model) / 1e6:.3f} 百萬')
    return model

model = init_model()
print(model)

獲取輸出結(jié)果如下:

Transformer(
  (tok_embeddings): Embedding(6400, 512)
  (dropout): Dropout(p=0.0, inplace=False)
  (layers): ModuleList(
    (0-7): 8 x TransformerBlock(
      (attention): Attention(
        (wq): Linear(in_features=512, out_features=512, bias=False)
        (wk): Linear(in_features=512, out_features=256, bias=False)
        (wv): Linear(in_features=512, out_features=256, bias=False)
        (wo): Linear(in_features=512, out_features=512, bias=False)
        (attn_dropout): Dropout(p=0.0, inplace=False)
        (resid_dropout): Dropout(p=0.0, inplace=False)
      )
      (feed_forward): FeedForward(
        (w1): Linear(in_features=512, out_features=1408, bias=False)
        (w2): Linear(in_features=1408, out_features=512, bias=False)
        (w3): Linear(in_features=512, out_features=1408, bias=False)
        (dropout): Dropout(p=0.0, inplace=False)
      )
      (attention_norm): RMSNorm()
      (ffn_norm): RMSNorm()
    )
  )
  (norm): RMSNorm()
  (output): Linear(in_features=512, out_features=6400, bias=False)
)

模型初始化這里就不詳細說了,這個系列出一篇文章具體分析 llama2.c 源碼,講述是如何實現(xiàn)模型創(chuàng)建的。

4、選擇optimizer

執(zhí)行模型初始化后則選擇優(yōu)化器,這里代碼如下:

scaler = torch.cuda.amp.GradScaler(enabled=(dtype == dtype))
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

4.1 GradScaler

GradScaler 在 PyTorch 中的作用是用于自動混合精度(Automatic Mixed Precision, AMP)訓練時的梯度縮放,具體來說,它的主要功能包括:

  • 防止梯度下溢:在使用混合精度訓練時,模型的權(quán)重和激活值可能會使用較低的精度(如半精度浮點數(shù),F(xiàn)P16)。這可能導致在反向傳播過程中計算出的梯度值過小,從而出現(xiàn)梯度下溢(即梯度變?yōu)榱悖?code style="background-color: rgb(231, 243, 237); padding: 1px 3px; border-radius: 4px; overflow-wrap: break-word; text-indent: 0px; display: inline-block;">GradScaler 會自動調(diào)整梯度的縮放因子,以確保梯度在更新時不會下溢;
  • 提高訓練速度:使用混合精度可以減少內(nèi)存使用和計算時間,從而加速訓練過程,GradScaler 通過動態(tài)調(diào)整縮放因子,幫助在保持數(shù)值穩(wěn)定性的同時,充分利用混合精度的優(yōu)勢;
  • 簡化代碼:使用 GradScaler 可以簡化混合精度訓練的實現(xiàn),開發(fā)者不需要手動管理縮放因子和反縮放操作;

在訓練過程中,通常會使用 scaler.scale(loss).backward() 來計算縮放后的損失的梯度,然后使用 scaler.step(optimizer) 來更新模型參數(shù),最后使用 scaler.update() 來更新縮放因子,這樣可以確保訓練過程的穩(wěn)定性和效率。

4.2 optimizer

optimizer 在深度學習中是一個非常重要的組件,其主要作用是更新模型的參數(shù),以最小化損失函數(shù),具體來說,optimizer 的作用包括:

  • 參數(shù)更新:優(yōu)化器根據(jù)計算得到的梯度信息來更新模型的參數(shù)(權(quán)重和偏置),通過調(diào)整這些參數(shù),優(yōu)化器試圖使模型在訓練數(shù)據(jù)上的表現(xiàn)更好;
  • 控制學習率:優(yōu)化器通常會使用學習率(learning rate)來控制每次參數(shù)更新的幅度。學習率是一個超參數(shù),決定了模型在每次迭代中向最優(yōu)解移動的步長;
  • 實現(xiàn)不同的優(yōu)化算法:PyTorch 提供了多種優(yōu)化算法(如 SGD、Adam、RMSprop 等),每種算法都有其獨特的更新規(guī)則和策略。選擇合適的優(yōu)化器可以影響模型的收斂速度和最終性能;
  • 處理動量和自適應學習率:一些優(yōu)化器(如 Adam 和 RMSprop)使用動量和自適應學習率的策略來加速收斂和提高穩(wěn)定性。這些策略可以幫助優(yōu)化器在訓練過程中更有效地探索參數(shù)空間;
  • 支持正則化:某些優(yōu)化器可以集成正則化技術(shù)(如 L2 正則化),以防止模型過擬合;

在下面的迭代訓練中,主要作用是根據(jù)損失值調(diào)整優(yōu)化器參數(shù):

# 反向傳播
scaler.scale(loss).backward()

# 梯度剪裁和更新參數(shù)
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()

# 清零梯度
optimizer.zero_grad(set_to_none=True)

5、迭代訓練

上述預處理數(shù)據(jù)加載完,模型執(zhí)行了初始化,然后優(yōu)化器也初始化后,就可以進行迭代訓練了,不過迭代訓練最重要的是設置學習率,根據(jù)loss動態(tài)調(diào)整參數(shù),代碼如下:

for epoch in range(epochs):
    start_time = time.time()

    for step, (X, Y) in enumerate(train_loader):
        X = X.to(device)
        Y = Y.to(device)

        # 設置學習率
        lr = get_lr(epoch * iter_per_epoch + step, epochs * iter_per_epoch)
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        # 前向傳播和損失計算
        with ctx:
            out = model(X, Y)
            loss = out.last_loss

        # 反向傳播
        scaler.scale(loss).backward()

        # 梯度剪裁和更新參數(shù)
        if (step + 1) % accumulation_steps == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()

        # 清零梯度
        optimizer.zero_grad(set_to_none=True)

        if step % 100 == 0:
            spend_time = time.time() - start_time
            print(
                'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.7f} epoch_Time:{}min:'.format(
                    epoch,
                    epochs,
                    step,
                    iter_per_epoch,
                    loss.item(),
                    optimizer.param_groups[-1]['lr'],
                    spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
            model.eval()
            ckp = f'{save_dir}/pretrain_{lm_config.dim}.pth'
            state_dict = model.state_dict()
            torch.save(state_dict, ckp)
            model.train()
  • out = model(X, Y) 前向傳播,計算輸出
  • scaler.scale(loss).backward() 反向傳播,計算梯度,執(zhí)行 accumulation_steps 后更新梯度
  • model.eval()model.train() 分別是模型評估和訓練,并保存當前模型到指定的文件夾

本人在T4的GPU上,跑了30+小時完成迭代訓練,如果使用CPU時間會X4,我在附錄中放了完整的代碼,有興趣的可以跑一下。

附錄

完成代碼:

import os
import time
import math
import warnings
import inspect
import numpy as np
import torch
from torch import optim
from torch.utils.data import DataLoader
from contextlib import nullcontext
from model.model import Transformer
from torch.utils.data import Dataset
from transformers import PretrainedConfig
from typing import Any, Optional, Tuple
import torch.nn.functional as F
from torch import nn
from transformers import PreTrainedModel
from transformers.modeling_outputs import CausalLMOutputWithPast
os.environ["TOKENIZERS_PARALLELISM"] = "false"

warnings.filterwarnings('ignore')
basepath = "../datasets"

class MyPretrainConfig(PretrainedConfig):
    model_type = "myllm"

    def __init__(
            self,
            dim: int = 512,
            n_layers: int = 8,
            n_heads: int = 16,
            n_kv_heads: int = 8,
            vocab_size: int = 6400,
            hidden_dim: int = None,
            multiple_of: int = 64,
            norm_eps: float = 1e-5,
            max_seq_len: int = 512,
            dropout: float = 0.0,
            flash_attn: bool = True,
            num_experts_per_tok=2,
            n_routed_experts=4,
            n_shared_experts: bool = True,
            scoring_func='softmax',
            aux_loss_alpha=0.01,
            seq_aux=True,
            norm_topk_prob=True,
            **kwargs,
    ):
        self.dim = dim
        self.n_layers = n_layers
        self.n_heads = n_heads
        self.n_kv_heads = n_kv_heads
        self.vocab_size = vocab_size
        self.hidden_dim = hidden_dim
        self.multiple_of = multiple_of
        self.norm_eps = norm_eps
        self.max_seq_len = max_seq_len
        self.dropout = dropout
        self.flash_attn = flash_attn
        self.num_experts_per_tok = num_experts_per_tok  # 每個token選擇的專家數(shù)量
        self.n_routed_experts = n_routed_experts        # 總的專家數(shù)量
        self.n_shared_experts = n_shared_experts        # 共享專家
        self.scoring_func = scoring_func                # 評分函數(shù),默認為'softmax'
        self.aux_loss_alpha = aux_loss_alpha            # 輔助損失的alpha參數(shù)
        self.seq_aux = seq_aux                          # 是否在序列級別上計算輔助損失
        self.norm_topk_prob = norm_topk_prob            # 是否標準化top-k概率
        super().__init__(**kwargs)

class PretrainDataset(Dataset):
    def __init__(self, data_path_lst, max_length=512, memmap=False):
        super().__init__()
        if memmap:
            with open(data_path_lst[0], 'r') as f:
                nbytes = f.seek(0, 2)
                flen = f.tell() // np.dtype('uint16').itemsize
            self.data = np.memmap(data_path_lst[0], dtype=np.dtype('uint16'), shape=(flen // max_length, max_length))
        else:
            data_lst = []
            for data_path in data_path_lst:
                with open(data_path, 'rb') as f:
                    data = np.fromfile(f, dtype=np.uint16)
                    data_lst.append(data)
            data = np.concatenate(data_lst)
            data = data[:max_length * int(len(data) / max_length)]
            self.data = data.reshape(-1, max_length)
        print("memmap:{} train data.shape:{}".format(memmap, self.data.shape))
        print("downloading finished.....")

    def __len__(self):
        return self.data.shape[0]

    def __getitem__(self, index: int):
        sample = self.data[index]
        X = np.array(sample[:-1]).astype(np.int64)
        Y = np.array(sample[1:]).astype(np.int64)

        return torch.from_numpy(X), torch.from_numpy(Y)
    
class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device)  # type: ignore
    freqs = torch.outer(t, freqs).float()  # type: ignore
    freqs_cos = torch.cos(freqs)  # real part
    freqs_sin = torch.sin(freqs)  # imaginary part
    return freqs_cos, freqs_sin

def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(shape)

def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cos: torch.Tensor,
    freqs_sin: torch.Tensor
) -> Tuple[torch.Tensor, torch.Tensor]:

    # reshape xq and xk to match the complex representation
    xq_r, xq_i = xq.float().reshape(xq.shape[:-1] + (-1, 2)).unbind(-1)
    xk_r, xk_i = xk.float().reshape(xk.shape[:-1] + (-1, 2)).unbind(-1)

    # reshape freqs_cos and freqs_sin for broadcasting
    freqs_cos = reshape_for_broadcast(freqs_cos, xq_r)
    freqs_sin = reshape_for_broadcast(freqs_sin, xq_r)

    # apply rotation using real numbers
    xq_out_r = xq_r * freqs_cos - xq_i * freqs_sin
    xq_out_i = xq_r * freqs_sin + xq_i * freqs_cos
    xk_out_r = xk_r * freqs_cos - xk_i * freqs_sin
    xk_out_i = xk_r * freqs_sin + xk_i * freqs_cos

    # flatten last two dimensions
    xq_out = torch.stack([xq_out_r, xq_out_i], dim=-1).flatten(3)
    xk_out = torch.stack([xk_out_r, xk_out_i], dim=-1).flatten(3)

    return xq_out.type_as(xq), xk_out.type_as(xk)

def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )

class Attention(nn.Module):
    def __init__(self, args: MyPretrainConfig):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        assert args.n_heads % self.n_kv_heads == 0
        model_parallel_size = 1
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads
        self.wq = nn.Linear(args.dim, args.n_heads * self.head_dim, bias=False)
        self.wk = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wv = nn.Linear(args.dim, self.n_kv_heads * self.head_dim, bias=False)
        self.wo = nn.Linear(args.n_heads * self.head_dim, args.dim, bias=False)
        self.attn_dropout = nn.Dropout(args.dropout)
        self.resid_dropout = nn.Dropout(args.dropout)
        self.dropout = args.dropout

        # use flash attention or a manual implementation?
        self.flash = hasattr(torch.nn.functional, 'scaled_dot_product_attention')
        if not self.flash:
            print("WARNING: using slow attention. Flash Attention requires PyTorch >= 2.0")
            mask = torch.full((1, 1, args.max_seq_len, args.max_seq_len), float("-inf"))
            mask = torch.triu(mask, diagonal=1)
            self.register_buffer("mask", mask)

    def forward(
        self,
        x: torch.Tensor,
        freqs_cos: torch.Tensor,
        freqs_sin: torch.Tensor,
    ):
        bsz, seqlen, _ = x.shape

        # QKV
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)
        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        # RoPE relative positional embeddings
        xq, xk = apply_rotary_emb(xq, xk, freqs_cos, freqs_sin)

        # grouped multiquery attention: expand out keys and values
        xk = repeat_kv(xk, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)
        xv = repeat_kv(xv, self.n_rep)  # (bs, seqlen, n_local_heads, head_dim)

        # make heads into a batch dimension
        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        xk = xk.transpose(1, 2)
        xv = xv.transpose(1, 2)

        # flash implementation
        if self.flash:
            output = torch.nn.functional.scaled_dot_product_attention(xq, xk, xv, attn_mask=None, dropout_p=self.dropout if self.training else 0.0, is_causal=True)
        else:
            # manual implementation
            scores = torch.matmul(xq, xk.transpose(2, 3)) / math.sqrt(self.head_dim)
            assert hasattr(self, 'mask')
            scores = scores + self.mask[:, :, :seqlen, :seqlen]   # (bs, n_local_heads, seqlen, cache_len + seqlen)
            scores = F.softmax(scores.float(), dim=-1).type_as(xq)
            scores = self.attn_dropout(scores)
            output = torch.matmul(scores, xv)  # (bs, n_local_heads, seqlen, head_dim)

        # restore time as batch dimension and concat heads
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)

        # final projection into the residual stream
        output = self.wo(output)
        output = self.resid_dropout(output)
        return output

class FeedForward(nn.Module):
    def __init__(self, dim: int, hidden_dim: int, multiple_of: int, dropout: float):
        super().__init__()
        if hidden_dim is None:
            hidden_dim = 4 * dim
            hidden_dim = int(2 * hidden_dim / 3)
            hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        return self.dropout(self.w2(F.silu(self.w1(x)) * self.w3(x)))

class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: MyPretrainConfig):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=args.hidden_dim,
            multiple_of=args.multiple_of,
            dropout=args.dropout,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(self, x, freqs_cos, freqs_sin):
        h = x + self.attention.forward(self.attention_norm(x), freqs_cos, freqs_sin)
        out = h + self.feed_forward.forward(self.ffn_norm(h))
        return out

class Transformer(PreTrainedModel):
    last_loss: Optional[torch.Tensor]

    def __init__(self, params: MyPretrainConfig):
        super().__init__(params)
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = nn.Embedding(params.vocab_size, params.dim)
        self.dropout = nn.Dropout(params.dropout)
        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))
        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = nn.Linear(params.dim, params.vocab_size, bias=False)

        # share the unembedding parameters with the embedding parameters
        self.tok_embeddings.weight = self.output.weight # https://paperswithcode.com/method/weight-tying

        # some useful precompute for the RoPE relative positional embeddings
        freqs_cos, freqs_sin = precompute_freqs_cis(self.params.dim // self.params.n_heads, self.params.max_seq_len)
        self.register_buffer("freqs_cos", freqs_cos, persistent=False)
        self.register_buffer("freqs_sin", freqs_sin, persistent=False)

        # init all weights
        self.apply(self._init_weights)
        # apply special scaled init to the residual projections, per GPT-2 paper
        for pn, p in self.named_parameters():
            if pn.endswith('w3.weight') or pn.endswith('wo.weight'):
                torch.nn.init.normal_(p, mean=0.0, std=0.02/math.sqrt(2 * params.n_layers))

        # Initialize attribute for the loss of the last forward call. This will be set if the forward is called with a targets tensor.
        self.last_loss = None
        self.OUT = CausalLMOutputWithPast()

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, tokens: torch.Tensor, targets: Optional[torch.Tensor] = None) -> torch.Tensor:
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        h = self.dropout(h)
        freqs_cos = self.freqs_cos[:seqlen]
        freqs_sin = self.freqs_sin[:seqlen]

        for layer in self.layers:
            h = layer(h, freqs_cos, freqs_sin)
        h = self.norm(h)

        if targets is not None:
            # if we are given some desired targets also calculate the loss
            logits = self.output(h)
            self.last_loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)
        else:
            # inference-time mini-optimization: only forward the output on the very last position
            logits = self.output(h[:, [-1], :]) # note: using list [-1] to preserve the time dim
            self.last_loss = None

        self.OUT.__setitem__('logits', logits)
        self.OUT.__setitem__('last_loss', self.last_loss)
        return self.OUT

    def configure_optimizers(self, weight_decay, learning_rate, betas, device_type):
        # start with all of the candidate parameters
        param_dict = {pn: p for pn, p in self.named_parameters()}
        # filter out those that do not require grad
        param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
        # create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
        # i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
        decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
        nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
        optim_groups = [
            {'params': decay_params, 'weight_decay': weight_decay},
            {'params': nodecay_params, 'weight_decay': 0.0}
        ]
        num_decay_params = sum(p.numel() for p in decay_params)
        num_nodecay_params = sum(p.numel() for p in nodecay_params)
        print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
        print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
        # Create AdamW optimizer and use the fused version if it is available
        fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
        use_fused = fused_available and device_type == 'cuda'
        extra_args = dict(fused=True) if use_fused else dict()
        optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=betas, **extra_args)
        print(f"using fused AdamW: {use_fused}")

        return optimizer

    def estimate_mfu(self, fwdbwd_per_iter, dt):
        """ estimate model flops utilization (MFU) in units of A100 bfloat16 peak FLOPS """
        # first estimate the number of flops we do per iteration.
        # see PaLM paper Appendix B as ref: https://arxiv.org/abs/2204.02311
        N = sum(p.numel() for p in self.parameters())
        cfg = self.params
        L, H, Q, T = cfg.n_layers, cfg.n_heads, cfg.dim//cfg.n_heads, cfg.max_seq_len
        flops_per_token = 6*N + 12*L*H*Q*T
        flops_per_fwdbwd = flops_per_token * T
        flops_per_iter = flops_per_fwdbwd * fwdbwd_per_iter
        # express our flops throughput as ratio of A100 bfloat16 peak flops
        flops_achieved = flops_per_iter * (1.0/dt) # per second
        flops_promised = 312e12 # A100 GPU bfloat16 peak flops is 312 TFLOPS
        mfu = flops_achieved / flops_promised
        return mfu

    @torch.inference_mode()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        """
        Take a conditioning sequence of indices idx (LongTensor of shape (b,t)) and complete
        the sequence max_new_tokens times, feeding the predictions back into the model each time.
        Most likely you'll want to make sure to be in model.eval() mode of operation for this.
        Also note this is a super inefficient version of sampling with no key/value cache.
        """
        for _ in range(max_new_tokens):
            # if the sequence context is growing too long we must crop it at block_size
            idx_cond = idx if idx.size(1) <= self.params.max_seq_len else idx[:, -self.params.max_seq_len:]
            # forward the model to get the logits for the index in the sequence
            logits = self(idx_cond)
            logits = logits[:, -1, :] # crop to just the final time step
            if temperature == 0.0:
                # "sample" the single most likely index
                _, idx_next = torch.topk(logits, k=1, dim=-1)
            else:
                # pluck the logits at the final step and scale by desired temperature
                logits = logits / temperature
                # optionally crop the logits to only the top k options
                if top_k is not None:
                    v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                    logits[logits < v[:, [-1]]] = -float('Inf')
                # apply softmax to convert logits to (normalized) probabilities
                probs = F.softmax(logits, dim=-1)
                idx_next = torch.multinomial(probs, num_samples=1)
            # append sampled index to the running sequence and continue
            idx = torch.cat((idx, idx_next), dim=1)

        return idx

def get_lr(it, all):
    warmup_iters = 0
    lr_decay_iters = all
    min_lr = learning_rate / 10

    if it < warmup_iters:
        return learning_rate * it / warmup_iters
    if it > lr_decay_iters:
        return min_lr
    decay_ratio = (it - warmup_iters) / (lr_decay_iters - warmup_iters)
    assert 0 <= decay_ratio <= 1
    coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio))
    return min_lr + coeff * (learning_rate - min_lr)

def init_model():
    def count_parameters(model):
        return sum(p.numel() for p in model.parameters() if p.requires_grad)

    model = Transformer(lm_config).to(device)
    print(f'LLM總參數(shù)量:{count_parameters(model) / 1e6:.3f} 百萬')
    return model


if __name__ == "__main__":
    # -----------------------------------------------------------------------------
    lm_config = MyPretrainConfig()
    max_seq_len = lm_config.max_seq_len
    out_dir = 'out'
    epochs = 20             # 訓練輪數(shù)
    batch_size = 8          # batch_size
    learning_rate = 1e-4    # 學習率
    device = 'cuda:0'       # or cpu
    dtype = 'bfloat16'
    save_dir = os.path.join(out_dir)
    os.makedirs(save_dir, exist_ok=True)
    os.makedirs(out_dir, exist_ok=True)
    tokens_per_iter = batch_size * max_seq_len
    torch.manual_seed(1337)
    device_type = device if "cuda" in device else "cpu"
    print(f"device_type: {device_type}")
    ctx = (
        nullcontext()
        if device_type == "cpu"
        else torch.cuda.amp.autocast()
    )
    # -----------------------------------------------------------------------------

    # -----init dataloader------
    data_path_list = [f'{basepath}/pretrain_data.bin']
    train_ds = PretrainDataset(data_path_list, max_length=max_seq_len, memmap=True)
    train_sampler = None
    num_workers = 16  # 可以根據(jù)系統(tǒng)的 CPU 核心數(shù)來調(diào)整
    train_loader = DataLoader(
        train_ds,
        batch_size=batch_size,
        pin_memory=True,
        drop_last=False,
        shuffle=False,
        num_workers=num_workers,
        sampler=train_sampler
    )

    # init model
    model = init_model()
    print(model)
    scaler = torch.cuda.amp.GradScaler(enabled=(dtype == dtype))
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # training loop
    accumulation_steps = 8
    iter_per_epoch = len(train_loader)
    for epoch in range(epochs):
        start_time = time.time()

        for step, (X, Y) in enumerate(train_loader):
            X = X.to(device)
            Y = Y.to(device)

            # 設置學習率
            lr = get_lr(epoch * iter_per_epoch + step, epochs * iter_per_epoch)
            for param_group in optimizer.param_groups:
                param_group['lr'] = lr

            # 前向傳播和損失計算
            with ctx:
                out = model(X, Y)
                loss = out.last_loss

            # 反向傳播
            scaler.scale(loss).backward()

            # 梯度剪裁和更新參數(shù)
            if (step + 1) % accumulation_steps == 0:
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(optimizer)
                scaler.update()

            # 清零梯度
            optimizer.zero_grad(set_to_none=True)

            if step % 100 == 0:
                spend_time = time.time() - start_time
                print(
                    'Epoch:[{}/{}]({}/{}) loss:{:.3f} lr:{:.7f} epoch_Time:{}min:'.format(
                        epoch,
                        epochs,
                        step,
                        iter_per_epoch,
                        loss.item(),
                        optimizer.param_groups[-1]['lr'],
                        spend_time / (step + 1) * iter_per_epoch // 60 - spend_time // 60))
                model.eval()
                ckp = f'{save_dir}/pretrain_{lm_config.dim}.pth'
                state_dict = model.state_dict()
                torch.save(state_dict, ckp)
                model.train()

參考

(1)https://github.com/jingyaogong/minimind?tab=readme-ov-file#%E6%95%B0%E6%8D%AE%E9%9B%86%E4%B8%8B%E8%BD%BD%E5%9C%B0%E5%9D%80
(2)https://github.com/karpathy/llama2.c/blob/master/train.py

責任編輯:武曉燕 來源: 周末程序猿
相關(guān)推薦

2024-11-26 09:33:44

2024-12-26 00:46:25

機器學習LoRA訓練

2025-04-03 15:40:41

機器學習大模型DeepSeek

2024-12-09 00:00:10

2025-01-10 08:38:10

2025-04-03 15:46:53

2020-08-10 15:05:02

機器學習人工智能計算機

2025-10-10 07:48:12

大模型預訓練初始化

2025-08-24 09:24:07

2019-05-07 11:18:51

機器學習人工智能計算機

2022-03-28 09:00:00

SQL數(shù)據(jù)庫機器學習

2017-03-24 15:58:46

互聯(lián)網(wǎng)

2017-12-26 13:53:31

深度學習遷移學習

2023-02-28 13:09:53

訓練模型

2025-06-19 10:09:55

2025-08-13 01:00:00

2023-06-24 19:59:40

2025-10-11 09:23:28

RLPT強化學習預訓練數(shù)據(jù)

2024-01-03 18:53:13

語言模型LLM
點贊
收藏

51CTO技術(shù)棧公眾號

成人午夜电影免费在线观看| 久久久精品网站| jizzjizzxxxx| 一区二区三区视频网站| 国产乱淫av一区二区三区| 不卡av在线网站| 91av在线免费| 91丨精品丨国产| 精品福利在线观看| 中文字幕一区二区三区四区五区| 亚洲福利在线观看视频| 日韩精品一二三四| 久久久久久久91| 国产精品综合激情| 精品福利一区| 欧美日韩一区小说| 欧美 丝袜 自拍 制服 另类| 免费观看久久久久| 93久久精品日日躁夜夜躁欧美| 香蕉久久久久久久av网站| 夜色激情一区二区| 日韩高清国产一区在线观看| 午夜精品久久久久久久99| 首页综合国产亚洲丝袜| 久久久久国产视频| 可以免费看av的网址| 性欧美lx╳lx╳| 精品免费国产一区二区三区四区| 色婷婷.com| 欧洲一级精品| 午夜精品123| 成年在线观看视频| 蜜芽在线免费观看| 国产欧美日韩综合| 久久艹中文字幕| 黄色av免费观看| 国产成人av电影免费在线观看| 国产精品日日做人人爱| 毛片基地在线观看| 日韩午夜免费| 久久久久久久一区二区| 极品久久久久久| 国产精品久久天天影视| 亚洲最大在线视频| b站大片免费直播| 亚洲激情播播| 亚洲欧美另类中文字幕| 亚洲色图14p| 青青视频一区二区| 亚洲黄色免费三级| 网站免费在线观看| 亚洲自拍电影| 亚洲天堂免费视频| 亚洲v国产v欧美v久久久久久| 国产成人3p视频免费观看| 日韩成人性视频| 亚洲精品女人久久久| 日韩黄色网络| 精品一区精品二区| 亚洲午夜久久久久久久久红桃| 羞羞色国产精品网站| 亚洲精品国产suv| 一级做a爰片毛片| 九九亚洲精品| 中文字幕欧美精品日韩中文字幕| 妖精视频在线观看免费 | 久久av资源网| 成人久久久久久| www.蜜臀av.com| 99视频精品全部免费在线| 精品视频一区在线| 国产精品久久久久一区二区国产 | 国产精品 日产精品 欧美精品| 91中文在线视频| 亚洲精品无amm毛片| 成人av免费网站| 蜜桃视频在线观看成人| 国产www.大片在线| 国产欧美一区二区精品性色超碰| 亚洲欧洲三级| 亚洲第一图区| 岛国av一区二区三区| 三级a在线观看| 国产精品一区二区三区av| 欧美va亚洲va香蕉在线| 美女久久久久久久久久| 欧美黄色大片在线观看| 欧美国产日韩一区二区| 亚洲GV成人无码久久精品| 免费精品视频最新在线| 97在线中文字幕| 青青青草原在线| 成人欧美一区二区三区视频网页| www.好吊操| 国产私拍福利精品视频二区| 日韩视频永久免费| 欧美成人午夜精品免费| 亚洲成人精品| 奇门遁甲1982国语版免费观看高清| 在线视频 中文字幕| 懂色一区二区三区免费观看| 日本一区二区三区免费看| av网站在线看| 在线亚洲高清视频| 麻豆tv在线观看| 成人综合一区| 欧美亚洲激情视频| 性做久久久久久久久久| 国产精品伦理一区二区| 国产极品在线视频| 美国十次综合久久| 一区二区欧美久久| 日产精品久久久久久久| 久久66热re国产| 欧美日韩在线一区二区三区| 欧美aaa免费| 制服丝袜一区二区三区| 李宗瑞91在线正在播放| 欧美阿v一级看视频| 国产精品久久久久久av下载红粉| 日本精品久久久久久| 日韩毛片精品高清免费| 熟妇人妻va精品中文字幕| 波多野结衣在线一区二区 | 久久99精品国产.久久久久| 鲁丝一区二区三区免费| 91破解版在线观看| 日韩欧美一区在线观看| 伊人久久久久久久久久久久久久| 久久青草久久| 欧美高清视频一区二区三区在线观看| 男女免费观看在线爽爽爽视频| 91麻豆精品国产91久久久更新时间| 欧美黄色激情视频| 丝袜亚洲另类欧美综合| 蜜桃久久影院| 小h片在线观看| 日韩av一卡二卡| 黄色激情视频在线观看| 粉嫩久久99精品久久久久久夜| 国产女主播av| 日韩一区二区三区色| 欧美大奶子在线| 国产日产亚洲系列最新| 亚洲天堂久久久久久久| 亚洲小视频网站| 91精品国产91久久综合| 国产拍精品一二三| 日本暖暖在线视频| 欧美人伦禁忌dvd放荡欲情| 黄大色黄女片18免费| 免费视频一区二区| 亚洲午夜精品久久| 亚洲精品成人一区| 久久伊人精品视频| www.蜜臀av| 天天免费综合色| 欧美多人猛交狂配| 日韩av网站在线观看| 亚洲aⅴ天堂av在线电影软件| 九九九精品视频| 久久亚洲欧美日韩精品专区| www.色日本| 婷婷久久综合九色综合绿巨人 | 国产精品成人在线观看| 中文字幕 日韩 欧美| 99久久九九| 痴汉一区二区三区| 日韩欧美精品一区二区三区| 亚洲天堂免费观看| 一级淫片免费看| 亚洲香肠在线观看| 亚洲第一黄色网址| 日本在线不卡视频一二三区| 天天操天天干天天玩| 盗摄牛牛av影视一区二区| 奇米四色中文综合久久| 日本在线天堂| 亚洲成人1234| 黄色污污视频软件| 日韩毛片一二三区| 中文字幕av观看| 精品无人码麻豆乱码1区2区| 成人性生活视频免费看| 国产一区二区三区四区五区传媒 | 久久九九热re6这里有精品| 2021国产精品视频| 欧美被日视频| 日韩精品在线免费观看| 亚洲一区二区激情| 亚洲国产精品尤物yw在线观看| 免费污网站在线观看| 国产另类ts人妖一区二区| av动漫在线看| 影音先锋日韩在线| 欧美激情第一页在线观看| 999精品视频在线观看| 91精品国产91久久久久久久久| 永久免费av在线| 日韩电影第一页| a级片免费观看| 欧美中文字幕一二三区视频| 日本少妇全体裸体洗澡| 国产精品久久一级| 黄色录像a级片| 国产一区久久久| 欧美精品无码一区二区三区| 狠狠色综合网| 手机在线视频你懂的| 色婷婷av一区二区三区丝袜美腿| 亚洲一区二区三区久久| 日韩av福利| 97视频免费看| 免费在线观看的电影网站| 中文字幕亚洲在线| 神马久久高清| 亚洲第一福利网| www.黄色国产| 欧美精品免费视频| 国产情侣呻吟对白高潮| 欧美日韩免费网站| 久久久久久久久久综合| 亚洲欧洲日产国码二区| 日韩毛片无码永久免费看| 99精品视频一区二区三区| 国产sm在线观看| 国产精品自拍一区| 欧洲美女亚洲激情| 久久国内精品自在自线400部| 99视频在线免费| 久久一区中文字幕| 看av免费毛片手机播放| 最新亚洲一区| 日本黄色片一级片| 欧美高清一区| 欧美一区二区三区综合| 欧美成人精品| 国产1区2区3区中文字幕| 欧美在线三级| 成人黄色片免费| 欧美日韩精品| 国内自拍中文字幕| 亚洲成人一区| 51xx午夜影福利| 午夜激情一区| 精品无码一区二区三区爱欲| 欧美日韩视频一区二区三区| 国产人妻人伦精品| 国户精品久久久久久久久久久不卡| 亚洲小视频在线播放| 欧美一区影院| 欧美久久在线观看| 亚洲综合电影一区二区三区| 男人靠女人免费视频网站| 亚洲一区二区三区高清| 国产裸体舞一区二区三区| 久久香蕉精品| 美女在线视频一区二区 | 影音先锋黄色资源| 91欧美一区二区| 国产高潮呻吟久久| 国产精品天天看| 欧美精品久久久久久久久46p| 亚洲黄色片在线观看| 国产精品成人av久久| 欧美日韩性视频在线| 中国精品一区二区| 911国产精品| 男人天堂综合网| 亚洲欧洲第一视频| 香港伦理在线| 欧美极品少妇xxxxx| 粉嫩一区二区| 国产日韩在线播放| 成人资源在线播放| 日本不卡久久| 国产精品麻豆久久| 日本欧美视频在线观看| 久久精品伊人| 免费高清视频在线观看| aa级大片欧美| 粉嫩精品久久99综合一区| 亚洲精品日产精品乱码不卡| 国产香蕉视频在线| 欧美日韩在线三级| 国模人体一区二区| 在线观看日韩av| 狂野欧美激情性xxxx欧美| 91精品国产91久久久久久| 久久久久久一区二区三区四区别墅| 97超碰人人模人人爽人人看| 在线看成人短视频| 中文字幕第50页| 欧美亚洲在线| 久久久久久久久久久影视| 91免费国产在线| 国产盗摄x88av| 在线视频欧美精品| 蜜桃av噜噜一区二区三区麻豆| 国产精品女主播av| 亚洲a∨无码无在线观看| 亚洲永久精品国产| 中文字幕乱码一区二区| 亚洲精品国产品国语在线| 欧美精品电影| 日本高清视频精品| 超碰精品在线观看| 一区二区不卡在线视频 午夜欧美不卡' | 美腿丝袜亚洲三区| 黄色网址在线视频| 亚洲欧美一区二区不卡| 精品黑人一区二区三区| 欧美成人猛片aaaaaaa| 婷婷在线视频| 国产成人jvid在线播放| 97人人澡人人爽91综合色| 亚洲综合首页| 久久经典综合| 日韩精品一区二区三区高清免费| 亚洲欧美日韩国产综合| 国产日韩久久久| 亚洲视频视频在线| 僵尸再翻生在线观看免费国语| 亚洲a∨日韩av高清在线观看| 亚洲人成网站77777在线观看| 青草视频在线观看视频| 国产伦精品一区二区三区免费迷 | 久久久久久久久99精品大| 激情五月亚洲色图| 91欧美激情一区二区三区成人| 国产一级理论片| 日韩美女天天操| av免费在线免费观看| 成人黄色在线观看| 久久人体视频| 性猛交ⅹ×××乱大交| 国产欧美va欧美不卡在线| 天干夜夜爽爽日日日日| 亚洲欧美国内爽妇网| 久九九久频精品短视频| 免费国产在线精品一区二区三区| 国产精品久久久免费| 香港三级日本三级| 欧美日韩精品国产| 午夜激情小视频| 日本精品一区二区三区在线| 一区二区三区四区在线看| 久久久久久久久久久视频| 99久久精品免费看国产 | 日韩免费影院| 国产精品久久久对白| 亚洲精品婷婷| 色婷婷av777| 欧美少妇性性性| 黄色网在线免费观看| 999国产在线| 亚洲区一区二| 尤物视频最新网址| 欧洲亚洲精品在线| 美女羞羞视频在线观看| 亚洲一区二区三区在线视频 | 成人在线观看黄| 久久精品亚洲精品国产欧美| 不卡av电影在线| 久久精品国产欧美激情| 视频一区日韩| 一区二区传媒有限公司| 国产视频在线观看一区二区三区 | 青青草手机在线观看| 精品sm在线观看| 美女福利一区二区三区| 亚洲a∨一区二区三区| 国产一区二区三区四区五区美女| 免看一级a毛片一片成人不卡| 亚洲国产小视频| 成人交换视频| 一二三四中文字幕| 91原创在线视频| 在线视频 91| 午夜免费日韩视频| 日韩精品一区二区三区免费观影| 91热视频在线观看| 激情成人在线视频| 日韩美女网站| 精品亚洲欧美日韩| 蜜桃av一区二区三区电影| 久久久全国免费视频| 亚洲天堂网在线观看| 99精品国产一区二区三区2021 | 日韩欧美中文字幕一区二区| 亚洲欧美在线免费| 精品视频在线观看免费观看| 91精品91久久久中77777老牛| 中文字幕一区二区三区蜜月| 涩涩视频免费看| 成人国产精品久久久| 国产视频久久| 欧美色图亚洲视频| 亚洲人成在线免费观看| 亚洲精品一区二区三区在线|