精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

鴻蒙開發者社區

公眾號矩陣

移動端

視頻課免費課排行榜短視頻直播課軟考學堂

全部課程軟考信創認證華為認證廠商認證 IT技術 PMP項目管理免費題庫

文章資源問答課堂專欄直播

51CTO

鴻蒙開發者社區

51CTO技術棧

51CTO官微

51CTO學堂

51CTO博客

CTO訓練營

鴻蒙開發者社區訂閱號

51CTO軟考

51CTO學堂APP

51CTO學堂企業版APP

鴻蒙開發者社區視頻號

51CTO軟考題庫

賬號設置退出

Tokenization?指南：字節對編碼，WordPiece等方法Python代碼詳解

作者：佚名 2024-01-17 16:29:59

在2022年11月OpenAI的ChatGPT發布之后，大型語言模型(llm)變得非常受歡迎。從那時起，這些語言模型的使用得到了爆炸式的發展，這在一定程度上得益于HuggingFace的Transformer庫和PyTorch等庫。

在2022年11月OpenAI的ChatGPT發布之后，大型語言模型(llm)變得非常受歡迎。從那時起，這些語言模型的使用得到了爆炸式的發展，這在一定程度上得益于HuggingFace的Transformer庫和PyTorch等庫。

計算機要處理語言，首先需要將文本轉換成數字形式。這個過程由一個稱為標記化 Tokenization。

標記化分為2個過程：

1、將輸入文本劃分為token

標記器首先獲取文本并將其分成更小的部分，可以是單詞、單詞的部分或單個字符。這些較小的文本片段被稱為標記。Stanford NLP Group[2]將標記更嚴格地定義為:

在某些特定的文檔中，作為一個有用的語義處理單元組合在一起的字符序列實例。

2、為每個標記分配一個ID

標記器將文本劃分為標記后，可以為每個標記分配一個稱為標記ID的整數。例如，單詞cat被賦值為15，因此輸入文本中的每個cat標記都用數字15表示。用數字表示替換文本標記的過程稱為編碼。類似地將已編碼的記號轉換回文本的過程稱為解碼。

使用單個數字表示記號有其缺點，因此要進一步處理這些編碼以創建詞嵌入，這個不在本文的范圍內，我們后面介紹。

標記方法

將文本劃分為標記的主要方法有三種:

1、基于單詞:

基于單詞的標記化是三種標記化方法中最簡單的一種。標記器將通過拆分每個空格字符(有時稱為“基于空白的標記化”)或通過類似的規則集(如基于標點的標記化)將句子分成單詞[12]。

例如，這個句子:

Cats are great, but dogs are better!

通過空格可以拆分為:

['Cats', 'are', 'great,', 'but', 'dogs', 'are', 'better!']

通過分隔標點和可以拆分為:

['Cats', 'are', 'great', ',', 'but', 'dogs', 'are', 'better', '!']

這里可以看到，用于確定分割的規則非常重要。空格方法可以更好地提供潛在的稀有標記!，而通過標點割則使兩個不太罕見的標記更加突出!這里要說明下不要完全去掉標點符號，因為它們可以承載非常特殊的含義?！褪且粋€例子，它可以區分單詞的復數形式和所有格形式。例如，“book’s”指的是一本書的某些屬性，而“books”指的是許多書。

生成標記后，每個標記都會可以分配一個編號。下一次生成標記器已經看到的標記時，可以簡單地為該標記分配為該單詞指定的數字。例如，如果在上面的句子中，標記great被賦值為1，那么great的所有后續實例也將被賦值為1[3]。

優缺點:

基于單詞的方法生成的標記包含高度的信息，因為每個標記都包含語義和上下文信息。但是這種方法最大的缺點之一是非常相似的單詞被視為完全獨立的標記。例如，cat和cats之間的聯系將是不存在的，因此它們將被視為單獨的單詞。這在包含許多單詞的大規模應用程序中成為一個問題，因為模型詞匯表中可能出現的標記數量(模型所看到的標記總數)可能會變得非常大。英語大約有17萬個單詞，就會導致所謂的詞匯爆炸問題。這方面的一個例子是TransformerXL標記器，它使用基于空白的分割。這導致詞匯量超過25萬[4]。

解決這個問題的一種方法是對模型可以學習的標記數量施加硬限制(例如10,000)。這將把10,000個最常見的標記之外的任何單詞分類為詞匯表外(OOV)，并將標記值分配為UNKNOWN而不是數值(通?？s寫為UNK)。在存在許多未知單詞的情況下，這會導致性能下降，但如果數據中包含的大多是常見單詞，這可能是一種合適的折衷方法。[5]

2、基于字符的分詞器

基于字符的標記法根據每個字符拆分文本，包括:字母、數字和標點符號等特殊字符。這大大減少了詞匯量的大小，英語可以用大約256個標記來表示，而不是基于單詞的方法所需的170,000多個[5]。即使是東亞語言，如漢語和日語，其詞匯量也會顯著減少，盡管它們的書寫系統中包含數千個獨特的字符。

在基于字符的標記器中，以下句子:

Cats are great, but dogs are better!

會被拆分成：

['C', 'a', 't', 's', ' ', 'a', 'r', 'e', ' ', 'g', 'r', 'e', 'a', 't', ',', ' ', 'b', 'u', 't', ' ', 'd', 'o', 'g', 's', ' ', 'a', 'r', 'e', ' ', 'b', 'e', 't', 't', 'e', 'r', '!'`]

優缺點:

與基于單詞的方法相比，基于字符的方法的詞匯表大小要小得多，而且詞匯表外的標記也要少得多。它可以對拼寫錯誤的單詞進行標記(盡管與單詞的正確形式不同)。

但是這種方法也有一些缺點。使用基于字符的方法生成的單個標記中存儲的信息非常少。這是因為與基于單詞的方法中的標記不同，沒有捕獲語義或上下文含義(特別是在使用基于字母的書寫系統的語言中，如英語)。這種方法限制了可以輸入語言模型的標記化輸入的大小，因為需要許多數字來編碼輸入文本。

3、基于子詞的分詞器

基于子詞的標記化可以實現基于詞和基于字符的方法的優點，同時最大限度地減少它們的缺點。基于子詞的方法采取了折中的方案，將單詞中的文本分開，創建具有語義意義的標記，即使它們不是完整的單詞。例如，符號ing和ed雖然本身不是單詞，但它們具有語法意義。

這種方法產生的詞匯表大小小于基于單詞的方法，但大于基于字符的方法。對于每個標記中存儲的信息量也是如此，它也位于前兩個方法生成的標記之間。

只拆分不常用的單詞，可以使詞形、復數形式等分解成它們的組成部分，同時保留符號之間的關系。例如，cat可能是數據集中非常常見的單詞，但cats可能不太常見。所以cats將被分成cat和s，其中cats現在被賦予與其他所有cats標記相同的值，而s被賦予不同的值，這可以編碼復數的含義。另一個例子是單詞tokenization，它可以分為詞根token和后綴ization。這種方法可以保持句法和語義的相似性[6]。由于這些原因，基于子詞的標記器在今天的NLP模型中非常常用。

標準化和預標記化

標記化過程需要一些預處理和后處理步驟，這些步驟組成了標記化管道。其中標記化方法(基于子詞，基于字符等)發生在模型步驟[7]中。

當使用Hugging Face的transformer庫中的標記器時，標記化管道的所有步驟都會自動處理。整個管道由一個名為Tokenizer的對象執行。本節將深入研究大多數用戶在處理NLP任務時不需要手動處理的代碼的內部工作原理。還將介紹在標記器庫中自定義基標記器類的步驟，這樣可以在需要時為特定任務專門構建標記器。

1、規范化方法

規范化是在將文本拆分為標記之前清理文本的過程。這包括將每個字符轉換為小寫，從字符中刪除重復，刪除不必要的空白等步驟。例如，字符串Thís is áN examplise sénteNCE。不同的規范化程序將執行不同的步驟，

Hugging Face的Normalizers包包含幾個基本的Normalizers，一般常用的有：

NFC:不轉換大小寫或移除口音

Lower:轉換大小寫，但不移除口音

BERT:轉換大小寫并移除口音

我們可以看看上面三種方法的對比：

from tokenizers.normalizers import NFC, Lowercase, BertNormalizer
 
 # Text to normalize
 text = 'Thís is áN ExaMPlé     sénteNCE'
 
 # Instantiate normalizer objects
 NFCNorm = NFC()
 LowercaseNorm = Lowercase()
 BertNorm = BertNormalizer()
 
 # Normalize the text
 print(f'NFC:   {NFCNorm.normalize_str(text)}')
 print(f'Lower: {LowercaseNorm.normalize_str(text)}')
 print(f'BERT: {BertNorm.normalize_str(text)}')
 
 #NFC:   Thís is áN ExaMPlé     sénteNCE
 #Lower: thís is án examplé     séntence
 #BERT: this is an example     sentence

下面的示例可以看到，只有NFC刪除了不必要的空白。

from transformers import FNetTokenizerFast, CamembertTokenizerFast, \
                          BertTokenizerFast
 
 # Text to normalize
 text = 'Thís is áN ExaMPlé     sénteNCE'
 
 # Instantiate tokenizers
 FNetTokenizer = FNetTokenizerFast.from_pretrained('google/fnet-base')
 CamembertTokenizer = CamembertTokenizerFast.from_pretrained('camembert-base')
 BertTokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
 
 # Normalize the text
 print(f'FNet Output:     \
    {FNetTokenizer.backend_tokenizer.normalizer .normalize_str(text)}')
 
 print(f'CamemBERT Output: \
    {CamembertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')
 
 print(f'BERT Output:     \
    {BertTokenizer.backend_tokenizer.normalizer.normalize_str(text)}')
     
 #FNet Output:     Thís is áN ExaMPlé sénteNCE
 #CamemBERT Output: Thís is áN ExaMPlé     sénteNCE
 #BERT Output:     this is an example     sentence

2、預標記化

預標記化步驟是標記化原始文本的第一次分割。執行分割是為了給出的最終標記的上限。一個句子可以在預標記步驟中被分割成幾個詞，然后在模型步驟中，根據標記方法(例如基于子詞的方法)，將其中的一些詞進一步分割。因此，預先標記的文本表示標記化后仍然可能保留的最大標記。

例如，一個句子可以根據每個空格拆分，每個空格加一些標點，或者每個空格加每個標點。

下面顯示了基本的Whitespacesplit預標記器和稍微復雜一點的BertPreTokenizer之間的比較。pre_tokenizers包?？瞻最A標記器的輸出保留標點完整，并且仍然連接到鄰近的單詞。例如，includes:被視為單個單詞。而BERT預標記器將標點符號視為單個單詞[8]。

from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer
 
 # Text to normalize 
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Define helper function to display pre-tokenized output
 def print_pretokenized_str(pre_tokens):
    for pre_token in pre_tokens:
        print(f'"{pre_token[0]}", ', end='')
 
 # Instantiate pre-tokenizers
 wss = WhitespaceSplit()
 bpt = BertPreTokenizer()
 
 # Pre-tokenize the text
 print('Whitespace Pre-Tokenizer:')
 print_pretokenized_str(wss.pre_tokenize_str(text))
 
 #Whitespace Pre-Tokenizer:
 #"this", "sentence's", "content", "includes:", "characters,", "spaces,", 
 #"and", "punctuation.", 
 
 
 print('\n\nBERT Pre-Tokenizer:')
 print_pretokenized_str(bpt.pre_tokenize_str(text))
 
 #BERT Pre-Tokenizer:
 #"this", "sentence", "'", "s", "content", "includes", ":", "characters", 
 #",", "spaces", ",", "and", "punctuation", ".",

我們可以直接從常見的標記器(如GPT-2和ALBERT (A Lite BERT)標記器)調用預標記化方法。這些方法與上面所示的標準BERT預標記器略有不同，因為在分割標記時不會刪除空格字符。它們被替換為表示空格所在位置的特殊字符。這樣做的好處是，在進一步處理時可以忽略空格字符，但如果需要，可以檢索原始句子。GPT-2模型使用?字符，其特征是大寫G上面有一個點。ALBERT模型使用下劃線字符。

from transformers import AutoTokenizer
 
 # Text to pre-tokenize
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Instatiate the pre-tokenizers
 GPT2_PreTokenizer = AutoTokenizer.from_pretrained('gpt2').backend_tokenizer \
                    .pre_tokenizer
 
 Albert_PreTokenizer = AutoTokenizer.from_pretrained('albert-base-v1') \
                      .backend_tokenizer.pre_tokenizer
 
 # Pre-tokenize the text
 print('GPT-2 Pre-Tokenizer:')
 print_pretokenized_str(GPT2_PreTokenizer.pre_tokenize_str(text))
 
 #GPT-2 Pre-Tokenizer:
 #"this", "?sentence", "'s", "?content", "?includes", ":", "?characters", ",",
 #"?spaces", ",", "?and", "?punctuation", ".", 
 
 print('\n\nALBERT Pre-Tokenizer:')
 print_pretokenized_str(Albert_PreTokenizer.pre_tokenize_str(text))
 
 #ALBERT Pre-Tokenizer:
 #"▁this", "▁sentence's", "▁content", "▁includes:", "▁characters,", "▁spaces,",
 #"▁and", "▁punctuation.",

下面顯示了同一個示例句子上的BERT預標記步驟的結果，返回的對象是一個包含元組的Python列表。每個元組對應一個預標記，其中第一個元素是預標記字符串，第二個元素是一個元組，包含原始輸入文本中字符串的開始和結束的索引。

from tokenizers.pre_tokenizers import WhitespaceSplit, BertPreTokenizer
 
 # Text to pre-tokenize
 text = ("this sentence's content includes: characters, spaces, and " \
        "punctuation.")
 
 # Instantiate pre-tokenizer
 bpt = BertPreTokenizer()
 
 # Pre-tokenize the text
 bpt.pre_tokenize_str(example_sentence)

結果如下：

[('this', (0, 4)),
  ('sentence', (5, 13)),
  ("'", (13, 14)),
  ('s', (14, 15)),
  ('content', (16, 23)),
  ('includes', (24, 32)),
  (':', (32, 33)),
  ('characters', (34, 44)),
  (',', (44, 45)),
  ('spaces', (46, 52)),
  (',', (52, 53)),
  ('and', (54, 57)),
  ('punctuation', (58, 69)),
  ('.', (69, 70))]

子詞標記化方法

在完成了分詞和預標記后，就可以開始合并標記了，對于transformer模型，有三種通常用于實現基于子詞的方法。它們都使用略微不同的技術將不常用的單詞分成更小的標記。

1、字節對編碼 Byte Pair Encoding

字節對編碼算法是一種常用的標記器，例如GPT和GPT-2模型(OpenAI)， BART (Lewis等人)等[9-10]。它最初被設計為一種文本壓縮算法，但人們發現它在語言模型的標記化任務中工作得非常好。BPE算法將一串文本分解為在參考語料庫(用于訓練標記化模型的文本)中頻繁出現的子詞單元[11]。BPE模型的訓練方法如下:

a)構建語料庫

輸入文本被提供給規范化和預標記化模型，創建干凈的單詞列表。然后將這些單詞交給BPE模型，模型確定每個單詞的頻率，并將該數字與單詞一起存儲在稱為語料庫的列表中。

b)構建詞匯

然后語料庫中的單詞被分解成單個字符，并添加到一個稱為詞匯表的空列表中。該算法將在每次確定哪些字符對可以合并在一起時迭代地添加該詞匯表。

c)找出字符對的頻率

然后記錄語料庫中每個單詞的字符對頻率。例如，單詞cat將具有ca, at和ts的字符對。所有單詞都以這種方式進行檢查，并貢獻給全局頻率計數器。在任何標記中找到的ca實例都會增加ca對的頻率計數器。

d)創建合并規則

當每個字符對的頻率已知時，最頻繁的字符對被添加到詞匯表中。詞匯表現在由符號中的每個字母以及最常見的字符對組成。這也提供了一個模型可以使用的合并規則。例如，如果模型學習到ca是最常見的字符對，它已經學習到語料庫中所有相鄰的c和a實例可以合并以得到ca?，F在可以將其作為單個字符ca處理其余步驟。

重復步驟c和d，找到更多合并規則，并向詞匯表中添加更多字符對。這個過程一直持續到詞匯表大小達到訓練開始時指定的目標大小。

下面是BPE算法的Python實現

class TargetVocabularySizeError(Exception):
    def __init__(self, message):
        super().__init__(message)
 
 class BPE:
    '''An implementation of the Byte Pair Encoding tokenizer.'''
 
    def calculate_frequency(self, words):
        ''' Calculate the frequency for each word in a list of words.
 
            Take in a list of words stored as strings and return a list of
            tuples where each tuple contains a string from the words list,
            and an integer representing its frequency count in the list.
 
            Args:
                words (list): A list of words (strings) in any order.
 
            Returns:
                corpus (list[tuple(str, int)]): A list of tuples where the
                  first element is a string of a word in the words list, and
                  the second element is an integer representing the frequency
                  of the word in the list.
        '''
        freq_dict = dict()
 
        for word in words:
            if word not in freq_dict:
                freq_dict[word] = 1
            else:
                freq_dict[word] += 1
 
        corpus = [(word, freq_dict[word]) for word in freq_dict.keys()]
 
        return corpus
 
 
    def create_merge_rule(self, corpus):
        ''' Create a merge rule and add it to the self.merge_rules list.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                None
        '''
        pair_frequencies = self.find_pair_frequencies(corpus)
        most_frequent_pair = max(pair_frequencies, key=pair_frequencies.get)
        self.merge_rules.append(most_frequent_pair.split(','))
        self.vocabulary.append(most_frequent_pair)
 
 
    def create_vocabulary(self, words):
        ''' Create a list of every unique character in a list of words.
 
            Args:
                words (list): A list of strings containing the words of the
                    input text.
 
            Returns:
                vocabulary (list): A list of every unique character in the list
                    of input words.
        '''
        vocabulary = list(set(''.join(words)))
        return vocabulary
 
    def find_pair_frequencies(self, corpus):
        ''' Find the frequency of each character pair in the corpus.
 
            Loop through the corpus and calculate the frequency of each pair
            of adjacent characters across every word. Return a dictionary of
            each character pair as the keys and the corresponding frequency as
            the values.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                pair_freq_dict (dict): A dictionary where the keys are the
                    character pairs from the input corpus and the values are an
                    integer representing the frequency of the pair in the
                    corpus.
        '''
        pair_freq_dict = dict()
 
        for word, word_freq in corpus:
            for idx in range(len(word)-1):
 
                char_pair = f'{word[idx]},{word[idx+1]}'
 
                if char_pair not in pair_freq_dict:
                    pair_freq_dict[char_pair] = word_freq
                else:
                    pair_freq_dict[char_pair] += word_freq
 
        return pair_freq_dict
 
 
    def get_merged_chars(self, char_1, char_2):
        ''' Merge the highest score pair and return to the self.merge method.
 
            This method is abstracted so that the BPE class can be used as the
            base class for other Tokenizers, and so the merging method can be
            easily overwritten. For example, in the BPE algorithm the
            characters can simply be concatenated and returned. However in the
            WordPiece algorithm, the # symbols must first be stripped.
 
            Args:
                char_1 (str): The first character in the highest-scoring pair.
                char_2 (str): The second character in the highest-scoring pair.
 
            Returns:
                merged_chars (str): Merged characters.
        '''
        merged_chars = char_1 + char_2
        return merged_chars
 
 
    def initialize_corpus(self, words):
        ''' Split each word into characters and count the word frequency.
 
            Split each word in the input word list on every character. For each
            word, store the split word in a list as the first element inside a
            tuple. Store the frequency count of the word as an integer as the
            second element of the tuple. Create a tuple for every word in this
            fashion and store the tuples in a list called 'corpus', then return
            then corpus list.
 
            Args:
                None
 
            Returns:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters of the word),
                    and the second element is an integer representing the
                    frequency of the word in the list.
        '''
        corpus = self.calculate_frequency(words)
        corpus = [([*word], freq) for (word, freq) in corpus]
        return corpus
 
 
    def merge(self, corpus):
        ''' Loop through the corpus and perform the latest merge rule.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                new_corpus (list[tuple(list, int)]): A modified version of the
                    input argument where the most recent merge rule has been
                    applied to merge the most frequent adjacent characters.
        '''
        merge_rule = self.merge_rules[-1]
        new_corpus = []
 
        for word, word_freq in corpus:
            new_word = []
            idx = 0
 
            while idx < len(word):
                # If a merge pattern has been found
                if (len(word) != 1) and (word[idx] == merge_rule[0]) and\
                (word[idx+1] == merge_rule[1]):
 
                    new_word.append(self.get_merged_chars(word[idx],word[idx+1]))
                    idx += 2
                # If a merge patten has not been found
                else:
                    new_word.append(word[idx])
                    idx += 1
 
            new_corpus.append((new_word, word_freq))
 
        return new_corpus
 
 
    def train(self, words, target_vocab_size):
        ''' Train the model.
 
            Args:
                words (list[str]): A list of words to train the model on.
 
                target_vocab_size (int): The number of words in the vocabulary
                    to be used as the stopping condition when training.
 
            Returns:
                None.
        '''
        self.words = words
        self.target_vocab_size = target_vocab_size
        self.corpus = self.initialize_corpus(self.words)
        self.corpus_history = [self.corpus]
        self.vocabulary = self.create_vocabulary(self.words)
        self.vocabulary_size = len(self.vocabulary)
        self.merge_rules = []
 
        # Iteratively add vocabulary until reaching the target vocabulary size
        if len(self.vocabulary) > self.target_vocab_size:
            raise TargetVocabularySizeError(f'Error: Target vocabulary size \
            must be greater than the initial vocabulary size \
            ({len(self.vocabulary)})')
 
        else:
            while len(self.vocabulary) < self.target_vocab_size:
                try:
                    self.create_merge_rule(self.corpus)
                    self.corpus = self.merge(self.corpus)
                    self.corpus_history.append(self.corpus)
 
                # If no further merging is possible
                except ValueError:
                    print('Exiting: No further merging is possible')
                    break
 
 
    def tokenize(self, text):
        ''' Take in some text and return a list of tokens for that text.
 
            Args:
                text (str): The text to be tokenized.
 
            Returns:
                tokens (list): The list of tokens created from the input text.
        '''
        tokens = [*text]
 
        for merge_rule in self.merge_rules:
 
            new_tokens = []
            idx = 0
 
            while idx < len(tokens):
                # If a merge pattern has been found
                if (len(tokens) != 1) and (tokens[idx] == merge_rule[0]) and \
                    (tokens[idx+1] == merge_rule[1]):
 
                    new_tokens.append(self.get_merged_chars(tokens[idx],
                                                            tokens[idx+1]))
                    idx += 2
                # If a merge patten has not been found
                else:
                    new_tokens.append(tokens[idx])
                    idx += 1
 
            tokens = new_tokens
 
        return tokens

使用的詳細步驟：

# Training set
 words = ['cat', 'cat', 'cat', 'cat', 'cat',
          'cats', 'cats',
          'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat', 'eat',
          'eating', 'eating', 'eating',
          'running', 'running',
          'jumping',
          'food', 'food', 'food', 'food', 'food', 'food']
 
 # Instantiate the tokenizer
 bpe = BPE()
 bpe.train(words, 21)
 
 # Print the corpus at each stage of the process, and the merge rule used
 print(f'INITIAL CORPUS:\n{bpe.corpus_history[0]}\n')
 for rule, corpus in list(zip(bpe.merge_rules, bpe.corpus_history[1:])):
    print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')
    print(corpus, end='\n\n')

結果輸出

INITIAL CORPUS:
 [(['c', 'a', 't'], 5), (['c', 'a', 't', 's'], 2), (['e', 'a', 't'], 10),
 (['e', 'a', 't', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "a" and "t"
 [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['e', 'at'], 10), 
 (['e', 'at', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "e" and "at"
 [(['c', 'at'], 5), (['c', 'at', 's'], 2), (['eat'], 10), 
 (['eat', 'i', 'n', 'g'], 3), (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "c" and "at"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'i', 'n', 'g'], 3), 
 (['r', 'u', 'n', 'n', 'i', 'n', 'g'], 2), 
 (['j', 'u', 'm', 'p', 'i', 'n', 'g'], 1), (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "i" and "n"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'in', 'g'], 3), 
 (['r', 'u', 'n', 'n', 'in', 'g'], 2), (['j', 'u', 'm', 'p', 'in', 'g'], 1), 
 (['f', 'o', 'o', 'd'], 6)]
 
 NEW MERGE RULE: Combine "in" and "g"
 [(['cat'], 5), (['cat', 's'], 2), (['eat'], 10), (['eat', 'ing'], 3), 
 (['r', 'u', 'n', 'n', 'ing'], 2), (['j', 'u', 'm', 'p', 'ing'], 1), 
 (['f', 'o', 'o', 'd'], 6)]

我們的代碼只是為了學習流程，在實際應用中可以直接使用transformer庫

BPE標記器只能識別出現在訓練數據中的字符（characters）。如果出現不包含的詞匯,會將這個字符轉換為一個未知的字符。如果模型被用來標記真實數據。但是BPE錯誤處理沒有添加未知的字符的標記,所以有的productionized模型是會產生崩潰。

但是GPT-2和RoBERTa中使用的BPE標記器沒有這個問題。它們不是基于Unicode字符分析訓練數據，而是分析字符的字節。這被稱為字節級BPE Byte-Level BPE，它允許一個小的基本詞匯表能夠標記模型可能看到的所有字符。

2、WordPiece

WordPiece是Google為的BERT模型開發的一種標記化方法，并用于其衍生模型，如DistilBERT和MobileBERT。

WordPiece算法的全部細節尚未完全向公眾公布，因此本文介紹的方法是基于Hugging Face[12]給出的解釋。WordPiece算法類似于BPE，但使用不同的度量來確定合并規則。系統不會選擇出現頻率最高的字符對，而是為每對字符計算一個分數，分數最高的字符對決定合并哪些字符。WordPiece的訓練如下:

a)構建語料庫

輸入文本被提供給規范化和預標記化模型，以創建干凈的單詞。

b)構建詞匯

與BPE一樣，語料庫中的單詞隨后被分解為單個字符，并添加到稱為詞匯表的空列表中。但是這一次不是簡單地存儲每個單獨的字符，而是使用兩個#符號作為標記來確定該字符是在單詞的開頭還是在單詞的中間/結尾找到的。例如，單詞cat在BPE中會被分成['c'， 'a'， 't']，但在WordPiece中它看起來像['c'， '##a'， '##t']。單詞開頭的c和單詞中間或結尾的##c將被區別對待。每次算法確定哪些字符對可以合并在一起時，都會迭代地向這個詞匯表中添加內容。

c)計算每個相鄰字符對的配對得分

與BPE模型不同，這次為每個字符對計算一個分數。識別語料庫中每個相鄰的字符對。'c##a'， ##a##t等，并計算頻率。每個字符單獨出現的頻率也是確定的。已知這些值后，可以根據以下公式計算配對得分:

這個指標會給經常一起出現的字符分配更高的分數，但單獨出現或與其他字符一起出現的頻率較低。這是WordPiece和BPE的主要區別，因為BPE不考慮單個字符本身的總體頻率。

d)創建合并規則

高分代表通常一起出現的字符對。也就是說，如果c##a的配對得分很高，那么c和a在語料庫中經常一起出現，而不是單獨出現。與BPE一樣，合并規則是由得分最高的字符對決定的，但這次不是由頻率決定得分，而是由字符對得分決定。

然后重復步驟c和d，找到更多合并規則，并向詞匯表添加更多字符對。這個過程一直持續到詞匯表大小達到訓練開始時指定的目標大小。

簡單代碼示例如下：

class WordPiece(BPE):
 
    def add_hashes(self, word):
        ''' Add # symbols to every character in a word except the first.
 
            Take in a word as a string and add # symbols to every character
            except the first. Return the result as a list where each element is
            a character with # symbols in front, except the first character
            which is just the plain character.
 
            Args:
                word (str): The word to add # symbols to.
 
            Returns:
                hashed_word (list): A list of the characters with # symbols
                    (except the first character which is just the plain
                    character).
        '''
        hashed_word = [word[0]]
 
        for char in word[1:]:
            hashed_word.append(f'##{char}')
 
        return hashed_word
 
 
    def create_merge_rule(self, corpus):
        ''' Create a merge rule and add it to the self.merge_rules list.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                None
        '''
        pair_frequencies = self.find_pair_frequencies(corpus)
        char_frequencies = self.find_char_frequencies(corpus)
        pair_scores = self.find_pair_scores(pair_frequencies, char_frequencies)
 
        highest_scoring_pair = max(pair_scores, key=pair_scores.get)
        self.merge_rules.append(highest_scoring_pair.split(','))
        self.vocabulary.append(highest_scoring_pair)
 
 
    def create_vocabulary(self, words):
        ''' Create a list of every unique character in a list of words.
 
            Unlike the BPE algorithm where each character is stored normally,
            here a distinction is made by characters that begin a word
            (unmarked), and characters that are in the middle or end of a word
            (marked with a '##'). For example, the word 'cat' will be split
            into ['c', '##a', '##t'].
 
            Args:
                words (list): A list of strings containing the words of the
                    input text.
 
            Returns:
                vocabulary (list): A list of every unique character in the list
                    of input words, marked accordingly with ## to denote if the
                    character was featured in the middle/end of a word, instead
                    of as the first character of the word.
        '''
        vocabulary = set()
        for word in words:
            vocabulary.add(word[0])
            for char in word[1:]:
                vocabulary.add(f'##{char}')
 
        # Convert to list so the vocabulary can be appended to later
        vocabulary = list(vocabulary)
        return vocabulary
 
 
    def find_char_frequencies(self, corpus):
        ''' Find the frequency of each character in the corpus.
 
            Loop through the corpus and calculate the frequency of characters.
            Note that 'c' and '##c' are different characters, since the first
            represents a 'c' at the start of a word, and '##c' represents a 'c'
            in the middle/end of a word. Return a dictionary of each character
            pair as the keys and the corresponding frequency as the values.
 
            Args:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters (or subwords in
                    later iterations) of the word), and the second element is
                    an integer representing the frequency of the word in the
                    list.
 
            Returns:
                pair_freq_dict (dict): A dictionary where the keys are the
                    characters from the input corpus and the values are an
                    integer representing the frequency.
        '''
        char_frequencies = dict()
 
        for word, word_freq in corpus:
            for char in word:
                if char in char_frequencies:
                    char_frequencies[char] += word_freq
                else:
                    char_frequencies[char] = word_freq
 
        return char_frequencies
 
 
    def find_pair_scores(self, pair_frequencies, char_frequencies):
        ''' Find the pair score for each character pair in the corpus.
 
            Loops through the pair_frequencies dictionary and calculate the
            pair score for each pair of adjacent characters in the corpus.
            Store the scores in a dictionary and return it.
 
            Args:
                pair_frequencies (dict): A dictionary where the keys are the
                    adjacent character pairs in the corpus and the values are
                    the frequencies of each pair.
 
                char_frequencies (dict): A dictionary where the keys are the
                    characters in the corpus and the values are corresponding
                    frequencies.
 
            Returns:
                pair_scores (dict): A dictionary where the keys are the
                    adjacent character pairs in the input corpus and the values
                    are the corresponding pair score.
        '''
        pair_scores = dict()
 
        for pair in pair_frequencies.keys():
            char_1 = pair.split(',')[0]
            char_2 = pair.split(',')[1]
            denominator = (char_frequencies[char_1]*char_frequencies[char_2])
            score = (pair_frequencies[pair]) / denominator
            pair_scores[pair] = score
 
        return pair_scores
 
 
    def get_merged_chars(self, char_1, char_2):
        ''' Merge the highest score pair and return to the self.merge method.
 
            Remove the # symbols as necessary and merge the highest scoring
            pair then return the merged characters to the self.merge method.
 
 
            Args:
                char_1 (str): The first character in the highest-scoring pair.
                char_2 (str): The second character in the highest-scoring pair.
 
            Returns:
                merged_chars (str): Merged characters.
        '''
        if char_2.startswith('##'):
            merged_chars = char_1 + char_2[2:]
        else:
            merged_chars = char_1 + char_2
 
        return merged_chars
 
 
    def initialize_corpus(self, words):
        ''' Split each word into characters and count the word frequency.
 
            Split each word in the input word list on every character. For each
            word, store the split word in a list as the first element inside a
            tuple. Store the frequency count of the word as an integer as the
            second element of the tuple. Create a tuple for every word in this
            fashion and store the tuples in a list called 'corpus', then return
            then corpus list.
 
            Args:
                None.
 
            Returns:
                corpus (list[tuple(list, int)]): A list of tuples where the
                    first element is a list of a word in the words list (where
                    the elements are the individual characters of the word),
                    and the second element is an integer representing the
                    frequency of the word in the list.
        '''
        corpus = self.calculate_frequency(words)
        corpus = [(self.add_hashes(word), freq) for (word, freq) in corpus]
        return corpus
 
    def tokenize(self, text):
        ''' Take in some text and return a list of tokens for that text.
 
            Args:
                text (str): The text to be tokenized.
 
            Returns:
                tokens (list): The list of tokens created from the input text.
        '''
        # Create cleaned vocabulary list without # and commas to check against
        clean_vocabulary = [word.replace('#', '').replace(',', '') 
                            for word in self.vocabulary]
        clean_vocabulary.sort(key=lambda word: len(word))
        clean_vocabulary = clean_vocabulary[::-1]
 
        # Break down the text into the largest tokens first, then smallest
        remaining_string = text
        tokens = []
        keep_checking = True
 
        while keep_checking:
            keep_checking = False
            for vocab in clean_vocabulary:
                if remaining_string.startswith(vocab):
                    tokens.append(vocab)
                    remaining_string = remaining_string[len(vocab):]
                    keep_checking = True
 
        if len(remaining_string) > 0:
            tokens.append(remaining_string)
 
        return tokens

WordPiece與BPE算法學習的標記非常不同?？梢郧宄乜吹?，WordPiece更傾向于這樣的組合:字符相互出現的頻率比單獨出現的頻率更高，因此m和p會立即合并，因為它們只一起存在于數據集中，而不是單獨存在。

wp = WordPiece()
 wp.train(words, 30)
 
 print(f'INITIAL CORPUS:\n{wp.corpus_history[0]}\n')
 for rule, corpus in list(zip(wp.merge_rules, wp.corpus_history[1:])):
    print(f'NEW MERGE RULE: Combine "{rule[0]}" and "{rule[1]}"')
    print(corpus, end='\n\n')

結果：

INITIAL CORPUS:
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##m', '##p', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##m" and "##p"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['r', '##u', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##mp', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "r" and "##u"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['j', '##u', '##mp', '##i', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "j" and "##u"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['ju', '##mp', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "ju" and "##mp"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), 
 (['jump', '##i', '##n', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jump" and "##i"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##i', '##n', '##g'], 3), 
 (['ru', '##n', '##n', '##i', '##n', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##i" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['ru', '##n', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "ru" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['run', '##n', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "run" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runn', '##in', '##g'], 2), (['jumpi', '##n', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jumpi" and "##n"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runn', '##in', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "runn" and "##in"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##in', '##g'], 3), 
 (['runnin', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "##in" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['runnin', '##g'], 2), (['jumpin', '##g'], 1), 
 (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "runnin" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumpin', '##g'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "jumpin" and "##g"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumping'], 1), (['f', '##o', '##o', '##d'], 6)]
 
 NEW MERGE RULE: Combine "f" and "##o"
 [(['c', '##a', '##t'], 5), (['c', '##a', '##t', '##s'], 2), 
 (['e', '##a', '##t'], 10), (['e', '##a', '##t', '##ing'], 3), 
 (['running'], 2), (['jumping'], 1), (['fo', '##o', '##d'], 6)]

盡管訓練數據有限，但模型仍然設法學習了一些有用的標記，比如單詞jumper開始。首先，字符串被分解成['jump'，'er']，因為jump是訓練集中可以在單詞開頭找到的最大token。接下來，字符串er被分解成單個字符，因為模型還沒有學會將字符e和r組合在一起。

print(wp.tokenize('jumper'))
 #['jump', 'e', 'r']

3、Unigram

Unigram標記器采用與BPE和WordPiece不同的方法，從一個大詞匯表開始，然后迭代地減少它，直到達到所需的大小。

Unigram模型使用統計方法，其中考慮句子中每個單詞或字符的概率。這些列表中的每個元素都可以被認為是一個標記t，而一系列標記t1, t2，…，tn出現的概率由下式給出:

a)構建語料庫

與往常一樣，輸入文本被提供給規范化和預標記化模型，以創建干凈的單詞

b)構建詞匯

Unigram模型的詞匯表大小一開始非常大，然后迭代地減少，直到達到所需的大小。要構造初始詞匯表，請在語料庫中找到所有可能的子字符串。例如，如果語料庫中的第一個單詞是cats，則子字符串['c'， 'a'， 't'， 's'， 'ca'， 'at'， 'ts'， 'cat'， 'ats']將被添加到詞匯表中。

c)計算每個標記的概率

通過查找語料庫中標記的出現次數，然后除以標記出現的總次數，可以近似地計算出標記出現的概率。

d)找出單詞的所有可能的切分

假設訓練語料庫中的一個單詞是cat。這可以通過以下方式進行細分:

['c'， 'a'， 't']

(“ca”、“t”)

[' c ', 'at']

(“cat”)

e)計算語料庫中每個分割出現的近似概率

結合上面的方程式將給出每個系列標記的概率。

由于段['ca'， 't']具有最高的概率得分，因此這是用于標記單詞的段。單詞cat將被標記為['ca'， 't']?？梢韵胂?，對于像tokenization這樣的較長的單詞，拆分可能出現在整個單詞的多個位置，例如['token'， 'iza'， tion]或['token'， 'ization]。

f)計算損失

這里的損失是指模型的分數，如果從詞匯表中刪除一個重要的標記，則損失會大大增加，但如果刪除一個不太重要的標記，則損失不會增加太多。通過計算每個標記被刪除后在模型中的損失，可以找到詞匯表中最沒用的標記。這可以迭代地重復，直到詞匯表大小減少到只剩下訓練集語料庫中最有用的標記。

這里的損失計算公式如下：

一旦刪除了足夠的字符，使詞匯表減少到所需的大小，訓練就完成了，模型就可以用于對單詞進行標記。

比較BPE、WordPiece和Unigram

根據訓練集和要標記的數據，一些標記器可能比其他標記器表現得更好。在為語言模型選擇標記器時，最好使用用于特定用例的訓練集進行實驗，看看哪個能提供最好的結果。

在這三種方法中，BPE似乎是當前語言模型標記器中最流行的選擇。盡管在這樣一個瞬息萬變的領域，這種變化在未來是很有可能發生的。但是其他子詞標記器，如sentencepece，近年來越來越受歡迎[13]。

與BPE和Unigram相比，WordPiece似乎產生了更多的單詞標記，但無論模型選擇如何，隨著詞匯量的增加，所有標記器似乎都產生了更少的標記[14]。

標記器的選擇取決于打算與模型一起使用的數據集。這里的建議是嘗試BPE或sentencepece進行實驗。

后處理

標記化的最后一步是后處理，如果有必要，可以對輸出進行任何最終修改。BERT使用這一步驟添加了兩種額外類型的標記:

[CLS] -這個標記代表“分類”，用于標記輸入文本的開始。這在BERT中是必需的，因為它被訓練的任務之一是分類(因此標記的名稱)。即使不用于分類任務，該標記仍然是模型所期望的。

[SEP] -這個標記代表“分隔”，用于分隔輸入中的句子。這對于BERT執行的許多任務都很有用，包括在同一提示符中同時處理多條指令[15]。

tokenizers庫

tokenizers庫使得使用預訓練的tokenizer非常容易。只需導入Tokenizer類，調用from_pretrained方法，并傳入要使用Tokenizer from的模型名稱。模型列表見[16]。

from tokenizers import Tokenizer
 
 tokenizer = Tokenizer.from_pretrained('bert-base-cased')

我們可以直接使用下面的實現

BertWordPieceTokenizer - The famous Bert tokenizer, using WordPiece
 CharBPETokenizer - The original BPE
 ByteLevelBPETokenizer - The byte level version of the BPE
 SentencePieceBPETokenizer - A BPE implementation compatible with the one used by SentencePiece

h愛可以使用train方法進行自定義的訓練。訓練完成后使用save方法保存訓練好的標記器，這樣就不必再次執行訓練。

# Import a tokenizer
 from tokenizers import BertWordPieceTokenizer, CharBPETokenizer, \
                        ByteLevelBPETokenizer, SentencePieceBPETokenizer
 
 # Instantiate the model
 tokenizer = CharBPETokenizer()
 
 # Train the model
 tokenizer.train(['./path/to/files/1.txt', './path/to/files/2.txt'])
 
 # Tokenize some text
 encoded = tokenizer.encode('I can feel the magic, can you?')
 
 # Save the model
 tokenizer.save('./path/to/directory/my-bpe.tokenizer.json')

下面是一個完整的自定義訓練的流程代碼：

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, \
                        processors
 
 # Initialize a tokenizer
 tokenizer = Tokenizer(models.BPE())
 
 # Customize pre-tokenization and decoding
 tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
 tokenizer.decoder = decoders.ByteLevel()
 tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)
 
 # And then train
 trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
 )
 tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
 ], trainer=trainer)
 
 # And Save it
 tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

總結

標記化管道是語言模型的關鍵部分，在決定使用哪種類型的標記器時應該仔細考慮。雖然Hugging Face為了我們處理了這部分的工作，但是對標記方法的深刻理解對于微調模型和在不同數據集上獲得的性能是非常重要的。

責任編輯：華軒來源： DeepHub IMBA

OpenAI 大型語言模型 Python

51CTO技術棧公眾號

業務
速覽

媒體

51CTO CIOAge HC3i

社區

51CTO博客鴻蒙開發者社區 AI.x社區

教育

51CTO學堂精培企業培訓 CTO訓練營

91精品国产综合久久精品麻豆 | 韩国精品久久久| 亚洲石原莉奈一区二区在线观看| 久久久久免费精品| av成人手机在线| 极品销魂美女一区二区三区| 欧美黑人又粗大| 在线免费看黄视频| 精品国产不卡一区二区| 五月天精品一区二区三区| 日本婷婷久久久久久久久一区二区| 亚洲无码精品在线观看| 精品99视频| 中文字幕av日韩| 岛国av免费观看| 亚洲不卡系列| 亚洲国产日韩在线一区模特| 日韩高清国产精品| 亚洲精品无遮挡| 日韩av电影免费观看高清完整版| 精品中文字幕在线| 免费看裸体网站| 成人直播在线观看| 欧美日韩情趣电影| 丰满爆乳一区二区三区| 日本免费视频在线观看| 91在线云播放| 亚洲综合中文字幕在线观看| 国产一级片免费视频| 激情综合久久| 久久高清视频免费| jizz日本在线播放| 欧美猛男男男激情videos| 日韩一卡二卡三卡四卡| 777视频在线| 制服丝袜专区在线| 亚洲一区二区三区四区在线观看| 日韩中文一区| 日韩av成人| 成人免费看黄yyy456| 成人黄色影片在线| 波多野结衣人妻| 国产精品亚洲产品| 久久久久久亚洲精品| 亚洲欧美日韩第一页| 婷婷精品在线| 亚洲国产成人爱av在线播放| 免费看三级黄色片| 精品亚洲二区| 欧美精品在线观看一区二区| 韩国中文字幕av| 激情开心成人网| 欧美日韩亚洲视频一区| 和岳每晚弄的高潮嗷嗷叫视频| 18网站在线观看| 1区2区3区欧美| 亚洲精品一区二区三区av| 黄网在线观看| 国产三级欧美三级| 日本欧美精品久久久| 精品资源在线看| 久久久国产精华| 日本在线观看一区二区三区| 美女毛片在线看| 久久精品一区二区三区av| 久久久免费看| 国产理论电影在线观看| 欧美经典一区二区| 一区二区三区不卡在线| 免费网站成人| 亚洲人成伊人成综合网小说| 色一情一乱一乱一区91| 四虎亚洲成人| 亚洲v日本v欧美v久久精品| 成人免费播放器| 在线天堂新版最新版在线8| 日韩欧美亚洲范冰冰与中字| 免费激情视频在线观看| 日韩av黄色| 日韩午夜小视频| 喷水视频在线观看| 男男gay无套免费视频欧美| 国产午夜精品视频免费不卡69堂| 少妇太紧太爽又黄又硬又爽小说| 亚洲成人日韩| 久久理论片午夜琪琪电影网| 麻豆成人免费视频| 蜜桃久久精品一区二区| 亚洲综合自拍一区| 涩爱av在线播放一区二区| 久久久电影一区二区三区| 午夜一区二区三区| 91极品在线| 欧美日韩国产精品| 成人午夜激情av| 国产精品99久久免费| 精品国产乱码久久久久久免费| 日本xxxx裸体xxxx| 97精品国产一区二区三区| 欧美黑人xxxⅹ高潮交| 一级成人黄色片| 国产一区二区不卡老阿姨| 国产精品一区二区av| 九色在线播放| 亚洲精品成人a在线观看| 日韩欧美视频网站| 亚洲精品乱码日韩| 日韩高清人体午夜| 国产一二三区精品| 丝袜亚洲另类欧美综合| 91精品入口蜜桃| 高清在线观看av| 亚洲一区二区三区四区中文字幕| 午夜两性免费视频| 欧美巨大xxxx| 九九精品在线观看| 欧美另类高清videos的特点| 国产成a人无v码亚洲福利| 色就是色欧美| 麻豆免费在线| 日韩一二三四区| 一级片视频免费看| 亚洲韩日在线| 亚洲综合社区网| 91大神xh98hx在线播放| 精品成人久久av| 1314成人网| 成人中文在线| 538国产精品一区二区在线 | 日韩视频一区| 91原创国产| 91精品大全| 在线观看一区二区视频| 精品无码国产一区二区三区51安| 性欧美69xoxoxoxo| 国产精品一二三在线| 青青草视频在线免费观看| 亚洲一区在线观看网站| 老司机久久精品| 国产日产一区| 欧美综合在线观看| 天天摸天天干天天操| 亚洲一区免费观看| 欧美性猛交xxxx乱大交91| 日韩在线视屏| 国产精品中文字幕在线| 成人精品一区二区| 欧美中文字幕一二三区视频| 97超碰在线资源| 国产精品一二| 蜜桃av噜噜一区二区三| 亚洲人体视频| 亚洲欧美国产精品专区久久| www..com国产| 94色蜜桃网一区二区三区| 狠狠干狠狠操| 欧美偷窥清纯综合图区| 91精品国产高清自在线看超| 天天色棕合合合合合合合| 午夜电影久久久| 亚洲AV无码国产精品| 亚洲综合不卡| 日本不卡一区二区三区视频| 国产精品毛片久久久久久久久久99999999| 亚洲欧美国产一本综合首页| 亚洲精品成人在线视频| 国产日本欧美一区二区| 少妇一级淫免费放| 午夜片欧美伦| 成人av蜜桃| 涩涩在线视频| 一区二区三区www| 一卡二卡三卡在线观看| 亚洲免费高清视频在线| av漫画在线观看| 亚洲一区日韩| 亚洲成色www久久网站| 精品视频在线一区二区在线| 久久久精品999| 日本精品一二区| 色域天天综合网| 四虎影视一区二区| 国产精品99精品久久免费| 国产毛片视频网站| 国产成人黄色| 96精品久久久久中文字幕| 国产探花视频在线观看| 亚洲男人天堂视频| 国产同性人妖ts口直男| 亚欧色一区w666天堂| 一区二区黄色片| 狠狠色综合色综合网络| 日韩精品一区二区免费| 欧美日韩在线高清| 九色porny丨入口在线| 亚洲免费电影一区| 91亚洲国产成人精品一区| 亚洲精品国产成人久久av盗摄 | 成人在线综合网站| av天堂永久资源网| 国产精品久久久久无码av| 高清视频一区二区三区| 日韩一区二区三区免费| 欧美成年人在线观看| 日本一区高清| 91精品国产手机| 国产精品第5页| 亚洲欧美日韩一区二区 | 久久九九国产精品怡红院 | 都市激情亚洲一区| 久久久av电影| 韩国福利在线| 欧美成人免费网站| 中文字幕乱码无码人妻系列蜜桃| 亚洲一区二区三区三| 欧美另类69xxxx| 91蝌蚪porny成人天涯| 亚洲在线观看网站| 日韩精品免费视频人成| 亚洲国产精品无码av| 三区四区不卡| 欧美极品一区| 成人爽a毛片| 成人做爽爽免费视频| 丝袜美腿一区| 午夜精品一区二区三区在线视| 免费**毛片在线| 亚洲人成伊人成综合网久久久 | 国产无遮挡又黄又爽| 久久丝袜美腿综合| 性高潮久久久久久| 久草这里只有精品视频| 亚洲熟妇av一区二区三区| 欧美暴力喷水在线| 在线观看欧美激情| 精品国产一区二区三区小蝌蚪| 国产嫩草一区二区三区在线观看| 外国成人毛片| 国产精品视频免费观看www| 中文在线资源| 91精品国产色综合久久不卡98| 羞羞视频在线免费国产| www.精品av.com| 中文日本在线观看| 一区二区欧美亚洲| 成人在线二区| 国产一区二区三区日韩欧美| 欧美zozo| 亚洲欧美日韩精品久久奇米色影视 | 中文字幕一区二区精品区| 曰韩不卡视频| 亚洲一区二区三区无吗| 伊人久久大香线蕉综合75| 欧美日韩一二| 亚洲v国产v| 欧美3p视频| 中文字幕一区二区三区在线乱码 | 国产一区二区三区视频播放| 国产日韩欧美精品在线| 亚洲成人黄色av| 国产欧美日韩视频在线观看| 黄色片在线观看免费| 日本一区二区成人在线| 国产亚洲精品精品精品| 国产精品每日更新| 影音先锋男人资源在线观看| 国产精品传媒在线| 97成人资源站| 亚洲一二三区在线观看| 日本免费在线播放| 午夜电影久久久| 免费黄色片视频| 欧美日韩精品一区二区三区四区| 一级爱爱免费视频| 欧美大肚乱孕交hd孕妇| 手机在线观看免费av| 亚洲老头同性xxxxx| 成年人在线视频免费观看| 视频直播国产精品| 午夜av在线免费观看| 久久久久五月天| 中文字幕在线观看| 国产欧亚日韩视频| 一区二区三区高清在线观看| 韩日午夜在线资源一区二区| 国产伦精品一区二区三区千人斩| 亚洲视频欧美在线| 国内精品嫩模av私拍在线观看| 日韩精品―中文字幕| 男人操女人的视频在线观看欧美| 在线免费看污网站| 成人免费毛片app| 乐播av一区二区三区| 日韩美女精品在线| 成年免费在线观看| 欧美日韩高清一区二区三区| 亚洲欧美另类综合| 在线观看欧美www| 欧美videosex性欧美黑吊| 日韩av高清不卡| 激情五月综合婷婷| 欧美日韩综合久久| 午夜激情一区| 国产一二三四在线视频| 国产91在线观看| 妖精视频在线观看免费| 天天操天天综合网| 国产又粗又长又黄| 亚洲毛片在线观看| 欧美大片黄色| 91精品久久久久久久久久久久久久 | 国产精品久久久久影院日本| 亚洲不卡视频| 亚洲乱码国产乱码精品天美传媒| 黑人一区二区| 午夜一级免费视频| 久久亚洲一区二区三区明星换脸 | 热久久99这里有精品| 韩国一区二区三区视频| 色视频一区二区三区| 99精品免费| 一区二区三区人妻| 中文av字幕一区| 亚洲免费黄色网址| 精品国产在天天线2019| 日本美女在线中文版| 国产精国产精品| 欧美在线导航| 97视频在线免费| 国产乱理伦片在线观看夜一区| 2019男人天堂| 色菇凉天天综合网| 亚州男人的天堂| 久久露脸国产精品| 一区二区三区四区高清视频 | 人妻互换一二三区激情视频| 国产精品色眯眯| aaaaaa毛片| 日韩精品高清视频| av中文字幕在线观看第一页 | 欧美成人精品欧美一级乱| 成人一级黄色片| 欧美日韩精品一区二区三区视频播放 | 特级黄色录像片| 麻豆精品国产91久久久久久| 51妺嘿嘿午夜福利| 色丁香久综合在线久综合在线观看| 亚洲av成人精品毛片| 77777亚洲午夜久久多人| 成人爽a毛片免费啪啪红桃视频| 免费人成在线观看视频播放| 国产成人综合网| 久久久久亚洲av无码专区 | 五月婷中文字幕| 九九精品视频在线| 91夜夜蜜桃臀一区二区三区| 黄色片免费在线观看视频| 国产高清不卡二三区| 久久久国产精华液| 亚洲高清福利视频| 色在线视频观看| 日韩精品另类天天更新| 美洲天堂一区二卡三卡四卡视频 | 一区二区三区四区中文字幕| 精品女同一区二区三区| 久久久久久九九九| 奇米影视777在线欧美电影观看| www国产精品内射老熟女| 91首页免费视频| 久久久久久久久久成人| 在线播放亚洲激情| 亚洲精品一区av| 日韩专区第三页| 99精品久久99久久久久| 在线精品免费视| 上原亚衣av一区二区三区| 国产一区二区三区亚洲综合| 国产自产在线视频| 久久影院视频免费| 曰批又黄又爽免费视频| 蜜臀久久99精品久久久无需会员| 国产精品xxx在线观看| 欧美日韩亚洲一| 国产精品久久久爽爽爽麻豆色哟哟| 国产欧美久久久| 国内精品400部情侣激情| 亚洲肉体裸体xxxx137| 午夜剧场在线免费观看| 亚洲国产一区二区在线播放| 国产一区电影| 3d动漫精品啪啪一区二区三区免费| 日韩午夜激情| 影音先锋男人在线| 欧美成人女星排行榜| 欧美大片免费高清观看| 国产在线拍揄自揄拍无码| 91看片淫黄大片一级在线观看| 97在线公开视频| 欧美主播福利视频| 亚洲欧美偷拍自拍|