羊駝系列大模型和ChatGPT差多少？詳細(xì)測(cè)評(píng)后，我沉默了

作者：機(jī)器之心 2023-05-15 09:39:37

總的來(lái)說(shuō)，該測(cè)試得出的結(jié)論是：MPT 還沒(méi)有準(zhǔn)備好在現(xiàn)實(shí)世界中使用，而 Vicuna 對(duì)于許多任務(wù)來(lái)說(shuō)是 ChatGPT (3.5) 的可行替代品。

前段時(shí)間，谷歌的一份泄密文件引發(fā)了廣泛關(guān)注。在這份文件中，一位谷歌內(nèi)部的研究人員表達(dá)了一個(gè)重要觀點(diǎn)：谷歌沒(méi)有護(hù)城河，OpenAI 也沒(méi)有。

這位研究人員表示，雖然表面看起來(lái) OpenAI 和谷歌在 AI 大模型上你追我趕，但真正的贏家未必會(huì)從這兩家中產(chǎn)生，因?yàn)橐粋€(gè)第三方力量正在悄悄崛起。

這個(gè)力量名叫「開(kāi)源」。圍繞 Meta 的 LLaMA 等開(kāi)源模型，整個(gè)社區(qū)正在迅速構(gòu)建與 OpenAI、谷歌大模型能力類似的模型，而且開(kāi)源模型的迭代速度更快，可定制性更強(qiáng)，更有私密性……「當(dāng)免費(fèi)的、不受限制的替代品質(zhì)量相當(dāng)時(shí)，人們不會(huì)為受限制的模型付費(fèi)。」作者寫(xiě)道。

這些觀點(diǎn)在社交媒體上引起了很大爭(zhēng)議，其中一個(gè)比較大的爭(zhēng)議是：那些開(kāi)源模型是否真的能達(dá)到和 OpenAI ChatGPT 或谷歌 Bard 等商業(yè)閉源大模型相似的水平？現(xiàn)階段兩個(gè)陣營(yíng)還有多大差距？

為了探索這個(gè)問(wèn)題，一位名叫 Marco Tulio Ribeiro 的 Medium 博主在一些復(fù)雜任務(wù)上對(duì)部分模型（Vicuna-13B、MPT-7b-Chat VS. ChatGPT 3.5）進(jìn)行了測(cè)試。

其中，Vicuna-13B 是加州大學(xué)伯克利分校、卡內(nèi)基梅隆大學(xué)、斯坦福大學(xué)、加州大學(xué)圣迭戈分校的研究者們提出的一個(gè)開(kāi)源模型，這個(gè)模型基于 LLaMA 13B 參數(shù)量的版本構(gòu)建而成，在一項(xiàng)由 GPT-4 打分的測(cè)試中表現(xiàn)十分亮眼（參見(jiàn)《300 美元復(fù)刻 ChatGPT 九成功力，GPT-4 親自監(jiān)考，130 億參數(shù)開(kāi)源模型「小羊駝」來(lái)了》）。

MPT-7B 是 MosaicML 發(fā)布的大型語(yǔ)言模型，遵循 meta 的 LLaMA 模型的訓(xùn)練方案。MosaicML 表示，MPT-7B 與 meta 的 70 億參數(shù) LLaMA 模型的性能相當(dāng)。

和它們對(duì)比的，自然是大語(yǔ)言模型標(biāo)桿 ChatGPT。

Marco Tulio Ribeiro 是一位研究員，目前在微軟研究院的自適應(yīng)系統(tǒng)和交互組工作。他還是華盛頓大學(xué)的聯(lián)合助理教授。這項(xiàng)工作由他和微軟的另一位研究員 Scott Lundberg 共同完成。在測(cè)試中，他們使用了微軟的 guidance 庫(kù)來(lái)幫助設(shè)計(jì) prompt。

熱身：解方程

第一項(xiàng)任務(wù)是解簡(jiǎn)單的多項(xiàng)式方程，這些問(wèn)題都有標(biāo)準(zhǔn)答案，比較容易評(píng)估對(duì)錯(cuò)。

對(duì)于指定的三個(gè)模型，測(cè)試者給出的題目是求二元一次方程「x^2+3x=0」的解。他們使用了以下 prompt：

三個(gè)模型表現(xiàn)如下。

ChatGPT:

equation = 'x^2 + 3.0x = 0'
roots = [0, -3]
answer_gpt = find_roots (llm=chatgpt, equatinotallow=equation)

Vicuna：

answer_vicuna = find_roots (llm=vicuna, equatinotallow=equation)

MPT:

answer_mpt = find_roots (llm=mpt, equatinotallow=equation)

顯然，正確答案應(yīng)該是 [-3, 0]，只有 ChatGPT 答對(duì)了（Vicuna 甚至沒(méi)有按照指定的格式作答）。

在這篇文章附帶的 notebook 中，測(cè)試者編寫(xiě)了一個(gè)函數(shù)，用于生成具有整數(shù)根的隨機(jī)二次方程，根的范圍在 - 20 到 20 之間，并且對(duì)每個(gè)模型運(yùn)行了 20 次 prompt。三個(gè)模型的準(zhǔn)確率結(jié)果如下：

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   80%    ║
║ Vicuna    ║    0%    ║ 
║ MPT       ║    0%    ║
╚═══════════╩══════════╩

在二元一次方程的測(cè)試中，雖然 GPT 做錯(cuò)了一些題，但 Vicuna 和 MPT 一道都沒(méi)做對(duì)，經(jīng)常在中間步驟中犯錯(cuò)（MPT 甚至經(jīng)常不寫(xiě)中間步驟）。下面是一個(gè) ChatGPT 錯(cuò)誤的例子：

ChatGPT 在最后一步計(jì)算錯(cuò)誤，(13 +- 25)/2 應(yīng)該得到 [19，-6] 而不是 [19.5，-6.5]。

由于 Vicuna 和 MPT 實(shí)在不會(huì)解二元一次方程，測(cè)試者就找了一些更簡(jiǎn)單的題讓他們做，比如 x-10=0。對(duì)于這些簡(jiǎn)單的方程，他們得到了以下統(tǒng)計(jì)結(jié)果：

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   100%   ║
║ Vicuna    ║    85%   ║ 
║ MPT       ║    30%   ║
╚═══════════╩══════════╩

下面是一個(gè) MPT 答錯(cuò)的例子：

結(jié)論

在這個(gè)非常簡(jiǎn)單的測(cè)試中，測(cè)試者使用相同的問(wèn)題、相同的 prompt 得出的結(jié)論是：ChatGPT 在準(zhǔn)確性方面遠(yuǎn)遠(yuǎn)超過(guò)了 Vicuna 和 MPT。

任務(wù)：提取片段 + 回答會(huì)議相關(guān)的問(wèn)題

這個(gè)任務(wù)更加現(xiàn)實(shí)，而且在會(huì)議相關(guān)的問(wèn)答中，出于安全性、隱私等方面考慮，大家可能更加傾向于用開(kāi)源模型，而不是將私有數(shù)據(jù)發(fā)送給 OpenAI。

以下是一段會(huì)議記錄（翻譯結(jié)果來(lái)自 DeepL，僅供參考）：

測(cè)試者給出的第一個(gè)測(cè)試問(wèn)題是：「Steven 如何看待收購(gòu)一事？」，prompt 如下：

qa_attempt1 = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Please answer the following question:
Question: {{query}}
Extract from the transcript the most relevant segments for the answer, and then answer the question.
{{/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

ChatGPT 給出了如下答案：

雖然這個(gè)回答是合理的，但 ChatGPT 并沒(méi)有提取任何對(duì)話片段作為答案的支撐（因此不符合測(cè)試者設(shè)定的規(guī)范）。測(cè)試者在 notebook 中迭代了 5 個(gè)不同的 prompt，以下是一些例子：

qa_attempt3 = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.

As an example of output format, here is a fictitious answer to a question about another meeting transcript.
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

在這個(gè)新的 prompt 中，ChatGPT 確實(shí)提取了相關(guān)的片段，但它沒(méi)有遵循測(cè)試者規(guī)定的輸出格式（它沒(méi)有總結(jié)每個(gè)片段，也沒(méi)有給出對(duì)話者的名字）。

不過(guò)，在構(gòu)建出更復(fù)雜的 prompt 之后，ChatGPT 終于聽(tīng)懂了指示：

qa_attempt5 = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: What were the main things that happened in the meeting?
Here is a meeting transcript:
----
Peter: Hey
John: Hey
Peter: John, how is the weather today?
John: It's raining.
Peter: That's too bad. I was hoping to go for a walk later.
John: Yeah, it's a shame.
Peter: John, you are a bad person.
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.
{{/user}}
{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{~/assistant~}}
{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.
{{~/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

測(cè)試者表示，他們之所以要多次迭代 prompt，是因?yàn)?OpenAI API 不允許他們做部分輸出補(bǔ)全（即他們不能指定 AI 助手如何開(kāi)始回答），因此他們很難引導(dǎo)輸出。

相反，如果使用一個(gè)開(kāi)源模型，他們就可以更清楚地指導(dǎo)輸出，迫使模型使用他們規(guī)定的結(jié)構(gòu)。

新一輪測(cè)試使用如下 prompt：

qa_guided = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract the three segment from the transcript that are the most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns. If you need less than three segments, you can leave the rest blank.

As an example of output format, here is a fictitious answer to a question about another meeting transcript:
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}

{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: {{gen'segment1'}}
Segment 2: {{gen'segment2'}}
Segment 3: {{gen'segment3'}}
ANSWER: {{gen 'answer'}}
{{~/assistant~}}''')

如果用 Vicuna 運(yùn)行上述 prompt，他們第一次就會(huì)得到正確的格式，而且格式總能保持正確：

當(dāng)然，也可以在 MPT 上運(yùn)行相同的 prompt：

雖然 MPT 遵循了格式要求，但它沒(méi)有針對(duì)給定的會(huì)議資料回答問(wèn)題，而是從格式示例中提取了片段。這顯然是不行的。

接下來(lái)比較 ChatGPT 和 Vicuna。

測(cè)試者給出的問(wèn)題是「誰(shuí)想賣掉公司？」兩個(gè)模型看起來(lái)答得都不錯(cuò)。

以下是 ChatGPT 的回答：

以下是 Vicuna 的回答：

接下來(lái)，測(cè)試者換了一段材料。新材料是馬斯克和記者的一段對(duì)話：

測(cè)試者提出的問(wèn)題是：「Elon Musk 有沒(méi)有侮辱（insult）記者？」

ChatGPT 給出的答案是：

Vicuna 給出的答案是：

Vicuna 給出了正確的格式，甚至提取的片段也是對(duì)的。但令人意外的是，它最后還是給出了錯(cuò)誤的答案，即「Elon musk does not accuse him of lying or insult him in any way」。

測(cè)試者還進(jìn)行了其他問(wèn)答測(cè)試，得出的結(jié)論是：Vicuna 在大多數(shù)問(wèn)題上與 ChatGPT 相當(dāng)，但比 ChatGPT 更經(jīng)常答錯(cuò)。

用 bash 完成任務(wù)

測(cè)試者嘗試讓幾個(gè) LLM 迭代使用 bash shell 來(lái)解決一些問(wèn)題。每當(dāng)模型發(fā)出命令，測(cè)試者會(huì)運(yùn)行這些命令并將輸出插入到 prompt 中，迭代進(jìn)行這個(gè)過(guò)程，直到任務(wù)完成。

ChatGPT 的 prompt 如下所示：

terminal = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
Please complete the following task:
Task: list the files in the current directory
You can give me one bash command to run at a time, using the syntax:
COMMAND: command
I will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.
{{/user}}

{{#assistant~}}
COMMAND: ls
{{~/assistant~}}

{{#user~}}
Output: guidance project
{{/user}}

{{#assistant~}}
The files or folders in the current directory are:
- guidance
- project
DONE
{{~/assistant~}}

{{#user~}}
Please complete the following task:
Task: {{task}}
You can give me one bash command to run at a time, using the syntax:
COMMAND: command
I will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.
{{/user}}

{{#geneach 'commands' stop=False}}
{{#assistant~}}
{{gen 'this.command'}}
{{~/assistant~}}

{{~#user~}}
Output: {{shell this.command)}}
{{~/user~}}
{{/geneach}}''')

測(cè)試者在～/work/project 中創(chuàng)建了一個(gè)虛擬存儲(chǔ)庫(kù)，其中包含文件 license.txt，但不是標(biāo)準(zhǔn)的 LICENSE 文件名。

然后測(cè)試者嘗試在不與 ChatGPT 溝通的情況下，看它是否能完成任務(wù) ——「找出位于～/work/project 中的開(kāi)源項(xiàng)目正在使用的 license」（Find out what license the open source project located in ~/work/project is using）。

ChatGPT 遵循一個(gè)非常自然的順序，并解決了這個(gè)問(wèn)題。

對(duì)于開(kāi)源模型，測(cè)試者編寫(xiě)了一個(gè)更簡(jiǎn)單的（引導(dǎo)式）prompt，其中包含一系列命令輸出：

guided_terminal = guidance ('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
Please complete the following task:
Task: list the files in the current directory
You can run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you are done with the task, use the COMMAND: DONE.
{{/user}}

{{#assistant~}}
COMMAND: ls
OUTPUT: guidance project
COMMAND: DONE 
{{~/assistant~}}

{{#user~}}
Please complete the following task:
Task: {{task}}
You can run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you are done with the task, use the COMMAND: DONE.
{{~/user}}

{{#assistant~}}
{{#geneach 'commands' stop=False ~}}
COMMAND: {{gen 'this.command' stop='\\n'}}
OUTPUT: {{shell this.command)}}{{~/geneach}}
{{~/assistant~}}''')

我們來(lái)看一下 Vicuna 和 MPT 執(zhí)行該任務(wù)的情況。

Vicuna：

MPT：

在一個(gè)有趣的轉(zhuǎn)折中，Vicuna 無(wú)法解決這個(gè)任務(wù)，但 MPT 卻成功了。除了保密性之外，開(kāi)源模型在這里有一個(gè)顯著的優(yōu)勢(shì)：整個(gè) prompt 被作為一個(gè)輸入傳遞給一個(gè) LLM 模型（測(cè)試者甚至通過(guò)不讓它生成像 COMMAND 這樣的輸出結(jié)構(gòu) token 來(lái)加速它）。

相比之下，他們必須為每個(gè)命令重新調(diào)用 ChatGPT，這更慢，開(kāi)銷也更大。

接下來(lái)，他們又嘗試了一個(gè)不同的命令：「在～/work/guidance 目錄下找到當(dāng)前未被 git 跟蹤的所有 jupyter notebook 文件」

以下是 ChatGPT 的回答：

測(cè)試者再次遇到一個(gè)問(wèn)題：ChatGPT 沒(méi)有遵循他們指定的輸出結(jié)構(gòu)（這樣就使得它無(wú)法在無(wú)人干預(yù)的情況下在程序內(nèi)使用）。該程序只是執(zhí)行命令，因此在上面最后一條 ChatGPT 信息之后就停止了。

測(cè)試者懷疑空輸出會(huì)導(dǎo)致 ChatGPT 關(guān)閉，因此他們通過(guò)在沒(méi)有輸出時(shí)更改信息來(lái)解決這個(gè)特殊問(wèn)題。然而，他們無(wú)法解決「無(wú)法強(qiáng)迫 ChatGPT 遵循指定的輸出結(jié)構(gòu)」這一普遍問(wèn)題。

在做了這個(gè)小小的修改后，ChatGPT 就能解決這個(gè)問(wèn)題：讓我們看看 Vicuna 是怎么做的：

Vicuna 遵循了輸出結(jié)構(gòu)，但不幸的是，它運(yùn)行了錯(cuò)誤的命令來(lái)完成任務(wù)。MPT 反復(fù)調(diào)用 git status，所以它也失敗了。

測(cè)試者還對(duì)其他各種指令運(yùn)行了這些程序，發(fā)現(xiàn) ChatGPT 幾乎總是能產(chǎn)生正確的指令序列，但有時(shí)并不遵循指定的格式（因此需要人工干預(yù)）。此處開(kāi)源模型的效果不是很好（或許可以通過(guò)更多的 prompt 工程來(lái)改進(jìn)它們，但它們?cè)诖蠖鄶?shù)較難的指令上都失敗了）。

歸納總結(jié)

測(cè)試者還嘗試了一些其他任務(wù)，包括文本摘要、問(wèn)題回答、創(chuàng)意生成和 toy 字符串操作，評(píng)估了幾種模型的準(zhǔn)確性。以下是主要的評(píng)估結(jié)果：

任務(wù)質(zhì)量：對(duì)于每項(xiàng)任務(wù)，ChatGPT (3.5) 都比 Vicuna 強(qiáng)，而 MPT 幾乎在所有任務(wù)上都表現(xiàn)不佳，這甚至讓測(cè)試團(tuán)隊(duì)?wèi)岩勺约旱氖褂梅椒ù嬖趩?wèn)題。值得注意的是，Vicuna 的性能通常接近 ChatGPT。
易用性：ChatGPT 很難遵循指定的輸出格式，因此難以在程序中使用它，需要為輸出編寫(xiě)正則表達(dá)式解析器。相比之下，能夠指定輸出結(jié)構(gòu)是開(kāi)源模型的一個(gè)顯著優(yōu)勢(shì)，以至于有時(shí) Vicuna 比 ChatGPT 更易用，即使它在任務(wù)性能方面更差一些。
效率：本地部署模型意味著我們可以在單次 LLM 運(yùn)行中解決任務(wù)（guidance 在程序執(zhí)行時(shí)保持 LLM 狀態(tài)），速度更快，成本更低。當(dāng)任何子步驟涉及調(diào)用其他 API 或函數(shù)（例如搜索、終端等）時(shí)尤其如此，這總是需要對(duì) OpenAI API 進(jìn)行新調(diào)用。guidance 還通過(guò)不讓模型生成輸出結(jié)構(gòu)標(biāo)記來(lái)加速生成，這有時(shí)會(huì)產(chǎn)生很大的不同。

總的來(lái)說(shuō)，該測(cè)試得出的結(jié)論是：MPT 還沒(méi)有準(zhǔn)備好在現(xiàn)實(shí)世界中使用，而 Vicuna 對(duì)于許多任務(wù)來(lái)說(shuō)是 ChatGPT (3.5) 的可行替代品。目前這些發(fā)現(xiàn)僅適用于該測(cè)試嘗試的任務(wù)和輸入（或 prompt 類型），該測(cè)試只是一個(gè)初步探索，而不是正式評(píng)估。

更多結(jié)果參見(jiàn) notebook：https://github.com/microsoft/guidance/blob/main/notebooks/chatgpt_vs_open_source_on_harder_tasks.ipynb

責(zé)任編輯：張燕妮來(lái)源：機(jī)器之心