ML工程師一次微調七個模型，擊敗OpenAI GPT-4

作者：新智元 2024-07-02 13:25:22

「微調你的模型，獲得比GPT-4更好的性能」不只是說說而已，而是真的可操作。最近，一位愿意動手的ML工程師就把幾個開源LLM調教成了自己想要的樣子。

模型微調（Fine Tuning）指的是在已經訓練好的大語言模型（LLM）的基礎上，使用特定的數據集進行進一步訓練，這種「站在巨人肩膀上」的做法，可以用比較小的數據集，比較低的訓練成本獲得更好的收益，避免了重復造輪子。

在大模型時代，提示詞工程（Prompt Engineering）、模型微調和檢索增強生成（RAG）都是非常重要的能力，對于大多數個人使用者來說，掌握提示詞工程就夠用了，但如果想要在自己的服務中接入大模型，模型微調是必由之路，也因為其對于技能的更高要求，成為了ML工程師高手過招之地。

今日，一位ML工程師的微調模型就登上了HN熱榜，「我的微調模型擊敗了OpenAI的GPT-4」！

工程師Alex Strick van Linschoten在博客中指出，就他的測試數據而言，Mistral、Llama3和Solar LLM的微調模型都要比OpenAI的模型更準確。

文章地址：https://mlops.systems/posts/2024-07-01-full-finetuned-model-evaluation.html#start-date-accuracy

對此，作者戲稱道，這篇文章的標題可以直接改為：微調模型擊敗OpenAI，但評估過程實在痛苦。

他坦言，代碼量很多，運行速度也較慢。在這次微調中，他第一次為微調選擇而糾結權衡。如果沒有某種系統來處理這個問題，維護這一切的復雜性就會開始增加。

現在，我們來看一下這位工程師是如何做的。（詳細代碼請參見原文）

加載數據集

所有數據都存儲在Hugging Face Hub的一個公共倉庫中。

為了進行這些評估，作者選擇使用數據集的測試部分，因為模型之前沒有接觸過這些數據，這樣可以更好地評估模型在新數據上的表現。

test_dataset

Dataset({
    features: ['name', 'eventrefnumber', 'text', 'StartDate', 'eventtype', 'province', 'citydistrict', 'village', 'targetgroup', 'commander', 'position', 'minkilled', 'mincaptured', 'capturedcharacterisation', 'killedcharacterisation', 'killq', 'captureq', 'killcaptureraid', 'airstrike', 'noshotsfired', 'dataprocessed', 'flagged', 'glossarymeta', 'minleaderskilled', 'minfacilitatorskilled', 'minleaderscaptured', 'minfacilitatorscaptured', 'leaderq'],
    num_rows: 724
})

首先在DataFrame中添加一個額外的列，然后對數據集中的每一行進行預測。將預測的副本存儲到這一列中，以避免重復執行這個計算密集的步驟。

但首先需要將數據組裝成Pydantic對象，以便處理數據驗證。

[
    IsafEvent(
        name='5',
        text='2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',
        start_date=datetime.date(2013, 1, 24),
        event_type={'insurgentskilled'},
        province={'ghazni'},
        target_group={'taliban'},
        min_killed=1,
        min_captured=0,
        killq=True,
        captureq=False,
        killcaptureraid=False,
        airstrike=False,
        noshotsfired=False,
        min_leaders_killed=1,
        min_leaders_captured=0,
        predictions={}
    ),
    IsafEvent(
        name='2',
        text='2011-11-S-034\nISAF Joint Command - Afghanistan\nFor Immediate Release\n\nKABUL, Afghanistan (Nov. 20, 2011)\nA coalition security force detained numerous suspected insurgents during an operation in Marjeh district, Helmand province, yesterday.  The force conducted the operation after receiving information that a group of insurgents were at a compound in the area.  After calling for the men inside to come out peacefully, the insurgents emerged and were detained without incident.',
        start_date=datetime.date(2011, 11, 19),
        event_type={'detention'},
        province={'helmand'},
        target_group={''},
        min_killed=0,
        min_captured=4,
        killq=False,
        captureq=True,
        killcaptureraid=True,
        airstrike=False,
        noshotsfired=False,
        min_leaders_killed=0,
        min_leaders_captured=0,
        predictions={}
     )
]

因此，當進行預測時，我們希望從模型中得到一個類似這樣的JSON字符串：

json_str = events[0].model_dump_json(exclude={"text", "predictions"})
print(json_str)

{"name":"5","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["tali
ban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"nosh
otsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}

從使用GPT模型進行完整評估開始，需要一個更復雜的提示詞才能獲得理想的結果。

由于GPT模型沒有經過訓練或微調來響應微調模型的特定提示詞，因此我們不能直接使用相同的提示詞。

這帶來了一個有趣的問題：我們需要花多少精力在設計GPT提示詞上，才能達到微調模型的準確度？換句話說，是否真的有辦法在接受不同提示詞的模型之間進行公平的比較？

嘗試OpenAI的GPT-4和GPT-4 Turbo可以看到，為了讓GPT模型有機會與微調模型競爭，提示詞需要多長。

理想情況下，作者會在上下文中加入更多的示例，但他也不希望增加token的使用量。

from openai import OpenAI
from rich import print
import json
import os


def query_openai(article_text: str, model: str) -> str:
    query = (
        f"The following is a press release issued by ISAF (formerly operating in Afghanistan):\n{article_text}\n\n"
        "## Extraction request\n"
        "Please extract the following information from the press release:\n"
        "- The name of the event (summarising the event / text as a headline)\n"
        "- The start date of the event\n"
        "- The event type(s)\n"
        "- The province(s) in which the event occurred\n"
        "- The target group(s) of the event\n"
        "- The minimum number of people killed during the event\n"
        "- The minimum number of people captured during the event\n"
        "- Whether someone was killed or not during the event\n"
        "- Whether someone was captured or not during the event\n"
        "- Whether the event was a so-called 'kill-capture raid'\n"
        "- Whether an airstrike was used during the event\n"
        "- Whether no shots were fired during the event\n"
        "- The minimum number of leaders killed during the event\n"
        "- The minimum number of leaders captured during the event\n\n"
        "## Annotation notes:\n"
        "- A 'faciliator' is not a leader.\n"
        "- If a press release states that 'insurgents' were detained without further "
        "details, assign a minimum number of two detained. Interpret 'a couple' as "
        "two. Interpret 'several' as at least three, even though it may sometimes "
        "refer to seven or eight. Classify the terms 'a few', 'some', 'a group', 'a "
        "small group', and 'multiple' as denoting at least three, even if they "
        "sometimes refer to larger numbers. Choose the smaller number if no other "
        "information is available in the press release to come up with a minimally "
        "acceptable figure. Interpret 'numerous' and 'a handful' as at least four, "
        "and 'a large number' as at least five.\n\n"
        "## Example:\n"
        "Article text: 'ISAF Joint Command Evening Operational Update Feb. 19, 2011\nISAF Joint Command - "
        "Afghanistan\u20282011-02-S-143\u2028For Immediate Release \u2028\u2028KABUL, Afghanistan (Feb. 19)\u2028\u2028ISAF "
        "service members at a compound in Sangin district, Helmand province observed numerous insurgents north and south of "
        "their position talking on radios today. After gaining positive identification of the insurgent positions, the "
        "coalition troops engaged, killing several insurgents. Later, the ISAF troops observed more insurgents positioning "
        "in the area with weapons. After positive identification, coalition forces continued firing on the various insurgent "
        "positions, resulting in several more insurgents being killed.'\n\n"
        'Output: `{"name":"Several insurgents killed in '
        'Helmand","start_date":"2011-02-18","event_type":["insurgentskilled"],"province":["helmand"],"target_group":[""],"mi'
        'n_killed":6,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired"'
        ':false,"min_leaders_killed":0,"min_leaders_captured":0}`'
    )


    # set up the prediction harness
    client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


    response = client.chat.completions.create(
        model=model,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": "You are an expert at identifying events in a press release. You are precise "
                "and always make sure you are correct, drawing inference from the text of the "
                "press release.\n\n You always return a JSON string with the following schema: "
                "## JSON Schema details\n"
                "Here is some of the schema for the JSON output string you "
                "should make use of: event_types = ['airstrike', 'detention', "
                "'captureandkill', 'insurgentskilled', 'exchangeoffire', 'civiliancasualty'], "
                "provinces = ['badakhshan', 'badghis', 'baghlan', 'balkh', 'bamyan', "
                "'day_kundi', 'farah', 'faryab', 'ghazni', 'ghor', 'helmand', 'herat', "
                "'jowzjan', 'kabul', 'kandahar', 'kapisa', 'khost', 'kunar', 'kunduz', "
                "'laghman', 'logar', 'nangarhar', 'nimroz', 'nuristan', 'paktya', 'paktika', "
                "'panjshir', 'parwan', 'samangan', 'sar_e_pul', 'takhar', 'uruzgan', "
                "'wardak', 'zabul'], target_groups = ['taliban', 'haqqani', 'criminals', "
                "'aq', 'hig', 'let', 'imu', 'judq', 'iju', 'hik', 'ttp', 'other']\n\n",
            },
            {"role": "user", "content": query},
        ],
        temperature=1,
    )


    return response.choices[0].message.content

可以通過一個簡單的示例來驗證這個函數是否正常工作：

json_str = query_openai(events[0].text, "gpt-4o")
print(json.loads(json_str))

{
    'name': 'Taliban leader Alaudin killed in Ghazni',
    'start_date': '2013-01-24',
    'event_type': ['insurgentskilled'],
    'province': ['ghazni'],
    'target_group': ['taliban'],
    'min_killed': 1,
    'min_captured': 0,
    'killq': True,
    'captureq': False,
    'killcaptureraid': True,
    'airstrike': False,
    'noshotsfired': False,
    'min_leaders_killed': 1,
    'min_leaders_captured': 0
}

模型正常工作（如預期的那樣），并且我們得到了一個JSON字符串。

接下來，構建一個程序來遍歷所有測試數據，獲取預測結果，并將這些預測結果存儲在Pydantic對象中。

對于批量預測，要確保以異步方式進行，因為有大量的事件，所以不希望耗費太多時間。

此外，作者還在函數中添加了一些重試機制，以應對GPT-3.5-turbo模型的速率限制。

正如我們現在所看到的，作者對每個事件都附加了三個預測結果。

print(events[0])

IsafEvent(
    name='5',
    text='2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',
    start_date=datetime.date(2013, 1, 24),
    event_type={'insurgentskilled'},
    province={'ghazni'},
    target_group={'taliban'},
    min_killed=1,
    min_captured=0,
    killq=True,
    captureq=False,
    killcaptureraid=False,
    airstrike=False,
    noshotsfired=False,
    min_leaders_killed=1,
    min_leaders_captured=0,
    predictinotallow={'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',
        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',
        'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}'
    }
)

目前，已經將所有預測結果都存儲在內存中，所以現在是時候將它們提交到數據集，并推送到Hugging Face Hub了，以防筆記本崩潰、本地計算機關閉或其他意外情況發生。

作者創建了一個函數來處理這個過程，因為還需要對其他模型重復這個步驟。雖然過程有點冗長，但這樣做更好，這樣方便我們可以清楚地看到每一步的操作。

一個更簡潔和抽象的 convert_to_dataset 函數可能如下所示：

def convert_to_dataset(data: List[BaseModel]) -> Dataset:
    dataset_dict = {}


    for field_name, field_value in data[0].__fields__.items():
        field_type = field_value.outer_type_
        if field_type in [str, int, float, bool, date]:
            dataset_dict[field_name] = [getattr(item, field_name) for item in data]
        elif field_type == set:
            dataset_dict[field_name] = [list(getattr(item, field_name)) for item in data]
        elif issubclass(field_type, BaseModel):
            dataset_dict[field_name] = [getattr(item, field_name).dict() for item in data]
        else:
            dataset_dict[field_name] = [getattr(item, field_name) for item in data]


    dataset = Dataset.from_dict(dataset_dict)
    return dataset

不過現在，要先把數據推送到Hub上。

convert_and_push_dataset(events, "isafpressreleases_with_preds", split_name="test")

添加來自微調模型的預測

在添加完基線OpenAI模型之后，現在再來添加一些之前微調過的模型，包括本地模型以及由一鍵微調服務商托管的模型。

重新加載預測數據集

先加載數據集，然后再添加一些本地模型的預測結果：

from datasets import load_dataset

preds_test_data = load_dataset("strickvl/isafpressreleases_with_preds")[
    "test"
].to_list()

微調TinyLlama的預測

現在，如果我們檢查數據集，就會發現新的模型預測結果已經保存進去了：

from rich import print

print(preds_test_data[0])

{
    'name': '5',
    'text': '2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',
    'predictions': {'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',
        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',
        'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',
        'tinyllama-templatefree': '\n{"name":"Taliban leader killed in Ghazni","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'tinyllama-sharegpt': 
'{"name":"2","start_date":"2013-01-24","event_type":["airstrike"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":true,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}'
    },
    'start_date': datetime.date(2013, 1, 24),
    'province': ['ghazni'],
    'target_group': ['taliban'],
    'event_type': ['insurgentskilled'],
    'min_killed': 1,
    'min_captured': 0,
    'killq': True,
    'captureq': False,
    'killcaptureraid': False,
    'airstrike': False,
    'noshotsfired': False,
    'min_leaders_killed': 1,
    'min_leaders_captured': 0
}

微調Mistral的預測

正如作者之前提到的，微調后的Mistral模型無法在本地運行，所以他在Modal上進行推理，在那里他可以使用強大的A100顯卡來進行預測。

我們會看到，模型的表現并不理想，幾乎所有評估都失敗了。在圖表中數值為零的就是mistral-lora-templatefree模型。

微調OpenAI的預測

使用OpenAI的一鍵微調服務對gpt-3.5-turbo-1106模型進行了微調。通過OpenAI SDK 遍歷數據集，生成了這個微調模型的預測結果。

微調Mistral模型（通過OpenPipe）

使用OpenPipe微調了Mistral 7B和Mistral 8x7B模型，以便有一個合理的基準來比較其他模型。

微調Solar LLM（通過Predibase）

大約一周前，Predibase宣布了一個新的頂級微調模型，即Upstage的Solar LLM，所以作者決定試試。

這個模型的優勢在于它被訓練得非常擅長人們常常微調模型的任務，例如結構化數據提取。正如在圖表中呈現的，它表現得相當不錯！

微調Llama3的預測（通過OpenPipe）

作者本地微調的Llama3模型表現并不好，但在OpenPipe上的輸出看起來還可以，所以他使用這些預測進行最終評估。

from rich import print

print(preds_test_data[0])

{
    'name': '5',
    'text': '2013-01-S-025\n\nKABUL, Afghanistan (Jan. 25, 2013)\nDuring a security operation in Andar district, Ghazni province, yesterday, an Afghan and coalition force killed the Taliban leader, Alaudin. Alaudin oversaw a group of insurgents responsible for conducting remote-controlled improvised explosive device and small-arms fire attacks against Afghan and coalition forces. Prior to his death, Alaudin was planning attacks against Afghan National Police in Ghazni province.',
    'predictions': {'finetuned-llama3-7b-32k-openpipe': 
'{"name":"1","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'finetuned-mistral-7b-optimised-openpipe': 
'{"name":"1","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'finetuned-openai-gpt-3.5-turbo-1106': 
'{"name":"4","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'gpt-3.5-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni province",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": false,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',
        'gpt-4-turbo': '{\n    "name": "Taliban leader Alaudin killed in Ghazni",\n    "start_date": "2013-01-24",\n    "event_type": ["captureandkill"],\n    "province": ["ghazni"],\n    "target_group": ["taliban"],\n    "min_killed": 1,\n    "min_captured": 0,\n    "killq": true,\n    "captureq": false,\n    "killcaptureraid": true,\n    "airstrike": false,\n    "noshotsfired": false,\n    "min_leaders_killed": 1,\n    "min_leaders_captured": 0\n}',
        'gpt-4o': '{\n  "name": "Taliban leader Alaudin killed in Ghazni",\n  "start_date": "2013-01-24",\n  "event_type": ["insurgentskilled", "captureandkill"],\n  "province": ["ghazni"],\n  "target_group": ["taliban"],\n "min_killed": 1,\n  "min_captured": 0,\n  "killq": true,\n  "captureq": false,\n  "killcaptureraid": true,\n  "airstrike": false,\n  "noshotsfired": false,\n  "min_leaders_killed": 1,\n  "min_leaders_captured": 0\n}',
        'mistral-lora-templatefree': '1',
        'tinyllama-sharegpt': 
'{"name":"2","start_date":"2013-01-24","event_type":["airstrike"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":true,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'tinyllama-templatefree': '\n{"name":"Taliban leader killed in Ghazni","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":false,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}',
        'ft-solar-1-mini-chat-240612-predibase': 
'\n\n{"name":"2","start_date":"2013-01-24","event_type":["insurgentskilled"],"province":["ghazni"],"target_group":["taliban"],"min_killed":1,"min_captured":0,"killq":true,"captureq":false,"killcaptureraid":true,"airstrike":false,"noshotsfired":false,"min_leaders_killed":1,"min_leaders_captured":0}'
    },
    'start_date': datetime.date(2013, 1, 24),
    'province': ['ghazni'],
    'target_group': ['taliban'],
    'event_type': ['insurgentskilled'],
    'min_killed': 1,
    'min_captured': 0,
    'killq': True,
    'captureq': False,
    'killcaptureraid': False,
    'airstrike': False,
    'noshotsfired': False,
    'min_leaders_killed': 1,
    'min_leaders_captured': 0
}

現在我們有了來自七個微調模型和三個OpenAI模型的預測結果（用于比較），可以開始進行評估了。

不過在此之前，還需要先進行一個簡單的檢查，看看這些預測中有多少是有效的JSON格式。

JSON有效性測試

from datasets import load_dataset

dataset_with_preds = load_dataset("strickvl/isafpressreleases_test_predictions")[
    "train"
].to_list()

通過對比templatefree和sharegpt生成有效JSON的能力，我們已經能看到它們在TinyLlama微調中的差異，這非常具有指導意義。

OpenAI模型每次都能生成有效的JSON，微調后的Mistral和Llama3模型也是如此。

在編寫評估模型的代碼時，作者注意到有些條目是空白的或者根本沒有預測結果，所以他接下來對這個問題進行了調查。

# find out how many of the predictions are None values or empty strings
missing_values = {
    "gpt-4o": 0,
    "gpt-4-turbo": 0,
    "gpt-3.5-turbo": 0,
    "tinyllama-templatefree": 0,
    "tinyllama-sharegpt": 0,
    "finetuned-openai-gpt-3.5-turbo-1106": 0,
    "finetuned-llama3-7b-32k-openpipe": 0,
    "mistral-lora-templatefree": 0,
    "finetuned-mistral-7b-optimised-openpipe": 0,
    "ft-solar-1-mini-chat-240612-predibase": 0,
}


for row in dataset_with_preds:
    for model in row["predictions"]:
        if row["predictions"][model] is None or row["predictions"][model] == "":
            missing_values[model] += 1


print(missing_values)

{
    'gpt-4o': 0,
    'gpt-4-turbo': 0,
    'gpt-3.5-turbo': 0,
    'tinyllama-templatefree': 0,
    'tinyllama-sharegpt': 38,
    'finetuned-openai-gpt-3.5-turbo-1106': 0,
    'finetuned-llama3-7b-32k-openpipe': 0,
    'mistral-lora-templatefree': 0,
    'finetuned-mistral-7b-optimised-openpipe': 0,
    'ft-solar-1-mini-chat-240612-predibase': 0
}

如果沒有缺失值，tinyllama-sharegpt模型將會有全部724個預測結果，并且都是有效的JSON。

現在我們可以進入我們真正感興趣的部分：準確性。作者將計算所有有意義的屬性的分數，然后展示模型比較的結果。

這些屬性包括：

start_date
province
target_group
event_type
min_killed
min_captured
killq
captureq
killcaptureraid
airstrike
noshotsfired
min_leaders_killed
min_leaders_captured

重要提示，對于接下來的所有圖表，總任務數為724，所以這些數字是從724中得出的。

測試結果

開始日期準確性

{
    'gpt-4o': 527,
    'gpt-4-turbo': 522,
    'gpt-3.5-turbo': 492,
    'tinyllama-templatefree': 231,
    'tinyllama-sharegpt': 479,
    'finetuned-openai-gpt-3.5-turbo-1106': 646,
    'finetuned-llama3-7b-32k-openpipe': 585,
    'mistral-lora-templatefree': 0,
    'finetuned-mistral-7b-optimised-openpipe': 636,
    'ft-solar-1-mini-chat-240612-predibase': 649
}

Solar和微調的GPT-3.5模型在預測事件發生日期方面表現最佳。

省份準確性

{
    'gpt-4o': 649,
    'gpt-4-turbo': 645,
    'gpt-3.5-turbo': 595,
    'tinyllama-templatefree': 335,
    'tinyllama-sharegpt': 660,
    'finetuned-openai-gpt-3.5-turbo-1106': 704,
    'finetuned-llama3-7b-32k-openpipe': 707,
    'mistral-lora-templatefree': 0,
    'finetuned-mistral-7b-optimised-openpipe': 711,
    'ft-solar-1-mini-chat-240612-predibase': 704
}

分析發現，微調后的模型實際上比OpenAI模型表現更好，只犯了少量錯誤。

目標群體準確性

在這里，可能會提到多個目標群體，因此作者會根據模型預測的群體中有多少是正確的來給出一個滿分為1的分數。

微調后的模型在目標群體識別方面明顯優于OpenAI。

不過，作者懷疑如果添加一些訓練數據中沒有的新群體，模型的表現可能會下降。

事件類型準確性

事件類型實際上是最難的類別之一，因為有些類別在語義上存在重疊，有時甚至連人工標注者也難以區分。

再一次，微調后的模型在這方面表現得相當好。

min_killed準確性

在這些數字估計任務中，微調模型和OpenAI模型的表現差距突然縮小了。

雖然Mistral依然表現最佳，但優勢并不明顯！而且OpenAI模型在這方面的表現非常出色，令人印象深刻。

作者猜測這是因為提示中有一整段解釋了用于標注示例的標準：

標注說明：‘facilitator’ 不是領導者。如果新聞稿中提到‘叛亂分子’被拘留而沒有進一步細節，則分配至少兩名被拘留者。將‘a couple’解釋為兩人。將‘several’解釋為至少三人，盡管有時可能指七或八人。將‘a few’、‘some’、‘a group’、‘a small group’和‘multiple’解釋為至少三人，即使有時它們指代更多人數。如果新聞稿中沒有其他信息來提供一個最低可接受的數字，請選擇較小的數字。將‘numerous’和‘a handful’解釋為至少四人，而‘a large number’解釋為至少五人。