構建一套可自我改進的 Agentic RAG 系統 精華
Agentic RAG 系統可以被視為一個“高維向量空間”,其中每個維度都對應一次設計決策,例如 prompt engineering、agent 協同、retrieval 策略等。手動調優這些維度以找到“正確組合”非常困難,而且上線后的未見數據往往會打破測試時有效的配置。
一個更好的方法是讓系統學會“自我優化”。一條典型的、能“自我進化”的 Agentic RAG 流水線,遵循如下思考過程:

Self Improving Agentic RAG System (Created by Fareed Khan)
- 一個由“專家型代理(specialist agents)”組成的協作團隊執行任務。它基于一個高層概念,按照當前 SOP(標準作業程序)生成一份完整的、多來源文檔。
- 一個“多維評價系統(multi-dimensional evaluation system)”對團隊輸出進行評分,度量準確性、可行性、合規性等多個目標,得到一個性能向量。
- 一個“性能診斷代理(diagnostician agent)”分析該向量,像咨詢顧問一樣識別流程中的主要薄弱環節,并追溯根因。
- 一個“SOP 架構代理(SOP architect agent)”基于診斷洞見更新流程,提出專門用于修復薄弱點的新變體。
- 每個“SOP 新版本”都會在團隊重復執行任務時進行測試,每次輸出再被評估,以生成對應的性能向量。
- 系統識別“Pareto front(帕累托前沿)”,即所有已測 SOP 的最優權衡組合,并將這些優化策略呈現給“人類決策者”,從而閉合進化回路。
在這篇博客中,我們將瞄準“醫療健康(healthcare)”領域。該領域的挑戰在于:需要針對輸入查詢或知識庫考慮“多種可能性”,同時“最終決策仍由人類掌握?!?/p>
我們將構建一條端到端、可自我改進的 Agentic RAG 流水線,用來生成 RAG 系統的不同設計模式。
完整代碼可在我的 GitHub 倉庫獲取:
GitHub - FareedKhan-dev/autonomous-agentic-rag: Self improving agentic rag pipeline
目錄
- 醫學 AI 的知識基礎設施°安裝開源技術?!悱h境配置與依賴導入°配置本地大語言模型°準備知識庫
- 構建內部臨床試驗設計網絡°定義標準操作規程(Guild SOP)°定義專業智能體(Specialist Agents)°使用 LangGraph 編排公會°完整運行工作流圖
- 多維度評價體系°為每個參數構建自定義評估器°創建聚合型 LangSmith 評估器
- 進化引擎的外層循環°管理配置°構建主任級智能體(Director-Level Agents)°運行完整的進化循環
- 基于五維的帕累托分析°識別帕累托前沿°可視化前沿并做出決策
- 理解認知工作流°可視化智能體工作流時間線°使用雷達圖剖析輸出結果
- 將其轉變為自主策略
醫學 AI 的知識基礎設施
在編寫可自進化的 agentic RAG 系統之前,我們需要先建立合適的知識數據庫,以及搭建用于構建架構的必要工具。
一套生產級 RAG 系統通常包含多樣化的數據庫,既包括敏感的組織內部數據,也包含開源數據,用來提升檢索質量,并彌補信息過時或不完整的問題。這個基礎步驟可以說是最關鍵的……
因為數據源的質量將直接決定最終輸出的質量。

Sourcing the knowledge base (Created by Fareed Khan)
本節我們將組裝整套架構的各個組件。計劃如下:
- 安裝開源技術棧(Open-Source Stack):搭建環境并安裝必要庫,堅持本地、開源優先(open-source-first)。
- 配置安全可觀測性(Secure Observability):安全加載 API Key,并配置 LangSmith,從一開始就追蹤和調試復雜的代理交互。
- 搭建本地 LLM 工坊(LLM Foundry):通過 Ollama 構建不同的開源模型組合,為不同任務分配不同模型,以優化表現與成本。
- 獲取并處理多模態數據:下載并準備 4 類真實數據源:PubMed 科學文獻、FDA 監管指南、倫理原則,以及一個大型結構化臨床數據集(MIMIC-III)。
- 索引知識庫(Index the Knowledge Stores):最終,將原始數據處理為高效可檢索的數據庫:對非結構化文本使用 FAISS 向量庫,對結構化臨床數據使用 DuckDB。
安裝開源技術棧
第一步是安裝所需的 Python 庫。可復現的環境是一切嚴肅項目的基石。我們選擇業界標準的開源棧,以便對系統進行完全掌控。包括用于核心 agentic 框架的 langchain 和 langgraph、與本地 LLM 交互的 ollama,以及訪問 PubMed 的 biopython、進行高性能臨床數據分析的 duckdb 等專業庫。
讓我們安裝需要的模塊……
# We uses pip "quiet" (-q) and "upgrade" (-U) flags to install all the required packages.
# - langchain, langgraph, etc.: These form the core of our agentic framework for building and orchestrating agents.
# - ollama: This is the client library that allows our Python code to communicate with a locally running Ollama server.
# - duckdb: An incredibly fast, in-process analytical database perfect for handling our structured MIMIC data without a heavy server setup.
# - faiss-cpu: Facebook AI's library for efficient similarity search, which will power the vector stores for our RAG agents.
# - sentence-transformers: A library for easy access to state-of-the-art models for creating text embeddings.
# - biopython, pypdf, beautifulsoup4: A suite of powerful utilities for downloading and parsing our diverse, real-world data sources.
%pip install -U langchain langgraph langchain_community langchain_openai langchain_core ollama pandas duckdb faiss-cpu sentence-transformers biopython pypdf pydantic lxml html2text beautifulsoup4 matplotlib -qqq我們一次性準備好所有工具和“建筑材料”。各庫各司其職:從用 langgraph 編排 agent 工作流,到用 duckdb 做數據分析。
模塊安裝完成后,讓我們逐一初始化它們。
環境配置與依賴導入
我們需要安全地配置環境。把 API Key 硬編碼在筆記本里既有安全風險,也不利于共享代碼。
我們使用 ??.env?? 文件管理敏感信息,主要是 LangSmith 的 API Key。從一開始就配置 LangSmith 是不可妥協的要求,這將為我們提供深度可觀測性,以跟蹤、調試并理解 agents 之間的交互。上代碼:
import os
import getpass
from dotenv import load_dotenv
# This function from the python-dotenv library searches for a .env file and loads its key-value pairs
# into the operating system's environment variables, making them accessible to our script.
load_dotenv()
# This is a critical check. We verify that our script can access the necessary API keys from the environment.
if"LANGCHAIN_API_KEY"notin os.environ or"ENTREZ_EMAIL"notin os.environ:
# If the keys are missing, we print an error and halt, as the application cannot proceed.
print("Required environment variables not set. Please set them in your .env file or environment.")
else:
# This confirmation tells us our secrets have been loaded securely and are ready for use.
print("Environment variables loaded successfully.")
# We explicitly set the LangSmith project name. This is a best practice that ensures all traces
# generated by this project are automatically grouped together in the LangSmith user interface for easy analysis.
os.environ["LANGCHAIN_PROJECT"] = "AI_Clinical_Trials_Architect"??load_dotenv()??? 是敏感憑據與代碼之間的一座“安全橋梁”。它讀取 ??.env??(絕不要提交到版本庫),并將密鑰注入環境。
從現在起,我們使用 LangChain 或 LangGraph 的所有操作都會自動被采集,并發送到 LangSmith 的項目中。
配置本地大語言模型
在生產級 agentic 系統中,“一刀切”的模型策略往往不是最佳。大型 SOTA 模型計算開銷大且慢,把它用于簡單任務會浪費資源(尤其自托管在 GPU 時)。但小模型雖然快速,卻可能缺乏做關鍵決策所需的深度推理能力。

Configuring Local LLMs (Created by Fareed Khan)
關鍵在于將“合適的模型放在系統的合適位置”。我們將構建一個多模型組合(均由 Ollama 本地服務以保障隱私、可控與成本效益),每個模型在特定角色上發揮所長。
先定義一個配置字典,集中管理每個選定模型的客戶端,便于替換與統一管理。
from langchain_community.chat_models import ChatOllama
from langchain_community.embeddings import OllamaEmbeddings
# This dictionary will act as our central registry, or "foundry," for all LLM and embedding model clients.
llm_config = {
# For the 'planner', we use Llama 3.1 8B. It's a modern, highly capable model that excels at instruction-following.
# We set `format='json'` to leverage Ollama's built-in JSON mode, ensuring reliable structured output for this critical task.
"planner": ChatOllama(model="llama3.1:8b-instruct", temperature=0.0, format='json'),
# For the 'drafter' and 'sql_coder', we use Qwen2 7B. It's a nimble and fast model, perfect for
# tasks like text generation and code completion where speed is valuable.
"drafter": ChatOllama(model="qwen2:7b", temperature=0.2),
"sql_coder": ChatOllama(model="qwen2:7b", temperature=0.0),
# For the 'director', the highest-level strategic agent, we use the powerful Llama 3 70B model.
# This high-stakes task of diagnosing performance and evolving the system's own procedures
# justifies the use of a larger, more powerful model.
"director": ChatOllama(model="llama3:70b", temperature=0.0, format='json'),
# For embeddings, we use 'nomic-embed-text', a top-tier, efficient open-source model.
"embedding_model": OllamaEmbeddings(model="nomic-embed-text")
}我們剛剛創建了 ??llm_config?? 字典,作為所有模型初始化的“中央樞紐”。通過為不同角色分配不同模型,構建一套按成本-性能權衡優化的層次結構。
- 快速靈巧(7B–8B):?
?planner???、??drafter???、??sql_coder?? 處理頻繁、定義清晰的任務。使用 Qwen2 7B、Llama 3.1 8B 能保證低延遲與高性價比,同時具備足夠的指令跟隨能力生成計劃、撰寫文本或編寫 SQL。 - 深度策略(70B):?
?director?? 需要分析多維性能數據并改寫整個 SOP,要求較強的因果推理與全局理解。為這種“低頻高風險”任務分配 Llama 3 70B 是合理的。
打印配置以確認:
# Print the configuration to confirm the clients are initialized and their parameters are set correctly.
print("LLM clients configured:")
print(f"Planner ({llm_config['planner'].model}): {llm_config['planner']}")
print(f"Drafter ({llm_config['drafter'].model}): {llm_config['drafter']}")
print(f"SQL Coder ({llm_config['sql_coder'].model}): {llm_config['sql_coder']}")
print(f"Director ({llm_config['director'].model}): {llm_config['director']}")
print(f"Embedding Model ({llm_config['embedding_model'].model}): {llm_config['embedding_model']}")輸出示例:
#### OUTPUT ####
LLM clients configured:
Planner (llama3.1:8b-instruct): ChatOllama(model='llama3.1:8b-instruct', temperature=0.0, format='json')
Drafter (qwen2:7b): ChatOllama(model='qwen2:7b', temperature=0.2)
SQL Coder (qwen2:7b): ChatOllama(model='qwen2:7b', temperature=0.0)
Director (llama3:70b): ChatOllama(model='llama3:70b', temperature=0.0, format='json')
Embedding Model (nomic-embed-text): OllamaEmbeddings(model='nomic-embed-text')這表明 ??ChatOllama??? 和 ??OllamaEmbeddings?? 客戶端已按指定模型與參數成功初始化。接下來連接知識庫。
準備知識庫
RAG 的“靈魂”在于一套豐富的多模態知識基座。面對臨床試驗設計這樣的專業任務,通用的網頁搜索遠遠不夠。我們需要以權威、領域特定的信息作為根基。

Knowledge store creation (Created by Fareed Khan)
為此,我們將構建一個全面的“知識庫”,從四類真實世界數據中采集、下載并處理內容。多源融合對幫助 agents 進行信息綜合至關重要,最終輸出也會更全面更可靠。
先創建數據目錄:
import os
# A dictionary to hold the paths for our different data types. This keeps our file management clean and centralized.
data_paths = {
"base": "./data",
"pubmed": "./data/pubmed_articles",
"fda": "./data/fda_guidelines",
"ethics": "./data/ethical_guidelines",
"mimic": "./data/mimic_db"
}
# This loop iterates through our defined paths and uses os.makedirs() to create any directories that don't already exist.
# This prevents errors in later steps when we try to save files to these locations.
for path in data_paths.values():
ifnot os.path.exists(path):
os.makedirs(path)
print(f"Created directory: {path}")這確保項目從一開始就擁有干凈、組織良好的文件結構。
接著從 PubMed 獲取真實文獻,為 ??Medical Researcher?? 提供核心知識:
from Bio import Entrez
from Bio import Medline
defdownload_pubmed_articles(query, max_articles=20):
"""Fetches abstracts from PubMed for a given query and saves them as text files."""
# The NCBI API requires an email address for identification. We fetch it from our environment variables.
Entrez.email = os.environ.get("ENTREZ_EMAIL")
print(f"Fetching PubMed articles for query: {query}")
# Step 1: Use Entrez.esearch to find the PubMed IDs (PMIDs) for articles matching our query.
handle = Entrez.esearch(db="pubmed", term=query, retmax=max_articles, sort="relevance")
record = Entrez.read(handle)
id_list = record["IdList"]
print(f"Found {len(id_list)} article IDs.")
print("Downloading articles...")
# Step 2: Use Entrez.efetch to retrieve the full records (in MEDLINE format) for the list of PMIDs.
handle = Entrez.efetch(db="pubmed", id=id_list, rettype="medline", retmode="text")
records = Medline.parse(handle)
count = 0
# Step 3: Iterate through the retrieved records, parse them, and save each abstract to a file.
for i, record inenumerate(records):
pmid = record.get("PMID", "")
title = record.get("TI", "No Title")
abstract = record.get("AB", "No Abstract")
if pmid:
# We name the file after the PMID for easy reference and to avoid duplicates.
filepath = os.path.join(data_paths["pubmed"], f"{pmid}.txt")
withopen(filepath, "w") as f:
f.write(f"Title: {title}\n\nAbstract: {abstract}")
print(f"[{i+1}/{len(id_list)}] Fetching PMID: {pmid}... Saved to {filepath}")
count += 1
return count該函數按 3 步連接 NCBI,檢索符合布爾查詢的 PMID、拉取 MEDLINE 記錄并保存標題與摘要到本地文本文件。
執行:
# We define a specific, boolean query to find articles highly relevant to our trial concept.
pubmed_query = "(SGLT2 inhibitor) AND (type 2 diabetes) AND (renal impairment)"
num_downloaded = download_pubmed_articles(pubmed_query)
print(f"PubMed download complete. {num_downloaded} articles saved.")示例輸出:
#### OUTPUT ####
Fetching PubMed articles for query: (SGLT2 inhibitor) AND (type 2 diabetes) AND (renal impairment)
Found 20 article IDs.
Downloading articles...
[1/20] Fetching PMID: 38810260... Saved to ./data/pubmed_articles/38810260.txt
[2/20] Fetching PMID: 38788484... Saved to ./data/pubmed_articles/38788484.txt
...
PubMed download complete. 20 articles saved.現在 ??Medical Researcher?? 具備扎實、最新、領域特定的科學依據。
接下來獲取監管文件,供 ??Regulatory Specialist?? 使用:
import requests
from pypdf import PdfReader
import io
defdownload_and_extract_text_from_pdf(url, output_path):
"""Downloads a PDF from a URL, saves it, and also extracts its text content to a separate .txt file."""
print(f"Downloading FDA Guideline: {url}")
try:
# We use the 'requests' library to perform the HTTP GET request to download the file.
response = requests.get(url)
response.raise_for_status() # This is a good practice that will raise an error if the download fails (e.g., a 404 error).
# We save the raw PDF file, which is useful for archival purposes.
withopen(output_path, 'wb') as f:
f.write(response.content)
print(f"Successfully downloaded and saved to {output_path}")
# We then use pypdf to read the PDF content directly from the in-memory response.
reader = PdfReader(io.BytesIO(response.content))
text = ""
# We loop through each page of the PDF and append its extracted text.
for page in reader.pages:
text += page.extract_text() + "\n\n"
# Finally, we save the clean, extracted text to a .txt file. This is the file our RAG system will actually use.
txt_output_path = os.path.splitext(output_path)[0] + '.txt'
withopen(txt_output_path, 'w') as f:
f.write(text)
returnTrue
except requests.exceptions.RequestException as e:
print(f"Error downloading file: {e}")
returnFalse運行下載 FDA 指南并抽取文本:
# This URL points to a real FDA guidance document for developing drugs for diabetes.
fda_url = "https://www.fda.gov/media/71185/download"
fda_pdf_path = os.path.join(data_paths["fda"], "fda_diabetes_guidance.pdf")
download_and_extract_text_from_pdf(fda_url, fda_pdf_path)
#### OUTPUT ####
Downloading FDA Guideline: https://www.fda.gov/media/71185/download
Successfully downloaded and saved to ./data/fda_guidelines/fda_diabetes_guidance.pdf現在 ??Regulatory Specialist?? 擁有法律與監管文本的基礎語料。
接著為 ??Ethics Specialist?? 準備一份精要文檔(相當于 Belmont Report 的核心原則摘要),以確保其推理建立在最重要概念之上:
# This multi-line string contains a curated summary of the three core principles of the Belmont Report,
# which is the foundational document for ethics in human subject research in the United States.
ethics_content = """
Title: Summary of the Belmont Report Principles for Clinical Research
1. Respect for Persons: This principle requires that individuals be treated as autonomous agents and that persons with diminished autonomy are entitled to protection. This translates to robust informed consent processes. Inclusion/exclusion criteria must not unduly target or coerce vulnerable populations, such as economically disadvantaged individuals, prisoners, or those with severe cognitive impairments, unless the research is directly intended to benefit that population.
2. Beneficence: This principle involves two complementary rules: (1) do not harm and (2) maximize possible benefits and minimize possible harms. The criteria must be designed to select a population that is most likely to benefit and least likely to be harmed by the intervention. The risks to subjects must be reasonable in relation to anticipated benefits.
3. Justice: This principle concerns the fairness of distribution of the burdens and benefits of research. The selection of research subjects must be equitable. Criteria should not be designed to exclude certain groups without a sound scientific or safety-related justification. For example, excluding participants based on race, gender, or socioeconomic status is unjust unless there is a clear rationale related to the drug's mechanism or risk profile.
"""
# We define the path where our ethics document will be saved.
ethics_path = os.path.join(data_paths["ethics"], "belmont_summary.txt")
# We open the file in write mode and save the content.
with open(ethics_path, "w") as f:
f.write(ethics_content)
print(f"Created ethics guideline file: {ethics_path}")最后是最復雜的數據源:來自 MIMIC-III 的結構化臨床數據,為 ??Patient Cohort Analyst?? 提供真實世界人群數據,用以評估招募可行性。
import duckdb
import pandas as pd
import os
defload_real_mimic_data():
"""Loads real MIMIC-III CSVs into a persistent DuckDB database file, processing the massive LABEVENTS table efficiently."""
print("Attempting to load real MIMIC-III data from local CSVs...")
db_path = os.path.join(data_paths["mimic"], "mimic3_real.db")
csv_dir = os.path.join(data_paths["mimic"], "mimiciii_csvs")
# Define the paths to the required compressed CSV files.
required_files = {
"patients": os.path.join(csv_dir, "PATIENTS.csv.gz"),
"diagnoses": os.path.join(csv_dir, "DIAGNOSES_ICD.csv.gz"),
"labevents": os.path.join(csv_dir, "LABEVENTS.csv.gz"),
}
# Before starting, we check if all the necessary source files are present.
missing_files = [path for path in required_files.values() ifnot os.path.exists(path)]
if missing_files:
print("ERROR: The following MIMIC-III files were not found:")
for f in missing_files: print(f"- {f}")
print("\nPlease download them as instructed and place them in the correct directory.")
returnNone
print("Required files found. Proceeding with database creation.")
# Remove any old database file to ensure we are building from scratch.
if os.path.exists(db_path):
os.remove(db_path)
# Connect to DuckDB. If the database file doesn't exist, it will be created.
con = duckdb.connect(db_path)
# Use DuckDB's powerful `read_csv_auto` to directly load data from the gzipped CSVs into SQL tables.
print(f"Loading {required_files['patients']} into DuckDB...")
con.execute(f"CREATE TABLE patients AS SELECT SUBJECT_ID, GENDER, DOB, DOD FROM read_csv_auto('{required_files['patients']}')")
print(f"Loading {required_files['diagnoses']} into DuckDB...")
con.execute(f"CREATE TABLE diagnoses_icd AS SELECT SUBJECT_ID, ICD9_CODE FROM read_csv_auto('{required_files['diagnoses']}')")
# The LABEVENTS table is enormous. To handle it robustly, we use a two-stage process.
print(f"Loading and processing {required_files['labevents']} (this may take several minutes)...")
# 1. Load the data into a temporary 'staging' table, treating all columns as text (`all_varchar=True`).
# This prevents parsing errors with mixed data types. We also filter for only the lab item IDs we
# care about (50912 for Creatinine, 50852 for HbA1c) and use a regex to ensure VALUENUM is numeric.
con.execute(f"""CREATE TABLE labevents_staging AS
SELECT SUBJECT_ID, ITEMID, VALUENUM
FROM read_csv_auto('{required_files['labevents']}', all_varchar=True)
WHERE ITEMID IN ('50912', '50852') AND VALUENUM IS NOT NULL AND VALUENUM ~ '^[0-9]+(\\.[0-9]+)?$'
""")
# 2. Create the final, clean table by selecting from the staging table and casting the columns to their correct numeric types.
con.execute("CREATE TABLE labevents AS SELECT SUBJECT_ID, CAST(ITEMID AS INTEGER) AS ITEMID, CAST(VALUENUM AS DOUBLE) AS VALUENUM FROM labevents_staging")
# 3. Drop the temporary staging table to save space.
con.execute("DROP TABLE labevents_staging")
con.close()
return db_path這里利用 DuckDB 直接從磁盤處理大型 CSV,而不是用 pandas 全量讀入內存;對 LABEVENTS 采用兩階段清洗(先 all_varchar 過濾,再強制轉換類型),以穩健應對數據質量問題并得到清潔高效的查詢表。
執行并檢查:
# Execute the function to build the database.
db_path = load_real_mimic_data()
# If the database was created successfully, connect to it and inspect the schema and some sample data.
if db_path:
print(f"\nReal MIMIC-III database created at: {db_path}")
print("\nTesting database connection and schema...")
con = duckdb.connect(db_path)
print(f"Tables in DB: {con.execute('SHOW TABLES').df()['name'].tolist()}")
print("\nSample of 'patients' table:")
print(con.execute("SELECT * FROM patients LIMIT 5").df())
print("\nSample of 'diagnoses_icd' table:")
print(con.execute("SELECT * FROM diagnoses_icd LIMIT 5").df())
con.close()示例輸出略,顯示三張表均已創建成功,可查詢。

Pre-processing Step (Created by Fareed Khan)
最后,將所有非結構化文本數據索引為可檢索的向量庫,以便 RAG 使用:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document
defcreate_vector_store(folder_path: str, embedding_model, store_name: str):
"""Loads all .txt files from a folder, splits them into chunks, and creates an in-memory FAISS vector store."""
print(f"--- Creating {store_name} Vector Store ---")
# Use DirectoryLoader to efficiently load all .txt files from the specified folder.
loader = DirectoryLoader(folder_path, glob="**/*.txt", loader_cls=TextLoader, show_progress=True)
documents = loader.load()
ifnot documents:
print(f"No documents found in {folder_path}, skipping vector store creation.")
returnNone
# Use RecursiveCharacterTextSplitter to break large documents into smaller, 1000-character chunks with a 100-character overlap.
# The overlap helps maintain context between chunks.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)
print(f"Loaded {len(documents)} documents, split into {len(texts)} chunks.")
print("Generating embeddings and indexing into FAISS... (This may take a moment)")
# FAISS.from_documents is a convenient function that handles both embedding the text chunks
# and building the efficient FAISS index in one step.
db = FAISS.from_documents(texts, embedding_model)
print(f"{store_name} Vector Store created successfully.")
return db
defcreate_retrievers(embedding_model):
"""Creates vector store retrievers for all unstructured data sources and consolidates all knowledge stores."""
# Create a separate, specialized vector store for each type of document.
pubmed_db = create_vector_store(data_paths["pubmed"], embedding_model, "PubMed")
fda_db = create_vector_store(data_paths["fda"], embedding_model, "FDA")
ethics_db = create_vector_store(data_paths["ethics"], embedding_model, "Ethics")
# Return a single dictionary containing all configured data access tools.
# The 'as_retriever' method converts the vector store into a standard LangChain Retriever object.
# The 'k' parameter in 'search_kwargs' controls how many top documents are returned by a search.
return {
"pubmed_retriever": pubmed_db.as_retriever(search_kwargs={"k": 3}) if pubmed_db elseNone,
"fda_retriever": fda_db.as_retriever(search_kwargs={"k": 3}) if fda_db elseNone,
"ethics_retriever": ethics_db.as_retriever(search_kwargs={"k": 2}) if ethics_db elseNone,
"mimic_db_path": db_path # We also include the file path to our structured DuckDB database.
}??create_vector_store??? 封裝了“load -> split -> embed -> index”的標準 RAG 構建流程;??create_retrievers??? 則為每類語料構建獨立向量庫并返回 retriever 字典。我們采用“分域向量庫”而非“大一統”,以便各代理只檢索各自相關的知識源(例如 ??Regulatory Specialist??? 僅使用 ??fda_retriever??)。
執行創建:
# Execute the function to create all our retrievers.
knowledge_stores = create_retrievers(llm_config["embedding_model"])
print("\nKnowledge stores and retrievers created successfully.")
# Print the final dictionary to confirm all components are present.
for name, store in knowledge_stores.items():
print(f"{name}: {store}")輸出顯示各 retriever 創建成功。
至此,數據(下載、處理、索引)與 LLM(配置)均已就緒,可以開始構建系統的第一大組件:Trial Design Guild(試驗設計工會)。
構建內部臨床試驗設計網絡
隨著知識庫就緒,現在構建系統核心。這不是一個簡單線性的 RAG chain,而是一套基于 LangGraph 的協作式多代理工作流:一支 AI 專家團隊,共同將高層試驗概念轉化為一份詳細、數據支撐的標準化標準文檔。

Main Inner Loop RAG (Created by Fareed Khan)
整個架構的行為不是硬編碼的,而由一個動態配置對象治理:標準作業程序(Standard Operating Procedure,??GuildSOP??)。
這個 SOP 是我們 RAG 流水線的“基因組(genome)”,也是外層的“AI Research Director”將要進化與優化的對象。
本節計劃:
- 定義 RAG 基因組:創建 Pydantic 模型?
?GuildSOP??,用于驅動整個工作流架構。 - 設計共享工作臺:定義?
?GuildState??,作為代理共享計劃與發現的中央空間。 - 構建專家型代理:將 Planner、Researchers、SQL Analyst、Synthesizer 分別實現為 Python 函數,作為圖中的節點。
- 編排協作:用 LangGraph 將這些 agent 節點接線成完整端到端工作流。
- 全量測試:用 baseline SOP 調用完整的 Guild graph,觀察其實際運行并生成首版標準文檔。
定義公會標準操作規程
先定義控制整體流程行為的結構。我們用 Pydantic ??BaseModel??? 創建 ??GuildSOP??。通過強類型、校驗、自文檔化,讓 SOP 穩定可進化。

Guild SOP Design (Created by Fareed Khan)
from pydantic import BaseModel, Field
from typing importLiteral
classGuildSOP(BaseModel):
"""Standard Operating Procedures for the Trial Design Guild. This object acts as the dynamic configuration for the entire RAG workflow."""
# This field holds the system prompt for the Planner Agent, dictating its strategy.
planner_prompt: str = Field(descriptinotallow="The system prompt for the Planner Agent.")
# This parameter controls how many documents the Medical Researcher retrieves, allowing us to tune the breadth of its search.
researcher_retriever_k: int = Field(descriptinotallow="Number of documents for the Medical Researcher to retrieve.", default=3)
# This is the system prompt for the final writer, the Synthesizer Agent.
synthesizer_prompt: str = Field(descriptinotallow="The system prompt for the Criteria Synthesizer Agent.")
# This allows us to dynamically change the model used for the final drafting stage, trading off speed vs. quality.
synthesizer_model: Literal["qwen2:7b", "llama3.1:8b-instruct"] = Field(descriptinotallow="The LLM to use for the Synthesizer.", default="qwen2:7b")
# These booleans act as "feature flags," allowing the Director to turn entire agent capabilities on or off.
use_sql_analyst: bool = Field(descriptinotallow="Whether to use the Patient Cohort Analyst agent.", default=True)
use_ethics_specialist: bool = Field(descriptinotallow="Whether to use the Ethics Specialist agent.", default=True)??GuildSOP??? 公開了關鍵參數(如 prompts、??researcher_retriever_k???、以及 agent 開關),使外層 AI Director 能夠拉動這些“策略杠桿”,進而調優整體性能。??synthesizer_model??? 使用 ??Literal?? 限定取值集合,保證類型安全。
構建 baseline 版本:
import json
baseline_sop = GuildSOP(
planner_prompt="""You are a master planner for clinical trial design...""",
synthesizer_prompt="""You are an expert medical writer...""",
researcher_retriever_k=3,
synthesizer_model="qwen2:7b",
use_sql_analyst=True,
use_ethics_specialist=True
)打?。?/p>
print("Baseline GuildSOP (v1.0):")
print(json.dumps(baseline_sop.dict(), indent=4))輸出顯示 baseline SOP 的全部配置,作為初始“手工工程”的最佳猜測,供 AI Director 后續優化與超越。
定義專業智能體(Specialist Agents)
有了“規則書”(SOP),接下來定義 agents。在 LangGraph 中,agent 是一個節點(Python 函數),輸入為當前圖狀態,輸出為狀態增量。

Specialist Agents (Created by Fareed Khan)
先定義共享狀態 ??GuildState??,充當協作“工作臺”,保存初始請求、planner 生成的計劃、各專家的發現、以及最終輸出。
from typing importList, Dict, Any, Optional
from langchain_core.pydantic_v1 import BaseModel
from typing_extensions import TypedDict
classAgentOutput(BaseModel):
"""A structured output for each agent's findings."""
agent_name: str
findings: Any
classGuildState(TypedDict):
"""The state of the Trial Design Guild's workflow, passed between all nodes."""
initial_request: str
plan: Optional[Dict[str, Any]]
agent_outputs: List[AgentOutput]
final_criteria: Optional[str]
sop: GuildSOP接著實現 ??planner_agent???,它讀取 SOP 中的 ??planner_prompt?? 并產出結構化計劃(JSON)指導后續 agents:
def planner_agent(state: GuildState) -> GuildState:
"""Receives the initial request and creates a structured plan for the specialist agents."""
print("--- EXECUTING PLANNER AGENT ---")
sop = state['sop']
planner_llm = ll-config['planner'].with_structured_output(schema={"plan": []})
prompt = f"{sop.planner_prompt}\n\nTrial Concept: '{state['initial_request']}'"
print(f"Planner Prompt:\n{prompt}")
response = planner_llm.invoke(prompt)
print(f"Generated Plan:\n{json.dumps(response, indent=2)}")
return {**state, "plan": response}然后實現通用的“檢索型代理”函數 ??retrieval_agent???,供 ??Medical Researcher???、??Regulatory Specialist???、??Ethics Specialist?? 復用:
def retrieval_agent(task_description: str, state: GuildState, retriever_name: str, agent_name: str) -> AgentOutput:
"""A generic agent function that performs retrieval from a specified vector store based on a task description."""
print(f"--- EXECUTING {agent_name.upper()} ---")
print(f"Task: {task_description}")
retriever = knowledge_stores[retriever_name]
if agent_name == "Medical Researcher":
retriever.search_kwargs['k'] = state['sop'].researcher_retriever_k
print(f"Using k={state['sop'].researcher_retriever_k} for retrieval.")
retrieved_docs = retriever.invoke(task_description)
findings = "\n\n---\n\n".join([f"Source: {doc.metadata.get('source', 'N/A')}\n\n{doc.page_content}"for doc in retrieved_docs])
print(f"Retrieved {len(retrieved_docs)} documents.")
print(f"Sample Finding:\n{findings[:500]}...")
return AgentOutput(agent_name=agent_name, findings=findings)??Patient Cohort Analyst?? 是最復雜的代理:Text-to-SQL,將自然語言轉為有效 SQL 并在 DuckDB 上執行,給出可招募人群估算:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
defpatient_cohort_analyst(task_description: str, state: GuildState) -> AgentOutput:
"""Estimates cohort size by generating and then executing a SQL query against the MIMIC database."""
print("--- EXECUTING PATIENT COHORT ANALYST ---")
ifnot state['sop'].use_sql_analyst:
print("SQL Analyst skipped as per SOP.")
return AgentOutput(agent_name="Patient Cohort Analyst", findings="Analysis skipped as per SOP.")
con = duckdb.connect(knowledge_stores['mimic_db_path'])
schema_query = """
SELECT table_name, column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'main' ORDER BY table_name, column_name;
"""
schema = con.execute(schema_query).df()
con.close()
sql_generation_prompt = ChatPromptTemplate.from_messages([
("system", f"You are an expert SQL writer specializing in DuckDB. ... schema:\n{schema.to_string()}\n\nIMPORTANT: All column names ...\n\nKey Mappings:\n- T2DM ... ICD9_CODE '25000'.\n- Moderate renal impairment ... creatinine ... ITEMID 50912 ... VALUENUM 1.5-3.0.\n- Uncontrolled T2D ... HbA1c ... ITEMID 50852 ... VALUENUM > 8.0."),
("human", "Please write a SQL query to count the number of unique patients who meet the following criteria: {task}")
])
sql_chain = sql_generation_prompt | llm_config['sql_coder'] | StrOutputParser()
print(f"Generating SQL for task: {task_description}")
sql_query = sql_chain.invoke({"task": task_description})
sql_query = sql_query.strip().replace("```sql", "").replace("```", "")
print(f"Generated SQL Query:\n{sql_query}")
try:
con = duckdb.connect(knowledge_stores['mimic_db_path'])
result = con.execute(sql_query).fetchone()
patient_count = result[0] if result else0
con.close()
findings = f"Generated SQL Query:\n{sql_query}\n\nEstimated eligible patient count from the database: {patient_count}."
print(f"Query executed successfully. Estimated patient count: {patient_count}")
except Exception as e:
findings = f"Error executing SQL query: {e}. Defaulting to a count of 0."
print(f"Error during query execution: {e}")
return AgentOutput(agent_name="Patient Cohort Analyst", findings=findings)最后是 ??criteria_synthesizer???,將各專家發現匯織為正式的“入排標準(Inclusion/Exclusion Criteria)”文檔。支持在 SOP 中動態切換 ??synthesizer_model??:
def criteria_synthesizer(state: GuildState) -> GuildState:
"""Synthesizes all the structured findings from the specialist agents into the final criteria document."""
print("--- EXECUTING CRITERIA SYNTHESIZER ---")
sop = state['sop']
drafter_llm = ChatOllama(model=sop.synthesizer_model, temperature=0.2)
context = "\n\n---\n\n".join([f"**{out.agent_name} Findings:**\n{out.findings}"for out in state['agent_outputs']])
prompt = f"{sop.synthesizer_prompt}\n\n**Context from Specialist Teams:**\n{context}"
print(f"Synthesizer is using model '{sop.synthesizer_model}'.")
response = drafter_llm.invoke(prompt)
print("Final criteria generated.")
return {**state, "final_criteria": response.content}使用 LangGraph 編排
將以上 agent 節點用 LangGraph 編排:Planner → 專家并行執行 → Synthesizer。

Guild with langgraph (Created by Fareed Khan)
定義“調度節點”,根據 plan 分派任務:
from langgraph.graph import StateGraph, END
defspecialist_execution_node(state: GuildState) -> GuildState:
"""This node acts as a dispatcher, executing all specialist tasks defined in the plan."""
plan_tasks = state['plan']['plan']
outputs = []
for task in plan_tasks:
agent_name = task['agent']
task_desc = task['task_description']
if"Regulatory"in agent_name:
output = retrieval_agent(task_desc, state, "fda_retriever", "Regulatory Specialist")
elif"Medical"in agent_name:
output = retrieval_agent(task_desc, state, "pubmed_retriever", "Medical Researcher")
elif"Ethics"in agent_name and state['sop'].use_ethics_specialist:
output = retrieval_agent(task_desc, state, "ethics_retriever", "Ethics Specialist")
elif"Cohort"in agent_name:
output = patient_cohort_analyst(task_desc, state)
else:
continue
outputs.append(output)
return {**state, "agent_outputs": outputs}構建與編譯 graph:
workflow = StateGraph(GuildState)
workflow.add_node("planner", planner_agent)
workflow.add_node("execute_specialists", specialist_execution_node)
workflow.add_node("synthesizer", criteria_synthesizer)
workflow.set_entry_point("planner")
workflow.add_edge("planner", "execute_specialists")
workflow.add_edge("execute_specialists", "synthesizer")
workflow.add_edge("synthesizer", END)
guild_graph = workflow.compile()
print("Graph compiled successfully.")可選圖形化略。至此,“Inner Loop” 多代理 RAG 管線搭建完畢。
完整運行公會工作流圖
用 baseline SOP 和真實試驗概念進行端到端測試,驗證 agents、數據存儲與編排邏輯是否協作正常,并產出我們的首個“baseline”輸出,供后續評估與進化環路使用。

Run Workflow (Created by Fareed Khan)
test_request = "Draft inclusion/exclusion criteria for a Phase II trial of 'Sotagliflozin', a novel SGLT2 inhibitor, for adults with uncontrolled Type 2 Diabetes (HbA1c > 8.0%) and moderate chronic kidney disease (CKD Stage 3)."
print("Running the full Guild graph with baseline SOP v1.0...")
graph_input = {
"initial_request": test_request,
"sop": baseline_sop
}
final_result = guild_graph.invoke(graph_input)
print("\nFinal Guild Output:")
print("---------------------")
print(final_result['final_criteria'])輸出日志顯示每個 agent 的執行過程,并最終得到結構良好的入排標準文檔。至此,我們已構建并測試了一套基于真實數據源的多代理 RAG 流水線。
多維度評價體系
一個能自我改進的系統,必須能夠衡量自己的表現。我們需要的不只是單一分數(如 accuracy),而是多維度質量評估。我們將構建一個多維評估套件,對 Guild 輸出在我們最初就確定的“五大支柱”上進行評分。這將為“外層進化環路”提供豐富、可操作的反饋信號。

Multi-dimension Eval (Created by Fareed Khan)
本節計劃:
- LLM-as-a-Judge:用?
?llama3:70b?? 構建三個“專家評委”,分別評 Scientific Rigor、Regulatory Compliance、Ethical Soundness。 - 程序化評估:用兩段快速、可靠、客觀的程序化函數,評 Recruitment Feasibility 與 Operational Simplicity。
- 匯總評估器:將五個單項評估封裝為一個總評函數,接收 Guild 輸出并生成 5D 性能向量,供 AI Director 決策使用。
為每個參數構建自定義評估器
首先定義 LLM 評委的統一輸出結構:
from langchain_core.pydantic_v1 import BaseModel, Field
class GradedScore(BaseModel):
"""A Pydantic model to structure the output of our LLM-as-a-Judge evaluators."""
score: float = Field(descriptinotallow="A score from 0.0 to 1.0")
reasoning: str = Field(descriptinotallow="A brief justification for the score.")- Scientific Rigor:
from langchain_core.prompts import ChatPromptTemplate
def scientific_rigor_evaluator(generated_criteria: str, pubmed_context: str) -> GradedScore:
evaluator_llm = llm_config['director'].with_structured_output(GradedScore)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert clinical scientist. ..."),
("human", "Evaluate the following criteria:\n\n**Generated Criteria:**\n{criteria}\n\n**Supporting Scientific Context:**\n{context}")
])
chain = prompt | evaluator_llm
return chain.invoke({"criteria": generated_criteria, "context": pubmed_context})- Regulatory Compliance:
def regulatory_compliance_evaluator(generated_criteria: str, fda_context: str) -> GradedScore:
evaluator_llm = llm_config['director'].with_structured_output(GradedScore)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert regulatory affairs specialist. ..."),
("human", "Evaluate the following criteria:\n\n**Generated Criteria:**\n{criteria}\n\n**Applicable FDA Guidelines:**\n{context}")
])
chain = prompt | evaluator_llm
return chain.invoke({"criteria": generated_criteria, "context": fda_context})- Ethical Soundness:
def ethical_soundness_evaluator(generated_criteria: str, ethics_context: str) -> GradedScore:
evaluator_llm = llm_config['director'].with_structured_output(GradedScore)
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert on clinical trial ethics. ..."),
("human", "Evaluate the following criteria:\n\n**Generated Criteria:**\n{criteria}\n\n**Ethical Principles:**\n{context}")
])
chain = prompt | evaluator_llm
return chain.invoke({"criteria": generated_criteria, "context": ethics_context})- Recruitment Feasibility(程序化):
def feasibility_evaluator(cohort_analyst_output: AgentOutput) -> GradedScore:
findings_text = cohort_analyst_output.findings
try:
count_str = findings_text.split("database: ")[1].replace('.', '')
patient_count = int(count_str)
except (IndexError, ValueError):
return GradedScore(score=0.0, reasnotallow="Could not parse patient count from analyst output.")
IDEAL_COUNT = 150.0
score = min(1.0, patient_count / IDEAL_COUNT)
reasoning = f"Estimated {patient_count} eligible patients. Score is normalized against an ideal target of {int(IDEAL_COUNT)}."
return GradedScore(score=score, reasnotallow=reasoning)- Operational Simplicity(程序化):
def simplicity_evaluator(generated_criteria: str) -> GradedScore:
EXPENSIVE_TESTS = ["mri", "genetic sequencing", "pet scan", "biopsy", "echocardiogram", "endoscopy"]
test_count = sum(1 for test in EXPENSIVE_TESTS if test in generated_criteria.lower())
score = max(0.0, 1.0 - (test_count * 0.5))
reasoning = f"Found {test_count} expensive/complex screening procedures mentioned."
return GradedScore(score=score, reasnotallow=reasoning)創建聚合型 LangSmith 評估器
定義總評結果模型與匯總函數:
class EvaluationResult(BaseModel):
rigor: GradedScore
compliance: GradedScore
ethics: GradedScore
feasibility: GradedScore
simplicity: GradedScore
def run_full_evaluation(guild_final_state: GuildState) -> EvaluationResult:
"""Orchestrates the entire evaluation process, calling each of the five specialist evaluators."""
print("--- RUNNING FULL EVALUATION GAUNTLET ---")
final_criteria = guild_final_state['final_criteria']
agent_outputs = guild_final_state['agent_outputs']
pubmed_context = next((o.findings for o in agent_outputs if o.agent_name == "Medical Researcher"), "")
fda_context = next((o.findings for o in agent_outputs if o.agent_name == "Regulatory Specialist"), "")
ethics_context = next((o.findings for o in agent_outputs if o.agent_name == "Ethics Specialist"), "")
analyst_output = next((o for o in agent_outputs if o.agent_name == "Patient Cohort Analyst"), None)
print("Evaluating: Scientific Rigor...")
rigor = scientific_rigor_evaluator(final_criteria, pubmed_context)
print("Evaluating: Regulatory Compliance...")
compliance = regulatory_compliance_evaluator(final_criteria, fda_context)
print("Evaluating: Ethical Soundness...")
ethics = ethical_soundness_evaluator(final_criteria, ethics_context)
print("Evaluating: Recruitment Feasibility...")
feasibility = feasibility_evaluator(analyst_output) if analyst_output else GradedScore(score=0, reasnotallow="Analyst did not run.")
print("Evaluating: Operational Simplicity...")
simplicity = simplicity_evaluator(final_criteria)
print("--- EVALUATION GAUNTLET COMPLETE ---")
return EvaluationResult(rigor=rigor, compliance=compliance, ethics=ethics, feasibility=feasibility, simplicity=simplicity)對 baseline 輸出運行評估,示例結果顯示在“Feasibility”維度明顯偏低(0.39),這為外層 AI Director 指出了明確改進方向。
進化引擎的外層循環
現在構建系統的“大腦”——“AI Research Director”(外層進化回路)。其職責不是設計試驗,而是改進“設計試驗”的過程:分析 5D 評分、診斷根因、智能改寫 GuildSOP。這是系統學習與自適應的核心。

Outer Loop (Created by Fareed Khan)
本節計劃:
- 創建“基因池(gene pool)”:管理 SOP 演化版本及其評分,形成可追溯的“基因史”。
- 設計 Director 級別代理:?
?Performance Diagnostician??? 識別弱點;??SOP Architect?? 提出改良方案。 - 架構進化循環:定義完整一代的進化過程:Diagnose → Evolve → Evaluate。
- 運行一次全流程:展示系統如何自主發現“可行性”弱點并產生新的 SOP 變體修復它。
管理配置
定義 ??SOPGenePool??,存儲 SOP、評分與“父版本”信息:
class SOPGenePool:
def__init__(self):
self.pool: List[Dict[str, Any]] = []
self.version_counter = 0
defadd(self, sop: GuildSOP, eval_result: EvaluationResult, parent_version: Optional[int] = None):
self.version_counter += 1
entry = {
"version": self.version_counter,
"sop": sop,
"evaluation": eval_result,
"parent": parent_version
}
self.pool.append(entry)
print(f"Added SOP v{self.version_counter} to the gene pool.")
defget_latest_entry(self) -> Optional[Dict[str, Any]]:
returnself.pool[-1] ifself.pool elseNone構建主任級智能體(Director-Level Agents)
先是 ??Performance Diagnostician??,分析 5D 向量并給出結構化診斷:
class Diagnosis(BaseModel):
primary_weakness: Literal['rigor', 'compliance', 'ethics', 'feasibility', 'simplicity']
root_cause_analysis: str = Field(...)
recommendation: str = Field(...)
defperformance_diagnostician(eval_result: EvaluationResult) -> Diagnosis:
print("--- EXECUTING PERFORMANCE DIAGNOSTICIAN ---")
diagnostician_llm = llm_config['director'].with_structured_output(Diagnosis)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a world-class management consultant ..."),
("human", "Please analyze the following performance evaluation report:\n\n{report}")
])
chain = prompt | diagnostician_llm
return chain.invoke({"report": eval_result.json()})再是 ??SOP Architect??,根據診斷與當前 SOP 生成多個“變體” SOP 作為候選:
class EvolvedSOPs(BaseModel):
mutations: List[GuildSOP]
def sop_architect(diagnosis: Diagnosis, current_sop: GuildSOP) -> EvolvedSOPs:
print("--- EXECUTING SOP ARCHITECT ---")
architect_llm = llm_config['director'].with_structured_output(EvolvedSOPs)
prompt = ChatPromptTemplate.from_messages([
("system", f"You are an AI process architect. ... schema: {GuildSOP.schema_json()} ..."),
("human", "Here is the current SOP:\n{current_sop}\n\nHere is the performance diagnosis:\n{diagnosis}\n\nBased on the diagnosis, please generate 2-3 new, improved SOPs.")
])
chain = prompt | architect_llm
return chain.invoke({"current_sop": current_sop.json(), "diagnosis": diagnosis.json()})運行完整的進化循環
封裝一次完整的進化循環:
def run_evolution_cycle(gene_pool: SOPGenePool, trial_request: str):
print("\n" + "="*25 + " STARTING NEW EVOLUTION CYCLE " + "="*25)
current_best_entry = gene_pool.get_latest_entry()
parent_sop = current_best_entry['sop']
parent_eval = current_best_entry['evaluation']
parent_version = current_best_entry['version']
print(f"Improving upon SOP v{parent_version}...")
diagnosis = performance_diagnostician(parent_eval)
print(f"Diagnosis complete. Primary Weakness: '{diagnosis.primary_weakness}'. Recommendation: {diagnosis.recommendation}")
new_sop_candidates = sop_architect(diagnosis, parent_sop)
print(f"Generated {len(new_sop_candidates.mutations)} new SOP candidates.")
for i, candidate_sop inenumerate(new_sop_candidates.mutations):
print(f"\n--- Testing SOP candidate {i+1}/{len(new_sop_candidates.mutations)} ---")
guild_input = {"initial_request": trial_request, "sop": candidate_sop}
final_state = guild_graph.invoke(guild_input)
eval_result = run_full_evaluation(final_state)
gene_pool.add(sop=candidate_sop, eval_result=eval_result, parent_versinotallow=parent_version)
print("\n" + "="*25 + " EVOLUTION CYCLE COMPLETE " + "="*26)初始化基因池、加入 baseline、運行一輪進化。示例輸出顯示:診斷識別“Feasibility”為主要弱項;Architect 生成兩個候選 SOP;測試后某個候選(v2)顯著提升 Feasibility(例如 0.81),且僅以輕微 Rigor 代價換取巨大實際可行性收益;另一個候選(v3)則未帶來改進。
基于五維的帕累托分析
進化循環完成一代。現在需要對結果進行多目標優化分析。在多目標問題中往往不存在單一“最好”解,而是存在“帕累托前沿(Pareto Frontier)”。目標是識別這一前沿并呈現給人類決策者。
本節計劃:
- 分析基因池:打印所有 SOP 及其 5D 評分的摘要,以觀察變體的直接影響。
- 識別 Pareto Front:編寫函數程序化識別基因池中的非支配解(non-dominated solutions)。
- 可視化前沿:用并行坐標圖(parallel coordinates plot)展示 5D 維度的權衡,讓 trade-off 一目了然。
打印摘要略。然后識別 Pareto 前沿:
import numpy as np
defidentify_pareto_front(gene_pool: SOPGenePool) -> List[Dict[str, Any]]:
pareto_front = []
pool_entries = gene_pool.pool
for i, candidate inenumerate(pool_entries):
is_dominated = False
cand_scores = np.array([s['score'] for s in candidate['evaluation'].dict().values()])
for j, other inenumerate(pool_entries):
if i == j: continue
other_scores = np.array([s['score'] for s in other['evaluation'].dict().values()])
if np.all(other_scores >= cand_scores) and np.any(other_scores > cand_scores):
is_dominated = True
break
ifnot is_dominated:
pareto_front.append(candidate)
return pareto_front運行后通常得到 v1 與 v2 組成帕累托前沿:v1 為“最大化 Rigor”的策略;v2 為“高 Feasibility”的策略。在實際決策中,如何取舍取決于業務優先級。
識別帕累托前沿
使用 2D 散點圖(Rigor vs. Feasibility)與 5D 并行坐標圖可視化:
import matplotlib.pyplot as plt
import pandas as pd
defvisualize_frontier(pareto_sops):
ifnot pareto_sops:
print("No SOPs on the Pareto front to visualize.")
return
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 7))
labels = [f"v{s['version']}"for s in pareto_sops]
rigor_scores = [s['evaluation'].rigor.score for s in pareto_sops]
feasibility_scores = [s['evaluation'].feasibility.score for s in pareto_sops]
ax1.scatter(rigor_scores, feasibility_scores, s=200, alpha=0.7, c='blue')
for i, txt inenumerate(labels):
ax1.annotate(txt, (rigor_scores[i], feasibility_scores[i]), xytext=(10,-10), textcoords='offset points', fnotallow=14)
ax1.set_title('Pareto Frontier: Rigor vs. Feasibility', fnotallow=16)
ax1.set_xlabel('Scientific Rigor Score', fnotallow=14)
ax1.set_ylabel('Recruitment Feasibility Score', fnotallow=14)
ax1.grid(True, linestyle='--', alpha=0.6)
ax1.set_xlim(min(rigor_scores)-0.05, max(rigor_scores)+0.05)
ax1.set_ylim(min(feasibility_scores)-0.1, max(feasibility_scores)+0.1)
data = []
for s in pareto_sops:
eval_dict = s['evaluation'].dict()
scores = {k.capitalize(): v['score'] for k, v in eval_dict.items()}
scores['SOP Version'] = f"v{s['version']}"
data.append(scores)
df = pd.DataFrame(data)
pd.plotting.parallel_coordinates(df, 'SOP Version', colormap=plt.get_cmap("viridis"), ax=ax2, axvlines_kwargs={"linewidth": 1, "color": "grey"})
ax2.set_title('5D Performance Trade-offs on Pareto Front', fnotallow=16)
ax2.grid(True, which='major', axis='y', linestyle='--', alpha=0.6)
ax2.set_ylabel('Normalized Score', fnotallow=14)
ax2.legend(loc='lower center', bbox_to_anchor=(0.5, -0.15), ncol=len(labels))
plt.tight_layout()
plt.show()渲染結果直觀展示 v1 與 v2 在各維的差異:兩者在 Compliance、Ethics、Simplicity 上幾乎一致,只在 Rigor 與 Feasibility 上形成明顯權衡(典型“交叉”形態)。
可視化前沿并做出決策
我們已經從宏觀層面(進化、帕累托前沿)看到了系統如何自我改進。現在從微觀層面理解一次“高表現”運行的內部過程:agents 如何協作?瓶頸在哪里?多維得分如何轉化為可視化剖面?

Understand the Workflow (Created by Fareed Khan)
計劃:
- 對工作流加儀表(instrumentation):精確記錄每個 agent 的開始/結束/耗時。
- 可視化執行時間線:用甘特圖(Gantt chart)呈現工作流,顯示并行與串行階段。
- 用雷達圖(Radar Chart)對比 baseline 與 evolved SOP 的 5D 表現剖面。
理解認知工作流
使用 graph 的 ??.stream()?? 方法逐節點獲取事件,記錄時間戳:
import time
from collections import defaultdict
definvoke_with_timing(graph, sop, request):
"""Invokes the Guild graph while capturing start and end times for each node."""
print(f"--- Instrumenting Graph Run for SOP: {sop.dict()} ---")
timing_data = []
start_times = defaultdict(float)
graph_input = {"initial_request": request, "sop": sop}
for event in graph.stream(graph_input, stream_mode="values"):
node_name = list(event.keys())[0]
end_time = time.time()
if node_name notin start_times:
start_times[node_name] = end_time - 0.1
start_time = end_time - duration
timing_data.append({
"node": node_name,
"start_time": start_time,
"end_time": end_time,
"duration": duration
})
start_times[node_name] = start_time
overall_start_time = min(d['start_time'] for d in timing_data)
for data in timing_data:
data['start_time'] -= overall_start_time
data['end_time'] -= overall_start_time
final_state = event[list(event.keys())[-1]]
return final_state, timing_data對 v2 執行并捕獲時序數據(示例輸出顯示 ??execute_specialists?? 是主要耗時階段,符合預期)。
繪制甘特圖:
import matplotlib.pyplot as plt
def plot_gantt_chart(timing_data: List[Dict[str, Any]], title: str):
"""Plots a Gantt chart of the agentic workflow from timing data."""
fig, ax = plt.subplots(figsize=(12, 4))
labels = [d['node'] for d in timing_data]
ax.barh(labels, [d['duration'] for d in timing_data], left=[d['start_time'] for d in timing_data], color='skyblue')
ax.set_xlabel('Time (seconds)')
ax.set_title(title, fnotallow=16)
ax.grid(True, which='major', axis='x', linestyle='--', alpha=0.6)
ax.invert_yaxis()
plt.show()甘特圖清晰展示了串行的頂層流程與內部并行機會,提示性能優化應聚焦 ??execute_specialists?? 階段。
使用雷達圖剖析輸出結果
用雷達圖對比 baseline v1 與 evolved v2 的 5D 剖面:
import pandas as pd
defplot_radar_chart(eval_results: List[Dict[str, Any]], labels: List[str]):
"""Creates a radar chart to compare the 5D performance of multiple SOPs."""
categories = ['Rigor', 'Compliance', 'Ethics', 'Feasibility', 'Simplicity']
num_vars = len(categories)
angles = np.linspace(0, 2 * np.pi, num_vars, endpoint=False).tolist()
angles += angles[:1]
fig, ax = plt.subplots(figsize=(8, 8), subplot_kw=dict(polar=True))
for i, result inenumerate(eval_results):
values = [res.score for res in result.dict().values()]
values += values[:1]
ax.plot(angles, values, linewidth=2, linestyle='solid', label=labels[i])
ax.fill(angles, values, alpha=0.25)
ax.set_yticklabels([])
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fnotallow=12)
ax.set_title('5D Performance Profile Comparison', size=20, color='blue', y=1.1)
plt.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1))
plt.show()圖中可見兩者在 Compliance、Ethics、Simplicity 上都很強;v1 在 Rigor 略優,而 v2 在 Feasibility 顯著優越,清晰呈現 trade-off。
自主策略
我們已設計、構建并演示了一套可自我改進的 agentic 系統。這不僅是一個解決方案,更是一套可擴展的基礎架構:分層代理設計、動態 SOP、多維評估、自動進化。這些原則打開了廣闊的未來空間:
- 持續運行進化循環:當前完成一代,未來可連續迭代數百代,以發現更豐富、更多樣的 Pareto Frontier(經過實戰檢驗的 SOP)。
- 將 Director 的推理蒸餾為更小的策略模型:基于成功變體的歷史進行訓練,用更快、更便宜的專用模型替換 70B Director,使進化更高效。
- 讓 AI Director 動態改變 Guild 的結構:根據試驗概念的需求,學習增刪專家(如新增“Biostatistician”),實現團隊層面的進化。
- 用實時 API 替換靜態 MIMIC-III:將?
?Patient Cohort Analyst?? 連接到安全的實時 EHR 系統,使可行性評估基于最新患者數據。 - 強化?
?SOP Architect?? 的進化操作符:引入“crossover”等機制,融合不同成功 SOP 的優勢,加速新策略發現。 - 融合人類專家反饋:將臨床科學家的評分接入評估回路,用專家判斷作為最終“獎勵信號”,引導系統趨向“技術最優 + 實踐卓越”的方案。
原文地址:???https://medium.com/gitconnected/building-a-self-improving-agentic-rag-system-f55003af44c4??
本文轉載自??PyTorch研習社??,作者:AI研究生

















