精品欧美一区二区三区在线观看 _久久久久国色av免费观看性色_国产精品久久在线观看_亚洲第一综合网站_91精品又粗又猛又爽_小泽玛利亚一区二区免费_91亚洲精品国偷拍自产在线观看 _久久精品视频在线播放_美女精品久久久_欧美日韩国产成人在线

用合成數據評測 RAG 系統:一份可直接上手的 DeepEval 實操指南 原創

發布于 2025-10-17 08:38
瀏覽
0收藏

在構建 RAG(Retrieval-Augmented Generation,檢索增強生成)系統的過程中,很多人都有這樣的困惑:

“模型看起來能回答問題,但到底是不是在胡說八道?” “Retriever 到底找得準不準?” “我該怎么知道系統整體是不是可靠的?”

這些問題的根源在于——我們缺乏系統化的評測方法。 尤其在項目早期,還沒有真實用戶數據時,想要驗證 RAG 流程的效果就更加困難。

今天,我們就來深入拆解一個實用方案: ?? 用 DeepEval 生成合成數據,系統性評測你的 RAG Pipeline。

這篇文章會帶你一步步上手,包括依賴安裝、數據生成、復雜度控制、評測邏輯等全部環節。 讀完后,你不僅能快速搭建一個自動化評測體系,還能理解為什么「合成數據」是 RAG 測試的關鍵突破口。

一、為什么要用合成數據評測 RAG?

在真實業務場景中,我們希望 RAG 系統具備三個核心能力:

  1. 檢索準確(Retriever):能找到與問題最相關的文檔;
  2. 生成可靠(LLM):答案必須“有出處”,不能胡編;
  3. 上下文合適(Context):輸入長度、內容密度要恰到好處。

但在系統上線前,我們往往沒有足夠的真實問題和反饋樣本。 這就導致很難知道模型的回答是否“扎實落地”。

而 合成數據(Synthetic Data) 正好填補了這個空白。

通過自動生成模擬用戶問題 + 理想回答(golden pairs),我們能提前建立一個可重復測試集:

  • 不依賴真實用戶;
  • 能針對不同類型問題系統化覆蓋;
  • 能反復驗證 Retriever 和 Generator 的優化效果。

DeepEval 就是這個過程的核心工具。

二、DeepEval:專為 LLM 評測設計的開源框架

DeepEval 是一個專門用于大模型評測的開源框架,支持包括 RAG 流水線在內的各種場景。 它的優勢主要體現在三點:

  • ?自動生成合成測試數據:內置??Synthesizer?? 類,可基于文檔生成真實感極強的 QA 對;
  • ?多維度評測指標:從 Grounding(答案是否有出處)、Context Relevance(上下文相關性)到 Faithfulness(事實一致性);
  • ?可擴展配置:通過??EvolutionConfig?? 控制生成樣本的復雜度與類型。

接下來我們進入實操環節。

三、安裝依賴與準備環境

首先,安裝所需依賴庫。

pip install deepeval chromadb tiktoken pandas

安裝完成后,配置你的 OpenAI API Key。 DeepEval 會調用外部模型(如 GPT-4)來生成和評測數據。

前往 OpenAI API 管理頁, 新建 API Key 并填入你的環境變量中:

export OPENAI_API_KEY="sk-xxxxxxx"

?? 提示: 初次使用 OpenAI API 可能需要綁定支付方式并充值約 $5 才能啟用。

四、準備源文本:生成“合成問答”的素材

接下來,我們需要準備一份源文本,它將作為合成數據的“語料庫”。 這份文本應盡量內容多樣、語義清晰、事實準確

例如:

text = """
Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
In contrast, the archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Meanwhile, in the world of physics, superconductors can carry electric current with zero resistance -- a phenomenon
discovered over a century ago but still unlocking new technologies like quantum computers today.
...
"""

將其保存為一個文本文件:

with open("example.txt", "w") as f:
    f.write(text)

?? 技巧: 你完全可以換成自己的內容,比如項目知識庫、技術文檔、內部 FAQ 等,這樣生成的評測樣本就更貼近業務實際。

五、自動生成合成數據(Synthetic Goldens)

DeepEval 的核心類 ??Synthesizer?? 可以直接讀取文檔并生成高質量的 QA 對。

from deepeval.synthesizer import Synthesizer

synthesizer = Synthesizer(model="gpt-4.1-nano")

# 從文檔中生成合成數據
synthesizer.generate_goldens_from_docs(
    document_paths=["example.txt"],
    include_expected_output=True
)

# 打印部分結果
for golden in synthesizer.synthetic_goldens[:3]:  
    print(golden, "\n")

運行結果示例:

Input: Evaluate the cognitive abilities of corvids in facial recognition tasks.
Expected Output: Crows can recognize human faces and remember them for years, showing advanced memory and problem-solving.
Context: "Crows are among the smartest birds..."

可以看到,每個樣本都包含:

  • 用戶問題(input)
  • 理想回答(expected output)
  • 語料來源(context)

這些就是我們的 golden pairs —— 可用于后續的模型性能驗證。

六、控制樣本復雜度:EvolutionConfig 的威力

光生成 QA 對還不夠,我們需要控制生成問題的復雜度與多樣性,讓測試更貼近真實用戶提問。

DeepEval 提供了 ??EvolutionConfig??,可以通過「進化策略」來調節生成方式。

from deepeval.synthesizer.config import EvolutionConfig, Evolution

evolution_config = EvolutionConfig(
    evolutions={
        Evolution.REASONING: 1/5,
        Evolution.MULTICONTEXT: 1/5,
        Evolution.COMPARATIVE: 1/5,
        Evolution.HYPOTHETICAL: 1/5,
        Evolution.IN_BREADTH: 1/5,
    },
    num_evolutions=3
)

synthesizer = Synthesizer(evolution_config=evolution_config)
synthesizer.generate_goldens_from_docs(["example.txt"])

這樣一來,生成的樣本不僅僅是簡單問答,而會覆蓋:

  • 推理類問題(Reasoning)
  • 多上下文問題(MultiContext)
  • 對比類問題(Comparative)
  • 假設場景(Hypothetical)
  • 廣域探索問題(InBreadth)

例如:

Q: 比較 Voyager 1 的黃金唱片與亞歷山大圖書館在人類歷史中的意義。A: 兩者都承載了人類知識與文明的象征,前者跨越宇宙,后者見證文明的起點。

這樣的數據能全面測試模型的多層推理與信息整合能力。

七、構建迭代評測循環:RAG 改進閉環

當我們有了高質量的合成數據,就可以進入核心環節——RAG 評測閉環

典型的流程如下:

  1. Retriever 測試:驗證召回文檔的相關性;
  2. LLM 評測:檢查生成回答是否基于上下文;
  3. 指標計算:如 Grounding、Context Relevance、Faithfulness;
  4. 結果反饋與優化:調整檢索策略或 Prompt;
  5. 重新評測:觀察指標是否提升。

這就是一個完整的 Iterative RAG Improvement Loop(迭代改進循環)

它的關鍵在于:

你不需要等待真實用戶來“踩坑”, 合成數據已經能讓你提前發現系統的薄弱點。

當 Retriever 的召回率提升、LLM 的事實一致性增強后,你的系統上線風險就會顯著降低。

實戰代碼見最后!

八、實戰建議與擴展思路

如果你準備在真實項目中落地 DeepEval,可以參考以下建議:

  • ??語料選取:優先使用結構化或知識密集型文檔,如產品手冊、內部FAQ;
  • ??模型配置:評測階段可用輕量模型(如 gpt-4.1-nano),正式驗證時切換至完整模型;
  • ??結果分析:結合 ChromaDB 等向量庫,計算各指標變化;
  • ??自動化集成:將評測腳本嵌入 CI/CD 流程,每次更新 Retriever 或 Prompt 后自動驗證。

長期來看,這種方式能讓你的 RAG 系統從「主觀感受好像行」變為「數據指標確實強」。

九、總結:讓 RAG 評測不再是黑箱

RAG 評測的難點在于——系統表現常常“看起來對”,但卻難以驗證背后的可靠性。 DeepEval 的出現,讓這件事變得可量化、可復現、可持續改進。

合成數據的價值不在于替代真實用戶,而在于提前建立可控的測試環境。通過 EvolutionConfig 等機制,我們甚至能模擬用戶提出各種復雜問題,全面檢驗系統的推理與檢索邊界。

一句話總結:

在沒有用戶數據的階段,合成數據就是最好的評測基線; 在持續優化階段,DeepEval 就是你的自動化教練。

付實戰代碼:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
rag_iterative_eval_full.py
完整示例:迭代評測循環(RAG 改進閉環)
功能:
  - 生成/讀取文檔
  - 生成合成 goldens(DeepEval / OpenAI / 規則化)
  - 構建檢索器(OpenAI embeddings 或 TF-IDF)
  - 使用檢索到的上下文調用 LLM 生成答案(OpenAI 或簡單拼接回復)
  - 計算 grounding / context_relevance / faithfulness 指標
  - 基于指標自動調整 top_k 與 temperature(形成閉環)
  - 保存與打印每輪結果
作者:jilolo
日期:2025-10
"""

import os
import json
import time
import math
import random
import hashlib
from typing import List, Dict, Any, Tuple
from collections import defaultdict, Counter

# optional imports
try:
    import openai
except Exception:
    openai = None

try:
    import numpy as np
    from numpy.linalg import norm
    NUMPY_AVAILABLE = True
except Exception:
    NUMPY_AVAILABLE = False

try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    SKLEARN_AVAILABLE = True
except Exception:
    SKLEARN_AVAILABLE = False

try:
    from tqdm import tqdm
    TQDM_AVAILABLE = True
except Exception:
    TQDM_AVAILABLE = False

# -------------------------
# CONFIG
# -------------------------
CONFIG = {
    "OPENAI_API_KEY": os.getenv("OPENAI_API_KEY", ""),
    "OPENAI_EMBEDDING_MODEL": "text-embedding-3-small",
    "OPENAI_COMPLETION_MODEL": "gpt-4o-mini",  # change to available model
    "DOC_PATH": "example.txt",
    "NUM_GOLDENS": 12,
    "ITERATIONS": 6,
    "INITIAL_TOP_K": 3,
    "MAX_TOP_K": 8,
    "MIN_TOP_K": 1,
    "TEMPERATURE_OPTIONS": [0.0, 0.2, 0.5],
    "SEED": 42,
    "REPORT_FILE": "rag_eval_report.json",
    "SAVE_DIR": "rag_eval_runs",
    "PROMPT_TEMPLATE": (
        "You are a knowledgeable assistant. Use only the provided context snippets to answer the question. "
        "If the information is not present in the context, respond with 'Insufficient information in context.'\n\n"
        "Context:\n{context}\n\nQuestion: {question}\n\nAnswer:"
    ),
    # metric thresholds for increasing/decreasing top_k
    "GROUNDING_GOOD": 0.7,
    "GROUNDING_BAD": 0.45,
    "FAITHFULNESS_GOOD": 0.7,
    "FAITHFULNESS_BAD": 0.45,
    "CONTEXT_RELEVANCE_GOOD": 0.7,
    "CONTEXT_RELEVANCE_BAD": 0.45,
}

random.seed(CONFIG["SEED"])
if openai and CONFIG["OPENAI_API_KEY"]:
    openai.api_key = CONFIG["OPENAI_API_KEY"]

# -------------------------
# Utilities
# -------------------------
def safe_print(*args, **kwargs):
    print(*args, **kwargs)

def ensure_dir(path: str):
    if not os.path.exists(path):
        os.makedirs(path, exist_ok=True)

def sha1_snippet(s: str) -> str:
    return hashlib.sha1(s.encode("utf-8")).hexdigest()[:10]

# -------------------------
# Example document (will write if missing)
# -------------------------
SAMPLE_TEXT = """Crows are among the smartest birds, capable of using tools and recognizing human faces even after years.
The archerfish displays remarkable precision, shooting jets of water to knock insects off branches.
Superconductors can carry electric current with zero resistance -- a phenomenon discovered over a century ago but still unlocking new technologies like quantum computers today.
The Library of Alexandria was once the largest center of learning, but much of its collection was lost in fires and wars.
Voyager 1 probe, launched in 1977, has left the solar system, carrying a golden record with sounds and images of Earth.
The Amazon rainforest produces roughly 20% of the world's oxygen.
Coral reefs support nearly 25% of all marine life despite covering less than 1% of the ocean floor.
MRI scanners use strong magnetic fields and radio waves to generate detailed images of organs without harmful radiation.
Moore's Law observed that the number of transistors on microchips doubles roughly every two years.
The Mariana Trench is the deepest part of Earth's oceans, reaching nearly 11,000 meters below sea level.
Ancient civilizations like the Sumerians and Egyptians invented mathematical systems thousands of years ago.
"""

def ensure_example_doc(path: str):
    if not os.path.exists(path):
        with open(path, "w", encoding="utf-8") as f:
            f.write(SAMPLE_TEXT)
        safe_print(f"[INFO] Wrote sample doc to {path}")

# -------------------------
# Synthetic golden generation (fallback-first approach)
# -------------------------
def simple_rule_based_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    """
    Very simple fallback: split document into sentences/paragraphs and craft simple Q/A.
    """
    with open(doc_path, "r", encoding="utf-8") as f:
        txt = f.read()
    paras = [p.strip() for p in txt.split("\n") if p.strip()]
    goldens = []
    for p in paras:
        q = f"What is one key fact from the following sentence: '{p[:120]}...'? "
        a = p
        goldens.append({"input": q, "expected_output": a, "context": p})
        if len(goldens) >= num:
            break
    return goldens

def openai_synthesize_goldens(doc_path: str, num: int = 12, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> List[Dict[str, str]]:
    """
    Try to use OpenAI to synthesize question-answer pairs.
    If OpenAI is not configured or API call fails, fall back to rule-based generation.
    """
    if openai is None or not getattr(openai, "api_key", None):
        safe_print("[WARN] OpenAI key not found - using rule-based goldens")
        return simple_rule_based_goldens(doc_path, num)
    with open(doc_path, "r", encoding="utf-8") as f:
        doc = f.read()

    prompt = (
        f"You are a dataset creator. Given the document below, produce {num} question-answer pairs. "
        f"For each pair, provide 'question', 'answer' (concise and grounded in the doc), and 'context' (the snippet). "
        f"Return a JSON array of objects.\n\nDocument:\n{doc}\n\n"
    )

    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You generate QA pairs."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.0,
            max_tokens=1500
        )
        text = resp["choices"][0]["message"]["content"]
        # find JSON in text
        start = text.find("[")
        if start >= 0:
            json_text = text[start:]
            try:
                arr = json.loads(json_text)
                goldens = []
                for item in arr[:num]:
                    q = item.get("question") or item.get("input") or item.get("q") or ""
                    a = item.get("answer") or item.get("expected_output") or ""
                    c = item.get("context") or ""
                    goldens.append({"input": q.strip(), "expected_output": a.strip(), "context": c.strip()})
                safe_print(f"[INFO] OpenAI synthesized {len(goldens)} goldens.")
                return goldens
            except Exception as e:
                safe_print("[WARN] Failed to parse JSON from OpenAI output:", e)
                return simple_rule_based_goldens(doc_path, num)
        else:
            safe_print("[WARN] OpenAI response lacking JSON - using rule-based fallback.")
            return simple_rule_based_goldens(doc_path, num)
    except Exception as e:
        safe_print("[ERROR] OpenAI call failed:", e)
        return simple_rule_based_goldens(doc_path, num)

def generate_goldens(doc_path: str, num: int = 12) -> List[Dict[str, str]]:
    # Attempt DeepEval if installed (not required here); else OpenAI; else rule-based
    # To keep dependencies light in this script we skip DeepEval auto-call.
    return openai_synthesize_goldens(doc_path, num)

# -------------------------
# Retriever: TF-IDF (fallback) and Embedding based (OpenAI)
# -------------------------
class TFIDFRetriever:
    def __init__(self, docs: List[str]):
        if not SKLEARN_AVAILABLE:
            raise RuntimeError("sklearn not available for TF-IDF retriever.")
        self.docs = docs
        self.vectorizer = TfidfVectorizer(ngram_range=(1,2), stop_words='english')
        self.doc_matrix = self.vectorizer.fit_transform(self.docs)

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        qv = self.vectorizer.transform([query])
        sims = cosine_similarity(qv, self.doc_matrix)[0]
        idx_scores = list(enumerate(sims))
        idx_scores.sort(key=lambda x: x[1], reverse=True)
        return idx_scores[:top_k]

class OpenAIEmbeddingRetriever:
    def __init__(self, docs: List[str], embedding_model: str = CONFIG["OPENAI_EMBEDDING_MODEL"]):
        self.docs = docs
        self.embedding_model = embedding_model
        self.embeddings = []
        # compute embeddings
        self._build()

    def _embed_text(self, text: str):
        if openai is None or not getattr(openai, "api_key", None):
            # fallback: random vector (deterministic via hash)
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()  # fake dim
            else:
                return [random.random() for _ in range(512)]
        try:
            resp = openai.Embedding.create(model=self.embedding_model, input=text)
            return resp["data"][0]["embedding"]
        except Exception as e:
            safe_print("[WARN] OpenAI embedding failed:", e)
            # fallback deterministic pseudo-random
            if NUMPY_AVAILABLE:
                h = int(hashlib_sha1_int(text))
                rng = np.random.RandomState(h % (2**32))
                return rng.normal(size=(1536,)).tolist()
            else:
                return [random.random() for _ in range(512)]

    def _build(self):
        self.embeddings = [self._embed_text(d) for d in self.docs]

    def retrieve(self, query: str, top_k: int = 3) -> List[Tuple[int, float]]:
        q_emb = self._embed_text(query)
        # compute cosine similarities
        if NUMPY_AVAILABLE:
            qv = np.array(q_emb, dtype=float)
            sims = []
            for emb in self.embeddings:
                ev = np.array(emb, dtype=float)
                denom = (norm(qv) * norm(ev))
                sim = float(np.dot(qv, ev) / denom) if denom > 0 else 0.0
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]
        else:
            sims = []
            for emb in self.embeddings:
                sim = sum(a*b for a,b in zip(q_emb, emb)) / (len(q_emb) or 1)
                sims.append(sim)
            idx_scores = list(enumerate(sims))
            idx_scores.sort(key=lambda x: x[1], reverse=True)
            return idx_scores[:top_k]

# helper hashing for fallback embeddings
def hashlib_sha1_int(s: str) -> int:
    return int(hashlib.sha1(s.encode('utf-8')).hexdigest()[:16], 16)

# -------------------------
# Generator (LLM call) with fallback
# -------------------------
def call_openai_chat(question: str, contexts: List[str], temperature: float = 0.0, model: str = CONFIG["OPENAI_COMPLETION_MODEL"]) -> str:
    if openai is None or not getattr(openai, "api_key", None):
        # fallback: naive rule - if any context contains a sentence with overlap words, return that sentence; else "Insufficient"
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."
    # try call
    prompt = CONFIG["PROMPT_TEMPLATE"].format(context="\n\n".join(contexts), question=question)
    try:
        resp = openai.ChatCompletion.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a precise assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=512,
        )
        text = resp["choices"][0]["message"]["content"].strip()
        return text
    except Exception as e:
        safe_print("[WARN] OpenAI ChatCompletion failed:", e)
        # fallback naive
        combined = " ".join(contexts)
        q_words = set([w.lower() for w in question.split() if len(w) > 3])
        best_sent = None
        best_overlap = 0
        for s in combined.split("."):
            wset = set([w.lower().strip(" ,;:()[]") for w in s.split() if len(w)>3])
            overlap = len(q_words & wset)
            if overlap > best_overlap:
                best_overlap = overlap
                best_sent = s.strip()
        if best_sent and best_overlap >= 1:
            return best_sent + "."
        return"Insufficient information in context."

# -------------------------
# Metrics implementations
# -------------------------
def compute_context_relevance(retrieved_idxs_scores: List[Tuple[int, float]]) -> float:
    """
    Simple metric: average similarity score (score between 0-1)
    """
    if not retrieved_idxs_scores:
        return 0.0
    scores = [s for _, s in retrieved_idxs_scores]
    # ensure in [0,1]
    clipped = [max(0.0, min(1.0, float(x))) for x in scores]
    return sum(clipped) / len(clipped)

def compute_grounding(answer: str, contexts: List[str]) -> float:
    """
    Heuristic: fraction of answer tokens that have overlap with context tokens.
    Returns 0-1.
    """
    a_words = [w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w) > 2]
    if not a_words:
        return 0.0
    context_text = " ".join(contexts).lower()
    hits = sum(1 for w in a_words if w in context_text)
    return hits / len(a_words)

def compute_faithfulness(answer: str, expected: str) -> float:
    """
    Very simple normalized similarity:
    - overlap ratio of important tokens (set intersection over union)
    """
    a_set = set([w.strip(" ,.;:()[]'\"").lower() for w in answer.split() if len(w)>2])
    e_set = set([w.strip(" ,.;:()[]'\"").lower() for w in expected.split() if len(w)>2])
    if not a_set and not e_set:
        return 1.0
    if not a_set or not e_set:
        return 0.0
    inter = a_set & e_set
    union = a_set | e_set
    return len(inter) / len(union)

# -------------------------
# Single-run RAG evaluation on list of goldens
# -------------------------
def run_rag_eval(
    goldens: List[Dict[str, str]],
    docs: List[str],
    retriever,
    top_k: int,
    temperature: float
) -> Dict[str, Any]:
    """
    Run through goldens, for each:
      - retrieve top_k contexts
      - call generator
      - compute metrics
    Return aggregated metrics and per-sample results
    """
    per_samples = []
    total_grounding = 0.0
    total_context_rel = 0.0
    total_faith = 0.0

    iterator = goldens if not TQDM_AVAILABLE else tqdm(goldens, desc=f"Eval top_k={top_k}, temp={temperature}")

    for g in iterator:
        q = g["input"]
        expected = g.get("expected_output", "")
        # retrieve
        retrieved = retriever.retrieve(q, top_k=top_k)
        contexts = [docs[idx] for idx, _ in retrieved]
        ctx_scores = [score for _, score in retrieved]

        # call generator
        answer = call_openai_chat(q, contexts, temperature=temperature)

        # compute metrics
        context_rel = compute_context_relevance(retrieved)
        grounding = compute_grounding(answer, contexts)
        faith = compute_faithfulness(answer, expected)

        total_context_rel += context_rel
        total_grounding += grounding
        total_faith += faith

        per_samples.append({
            "question": q,
            "expected": expected,
            "answer": answer,
            "retrieved": [{"idx": idx, "score": float(score), "snippet_hash": sha1_snippet(docs[idx])} for idx, score in retrieved],
            "metrics": {"context_relevance": context_rel, "grounding": grounding, "faithfulness": faith}
        })

    n = len(goldens)
    agg = {
        "avg_context_relevance": total_context_rel / n if n else 0.0,
        "avg_grounding": total_grounding / n if n else 0.0,
        "avg_faithfulness": total_faith / n if n else 0.0
    }
    return {"aggregate": agg, "samples": per_samples}

# -------------------------
# Iterative parameter adjustment logic
# -------------------------
def adjust_params(current_top_k: int, metrics: Dict[str, float]) -> int:
    """
    Very simple policy:
      - If grounding low -> increase top_k (more context)
      - If grounding high and context relevance low -> increase top_k
      - If grounding high & context relevance high -> try reduce top_k to optimize
    Bound by min/max.
    """
    g = metrics.get("avg_grounding", 0.0)
    cr = metrics.get("avg_context_relevance", 0.0)
    fa = metrics.get("avg_faithfulness", 0.0)
    new_top_k = current_top_k

    # if grounding is very low, expand context
    if g < CONFIG["GROUNDING_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 2)
    elif cr < CONFIG["CONTEXT_RELEVANCE_BAD"] and g < CONFIG["GROUNDING_GOOD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], current_top_k + 1)
    elif g > CONFIG["GROUNDING_GOOD"] and cr > CONFIG["CONTEXT_RELEVANCE_GOOD"]:
        # try shrink to save cost
        new_top_k = max(CONFIG["MIN_TOP_K"], current_top_k - 1)
    # small adjustments if faithfulness very low
    if fa < CONFIG["FAITHFULNESS_BAD"]:
        new_top_k = min(CONFIG["MAX_TOP_K"], new_top_k + 1)
    # ensure bounds
    new_top_k = max(CONFIG["MIN_TOP_K"], min(CONFIG["MAX_TOP_K"], new_top_k))
    return new_top_k

def pick_temperature(candidate_list: List[float], metrics: Dict[str, float]) -> float:
    """
    Simple heuristic: if faithfulness low, use lower temp (more deterministic).
    If faithfulness high and grounding high, allow slightly higher temp for diversity.
    """
    fa = metrics.get("avg_faithfulness", 0.0)
    g = metrics.get("avg_grounding", 0.0)
    if fa < 0.4 or g < 0.4:
        return min(candidate_list)
    if fa > 0.75 and g > 0.7:
        return max(candidate_list)
    return candidate_list[len(candidate_list)//2]

# -------------------------
# Main pipeline
# -------------------------
def main():
    safe_print("=== RAG Iterative Evaluation Demo ===")
    ensure_example_doc(CONFIG["DOC_PATH"])
    ensure_dir(CONFIG["SAVE_DIR"])

    # load docs and split into chunks (naive paragraph chunking)
    with open(CONFIG["DOC_PATH"], "r", encoding="utf-8") as f:
        doc_text = f.read()
    paragraphs = [p.strip() for p in doc_text.split("\n") if p.strip()]
    # if paragraphs too short, split sentences
    if len(paragraphs) < 5:
        # attempt sentence split
        sents = [s.strip() for s in doc_text.replace("\n", " ").split(".") if s.strip()]
        # group per 1-2 sentences
        paragraphs = []
        i = 0
        while i < len(sents):
            chunk = sents[i]
            if i+1 < len(sents):
                if random.random() < 0.5:
                    chunk = chunk + ". " + sents[i+1]
                    i += 2
                else:
                    i += 1
            else:
                i += 1
            paragraphs.append(chunk + ".")
    docs = paragraphs

    safe_print(f"[INFO] Loaded {len(docs)} document chunks for retrieval.")

    # generate goldens
    goldens = generate_goldens(CONFIG["DOC_PATH"], CONFIG["NUM_GOLDENS"])
    safe_print(f"[INFO] Generated {len(goldens)} goldens for evaluation.")

    # choose retriever: prefer OpenAI embeddings if available, else TF-IDF
    retriever = None
    use_embedding = False
    if openai and getattr(openai, "api_key", None) and NUMPY_AVAILABLE:
        try:
            retriever = OpenAIEmbeddingRetriever(docs)
            use_embedding = True
            safe_print("[INFO] Using OpenAI embedding retriever.")
        except Exception as e:
            safe_print("[WARN] OpenAIEmbeddingRetriever failed, falling back to TF-IDF:", e)
    if retriever is None:
        if SKLEARN_AVAILABLE:
            retriever = TFIDFRetriever(docs)
            safe_print("[INFO] Using TF-IDF retriever.")
        else:
            # fallback: naive substring search retriever
            class NaiveRetriever:
                def __init__(self, docs):
                    self.docs = docs
                def retrieve(self, query, top_k=3):
                    qs = query.lower()
                    scores = []
                    for i, d in enumerate(self.docs):
                        s = sum(1 for w inset(qs.split()) if w in d.lower())
                        scores.append((i, float(s)))
                    scores.sort(key=lambda x: x[1], reverse=True)
                    return scores[:top_k]
            retriever = NaiveRetriever(docs)
            safe_print("[INFO] Using naive substring retriever.")

    # iterative loop
    cur_top_k = CONFIG["INITIAL_TOP_K"]
    cur_temp = CONFIG["TEMPERATURE_OPTIONS"][0]
    history = []
    for itr in range(1, CONFIG["ITERATIONS"] + 1):
        safe_print(f"\n--- Iteration {itr} | top_k={cur_top_k} | temp={cur_temp} ---")
        result = run_rag_eval(goldens, docs, retriever, top_k=cur_top_k, temperature=cur_temp)
        agg = result["aggregate"]
        safe_print(f"[RESULT] avg_context_relevance={agg['avg_context_relevance']:.3f}, avg_grounding={agg['avg_grounding']:.3f}, avg_faithfulness={agg['avg_faithfulness']:.3f}")
        # save per-iteration
        run_record = {
            "iteration": itr,
            "top_k": cur_top_k,
            "temperature": cur_temp,
            "aggregate": agg,
            "timestamp": time.time(),
            "samples_count": len(result["samples"])
        }
        history.append(run_record)
        # adapt params
        new_top_k = adjust_params(cur_top_k, agg)
        new_temp = pick_temperature(CONFIG["TEMPERATURE_OPTIONS"], agg)
        safe_print(f"[ADAPT] next_top_k={new_top_k}, next_temp={new_temp}")
        # if no change and already good metrics, we can stop early
        if new_top_k == cur_top_k and new_temp == cur_temp and agg["avg_grounding"] > 0.8 and agg["avg_faithfulness"] > 0.8:
            safe_print("[INFO] Metrics are good and stable - stopping early.")
            break
        cur_top_k = new_top_k
        cur_temp = new_temp

    # produce final report
    report = {
        "config": CONFIG,
        "docs_count": len(docs),
        "goldens_count": len(goldens),
        "history": history
    }
    report_path = os.path.join(CONFIG["SAVE_DIR"], CONFIG["REPORT_FILE"])
    with open(report_path, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2, ensure_ascii=False)
    safe_print(f"\n[FINISH] Saved report to {report_path}")
    safe_print("=== End ===")

if __name__ == "__main__":
    main()


本文轉載自??Halo咯咯??    作者:基咯咯

?著作權歸作者所有,如需轉載,請注明出處,否則將追究法律責任
已于2025-10-17 08:38:05修改
收藏
回復
舉報
回復
相關推薦
中文字幕精品—区二区四季| 亚洲欧洲日本精品| 久久久久亚洲av片无码v| 成人黄色免费视频| 国产探花在线精品一区二区| 夜夜嗨av一区二区三区中文字幕| 国产午夜精品美女视频明星a级| 蜜桃视频成人在线观看| 无码视频一区二区三区| 国产精品任我爽爆在线播放| 国产精品久线观看视频| 1769国产精品| v天堂中文在线| 成人女同在线观看| 围产精品久久久久久久| 欧美丝袜自拍制服另类| 欧美二级三级| 久久久精品免费看| 国产精品午夜av| 在线观看成人小视频| 欧美高清性xxxxhdvideosex| 国产精品嫩草影院桃色| 久久麻豆精品| 欧美精品 国产精品| 日韩性感在线| 中文字幕码精品视频网站| 国产aⅴ精品一区二区三区久久| 7777女厕盗摄久久久| 女女同性女同一区二区三区按摩| 外国精品视频在线观看 | 国产青春久久久国产毛片| 精品国产大片大片大片| 在线播放成人| 亚洲精品视频自拍| 国产精品对白一区二区三区| 日本特黄特色aaa大片免费| 97se亚洲国产一区二区三区| 亚洲成人免费影院| 久久久久资源| 最近中文字幕在线免费观看| 国产精品久久久久无码av| 日韩电影大片中文字幕| 久章草在线视频| 高清美女视频一区| 久久99国产精品免费| 欧美成人手机在线| 国产精品久久AV无码| 日本韩国欧美| 国产精品第13页| 肥熟一91porny丨九色丨| 亚洲综合视频在线播放| 午夜精品国产| 日韩精品在线私人| 亚洲欧美日韩一级| 毛片在线网址| 国产色爱av资源综合区| 亚洲999一在线观看www| a v视频在线观看| 亚洲精品国产日韩| 在线成人免费网站| 欧美性生交xxxxx| 日本欧美韩国| 亚洲一区二区三区四区在线| 欧美午夜精品理论片a级大开眼界 欧美午夜精品久久久久免费视 | 国产麻豆一区二区三区精品视频| 亚洲第一视频在线观看| 成人性生生活性生交12| 丝袜中文在线| 国产午夜精品一区二区三区视频| 欧美日韩免费高清| 91在线直播| 99国产精品久久久久久久久久久| 国产精品日韩在线| 日本少妇全体裸体洗澡| 日韩一级网站| 色在人av网站天堂精品| 90岁老太婆乱淫| 午夜日韩影院| 欧美日韩在线综合| 成年人午夜视频在线观看| 午夜视频成人| 久久网这里都是精品| 亚洲一区二区三区四区视频| 国产福利第一页| 蜜臀久久99精品久久久久久9 | 久久久人成影片一区二区三区在哪下载 | 欧美日韩一区二区在线观看| 亚洲国产日韩在线一区| a屁视频一区二区三区四区| 亚洲成人1区2区| 国产综合免费视频| 亚洲日本中文| 日韩国产精品视频| 国产麻豆a毛片| 精品动漫3d一区二区三区免费版| 日韩在线观看免费| 最近中文字幕免费| 日韩精品免费一区二区三区竹菊| 欧美一区二区啪啪| 想看黄色一级片| 成人看片毛片免费播放器| 91精品婷婷国产综合久久| 完美搭档在线观看| 99久久久国产精品美女| 国自在线精品视频| 麻豆国产尤物av尤物在线观看 | 亚洲精品一区在线观看香蕉| 超碰97av在线| 国产探花在线精品| 欧美成人剧情片在线观看| 天天操天天操天天操天天| 激情久久综合| 国产人妖伪娘一区91| 国产美女www爽爽爽| 国产成人久久精品77777最新版本| 国产自产女人91一区在线观看| 成人无码一区二区三区| 粉嫩一区二区三区性色av| 51国偷自产一区二区三区的来源| 国产男女无套免费网站| 六月婷婷色综合| 成人福利视频在线观看| 97久久人国产精品婷婷| 91论坛在线播放| 欧美精品免费观看二区| 羞羞的视频在线观看| 在线视频一区二区三区| 超碰97在线资源站| 国产精品啊啊啊| 98精品国产自产在线观看| 国产精品一区二区人人爽| 久久综合久久综合久久| 欧美又粗又长又爽做受| 天天在线视频色| 色综合久久中文综合久久牛| 777米奇影视第四色| 亚洲性视频在线| 亚洲国产欧美一区二区丝袜黑人| 波多野结衣福利| 成人写真视频| 久久躁狠狠躁夜夜爽| 久久视频免费看| 久久久一二三| 国产欧美 在线欧美| 伊人免费在线观看| 国产欧美精品一区二区三区四区 | 91免费在线视频网站| www.蜜桃av.com| 亚洲欧洲另类国产综合| 国产情侣第一页| 成人免费看视频网站| 欧美人伦禁忌dvd放荡欲情| 亚洲成人激情小说| 综合av在线| 奇门遁甲1982国语版免费观看高清 | 国产黄色免费在线观看| 欧美午夜精品久久久久久浪潮| 毛葺葺老太做受视频| 成人精品在线| 亚洲久久久久久久久久| 欧美激情精品久久久久久免费 | 伊人情人综合网| 91日本视频在线| 亚洲www色| 亚洲国产欧美一区二区丝袜黑人| 黄色片免费观看视频| 久久久噜噜噜久噜久久综合| 精品国产三级a∨在线| 精品中文视频| 亚洲视频综合网| 九九精品在线观看视频 | 香蕉视频黄色在线观看| 久久精品一本| 成人免费视频网站| 3344国产永久在线观看视频| 欧美日韩另类一区| 91嫩草丨国产丨精品| 美女爽到呻吟久久久久| 亚洲高清在线播放| jizz内谢中国亚洲jizz| 原创国产精品91| 精品国产伦一区二区三区| 亚洲国产美国国产综合一区二区| 亚洲精品女人久久久| 日韩成人精品在线观看| 久久国产精品一区二区三区| 日本成人片在线| 九九热最新视频//这里只有精品| 五月天福利视频| 一区二区高清在线| 久久久久国产精品区片区无码| 欧美精品色网| 亚洲一区二区三区视频| 日韩精品av| 亚洲精品www久久久| 日韩一级片中文字幕| 亚洲人成网站在线| 老司机午夜性大片| 成人羞羞视频在线看网址| 亚洲va欧美va国产综合久久| 色戒汤唯在线观看| 久久国产精品久久久久久| 无码精品一区二区三区在线 | www.国产.com| 在线一区二区三区四区五区| 欧美日韩三级在线观看| 国产一区二区三区免费| 亚洲免费精品视频| 精品国产一区二区三区成人影院| 欧美激情xxxx| 丰满人妻熟女aⅴ一区| 色婷婷亚洲精品| 免费看一级一片| 国产精品久久久一本精品| 日韩av手机在线播放| 国产一区二区三区观看| 丰满少妇在线观看| 亚洲欧美网站| 日本一区免费看| 国产精品久久久久av电视剧| 欧美极品欧美精品欧美视频| 色欲av伊人久久大香线蕉影院| 午夜精品一区在线观看| 国产精品一区二区人妻喷水| 精品一区二区免费| 亚洲国产精品一区二区第一页 | 亚洲最新在线| 91视频成人| 国产精品久久97| 在线激情网站| 亚洲欧美综合另类中字| 神马午夜电影一区二区三区在线观看| 日韩午夜三级在线| 日韩av综合在线| 伊人婷婷欧美激情| 成人性生活免费看| 成人免费av资源| 欧美日本视频在线观看| 国产精选一区| 九色91视频| 嫩草国产精品入口| 国产精品美女av| 亚洲精品在线影院| 日本一本a高清免费不卡| 1区2区3区在线观看| 亚洲欧美国产va在线影院| 亚洲AV第二区国产精品| 亚洲精品久久久久久久久久久久| 国产刺激高潮av| 亚洲第一精品久久忘忧草社区| 亚洲第一大网站| 欧美在线视频不卡| 国产免费www| 亚洲宅男天堂在线观看无病毒| av成人免费网站| 97久久精品人人爽人人爽蜜臀| 无码人妻久久一区二区三区蜜桃| 久久黄色网页| 男人女人黄一级| 麻豆国产欧美一区二区三区| 中文字幕永久有效| 国产精品亚洲第一| 97公开免费视频| 日韩av成人高清| 国产九色porny| 一本久久综合| 少妇高清精品毛片在线视频| 午夜欧美视频| 日韩网站在线免费观看| 国产亚洲在线| 天堂а√在线中文在线| 欧美伦理在线视频| 久久久久久久久久久一区| 蜜臀av免费一区二区三区| 97人人干人人| 女同久久另类99精品国产| 清纯唯美一区二区三区| 欧美独立站高清久久| 日本三级中文字幕在线观看| 青青草原综合久久大伊人精品| 一区二区三区三区在线| 午夜日韩电影| 国产a视频免费观看| 久草这里只有精品视频| 久久无码人妻一区二区三区| 99久久精品一区二区| 亚洲AV无码久久精品国产一区| 成人一区二区三区中文字幕| 国产精品揄拍100视频| 国产精品美女视频| 国产精品美女毛片真酒店| 在线免费观看成人短视频| 国产成人久久精品77777综合 | 91精品在线免费| 天堂在线中文资源| 久久综合色88| 三上悠亚国产精品一区二区三区| 91在线视频成人| 五月激激激综合网色播| 99re99热| 小小影院久久| 在线免费观看成人网| 日韩1区2区| 亚洲精品成人自拍| 黑人一区二区| 最新av免费在线观看| 91麻豆精东视频| 印度午夜性春猛xxx交| 91福利在线免费观看| 亚洲男人天堂网址| 日韩精品一区在线观看| 粉嫩av一区二区夜夜嗨| 中文字幕国产精品| 日本天堂在线观看| 欧美孕妇性xx| 成人免费在线观看视频| 国产综合色一区二区三区| 91精品一区二区三区综合在线爱 | 一区二区三区在线观看www| 亚洲激情婷婷| 日本少妇一区二区三区| 国产精品免费网站在线观看| 久久久免费高清视频| 精品国产三级a在线观看| 六月丁香综合网| 久久视频中文字幕| 久久久久久一区二区三区四区别墅| 国产日韩精品在线| 九九热线有精品视频99| 欧美一区二区中文字幕| 国产精品香蕉一区二区三区| 精品在线观看一区| 一级精品视频在线观看宜春院 | 麻豆国产精品一区二区三区 | 亚洲免费av在线| 国产精品久久久久久久免费看 | 大胸美女被爆操| 色噜噜夜夜夜综合网| 日本福利片在线| 色狠狠av一区二区三区香蕉蜜桃| 在线观看午夜av| 欧美精品www在线观看| 99er精品视频| 中文一区一区三区免费| 久久av老司机精品网站导航| 毛片视频免费播放| 亚洲二区在线视频| 亚洲国产综合一区| 欧美激情手机在线视频| 91精品入口| 久草视频国产在线| 成人一区二区视频| 国产免费观看av| 亚洲精品视频播放| 一二区成人影院电影网| 一区二区三区不卡在线| 国产在线精品一区二区夜色| 日本精品在线免费观看| 日韩西西人体444www| 国产蜜臀在线| 国外成人在线视频网站| 国产中文精品久高清在线不| 国产精品69页| 中文字幕精品综合| 国产又粗又猛又爽又黄视频| 欧美成人免费一级人片100| 福利电影一区 | 精品中文av资源站在线观看| 内射一区二区三区| 精品少妇一区二区三区日产乱码| 高清毛片在线看| 成人黄色在线播放| 国产精品v欧美精品v日本精品动漫| 四虎精品一区二区| 亚洲三级小视频| 国产高中女学生第一次| 91精品国产99| 久久亚洲国产| 制服丝袜av在线| 91福利资源站| 好吊日av在线| 日韩亚洲不卡在线| 国产成人无遮挡在线视频| 97免费在线观看视频| 精品少妇一区二区三区免费观看 | 少妇愉情理伦三级| 欧美一个色资源| 中文在线资源| 国产福利一区二区三区在线观看| 一本色道久久综合亚洲精品高清 | 精品国产鲁一鲁一区二区张丽 | 激情综合色综合久久| 国产在线一二区| 日韩欧美中文字幕公布| 三妻四妾完整版在线观看电视剧| 亚洲最大免费| 久久一区二区视频| 99热这里只有精品66| 欧美在线播放视频| 欧美精品福利|