Crawl4AI，智能體網(wǎng)絡(luò)自動(dòng)采集利器

發(fā)布于 2024-11-8 14:59

瀏覽

0收藏

Crawl是一款免費(fèi)的開源工具，利用AI技術(shù)簡(jiǎn)化網(wǎng)絡(luò)爬取和數(shù)據(jù)提取，提高信息收集與分析的效率。它智能識(shí)別網(wǎng)頁(yè)內(nèi)容，并將數(shù)據(jù)轉(zhuǎn)換為易于處理的格式，功能全面且操作簡(jiǎn)便。

1 使用 Crawl 的步驟

步驟 1：安裝與設(shè)置

pip install “crawl4ai @ git+https://github.com/unclecode/crawl4ai.git" transformers torch nltk

步驟 2：數(shù)據(jù)提取

創(chuàng)建Python腳本，啟動(dòng)網(wǎng)絡(luò)爬蟲并從URL提取數(shù)據(jù)：

from crawl4ai import WebCrawler

# 創(chuàng)建 WebCrawler 的實(shí)例
crawler = WebCrawler()

# 預(yù)熱爬蟲（加載必要的模型）
crawler.warmup()

# 在 URL 上運(yùn)行爬蟲
result = crawler.run(url="https://openai.com/api/pricing/")

# 打印提取的內(nèi)容
print(result.markdown)

步驟 3：數(shù)據(jù)結(jié)構(gòu)化

使用LLM（大型語(yǔ)言模型）定義提取策略，將數(shù)據(jù)轉(zhuǎn)換為結(jié)構(gòu)化格式：

import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field

class OpenAIModelFee(BaseModel):
    model_name: str = Field(..., description="OpenAI 模型的名稱。")
    input_fee: str = Field(..., description="OpenAI 模型的輸入令牌費(fèi)用。")
    output_fee: str = Field(..., description="OpenAI 模型的輸出令牌費(fèi)用。")

url = 'https://openai.com/api/pricing/'
crawler = WebCrawler()
crawler.warmup()

result = crawler.run(
        url=url,
        word_count_threshold=1,
        extraction_strategy= LLMExtractionStrategy(
            provider= "openai/gpt-4o", api_token = os.getenv('OPENAI_API_KEY'), 
            schema=OpenAIModelFee.schema(),
            extraction_type="schema",
            instruction="""從爬取的內(nèi)容中提取所有提到的模型名稱以及它們的輸入和輸出令牌費(fèi)用。不要遺漏整個(gè)內(nèi)容中的任何模型。提取的模型 JSON 格式應(yīng)該像這樣：
            {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
        ),            
        bypass_cache=True,
    )

print(result.extracted_content)

步驟 4：集成AI智能體

將 Crawl 與 Praison CrewAI 智能體集成，實(shí)現(xiàn)高效的數(shù)據(jù)處理：

pip install praisonai

創(chuàng)建工具文件（tools.py）來(lái)包裝 Crawl 工具：

# tools.py
import os
from crawl4ai import WebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel, Field
from praisonai_tools import BaseTool

class ModelFee(BaseModel):
    llm_model_name: str = Field(..., description="模型的名稱。")
    input_fee: str = Field(..., description="模型的輸入令牌費(fèi)用。")
    output_fee: str = Field(..., description="模型的輸出令牌費(fèi)用。")

class ModelFeeTool(BaseTool):
    name: str = "ModelFeeTool"
    description: str = "從給定的定價(jià)頁(yè)面提取模型的費(fèi)用信息。"

    def _run(self, url: str):
        crawler = WebCrawler()
        crawler.warmup()

        result = crawler.run(
            url=url,
            word_count_threshold=1,
            extraction_strategy= LLMExtractionStrategy(
                provider="openai/gpt-4o",
                api_token=os.getenv('OPENAI_API_KEY'), 
                schema=ModelFee.schema(),
                extraction_type="schema",
                instruction="""從爬取的內(nèi)容中提取所有提到的模型名稱以及它們的輸入和輸出令牌費(fèi)用。不要遺漏整個(gè)內(nèi)容中的任何模型。提取的模型 JSON 格式應(yīng)該像這樣：
                {"model_name": "GPT-4", "input_fee": "US$10.00 / 1M tokens", "output_fee": "US$30.00 / 1M tokens"}."""
            ),            
            bypass_cache=True,
        )
        return result.extracted_content

if __name__ == "__main__":
    # 測(cè)試 ModelFeeTool
    tool = ModelFeeTool()
    url = "https://www.openai.com/pricing"
    result = tool.run(url)
    print(result)

AI智能體配置

配置AI智能體使用Crawl工具進(jìn)行網(wǎng)絡(luò)抓取和數(shù)據(jù)提取。在crewai框架下，我們?cè)O(shè)定了三個(gè)核心角色，共同完成網(wǎng)站模型定價(jià)信息的提取任務(wù)：

網(wǎng)絡(luò)爬蟲：負(fù)責(zé)從OpenAI、Anthropic和Cohere等網(wǎng)站抓取定價(jià)信息，輸出原始HTML或JSON數(shù)據(jù)。
數(shù)據(jù)清理員：確保收集的數(shù)據(jù)準(zhǔn)確無(wú)誤，并整理成結(jié)構(gòu)化的JSON或CSV文件。
數(shù)據(jù)分析員：分析清理后的數(shù)據(jù)，提煉出定價(jià)趨勢(shì)和模式，并編制詳細(xì)報(bào)告。

整個(gè)流程無(wú)需額外依賴，各角色獨(dú)立完成各自任務(wù)。

2 AI 智能體應(yīng)用實(shí)例

以Crawl為基礎(chǔ)，Praison-AI智能體能夠執(zhí)行網(wǎng)絡(luò)抓取、數(shù)據(jù)清洗和分析工作。它們相互協(xié)作，從多個(gè)網(wǎng)站抓取定價(jià)數(shù)據(jù)，并匯總成詳盡的報(bào)告，以展示分析結(jié)果。

3 結(jié)語(yǔ)

Crawl是一個(gè)強(qiáng)大的工具，它賦予AI智能體更高的效率和準(zhǔn)確性執(zhí)行網(wǎng)絡(luò)爬取和數(shù)據(jù)提取任務(wù)。其開源特性、AI驅(qū)動(dòng)的能力和多功能性，使其成為構(gòu)建智能且數(shù)據(jù)驅(qū)動(dòng)智能體的寶貴資產(chǎn)。

本文轉(zhuǎn)載自??AI科技論談??，作者： AI科技論談 ????

標(biāo)簽

Crawl4AI

智能體

網(wǎng)絡(luò)

贊

回復(fù)