feat(community): add SearXNG and Browserless web search/fetch tools (#3451)

* feat(community): add SearXNG and Browserless web search/fetch tools

- SearXNG web_search: privacy-focused meta search engine integration
  with configurable base_url via config.yaml tool settings
- Browserless web_fetch: headless browser page fetching with
  readability article extraction
- Both tools are fully configurable through tool config section
- No external API keys required for basic operation

* fix: address PR review feedback and add unit tests

- Guard config.model_extra against None values (review #1, #2)
- Coerce max_results to int when reading from config (review #2)
- Fix web_fetch_tool to use direct HTTP fetch instead of reusing
  the web_search client config (review #3)
- Fix misleading docstring for SearxngClient.fetch (review #4)
- Remove unused target_url variable to pass Ruff lint (review #5)
- Normalize bool config values with _normalize_bool helper to
  handle env-resolved string values correctly (review #6)
- Add unit tests for both SearXNG and Browserless client classes
  and their tool functions with mocked httpx (review #7, #8)

* fix: convert to async httpx to avoid blocking I/O on event loop

- Replace httpx.Client with httpx.AsyncClient in both client classes
- Convert tool functions to async def
- Wrap readability_extractor calls in asyncio.to_thread()
- Update all tests to use pytest.mark.asyncio and async mocks
- Fix import sorting to pass Ruff lint

* fix(browserless): replace deprecated waitUntil with waitForEvent

The Browserless API has deprecated the waitUntil parameter.
Replace with waitForEvent which accepts values like 'networkidle'.
Default is empty (no wait), configurable via config.yaml.

* fix(browserless): remove deprecated gotoTimeout and bestAttempt params

The Browserless /content API does not accept gotoTimeout or bestAttempt
as top-level payload keys. These were being sent unconditionally,
causing 400 Bad Request errors on current Browserless versions.

Changes:
- Remove goto_timeout_ms parameter and 'gotoTimeout' from payload
- Remove best_attempt parameter and 'bestAttempt' from payload
- Remove _normalize_bool helper (no longer needed)
- Remove goto_timeout_ms and best_attempt config reading in tools.py
- Add tests for waitForSelector and reject params
- Verify no deprecated params are sent in test_fetch_html_success

* refactor(searxng): remove web_fetch_tool, decouple from web_search config

SearXNG is a search engine — it should only provide web_search_tool.
The web_fetch responsibility belongs to Browserless (headless Chrome)
or Jina AI, not SearXNG.

Changes:
- Remove web_fetch_tool from SearXNG tools.py and __init__.py
- Remove SearxngClient.fetch() method (no longer needed)
- Remove unused asyncio/readability imports from SearXNG tools.py
- Add test for max_results string-to-int coercion from config
- Add test for search with categories parameter
- Add test for httpx.RequestError handling
- Apply ruff format fixes to browserless_client.py and test files
This commit is contained in:
zengxi
2026-06-12 09:45:26 +08:00
committed by GitHub
parent 0367fe6c7a
commit 330a2ff8c5
8 changed files with 663 additions and 0 deletions
@@ -0,0 +1,85 @@
import asyncio
import logging
from langchain.tools import tool
from deerflow.config import get_app_config
from deerflow.utils.readability import ReadabilityExtractor
from .browserless_client import BrowserlessClient
logger = logging.getLogger(__name__)
# readability_extractor runs CPU-bound parsing; always call via asyncio.to_thread
_readability_extractor = ReadabilityExtractor()
def _get_tool_config(tool_name: str) -> dict | None:
"""Get tool config extras safely, returning None if not configured."""
config = get_app_config().get_tool_config(tool_name)
if config is None:
return None
extras = config.model_extra
return extras if extras is not None else {}
def _get_browserless_client() -> BrowserlessClient:
cfg = _get_tool_config("web_fetch")
base_url = "http://localhost:3032"
token = ""
timeout_s = 30.0
if cfg is not None:
base_url = cfg.get("base_url", base_url)
token = cfg.get("token", token)
raw = cfg.get("timeout_s", timeout_s)
timeout_s = float(raw) if not isinstance(raw, float) else raw
return BrowserlessClient(base_url=base_url, token=token, timeout_s=timeout_s)
@tool("web_fetch", parse_docstring=True)
async def web_fetch_tool(url: str) -> str:
"""Fetch the contents of a web page at a given URL using Browserless (headless Chrome).
Only fetch EXACT URLs that have been provided directly by the user or have been returned in results from the web_search and web_fetch tools.
This tool can NOT access content that requires authentication, such as private Google Docs or pages behind login walls.
Do NOT add www. to URLs that do NOT have them.
URLs must include the schema: https://example.com is a valid URL while example.com is an invalid URL.
Args:
url: The URL to fetch the contents of.
"""
try:
cfg = _get_tool_config("web_fetch")
wait_for_event = ""
wait_for_timeout_ms = 0
wait_for_selector = ""
wait_for_selector_timeout_ms = 5000
reject_resource_types: list[str] | None = None
reject_request_pattern: list[str] | None = None
if cfg is not None:
wait_for_event = cfg.get("wait_for_event", wait_for_event)
raw_wait = cfg.get("wait_for_timeout_ms", wait_for_timeout_ms)
wait_for_timeout_ms = int(raw_wait) if not isinstance(raw_wait, int) else raw_wait
wait_for_selector = cfg.get("wait_for_selector", wait_for_selector)
client = _get_browserless_client()
html = await client.fetch_html(
url=url,
wait_for_event=wait_for_event,
wait_for_timeout_ms=wait_for_timeout_ms,
wait_for_selector=wait_for_selector,
wait_for_selector_timeout_ms=wait_for_selector_timeout_ms,
reject_resource_types=reject_resource_types,
reject_request_pattern=reject_request_pattern,
)
if html.startswith("Error:"):
return html
article = await asyncio.to_thread(_readability_extractor.extract_article, html)
return article.to_markdown()[:4096]
except Exception as e:
logger.error(f"Error in web_fetch_tool: {e}")
return f"Error: {str(e)}"