feat(community): add SearXNG and Browserless web search/fetch tools (#3451)

* feat(community): add SearXNG and Browserless web search/fetch tools - SearXNG web_search: privacy-focused meta search engine integration with configurable base_url via config.yaml tool settings - Browserless web_fetch: headless browser page fetching with readability article extraction - Both tools are fully configurable through tool config section - No external API keys required for basic operation * fix: address PR review feedback and add unit tests - Guard config.model_extra against None values (review #1, #2) - Coerce max_results to int when reading from config (review #2) - Fix web_fetch_tool to use direct HTTP fetch instead of reusing the web_search client config (review #3) - Fix misleading docstring for SearxngClient.fetch (review #4) - Remove unused target_url variable to pass Ruff lint (review #5) - Normalize bool config values with _normalize_bool helper to handle env-resolved string values correctly (review #6) - Add unit tests for both SearXNG and Browserless client classes and their tool functions with mocked httpx (review #7, #8) * fix: convert to async httpx to avoid blocking I/O on event loop - Replace httpx.Client with httpx.AsyncClient in both client classes - Convert tool functions to async def - Wrap readability_extractor calls in asyncio.to_thread() - Update all tests to use pytest.mark.asyncio and async mocks - Fix import sorting to pass Ruff lint * fix(browserless): replace deprecated waitUntil with waitForEvent The Browserless API has deprecated the waitUntil parameter. Replace with waitForEvent which accepts values like 'networkidle'. Default is empty (no wait), configurable via config.yaml. * fix(browserless): remove deprecated gotoTimeout and bestAttempt params The Browserless /content API does not accept gotoTimeout or bestAttempt as top-level payload keys. These were being sent unconditionally, causing 400 Bad Request errors on current Browserless versions. Changes: - Remove goto_timeout_ms parameter and 'gotoTimeout' from payload - Remove best_attempt parameter and 'bestAttempt' from payload - Remove _normalize_bool helper (no longer needed) - Remove goto_timeout_ms and best_attempt config reading in tools.py - Add tests for waitForSelector and reject params - Verify no deprecated params are sent in test_fetch_html_success * refactor(searxng): remove web_fetch_tool, decouple from web_search config SearXNG is a search engine — it should only provide web_search_tool. The web_fetch responsibility belongs to Browserless (headless Chrome) or Jina AI, not SearXNG. Changes: - Remove web_fetch_tool from SearXNG tools.py and __init__.py - Remove SearxngClient.fetch() method (no longer needed) - Remove unused asyncio/readability imports from SearXNG tools.py - Add test for max_results string-to-int coercion from config - Add test for search with categories parameter - Add test for httpx.RequestError handling - Apply ruff format fixes to browserless_client.py and test files
2026-06-12 10:25:58 +00:00 · 2026-06-12 09:45:26 +08:00
parent 0367fe6c7a
commit 330a2ff8c5
8 changed files with 663 additions and 0 deletions
@@ -0,0 +1,85 @@
+import asyncio
+import logging
+
+from langchain.tools import tool
+
+from deerflow.config import get_app_config
+from deerflow.utils.readability import ReadabilityExtractor
+
+from .browserless_client import BrowserlessClient
+
+logger = logging.getLogger(__name__)
+
+# readability_extractor runs CPU-bound parsing; always call via asyncio.to_thread
+_readability_extractor = ReadabilityExtractor()
+
+
+def _get_tool_config(tool_name: str) -> dict | None:
+    """Get tool config extras safely, returning None if not configured."""
+    config = get_app_config().get_tool_config(tool_name)
+    if config is None:
+        return None
+    extras = config.model_extra
+    return extras if extras is not None else {}
+
+
+def _get_browserless_client() -> BrowserlessClient:
+    cfg = _get_tool_config("web_fetch")
+    base_url = "http://localhost:3032"
+    token = ""
+    timeout_s = 30.0
+    if cfg is not None:
+        base_url = cfg.get("base_url", base_url)
+        token = cfg.get("token", token)
+        raw = cfg.get("timeout_s", timeout_s)
+        timeout_s = float(raw) if not isinstance(raw, float) else raw
+    return BrowserlessClient(base_url=base_url, token=token, timeout_s=timeout_s)
+
+
+@tool("web_fetch", parse_docstring=True)
+async def web_fetch_tool(url: str) -> str:
+    """Fetch the contents of a web page at a given URL using Browserless (headless Chrome).
+    Only fetch EXACT URLs that have been provided directly by the user or have been returned in results from the web_search and web_fetch tools.
+    This tool can NOT access content that requires authentication, such as private Google Docs or pages behind login walls.
+    Do NOT add www. to URLs that do NOT have them.
+    URLs must include the schema: https://example.com is a valid URL while example.com is an invalid URL.
+
+    Args:
+        url: The URL to fetch the contents of.
+    """
+    try:
+        cfg = _get_tool_config("web_fetch")
+
+        wait_for_event = ""
+        wait_for_timeout_ms = 0
+        wait_for_selector = ""
+        wait_for_selector_timeout_ms = 5000
+        reject_resource_types: list[str] | None = None
+        reject_request_pattern: list[str] | None = None
+
+        if cfg is not None:
+            wait_for_event = cfg.get("wait_for_event", wait_for_event)
+            raw_wait = cfg.get("wait_for_timeout_ms", wait_for_timeout_ms)
+            wait_for_timeout_ms = int(raw_wait) if not isinstance(raw_wait, int) else raw_wait
+            wait_for_selector = cfg.get("wait_for_selector", wait_for_selector)
+
+        client = _get_browserless_client()
+        html = await client.fetch_html(
+            url=url,
+            wait_for_event=wait_for_event,
+            wait_for_timeout_ms=wait_for_timeout_ms,
+            wait_for_selector=wait_for_selector,
+            wait_for_selector_timeout_ms=wait_for_selector_timeout_ms,
+            reject_resource_types=reject_resource_types,
+            reject_request_pattern=reject_request_pattern,
+        )
+
+        if html.startswith("Error:"):
+            return html
+
+        article = await asyncio.to_thread(_readability_extractor.extract_article, html)
+        return article.to_markdown()[:4096]
+
+    except Exception as e:
+        logger.error(f"Error in web_fetch_tool: {e}")
+        return f"Error: {str(e)}"