Files
deer-flow/backend/packages/harness/deerflow/community/browserless/browserless_client.py
T
zengxi 330a2ff8c5 feat(community): add SearXNG and Browserless web search/fetch tools (#3451)
* feat(community): add SearXNG and Browserless web search/fetch tools

- SearXNG web_search: privacy-focused meta search engine integration
  with configurable base_url via config.yaml tool settings
- Browserless web_fetch: headless browser page fetching with
  readability article extraction
- Both tools are fully configurable through tool config section
- No external API keys required for basic operation

* fix: address PR review feedback and add unit tests

- Guard config.model_extra against None values (review #1, #2)
- Coerce max_results to int when reading from config (review #2)
- Fix web_fetch_tool to use direct HTTP fetch instead of reusing
  the web_search client config (review #3)
- Fix misleading docstring for SearxngClient.fetch (review #4)
- Remove unused target_url variable to pass Ruff lint (review #5)
- Normalize bool config values with _normalize_bool helper to
  handle env-resolved string values correctly (review #6)
- Add unit tests for both SearXNG and Browserless client classes
  and their tool functions with mocked httpx (review #7, #8)

* fix: convert to async httpx to avoid blocking I/O on event loop

- Replace httpx.Client with httpx.AsyncClient in both client classes
- Convert tool functions to async def
- Wrap readability_extractor calls in asyncio.to_thread()
- Update all tests to use pytest.mark.asyncio and async mocks
- Fix import sorting to pass Ruff lint

* fix(browserless): replace deprecated waitUntil with waitForEvent

The Browserless API has deprecated the waitUntil parameter.
Replace with waitForEvent which accepts values like 'networkidle'.
Default is empty (no wait), configurable via config.yaml.

* fix(browserless): remove deprecated gotoTimeout and bestAttempt params

The Browserless /content API does not accept gotoTimeout or bestAttempt
as top-level payload keys. These were being sent unconditionally,
causing 400 Bad Request errors on current Browserless versions.

Changes:
- Remove goto_timeout_ms parameter and 'gotoTimeout' from payload
- Remove best_attempt parameter and 'bestAttempt' from payload
- Remove _normalize_bool helper (no longer needed)
- Remove goto_timeout_ms and best_attempt config reading in tools.py
- Add tests for waitForSelector and reject params
- Verify no deprecated params are sent in test_fetch_html_success

* refactor(searxng): remove web_fetch_tool, decouple from web_search config

SearXNG is a search engine — it should only provide web_search_tool.
The web_fetch responsibility belongs to Browserless (headless Chrome)
or Jina AI, not SearXNG.

Changes:
- Remove web_fetch_tool from SearXNG tools.py and __init__.py
- Remove SearxngClient.fetch() method (no longer needed)
- Remove unused asyncio/readability imports from SearXNG tools.py
- Add test for max_results string-to-int coercion from config
- Add test for search with categories parameter
- Add test for httpx.RequestError handling
- Apply ruff format fixes to browserless_client.py and test files
2026-06-12 09:45:26 +08:00

99 lines
3.6 KiB
Python

import logging
from typing import Any
import httpx
logger = logging.getLogger(__name__)
class BrowserlessClient:
"""Client for Browserless headless Chrome API."""
def __init__(self, base_url: str, token: str = "", timeout_s: float = 30) -> None:
self.base_url = base_url.rstrip("/")
self.token = token
self.timeout_s = timeout_s
async def fetch_html(
self,
url: str,
wait_for_event: str = "",
wait_for_timeout_ms: int = 0,
wait_for_selector: str = "",
wait_for_selector_timeout_ms: int = 5000,
reject_resource_types: list[str] | None = None,
reject_request_pattern: list[str] | None = None,
) -> str:
"""Fetch the rendered HTML of a page using Browserless.
Only sends accepted parameters for the current Browserless API version.
Sets a default navigation timeout (30s) via query param.
Args:
url: The URL to fetch.
wait_for_event: Wait for a page event (e.g. "networkidle", "load").
wait_for_timeout_ms: Extra wait after page load.
wait_for_selector: CSS selector to wait for.
wait_for_selector_timeout_ms: Timeout for selector wait.
reject_resource_types: Resource types to block (e.g. ["image"]).
reject_request_pattern: URL patterns to block.
Returns:
Rendered HTML content.
"""
payload: dict[str, Any] = {
"url": url,
}
if self.token:
payload["token"] = self.token
if wait_for_event:
payload["waitForEvent"] = wait_for_event
if wait_for_timeout_ms > 0:
payload["waitForTimeout"] = wait_for_timeout_ms
if wait_for_selector:
payload["waitForSelector"] = {
"selector": wait_for_selector,
"timeout": wait_for_selector_timeout_ms,
}
if reject_resource_types:
payload["rejectResourceTypes"] = reject_resource_types
if reject_request_pattern:
payload["rejectRequestPattern"] = reject_request_pattern
logger.debug(f"Fetching URL via Browserless: {url}")
try:
async with httpx.AsyncClient(timeout=self.timeout_s) as client:
resp = await client.post(
f"{self.base_url}/content",
json=payload,
headers={
"Content-Type": "application/json",
"Cache-Control": "no-cache",
},
)
code = resp.status_code
target_code = resp.headers.get("X-Response-Code", "")
target_status = resp.headers.get("X-Response-Status", "")
logger.debug(f"Browserless response: code={code}, target_code={target_code}, target_status={target_status}")
if code != 200:
return f"Error: Browserless HTTP {code}: {resp.text[:200]}"
html = resp.text
if not html or not html.strip():
return "Error: Browserless returned empty response"
return html
except httpx.TimeoutException:
return f"Error: Browserless request timed out after {self.timeout_s}s"
except httpx.RequestError as e:
logger.error(f"Browserless request failed: {e}")
return f"Error: Browserless request failed: {e!s}"
except Exception as e:
logger.error(f"Browserless fetch failed: {e}")
return f"Error: Browserless fetch failed: {e!s}"