feat(eval): add report quality evaluation module and UI integration (#776)

* feat(eval): add report quality evaluation module

Addresses issue #773 - How to evaluate generated report quality objectively.

This module provides two evaluation approaches:
1. Automated metrics (no LLM required):
   - Citation count and source diversity
   - Word count compliance per report style
   - Section structure validation
   - Image inclusion tracking

2. LLM-as-Judge evaluation:
   - Factual accuracy scoring
   - Completeness assessment
   - Coherence evaluation
   - Relevance and citation quality checks

The combined evaluator provides a final score (1-10) and letter grade (A+ to F).

Files added:
- src/eval/__init__.py
- src/eval/metrics.py
- src/eval/llm_judge.py
- src/eval/evaluator.py
- tests/unit/eval/test_metrics.py
- tests/unit/eval/test_evaluator.py

* feat(eval): integrate report evaluation with web UI

This commit adds the web UI integration for the evaluation module:

Backend:
- Add EvaluateReportRequest/Response models in src/server/eval_request.py
- Add /api/report/evaluate endpoint to src/server/app.py

Frontend:
- Add evaluateReport API function in web/src/core/api/evaluate.ts
- Create EvaluationDialog component with grade badge, metrics display,
  and optional LLM deep evaluation
- Add evaluation button (graduation cap icon) to research-block.tsx toolbar
- Add i18n translations for English and Chinese

The evaluation UI allows users to:
1. View quick metrics-only evaluation (instant)
2. Optionally run deep LLM-based evaluation for detailed analysis
3. See grade (A+ to F), score (1-10), and metric breakdown

* feat(eval): improve evaluation reliability and add LLM judge tests

- Extract MAX_REPORT_LENGTH constant in llm_judge.py for maintainability
- Add comprehensive unit tests for LLMJudge class (parse_response,
  calculate_weighted_score, evaluate with mocked LLM)
- Pass reportStyle prop to EvaluationDialog for accurate evaluation criteria
- Add researchQueries store map to reliably associate queries with research
- Add getResearchQuery helper to retrieve query by researchId
- Remove unused imports in test_metrics.py

* fix(eval): use resolveServiceURL for evaluate API endpoint

The evaluateReport function was using a relative URL '/api/report/evaluate'
which sent requests to the Next.js server instead of the FastAPI backend.
Changed to use resolveServiceURL() consistent with other API functions.

* fix: improve type accuracy and React hooks in evaluation components

- Fix get_word_count_target return type from Optional[Dict] to Dict since it always returns a value via default fallback
- Fix useEffect dependency issue in EvaluationDialog using useRef to prevent unwanted re-evaluations
- Add aria-label to GradeBadge for screen reader accessibility
This commit is contained in:
Willem Jiang
2025-12-25 21:55:48 +08:00
committed by GitHub
parent 84a7f7815c
commit 8d9d767051
17 changed files with 2103 additions and 2 deletions
+35
View File
@@ -35,6 +35,7 @@ from src.podcast.graph.builder import build_graph as build_podcast_graph
from src.ppt.graph.builder import build_graph as build_ppt_graph
from src.prompt_enhancer.graph.builder import build_graph as build_prompt_enhancer_graph
from src.prose.graph.builder import build_graph as build_prose_graph
from src.eval import ReportEvaluator
from src.rag.builder import build_retriever
from src.rag.milvus import load_examples as load_milvus_examples
from src.rag.qdrant import load_examples as load_qdrant_examples
@@ -47,6 +48,7 @@ from src.server.chat_request import (
GenerateProseRequest,
TTSRequest,
)
from src.server.eval_request import EvaluateReportRequest, EvaluateReportResponse
from src.server.config_request import ConfigResponse
from src.server.mcp_request import MCPServerMetadataRequest, MCPServerMetadataResponse
from src.server.mcp_utils import load_mcp_tools
@@ -946,6 +948,39 @@ async def generate_prose(request: GenerateProseRequest):
raise HTTPException(status_code=500, detail=INTERNAL_SERVER_ERROR_DETAIL)
@app.post("/api/report/evaluate", response_model=EvaluateReportResponse)
async def evaluate_report(request: EvaluateReportRequest):
"""Evaluate report quality using automated metrics and optionally LLM-as-Judge."""
try:
evaluator = ReportEvaluator(use_llm=request.use_llm)
if request.use_llm:
result = await evaluator.evaluate(
request.content, request.query, request.report_style or "default"
)
return EvaluateReportResponse(
metrics=result.metrics.to_dict(),
score=result.final_score,
grade=result.grade,
llm_evaluation=result.llm_evaluation.to_dict()
if result.llm_evaluation
else None,
summary=result.summary,
)
else:
result = evaluator.evaluate_metrics_only(
request.content, request.report_style or "default"
)
return EvaluateReportResponse(
metrics=result["metrics"],
score=result["score"],
grade=result["grade"],
)
except Exception as e:
logger.exception(f"Error occurred during report evaluation: {str(e)}")
raise HTTPException(status_code=500, detail=INTERNAL_SERVER_ERROR_DETAIL)
@app.post("/api/prompt/enhance")
async def enhance_prompt(request: EnhancePromptRequest):
try:
+71
View File
@@ -0,0 +1,71 @@
# Copyright (c) 2025 Bytedance Ltd. and/or its affiliates
# SPDX-License-Identifier: MIT
"""Request models for report evaluation endpoint."""
from typing import Optional
from pydantic import BaseModel, Field
class EvaluateReportRequest(BaseModel):
"""Request model for report evaluation."""
content: str = Field(description="Report markdown content to evaluate")
query: str = Field(description="Original research query")
report_style: Optional[str] = Field(
default="default", description="Report style (academic, news, etc.)"
)
use_llm: bool = Field(
default=False,
description="Whether to use LLM for deep evaluation (slower but more detailed)",
)
class EvaluationMetrics(BaseModel):
"""Automated metrics result."""
word_count: int
citation_count: int
unique_sources: int
image_count: int
section_count: int
section_coverage_score: float
sections_found: list[str]
sections_missing: list[str]
has_title: bool
has_key_points: bool
has_overview: bool
has_citations_section: bool
class LLMEvaluationScores(BaseModel):
"""LLM evaluation scores."""
factual_accuracy: int = 0
completeness: int = 0
coherence: int = 0
relevance: int = 0
citation_quality: int = 0
writing_quality: int = 0
class LLMEvaluation(BaseModel):
"""LLM evaluation result."""
scores: LLMEvaluationScores
overall_score: float
weighted_score: float
strengths: list[str]
weaknesses: list[str]
suggestions: list[str]
class EvaluateReportResponse(BaseModel):
"""Response model for report evaluation."""
metrics: EvaluationMetrics
score: float
grade: str
llm_evaluation: Optional[LLMEvaluation] = None
summary: Optional[str] = None