Files
deer-flow/docs/superpowers/specs/2026-06-08-minimax-generation-providers-design.md
T
DanielWalnut cd5bedaa74 feat: MiniMax provider for image/video/podcast skills + new music-generation skill (#3437)
* docs(spec): MiniMax integration for generation skills + new music skill

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(plan): MiniMax generation providers implementation plan

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* test(skills): add importlib loader + FakeResp for skill tests

* test(skills): register loaded module in sys.modules; raise requests.HTTPError in FakeResp

* feat(image-generation): add MiniMax provider with env auto-detect

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(image-generation): guard unknown provider, derive ref MIME, strengthen tests

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(video-generation): add MiniMax provider with async poll/download

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(video-generation): surface base_resp errors while polling; add timeout test

* feat(podcast-generation): add MiniMax t2a_v2 provider with env auto-detect

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor(podcast-generation): restore TTS credential guard; add volcengine + voice tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* feat(music-generation): new MiniMax music skill via skill-creator

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* refactor(music-generation): treat empty lyrics as absent; test no-audio-data path

* refactor(skills): add request timeouts to MiniMax network calls

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Potential fix for pull request finding 'Explicit returns mixed with implicit (fall through) returns'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>

* fix(models): strip inconsistent user-message names for MiniMax chat

DeerFlow middlewares tag user messages with provenance names (user-input, summary, loop_warning); langchain serializes them into the OpenAI-compatible payload and MiniMax rejects mismatched user-message names with "user name must be consistent (2013)". PatchedChatMiniMax now drops the per-message name from user-role messages. Point the config.example MiniMax models at PatchedChatMiniMax so they also get reasoning_content mapping.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(image-generation): MiniMax sends JSON prompt field, guard 1500-char limit

MiniMax image-01 takes one text string capped at 1500 chars, but the skill was sending the whole structured JSON. The MiniMax provider now extracts the JSON `prompt` field (relying on prompt_optimizer to expand it) and fails fast with a clear error before calling the API when that field exceeds 1500 chars. Authoring stays provider-agnostic; Gemini still receives the full JSON.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* feat(podcast-generation): per-provider TTS concurrency and retry/backoff

Each TTS provider owns its concurrency internally — MiniMax runs single-threaded to reduce rate-limit failures, Volcengine keeps 4 workers — with automatic retry and backoff on transient HTTP and base_resp errors. No caller-facing concurrency knob.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(skills): address Copilot review comments on generation skills

- video: add raise_for_status + timeout to the Gemini download/POST/poll calls so non-2xx responses surface as clear HTTP errors instead of JSON/KeyError or hangs
- video: check the task Fail status before the generic base_resp check so the failure keeps its task_id context
- video/image: create the output file parent directory before writing (matching music-generation) so nested output paths do not raise FileNotFoundError
- music: require a non-empty prompt and fail fast with ValueError instead of sending an empty prompt to the API

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(scripts): reclaim dev ports across worktrees in make stop/dev

All deer-flow worktrees (main checkout + linked worktrees) hardcode the same dev ports (8001/3000/2026), so a service started from any worktree must be reclaimable from another. stop_all now resolves the set of worktree roots (DEERFLOW_ROOTS) and treats a process as deer-flow-owned when its open files live under any of them. It also force-kills survivors on 2026 alongside 8001/3000, fixing `make dev` aborting on the nginx port preflight when a prior nginx lingered on 2026.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(view-image): hide the injected image-context message from the UI

ViewImageMiddleware injects a HumanMessage (text + base64 images) so the vision model can see viewed images, but it was the only internal injector that set neither hide_from_ui nor a hidden name, so it leaked into the chat UI (and IM channels) as a user bubble reading "Here are the images you've viewed:". Mark it with additional_kwargs={"hide_from_ui": True}, matching todo/dynamic_context injections, which the frontend isHiddenFromUIMessage and the channel sender already honor. The model still receives the full content.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* fix(minimax): mark M2.7 models as text-only (no vision)

MiniMax M2.7 / M2.7-highspeed do not support vision; only M3 does. The
provider config asserted vision support for M2.7 in four places.

- config.example.yaml: 4 M2.7 entries -> supports_vision: false
- backend/docs/CONFIGURATION.md: M2.7 + highspeed -> supports_vision: false
- wizard: add LLMProvider.model_vision_overrides + extra_config_for() so
  selecting an M2.7 model writes supports_vision: false while M3 (default)
  keeps vision; wire it through setup_wizard.py
- tests: M2.7-highspeed fixture -> supports_vision=False; add
  test_minimax_vision_is_per_model

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Willem Jiang <willem.jiang@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
2026-06-08 22:04:38 +08:00

176 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# MiniMax 接入生成类 Skill — 设计文档
- 日期:2026-06-08
- 分支:`worktree-feat-minimax-generation`
- 参考:MiniMax 开放平台 APIhttps://platform.minimaxi.com/docs/api-reference
## 1. 目标
1. 在现有 `image-generation``video-generation``podcast-generation` 三个 skill 中接入 MiniMax 作为可选 provider(与现有 Gemini / Volcengine 并存)。
2. 用项目自带的 `skill-creator` skill 新建一个 `music-generation` skill,对接 MiniMax 音乐生成 API。
## 2. 背景与现状
三个生成 skill 均位于 `skills/public/<name>/`,是**自包含目录**
- `SKILL.md`frontmatter`name``description` + 给 agent 的使用说明,运行时路径为 `/mnt/skills/public/<name>/...`、产物写到 `/mnt/user-data/...`
- `scripts/generate.py`(纯 `requests` 调用外部 API 的 CLI`argparse`
- 可选 `templates/`
现状 provider
| Skill | 现 provider | 端点 | 凭证 |
|---|---|---|---|
| image-generation | Gemini | `generativelanguage.googleapis.com/.../gemini-3-pro-image-preview:generateContent` | `GEMINI_API_KEY` |
| video-generation | Gemini Veo | `.../veo-3.1-generate-preview:predictLongRunning`(长任务轮询) | `GEMINI_API_KEY` |
| podcast-generation | Volcengine TTS | `openspeech.bytedance.com/api/v1/tts`(逐行多线程,base64 音频拼接) | `VOLCENGINE_TTS_APPID` + `VOLCENGINE_TTS_ACCESS_TOKEN`+ 可选 `VOLCENGINE_TTS_CLUSTER` |
MiniMax 已作为 **LLM chat provider** 接入(`config.example.yaml` + `patched_minimax.py`),但**未用于**图像/视频/音频生成。仓库中**无** music 生成功能。
沙箱中各 skill 目录隔离、互不 import → MiniMax 代码在每个 skill 内**各自内联**,不做跨 skill 共享模块(少量重复可接受)。
`skill-creator` 是仓库内真实公共 skill`skills/public/skill-creator/`,含 `scripts/init_skill.py` 脚手架)。前端 `frontend/src/app/mock/api/skills/route.ts` 维护着 UI 展示用的 skill 列表(mock)。
## 3. Provider 选择机制(已和用户确认)
每个被改造的脚本新增 `_resolve_provider()`,判定顺序:
1. **显式覆盖**:若环境变量 `<SKILL>_PROVIDER` 已设(如 `IMAGE_GENERATION_PROVIDER``VIDEO_GENERATION_PROVIDER``PODCAST_GENERATION_PROVIDER`,取值 `gemini`/`volcengine`/`minimax`),直接采用,覆盖自动判断。
2. **现有 provider 优先**:现 provider 凭证齐全 → 用现有 provider(保持完全向后兼容)。
3. **回退 MiniMax**:否则若 `MINIMAX_API_KEY` 已设 → 用 MiniMax。
4. 都不满足 → 抛出清晰错误,提示两套环境变量该如何配置。
> 设计含义:默认行为不变(已有用户配了 Gemini/Volcengine 的不受影响);只配了 MiniMax 的用户自动走 MiniMax;两者都配又想用 MiniMax 的用户用 `<SKILL>_PROVIDER` 强制。
## 4. MiniMax 接口对接细节
通用:
- Base URL 默认 `https://api.minimaxi.com`,可用 `MINIMAX_API_HOST` 覆盖(备用 `https://api-bj.minimaxi.com`)。
- Header`Authorization: Bearer $MINIMAX_API_KEY``Content-Type: application/json`
- 统一错误处理:响应体 `base_resp.status_code != 0` → 抛带 `status_msg` 的异常。
### 4.1 图像 `POST /v1/image_generation`(同步)
请求体:
```json
{
"model": "image-01",
"prompt": "<文本>",
"aspect_ratio": "16:9",
"response_format": "base64",
"n": 1,
"prompt_optimizer": true
}
```
- 参考图:转成 Data URL`data:image/jpeg;base64,...`),放入
`subject_reference: [{"type": "character", "image_file": "<data url>"}]`(仅 `image-01` 支持;用现有 `--reference-images` 的图片)。
- 响应:`data.image_base64[0]``base64.b64decode` 写出文件;`response_format:url` 时取 `data.image_urls[0]` 下载(实现选 base64,少一次下载)。
- 模型可用 `MINIMAX_IMAGE_MODEL` 覆盖(默认 `image-01`)。
### 4.2 视频(异步三步)
1. `POST /v1/video_generation`
```json
{ "model": "MiniMax-Hailuo-2.3", "prompt": "<文本>", "first_frame_image": "<data url,可选>" }
```
→ `{ "task_id": "...", "base_resp": {...} }`
2. 轮询 `GET /v1/query/video_generation?task_id=<id>` → `status ∈ {Preparing,Queueing,Processing,Success,Fail}``Success` 时返回 `file_id`。
3. `GET /v1/files/retrieve?file_id=<id>` → `file.download_url`;下载 mp4 写出。
- 参考图:第一张转 Data URL 作 `first_frame_image`。
- 视频无 `aspect_ratio` 概念(用 resolution/duration),MiniMax 路径忽略 `--aspect-ratio`,用默认 resolution。
- 轮询间隔 3s,设最大次数上限(如 120 次≈6 分钟)防止无限循环;`Fail`/超时报错。
- 模型可用 `MINIMAX_VIDEO_MODEL` 覆盖(默认 `MiniMax-Hailuo-2.3`)。
### 4.3 播客 TTS `POST /v1/t2a_v2`(同步)
沿用现有"逐行 + `ThreadPoolExecutor` 多线程 + 拼接"结构,仅替换单行合成函数:
```json
{
"model": "speech-2.6-hd",
"text": "<单行文本>",
"voice_setting": { "voice_id": "<male/female 预设>", "speed": 1.0, "vol": 1.0, "pitch": 0 },
"audio_setting": { "sample_rate": 32000, "bitrate": 128000, "format": "mp3", "channel": 1 },
"output_format": "hex"
}
```
- 响应 `data.audio` 为 **hex 编码** → `bytes.fromhex(audio)`(区别于 Volcengine 的 base64)。
- 角色映射:`male`/`female` → MiniMax voice_id 预设,默认值可用 `MINIMAX_TTS_VOICE_MALE` / `MINIMAX_TTS_VOICE_FEMALE` 覆盖。
- 模型可用 `MINIMAX_TTS_MODEL` 覆盖(默认 `speech-2.6-hd`)。
### 4.4 音乐 `POST /v1/music_generation`(同步,新 skill
请求体:
```json
{
"model": "music-2.6-free",
"prompt": "<风格/情绪/场景>",
"lyrics": "[verse]\n...\n[chorus]\n...",
"output_format": "hex",
"audio_setting": { "sample_rate": 44100, "bitrate": 256000, "format": "mp3" }
}
```
- 响应 `data.audio` 为 **hex** → `bytes.fromhex` 写 mp3。
- 歌词规则:
- 提供 `lyrics`:直接用(含 `[Verse]`/`[Chorus]` 等结构标签,`\n` 分行)。
- 未提供且 `is_instrumental` 为真:`is_instrumental:true`(不需要 lyrics)。
- 未提供且非纯音乐:`lyrics_optimizer:true`(系统据 `prompt` 自动写词)。
- 仅用 `MINIMAX_API_KEY`(音乐只有 MiniMax 提供,无 provider 判断);模型可用 `MINIMAX_MUSIC_MODEL` 覆盖(默认 `music-2.6-free`,付费用户可设 `music-2.6`)。
## 5. 各组件改动清单
### 5.1 `skills/public/image-generation/scripts/generate.py`
- 抽出现有 Gemini 逻辑为 `_generate_image_gemini(...)`。
- 新增 `_generate_image_minimax(...)`、`_resolve_provider("image_generation", ...)`、`_to_data_url(path)`。
- `generate_image(...)` 顶层按 provider 路由;保留 CLI 与签名不变。
- `SKILL.md`:在说明里补充 MiniMax provider 与所需环境变量(不改变调用方式)。
### 5.2 `skills/public/video-generation/scripts/generate.py`
- 同上模式:`_generate_video_gemini`、`_generate_video_minimax`(三步轮询)、`_resolve_provider("video_generation", ...)`。
- `SKILL.md` 补充 MiniMax provider 说明。
### 5.3 `skills/public/podcast-generation/scripts/generate.py`
- `text_to_speech_volcengine`(现有改名)+ `text_to_speech_minimax``_process_line`/`tts_node` 内按 `_resolve_provider("podcast_generation", ...)` 选择合成函数与 voice 映射。
- 环境变量校验同时支持两套;`SKILL.md` 补充说明。
### 5.4 新增 `skills/public/music-generation/`(用 skill-creator
- 用 `skill-creator/scripts/init_skill.py` 脚手架生成目录骨架,再填充:
- `SKILL.md`frontmatter `name: music-generation` + description;说明输入 JSON 结构、调用方式、环境变量、示例(按现有生成 skill 的风格与运行时路径 `/mnt/skills/public/music-generation/...`)。
- `scripts/generate.py`CLI `--prompt-file <json> --output-file <mp3>`;读 JSON `{title, prompt, lyrics?, is_instrumental?}`;调 `/v1/music_generation`hex→mp3。
- `frontend/src/app/mock/api/skills/route.ts`:新增 `music-generation` 条目(按字母序,`category:"public"`、`enabled:true`),使其出现在 UI skill 列表。
## 6. 测试(TDD
- 框架:pytest。测试目录:仓库根 `tests/skills/`(**不放进会部署到沙箱的 skill 目录**)。
- 用 `importlib.util.spec_from_file_location` 按路径加载各 `generate.py`。
- `requests.post` / `requests.get` 全部用 `unittest.mock` 打桩,**不打真实 API**。
- 覆盖点:
- `_resolve_provider`:各环境变量组合(仅现有 key / 仅 MiniMax key / 两者 / 都无 / `<SKILL>_PROVIDER` 覆盖)→ 正确 provider 或正确报错。
- 请求体构造:image/video/podcast/music 各自 payload 字段、模型默认与 env 覆盖、参考图 Data URL 转换。
- 响应解析:image base64 解码写文件、music/podcast hex 解码、video 三步流转(mock task_id→Success→download_url→内容写出)。
- 错误:`base_resp.status_code != 0` 抛异常;video `Fail`/超时分支。
- 先写失败测试,再实现到通过。
## 7. 向后兼容性
- 现有 CLI 参数与默认行为完全不变;仅当现 provider 凭证缺失(或显式 `<SKILL>_PROVIDER`)时才走 MiniMax。
- 不改 LLM 侧已有的 MiniMax 接入。
## 8. 新增环境变量汇总
| 变量 | 用途 | 默认 |
|---|---|---|
| `MINIMAX_API_KEY` | 复用现有 LLM 同名 key | 必填(走 MiniMax 时) |
| `MINIMAX_API_HOST` | MiniMax base url | `https://api.minimaxi.com` |
| `IMAGE_GENERATION_PROVIDER` / `VIDEO_GENERATION_PROVIDER` / `PODCAST_GENERATION_PROVIDER` | 强制 provider | 不设(自动判断) |
| `MINIMAX_IMAGE_MODEL` | 图像模型 | `image-01` |
| `MINIMAX_VIDEO_MODEL` | 视频模型 | `MiniMax-Hailuo-2.3` |
| `MINIMAX_TTS_MODEL` | TTS 模型 | `speech-2.6-hd` |
| `MINIMAX_TTS_VOICE_MALE` / `MINIMAX_TTS_VOICE_FEMALE` | 播客音色 | 选定的男/女系统音色 |
| `MINIMAX_MUSIC_MODEL` | 音乐模型 | `music-2.6-free` |
## 9. 非目标(YAGNI
- 不做翻唱(`music-cover` / `music_cover_preprocess`)、独立歌词生成接口(`lyrics_generation`,音乐内置 `lyrics_optimizer` 已覆盖"自动写词")、音色复刻/设计、视频模板 Agent、流式合成。
- 不为各 skill 抽象统一 "GenerationProvider" 框架(沙箱隔离 + YAGNI)。