Synthetic Training Data Generator

This tool generates high-quality synthetic training data for fine-tuning LLMs using an OpenAI-compatible API. Designed for roleplay data with a strict style: Obtuse, Passionate, Absurd (includes mature themes).

Current Status (2024-12-14)

ISSUE: The script is getting intermittent HTTP 400 and 429 errors from the API.

429 errors: Quota exhausted on rotating keys (handled by key rotation)
400 errors: Need to add retry logic to handle transient failures

TODO for next session:

Add retry logic with exponential backoff to generate_training_data.py
Detect when error messages are returned as successful content (the proxy sometimes returns errors inside 200 responses)
Consider filtering out responses that start with 错误: (Chinese for "Error:")

Structure

generate_training_data.py: Main script that processes character cards and generates multi-turn conversations
.env: API configuration (API_KEY, MODEL_NAME, BASE_URL)
chars/: Directory containing character definition files (chara_card_v2 JSON format)
training_data.json: Output file with generated conversations
GEMINI.md: Session memory file with full context history

Setup

Configure API - Edit .env:

API_KEY=your_api_key
MODEL_NAME=claude-opus-4-5-thinking
BASE_URL=http://127.0.0.1:8045/v1

Run on NixOS:

nix-shell -p python3Packages.python-dotenv python3Packages.requests python3Packages.openai --run "python generate_training_data.py"

How It Works

Loads character cards from chars/*.json
Uses an enforced "GameMaster" system prompt (see ENFORCED_SYSTEM_PROMPT in script)
For each character:
- Uses the character's first_mes as the initial assistant message
- Generates 5 turns of User ↔ Character interaction
- User responses are generated by a "User Simulator" prompt
- Character responses use the full system prompt + character description
Saves incrementally to training_data.json

Key Code Sections

Lines 137-197: The ENFORCED_SYSTEM_PROMPT - detailed roleplay instructions
Lines 38-82: generate_user_response() - simulates user input
Lines 84-107: generate_character_response() - generates character replies
Error handling: Currently catches APIStatusError but needs retry logic

API Notes

The local endpoint at 127.0.0.1:8045 is a proxy with rotating API keys
Thinking models (claude-*-thinking) may have special requirements
Error responses sometimes come back as 200 with error text in content

2.6 KiB Raw Blame History