| .. | ||
| generate_training_data.py | ||
| README.md | ||
| requirements.txt | ||
| test_api.py | ||
| test_url.py | ||
| training_data.json | ||
Synthetic Training Data Generator
This tool generates high-quality synthetic training data for fine-tuning LLMs using an OpenAI-compatible API. Designed for roleplay data with a strict style: Obtuse, Passionate, Absurd (includes mature themes).
Current Status (2024-12-14)
ISSUE: The script is getting intermittent HTTP 400 and 429 errors from the API.
- 429 errors: Quota exhausted on rotating keys (handled by key rotation)
- 400 errors: Need to add retry logic to handle transient failures
TODO for next session:
- Add retry logic with exponential backoff to
generate_training_data.py - Detect when error messages are returned as successful content (the proxy sometimes returns errors inside 200 responses)
- Consider filtering out responses that start with
错误:(Chinese for "Error:")
Structure
generate_training_data.py: Main script that processes character cards and generates multi-turn conversations.env: API configuration (API_KEY, MODEL_NAME, BASE_URL)chars/: Directory containing character definition files (chara_card_v2 JSON format)training_data.json: Output file with generated conversationsGEMINI.md: Session memory file with full context history
Setup
-
Configure API - Edit
.env:API_KEY=your_api_key MODEL_NAME=claude-opus-4-5-thinking BASE_URL=http://127.0.0.1:8045/v1 -
Run on NixOS:
nix-shell -p python3Packages.python-dotenv python3Packages.requests python3Packages.openai --run "python generate_training_data.py"
How It Works
- Loads character cards from
chars/*.json - Uses an enforced "GameMaster" system prompt (see
ENFORCED_SYSTEM_PROMPTin script) - For each character:
- Uses the character's
first_mesas the initial assistant message - Generates 5 turns of User ↔ Character interaction
- User responses are generated by a "User Simulator" prompt
- Character responses use the full system prompt + character description
- Uses the character's
- Saves incrementally to
training_data.json
Key Code Sections
- Lines 137-197: The
ENFORCED_SYSTEM_PROMPT- detailed roleplay instructions - Lines 38-82:
generate_user_response()- simulates user input - Lines 84-107:
generate_character_response()- generates character replies - Error handling: Currently catches
APIStatusErrorbut needs retry logic
API Notes
- The local endpoint at
127.0.0.1:8045is a proxy with rotating API keys - Thinking models (
claude-*-thinking) may have special requirements - Error responses sometimes come back as 200 with error text in content