nixos/scripts/data_generator/README.md
2026-01-14 21:24:19 +01:00

61 lines
2.6 KiB
Markdown

# Synthetic Training Data Generator
This tool generates high-quality synthetic training data for fine-tuning LLMs using an OpenAI-compatible API. Designed for roleplay data with a strict style: **Obtuse, Passionate, Absurd** (includes mature themes).
## Current Status (2024-12-14)
**ISSUE**: The script is getting intermittent HTTP 400 and 429 errors from the API.
- **429 errors**: Quota exhausted on rotating keys (handled by key rotation)
- **400 errors**: Need to add retry logic to handle transient failures
**TODO for next session**:
1. Add retry logic with exponential backoff to `generate_training_data.py`
2. Detect when error messages are returned as successful content (the proxy sometimes returns errors inside 200 responses)
3. Consider filtering out responses that start with `错误:` (Chinese for "Error:")
## Structure
- `generate_training_data.py`: Main script that processes character cards and generates multi-turn conversations
- `.env`: API configuration (API_KEY, MODEL_NAME, BASE_URL)
- `chars/`: Directory containing character definition files (chara_card_v2 JSON format)
- `training_data.json`: Output file with generated conversations
- `GEMINI.md`: Session memory file with full context history
## Setup
1. **Configure API** - Edit `.env`:
```ini
API_KEY=your_api_key
MODEL_NAME=claude-opus-4-5-thinking
BASE_URL=http://127.0.0.1:8045/v1
```
2. **Run on NixOS**:
```bash
nix-shell -p python3Packages.python-dotenv python3Packages.requests python3Packages.openai --run "python generate_training_data.py"
```
## How It Works
1. Loads character cards from `chars/*.json`
2. Uses an enforced "GameMaster" system prompt (see `ENFORCED_SYSTEM_PROMPT` in script)
3. For each character:
- Uses the character's `first_mes` as the initial assistant message
- Generates 5 turns of User ↔ Character interaction
- User responses are generated by a "User Simulator" prompt
- Character responses use the full system prompt + character description
4. Saves incrementally to `training_data.json`
## Key Code Sections
- **Lines 137-197**: The `ENFORCED_SYSTEM_PROMPT` - detailed roleplay instructions
- **Lines 38-82**: `generate_user_response()` - simulates user input
- **Lines 84-107**: `generate_character_response()` - generates character replies
- **Error handling**: Currently catches `APIStatusError` but needs retry logic
## API Notes
- The local endpoint at `127.0.0.1:8045` is a proxy with rotating API keys
- Thinking models (`claude-*-thinking`) may have special requirements
- Error responses sometimes come back as 200 with error text in content