nixos/scripts/data_generator/README.md

# Synthetic Training Data Generator

This tool generates high-quality synthetic training data for fine-tuning LLMs using an OpenAI-compatible API. Designed for roleplay data with a strict style: **Obtuse, Passionate, Absurd** (includes mature themes).

## Current Status (2024-12-14)

**ISSUE**: The script is getting intermittent HTTP 400 and 429 errors from the API.

- **429 errors**: Quota exhausted on rotating keys (handled by key rotation)
- **400 errors**: Need to add retry logic to handle transient failures

**TODO for next session**:
1. Add retry logic with exponential backoff to `generate_training_data.py`
2. Detect when error messages are returned as successful content (the proxy sometimes returns errors inside 200 responses)
3. Consider filtering out responses that start with `错误:` (Chinese for "Error:")

## Structure

- `generate_training_data.py`: Main script that processes character cards and generates multi-turn conversations
- `.env`: API configuration (API_KEY, MODEL_NAME, BASE_URL)
- `chars/`: Directory containing character definition files (chara_card_v2 JSON format)
- `training_data.json`: Output file with generated conversations
- `GEMINI.md`: Session memory file with full context history

## Setup

1. **Configure API** - Edit `.env`:
   ```ini
   API_KEY=your_api_key
   MODEL_NAME=claude-opus-4-5-thinking
   BASE_URL=http://127.0.0.1:8045/v1
   ```

2. **Run on NixOS**:
   ```bash
   nix-shell -p python3Packages.python-dotenv python3Packages.requests python3Packages.openai --run "python generate_training_data.py"
   ```

## How It Works

1. Loads character cards from `chars/*.json`
2. Uses an enforced "GameMaster" system prompt (see `ENFORCED_SYSTEM_PROMPT` in script)
3. For each character:
   - Uses the character's `first_mes` as the initial assistant message
   - Generates 5 turns of User ↔ Character interaction
   - User responses are generated by a "User Simulator" prompt
   - Character responses use the full system prompt + character description
4. Saves incrementally to `training_data.json`

## Key Code Sections

- **Lines 137-197**: The `ENFORCED_SYSTEM_PROMPT` - detailed roleplay instructions
- **Lines 38-82**: `generate_user_response()` - simulates user input
- **Lines 84-107**: `generate_character_response()` - generates character replies
- **Error handling**: Currently catches `APIStatusError` but needs retry logic

## API Notes

- The local endpoint at `127.0.0.1:8045` is a proxy with rotating API keys
- Thinking models (`claude-*-thinking`) may have special requirements
- Error responses sometimes come back as 200 with error text in content