nixos/scripts/data_generator/README.md
2026-01-14 21:24:19 +01:00

2.6 KiB

Synthetic Training Data Generator

This tool generates high-quality synthetic training data for fine-tuning LLMs using an OpenAI-compatible API. Designed for roleplay data with a strict style: Obtuse, Passionate, Absurd (includes mature themes).

Current Status (2024-12-14)

ISSUE: The script is getting intermittent HTTP 400 and 429 errors from the API.

  • 429 errors: Quota exhausted on rotating keys (handled by key rotation)
  • 400 errors: Need to add retry logic to handle transient failures

TODO for next session:

  1. Add retry logic with exponential backoff to generate_training_data.py
  2. Detect when error messages are returned as successful content (the proxy sometimes returns errors inside 200 responses)
  3. Consider filtering out responses that start with 错误: (Chinese for "Error:")

Structure

  • generate_training_data.py: Main script that processes character cards and generates multi-turn conversations
  • .env: API configuration (API_KEY, MODEL_NAME, BASE_URL)
  • chars/: Directory containing character definition files (chara_card_v2 JSON format)
  • training_data.json: Output file with generated conversations
  • GEMINI.md: Session memory file with full context history

Setup

  1. Configure API - Edit .env:

    API_KEY=your_api_key
    MODEL_NAME=claude-opus-4-5-thinking
    BASE_URL=http://127.0.0.1:8045/v1
    
  2. Run on NixOS:

    nix-shell -p python3Packages.python-dotenv python3Packages.requests python3Packages.openai --run "python generate_training_data.py"
    

How It Works

  1. Loads character cards from chars/*.json
  2. Uses an enforced "GameMaster" system prompt (see ENFORCED_SYSTEM_PROMPT in script)
  3. For each character:
    • Uses the character's first_mes as the initial assistant message
    • Generates 5 turns of User ↔ Character interaction
    • User responses are generated by a "User Simulator" prompt
    • Character responses use the full system prompt + character description
  4. Saves incrementally to training_data.json

Key Code Sections

  • Lines 137-197: The ENFORCED_SYSTEM_PROMPT - detailed roleplay instructions
  • Lines 38-82: generate_user_response() - simulates user input
  • Lines 84-107: generate_character_response() - generates character replies
  • Error handling: Currently catches APIStatusError but needs retry logic

API Notes

  • The local endpoint at 127.0.0.1:8045 is a proxy with rotating API keys
  • Thinking models (claude-*-thinking) may have special requirements
  • Error responses sometimes come back as 200 with error text in content