OpenAI's Pre-Release Safety Trick: Make Models Think They're in Production
OpenAI replays 1.3M anonymized production conversations with candidate models before release — catching reward hacking and behavior shifts that adversarial evals miss entirely.