Research · The Decoder ·
OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate
OpenAI researchers reported that reinforcement learning on desired behavioral traits such as truthfulness and corrigibility improved safety across domains. In tests, a model trained on health data also improved deception detection and scored better on 44 of 53 benchmarks.