Research · The Decoder ·

OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate

OpenAI researchers show small doses of "beneficial trait" training make AI models broadly safer and harder to manipulate

OpenAI researchers reported that reinforcement learning on desired behavioral traits such as truthfulness and corrigibility improved safety across domains. In tests, a model trained on health data also improved deception detection and scored better on 44 of 53 benchmarks.

Read the full story at The Decoder →