RSS

Mike Vaiana

Karma: 559

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

27 Mar 2025 15:39 UTC
80 points
4 comments13 min readLW link

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

13 Mar 2025 19:09 UTC
155 points
41 comments6 min readLW link

Self-pre­dic­tion acts as an emer­gent regularizer

23 Oct 2024 22:27 UTC
91 points
9 comments4 min readLW link

Self-Other Over­lap: A Ne­glected Ap­proach to AI Alignment

30 Jul 2024 16:22 UTC
222 points
51 comments12 min readLW link

Video In­tro to Guaran­teed Safe AI

11 Jul 2024 17:53 UTC
27 points
0 comments1 min readLW link
(youtu.be)

DIY RLHF: A sim­ple im­ple­men­ta­tion for hands on experience

10 Jul 2024 12:07 UTC
28 points
0 comments6 min readLW link
OSZAR »