Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Tomek Korbak
Karma:
745
Senior Research Scientist at UK AISI working on AI control
https://tomekkorbak.com/
All
Posts
Comments
New
Top
Old
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
,
Mikita Balesni
,
Buck
and
Geoffrey Irving
14 Apr 2025 16:45 UTC
29
points
1
comment
2
min read
LW
link
A sketch of an AI control safety case
Tomek Korbak
,
joshc
,
Benjamin Hilton
,
Buck
and
Geoffrey Irving
30 Jan 2025 17:28 UTC
57
points
0
comments
5
min read
LW
link
Eliciting bad contexts
Geoffrey Irving
,
Joseph Bloom
and
Tomek Korbak
24 Jan 2025 10:39 UTC
32
points
8
comments
3
min read
LW
link
Automation collapse
Geoffrey Irving
,
Tomek Korbak
and
Benjamin Hilton
21 Oct 2024 14:50 UTC
72
points
9
comments
7
min read
LW
link
Compositional preference models for aligning LMs
Tomek Korbak
25 Oct 2023 12:17 UTC
18
points
2
comments
5
min read
LW
link
Towards Understanding Sycophancy in Language Models
Ethan Perez
,
mrinank_sharma
,
Meg
and
Tomek Korbak
24 Oct 2023 0:30 UTC
66
points
0
comments
2
min read
LW
link
(arxiv.org)
Paper: LLMs trained on “A is B” fail to learn “B is A”
lberglund
,
Owain_Evans
,
Meg
,
Maximilian Kaufmann
,
Mikita Balesni
,
Asa Cooper Stickland
and
Tomek Korbak
23 Sep 2023 19:55 UTC
121
points
74
comments
4
min read
LW
link
(arxiv.org)
Paper: On measuring situational awareness in LLMs
Owain_Evans
,
Daniel Kokotajlo
,
Mikita Balesni
,
Tomek Korbak
,
Asa Cooper Stickland
,
Meg
and
Maximilian Kaufmann
4 Sep 2023 12:54 UTC
109
points
17
comments
5
min read
LW
link
(arxiv.org)
Imitation Learning from Language Feedback
Jérémy Scheurer
,
Tomek Korbak
and
Ethan Perez
30 Mar 2023 14:11 UTC
71
points
3
comments
10
min read
LW
link
Pretraining Language Models with Human Preferences
Tomek Korbak
,
Sam Bowman
and
Ethan Perez
21 Feb 2023 17:57 UTC
135
points
20
comments
11
min read
LW
link
2
reviews
RL with KL penalties is better seen as Bayesian inference
Tomek Korbak
and
Ethan Perez
25 May 2022 9:23 UTC
115
points
17
comments
12
min read
LW
link
Back to top
Otomatik - 192.81.135.44
CloudFlare DNS
Türk Telekom DNS
Google DNS
Open DNS
OSZAR »