Artwork

Innhold levert av BlueDot Impact. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av BlueDot Impact eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.
Player FM - Podcast-app
Gå frakoblet med Player FM -appen!

Eliciting Latent Knowledge

1:00:27
 
Del
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on January 02, 2025 12:05 (20d ago)

What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 424087966 series 3498845
Innhold levert av BlueDot Impact. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av BlueDot Impact eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

Kapitler

1. Eliciting Latent Knowledge (00:00:00)

2. Toy scenario: the SmartVault (00:04:12)

3. How the SmartVault AI works: model-based RL (00:05:43)

4. How it could go wrong: observations leave out key information (00:07:40)

5. How we might address this problem by asking questions (00:10:38)

6. Baseline: what you’d try first and how it could fail (00:12:43)

7. Training strategy: generalize from easy questions to hard questions (00:14:07)

8. Counterexample: why this training strategy won’t always work (00:15:48)

9. Test case: prediction is done by inference on a Bayes net (00:16:41)

10. How the prediction model works (00:17:22)

11. How the humans answer questions (00:19:52)

12. Isn’t this oversimplified and unrealistic? (00:20:44)

13. Intended behavior: translate to the human’s Bayes net (00:23:17)

14. Bad behavior: do inference in the human Bayes net (00:25:57)

15. Would this strategy learn the human simulator or the direct translator? (00:27:38)

16. Research methodology (00:28:55)

17. Why focus on the worst case? (00:30:34)

18. What counts as a counterexample for ELK? (00:32:08)

19. Informal steps (00:34:37)

20. Can we construct a dataset that separates “correct” from “looks correct to a human”? (00:35:44)

21. Strategy: have a human operate the SmartVault and ask them what happened (00:37:42)

22. How this defeats the previous counterexample (00:38:29)

23. New counterexample: better inference in the human Bayes net (00:40:59)

24. Strategy: have AI help humans improve our understanding (00:43:06)

25. How this defeats the previous counterexample (00:46:33)

26. New counterexample: gradient descent is more efficient than science (00:47:05)

27. Strategy: have humans adopt the optimal Bayes net (00:49:20)

28. How this defeats the previous counterexample (00:50:57)

29. New counterexample: ontology mismatch (00:51:23)

30. So are we just stuck now? (00:52:19)

31. Ontology identification (00:54:47)

32. Examples of ontology mismatches (00:55:31)

33. Relationship between ontology identification and ELK (00:57:55)

85 episoder

Artwork
iconDel
 

Fetch error

Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on January 02, 2025 12:05 (20d ago)

What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.

Manage episode 424087966 series 3498845
Innhold levert av BlueDot Impact. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av BlueDot Impact eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.

In this post, we’ll present ARC’s approach to an open problem we think is central to aligning powerful machine learning (ML) systems:

Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us.

But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad.

In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?

We’ll call this problem eliciting latent knowledge (ELK). In this report we’ll focus on detecting sensor tampering as a motivating example, but we believe ELK is central to many aspects of alignment.

Source:

https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit#

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.
Learn more on the AI Safety Fundamentals website.

  continue reading

Kapitler

1. Eliciting Latent Knowledge (00:00:00)

2. Toy scenario: the SmartVault (00:04:12)

3. How the SmartVault AI works: model-based RL (00:05:43)

4. How it could go wrong: observations leave out key information (00:07:40)

5. How we might address this problem by asking questions (00:10:38)

6. Baseline: what you’d try first and how it could fail (00:12:43)

7. Training strategy: generalize from easy questions to hard questions (00:14:07)

8. Counterexample: why this training strategy won’t always work (00:15:48)

9. Test case: prediction is done by inference on a Bayes net (00:16:41)

10. How the prediction model works (00:17:22)

11. How the humans answer questions (00:19:52)

12. Isn’t this oversimplified and unrealistic? (00:20:44)

13. Intended behavior: translate to the human’s Bayes net (00:23:17)

14. Bad behavior: do inference in the human Bayes net (00:25:57)

15. Would this strategy learn the human simulator or the direct translator? (00:27:38)

16. Research methodology (00:28:55)

17. Why focus on the worst case? (00:30:34)

18. What counts as a counterexample for ELK? (00:32:08)

19. Informal steps (00:34:37)

20. Can we construct a dataset that separates “correct” from “looks correct to a human”? (00:35:44)

21. Strategy: have a human operate the SmartVault and ask them what happened (00:37:42)

22. How this defeats the previous counterexample (00:38:29)

23. New counterexample: better inference in the human Bayes net (00:40:59)

24. Strategy: have AI help humans improve our understanding (00:43:06)

25. How this defeats the previous counterexample (00:46:33)

26. New counterexample: gradient descent is more efficient than science (00:47:05)

27. Strategy: have humans adopt the optimal Bayes net (00:49:20)

28. How this defeats the previous counterexample (00:50:57)

29. New counterexample: ontology mismatch (00:51:23)

30. So are we just stuck now? (00:52:19)

31. Ontology identification (00:54:47)

32. Examples of ontology mismatches (00:55:31)

33. Relationship between ontology identification and ELK (00:57:55)

85 episoder

Alle episoder

×
 
Loading …

Velkommen til Player FM!

Player FM scanner netter for høykvalitets podcaster som du kan nyte nå. Det er den beste podcastappen og fungerer på Android, iPhone og internett. Registrer deg for å synkronisere abonnement på flere enheter.

 

Hurtigreferanseguide

Copyright 2025 | Sitemap | Personvern | Vilkår for bruk | | opphavsrett
Lytt til dette showet mens du utforsker
Spill