Judging The Judges: Evaluating Alignment And Vulnerabilities In LLMs-as-Judges Deep Papers podcast

Artwork

Science Tech Math Business Arize AI

Innhold levert av Arize AI. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av Arize AI eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.

Deep Papers « »
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

1M ago 39:05

Del

MP3•Episoder hjem

Innhold levert av Arize AI. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av Arize AI eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.

Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/

To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.

… continue reading

30 episoder

#Science #Tech #Math #Business #Arize AI

Artwork

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

15 subscribers

published 1M ago

Del

MP3•Episoder hjem

Innhold levert av Arize AI. Alt podcastinnhold, inkludert episoder, grafikk og podcastbeskrivelser, lastes opp og leveres direkte av Arize AI eller deres podcastplattformpartner. Hvis du tror at noen bruker det opphavsrettsbeskyttede verket ditt uten din tillatelse, kan du følge prosessen skissert her https://no.player.fm/legal.

This week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.

Read it on the blog: https://arize.com/blog/judging-the-judges-llm-as-a-judge/

To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.

… continue reading

30 episoder

#Science #Tech #Math #Business #Arize AI

Alle episoder

×

Velkommen til Player FM!

Player FM scanner netter for høykvalitets podcaster som du kan nyte nå. Det er den beste podcastappen og fungerer på Android, iPhone og internett. Registrer deg for å synkronisere abonnement på flere enheter.

Lytt til 500+ tema