LLM-as-a-Judge

LLM-as-a-Judge is a conceptual framework in natural language processing (NLP) that employs large language models (LLMs) as evaluators to assess the performance of other language-based systems or outputs. Instead of relying solely on human annotators, the approach leverages the general language capabilities of advanced language models to serve at automated judges.

LLM-as-a-Judge may be more cost-effective and may be added to automated evaluation pipelines. Unlike traditional automatic evaluation metrics such as ROUGE and BLEU—which rely on transparent, rule-based comparisons with surface-level n-grams—LLM-as-a-Judge relies on the opaque internal reasoning of large language models—offering evaluations that likely incorporate deeper semantic understanding, but at the cost of interpretability.

Typically, a more powerful LLM is employed to evaluate the outputs of smaller or less capable language models—for example, using GPT-4 to assess the performance of a 13-billion-parameter LLaMA model.^[1]

References

Scholia has a topic profile for LLM-as-a-Judge.

^ Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; et al. (9 June 2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685, doi:10.48550/ARXIV.2306.05685, Wikidata Q123527686

This large language model-related article is a stub. You can help Wikipedia by expanding it.

This machine learning-related article is a stub. You can help Wikipedia by expanding it.

[1] Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; et al. (9 June 2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena, arXiv:2306.05685, doi:10.48550/ARXIV.2306.05685, Wikidata Q123527686

[1]