Draft:LegalBench
![]() | Draft article not currently submitted for review.
This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window. To be accepted, a draft should:
It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it. Where to get help
How to improve a draft
You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article. Improving your odds of a speedy review To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags. Editor resources
Last edited by Headbomb (talk | contribs) 38 days ago. (Update) |
Comment: In accordance with Wikipedia's Conflict of interest policy, I disclose that I have a conflict of interest regarding the subject of this article. Ghita Ha (talk) 02:26, 22 April 2025 (UTC)
LegalBench[1] is an open-source benchmark designed to evaluate the legal reasoning capabilities of large language models (LLMs). Developed as a collaborative initiative in 2023, LegalBench includes over 160 legal tasks contributed by legal scholars, practitioners, and computational researchers. It serves both as a testbed for AI researchers and as a practical resource for legal professionals exploring the capabilities of language models in law-related applications.
Overview
[edit]LegalBench comprises a diverse set of tasks intended to assess how well LLMs can understand, reason through, and apply legal principles. Examples of tasks include:
- Determining whether a passage constitutes hearsay
- Identifying whether a statute includes a private right of action
- Answering substantive questions about legal rules and cases
Each task is associated with a dataset of input-output examples and is suitable for evaluation through prompting, fine-tuning, or retrieval-based techniques. Tasks span a wide variety of legal domains, document types, and complexity levels.
Origins
[edit]LegalBench was created through a crowdsourced effort involving over 40 contributors, including law professors, practicing attorneys, legal technologists, and public interest legal organizations. Many tasks were newly created for the benchmark, while others were adapted from existing legal NLP datasets such as CUAD[2], ContractNLI [3], MAUD [4], and CaseHold. Contributors were encouraged to submit tasks they deemed "interesting" (i.e., reflective of reasoning challenges) or "useful" (i.e., applicable to real-world legal work).
Applications
[edit]LegalBench is intended for two main audiences:
- AI researchers seeking to test the capabilities of LLMs in domains requiring long-context reasoning, complex terminology, and minimal labeled data.
- Legal professionals and organizations evaluating the utility of LLMs for tasks such as legal research, contract analysis, or regulatory compliance.
LegalBench-RAG
[edit]In 2024, a derivative benchmark called LegalBench-RAG [5] was introduced by ZeroEntropy. This extension adapts tasks from LegalBench for use in evaluating retrieval-augmented generation (RAG) systems—AI models that combine document retrieval with generation to improve factual accuracy.
LegalBench-RAG focuses on assessing retrieval quality in legal settings by providing precision and recall metrics for document-level retrieval over unstructured legal corpora. It is used to benchmark systems that rely on vector search, reranking, and prompt augmentation for generating legally accurate responses.
Related Benchmarks
[edit]LegalBench builds on or integrates tasks from several existing legal datasets, including:
- CUAD – Contract Understanding Atticus Dataset
- ContractNLI – Contractual natural language inference
- MAUD – Merger Agreement Understanding Dataset
- CaseHold – Case law entailment classification
- CLAUDETTE – Unfair terms detection in consumer contracts
- PolicyQA – Privacy policy question answering
References
[edit]- ^ Guha, Neel; Nyarko, Julian; Ho, Daniel E.; Ré, Christopher; Chilton, Adam; et al. (2023). "LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models". arXiv:2308.11462 [cs.CL].
- ^ Hendrycks, Dan; Burns, Collin; Chen, Anya; Ball, Spencer (2021). "CUAD: An expert-annotated NLP dataset for legal contract review". arXiv:2103.06268 [cs.CL].
- ^ Koreeda, Yuta; Manning, Christopher D. (2021). "ContractNLI: A dataset for document-level natural language inference for contracts". arXiv:2110.01799 [cs.CL].
- ^ Wang, Steven H.; Scardigli, Antoine; Tang, Leonard; Chen, Wei (2023). "MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding". arXiv:2301.00876 [cs.CL].
- ^ Pipitone, Nicholas; Alami, Ghita Houir (2024). "LegalBench-RAG: Evaluating Retrieval-Augmented Generation for Legal Reasoning". arXiv:2408.10343 [cs.AI].