Jump to content

Draft:Language-Theoretic Security

From Wikipedia, the free encyclopedia

Language-theoretic security, or LangSec, is an approach to software security that focuses on input handling, complexity, and program design as strategies to improve the verifiability of computer programs. It was introduced in 2011 by Len Sassaman and Meredith L. Patterson.[1] It aims to create a formal description of which software is likely to have security vulnerabilities of particular classes, and why. It considers programs to have an inherent parser component, whether or not explicit, composed of that part of the program which operates on external input before that input is fully parsed. A central hypothesis of language-theoretic security is that vulnerabilities in software increase according to the computational power of the notional input-accepting automaton equivalent to this parser, using the definitions of automata theory. The lower bound on this computational power is the input language complexity of the program. The extent to which reducing this complexity is possible is a function of the specification of the communication protocol or file format the program takes as input.

Parsing as a Security Mechanism

[edit]

The behaviour of a program is defined with reference to its expected input. Unexpected input being used by a program is a factor in numerous security bugs, including the so-called Android master key vulnerability (CVE-2013-4787),[2] because accepting unexpected input renders the program's specification ambiguous. In that instance, the unexpected ambiguity came in the form of a ZIP file with duplicate filenames.

If a program fully parses its input and only acts on input that unambiguously meets the specification, it follows that the program will avoid these types of vulnerabilities. This is an intentional inversion of the Postel principle. Accepting only unambiguous and valid input is a more formal requirement than input validation or sanitization, and narrows the number of possible but unanticipated program states based on user input. Conversely, failure to do this is associated with security vulnerabilities.[3]

Parser Differentials

[edit]

If the language of accepted program input is sufficiently simple, it is possible to verify that two implementations parse the same input language consistently. This is advantageous because it shows no parser differential exists between the two implementations. The requisite level of simplicity is theoretically that for which there is a solution to the equivalence problem. If the two parsers involved in CVE-2013-4787 were equivalent - that is, if they rendered the same output state given the same input state - the vulnerability could not have existed.

One strategy for doing this is to publish machine-readable specifications of a format or protocol, and then use a parser generator to generate the parser code. An example of a parser generator built for this purpose is DaeDaLus.[4] The combination of Lex with any of GNU Bison, ANTLR, or Yacc also accomplishes this. However, many parser generators allow the mixing of general purpose code with the parsing definitions, which weakens the guarantees provided by parsing.[5]

Analysis of Injection Attacks

[edit]

Injection attacks are generally the result of differences between the serializer (or "unparser") and the corresponding parser at a layer boundary in a system; therefore, they are a special case of parser differentials.[6] In a SQL injection attack, for example, an attacker is able to cause the application with which they are interacting to serialize a SQL query that has different semantics than intended. In the simplest case where the payload ends a string and adds new code, the payload has crossed the code-data boundary in SQL. In language-theoretic security, this is treated as a bug in the serializer of the SQL query.

Parser Combinators

[edit]

If a parser generator is not used, it is still possible to avoid implementation bugs by using parser combinator such as Nom[7] to implement the parser code. This has the drawback of relying on a programmer correctly translating the specification into the language of the parser generator library, though this task is still less error-prone than hand-coding a parser.[8]

Input Format Complexity

[edit]

Complexity in computer programs is associated with security vulnerabilities.[9][10][11] Within the domain of language-theoretic security, complexity is described with reference to the computational power of the abstract machine necessary to implement the program, or more particularly, the parser for its input language. This complexity describes whether it is possible to show that there is no unintended or undesired functionality in the program which might be exploitable by an attacker.

Weird Machines

[edit]

A weird machine is a model of computation in a program that exists in parallel with, but is distinct from, the intended abstract model of computation in that program. Some classes of weird machine arise from the multi-layered nature of computer programs, or the context in which the programs run; others result from the unanticipated functionality a program has due to its complexity or to software bugs.

The more complex the computation model of a program, the more likely it is to implement a weird machine. Depending on context, the weird machine may or may not be concretely useful for an attacker. Since the space of weird machines in the context of some program is the universe of all possible states that are not within the program's intended states, many exploited states including remote code execution[12] and injection attacks belong to the domain of weird machines. A reduction in weird machines is therefore a likely correlate with reduced program vulnerability.

SafeDocs Project

[edit]

SafeDocs is a DARPA project undertaken in 2018 to take existing file formats, create safer subsets of them, and develop programming tools to work for the safer formats. The initial test case for this was PDF. The purpose of creating safer subsets in this case is to lower the minimum bound on parser complexity so that it becomes possible to create tools that will generate correct, normative parsers for them.[13]

Relation to Type Systems

[edit]

Type-Safe Programming Languages

[edit]

Memory-Safe Programming Languages

[edit]

Program Analysis

[edit]

Static Analysis

[edit]

Dynamic Analysis

[edit]

Programming Patterns

[edit]

Recognizer Pattern

[edit]

Unparser Pattern

[edit]

References

[edit]
  1. ^ Sassaman, Len; Patterson, Meredith L. (2011-02-17), Towards a formal theory of computer insecurity: a language-theoretic approach (video), retrieved 2025-05-26
  2. ^ Wang, Haoyu; Liu, Hongxuan; Xiao, Xusheng; Meng, Guozhu; Guo, Yao (November 2019). "Characterizing Android App Signing Issues". 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE: 280–292. doi:10.1109/ASE.2019.00035. ISBN 978-1-7281-2508-4. ISSN 2643-1572.
  3. ^ Ali, Sameed; Anantharaman, Prashant; Lucas, Zephyr; Smith, Sean W. (May 2021). "What We Have Here Is Failure to Validate: Summer of LangSec". IEEE Security & Privacy. 19 (3): 17–23. doi:10.1109/MSEC.2021.3059167. ISSN 1540-7993.
  4. ^ Diatchki, Iavor S.; Dodds, Mike; Goldstein, Harrison; Harris, Bill; Holland, David A.; Razet, Benoit; Schlesinger, Cole; Winwood, Simon (2024-06-20). "Daedalus: Safer Document Parsing". Proceedings of the ACM on Programming Languages. 8 (PLDI): 816–840. doi:10.1145/3656410. ISSN 2475-1421.
  5. ^ Bangert, Julian; Zeldovich, Nickolai (May 2014). "Nail: A Practical Interface Generator for Data Formats". IEEE Security and Privacy Workshops. IEEE: 158–166. doi:10.1109/SPW.2014.31. ISBN 978-1-4799-5103-1.
  6. ^ Sassaman, Len; Patterson, Meredith L.; Bratus, Sergey; Locasto, Michael E. (4 July 2013). "Security Applications of Formal Language Theory". IEEE Systems Journal. 7 (3): 489–500. doi:10.1109/JSYST.2012.2222000. ISSN 1932-8184.
  7. ^ Couprie, Geoffroy (May 2015). "Nom, A Byte oriented, streaming, Zero copy, Parser Combinators Library in Rust". IEEE Security and Privacy Workshops. IEEE: 142–148. doi:10.1109/SPW.2015.31. ISBN 978-1-4799-9933-0.
  8. ^ Isradisaikul, Chinawat; Myers, Andrew C. (2015-08-07). "Finding counterexamples from parsing conflicts". ACM SIGPLAN Notices. 50 (6): 555–564. doi:10.1145/2813885.2737961. ISSN 0362-1340.
  9. ^ "A Plea for Simplicity". Schneier on Security. Retrieved 2025-06-04.
  10. ^ Geer Jr., Daniel E. (November 2008). "Complexity Is the Enemy". IEEE Security & Privacy. 6 (6): 88–88. doi:10.1109/MSP.2008.139. ISSN 1558-4046.
  11. ^ Hoffer, Gregory (30 April 2023). "Complexity Is Still the Enemy of Security". Cyber Defense Magazine.
  12. ^ Dullien, Thomas (2020-04-01). "Weird Machines, Exploitability, and Provable Unexploitability". IEEE Transactions on Emerging Topics in Computing. 8 (2): 391–403. doi:10.1109/TETC.2017.2785299. ISSN 2168-6750.
  13. ^ "SafeDocs: Safe Documents | DARPA". www.darpa.mil. Retrieved 2025-06-04.