Draft:Language-Theoretic Security

Draft article not currently submitted for review.

This is a draft Articles for creation (AfC) submission. It is not currently pending review. While there are no deadlines, abandoned drafts may be deleted after six months. To edit the draft click on the "Edit" tab at the top of the window.

To be accepted, a draft should:

Show the subject qualifies for a Wikipedia article by using multiple sources that meet four criteria. The sources should be (1) reliable (2) secondary (3) independent of the subject (4) talk about the subject in some depth. For some topics, there are alternative criteria.
Be written from a neutral point of view
Respect copyright and do not plagiarize. Do not copy-paste.

It is strongly discouraged to write about yourself, your business or employer. If you do so, you must declare it.

Where to get help

If you need help editing or submitting your draft, please ask us a question at the AfC Help Desk or get live help from experienced editors. These venues are only for help with editing and the submission process, not to get reviews.
If you need feedback on your draft, or if the review is taking a lot of time, you can try asking for help on the talk page of a relevant WikiProject. Some WikiProjects are more active than others so a speedy reply is not guaranteed.

How to improve a draft

Wikipedia:Contributing to Wikipedia – a basic overview on how to edit Wikipedia.
Help:Wikitext – how to use the markup
Help:Referencing for beginners – how to include references
Wikipedia:Article development – how to develop your article
Wikipedia:Writing better articles – how to improve your article
Wikipedia:Verifiability – make sure your article includes reliable third-party sources

You can also browse Wikipedia:Featured articles and Wikipedia:Good articles to find examples of Wikipedia's best writing on topics similar to your proposed article.

Improving your odds of a speedy review

To improve your odds of a faster review, tag your draft with relevant WikiProject tags using the button below. This will let reviewers know a new draft has been submitted in their area of interest. For instance, if you wrote about a female astronomer, you would want to add the Biography, Astronomy, and Women scientists tags.

Add tags to your draft

Editor resources

Easy tools: Citation bot (help) | Advanced: Fix bare URLs

Last edited by Falcon Kirtaran (talk | contribs) 26 days ago. (Update)

Submit the draft for review!

Language-theoretic security, or LangSec, is an approach to software security that focuses on input handling, complexity, and program design as strategies to improve the verifiability of computer programs. It was introduced in 2011 by Len Sassaman and Meredith L. Patterson.^[1] It aims to create a formal description of which software is likely to have security vulnerabilities of particular classes, and why. It considers programs to have an inherent parser component, whether or not explicit, composed of that part of the program which operates on external input before that input is fully parsed. A central hypothesis of language-theoretic security is that vulnerabilities in software increase according to the computational power of the notional input-accepting automaton equivalent to this parser, using the definitions of automata theory. The lower bound on this computational power is the input language complexity of the program. The extent to which reducing this complexity is possible is a function of the specification of the communication protocol or file format the program takes as input.

Parsing as a Security Mechanism

The behaviour of a program is defined with reference to its expected input. Unexpected input being used by a program is a factor in numerous security bugs, including the so-called Android master key vulnerability (CVE-2013-4787),^[2] because accepting unexpected input renders the program's specification ambiguous. In that instance, the unexpected ambiguity came in the form of a ZIP file with duplicate filenames.

If a program fully parses its input and only acts on input that unambiguously meets the specification, it follows that the program will avoid these types of vulnerabilities. This is an intentional inversion of the Postel principle. Accepting only unambiguous and valid input is a more formal requirement than input validation or sanitization, and narrows the number of possible but unanticipated program states that can be induced in an application via user input. Conversely, failure to do this is associated with security vulnerabilities.^[3]

Input sanitization in particular is held to be an inadequate approach to avoiding malicious input because it inherently ignores context-sensitive properties of the input;^[4] it can therefore result in paradoxical effects, such as sanitization code activating otherwise inert cross-site scripting payloads in browsers.^[5]

Parser Differentials

If the language of accepted program input is sufficiently simple, it is possible to verify that two implementations parse the same input language consistently. This is advantageous because it shows no parser differential exists between the two implementations. The requisite level of simplicity is theoretically that for which there is a solution to the equivalence problem. If the two parsers involved in CVE-2013-4787 were equivalent - that is, if they rendered the same output state given the same input state - the vulnerability could not have existed.

One strategy for doing this is to publish machine-readable specifications of a format or protocol, and then use a parser generator to generate the parser code. An example of a parser generator built for this purpose is DaeDaLus.^[6] The combination of Lex with any of GNU Bison, ANTLR, or Yacc also accomplishes this. However, many parser generators allow the mixing of general purpose code with the parsing definitions, which weakens the guarantees provided by parsing.^[7]

Analysis of Injection Attacks

Injection attacks are generally the result of differences between the serializer (or "unparser") and the corresponding parser at a layer boundary in a system; therefore, they are a special case of parser differentials.^[8] In a SQL injection attack, for example, an attacker is able to cause the application with which they are interacting to serialize a SQL query that has different semantics than intended. In the simplest case where the payload ends a string and adds new code, the payload has crossed the code-data boundary in SQL. In language-theoretic security, this is treated as a bug in the serializer of the SQL query, which should instead be written in a way that constrains its possible outputs to those within the scope of the intended query.

Parser Combinators

If a parser generator is not used, it is still possible to avoid implementation bugs by using parser combinator such as Nom^[9] to implement the parser code. This has the drawback of relying on a programmer correctly translating the specification into the language of the parser generator library, though this task is still less error-prone than hand-coding a parser.^[10]

Input Format Complexity

Complexity in computer programs is associated with security vulnerabilities.^[11]^[12]^[13] Within the domain of language-theoretic security, complexity is described with reference to the computational power of the abstract machine necessary to implement the program, or more particularly, to implement the parser for its input language. This complexity describes whether it is possible to show that there is no unintended or undesired functionality in the program which might be exploitable by an attacker.

Weird Machines

A weird machine is a model of computation in a program that exists in parallel with, but is distinct from, the intended abstract model of computation in that program. Some classes of weird machine arise from the multi-layered nature of computer programs, or the context in which the programs run; others result from the unanticipated functionality a program has due to its complexity or to software bugs.

The more complex the computation model of a program, the more likely it is to implement a weird machine. Depending on context, the weird machine may or may not be concretely useful for an attacker. Since the space of weird machines in the context of some program is the universe of all possible states that are not within the program's intended states, many exploited states including remote code execution^[14] and injection attacks belong to the domain of weird machines. A reduction in weird machines is therefore a likely correlate with reduced program vulnerability.

SafeDocs Project

SafeDocs is a DARPA project undertaken in 2018 to take existing file formats, create safer subsets of them, and develop programming tools to work for the safer formats. The initial test case for this was PDF. The purpose of creating safer subsets in this case is to lower the minimum bound on parser complexity so that it becomes possible to create tools that will generate correct, normative parsers for them.^[15]

Relation to Programming Languages

The analytic framework of language-theoretic security assumes programs to be virtual machines that execute their input. A document that is read by an application is in this sense a form of machine code, in a generalization of the data as code idea, following the automata theory description of parsers.

Type-Safe Programming Languages

Memory-Safe Programming Languages

In the general case, spatial memory correctness is undecidable. If any proof of spatial memory correctness is to be made, it is therefore necessary to bound the complexity of the code. Interpreted languages such as Java and Python effectively accomplish this via runtime bounds checking, and frameworks for runtime bounds checking also exist for C.^[16] The effect of these strategies for spatial memory correctness are to create a halt state in place of a spatial memory correctness violation; therefore, it can be shown that the program will not violate spatial memory correctness, but in exchange, it cannot be shown in the general case that programs will not have runtime bounds checking exceptions.

Some programming languages, such as Rust, accomplish this using borrow checking. The borrow checker acts to assure spatial memory correctness by compile-time reference counting. Code for which spatial memory correctness cannot be shown to not be violated therefore does not compile, inherently limiting the complexity of the spatial memory correctness of the program to what is decidable. This works to the extent that the borrow checker is correct, which has been shown for subsets of Rust but not the totality of Rust; therefore, while it is nominally "memory-safe", bugs that violate the temporal correctness of the memory model have been discovered.^[17]

Program Analysis

When the complexity of program input is limited such that it only requires a sub-Turing program to validate, the parser is more amenable to verification and static analysis. The programming language Crema, as an example, experimentally demonstrated that an SMTP protocol parser written in even a minimally sub-Turing programming language allowed the static analyzer KLEE to achieve more coverage while exploring less program states, as compared with the parser used in Qmail.^[18]

Programming Patterns

Recognizer Pattern

The recognizer pattern is a software design pattern described by the language-theoretic security community whereby a program's input is fully recognized by a parser before the program acts upon that input.

Unparser Pattern

References

^ Sassaman, Len; Patterson, Meredith L. (2011-02-17), Towards a formal theory of computer insecurity: a language-theoretic approach (video), retrieved 2025-05-26
^ Wang, Haoyu; Liu, Hongxuan; Xiao, Xusheng; Meng, Guozhu; Guo, Yao (November 2019). "Characterizing Android App Signing Issues". 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE: 280–292. doi:10.1109/ASE.2019.00035. ISBN 978-1-7281-2508-4. ISSN 2643-1572.
^ Ali, Sameed; Anantharaman, Prashant; Lucas, Zephyr; Smith, Sean W. (May 2021). "What We Have Here Is Failure to Validate: Summer of LangSec". IEEE Security & Privacy. 19 (3): 17–23. doi:10.1109/MSEC.2021.3059167. ISSN 1540-7993.
^ Poll, Erik (24 May 2018). "LangSec Revisited: Input Security Flaws of the Second Kind". Security and Privacy Workshops. IEEE: 329–334. doi:10.1109/SPW.2018.00051. ISBN 978-1-5386-8276-0.
^ "Major IE8 flaw makes 'safe' sites unsafe". Archived from the original on 2025-04-30. Retrieved 2025-06-06.
^ Diatchki, Iavor S.; Dodds, Mike; Goldstein, Harrison; Harris, Bill; Holland, David A.; Razet, Benoit; Schlesinger, Cole; Winwood, Simon (2024-06-20). "Daedalus: Safer Document Parsing". Proceedings of the ACM on Programming Languages. 8 (PLDI): 816–840. doi:10.1145/3656410. ISSN 2475-1421.
^ Bangert, Julian; Zeldovich, Nickolai (May 2014). "Nail: A Practical Interface Generator for Data Formats". IEEE Security and Privacy Workshops. IEEE: 158–166. doi:10.1109/SPW.2014.31. ISBN 978-1-4799-5103-1.
^ Sassaman, Len; Patterson, Meredith L.; Bratus, Sergey; Locasto, Michael E. (4 July 2013). "Security Applications of Formal Language Theory". IEEE Systems Journal. 7 (3): 489–500. doi:10.1109/JSYST.2012.2222000. ISSN 1932-8184.
^ Couprie, Geoffroy (May 2015). "Nom, A Byte oriented, streaming, Zero copy, Parser Combinators Library in Rust". IEEE Security and Privacy Workshops. IEEE: 142–148. doi:10.1109/SPW.2015.31. ISBN 978-1-4799-9933-0.
^ Isradisaikul, Chinawat; Myers, Andrew C. (2015-08-07). "Finding counterexamples from parsing conflicts". ACM SIGPLAN Notices. 50 (6): 555–564. doi:10.1145/2813885.2737961. ISSN 0362-1340.
^ "A Plea for Simplicity". Schneier on Security. Retrieved 2025-06-04.
^ Geer Jr., Daniel E. (November 2008). "Complexity Is the Enemy". IEEE Security & Privacy. 6 (6): 88–88. doi:10.1109/MSP.2008.139. ISSN 1558-4046.
^ Hoffer, Gregory (30 April 2023). "Complexity Is Still the Enemy of Security". Cyber Defense Magazine.
^ Dullien, Thomas (2020-04-01). "Weird Machines, Exploitability, and Provable Unexploitability". IEEE Transactions on Emerging Topics in Computing. 8 (2): 391–403. doi:10.1109/TETC.2017.2785299. ISSN 2168-6750.
^ "SafeDocs: Safe Documents | DARPA". www.darpa.mil. Retrieved 2025-06-04.
^ Nagarakatte, Santosh; Zhao, Jianzhou; Martin, Milo M.K.; Zdancewic, Steve (2009-06-15). "SoftBound: highly compatible and complete spatial memory safety for c". Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM: 245–258. doi:10.1145/1542476.1542504. ISBN 978-1-60558-392-1.
^ Jung, Ralf; Jourdan, Jacques-Henri; Krebbers, Robbert; Dreyer, Derek (January 2018). "RustBelt: securing the foundations of the Rust programming language". Proceedings of the ACM on Programming Languages. 2 (POPL): 1–34. doi:10.1145/3158154. ISSN 2475-1421.
^ Torrey, Jacob I.; Bridgman, Mark P. (May 2015). "Verification State-Space Reduction through Restricted Parsing Environments". IEEE CS Security and Privacy Workshops. IEEE: 106–116. doi:10.1109/SPW.2015.30. ISBN 978-1-4799-9933-0.

[1] Sassaman, Len; Patterson, Meredith L. (2011-02-17), Towards a formal theory of computer insecurity: a language-theoretic approach (video), retrieved 2025-05-26

[2] Wang, Haoyu; Liu, Hongxuan; Xiao, Xusheng; Meng, Guozhu; Guo, Yao (November 2019). "Characterizing Android App Signing Issues". 34th IEEE/ACM International Conference on Automated Software Engineering. IEEE: 280–292. doi:10.1109/ASE.2019.00035. ISBN 978-1-7281-2508-4. ISSN 2643-1572.

[3] Ali, Sameed; Anantharaman, Prashant; Lucas, Zephyr; Smith, Sean W. (May 2021). "What We Have Here Is Failure to Validate: Summer of LangSec". IEEE Security & Privacy. 19 (3): 17–23. doi:10.1109/MSEC.2021.3059167. ISSN 1540-7993.

[4] Poll, Erik (24 May 2018). "LangSec Revisited: Input Security Flaws of the Second Kind". Security and Privacy Workshops. IEEE: 329–334. doi:10.1109/SPW.2018.00051. ISBN 978-1-5386-8276-0.

[5] "Major IE8 flaw makes 'safe' sites unsafe". Archived from the original on 2025-04-30. Retrieved 2025-06-06.

[6] Diatchki, Iavor S.; Dodds, Mike; Goldstein, Harrison; Harris, Bill; Holland, David A.; Razet, Benoit; Schlesinger, Cole; Winwood, Simon (2024-06-20). "Daedalus: Safer Document Parsing". Proceedings of the ACM on Programming Languages. 8 (PLDI): 816–840. doi:10.1145/3656410. ISSN 2475-1421.

[7] Bangert, Julian; Zeldovich, Nickolai (May 2014). "Nail: A Practical Interface Generator for Data Formats". IEEE Security and Privacy Workshops. IEEE: 158–166. doi:10.1109/SPW.2014.31. ISBN 978-1-4799-5103-1.

[8] Sassaman, Len; Patterson, Meredith L.; Bratus, Sergey; Locasto, Michael E. (4 July 2013). "Security Applications of Formal Language Theory". IEEE Systems Journal. 7 (3): 489–500. doi:10.1109/JSYST.2012.2222000. ISSN 1932-8184.

[9] Couprie, Geoffroy (May 2015). "Nom, A Byte oriented, streaming, Zero copy, Parser Combinators Library in Rust". IEEE Security and Privacy Workshops. IEEE: 142–148. doi:10.1109/SPW.2015.31. ISBN 978-1-4799-9933-0.

[10] Isradisaikul, Chinawat; Myers, Andrew C. (2015-08-07). "Finding counterexamples from parsing conflicts". ACM SIGPLAN Notices. 50 (6): 555–564. doi:10.1145/2813885.2737961. ISSN 0362-1340.

[11] "A Plea for Simplicity". Schneier on Security. Retrieved 2025-06-04.

[12] Geer Jr., Daniel E. (November 2008). "Complexity Is the Enemy". IEEE Security & Privacy. 6 (6): 88–88. doi:10.1109/MSP.2008.139. ISSN 1558-4046.

[13] Hoffer, Gregory (30 April 2023). "Complexity Is Still the Enemy of Security". Cyber Defense Magazine.

[14] Dullien, Thomas (2020-04-01). "Weird Machines, Exploitability, and Provable Unexploitability". IEEE Transactions on Emerging Topics in Computing. 8 (2): 391–403. doi:10.1109/TETC.2017.2785299. ISSN 2168-6750.

[15] "SafeDocs: Safe Documents | DARPA". www.darpa.mil. Retrieved 2025-06-04.

[16] Nagarakatte, Santosh; Zhao, Jianzhou; Martin, Milo M.K.; Zdancewic, Steve (2009-06-15). "SoftBound: highly compatible and complete spatial memory safety for c". Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM: 245–258. doi:10.1145/1542476.1542504. ISBN 978-1-60558-392-1.

[17] Jung, Ralf; Jourdan, Jacques-Henri; Krebbers, Robbert; Dreyer, Derek (January 2018). "RustBelt: securing the foundations of the Rust programming language". Proceedings of the ACM on Programming Languages. 2 (POPL): 1–34. doi:10.1145/3158154. ISSN 2475-1421.

[18] Torrey, Jacob I.; Bridgman, Mark P. (May 2015). "Verification State-Space Reduction through Restricted Parsing Environments". IEEE CS Security and Privacy Workshops. IEEE: 106–116. doi:10.1109/SPW.2015.30. ISBN 978-1-4799-9933-0.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]