Jump to content

Text processing

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Cpiral (talk | contribs) at 03:02, 5 June 2013 (+ sections 'History', 'Definition). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, the term text processing refers to the discipline of mechanizing the creation or manipulation of electronic text. Text usually refers to all the alphanumeric characters specified on the keyboard of the person performing the mechanization, but in general text here means the abstraction layer that is one layer above the standard character encoding of the target text. The term processing refers to automated (or mechanized) processing, as opposed to the same manipulation done manually.

Text processing involves computer commands which invoke content, content changes, and cursor movement, for example to

  • search and replace
  • format
  • generate a processed report of the content of, or
  • filter a file or report of a text file.

The text processing of a regular expression is a virtual editing machine, having a primitive programming language that has named registers (identifiers), and named positions in the sequence of characters comprising the text. Using these the "text processor" can, for example, mark a region of text, and then move it. The text processing of a utility is a filter program, or filter. These two mechanisms comprise text processing.

Definition

Since the standardized markup such as ANSI escape codes are generally invisible to the editor, they comprise a set of transitory properties that become at times indistinguishable from word processing. But the definite distinctions from word processing are that text processing proper:

  • represents "text processing utilities", not just "text editing" applications.
  • is much more "the keyboard way", as opposed to "the mouse way" (e.g. drag and drop, cut and paste) of initiating an edit.
  • is sequential access rather than random access in approach.
  • operates directly at the presentation layer rather than indirectly at the application layer.
  • works raw data that is standardized and works more openly rather than tending towards any proprietary methods.

In this way markup such as font and color are not really a distinguishing factor, because the character sequences that affect font and color are simply standard characters inserted automatically by a background text processing mode, made to work transparently by compliant text editors, yet becoming otherwise visible as text processing commands when that mode is not in effect. So text processing is defined most basically (but not entirely) around the visual characters (or graphemes) rather than the standard, yet invisible characters.

History

The development of computer text processing started in earnest with Kleene's formalizing what is a regular language. Such regular expressions could then became a mini-program, complete with a compilation process, available to perform any edit, once that language was extended. Similarly, filters are extended by evolving particular options.

Basic concepts

An editor essentially invokes an input stream and directs it to the text processing environment, which is either a command shell or a text editor. The resulting output may also then be applicable to further, innumerable text processing steps, the totality of which is comparable to a single application of an algorithm applied once by a more sophisticated and structured computer program, instead of a sequence of simple macros that are the pattern-action expressions and filtering mechanisms of text processing. In either case the "programmer's" intention is impressed indirectly upon a given set of textual characters. The results of a text processing step are sometimes only hopeful, and the attempted mechanism is often subject to multiple drafts through visual feedback, until the regular expression or markup language details, or until the utility options, are fully mastered.

The subject matter of the book Automatic Text Processing by Gerard Salton