Text mining

Text mining, sometimes alternately referred to as text data mining, refers generally to the process of deriving high quality information from text. High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

History

Labour-intensive manual text-mining approaches first surfaced in the mid-1980s, but technological advances have enabled the field to advance swiftly during the past decade. Text mining is an interdisciplinary field which draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (over 80%) is currently stored as text, text mining is believed to have a high commercial potential value.

Applications

Recently, text mining has been receiving attention in many areas, most notably in the security, commercial, and academic fields.

Security applications

One of the largest text mining applications that exists is probably the classified ECHELON surveillance system.

Commercial applications

Research and development departments of major companies, including IBM and Microsoft, are researching text mining techniques and developing programs to further automate the mining and analysis processes.

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information requiring indexing for retrieval. This is particularly true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been begun such as Nature's proposal for an open text mining interface (OTMI) and NIH's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative: The National Centre for Text Mining (NaCTeM), a collaborative effort between the Universities of Manchester, Liverpool and Salford, funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils aim to provide tools, carry out research and offer advice to the academic community, with an initial focus on text mining in the biological and biomedical sciences. In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.

Implications

Until recently websites mostly used text-based lexical searches. Text mining will enable searches which can be directly answered by the semantic web. Text mining is also the technique used for fighting email spam.

External links

Software:

PreText An environment for pre-processing text for Text Mining. Several facilities are implemented in PreTexT, such as Luhn Cuts off, stemming, TFIDF, TF, Bool, TFLinear, quadratic and linear normalizations, n-gram, inductive construction, and others.
UIMA standard - integration framework for text technologies, including text mining.
The "Ultimate Research Assistant" - an online text mining tool.
GATE (General Architecture for Text Engineering) - freely available open-source Java library for text engineering and a leading toolkit for text mining, information extraction, and other natural language processing (NLP) tasks used worldwide by thousands of scientists, companies, teachers and students.
YALE (Yet Another Learning Environment) - freely available integrated open-source software environment for knowledge discovery, data mining including text mining, machine learning, visualization (e.g. of text clusterings), etc. featuring a plugin WordVectorTool for text mining tasks like classification, clustering, feature set construction and transformation, etc.
Bow - freely available open-source toolkit for statistical language modeling, text retrieval, classification, and clustering.
Topicalizer - an online text analysis tool for generating text analysis statistics of web pages and other texts.
Textalyser - an online text analysis tool for generating text analysis statistics of web pages and other texts.
Pimiento A Java-based application framework for Text Mining.

Resources and information:

Text-Mining.org Information for and about the text mining community
Textengines Quick guide: Text analysis explained
Text mining: Science Digs Deeper
Data Mining Tutorials, Resources

Events:

Kmining List of text mining, data mining, and knowledge discovery scientific conferences
Annual conference on text analytics/mining

News:

Text mining industry news and analysis
unstruct.org News about the industry
Data Mining On industrial applications