Text mining

Text mining, sometimes alternately referred to as text data mining, roughly equivalent to text analytics, refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interestingness. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).

History

Labor-intensive manual text mining approaches first surfaced in the mid-1980s,^{[citation needed]}^{[example needed]} but technological advances have enabled the field to advance during the past decade. Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics, and computational linguistics. As most information (common estimates say over 80%)^[1] is currently stored as text, text mining is believed to have a high commercial potential value. Increasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.

Applications

Recently, text mining has received attention in many areas.

Security applications

Many text mining software packages are marketed for security applications, especially analysis of plain text sources such as Internet news. It also involves in the study of text encryption.

Biomedical applications

A range of text mining applications in the biomedical literature has been described.^[2]

The more important online text mining application in the biomedical literature is GoPubMed.^[3] GoPubmed was actually the first semantic search engine on the Web. Other example is PubGene that combines biomedical text mining with network visualization as an Internet service.^[4]

Software and applications

Text mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities.^[5]

Online media applications

Text mining is being used by large media companies, such as the Tribune Company, to disambiguate information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.

Marketing applications

Text mining is starting to be used in marketing as well, more specifically in analytical customer relationship management. Coussement and Van den Poel (2008)^[6] apply it to improve predictive analytics models for customer churn (customer attrition).^[7]

Sentiment analysis

Sentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie.^[8] Such an analysis may need a labeled data set or labeling of the affectivity of words. Resources for affectivity of words and concepts have been made for WordNet^[9] and ConceptNet,^[10] respectively.

Text has been used to detect emotions in the related area of affective computing.^[11] Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.

Academic applications

The issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.

Academic institutions have also become involved in the text mining initiative:

The National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of Manchester^[12] in close collaboration with the Tsujii Lab,^[13] University of Tokyo.^[14] NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils (EPSRC & BBSRC). With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of social sciences.

In the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist biology researchers in text mining and analysis.

Notable software and applications

Text mining computer programs are available from many commercial and open source companies and sources.

Commercial

AeroText – a suite of text mining applications for content analysis. Content used can be in multiple languages.
Attensity – hosted, integrated and stand-alone text mining (analytics) software that uses natural language processing technology
Autonomy – text mining, clustering and categorization software
Basis Technology – provides a suite of text analysis modules to identify language, enable search in more than 20 languages, extract entities, and efficiently search for and translate entities.
Clarabridge – text analytics (text mining) software, including natural language (NLP), machine learning, clustering and categorization
Endeca Technologies – provides software to analyze and cluster unstructured text.
Expert System S.p.A. – suite of semantic technologies and products for developers and knowledge managers.
Fair Isaac – leading provider of decision management solutions powered by advanced analytics (includes text analytics).
iDETECT - software platform providing text analytics, including natural language processing and categorization, and unstructured data vizualisation features for investigative purpose.
Inxight – provider of text analytics, search, and unstructured visualization technologies. (Inxight was bought by Business Objects that was bought by SAP AG in 2008).
LanguageWare – text analysis libraries and customization software from IBM.
Language Computer Corporation – text extraction and analysis tools, available in multiple languages.
LexisNexis – provider of business intelligence solutions based on an extensive news and company information content set. LexisNexis acquired DataOps to pursue search
Mathematica – provides built in tools for text alignment, pattern matching, clustering and semantic analysis.
SAS – SAS Text Miner and Teragram; commercial text analytics, natural language processing, and taxonomy software used for Information Management. SAS Text Miner rated as the third most used text mining software (9%) by Rexer's Annual Data Miner Survey in 2010.^[15]
IBM SPSS – provider of IBM SPSS Modeler and IBM SPSS Text Analytics (now called IBM SPSS Modeler Premium).^[16] Rated as the second (17%) and fourth (7%), respectively, most used text mining software by Rexer's Annual Data Miner Survey in 2010.^[15]
StatSoft – provides STATISTICA Text Miner as an optional extension to STATISTICA Data Miner, for Predictive Analytics Solutions. Rated as the top used text mining software (19%) by Rexer's Annual Data Miner Survey in 2010.^[15]
Thomson Data Analyzer – enables complex analysis on patent information, scientific publications and news.
WordStat – Content analysis and text mining software for analyzing large amounts of unstructured information.

Open source

Carrot2 – text and search results clustering framework.
GATE – natural language processing and language engineering tool.
OpenNLP - natural language processing
Natural Language Toolkit (NLTK) – a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python programming language.
RapidMiner with its Text Processing Extension – data and text mining software. Rated as the fifth most used text mining software (6%) by Rexer's Annual Data Miner Survey in 2010.^[15]
Unstructured Information Management Architecture (UIMA) – a component framework to analyze unstructured content such as text, audio and video, originally developed by IBM.
Knime - Open Source Data Mining Tool with an experimental Textprocessing-Extension ^[17]
KH Coder - A free software for Quantitative Content Analysis or Text Mining ^[18]

Implications

Until recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word).

Additionally, text mining software can be used to build large dossiers of information about specific people and events. For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence. In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis.

Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material.

Notes

^ Unstructured Data and the 80 Percent Rule
^ K. Bretonnel Cohen & Lawrence Hunter (2008). "Getting Started in Text Mining". PLoS Computational Biology. 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579. PMID 18225946. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: unflagged free DOI (link)
^ GoPubMed: exploring PubMed with the Gene Ontology, A. Doms and M. Schroeder, 2005, http://nar.oxfordjournals.org/content/33/suppl_2/W783.long
^
Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–28. doi:10.1038/ng0501-21. PMID 11326270.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
- Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264.
^ Texor
^ Academic Papers about Analytical Customer Relationship Management
^ Kristof Coussement, and Dirk Van den Poel (2008). "Integrating the Voice of Customers through Call Center Emails into a Decision Support System for Churn Prediction". Information and Management. {{cite journal}}: Unknown parameter |month= ignored (help)
^ Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques" (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
^ Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources" (PDF). Psychology Journal. 2 (1): 61–83.{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ Erik Cambria (2010). "SenticNet: a Publicly Available Semantic Resource for Opinion Mining" (PDF). Proceedings of AAAI CSK. pp. 14–18. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)
^ Rafael A. Calvo, Sidney K. D'Mello (2010). "Affect Detection: An Interdisciplinary Review of Models,Methods, and their Applications". IEEE Transactions on Affective Computing. 1 (1): 18–37.
^ The University of Manchester
^ Tsujii Laboratory
^ The University of Tokyo
^ ^a ^b ^c ^d Rexer Analytics 4th Annual Data Miner Survey - 2010
^ IBM - SPSS - Software Products
^ http://www.tech.knime.org/knime-text-processing-0
^ http://khc.sourceforge.net/en/

References

Ananiadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine. Artech House Books. ISBN 978-1-58053-984-5
Bilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. ISBN 978-0470176436
Feldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge University Press. ISBN 9780521836579
Indurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Boca Raton, FL: CRC Press. ISBN 978-1420085921
Kao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. ISBN 184628175X
Konchady, M. Text Mining Application Programming (Programming Series). Charles River Media. ISBN 1584504609
Manning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. ISBN 978-0262133609
Miner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. ISBN 978-0123869791
McKnight, W. (2005). "Building business intelligence: Text data mining in business intelligence". DM Review, 21-22.
Srivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press. ISBN 978-1420059403

External links

Marti Hearst: What Is Text Mining? (October, 2003)

[1] Unstructured Data and the 80 Percent Rule

[2] K. Bretonnel Cohen & Lawrence Hunter (2008). "Getting Started in Text Mining". PLoS Computational Biology. 4 (1): e20. doi:10.1371/journal.pcbi.0040020. PMC 2217579. PMID 18225946. {{cite journal}}: Unknown parameter |month= ignored (help)CS1 maint: unflagged free DOI (link)

[3] GoPubMed: exploring PubMed with the Gene Ontology, A. Doms and M. Schroeder, 2005, http://nar.oxfordjournals.org/content/33/suppl_2/W783.long

[4] Tor-Kristian Jenssen, Astrid Lægreid, Jan Komorowski1 & Eivind Hovig (2001). "A literature network of human genes for high-throughput analysis of gene expression". Nature Genetics. 28 (1): 21–28. doi:10.1038/ng0501-21. PMID 11326270.{{cite journal}}: CS1 maint: multiple names: authors list (link) CS1 maint: numeric names: authors list (link)
Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264.

[5] Summary: Daniel R. Masys (2001). "Linking microarray data to the literature". Nature Genetics. 28 (1): 9–10. doi:10.1038/ng0501-9. PMID 11326264.

[5] Texor

[6] Academic Papers about Analytical Customer Relationship Management

[7] Kristof Coussement, and Dirk Van den Poel (2008). "Integrating the Voice of Customers through Call Center Emails into a Decision Support System for Churn Prediction". Information and Management. {{cite journal}}: Unknown parameter |month= ignored (help)

[8] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002). "Thumbs up? Sentiment Classification using Machine Learning Techniques" (PDF). Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 79–86. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)

[9] Alessandro Valitutti, Carlo Strapparava, Oliviero Stock (2005). "Developing Affective Lexical Resources" (PDF). Psychology Journal. 2 (1): 61–83.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[camnet-10] Erik Cambria (2010). "SenticNet: a Publicly Available Semantic Resource for Opinion Mining" (PDF). Proceedings of AAAI CSK. pp. 14–18. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help); Unknown parameter |coauthors= ignored (|author= suggested) (help)

[11] Rafael A. Calvo, Sidney K. D'Mello (2010). "Affect Detection: An Interdisciplinary Review of Models,Methods, and their Applications". IEEE Transactions on Affective Computing. 1 (1): 18–37.

[12] The University of Manchester

[13] Tsujii Laboratory

[14] The University of Tokyo

[autogenerated1-15] Rexer Analytics 4th Annual Data Miner Survey - 2010

[16] IBM - SPSS - Software Products

[17] ttp://www.tech.knime.org/knime-text-processing-0

[18] ttp://khc.sourceforge.net/en/

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]