Text mining

What kind of data should the tool work with?

EPPT allows users to encode image-based scholarly editions without having to know XML syntax. It automates or semi-automates repeating attributes, and provides templates to reduce errors and accelerate the encoding process.

Last updated: 9 Aug 2016

TXM

TXM is a free and open-source cross-platform Unicode, XML & TEI based text analysis software, supporting Windows, Mac OS X and Linux. It is also available as a J2EE standard compliant portal software (GWT based) for online access with access control built in (see a demo portal: http://portal.textometrie.org/demo).

Code license: Open source, GNU GPL v3
Last updated: 29 Jun 2016

IBM AeroText is an information extraction system for developing knowledge-based content analysis applications.

Last updated: 15 Jun 2016

A global geographical database that may be used to identify and tag all references to location. The database contains over 8 million entries, each of which possesses a geographic name (in various languages), latitude, longitude, elevation, population, administrative subdivision and postal codes and information on unique features.

Features:

Last updated: 7 Jun 2016

GeoParser is a text analysis tool that may be used to identify and tag references to geographic location in a text resource using Natural Language Processing to analyse the composition of a resource and identifying words that match its geographic database. The approach is useful for processing names that may have one of several locations (e.g. Belfast in Ireland, New Zealand and Canada) and distinguishing names that may be confused with other common words (e.g. Reading in Berkshire and reading as an activity).

Last updated: 7 Jun 2016

The goal of the Alpheios project is to help people learn how to learn languages as efficiently and enjoyably as possible, and in a way that best helps them understand their own literary heritage and culture, and the literary heritage and culture of other peoples throughout history. One of the principal tools, a Firefox plugin, allows a reader to browse a web page with Latin, ancient Greek, or Arabic, click on a word, and get a definition and morphological analysis of the word.

Code license: Open source, GNU GPL
Last updated: 7 Jun 2016

Part-of-Speech (POS) tagging software for English - the classification of words into one or more categories based upon its definition, relationship with other words, or other context, also known as wordclass tagging. CLAWS (Constituent Likelihood Automatic Word-tagging System) uses several methods to identify parts of speech., most notably a system called Hidden Markov models (HMMs) which involve counting examples of co-occurrence of words and wordclasses in training data and making a table of the probabilities of certain sequences of words.

Features:

Code license: Closed source
Last updated: 3 May 2016

TokenX is a web-based environment for visualizing, analyzing and playing with texts. Options include word clouds, highlighting words, keywords in context, replacing words with bocks, highlighting punctuation and non-words, counting words in context and decontextualized, and substituting words. A number of sample files are provided, or users can point TokenX to any XML file online.

Last updated: 19 Apr 2016

TAToo is an embeddable Flash widget that displays TAPOR analytics for the page on which it resides.

Code license: Apache License
Last updated: 23 Feb 2016

The TAPoR Portal is an online environment where users can keep track of texts they want to study (uploaded or available online), learn about and try different tools, and run tools on texts.

Last updated: 23 Feb 2016

Philomine is an extension to the Philologic text retrieval engine that supports a variety of machine learning, text mining, and document clustering tasks.

Code license: Open source, GNU GPL
Last updated: 22 Feb 2016

PhiloLine is an add-on for the Philologic text retrieval engine that provides a sequence alignment algorithm for humanities text analysis designed to identify "similar passages" in large collections of texts.

Code license: Open source, GNU GPL
Last updated: 22 Feb 2016

Combined with the Leptonica Image Processing Library Tesseract can read a wide variety of image formats and convert them to text in over 40 languages.

This code is a raw OCR engine. It has no output formatting and no UI. It can detect fixed pitch vs proportional text. Nevertheless in 1995 this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code is included in the open source release.

The core developer on the project is Ray Smith (theraysmith).

Code license: Open source, Apache License
Last updated: 27 Jan 2016

A free iOS app for text analysis. Textal allows you to analyze documents, tweet streams, and webpages. Create clickable text clouds based on the source data that you choose. It comes pre-loaded with a large number of public domain texts. Text clouds are easily shareable via various Twitter and email.

Last updated: 18 Dec 2015

Superfastmatch is designed to find exact duplicates of text strings between documents.

Code license: Open source, GNU GPL
Last updated: 1 Dec 2015

Voyeur is a web-based text analysis environment where users can apply a wide variety of tools to any text they import.

Last updated: 3 Nov 2015

The MONK workbench provides 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare, along with tools to enable literary research through the discovery, exploration, and visualization of patterns.

Users affiliated with CIC (Big Ten) schools can access a larger data set that includes about a thousand works of British literature from the 16th through the 19th century, provided by The Text Creation Partnership (EEBO and ECCO) and ProQuest (Chadwyck-Healey Nineteenth-Century Fiction).

Last updated: 12 Aug 2015

Philologic is a full-text search, retrieval and analysis tool with support for TEI-Lite XML/SGML, Unicode encoding, plaintext, Dublin Core/HTML, and DocBook.

Code license: GNU GPL, Open source
Last updated: 9 Aug 2015

A statistical natural language parser for analyzing text to determine its grammatical structure.

Code license: GNU GPL, Open source
Last updated: 18 Jun 2015

A text-mining system for scientific literature. Textpresso's two major elements are (1) access to full text, so that entire articles can be searched, and (2) introduction of categories of biological concepts and classes that relate to objects (e.g., association, regulation, etc.) or describe one (e.g., methods, etc).

Code license: Open source
Last updated: 28 May 2015

DiscoverText allows users to import data from a variety of sources (including Facebook & Twitter feeds, plain text, Word, Excel, public YouTube comments, blogs/wikis, PDF, etc.), code them, and generate tag clouds and reports.

Last updated: 24 May 2015

"Linguistic Inquiry and Word Count (LIWC) is a text analysis software program...LIWC is able to calculate the degree to which people use different categories of words across a wide array of texts." Free and limited web analysis available.

Last updated: 23 May 2015

Whatizit can ingest up to 500,000 terms pasted into the input box and execute any of the pre-defined text analysis pipelines.

Last updated: 23 May 2015

WordSmith allows users to develop concordances, find keywords, and develop word lists from plain text files.

Last updated: 22 May 2015

Scrapy is an open source programming library for web crawling and web page text extraction, written in Python. You can make calls to Scrapy code from within your own scripts and applications to automate the task of extracting data from websites.

You would typically use Scrapy to automate the task of visiting one or more web pages, on a website to which you have access. You could alternately use it to invoke web-based Application Programming Interfaces (APIs).

Code license: Open source
Last updated: 22 May 2015

Diction analyzes texts for language indicating certainty, activity, optimism, realism, and commonality.

Last updated: 19 May 2015

Lexos is an online tool that enables you to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations.

Code license: Open source
Last updated: 17 May 2015

CAT is an environment for group coding and analyzing data sets, including computing inter-rater reliability. Users can create a free account, or download the ASP.NET tool suite to run independently.

Last updated: 9 May 2015

AntWordProfiler is free software for analyzing word frequency.

Last updated: 9 May 2015

Juxta is an open-source cross-platform desktop tool for comparing and collating multiple witnesses to a single textual work. The software allows you to set any of the witnesses as the base text, to add or remove witness texts, to switch the base text at will, and to annotate Juxta-revealed comparisons and save the results. New in version 1.6.5 is the ability to upload your comparison sets to a free online workspace called Juxta Commons where you can analyze your data privately or choose to share visualizations of your work with anyone on the web.

Code license: Open source, Creative Commons
Last updated: 4 May 2015

This software scans one or many Word DOCX, text and text-like files (e.g. HTML and XML files) and counts the number of occurrences of the different words or phrases. There is no limit on the size of an input text file. The words/phrases which are found can be displayed alphabetically or by frequency. The program can be told to allow or disallow words with numerals, hyphens, apostrophes, underscores or colons, to ignore words which are short or which occur infrequently, and to ignore words (e.g., common words such as 'the', a.k.a. stop words) contained in a specified file.

Code license: Closed source
Last updated: 1 May 2015

This software scans a Word DOCX file or a text file (including HTML and XML files) with text encoded via ANSI or UTF-8 and counts the frequencies of different words. The words which are found and displayed can be ordered alphabetically or by frequency.

Code license: Closed source
Last updated: 30 Apr 2015

After creating a free account, users can submit requests for mining and analyzing JSTOR content. By submitting a query, a user will receive a random sample of 1,000 of JSTOR's 4.6 million documents; more documents can be received by contacting JSTOR directly. Users can choose to receive the following results:

  • Citations Only (all requests come with citations by default)
  • Word Counts
  • Bigrams
  • Trigrams
  • Quadgrams
  • Key Terms
  • References
Last updated: 29 Apr 2015

MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and places.

Code license: NCSA, Open source
Last updated: 21 Apr 2015

Bitext provides multilingual semantic technologies in the field of Text Analyics via API with services like Entity Extraction, Concept Extraction, Sentiment Analysis, and Text Categorisation.

Last updated: 25 Mar 2015

JGAAP is software designed for textual analysis, text categorization, and authorship attribution

Last updated: 25 Mar 2015

This package allows users to train topic models in MALLET and load results directly into R.

Code license: Open source, MIT License
Last updated: 25 Mar 2015

TAMS Analyzer is a program that works with TAMS to let you assign ethnographic codes to passages of a text just by selecting the relevant text and double clicking the name of the code on a list. It then allows you to extract, analyze, and save coded information.

Code license: Open source, GNU GPL
Last updated: 24 Mar 2015

"TextSTAT is a simple programme for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus. The new news-reader, too, puts news messages in a TextSTAT-readable corpus file.
TextSTAT reads MS Word and OpenOffice files. No conversion needed, just add the files to your corpus...

Last updated: 24 Mar 2015

VARD 2 is an interactive piece of software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic methods such as keyword analysis, collocations and annotation (e.g. POS and semantic tagging), the aim being to improve the accuracy of these tools

Last updated: 19 Feb 2015

A software tool for performing concordance – the analysis of a set of words within its immediate context - on a body of text. The tool performs full concordance, reading and analysing each and every word in a text. It was initially written for the analysis of English texts, but has since been extended to cater for other Western languages. Limited support is also provided for text in East Asian scripts, such as Chinese and Korean.

Features:

Code license: Closed source
Last updated: 11 Feb 2015

AntConc is free concordance software. It is multi-platform and easy to deploy and use.

AntConc is part of a suite of related tools for text processing and analysis, including applications for parallel corpus analysis, word profiling, PDF to text conversion, text structure analysis, detecting and converting character encodings, Japanese and Chinese segmenter and tokenizer, wordclass tagger, and spelling variant anaysis. The developer is currently drafting a more explicit licence for the use of the software.

Last updated: 11 Feb 2015

CATMA (Computer Aided Textual Markup & Analysis) is a free, open source markup and analysis tool from the University of Hamburg's Department of Languages, Literature and Media. It incorporates three interactive modules: (1) The tagger enables flexible and individual textual markup and markup editing. (2) The analyzer incorporates a query language and predefined functions. It also includes a query builder that allows users to construct queries from combinations of pre-defined questions while allowing for manual modification for more specific questions.

Code license: GNU GPL v3
Last updated: 29 Dec 2014

Weka provides machine learning algorithms in Java for data mining and predictive modeling tasks. These algorithms can either be incorporated into other Java code or called from the Weka Workbench, a GUI environment.

Code license: Open source, GNU GPL
Last updated: 29 Dec 2014

PAIR is a sequence alignment algorithm for humanities text analysis designed to identify "similar passages" in large collections of texts. In addition to a Philologic add-on, PAIR is available as Text::Pair, a generalized Perl module that supports one-against-many comparisons. A corpus is indexed and incoming texts are compared against the entire corpus for text reuse.

Code license: Open source, GNU GPL
Last updated: 29 Dec 2014

MONK is a digital environment designed to help humanities scholars discover and analyze patterns in the texts they study.

Last updated: 29 Dec 2014

HyperPo is a user-friendly text exploration and analysis program that allows users to import texts or use texts available online (in English or French), and provides frequency lists of characters, words and series of words, color-coding to indicate repetition, KWIC, co-occurrence and distribution lists, and the ability to simultaneously compare data from multiple texts.

Last updated: 29 Dec 2014

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.

Code license: CPL, Open source
Last updated: 29 Dec 2014

text analytic and data extraction framework: data and semantic analytics in a suite of business applications.

Last updated: 29 Dec 2014

Basis provides natural language processing technology for the analysis of unstructured multilingual text.

Last updated: 29 Dec 2014

IBM InfoSphere is intended for enterprise-scale data warehouses, delivering access to structured and unstructured information and operational and transactional data.

Last updated: 29 Dec 2014

Netlytic is a web-based system for automated text analysis and discovery of social networks from electronic communication such as emails, forums, blogs and chats.

What can the current version of Netlytic do?
Import and clean your data set from an RSS feed, an external database or a text file
Find and explore emerging themes of discussions
Build and visualize Chain Networks (social networks based on the number of messages exchanged between individuals) and Name Networks (social networks built from mining personal names).

Last updated: 29 Dec 2014

Wmatrix is web-based software for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.

Last updated: 29 Dec 2014

MMax2 is a text annotation tool for creating and visualizing annotations. It has advanced and customizable methods for information and relation visualization.
Features:

  • Determination of the word class / part of speech (POS) for words in a text
  • Determination of word senses, including the disambiguation of homonymous and polysemous words
  • Detection of anaphoric expressions and identification of their antecedents
Last updated: 29 Dec 2014

The main programs that comprise the Information processor are called the analyst server and query or knowledge processor. The analyst program can be called from a command line, from an html form, or through a TCP/IP socket protocol. The query processor can be accessed with any browser using HTML commands. It analyzes text and allows the user to search it.

Code license: Closed source
Last updated: 29 Dec 2014

CHET-C, or Chapel Hill Electronic Text-Converter, is a browser based software tool designed to convert digital texts that employ standard epigraphic conventions such as the Leiden sigla into EpiDoc-compliant XML files.

The tool can be accessed online at http://www.stoa.org/projects/epidoc/stable/chetc-js/chetc.html. Fragments of epigraphic text using standard sigla (eg Leiden convention markup) are pasted into the tool and Epidoc compliant XML is generated.

Code license: Open source, GNU GPL
Last updated: 29 Dec 2014

XSugar is a proof of concept tool for mapping textual content between a flat file schema and XML format. It performs statistical analysis to establish if transformations between the two formats are bi-directional, enabling content that has been converted into an XML format to be re-exported to the original flat file structure, or vice-versa. To validate the conversion, a schema must exist for source and destination formats, e.g. a bespoke XFlat encoded XML document that contains a definition of the structure of a class of flat files, an XML schema.

Features:

Code license: GNU GPL, Open source
Last updated: 29 Dec 2014

CollateX-based text collation client. CollateX, run on an server independent from the URL above, is a powerful, fully automatic, baseless text collation engine for multiple witnesses. A second collation technique, ncritic, provides a slightly different baseless text collation. Each engine complements each other nicely. The user can use different files, even URLs, then output the result in GraphML, TEI, JSON, HTML, or SVG. Fuzzy matching is an option.

Last updated: 29 Dec 2014

Pdf-extract is an open source set of tools and libraries for identifying and extracting semantically significant regions of a scholarly journal article (or conference proceeding) PDF.

Last updated: 29 Dec 2014

An online text analysis tool that provides detailed statistics of your text, including features like the anlysis of words groups, finding out keyword density, analysing the prominence of word or expressions.

Last updated: 29 Dec 2014

Voyant Tools is a web-based reading and analysis environment for digital texts.

Code license: Open source
Last updated: 29 Dec 2014

ANNIS2 is an open source, versatile web browser-based search and visualization architecture for complex multilevel linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, has been designed to provide access to the data of the SFB 632 ("Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts").

Code license: Apache License, Open source
Last updated: 29 Dec 2014

This online tool can be used for a wide variety of annotation tasks, including visualization and collaboration.

brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer. brat also supports the annotation of n-ary associations that can link together any number of other annotations participating in specific roles. brat also implements a number of features relying on natural language processing techniques to support human annotation efforts.

Last updated: 29 Dec 2014

The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. When applied to literature as we do here, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The current suite of lexomics tools are:

  • scrubber -- strips tags, removes stop words, applies lemma lists, and prepares texts for diviText
  • diviText -- cuts texts into chunks in one of three ways, count words, exports the results
Last updated: 29 Dec 2014

The purpose of ATLAS.ti is to help researchers uncover and systematically analyze complex phenomena hidden in text and multimedia data. The program provides tools that let the user locate, code, and annotate findings in primary data material, to weigh and evaluate their importance, and to visualize complex relations between them.

Last updated: 29 Dec 2014

QDA Miner is an easy-to-use mixed-methods qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner may be used to analyze interview or focus-group transcripts, legal documents, journal articles, even entire books, as well as drawing, photographs, paintings, and other types of visual documents.

Last updated: 29 Dec 2014

WordStat is a text analysis module for QDA Miner or SimStat. WordStat combines content analysis method by using dictionary approach and many algorithms exploration or various text mining methods. WordStat can apply existing categorization dictionaries to a new text corpus. It also may be used in the development and validation of new categorization dictionaries.

Last updated: 29 Dec 2014

WordCruncher is a text retrieval and analysis program that allows users to index or use a text, including very large multilingual Unicode documents. It supports the addition of tags (such as part of speech, definitions, lemma, etc), graphics, and hyperlinks to text or multimedia files. In addition to supporting contextual and tag searching, WordCruncher also includes many analytical reports, including collocation, vocabulary dispersion, frequency distribution, vocabulary usage, and various other reports.

Last updated: 29 Dec 2014

Nomenklatura is a reference data recon server. It is a service that allows users to define and manage manage lists of canonical entities (e.g. person or organization names) and aliases that connect to one of the canonical entities. This helps to clean up messy data in which a single entity may be referred to by many names.It includes a user interface, an API, and a reconciliation endpoint for OpenRefine for matching data from data sets with the canonical entries.

Code license: Open source
Last updated: 29 Dec 2014

Word and Phrase utilizes the Corpus of Contemporary American English (COCA) to analyze texts for word frequencies, collocations, and concordance lines. Users copy and paste texts into a web interface.

Last updated: 29 Dec 2014
CSV
Subscribe to Text mining