DiscoverText allows users to import data from a variety of sources (including free and premium Gnip Twitter feeds, plain text, Word, Excel, public YouTube comments, blogs/wikis, PDF, etc.), to view, search, filter, deduplicate, code and machine classify the data. This is a collaborative, web-based platform widely used by academics.
ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since then been extended to a large number of projects annotating a variety of phenomena.
EPPT allows users to encode image-based scholarly editions without having to know XML syntax. It automates or semi-automates repeating attributes, and provides templates to reduce errors and accelerate the encoding process.
TXM is a free and open-source cross-platform Unicode, XML & TEI based text analysis software, supporting Windows, Mac OS X and Linux. It is also available as a J2EE standard compliant portal software (GWT based) for online access with access control built in (see a demo portal: http://portal.textometrie.org/demo).
IBM AeroText is an information extraction system for developing knowledge-based content analysis applications.
A global geographical database that may be used to identify and tag all references to location. The database contains over 8 million entries, each of which possesses a geographic name (in various languages), latitude, longitude, elevation, population, administrative subdivision and postal codes and information on unique features.
GeoParser is a text analysis tool that may be used to identify and tag references to geographic location in a text resource using Natural Language Processing to analyse the composition of a resource and identifying words that match its geographic database. The approach is useful for processing names that may have one of several locations (e.g. Belfast in Ireland, New Zealand and Canada) and distinguishing names that may be confused with other common words (e.g. Reading in Berkshire and reading as an activity).
The goal of the Alpheios project is to help people learn how to learn languages as efficiently and enjoyably as possible, and in a way that best helps them understand their own literary heritage and culture, and the literary heritage and culture of other peoples throughout history. One of the principal tools, a Firefox plugin, allows a reader to browse a web page with Latin, ancient Greek, or Arabic, click on a word, and get a definition and morphological analysis of the word.
Part-of-Speech (POS) tagging software for English - the classification of words into one or more categories based upon its definition, relationship with other words, or other context, also known as wordclass tagging. CLAWS (Constituent Likelihood Automatic Word-tagging System) uses several methods to identify parts of speech., most notably a system called Hidden Markov models (HMMs) which involve counting examples of co-occurrence of words and wordclasses in training data and making a table of the probabilities of certain sequences of words.
TokenX is a web-based environment for visualizing, analyzing and playing with texts. Options include word clouds, highlighting words, keywords in context, replacing words with bocks, highlighting punctuation and non-words, counting words in context and decontextualized, and substituting words. A number of sample files are provided, or users can point TokenX to any XML file online.
TAToo is an embeddable Flash widget that displays TAPOR analytics for the page on which it resides.
The TAPoR Portal is an online environment where users can keep track of texts they want to study (uploaded or available online), learn about and try different tools, and run tools on texts.
Philomine is an extension to the Philologic text retrieval engine that supports a variety of machine learning, text mining, and document clustering tasks.
PhiloLine is an add-on for the Philologic text retrieval engine that provides a sequence alignment algorithm for humanities text analysis designed to identify "similar passages" in large collections of texts.
Combined with the Leptonica Image Processing Library Tesseract can read a wide variety of image formats and convert them to text in over 40 languages.
This code is a raw OCR engine. It has no output formatting and no UI. It can detect fixed pitch vs proportional text. Nevertheless in 1995 this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code is included in the open source release.
The core developer on the project is Ray Smith (theraysmith).
A free iOS app for text analysis. Textal allows you to analyze documents, tweet streams, and webpages. Create clickable text clouds based on the source data that you choose. It comes pre-loaded with a large number of public domain texts. Text clouds are easily shareable via various Twitter and email.
Superfastmatch is designed to find exact duplicates of text strings between documents.
Voyeur is a web-based text analysis environment where users can apply a wide variety of tools to any text they import.
The MONK workbench provides 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare, along with tools to enable literary research through the discovery, exploration, and visualization of patterns.
Users affiliated with CIC (Big Ten) schools can access a larger data set that includes about a thousand works of British literature from the 16th through the 19th century, provided by The Text Creation Partnership (EEBO and ECCO) and ProQuest (Chadwyck-Healey Nineteenth-Century Fiction).
Philologic is a full-text search, retrieval and analysis tool with support for TEI-Lite XML/SGML, Unicode encoding, plaintext, Dublin Core/HTML, and DocBook.
A statistical natural language parser for analyzing text to determine its grammatical structure.
A text-mining system for scientific literature. Textpresso's two major elements are (1) access to full text, so that entire articles can be searched, and (2) introduction of categories of biological concepts and classes that relate to objects (e.g., association, regulation, etc.) or describe one (e.g., methods, etc).
"Linguistic Inquiry and Word Count (LIWC) is a text analysis software program...LIWC is able to calculate the degree to which people use different categories of words across a wide array of texts." Free and limited web analysis available.
Whatizit can ingest up to 500,000 terms pasted into the input box and execute any of the pre-defined text analysis pipelines.
WordSmith allows users to develop concordances, find keywords, and develop word lists from plain text files.
Scrapy is an open source programming library for web crawling and web page text extraction, written in Python. You can make calls to Scrapy code from within your own scripts and applications to automate the task of extracting data from websites.
You would typically use Scrapy to automate the task of visiting one or more web pages, on a website to which you have access. You could alternately use it to invoke web-based Application Programming Interfaces (APIs).
Diction analyzes texts for language indicating certainty, activity, optimism, realism, and commonality.
Lexos is an online tool that enables you to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations.
CAT is an environment for group coding and analyzing data sets, including computing inter-rater reliability. Users can create a free account, or download the ASP.NET tool suite to run independently.
AntWordProfiler is free software for analyzing word frequency.
Juxta is an open-source cross-platform desktop tool for comparing and collating multiple witnesses to a single textual work. The software allows you to set any of the witnesses as the base text, to add or remove witness texts, to switch the base text at will, and to annotate Juxta-revealed comparisons and save the results. New in version 1.6.5 is the ability to upload your comparison sets to a free online workspace called Juxta Commons where you can analyze your data privately or choose to share visualizations of your work with anyone on the web.
This software scans one or many Word DOCX, text and text-like files (e.g. HTML and XML files) and counts the number of occurrences of the different words or phrases. There is no limit on the size of an input text file. The words/phrases which are found can be displayed alphabetically or by frequency. The program can be told to allow or disallow words with numerals, hyphens, apostrophes, underscores or colons, to ignore words which are short or which occur infrequently, and to ignore words (e.g., common words such as 'the', a.k.a. stop words) contained in a specified file.
This software scans a Word DOCX file or a text file (including HTML and XML files) with text encoded via ANSI or UTF-8 and counts the frequencies of different words. The words which are found and displayed can be ordered alphabetically or by frequency.
After creating a free account, users can submit requests for mining and analyzing JSTOR content. By submitting a query, a user will receive a random sample of 1,000 of JSTOR's 4.6 million documents; more documents can be received by contacting JSTOR directly. Users can choose to receive the following results:
- Citations Only (all requests come with citations by default)
- Word Counts
- Key Terms
MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and places.
Bitext provides multilingual semantic technologies in the field of Text Analyics via API with services like Entity Extraction, Concept Extraction, Sentiment Analysis, and Text Categorisation.
JGAAP is software designed for textual analysis, text categorization, and authorship attribution
This package allows users to train topic models in MALLET and load results directly into R.
TAMS Analyzer is a program that works with TAMS to let you assign ethnographic codes to passages of a text just by selecting the relevant text and double clicking the name of the code on a list. It then allows you to extract, analyze, and save coded information.
"TextSTAT is a simple programme for the analysis of texts. It reads plain text files (in different encodings) and HTML files (directly from the internet) and it produces word frequency lists and concordances from these files. This version includes a web-spider which reads as many pages as you want from a particular website and puts them in a TextSTAT-corpus. The new news-reader, too, puts news messages in a TextSTAT-readable corpus file.
TextSTAT reads MS Word and OpenOffice files. No conversion needed, just add the files to your corpus...
VARD 2 is an interactive piece of software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic methods such as keyword analysis, collocations and annotation (e.g. POS and semantic tagging), the aim being to improve the accuracy of these tools
A software tool for performing concordance – the analysis of a set of words within its immediate context - on a body of text. The tool performs full concordance, reading and analysing each and every word in a text. It was initially written for the analysis of English texts, but has since been extended to cater for other Western languages. Limited support is also provided for text in East Asian scripts, such as Chinese and Korean.
AntConc is free concordance software. It is multi-platform and easy to deploy and use.
AntConc is part of a suite of related tools for text processing and analysis, including applications for parallel corpus analysis, word profiling, PDF to text conversion, text structure analysis, detecting and converting character encodings, Japanese and Chinese segmenter and tokenizer, wordclass tagger, and spelling variant anaysis. The developer is currently drafting a more explicit licence for the use of the software.
CATMA (Computer Aided Textual Markup & Analysis) is a free, open source markup and analysis tool from the University of Hamburg's Department of Languages, Literature and Media. It incorporates three interactive modules: (1) The tagger enables flexible and individual textual markup and markup editing. (2) The analyzer incorporates a query language and predefined functions. It also includes a query builder that allows users to construct queries from combinations of pre-defined questions while allowing for manual modification for more specific questions.
Weka provides machine learning algorithms in Java for data mining and predictive modeling tasks. These algorithms can either be incorporated into other Java code or called from the Weka Workbench, a GUI environment.
PAIR is a sequence alignment algorithm for humanities text analysis designed to identify "similar passages" in large collections of texts. In addition to a Philologic add-on, PAIR is available as Text::Pair, a generalized Perl module that supports one-against-many comparisons. A corpus is indexed and incoming texts are compared against the entire corpus for text reuse.
MONK is a digital environment designed to help humanities scholars discover and analyze patterns in the texts they study.
HyperPo is a user-friendly text exploration and analysis program that allows users to import texts or use texts available online (in English or French), and provides frequency lists of characters, words and series of words, color-coding to indicate repetition, KWIC, co-occurrence and distribution lists, and the ability to simultaneously compare data from multiple texts.
MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
text analytic and data extraction framework: data and semantic analytics in a suite of business applications.
IBM InfoSphere is intended for enterprise-scale data warehouses, delivering access to structured and unstructured information and operational and transactional data.
Netlytic is a web-based system for automated text analysis and discovery of social networks from electronic communication such as emails, forums, blogs and chats.
What can the current version of Netlytic do?
Import and clean your data set from an RSS feed, an external database or a text file
Find and explore emerging themes of discussions
Build and visualize Chain Networks (social networks based on the number of messages exchanged between individuals) and Name Networks (social networks built from mining personal names).
Basis provides natural language processing technology for the analysis of unstructured multilingual text.
Wmatrix is web-based software for corpus analysis and comparison. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic domains.
MMax2 is a text annotation tool for creating and visualizing annotations. It has advanced and customizable methods for information and relation visualization.
- Determination of the word class / part of speech (POS) for words in a text
- Determination of word senses, including the disambiguation of homonymous and polysemous words
- Detection of anaphoric expressions and identification of their antecedents
The main programs that comprise the Information processor are called the analyst server and query or knowledge processor. The analyst program can be called from a command line, from an html form, or through a TCP/IP socket protocol. The query processor can be accessed with any browser using HTML commands. It analyzes text and allows the user to search it.
CHET-C, or Chapel Hill Electronic Text-Converter, is a browser based software tool designed to convert digital texts that employ standard epigraphic conventions such as the Leiden sigla into EpiDoc-compliant XML files.
The tool can be accessed online at http://www.stoa.org/projects/epidoc/stable/chetc-js/chetc.html. Fragments of epigraphic text using standard sigla (eg Leiden convention markup) are pasted into the tool and Epidoc compliant XML is generated.
XSugar is a proof of concept tool for mapping textual content between a flat file schema and XML format. It performs statistical analysis to establish if transformations between the two formats are bi-directional, enabling content that has been converted into an XML format to be re-exported to the original flat file structure, or vice-versa. To validate the conversion, a schema must exist for source and destination formats, e.g. a bespoke XFlat encoded XML document that contains a definition of the structure of a class of flat files, an XML schema.
CollateX-based text collation client. CollateX, run on an server independent from the URL above, is a powerful, fully automatic, baseless text collation engine for multiple witnesses. A second collation technique, ncritic, provides a slightly different baseless text collation. Each engine complements each other nicely. The user can use different files, even URLs, then output the result in GraphML, TEI, JSON, HTML, or SVG. Fuzzy matching is an option.
Pdf-extract is an open source set of tools and libraries for identifying and extracting semantically significant regions of a scholarly journal article (or conference proceeding) PDF.
An online text analysis tool that provides detailed statistics of your text, including features like the anlysis of words groups, finding out keyword density, analysing the prominence of word or expressions.
Voyant Tools is a web-based reading and analysis environment for digital texts.
This online tool can be used for a wide variety of annotation tasks, including visualization and collaboration.
brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer. brat also supports the annotation of n-ary associations that can link together any number of other annotations participating in specific roles. brat also implements a number of features relying on natural language processing techniques to support human annotation efforts.
QDA Miner is an easy-to-use mixed-methods qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner may be used to analyze interview or focus-group transcripts, legal documents, journal articles, even entire books, as well as drawing, photographs, paintings, and other types of visual documents.
WordStat is a text analysis module for QDA Miner or SimStat. WordStat combines content analysis method by using dictionary approach and many algorithms exploration or various text mining methods. WordStat can apply existing categorization dictionaries to a new text corpus. It also may be used in the development and validation of new categorization dictionaries.
The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. When applied to literature as we do here, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The current suite of lexomics tools are:
- scrubber -- strips tags, removes stop words, applies lemma lists, and prepares texts for diviText
- diviText -- cuts texts into chunks in one of three ways, count words, exports the results
The purpose of ATLAS.ti is to help researchers uncover and systematically analyze complex phenomena hidden in text and multimedia data. The program provides tools that let the user locate, code, and annotate findings in primary data material, to weigh and evaluate their importance, and to visualize complex relations between them.
WordCruncher is a text retrieval and analysis program that allows users to index or use a text, including very large multilingual Unicode documents. It supports the addition of tags (such as part of speech, definitions, lemma, etc), graphics, and hyperlinks to text or multimedia files. In addition to supporting contextual and tag searching, WordCruncher also includes many analytical reports, including collocation, vocabulary dispersion, frequency distribution, vocabulary usage, and various other reports.
Nomenklatura is a reference data recon server. It is a service that allows users to define and manage manage lists of canonical entities (e.g. person or organization names) and aliases that connect to one of the canonical entities. This helps to clean up messy data in which a single entity may be referred to by many names.It includes a user interface, an API, and a reconciliation endpoint for OpenRefine for matching data from data sets with the canonical entries.