Text collections

What kind of data should the tool work with?

Part-of-Speech (POS) tagging software for English - the classification of words into one or more categories based upon its definition, relationship with other words, or other context, also known as wordclass tagging. CLAWS (Constituent Likelihood Automatic Word-tagging System) uses several methods to identify parts of speech., most notably a system called Hidden Markov models (HMMs) which involve counting examples of co-occurrence of words and wordclasses in training data and making a table of the probabilities of certain sequences of words.

Features:

Code license: Closed source
Last updated: 3 May 2016

Combined with the Leptonica Image Processing Library Tesseract can read a wide variety of image formats and convert them to text in over 40 languages.

This code is a raw OCR engine. It has no output formatting and no UI. It can detect fixed pitch vs proportional text. Nevertheless in 1995 this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code is included in the open source release.

The core developer on the project is Ray Smith (theraysmith).

Code license: Open source, Apache License
Last updated: 27 Jan 2016

The MONK workbench provides 525 works of American literature from the 18th and 19th centuries, and 37 plays and 5 works of poetry by William Shakespeare, along with tools to enable literary research through the discovery, exploration, and visualization of patterns.

Users affiliated with CIC (Big Ten) schools can access a larger data set that includes about a thousand works of British literature from the 16th through the 19th century, provided by The Text Creation Partnership (EEBO and ECCO) and ProQuest (Chadwyck-Healey Nineteenth-Century Fiction).

Last updated: 12 Aug 2015

A statistical natural language parser for analyzing text to determine its grammatical structure.

Code license: GNU GPL, Open source
Last updated: 18 Jun 2015

TextGrid is a virtual research environment (VRE) for the humanities, providing integrated access to specialized tools, services and content, and serving as a long-term archive for research data in the humanities.

Last updated: 22 May 2015

Silk is a platform for sites that contain collections of information. It's like the Tumblr for websites that have structured content–like software reviews, information about designers, a site with UN datasets, and more.

Last updated: 22 May 2015

The Text Creation Partnership has double-keyed roughly 55,000 titles from ProQuest’s EEBO image product into fully-searchable, TEI-compliant SGML/XML texts. These texts contain rich metadata fields that indicate when the texts include features like alchemical language, bibliographic citations, and epistolary forms.

Last updated: 17 May 2015

After creating a free account, users can submit requests for mining and analyzing JSTOR content. By submitting a query, a user will receive a random sample of 1,000 of JSTOR's 4.6 million documents; more documents can be received by contacting JSTOR directly. Users can choose to receive the following results:

  • Citations Only (all requests come with citations by default)
  • Word Counts
  • Bigrams
  • Trigrams
  • Quadgrams
  • Key Terms
  • References
Last updated: 29 Apr 2015

CollateX is a Java software for collating textual sources, for example, to produce a critical apparatus. As of January 2012 the project was at an early stage of development and lacked thorough documentation.

Code license: GNU GPL v3
Last updated: 25 Mar 2015

The HathiTrust Digital Library offers public domain texts scanned from university libraries for the Google Books Project, in a variety of formats for search and browsing.

Last updated: 29 Dec 2014

The ARTFL Encyclopédie Project has digitized the Encyclopédie ou Dictionnaire raisonné des sciences, des arts et des métiers, par une Société de Gens de lettres (published under the direction of Diderot and d'Alembert between 1751 and 1772, containing 74,000 articles written by more than 130 contributors) and made it available online for scholars to use with the Philologic text retrieval engine and the Philomine text mining tools.

Last updated: 29 Dec 2014

"In the WordHoard environment, texts are annotated or tagged by morphological, lexical, prosodic, and narratological criteria. They are mediated through a 'digital page' or user interface that lets scholarly but non-technical users explore the greatly increased query potential of textual data kept in such a form."

Code license: GNU GPL, Open source
Last updated: 29 Dec 2014

The Versioning Machine displays multiple versions of text encoded according to TEI Guidelines and allows for comparisons of annotation and introductory materials. This is a text editor and allows editors "to immediately see the consequences of their editorial decisions." This tool does not appear to have been updated since 2011.

Last updated: 29 Dec 2014

Google Books provides full, free PDF scans (and web-viewable versions) of public domain books and magazines, free previews of some books and magazines under copyright, and the option to purchase digital editions of available items to save and read on Google Play, their e-reading platform. You can also find a local library based on your zip code that has the item available. Reader reviews of the items are available and you can contribute your own. Search for authors, titles, as well as keywords and Google will return results with matches in the title/text of the item.

Last updated: 29 Dec 2014

SARIT (Search and Retrieval of Indic Texts" is a collection of electronic editions of Sanskrit and other Indian-language texts that have dated and embedded notes about their change history. You can perform a text search, retrieval and analysis of works in SARIT, as well as download all the texts and convert them to PDF, HTML, etc.

Last updated: 29 Dec 2014

InDesign is a desktop publishing (DTP) software application which can be used to create periodical publications, posters, flyers, brochures, magazines and books. The latest version, InDesign CS5.5 , is the twelfth generation in the product line.

Features:

  • Tightly integrated with other Adobe suite software
  • Dynamic cross-reference support that updates content when moved within a document
  • Spread rotation
Last updated: 30 Jan 2015

Adobe Illustrator is a comprehensive vector graphics environment that is ideal for all creative professionals, including web and interactive designers and developers, multimedia producers, motion graphics and visual effects designers, animators, and video professionals. Adobe Creative Suite products, including Illustrator, have recently moved to a subscription based service model. Adobe Illustrator CS6 was the last version released on disk. Adobe Illustrator Creative Cloud is now available starting at $19.99/month.

Features of Adobe Illustrator CC include:

Last updated: 29 Dec 2014

Adobe Bridge is a media management application used for organizing, browsing, locating, and viewing creative assets. It was provided as a part of the Adobe Creative Suite, beginning with CS2, and is now in version CS5

Features:

  • Tightly integrated with other Adobe suite software (except for the standalone version of Adobe Acrobat 8)
  • Extensible through use of Javascript
Code license: Closed source
Last updated: 29 Dec 2014

CollateX-based text collation client. CollateX, run on an server independent from the URL above, is a powerful, fully automatic, baseless text collation engine for multiple witnesses. A second collation technique, ncritic, provides a slightly different baseless text collation. Each engine complements each other nicely. The user can use different files, even URLs, then output the result in GraphML, TEI, JSON, HTML, or SVG. Fuzzy matching is an option.

Last updated: 29 Dec 2014

LATtice lets you explore and compare texts across entire corpora but also allows you to “drill down” to the level of individual LATs (language action types) to ask exactly what rhetorical categories make texts similar or different.

Last updated: 29 Dec 2014

TEI Boilerplate is a lightweight solution for publishing styled TEI (Text Encoding Initiative) P5 content directly in modern browsers. With TEI Boilerplate, TEI XML files can be served directly to the web without server-side processing or translation to HTML.

Last updated: 29 Dec 2014

Prism is a tool for crowdsourcing interpretation. Welcome to our experiment in crowd-sourcing and visualizing many readings of a common set of texts.

Last updated: 29 Dec 2014

An online text analysis tool that provides detailed statistics of your text, including features like the anlysis of words groups, finding out keyword density, analysing the prominence of word or expressions.

Last updated: 29 Dec 2014

Bookworm enables you to graphically explore lexical trends in repositories of digitized texts.

Code license: Open source
Last updated: 29 Dec 2014

Voyant Tools is a web-based reading and analysis environment for digital texts.

Code license: Open source
Last updated: 29 Dec 2014

WordCruncher is a text retrieval and analysis program that allows users to index or use a text, including very large multilingual Unicode documents. It supports the addition of tags (such as part of speech, definitions, lemma, etc), graphics, and hyperlinks to text or multimedia files. In addition to supporting contextual and tag searching, WordCruncher also includes many analytical reports, including collocation, vocabulary dispersion, frequency distribution, vocabulary usage, and various other reports.

Last updated: 29 Dec 2014

The Classical Text Editor was designed to enable scholars working on a critical edition or on a text with commentary or translation to prepare a camera-ready copy or electronic publication without bothering much about making up and page proofs. Its features, formed in continuous discussion with editors using the program, meet the practical needs of the scholar concerning text constitution, entries to different apparatus and updating them when the text has been changed, as well as creating and redefining sigla.

Last updated: 29 Dec 2014

Textexture is a tool for visualizing any text as a network. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts (by clicking on the nodes), and find similar texts.

Last updated: 29 Dec 2014

TVE is an interactive Java tool for exploring the effect of window size on three common linguistic measures: type-token ratio, proportion of hapax legomena, and average word length. In addition, TVE can cluster the text fragments according to a user-given set of words by applying principal component analysis (PCA).

Last updated: 29 Dec 2014

With ediarum researchers can comfortably transcribe, encode and edit manuscripts in TEI-XML, as well as publish their results in an online or print edition. The solution, developed by TELOTA, is based on three software components: exist-db, Oxygen XML Author, and ConTeXt. These are combined, supplemented with additional functions, and tailored to fit a project's needs.

Code license: Open source, GNU GPL, GPL, GNU LGPL
Last updated: 29 Dec 2014
CSV
Subscribe to Text collections