ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since then been extended to a large number of projects annotating a variety of phenomena.
TXM is a free and open-source cross-platform Unicode, XML & TEI based text analysis software, supporting Windows, Mac OS X and Linux. It is also available as a J2EE standard compliant portal software (GWT based) for online access with access control built in (see a demo portal: http://portal.textometrie.org/demo).
The goal of the Alpheios project is to help people learn how to learn languages as efficiently and enjoyably as possible, and in a way that best helps them understand their own literary heritage and culture, and the literary heritage and culture of other peoples throughout history. One of the principal tools, a Firefox plugin, allows a reader to browse a web page with Latin, ancient Greek, or Arabic, click on a word, and get a definition and morphological analysis of the word.
Part-of-Speech (POS) tagging software for English - the classification of words into one or more categories based upon its definition, relationship with other words, or other context, also known as wordclass tagging. CLAWS (Constituent Likelihood Automatic Word-tagging System) uses several methods to identify parts of speech., most notably a system called Hidden Markov models (HMMs) which involve counting examples of co-occurrence of words and wordclasses in training data and making a table of the probabilities of certain sequences of words.
Transana is a computer program that allows researchers to transcribe and analyze large collections of video, audio, and image data.
A statistical natural language parser for analyzing text to determine its grammatical structure.
The Natural Language Toolkit (NLTK) is an open source Python library for text analysis and natural language processing. NLTK can tokenize strings (create a list of words from a set of characters), identify parts of speech, and perform operations based on a word's context.
The Stanford Part-of-Speech Tagger includes English, Arabic, Chinese, and German tagger modules.
TextGrid is a virtual research environment (VRE) for the humanities, providing integrated access to specialized tools, services and content, and serving as a long-term archive for research data in the humanities.
Lexos is an online tool that enables you to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations.
This software scans one or many Word DOCX, text and text-like files (e.g. HTML and XML files) and counts the number of occurrences of the different words or phrases. There is no limit on the size of an input text file. The words/phrases which are found can be displayed alphabetically or by frequency. The program can be told to allow or disallow words with numerals, hyphens, apostrophes, underscores or colons, to ignore words which are short or which occur infrequently, and to ignore words (e.g., common words such as 'the', a.k.a. stop words) contained in a specified file.
This software scans a Word DOCX file or a text file (including HTML and XML files) with text encoded via ANSI or UTF-8 and counts the frequencies of different words. The words which are found and displayed can be ordered alphabetically or by frequency.
MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and places.
Praat is software for the phonetic analysis of speech, including support for articulatory and speech synthesis.
VARD 2 is an interactive piece of software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic methods such as keyword analysis, collocations and annotation (e.g. POS and semantic tagging), the aim being to improve the accuracy of these tools
AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data.
CorpusSearch 2 allows users to construct and search syntactically annotated corpora, including finding and counting lexical and syntactic patterns, correcting systemic errors, and coding linguistic features.
The software is released under Mozilla Public License 1.1 (MPL 1.1) .
A software tool for performing concordance – the analysis of a set of words within its immediate context - on a body of text. The tool performs full concordance, reading and analysing each and every word in a text. It was initially written for the analysis of English texts, but has since been extended to cater for other Western languages. Limited support is also provided for text in East Asian scripts, such as Chinese and Korean.
cue.language is a Java library that has tokenizing (words/sentences/ngram), string counting, language guessing, and stop word detection capabilities.
Basis provides natural language processing technology for the analysis of unstructured multilingual text.
The Transformer is a tool designed for aligned transcribed linguistic data that can convert data between a variety of formats, as well as organize, search and display the data.
EXMARaLDA (Extensible Markup Language for Discourse Annotation) is a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and for the construction and analysis of spoken language corpora.
The Field Linguist's Toolbox is Windows software for maintaining lexical data, and for parsing and interlinearizing text.
A software application for the playback of audio recordings. SoundScriber offers specific functionality for researchers that wish to transcribe a recording. It was originally developed for use in the Michigan Corpus of Academic Spoken English (MICASE) project and released for use by academics performing similar work.
- Audio playback via installed audio codecs (e.g. Wav, MP3)
- Variable speed playback
CollateX-based text collation client. CollateX, run on an server independent from the URL above, is a powerful, fully automatic, baseless text collation engine for multiple witnesses. A second collation technique, ncritic, provides a slightly different baseless text collation. Each engine complements each other nicely. The user can use different files, even URLs, then output the result in GraphML, TEI, JSON, HTML, or SVG. Fuzzy matching is an option.
WebLicht is a service-oriented architecture (SOA) for creating annotated text corpora. Development started in October 2008 as part of CLARIN-D's predecessor project D-SPIN, and further development and enhancement of WebLicht is an important goal of CLARIN-D, aiming to make WebLicht a fully-functional virtual research environment.
This online tool can be used for a wide variety of annotation tasks, including visualization and collaboration.
brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer. brat also supports the annotation of n-ary associations that can link together any number of other annotations participating in specific roles. brat also implements a number of features relying on natural language processing techniques to support human annotation efforts.
The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. When applied to literature as we do here, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The current suite of lexomics tools are:
- scrubber -- strips tags, removes stop words, applies lemma lists, and prepares texts for diviText
- diviText -- cuts texts into chunks in one of three ways, count words, exports the results
QDA Miner is an easy-to-use mixed-methods qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner may be used to analyze interview or focus-group transcripts, legal documents, journal articles, even entire books, as well as drawing, photographs, paintings, and other types of visual documents.
The UAM CorpusTool may be used for annotating a corpus as part of a linguistic study. The tool also allows to search for annotation forms and compare them to one another. It provides a graphical schema editor and saves annotations in a stand-off XML format.
GATE (General Architecture for Text Engineering) is a sophisticated framework that allows manual and automatic annotation as well as the processing of all kinds of language resources. GATE has a broad community of users and developers, and comes with diverse plugins for specific linguistic tasks.
"WordFreak is a java-based linguistic annotation tool designed to support human, and automatic annotation of linguistic data as well as employ active-learning for human correction of automatically annotated data." (text taken from http://wordfreak.sourceforge.net/)
WordCruncher is a text retrieval and analysis program that allows users to index or use a text, including very large multilingual Unicode documents. It supports the addition of tags (such as part of speech, definitions, lemma, etc), graphics, and hyperlinks to text or multimedia files. In addition to supporting contextual and tag searching, WordCruncher also includes many analytical reports, including collocation, vocabulary dispersion, frequency distribution, vocabulary usage, and various other reports.
The Tesserae project aims to provide a flexible and robust web interface for exploring intertextual parallels.
Writefull is a light-weight app that uses data from Google Books (5+ million books) and the Web to improve your writing, It compares small sections of your text to a large data set of writing found online and in Google Books. All you need to do is select a chunk of your text in your browser or text editing software, activate the Writefull popover, and choose one of its five options:
1) check the number of results (how often the chunk appears in Google Books or the Web);