Linguistic research

What kind of data should the tool work with?

ANNIS is an open source, cross platform (Linux, Mac, Windows), web browser-based search and visualization architecture for complex multi-layer linguistic corpora with diverse types of annotation. ANNIS, which stands for ANNotation of Information Structure, was originally designed to provide access to the data of the SFB 632 - “Information Structure: The Linguistic Means for Structuring Utterances, Sentences and Texts”. It has since then been extended to a large number of projects annotating a variety of phenomena.

Code license: Open source, Apache License
Last updated: 16 Sep 2016

TXM

TXM is a free and open-source cross-platform Unicode, XML & TEI based text analysis software, supporting Windows, Mac OS X and Linux. It is also available as a J2EE standard compliant portal software (GWT based) for online access with access control built in (see a demo portal: http://portal.textometrie.org/demo).

Code license: Open source, GNU GPL v3
Last updated: 29 Jun 2016

The goal of the Alpheios project is to help people learn how to learn languages as efficiently and enjoyably as possible, and in a way that best helps them understand their own literary heritage and culture, and the literary heritage and culture of other peoples throughout history. One of the principal tools, a Firefox plugin, allows a reader to browse a web page with Latin, ancient Greek, or Arabic, click on a word, and get a definition and morphological analysis of the word.

Code license: Open source, GNU GPL
Last updated: 7 Jun 2016

Part-of-Speech (POS) tagging software for English - the classification of words into one or more categories based upon its definition, relationship with other words, or other context, also known as wordclass tagging. CLAWS (Constituent Likelihood Automatic Word-tagging System) uses several methods to identify parts of speech., most notably a system called Hidden Markov models (HMMs) which involve counting examples of co-occurrence of words and wordclasses in training data and making a table of the probabilities of certain sequences of words.

Features:

Code license: Closed source
Last updated: 3 May 2016

Transana is a computer program that allows researchers to transcribe and analyze large collections of video, audio, and image data.

Last updated: 10 Aug 2015

A statistical natural language parser for analyzing text to determine its grammatical structure.

Code license: GNU GPL, Open source
Last updated: 18 Jun 2015

The Natural Language Toolkit (NLTK) is an open source Python library for text analysis and natural language processing. NLTK can tokenize strings (create a list of words from a set of characters), identify parts of speech, and perform operations based on a word's context.

Last updated: 28 May 2015

The Stanford Part-of-Speech Tagger includes English, Arabic, Chinese, and German tagger modules.

Code license: GNU GPL, Open source
Last updated: 27 May 2015

TextGrid is a virtual research environment (VRE) for the humanities, providing integrated access to specialized tools, services and content, and serving as a long-term archive for research data in the humanities.

Last updated: 22 May 2015

Lexos is an online tool that enables you to "scrub" (clean) your text(s), cut a text(s) into various size chunks, manage chunks and chunk sets, and choose from a suite of analysis tools for investigating those texts. Functionality includes building dendrograms, making graphs of rolling averages of word frequencies or ratios of words or letters, and playing with visualizations of word frequencies including word clouds and bubble visualizations.

Code license: Open source
Last updated: 17 May 2015

This software scans one or many Word DOCX, text and text-like files (e.g. HTML and XML files) and counts the number of occurrences of the different words or phrases. There is no limit on the size of an input text file. The words/phrases which are found can be displayed alphabetically or by frequency. The program can be told to allow or disallow words with numerals, hyphens, apostrophes, underscores or colons, to ignore words which are short or which occur infrequently, and to ignore words (e.g., common words such as 'the', a.k.a. stop words) contained in a specified file.

Code license: Closed source
Last updated: 1 May 2015

This software scans a Word DOCX file or a text file (including HTML and XML files) with text encoded via ANSI or UTF-8 and counts the frequencies of different words. The words which are found and displayed can be ordered alphabetically or by frequency.

Code license: Closed source
Last updated: 30 Apr 2015

MorphAdorner is a Java command-line program which acts as a pipeline manager for processes performing morphological adornment of words in a text. Currently MorphAdorner provides methods for adorning text with standard spellings, parts of speech and lemmata. MorphAdorner also provides facilities for tokenizing text, recognizing sentence boundaries, and extracting names and places.

Code license: NCSA, Open source
Last updated: 21 Apr 2015

Praat is software for the phonetic analysis of speech, including support for articulatory and speech synthesis.

Code license: GNU GPL v2
Last updated: 19 Feb 2015

VARD 2 is an interactive piece of software produced in Java designed to assist users of historical corpora in dealing with spelling variation, particularly in Early Modern English texts. The tool is intended to be a pre-processor to other corpus linguistic methods such as keyword analysis, collocations and annotation (e.g. POS and semantic tagging), the aim being to improve the accuracy of these tools

Last updated: 19 Feb 2015

AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data.

Code license: Open source
Last updated: 11 Feb 2015

CorpusSearch 2 allows users to construct and search syntactically annotated corpora, including finding and counting lexical and syntactic patterns, correcting systemic errors, and coding linguistic features.

The software is released under Mozilla Public License 1.1 (MPL 1.1) .

Code license: Open source
Last updated: 11 Feb 2015

A software tool for performing concordance – the analysis of a set of words within its immediate context - on a body of text. The tool performs full concordance, reading and analysing each and every word in a text. It was initially written for the analysis of English texts, but has since been extended to cater for other Western languages. Limited support is also provided for text in East Asian scripts, such as Chinese and Korean.

Features:

Code license: Closed source
Last updated: 11 Feb 2015

cue.language is a Java library that has tokenizing (words/sentences/ngram), string counting, language guessing, and stop word detection capabilities.

Code license: Apache License, Open source
Last updated: 29 Dec 2014

Basis provides natural language processing technology for the analysis of unstructured multilingual text.

Last updated: 29 Dec 2014

EXMARaLDA (Extensible Markup Language for Discourse Annotation) is a system of concepts, data formats and tools for the computer assisted transcription and annotation of spoken language, and for the construction and analysis of spoken language corpora.

Last updated: 29 Dec 2014

The Field Linguist's Toolbox is Windows software for maintaining lexical data, and for parsing and interlinearizing text.

Last updated: 29 Dec 2014

The Transformer is a tool designed for aligned transcribed linguistic data that can convert data between a variety of formats, as well as organize, search and display the data.

Last updated: 29 Dec 2014

A software application for the playback of audio recordings. SoundScriber offers specific functionality for researchers that wish to transcribe a recording. It was originally developed for use in the Michigan Corpus of Academic Spoken English (MICASE) project and released for use by academics performing similar work.

Features:

  • Audio playback via installed audio codecs (e.g. Wav, MP3)
  • Variable speed playback
Code license: GNU GPL, Open source
Last updated: 29 Jan 2015

CollateX-based text collation client. CollateX, run on an server independent from the URL above, is a powerful, fully automatic, baseless text collation engine for multiple witnesses. A second collation technique, ncritic, provides a slightly different baseless text collation. Each engine complements each other nicely. The user can use different files, even URLs, then output the result in GraphML, TEI, JSON, HTML, or SVG. Fuzzy matching is an option.

Last updated: 29 Dec 2014

WebLicht is a service-oriented architecture (SOA) for creating annotated text corpora. Development started in October 2008 as part of CLARIN-D's predecessor project D-SPIN, and further development and enhancement of WebLicht is an important goal of CLARIN-D, aiming to make WebLicht a fully-functional virtual research environment.

Last updated: 29 Dec 2014

This online tool can be used for a wide variety of annotation tasks, including visualization and collaboration.

brat is designed in particular for structured annotation, where the notes are not freeform text but have a fixed form that can be automatically processed and "interpreted" by a computer. brat also supports the annotation of n-ary associations that can link together any number of other annotations participating in specific roles. brat also implements a number of features relying on natural language processing techniques to support human annotation efforts.

Last updated: 29 Dec 2014

The UAM CorpusTool may be used for annotating a corpus as part of a linguistic study. The tool also allows to search for annotation forms and compare them to one another. It provides a graphical schema editor and saves annotations in a stand-off XML format.

Last updated: 29 Dec 2014

GATE (General Architecture for Text Engineering) is a sophisticated framework that allows manual and automatic annotation as well as the processing of all kinds of language resources. GATE has a broad community of users and developers, and comes with diverse plugins for specific linguistic tasks.

Last updated: 29 Dec 2014

"WordFreak is a java-based linguistic annotation tool designed to support human, and automatic annotation of linguistic data as well as employ active-learning for human correction of automatically annotated data." (text taken from http://wordfreak.sourceforge.net/)

Last updated: 29 Dec 2014

The term "lexomics" was originally coined to describe the computer-assisted detection of "words" (short sequences of bases) in genomes. When applied to literature as we do here, lexomics is the analysis of the frequency, distribution, and arrangement of words in large-scale patterns. The current suite of lexomics tools are:

  • scrubber -- strips tags, removes stop words, applies lemma lists, and prepares texts for diviText
  • diviText -- cuts texts into chunks in one of three ways, count words, exports the results
Last updated: 29 Dec 2014

QDA Miner is an easy-to-use mixed-methods qualitative data analysis software package for coding, annotating, retrieving and analyzing small and large collections of documents and images. QDA Miner may be used to analyze interview or focus-group transcripts, legal documents, journal articles, even entire books, as well as drawing, photographs, paintings, and other types of visual documents.

Last updated: 29 Dec 2014

WordCruncher is a text retrieval and analysis program that allows users to index or use a text, including very large multilingual Unicode documents. It supports the addition of tags (such as part of speech, definitions, lemma, etc), graphics, and hyperlinks to text or multimedia files. In addition to supporting contextual and tag searching, WordCruncher also includes many analytical reports, including collocation, vocabulary dispersion, frequency distribution, vocabulary usage, and various other reports.

Last updated: 29 Dec 2014

The Tesserae project aims to provide a flexible and robust web interface for exploring intertextual parallels.

Last updated: 29 Dec 2014

Writefull is a light-weight app that uses data from Google Books (5+ million books) and the Web to improve your writing, It compares small sections of your text to a large data set of writing found online and in Google Books. All you need to do is select a chunk of your text in your browser or text editing software, activate the Writefull popover, and choose one of its five options:

1) check the number of results (how often the chunk appears in Google Books or the Web);

Code license: Closed source
Last updated: 29 Dec 2014

Word and Phrase utilizes the Corpus of Contemporary American English (COCA) to analyze texts for word frequencies, collocations, and concordance lines. Users copy and paste texts into a web interface.

Last updated: 29 Dec 2014
CSV
Subscribe to Linguistic research