What kind of data should the tool work with?

This tool was inspired by Victor Powell's CSV fingerprint which he describes as a "birdseye view of the file without too much distracting detail". Breve gives you that meta view of tabular data and also lets you drill down to records and columns, and edit values.

All data is handled locally, so this tool can be used with data that cannot be uploaded to the cloud.

Code license: Open source
Last updated: 17 Jan 2019

Geospatial Data Abstraction Library (GDAL) is a translator library for vector and raster geospatial data formats that is released under an X/MIT style Open Source license by the Open Source Geospatial Foundation.

Code license: Open source, MIT License
Last updated: 7 Jun 2016

Geographically Encoded Objects for RSS feeds. GeoRSS was designed as a lightweight, community driven way to extend existing feeds with geographic information.

As RSS and Atom become more prevalent as a way to publish and share information, it becomes increasingly important that location is described in an interoperable manner so that applications can request, aggregate, share and map geographically tagged feeds.
RSS Map of Digital Humanities centers

Last updated: 7 Jun 2016

Overview is a tool for analyzing large sets of documents. In includes a sophisticated search engine, word clouds, entity detection, and topic-based document clustering. If that’s not good enough, you can write your own plugins using the API. It is open source and you can run it on your own computer.

It was originally designed for investigative journalists, but it’s now also used for qualitative research, social media conversation analysis, legal document review, digital humanities, and more.

Overview is built to do several types of tasks:

Code license: Open source
Last updated: 9 Mar 2016

Combined with the Leptonica Image Processing Library Tesseract can read a wide variety of image formats and convert them to text in over 40 languages.

This code is a raw OCR engine. It has no output formatting and no UI. It can detect fixed pitch vs proportional text. Nevertheless in 1995 this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code is included in the open source release.

The core developer on the project is Ray Smith (theraysmith).

Code license: Open source, Apache License
Last updated: 27 Jan 2016

The Entity Authority Tool Set (EATS) is a web application for recording, editing, using and displaying authority information about entities. It is designed to allow multiple authorities to each maintain their own independent data, while operating on a common base so that information about the same entity is all in one place. EATS also comes with client tools for automatically looking up entities in a text by name and adding appropriate TEI markup.

  • A web API for importing and exporting entity data
Code license: Open source, GNU GPL
Last updated: 26 Jan 2016 is a free web-based platform that puts the power of the machine readable web in user's hands. Using their tools users can create an API or crawl an entire website in a fraction of the time of traditional methods, no coding required. Their highly efficient and scalable platform allows users to process 1,000s of queries at once and get real-time data in any format you choose. They also offer an easy to use client library to make exporting, integrating and using data as simple as extracting it.

Code license: Closed source
Last updated: 15 Jan 2016

A free iOS app for text analysis. Textal allows you to analyze documents, tweet streams, and webpages. Create clickable text clouds based on the source data that you choose. It comes pre-loaded with a large number of public domain texts. Text clouds are easily shareable via various Twitter and email.

Last updated: 18 Dec 2015

CulturalAnalytics is an R package containing functions for statistical analysis and plotting of image properties, including statistics such as the standard deviation and mean in the RGB and HSV color spaces, image entropy and histograms in greyscale (intensity) and color, and for plotting color clouds and image scatter charts.

Code license: Open source, GNU GPL
Last updated: 12 Nov 2015

Whatizit can ingest up to 500,000 terms pasted into the input box and execute any of the pre-defined text analysis pipelines.

Last updated: 23 May 2015

A free (under the GNU General Public License) toolkit for the development of document image recognition systems.


  • Custom dictionaries may be created to assist with analysis of specific record types
  • Extensible functionality
  • Optical character recognition (OCR) toolkit plugin
Code license: Open source, GNU GPL
Last updated: 22 May 2015

OHMS (Oral History Metadata Synchronizer) inexpensively and efficiently enhances access to oral history by providing users with word-level search capability and a time-correlated transcript or indexed interview connecting the textual search term to the corresponding moment in the recorded interview online.

OHMS is an open source, web-based application designed to improve the user experience you provide for oral history, no matter what CMS or repository you use. There are 2 main components of the OHMS system

Code license: Open source
Last updated: 6 Apr 2015

Bitext provides multilingual semantic technologies in the field of Text Analyics via API with services like Entity Extraction, Concept Extraction, Sentiment Analysis, and Text Categorisation.

Last updated: 25 Mar 2015

Praat is software for the phonetic analysis of speech, including support for articulatory and speech synthesis.

Code license: GNU GPL v2
Last updated: 19 Feb 2015

The DocScanner app uses a device's built-in camera to scan documents. Features include image optimization, OCR, document type recognition (document, business card, receipt, etc.), autosorting, and ability to upload documents to Evernote, Dropbox, and Google Drive.

Code license: Closed source
Last updated: 29 Dec 2014
Subscribe to DataRecognition