What kind of data should the tool work with?

Beautiful Soup is a library, written in the Python programming language, for pulling specific pieces of data out of HTML and XML files. It is especially suitable when working with data files that aren't well-formed, or are otherwise difficult to parse.

Saves programmers hours or days of work on quick-turnaround screen scraping projects.

Last updated: 19 Apr 2016

A Python-based XML web publishing framework which enables dynamic pipelining of XSLT transformations. Data is processed by an XML pipeline composed of several WSGI applications and middleware components.


  • Apache Cocoon Sitemap 1.0 compatible
  • WSGI modularity
  • URI pattern matching
Code license: Open source, GNU GPL
Last updated: 26 Jan 2016

epub-tools is a collection of Python tools for generating and managing epub documents from Word, RTF, DocBook, TEI and FictionBook.

Code license: BSD
Last updated: 26 Jan 2016

Plone is a powerful, flexible, open source Content Management System (CMS) built on top of Zope application server and CMF.

  • Flexible and adaptable workflow
  • Customisable
  • Free add-ons
  • Versioning, history and reverting content
  • Support for multiple mark up formats
  • Multilingual content management
  • RSS feed support
  • WebDAV and FTP support
  • Integrates with Active Directory, Salesforce, LDAP, SQL, Web Services, LDAP and Oracle
Code license: Open source, GNU GPL, GNU GPL v2
Last updated: 7 Aug 2015

The Natural Language Toolkit (NLTK) is an open source Python library for text analysis and natural language processing. NLTK can tokenize strings (create a list of words from a set of characters), identify parts of speech, and perform operations based on a word's context.

Last updated: 28 May 2015

From the web site:

music21: A toolkit for computer-aided musicology

Code license: Open source
Last updated: 27 May 2015

A free (under the GNU General Public License) toolkit for the development of document image recognition systems.


  • Custom dictionaries may be created to assist with analysis of specific record types
  • Extensible functionality
  • Optical character recognition (OCR) toolkit plugin
Code license: Open source, GNU GPL
Last updated: 22 May 2015

Scrapy is an open source programming library for web crawling and web page text extraction, written in Python. You can make calls to Scrapy code from within your own scripts and applications to automate the task of extracting data from websites.

You would typically use Scrapy to automate the task of visiting one or more web pages, on a website to which you have access. You could alternately use it to invoke web-based Application Programming Interfaces (APIs).

Code license: Open source
Last updated: 22 May 2015

Graphviz is open source software for graph visualization, representing structural information as diagrams of abstract graphs and networks. The package includes web and interactive graphical interfaces, and auxiliary tools, libraries, and language bindings.

Last updated: 7 May 2015

PDFMiner is a Python tool for extracting information from PDFs (not only text, but also information about fonts, encoding, and layout.)

Code license: MIT License
Last updated: 1 May 2015

AGTK is a suite of software components for building tools for annotating linguistic signals, time-series data which documents any kind of linguistic behavior (e.g. audio, video). The internal data structures are based on annotation graphs. Annotation Graphs are a formal framework for representing linguistic annotations of time series data.

Code license: Open source
Last updated: 11 Feb 2015

Solr is an open source enterprise search platform from the Apache Lucene project. It operates as a standalone full-text search server within an appropriate servlet container, such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language.

Code license: Apache License, Open source
Last updated: 29 Dec 2014

Apache Lucene is a Java-based high-performance text search engine library.

Code license: Apache License, Open source
Last updated: 29 Dec 2014

PostgreSQL is a powerful, open source object-relational database system running on all major platforms. Support for native programming interfaces for C/C++, Java, .Net, Perl, Python, Ruby, Tcl and ODBC among others.

Last updated: 29 Dec 2014

Apache Subversion Version System (SVN) is an open source version control system. Access and revision to objects are carefully controlled, to prevent unauthorized access and alteration. Developers use SVN to maintain current and historical versions of files such as source code, web pages, and documentation.

Their Vision:

Code license: Apache License
Last updated: 29 Dec 2014

Pattern is a Python web mining module with tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics) and data visualization (graph networks).

Code license: BSD, Open source
Last updated: 29 Dec 2014
Subscribe to Python