Get 1T: software for processing Web 1T
The Web 1T 5-gram corpus contains n-grams from unigrams through to 5-grams compiled from counts on a one trillion word corpus. It is distributed by the Linguistic Data Consortium for researchers.
Since Web 1T is very large, querying it can be time consuming. Get 1T is a suite of tools designed to pre-process and query Web 1T (and potentially other large corpora in a similar format). Get 1T is available as Free Software under the GNU General Public Licence (GPL) version 2 (and later).
What Get 1T can do
Get 1T currently allows a set of many pre-specified queries to be run on Web 1T in a single pass over the corpus. (The exact number of queries possible is limited by your RAM, but for reference tens of millions are easily possible on modern desktop standard hardware with 1GB of RAM or more.)
We will very soon add a second tool which can build a lossy (hash-based) compression of the corpus so that it will fit in RAM on modern machines, so that you can get approximate frequencies for a given n-gram using on-the-fly queries.
Get 1T is not a system that allows on-the-fly queries of the full set of Web 1T n-grams. The goal of Get 1T is to allow use of Web 1T on relatively modest hardware resources (modern desktop machines) and full querying of the corpus requires considerably more resources than this. (See for example this message to corpora-list for a small discussion.)
Please note: we strongly advise that you keep note of which version of Get 1T you use to produce your results. This will assist in allowing you and others to reproduce your experiments in the event of a bug or other problem.
The most recent release is 0.3 (17th August 2009).
See our Sourceforge download page for tarballs of our releases.
If you would like to be notified when there is a new release of Get 1T, please subscribe to our get1t-announce email announcement list.
Bugs and feature requests
Please submit these through the Sourceforge bug tracker.
Get 1T is described in the following article:
Tobias Hawker, Mary Gardiner and Andrew Bennetts. 2007. Practical queries of a massive n-gram database. In Proceedings of the Australasian Language Technology Workshop 2007 (ALTW 2007), Melbourne, Australia, pages 40–48. [Abstract, PDF, bibtex]