Title: Practical Queries of a Massive n-gram Database

Authors:
   Tobias Hawker (School of Information Technologies, University of Sydney)
   Mary Gardiner (Centre for Language Technology, Macquarie University)
   Andrew Bennetts (Canonical Ltd.)

Abstract:

Large quantities of data are an increasingly essential resource for many
Natural Language Processing techniques. The Web 1T corpus, a massive resource
containing n-gram frequencies produced from one trillion words drawn from the
World Wide Web, is a relatively new corpus whose size will increase
performance on many data-hungry applications. In addition, a fixed resource of
this kind reduces reliance on using web results as experimental data,
increasing replicability of researchers' results.

However, effectively utilising a resource of this size presents significant
challenges. We discuss the challenges of using a data source of this magnitude,
and describe strategies for overcoming these, including efficient extraction of
queries including wildcards, and specialised data compression.  We present a
software suite, "Get 1T", implementing these techniques, released as free
software for use by the natural language research community, and others.

PDF: http://get1t.sourceforge.net/publications/hawker-alta2007-get1t.pdf
Bibtex: http://get1t.sourceforge.net/publications/hawker-alta2007-get1t.bib