Title: Practical Queries of a Massive n-gram Database Authors: Tobias Hawker (School of Information Technologies, University of Sydney) Mary Gardiner (Centre for Language Technology, Macquarie University) Andrew Bennetts (Canonical Ltd.) Abstract: Large quantities of data are an increasingly essential resource for many Natural Language Processing techniques. The Web 1T corpus, a massive resource containing n-gram frequencies produced from one trillion words drawn from the World Wide Web, is a relatively new corpus whose size will increase performance on many data-hungry applications. In addition, a fixed resource of this kind reduces reliance on using web results as experimental data, increasing replicability of researchers' results. However, effectively utilising a resource of this size presents significant challenges. We discuss the challenges of using a data source of this magnitude, and describe strategies for overcoming these, including efficient extraction of queries including wildcards, and specialised data compression. We present a software suite, "Get 1T", implementing these techniques, released as free software for use by the natural language research community, and others. PDF: http://get1t.sourceforge.net/publications/hawker-alta2007-get1t.pdf Bibtex: http://get1t.sourceforge.net/publications/hawker-alta2007-get1t.bib