The WebCorp suite of tools is comprised of three versions of the software that build on one another to provide linguistic web analysis for research, teaching and beyond.
The WebCorp suite of tools is comprised of three versions of the software that build on one another. The WebCorp tools have featured in over 1700 publications by researchers across disciplines, with a multitude of users worldwide.
WebCorp Live (released in December 2008 after several years of prototyping as WebCorp) was designed to test the hypothesis that the web could complement static offline text collections by providing evidence of rare, new and changing language use. Previous linguistic research relied on searches using the web interfaces of commercial search engines such as Google, but this required researchers to expend substantial effort visiting each web page manually to observe the linguistic patterns within which their search terms occur. WebCorp Live streamlines this approach by processing the results of commercial search engines, automatically accessing the web pages and producing examples of words and phrases with the level of detail required for linguistic study. With the ability to search in multiple languages, WebCorp Live has augmented language teaching and translation in over 180 countries.
While WebCorp Live uses commercial search engines as gatekeepers to the web, the goal of its sister project, the WebCorp Linguist’s Search Engine (WebCorpLSE), was to build a bespoke large-scale collection of web-texts, and thus enable advanced linguistic and statistical analysis of the kind only possible in datasets of known size and composition. We developed linguistically-focused web processing, annotation and search tools and used these to build a large-scale representative sample of the web (a ‘miniweb’), capturing the distribution of document formats, subject domains and web-native text-types, as well as constructing specialist datasets of online news and blogs with their associated comments. WebCorpLSE was supported by EPSRC, HEFCE and AHRC grants.
The WebCorpLSE software was also used to introduce A-Level English Language students to empirical text study through an AHRC Knowledge Transfer Fellowship. In recent years, the subject criteria have been tightened, requiring more in-depth understanding of linguistic concepts and analytical techniques, and with an increased emphasis on independent learning. At present, the corpus linguistic approach is rarely employed at pre-university level, if at all. A-level students do not, for the most part, have understanding of or access to many automated analysis tools beyond the spelling/grammar checker in Microsoft Word.
Our work provided students with access to a novel, state-of-the-art teaching and independent learning aid, and distilled a wealth of linguistic knowledge, gained from previous research projects in the field of Corpus Linguistics, into a form appropriate for A-level study. By introducing A-level students to WebCorpLSE, they learn to apply corpus linguistic techniques to their language studies and their independent research projects. WebCorpLSE also provides teachers and their students with a plentiful supply of authentic language data, relevant to all aspects of the A-level syllabus.
In 2020, we expanded the work on data-driven language learning, adapting the WebCorpLSE technology and creating WebCorp Learn, a version optimised for interactive English language learning by non-native speakers. WebCorp Learn is now integrated into courses in German secondary schools through collaboration with the Teaching Solutions language consultancy.