Statistics for Corpus Linguists
  • Overview
  • Fundamentals
    • 1.1 Basics
    • 1.2 Linguistic variables
    • 1.3 Research questions
    • 1.4 Set theory and mathematical notation
  • Introduction to R
    • 2.1 First steps
    • 2.2 Exploring R Studio
    • 2.3 Vectors
    • 2.4 Data frames
    • 2.5 Libraries
    • 2.6 Importing/Exporting
  • NLP
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 The CQP interface
    • 3.4 Data annotation
  • Statistics
    • 4.1 Data, variables, samples
    • 4.2 Probability theory
    • 4.3 Descriptive statistics
    • 4.4 Hypothesis testing
    • 4.5 Chi-squared test
    • 4.6 t-test
  • Models
    • 6.1 Linear regression
    • 6.2 Logistic regression
    • 6.3 Mixed-effects regression
    • 6.4 Poisson regression
    • 6.5 Ordinal regression
  • Machine Learning
    • 7.1 Tree-based methods
    • 7.2 Gradient boosting
    • 7.3 PCA
    • 7.4 EFA
    • 7.5 Clustering
  1. 3.3 The CQP interface
  • 3. NLP with R
    • 3.1 Concordancing
    • 3.2 Regular expressions
    • 3.3 Data Annotation

On this page

  • Basic use
  • Exporting and importing your results

3.3 The CQP interface

Author
Affiliation

Vladimir Buskin

Catholic University of Eichstätt-Ingolstadt

Abstract
This handout introduces regular expressions for advanced corpus queries in R and on corpus platforms, showing how to construct, refine, and apply search patterns to linguistic data.

Some corpus platforms such as BNCweb, CQPweb or our local KU corpora1 support a specialised query syntax known as CQP (Corpus Query Processor), which enables users to query for strings with specific meta-attributes (text category, age, gender etc.).

  • 1 Note that the corpus platform of the Catholic University of Eichstätt-Ingolstadt, which was set up by Dr. Thomas Brunner, is only accessible to students or staff via the local eduroam network or a VPN client which holds their credentials.

  • Basic use

    Once signed in on any of these platforms, the user interface will generally follow this layout:

    CQP query of PROVIDE in the GloWbE corpus
    • To select a different corpus, navigate to CQPweb main menu under About CQPweb in the left menu bar.

    • To restrict search results to specific parts of the corpus, select Restricted query and choose the relevant text categories, varieties, etc.

    You can now enter a search expression into the white box using the following pattern:

    \[ \text{[attribute = "property"]} \]

    For example, to retrieve all inflectional forms2 of the verb provide, enter:

  • 2 Some corpora do not support the lemma attribute and require manual listing of all inflectional forms via regular expressions, such as [word = "provid(e(s)?|d|ing)"] for provide.

  • [lemma = "provide"]

    To search for a word with a specific part-of-speech (POS) tag – such as like used as a preposition – enter:

    [word = "like" & pos = "ii"]

    For a full list of POS-tags, refer to:

    • CLAWS5 (for BNC)
    • CLAWS7 (for COCA and GloWbE)
    Note

    If the exact word string is not important and you are only interested in part-of-speech sequences, you can supply the regular expression .+ to the word attribute, which means ‘match everything’. For example, to find all common nouns (NN) that are followed by a general preposition io, we can write

    [word = ".+" & pos = "nn"][word = ".+" & pos = "io"]

    Regular expressions also allow you to match multiple POS-tags, such as all noun categories (nn, nn1, nn2, nna, etc.) or all prepositions (if, ii, io). The search term below generalises the previous example, making it more inclusive:

    [word = ".+" & pos = "n.+"][word = ".+" & pos = "i.+"]

    However, be cautious with overly broad search patterns, as they might return (nearly) the entire corpus! The last one yields 22,294,282 matches.

    Exporting and importing your results

    To export your KWIC-hits, select Download in the top-right corner and specify your output options (e.g., the size of the search window or the speaker metadata). While you can immediately read the downloaded concordance.txt file into R, it’s advisable to first perform some manual clean-up in spreadsheet software (e.g., MS Excel). This ensures all columns and rows are complete and properly formatted.

    To do this in Excel:

    1. Navigate to File > Import > Text file.

    2. Select your file and choose Delimited.

    3. Click Next.

    4. Select Tab as the delimiter.

    5. Click Next > Finish.

    Make sure to save your file, ideally with the extension .xlsx.

    From here, please refer to the unit Import/export data for further steps.