library(tidyverse)
library(quanteda)
library(sampling)
library(data.table)
31 Drawing samples
Warning
This page is still under construction. More content will be added soon!
31.1 Preparation
Load libraries:
Perform query:
# Load corpus
<- readRDS("ICE_GB.RDS")
ICE_GB
# Perform query
<- kwic(ICE_GB, "think")
kwic_think
# Count number of observations
nrow(kwic_think)
[1] 2648
# Show first few
head(kwic_think)
Keyword-in-context with 6 matches.
[ICE_GB/S1A-001.txt, 55] 1: B > I | think |
[ICE_GB/S1A-001.txt, 218] 1: B > I | think |
[ICE_GB/S1A-001.txt, 534] 1: B > I | think |
[ICE_GB/S1A-001.txt, 588] difference <, > I | think |
[ICE_GB/S1A-001.txt, 675] <, > and I | think |
[ICE_GB/S1A-001.txt, 1049] B > Uhm and I | think |
the main things that I
the m <, >
that the <,,
the main difference that I
one of the things that
f for for myself <
31.2 Stratified sample
# Source function from GitHub
source("https://raw.githubusercontent.com/VBuskin/Stats_with_R/refs/heads/main/Custom_functions.R")
# Apply function to the output of kwic() to perform weighted sampling
stratified_sample_ICE(kwic_think, 500)
# A tibble: 501 × 8
Text_category File_number from to pre keyword post pattern
<chr> <chr> <int> <int> <chr> <chr> <chr> <fct>
1 ICE_GB/S1A 001.txt 588 588 difference < , >… think the … think
2 ICE_GB/S1A 001.txt 1866 1866 : B > And I think that… think
3 ICE_GB/S1A 001.txt 1974 1974 initial difficul… think that… think
4 ICE_GB/S1A 002.txt 386 386 lazy really for … think < , … think
5 ICE_GB/S1A 002.txt 649 649 : C > And I think when… think
6 ICE_GB/S1A 002.txt 1578 1578 think there ' s I think ther… think
7 ICE_GB/S1A 002.txt 3621 3621 : B > Therefore I think that… think
8 ICE_GB/S1A 002.txt 3710 3710 : B > But I think < , … think
9 ICE_GB/S1A 003.txt 427 427 , > why do you think phys… think
10 ICE_GB/S1A 003.txt 2613 2613 > Well I Well I think it '… think
# ℹ 491 more rows