About CLEF

CLEFCatalogue of Lexico-grammatical English Features — is a unified, structured, machine-readable catalogue of linguistic features for multi-dimensional register analysis, corpus stylistics, authorship attribution, and any research that relies on counting lexico-grammatical features in English text.

What is CLEF?

CLEF synthesizes feature sets from multiple foundational sources in register analysis and corpus linguistics. Each feature is defined in a YAML file with a human-readable definition, machine-readable detection rules, normalization metadata, source provenance, and annotated examples.

The catalogue is implementation-agnostic: it defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef will provide a Python engine that interprets the catalogue’s rules against text.

Why CLEF?

Multi-dimensional analysis (Biber, 1988) is the dominant framework for studying register variation. But the field has no single, unified catalogue of features. Different tools implement different subsets with different operationalizations.

Until now, feature definitions have lived in exactly two places: lists and tables in books and papers (human-readable but not machine-executable) or hard-coded in software (machine-executable but not human-inspectable without programming). CLEF brings these together in a single structured format.

Design Principles

Separation of spec and implementation
The catalogue defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef is one engine that interprets these rules — but the catalogue stands on its own. A linguist who has never written a line of Python can inspect, critique, and propose changes to a feature definition.
Provenance and cross-tool comparison
Every detection rule is tagged with its source. When sources use different word lists for the “same” feature, both are preserved as separate rules. This makes disagreements visible for the first time. A single feature may have hand-curated word lists from one source and automatically extracted lists from another (e.g. the USAS lexicon), each independently attributed.
Maximum granularity
Where sources merge features that could be separated, CLEF preserves the finer distinctions. Composite features with children allow aggregation when needed, but the atomics are always available.
Tiered detection methods
Not all features can be detected the same way. CLEF’s requires field makes this explicit — from simple tokenization through POS tagging, dependency parsing, semantic tagging, LLM-based detection, and fully manual human annotation. This acknowledges the reality that some features are ambiguous or resist automation and require human judgment. Rather than excluding these features, CLEF gives them first-class status alongside automated ones.
Tagger-agnostic
Detection rules are expressed in standard pattern languages (CQL for sequential patterns, Semgrex for dependency trees) rather than tied to a specific NLP toolkit. Today the engine uses spaCy; tomorrow it could be Stanza, a fine-tuned BERT, or a language model. The catalogue doesn’t change.
LLM and human annotation as first-class methods
The detection method taxonomy includes llm and human alongside automated methods. As LLMs improve at linguistic annotation tasks, features that once required expert human judgment can gain LLM-based detection rules — and the catalogue tracks both methods.
Extensibility without code
Adding a new feature means writing one YAML file. It automatically becomes available to any engine that reads the catalogue. A researcher who discovers a new feature or proposes a better word list can contribute without forking a codebase.
Annotated examples as test suite
Positive examples (with _match_ markup) and non-examples (with ~false positive~ markup) serve as both documentation and automated tests. They make edge cases and disambiguation needs visible, and they grow over time as the community encounters new cases.

Detection Rules

The catalogue contains 654 detection rules across features. Many features have multiple rules — from different sources, at different accuracy levels, or using different detection methods.

For example, a semantic feature defined by Xiao (2009) via a USAS semantic tag may also have a lexical list fallback derived from the UCREL USAS English lexicon (54,798 entries). This means the feature can be approximately detected using word lists even without a full semantic tagger. When multiple detection rules exist, they are tagged with their source so the engine (or researcher) can choose which to apply.

Pattern Type Rules Description
cql 310 Corpus Query Language — sequential token matching by POS, lemma, word, dep
other 184 Semantic tagger, LLM, or human annotation (no automated pattern)
regex 87 Regular expression — surface-level character matching
parts 40 Multi-pattern rules with named parts and boolean combine expressions
semgrex 24 Stanford Semgrex — dependency tree pattern matching
computed 9 Computed from child features (composites) or cross-feature references

Pattern Languages

The catalogue uses three pattern languages, matching different kinds of detection:

Regex (Regular expressions)
For surface-level patterns — matching character sequences in word forms without requiring any NLP pipeline. Used especially for derivational morphology features (prefixes and suffixes) where the orthographic form is the signal.
\w+tion$                                   words ending in -tion
^un\w+                                     words starting with un-
\w+'(t|ll|re|ve|s|d)                       contractions
CQL (Corpus Query Language)
For sequential patterns — matching tokens by POS tag, UPOS, lemma, word form, or dependency relation. The dominant query language in corpus linguistics (CQP/CWB, Sketch Engine, NoSketch Engine).
[pos="JJ"] [pos="NN|NNS"]                  two-token sequence
[lemma="be" & pos="VBZ"]                   lemma AND POS constraint
[word={words} & pos="RB"]                  word list with placeholder
[pos="IN" & dep="prep"]                    POS + dependency relation
[upos="VERB" & pos!="VBG"]                 coarse UPOS + fine-grained exclusion
[dep="aux.*"] [upos="ADV"] [upos="VERB"]   regex in dep, UPOS constraints
Semgrex (Stanford)
For structural patterns — matching dependency tree configurations. The de facto standard for dependency patterns.
{pos:/VB.*/}=verb >auxpass {lemma:be}      passive voice
{pos:/NN.*/}=noun >amod {}=adj             attributive adjective
A >rel B                                   A is head of B via relation

Rule Fields

Each detection rule is a YAML mapping with the following fields:

FieldRequiredDescription
sourceProvenance key (e.g. pybiber, mfte). Rules without a source are universal.
requiresList of token attributes needed: [word, pos, dep]
cqlCQL pattern — sequential token matching by POS, lemma, word, dep, etc.
semgrexSemgrex pattern — dependency tree matching
regexRegular expression — surface-level character matching
wordsWord list (flat list or named dict of lists)
partsNamed sub-patterns — a dict of parts, each with cql:, semgrex:, or word_list:
combineBoolean expression composing parts and feature codes: p1 | p2, _ & !FEAT, _ & -FEAT
anchorWhich token in a multi-token pattern to report as the hit: first (default) or last
sentence_scopeIf true, pattern matching is constrained to within sentence boundaries
refinesPOS tag(s) this rule reclassifies (e.g. RB). See Tag Refinement below.
s_attrsSentence-level attribute specs — mapping of names to CQL patterns. Enables [sent_has_q="true"] style constraints.
descriptionHuman-readable explanation of the rule
caveatKnown limitation or accuracy warning
usas_codeUSAS semantic tag code (for semantic features)

Combine Expressions

The combine field uses boolean expressions to compose parts and exclude other features. Two NOT operators handle different kinds of feature overlap:

! exclude (index overlap)
Removes hits whose token index also appears in the excluded feature. Use when both features match the same token — e.g. _ & !PASSBY on PASS.
- subtract (count)
Subtracts the total count of another feature from this feature's count. Use when features overlap conceptually but match different tokens — e.g. _ & -WHREL_SUBJ & -WHREL_OBJ on WHSC.
| union (parts)
Combines hits from multiple named parts — e.g. p1 | p2 | p3 for multi-pattern rules.
_ self
Refers to the rule's own bare pattern. Used when the rule has a cql: or semgrex: at rule level alongside a combine expression — e.g. _ & !NOMZ & !GER.

Sentence-Level Attributes (s-attrs)

Some features require checking a property of the sentence a token belongs to, not just the token's immediate context. For example, WH-questions are identified partly by whether the sentence contains a question mark or an auxiliary verb — properties that may be many tokens away from the WH-word itself.

CLEF handles this with s-attributes: pre-computed sentence-level boolean properties attached to every token in the sentence. A detection rule declares the properties it needs as CQL patterns; the engine evaluates each pattern per sentence and sets the named attribute to "true" on every token in matching sentences. CQL constraints then reference these attributes like any other token property.

# Rule declaration
s_attrs:
  sent_has_q: '[word="?"]'           sentence contains "?"
  sent_has_aux: '[upos="AUX"]'       sentence contains an auxiliary

# CQL pattern referencing s-attributes
[pos="W.*" & sent_has_q="true"]     WH-word in a question sentence

This mechanism is general: any sentence-level property expressible as a CQL pattern can be declared and queried, without hard-coding attribute names in the engine.

Feature Format

Each feature is a YAML file in canonical key order:

FieldRequiredDescription
codeUnique uppercase identifier (e.g. VBD)
biber_numberBiber’s 1988 feature number (1–67)
xiao_numberXiao’s 2009 feature code (e.g. B24)
nameHuman-readable name
definitionSelf-contained description of the feature
normalizationRate denominator: words, finite_verbs, nouns, etc.
parentCode of composite parent (for atomic features)
childrenList of atomic child codes (for composite features)
detectionOne or more detection rules (see below)
examplesPositive test examples with _match_ markup
non_examplesNegative examples; ~false positive~ markup
sourcesProvenance keys referencing sources.yaml
notesDimension loadings, connections, observations

Composites and Atomics

When a source (e.g. MFTE) splits a Biber feature, the catalogue preserves both levels. A composite feature has children listing its atomic parts; each child has a parent back-link. For example, Biber’s “first person pronouns” (6) becomes composite FPP with children FPP1S and FPP1P.

Tag Refinement

Some detection rules reclassify tokens that a POS tagger has assigned to a broad category. For example, a detection rule for FREQ (frequency adverbs) declares refines: RB: tokens it matches are functionally "frequency adverbs," not generic adverbs. The residual RB count should only include adverbs that no specific detection rule has claimed.

CLEF handles this with two virtual layers:

pos — the POS tagger's output (immutable)
Always reflects what the tagger assigned. A query like [pos="RB"] matches all RB tokens, including those later refined.
cat — functional category (refined)
Defaults to pos but is overwritten when a detection rule with refines: RB matches. A query like [cat="RB"] matches only unrefined RB tokens — those not claimed by any specific detection rule.

Detection rules declare refinement with the refines field (e.g. refines: RB or refines: RB|NN for multiple POS tags). This makes explicit a mechanism that in other tools is implicit in processing order — see the design principle of separation of spec and implementation above.

Sources

KeyAuthorYearTitle
biber_1988 Biber, Douglas 1988 Variation across Speech and Writing
xiao_2009 Xiao, Richard 2009 Multidimensional analysis and the study of world Englishes
bohmann_2019 Bohmann, Axel 2019 Variation in English Worldwide: Varieties and Genres in a Quantitative Perspective
biber_2006 Biber, Douglas 2006 University Language — A Corpus-based Study of Spoken and Written Registers
pybiber Brown, David West & Reinhart, Alex 2026 pybiber — Python package for linguistic feature extraction and Multi-Dimensional Analysis
mfte Le Foll, Elen & Shakir, Muhammad 2023/2025 Multi-Feature Tagger of English (MFTE) — Python version
grieve_2016 Grieve, Jack 2016 Regional Variation in Written American English
grieve_2023 Grieve, Jack 2023 Register variation explains stylometric authorship analysis
leech_2009 Leech, Geoffrey; Hundt, Marianne; Mair, Christian; Smith, Nicholas 2009 Change in Contemporary English — A Grammatical Study
quirk_1985 Quirk, Randolph; Greenbaum, Sidney; Leech, Geoffrey; Svartvik, Jan 1985 A Comprehensive Grammar of the English Language
tottie_2009 Tottie, Gunnel 2009 How different are American and British English grammar? And how are they different?
collins_2015 Collins, Peter 2015 Grammatical Variation in English Worldwide
le_foll_2024 Le Foll, Elen 2024 Textbook English
baayen_1994 Baayen, R. Harald 1994 Derivational productivity and text typology
biermeier_2008 Biermeier, Thomas 2008 Word-formation in New Englishes
rohdenburg_schlueter_2009 Rohdenburg, Günter and Julia Schlüter 2009 One Language, Two Grammars?
usas Rayson, Paul; Archer, Dawn; Piao, Scott; McEnery, Tony 2004 The UCREL Semantic Analysis System

Normalization

Each feature specifies a normalization denominator — the base used when computing rates. This follows Biber’s convention where most features are normalized per finite verb count, not per total words.

ValueMeaning
wordsPer total word tokens
finite_verbsPer finite verb tokens
nounsPer noun tokens
clausesPer clause count
sentencesPer sentence count