About CLEF
CLEF — Catalogue of Lexico-grammatical English Features — is a unified, structured, machine-readable catalogue of linguistic features for multi-dimensional register analysis, corpus stylistics, authorship attribution, and any research that relies on counting lexico-grammatical features in English text.
What is CLEF?
CLEF synthesizes feature sets from multiple foundational sources in register analysis and corpus linguistics. Each feature is defined in a YAML file with a human-readable definition, machine-readable detection rules, normalization metadata, source provenance, and annotated examples.
The catalogue is implementation-agnostic: it defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef will provide a Python engine that interprets the catalogue’s rules against text.
Why CLEF?
Multi-dimensional analysis (Biber, 1988) is the dominant framework for studying register variation. But the field has no single, unified catalogue of features. Different tools implement different subsets with different operationalizations.
Until now, feature definitions have lived in exactly two places: lists and tables in books and papers (human-readable but not machine-executable) or hard-coded in software (machine-executable but not human-inspectable without programming). CLEF brings these together in a single structured format.
Design Principles
- Separation of spec and implementation
- The catalogue defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef is one engine that interprets these rules — but the catalogue stands on its own. A linguist who has never written a line of Python can inspect, critique, and propose changes to a feature definition.
- Provenance and cross-tool comparison
-
Every detection rule is tagged with its
source. When sources use different word lists for the “same” feature, both are preserved as separate rules. This makes disagreements visible for the first time. A single feature may have hand-curated word lists from one source and automatically extracted lists from another (e.g. the USAS lexicon), each independently attributed. - Maximum granularity
-
Where sources merge features that could be separated, CLEF preserves
the finer distinctions. Composite features with
childrenallow aggregation when needed, but the atomics are always available. - Tiered detection methods
-
Not all features can be detected the same way. CLEF’s
requiresfield makes this explicit — from simple tokenization through POS tagging, dependency parsing, semantic tagging, LLM-based detection, and fully manual human annotation. This acknowledges the reality that some features are ambiguous or resist automation and require human judgment. Rather than excluding these features, CLEF gives them first-class status alongside automated ones. - Tagger-agnostic
- Detection rules are expressed in standard pattern languages (CQL for sequential patterns, Semgrex for dependency trees) rather than tied to a specific NLP toolkit. Today the engine uses spaCy; tomorrow it could be Stanza, a fine-tuned BERT, or a language model. The catalogue doesn’t change.
- LLM and human annotation as first-class methods
-
The detection method taxonomy includes
llmandhumanalongside automated methods. As LLMs improve at linguistic annotation tasks, features that once required expert human judgment can gain LLM-based detection rules — and the catalogue tracks both methods. - Extensibility without code
- Adding a new feature means writing one YAML file. It automatically becomes available to any engine that reads the catalogue. A researcher who discovers a new feature or proposes a better word list can contribute without forking a codebase.
- Annotated examples as test suite
-
Positive examples (with
_match_markup) and non-examples (with~false positive~markup) serve as both documentation and automated tests. They make edge cases and disambiguation needs visible, and they grow over time as the community encounters new cases.
Detection Rules
The catalogue contains 654 detection rules across features. Many features have multiple rules — from different sources, at different accuracy levels, or using different detection methods.
For example, a semantic feature defined by Xiao (2009) via a USAS
semantic tag may also have a lexical list fallback
derived from the
UCREL USAS English lexicon
(54,798 entries). This means the feature can be approximately detected
using word lists even without a full semantic tagger. When multiple
detection rules exist, they are tagged with their source
so the engine (or researcher) can choose which to apply.
| Pattern Type | Rules | Description |
|---|---|---|
cql |
310 | Corpus Query Language — sequential token matching by POS, lemma, word, dep |
other |
184 | Semantic tagger, LLM, or human annotation (no automated pattern) |
regex |
87 | Regular expression — surface-level character matching |
parts |
40 | Multi-pattern rules with named parts and boolean combine expressions |
semgrex |
24 | Stanford Semgrex — dependency tree pattern matching |
computed |
9 | Computed from child features (composites) or cross-feature references |
Pattern Languages
The catalogue uses three pattern languages, matching different kinds of detection:
- Regex (Regular expressions)
-
For surface-level patterns — matching character sequences in word forms
without requiring any NLP pipeline. Used especially for derivational
morphology features (prefixes and suffixes) where the orthographic form
is the signal.
\w+tion$ words ending in -tion ^un\w+ words starting with un- \w+'(t|ll|re|ve|s|d) contractions - CQL (Corpus Query Language)
-
For sequential patterns — matching tokens by POS tag, UPOS, lemma,
word form, or dependency relation.
The dominant query language in corpus linguistics (CQP/CWB, Sketch
Engine, NoSketch Engine).
[pos="JJ"] [pos="NN|NNS"] two-token sequence [lemma="be" & pos="VBZ"] lemma AND POS constraint [word={words} & pos="RB"] word list with placeholder [pos="IN" & dep="prep"] POS + dependency relation [upos="VERB" & pos!="VBG"] coarse UPOS + fine-grained exclusion [dep="aux.*"] [upos="ADV"] [upos="VERB"] regex in dep, UPOS constraints - Semgrex (Stanford)
-
For structural patterns — matching dependency tree configurations.
The de facto standard for dependency patterns.
{pos:/VB.*/}=verb >auxpass {lemma:be} passive voice {pos:/NN.*/}=noun >amod {}=adj attributive adjective A >rel B A is head of B via relation
Rule Fields
Each detection rule is a YAML mapping with the following fields:
| Field | Required | Description |
|---|---|---|
source | Provenance key (e.g. pybiber, mfte). Rules without a source are universal. | |
requires | List of token attributes needed: [word, pos, dep] | |
cql | CQL pattern — sequential token matching by POS, lemma, word, dep, etc. | |
semgrex | Semgrex pattern — dependency tree matching | |
regex | Regular expression — surface-level character matching | |
words | Word list (flat list or named dict of lists) | |
parts | Named sub-patterns — a dict of parts, each with cql:, semgrex:, or word_list: | |
combine | Boolean expression composing parts and feature codes: p1 | p2, _ & !FEAT, _ & -FEAT | |
anchor | Which token in a multi-token pattern to report as the hit: first (default) or last | |
sentence_scope | If true, pattern matching is constrained to within sentence boundaries | |
refines | POS tag(s) this rule reclassifies (e.g. RB). See Tag Refinement below. | |
s_attrs | Sentence-level attribute specs — mapping of names to CQL patterns. Enables [sent_has_q="true"] style constraints. | |
description | Human-readable explanation of the rule | |
caveat | Known limitation or accuracy warning | |
usas_code | USAS semantic tag code (for semantic features) |
Combine Expressions
The combine field uses boolean expressions to compose parts and exclude other features.
Two NOT operators handle different kinds of feature overlap:
!exclude (index overlap)-
Removes hits whose token index also appears in the excluded feature.
Use when both features match the same token — e.g.
_ & !PASSBYon PASS. -subtract (count)-
Subtracts the total count of another feature from this feature's count.
Use when features overlap conceptually but match different tokens
— e.g.
_ & -WHREL_SUBJ & -WHREL_OBJon WHSC. |union (parts)-
Combines hits from multiple named parts — e.g.
p1 | p2 | p3for multi-pattern rules. _self-
Refers to the rule's own bare pattern. Used when the rule has a
cql:orsemgrex:at rule level alongside a combine expression — e.g._ & !NOMZ & !GER.
Sentence-Level Attributes (s-attrs)
Some features require checking a property of the sentence a token belongs to, not just the token's immediate context. For example, WH-questions are identified partly by whether the sentence contains a question mark or an auxiliary verb — properties that may be many tokens away from the WH-word itself.
CLEF handles this with s-attributes: pre-computed
sentence-level boolean properties attached to every token in the
sentence. A detection rule declares the properties it needs as CQL
patterns; the engine evaluates each pattern per sentence and sets the
named attribute to "true" on every token in matching
sentences. CQL constraints then reference these attributes like any
other token property.
# Rule declaration
s_attrs:
sent_has_q: '[word="?"]' sentence contains "?"
sent_has_aux: '[upos="AUX"]' sentence contains an auxiliary
# CQL pattern referencing s-attributes
[pos="W.*" & sent_has_q="true"] WH-word in a question sentence
This mechanism is general: any sentence-level property expressible as a CQL pattern can be declared and queried, without hard-coding attribute names in the engine.
Feature Format
Each feature is a YAML file in canonical key order:
| Field | Required | Description |
|---|---|---|
code | ✓ | Unique uppercase identifier (e.g. VBD) |
biber_number | Biber’s 1988 feature number (1–67) | |
xiao_number | Xiao’s 2009 feature code (e.g. B24) | |
name | ✓ | Human-readable name |
definition | ✓ | Self-contained description of the feature |
normalization | ✓ | Rate denominator: words, finite_verbs, nouns, etc. |
parent | Code of composite parent (for atomic features) | |
children | List of atomic child codes (for composite features) | |
detection | ✓ | One or more detection rules (see below) |
examples | Positive test examples with _match_ markup | |
non_examples | Negative examples; ~false positive~ markup | |
sources | ✓ | Provenance keys referencing sources.yaml |
notes | Dimension loadings, connections, observations |
Composites and Atomics
When a source (e.g. MFTE) splits a Biber feature, the catalogue
preserves both levels. A composite feature has
children listing its atomic parts; each child has a
parent back-link. For example, Biber’s “first person
pronouns” (6) becomes composite FPP with children
FPP1S and FPP1P.
Tag Refinement
Some detection rules reclassify tokens that a POS tagger has assigned
to a broad category. For example, a detection rule for
FREQ (frequency adverbs)
declares refines: RB: tokens it matches are functionally
"frequency adverbs," not generic adverbs. The residual
RB count should only include
adverbs that no specific detection rule has claimed.
CLEF handles this with two virtual layers:
pos— the POS tagger's output (immutable)-
Always reflects what the tagger assigned. A query like
[pos="RB"]matches all RB tokens, including those later refined. cat— functional category (refined)-
Defaults to
posbut is overwritten when a detection rule withrefines: RBmatches. A query like[cat="RB"]matches only unrefined RB tokens — those not claimed by any specific detection rule.
Detection rules declare refinement with the refines field
(e.g. refines: RB or refines: RB|NN for
multiple POS tags). This makes explicit a mechanism that in other
tools is implicit in processing order — see the design principle of
separation of spec and implementation above.
Sources
| Key | Author | Year | Title |
|---|---|---|---|
biber_1988 |
Biber, Douglas | 1988 | Variation across Speech and Writing |
xiao_2009 |
Xiao, Richard | 2009 | Multidimensional analysis and the study of world Englishes |
bohmann_2019 |
Bohmann, Axel | 2019 | Variation in English Worldwide: Varieties and Genres in a Quantitative Perspective |
biber_2006 |
Biber, Douglas | 2006 | University Language — A Corpus-based Study of Spoken and Written Registers |
pybiber |
Brown, David West & Reinhart, Alex | 2026 | pybiber — Python package for linguistic feature extraction and Multi-Dimensional Analysis |
mfte |
Le Foll, Elen & Shakir, Muhammad | 2023/2025 | Multi-Feature Tagger of English (MFTE) — Python version |
grieve_2016 |
Grieve, Jack | 2016 | Regional Variation in Written American English |
grieve_2023 |
Grieve, Jack | 2023 | Register variation explains stylometric authorship analysis |
leech_2009 |
Leech, Geoffrey; Hundt, Marianne; Mair, Christian; Smith, Nicholas | 2009 | Change in Contemporary English — A Grammatical Study |
quirk_1985 |
Quirk, Randolph; Greenbaum, Sidney; Leech, Geoffrey; Svartvik, Jan | 1985 | A Comprehensive Grammar of the English Language |
tottie_2009 |
Tottie, Gunnel | 2009 | How different are American and British English grammar? And how are they different? |
collins_2015 |
Collins, Peter | 2015 | Grammatical Variation in English Worldwide |
le_foll_2024 |
Le Foll, Elen | 2024 | Textbook English |
baayen_1994 |
Baayen, R. Harald | 1994 | Derivational productivity and text typology |
biermeier_2008 |
Biermeier, Thomas | 2008 | Word-formation in New Englishes |
rohdenburg_schlueter_2009 |
Rohdenburg, Günter and Julia Schlüter | 2009 | One Language, Two Grammars? |
usas |
Rayson, Paul; Archer, Dawn; Piao, Scott; McEnery, Tony | 2004 | The UCREL Semantic Analysis System |
Normalization
Each feature specifies a normalization denominator — the base used when computing rates. This follows Biber’s convention where most features are normalized per finite verb count, not per total words.
| Value | Meaning |
|---|---|
words | Per total word tokens |
finite_verbs | Per finite verb tokens |
nouns | Per noun tokens |
clauses | Per clause count |
sentences | Per sentence count |