About CLEF

CLEF — Catalogue of Lexico-grammatical English Features — is a unified, structured, machine-readable catalogue of linguistic features for multi-dimensional register analysis, corpus stylistics, authorship attribution, and any research that relies on counting lexico-grammatical features in English text.

What is CLEF?

CLEF synthesizes feature sets from multiple foundational sources in register analysis and corpus linguistics. Each feature is defined in a YAML file with a human-readable definition, machine-readable detection rules, normalization metadata, source provenance, and annotated examples.

The catalogue is implementation-agnostic: it defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef will provide a Python engine that interprets the catalogue’s rules against text.

Why CLEF?

Multi-dimensional analysis (Biber, 1988) is the dominant framework for studying register variation. But the field has no single, unified catalogue of features. Different tools implement different subsets with different operationalizations.

Until now, feature definitions have lived in exactly two places: lists and tables in books and papers (human-readable but not machine-executable) or hard-coded in software (machine-executable but not human-inspectable without programming). CLEF brings these together in a single structured format.

Design Principles

Separation of spec and implementation: The catalogue defines what each feature is and how to detect it declaratively. The forthcoming companion library pyclef is one engine that interprets these rules — but the catalogue stands on its own. A linguist who has never written a line of Python can inspect, critique, and propose changes to a feature definition.
Provenance and cross-tool comparison: Every detection rule is tagged with its source. When sources use different word lists for the “same” feature, both are preserved as separate rules. This makes disagreements visible for the first time. A single feature may have hand-curated word lists from one source and automatically extracted lists from another (e.g. the USAS lexicon), each independently attributed.
Maximum granularity: Where sources merge features that could be separated, CLEF preserves the finer distinctions. Composite features with children allow aggregation when needed, but the atomics are always available.
Tiered detection methods: Not all features can be detected the same way. CLEF’s requires field makes this explicit — from simple tokenization through POS tagging, dependency parsing, semantic tagging, LLM-based detection, and fully manual human annotation. This acknowledges the reality that some features are ambiguous or resist automation and require human judgment. Rather than excluding these features, CLEF gives them first-class status alongside automated ones.
Tagger-agnostic: Detection rules are expressed in standard pattern languages (CQL for sequential patterns, Semgrex for dependency trees) rather than tied to a specific NLP toolkit. Today the engine uses spaCy; tomorrow it could be Stanza, a fine-tuned BERT, or a language model. The catalogue doesn’t change.
LLM and human annotation as first-class methods: The detection method taxonomy includes llm and human alongside automated methods. As LLMs improve at linguistic annotation tasks, features that once required expert human judgment can gain LLM-based detection rules — and the catalogue tracks both methods.
Extensibility without code: Adding a new feature means writing one YAML file. It automatically becomes available to any engine that reads the catalogue. A researcher who discovers a new feature or proposes a better word list can contribute without forking a codebase.
Annotated examples as test suite: Positive examples (with _match_ markup) and non-examples (with ~false positive~ markup) serve as both documentation and automated tests. They make edge cases and disambiguation needs visible, and they grow over time as the community encounters new cases.

Detection Rules

The catalogue contains 673 detection rules across features. Many features have multiple rules — from different sources, at different accuracy levels, or using different detection methods.

For example, a semantic feature defined by Xiao (2009) via a USAS semantic tag may also have a lexical list fallback derived from the UCREL USAS English lexicon (54,798 entries). This means the feature can be approximately detected using word lists even without a full semantic tagger. When multiple detection rules exist, they are tagged with their source so the engine (or researcher) can choose which to apply.

Pattern Type	Rules	Description
`cql`	318	Corpus Query Language — sequential token matching by POS, lemma, word, dep
`other`	184	Semantic tagger, LLM, or human annotation (no automated pattern)
`regex`	87	Regular expression — surface-level character matching
`parts`	42	Multi-pattern rules with named parts and boolean combine expressions
`semgrex`	26	Stanford Semgrex — dependency tree pattern matching
`computed`	16	Computed from child features (composites) or cross-feature references

Pattern Languages

The catalogue uses three pattern languages, matching different kinds of detection:

Regex (Regular expressions)

For surface-level patterns — matching character sequences in word forms without requiring any NLP pipeline. Used especially for derivational morphology features (prefixes and suffixes) where the orthographic form is the signal.

\w+tion$                                   words ending in -tion
^un\w+                                     words starting with un-
\w+'(t|ll|re|ve|s|d)                       contractions

CQL (Corpus Query Language)

For sequential patterns — matching tokens by POS tag, UPOS, lemma, word form, or dependency relation. The dominant query language in corpus linguistics (CQP/CWB, Sketch Engine, NoSketch Engine).

[pos="JJ"] [pos="NN|NNS"]                  two-token sequence
[lemma="be" & pos="VBZ"]                   lemma AND POS constraint
[word={words} & pos="RB"]                  word list with placeholder
[pos="IN" & dep="prep"]                    POS + dependency relation
[upos="VERB" & pos!="VBG"]                 coarse UPOS + fine-grained exclusion
[dep="aux.*"] [upos="ADV"] [upos="VERB"]   regex in dep, UPOS constraints

Semgrex (Stanford)

For structural patterns — matching dependency tree configurations. The de facto standard for dependency patterns.

{pos:/VB.*/}=verb >auxpass {lemma:be}      passive voice
{pos:/NN.*/}=noun >amod {}=adj             attributive adjective
A >rel B                                   A is head of B via relation

Rule Fields

Each detection rule is a YAML mapping with the following fields:

Field	Required	Description
`source`		Provenance key (e.g. `pybiber`, `mfte`). Rules without a source are universal.
`requires`		List of token attributes needed: `[word, pos, dep]`
`cql`		CQL pattern — sequential token matching by POS, lemma, word, dep, etc.
`semgrex`		Semgrex pattern — dependency tree matching
`regex`		Regular expression — surface-level character matching
`words`		Word list (flat list or named dict of lists)
`parts`		Named sub-patterns — a dict of parts, each with `cql:`, `semgrex:`, or `word_list:`
`combine`		Boolean expression composing parts and feature codes: `p1 \| p2`, `_ & !FEAT`, `_ & -FEAT`
`anchor`		Which token in a multi-token pattern to report as the hit: `first` (default) or `last`
`sentence_scope`		If `true`, pattern matching is constrained to within sentence boundaries
`refines`		POS tag(s) this rule reclassifies (e.g. `RB`). See Tag Refinement below.
`s_attrs`		Sentence-level attribute specs — mapping of names to CQL patterns. Enables `[sent_has_q="true"]` style constraints.
`description`		Human-readable explanation of the rule
`caveat`		Known limitation or accuracy warning
`usas_code`		USAS semantic tag code (for semantic features)

Combine Expressions

The combine field uses boolean expressions to compose parts and exclude other features. Two NOT operators handle different kinds of feature overlap:

! exclude (index overlap): Removes hits whose token index also appears in the excluded feature. Use when both features match the same token — e.g. _ & !PASSBY on PASS.
- subtract (count): Subtracts the total count of another feature from this feature's count. Use when features overlap conceptually but match different tokens — e.g. _ & -WHREL_SUBJ & -WHREL_OBJ on WHSC.
| union (parts): Combines hits from multiple named parts — e.g. p1 | p2 | p3 for multi-pattern rules.
_ self: Refers to the rule's own bare pattern. Used when the rule has a cql: or semgrex: at rule level alongside a combine expression — e.g. _ & !NOMZ & !GER.

Sentence-Level Attributes (s-attrs)

Some features require checking a property of the sentence a token belongs to, not just the token's immediate context. For example, WH-questions are identified partly by whether the sentence contains a question mark or an auxiliary verb — properties that may be many tokens away from the WH-word itself.

CLEF handles this with s-attributes: pre-computed sentence-level boolean properties attached to every token in the sentence. A detection rule declares the properties it needs as CQL patterns; the engine evaluates each pattern per sentence and sets the named attribute to "true" on every token in matching sentences. CQL constraints then reference these attributes like any other token property.

# Rule declaration
s_attrs:
  sent_has_q: '[word="?"]'           sentence contains "?"
  sent_has_aux: '[upos="AUX"]'       sentence contains an auxiliary

# CQL pattern referencing s-attributes
[pos="W.*" & sent_has_q="true"]     WH-word in a question sentence

This mechanism is general: any sentence-level property expressible as a CQL pattern can be declared and queried, without hard-coding attribute names in the engine.

Feature Format

Each feature is a YAML file in canonical key order:

Field	Required	Description
`code`	✓	Unique uppercase identifier (e.g. `VBD`)
`biber_number`		Biber’s 1988 feature number (1–67)
`xiao_number`		Xiao’s 2009 feature code (e.g. `B24`)
`name`	✓	Human-readable name
`definition`	✓	Self-contained description of the feature
`normalization`	✓	Rate denominator: `words`, `finite_verbs`, `nouns`, etc.
`parent`		Code of composite parent (for atomic features)
`children`		List of atomic child codes (for composite features)
`detection`	✓	One or more detection rules (see below)
`examples`		Positive test examples with `_match_` markup
`non_examples`		Negative examples; `~false positive~` markup
`sources`	✓	Provenance keys referencing `sources.yaml`
`notes`		Dimension loadings, connections, observations

Composites and Atomics

When a source (e.g. MFTE) splits a Biber feature, the catalogue preserves both levels. A composite feature has children listing its atomic parts; each child has a parent back-link. For example, Biber’s “first person pronouns” (6) becomes composite FPP with children FPP1S and FPP1P.

Sources

Key	Author	Year	Title
`biber_1988`	Biber, Douglas	1988	Variation across Speech and Writing
`xiao_2009`	Xiao, Richard	2009	Multidimensional analysis and the study of world Englishes
`bohmann_2019`	Bohmann, Axel	2019	Variation in English Worldwide: Varieties and Genres in a Quantitative Perspective
`biber_2006`	Biber, Douglas	2006	University Language — A Corpus-based Study of Spoken and Written Registers
`pybiber`	Brown, David West & Reinhart, Alex	2026	pybiber — Python package for linguistic feature extraction and Multi-Dimensional Analysis
`mfte`	Le Foll, Elen & Shakir, Muhammad	2023/2025	Multi-Feature Tagger of English (MFTE) — Python version
`grieve_2016`	Grieve, Jack	2016	Regional Variation in Written American English
`grieve_2023`	Grieve, Jack	2023	Register variation explains stylometric authorship analysis
`leech_2009`	Leech, Geoffrey; Hundt, Marianne; Mair, Christian; Smith, Nicholas	2009	Change in Contemporary English — A Grammatical Study
`quirk_1985`	Quirk, Randolph; Greenbaum, Sidney; Leech, Geoffrey; Svartvik, Jan	1985	A Comprehensive Grammar of the English Language
`tottie_2009`	Tottie, Gunnel	2009	How different are American and British English grammar? And how are they different?
`collins_2015`	Collins, Peter	2015	Grammatical Variation in English Worldwide
`le_foll_2024`	Le Foll, Elen	2024	Textbook English
`baayen_1994`	Baayen, R. Harald	1994	Derivational productivity and text typology
`biermeier_2008`	Biermeier, Thomas	2008	Word-formation in New Englishes
`rohdenburg_schlueter_2009`	Rohdenburg, Günter and Julia Schlüter	2009	One Language, Two Grammars?
`usas`	Rayson, Paul; Archer, Dawn; Piao, Scott; McEnery, Tony	2004	The UCREL Semantic Analysis System

Normalization

Each feature specifies a normalization denominator — the base used when computing rates. This follows Biber’s convention where most features are normalized per finite verb count, not per total words.

Value	Meaning
`words`	Per total word tokens
`finite_verbs`	Per finite verb tokens
`nouns`	Per noun tokens
`clauses`	Per clause count
`sentences`	Per sentence count