Skip to main content

Hiro Fukushima

Portfolio

Back to Portfolio

Write-Like-Me

A Measurement Framework for Writing Voice

Applied AI
Case Study
AINLPStylometryComputational Linguistics

Summary

A framework that models a writing voice by measuring it rather than describing it. Most attempts to make an AI write in a particular voice rely on adjectives a model cannot act on, so the imitation drifts within a paragraph. This framework treats voice as a set of measurable habits and extracts around fifty linguistic features grounded in published stylometry, the same body of methods used to attribute disputed authorship.

A seven-stage pipeline runs from raw corpus to a reusable, rule-based voice profile. It separates registers so a single number is never read out of context, and it ends with a verification stage that regenerates text and checks it against the source on fixed numeric thresholds. The method is the product. The system does not merely claim to capture a voice, it measures whether it succeeded.

The framework is public and authored end to end, built and validated across multiple registers. The concrete figures shown here are drawn from a published essay, so the method can be demonstrated without exposing private writing.

Overview

  • Built a framework that models a writing voice by measuring around fifty linguistic features, instead of describing it with adjectives a model cannot act on
  • Grounded in published stylometry: function-word analysis, moving-average lexical richness, sentence-length distribution, punctuation profiling, and register separation
  • A seven-stage pipeline runs from raw corpus to a reusable, rule-based voice profile
  • Three-tier feature extraction that degrades gracefully, from a zero-dependency baseline to optional deep-syntactic analysis
  • A verification stage that regenerates text and tests it against the source corpus on fixed numeric thresholds
  • Produces hard, checkable rules and quantitative targets rather than a loose description
  • Public framework, authored end to end

Role

  • Sole author of the framework, methodology, and implementation
  • Designed the linguistic feature taxonomy and grounded each feature in published research
  • Built the stylometric feature-extraction pipeline in Python across three dependency tiers
  • Designed the seven-stage build workflow with a human review checkpoint
  • Designed the verification protocol that measures a rebuilt voice against its source

00. Table of Contents

01. The Problem

Why instructing a model to write in a voice produces drift, and what is missing.

02. The Premise

Voice as a set of measurable habits rather than an aesthetic.

03. The Method

The stylometric foundation, from the Federalist Papers to register theory.

04. What Gets Measured

Around fifty features across three extraction tiers.

05. The Pipeline

A seven-stage workflow from corpus to a verified voice profile.

06. Verification

Regenerating text and testing it against the source on fixed thresholds.

07. Outcome

A reproducible voice system whose strength is the method behind it.

01. The Problem

The instruction to write in someone’s voice produces an imitation that holds for one sentence and loosens by the end of the paragraph. A language model reaches for an average of how educated English reads, and the specific habits that make one writer recognizable are the first thing that averaging removes. The output is competent and anonymous at the same time, which is the worst result for anyone whose writing is part of how they are known.

The common fix makes the problem worse. People describe the voice they want with adjectives, asking for prose that is "direct", or "warm", or "conversational", and those words carry almost no information a model can act on. Two writers described the same way produce nothing alike, because the description names a feeling about the writing rather than anything the writing measurably does.

The deeper issue is that a description gives you no way to tell when the output has slipped. Working from an impression, you cannot catch a sentence-length distribution that has quietly shifted or a punctuation habit that has crept back in. The voice fails silently, and the only signal is a vague sense that something is off.

A voice is not an aesthetic you describe, it is a set of habits you can measure.

02. The Premise

A writing voice is the accumulation of choices a writer makes without noticing: how long a sentence runs before it closes, how often a claim gets hedged, which small function words recur, whether the punctuation leans on the em dash or refuses it. These choices are stable within a single register and distinctive between writers, which is what makes them worth measuring rather than describing.

This is supported by work on stylistic prompting, which finds that quantified constraints steer a language model better than qualitative descriptions, and that a numerical profile paired with a few short exemplar passages outperforms either one alone. The same work finds that phrases of the "writes in a direct, conversational style" kind have close to zero measurable effect on output. The framework is built on that finding. It refuses adjectives, measures the habits, and turns them into constraints a model can be held to.

03. The Method

The measurement is not improvised. Every feature the framework extracts traces to published work in corpus linguistics and authorship attribution, the same body of methods used to settle questions of disputed authorship. Naming the lineage matters, because it is the difference between a tool that sounds rigorous and one that can be defended.

03.01 Function Words and Lexical Richness

Mosteller and Wallace established in their 1964 study of the disputed Federalist Papers that the frequencies of small function words are author-distinctive and topic-invariant, meaning they do not shift with subject matter the way content words do. Burrows formalized this into the Delta method for authorship attribution in 2002. For lexical richness the framework uses the Moving-Average Type-Token Ratio of Covington and McFall (2010), which measures vocabulary variety inside a sliding window and removes the length bias that distorts a plain type-token ratio.

03.02 Sentence and Punctuation Habits

Mean sentence length is the oldest stylometric feature there is, dating to Mendenhall in 1887. The framework records the full distribution rather than the mean alone, because the median, the quartiles, and the share of very short and very long sentences carry far more signal than a single average. Punctuation is treated as a primary marker on the strength of Grieve (2007), who showed that em-dash, semicolon, and comma habits are among the most author-specific surface features available.

03.03 Discourse and Stance

Connectives are classified using the taxonomy Halliday and Hasan set out in Cohesion in English (1976), separating additive, adversative, causal, and temporal links, because a writer who argues from mechanism reaches for "because" and "therefore" where another reaches for "also" and "furthermore". Hedging and boosting follow the epistemic-stance line of Hyland (1998), where what matters is not the raw rate but its calibration to the strength of the evidence, itself a marker of how carefully a writer handles uncertainty.

03.04 Register

Biber established in Variation Across Speech and Writing (1988) that texts cluster into registers, each carrying its own baseline for every feature. This is why the framework never reads a number out of context. Calling a sentence short requires knowing whether the comparison is a formal essay or a casual message, so every register is analyzed on its own terms.

Two further families of features, the LIWC cognitive-process markers of Pennebaker and colleagues and the Appraisal-theory stance system of Martin and White, are documented as the research foundation for planned work rather than presented as already running. The framework states that boundary rather than implying a coverage it does not yet have.

04. What Gets Measured

The extraction runs in three tiers so it degrades gracefully, producing a useful profile on any machine and a richer one where optional tools are installed.

04.01 Tier One, Standard Library

This tier always runs, using nothing beyond the Python standard library. It computes the function-word profile, the moving-average type-token ratio, the hapax-legomena ratio, the full sentence-length distribution, paragraph statistics, punctuation rates across commas, em dashes, semicolons, colons, parentheticals, questions, exclamations and ellipses, hedging and booster density, pronoun rates by person, the concession rate, and the distribution of sentence-initial words.

04.02 Tier Two, Readability

When the optional textstat package is present, the framework adds readability scores including Flesch-Kincaid grade and Gunning Fog, along with syllable and lexical-density statistics.

04.03 Tier Three, Deep Syntax

When a dependency parser is installed, the framework adds part-of-speech bigram frequencies, average dependency-tree depth, the passive-voice ratio, and the nominalization rate, which describe syntactic habits that surface counts cannot reach.

Combined across the tiers, the framework produces around fifty distinct measurements per register.

05. The Pipeline

Building a voice is a seven-stage workflow that moves from a raw corpus to a verified profile, with a human checkpoint before any rule is committed.

StageWhat happens
01. DiscoverIdentify the corpus and confirm ownership or permission to analyze it.
02. ExtractIsolate author-only text from conversation exports using a heading marker, following standard speaker-diarization practice. Warn when the corpus is too small for reliable distributions.
03. AnalyzeRun feature extraction over the full corpus and over each register on its own.
04. Mine rulesDerive candidate rules from existing feedback files, from extreme statistics, and from negative space, the shapes a writer never uses.
05. ReviewStop. Present the mined rules for the writer to confirm, correct, or extend before anything is committed.
06. EmitSelect exemplar passages from the actual corpus, never invented, and write the voice profile and a self-contained skill.
07. VerifyRegenerate text and measure it against the source corpus, described next.

The checkpoint at stage five exists because a statistic can be real and still not be a rule worth enforcing, and only the writer can make that call. The framework never invents an exemplar. Every passage it holds up as representative is drawn from the corpus itself.

06. Verification

The framework does not trust its own output. After a profile is built, it holds out samples that were not used as exemplars, generates fresh text on the same topics in the new voice, and checks the result against the original on fixed thresholds.

CheckPass condition
Em-dash countMatches the corpus, usually zero
Semicolon countMatches the corpus
Sentence-length distributionWithin 20% of the corpus mean
Hedging densityWithin 30% of the corpus rate
Hard-ban violationsZero

A failure on any threshold sends the responsible rule back to be sharpened rather than waving it through. This is the step that separates claiming a voice was captured from showing that it was.

The thresholds are anchored to real observed values. The figures below come from one of the author’s published essays, so the method can be shown without exposing private writing.

Measured featureValue
Mean sentence length33.0 words
Sentences of 30 words or more60%
Sentences of 5 words or fewer0%
Commas per sentence2.0
Em dashes0
Semicolons0
Hedges per 100 words0.23
Boosters per 100 words0
Moving-average type-token ratio0.745

These are observed values, not targets imposed in advance. Regenerated text has to land inside them, which is what turns a claim about voice into something a reader can check.

07. Outcome

Write-Like-Me is a reproducible voice system whose strength is the method and the knowledge behind it rather than the quality of any single imitation. It knows what makes a voice that voice at a level that can be measured and checked, which is a more durable thing than producing one passage that happens to sound right.

The framework supports multiple named voices, each with its own register baselines, so formal and casual registers stay separate and different writers never blend. It is public and authored end to end. The measurement code, the rule-mining workflow, and the verification protocol are the deliverable, not any one profile built with them.

The framework reports its own limits, which is part of why the measurements can be trusted. A small corpus produces directional rather than precise targets, the deepest linguistic features are documented but not yet computed, and rules derived from a single reviewer are flagged as such. Stating the boundary is what keeps the word scientific honest.

Imitation reproduces the surface of a voice and drifts within a paragraph. Measurement names the habits beneath it and holds them in place.