How We Analyzed 10,000 AI Posts to Map Writing Dialects
The methodology behind Bloomberry's AI Dialects research β how we collected, labeled, and analyzed thousands of AI-generated outputs to identify the consistent structural patterns that differentiate ChatGPT, Claude, and Gemini.
By Sadok Hasan
Research is only as credible as its methodology. When Bloomberry published findings about AI writing dialects β distinct, consistent patterns in how ChatGPT, Claude, and Gemini structure their outputs β the obvious question is: how do you actually know?
This post is the answer. It's a full account of how we collected our data, how we defined and measured the patterns we identified, what we found, and where the limitations of the research lie.
We're publishing this for two reasons. First, transparency: claims about how AI models write should be falsifiable, and that requires methodology to be visible. Second, utility: if you're thinking about how to study AI writing systematically β whether you're a researcher, a journalist, or just deeply curious β the approach below might save you significant time.
The Research Question
The starting question for the AI Dialects research was observational: writers who used multiple AI models consistently reported that the models "felt different" β that Claude produced more philosophical writing, that ChatGPT was punchier, that Gemini was more careful and less interesting. Were these impressions real, measurable, and consistent across different topics and formats?
The hypothesis we wanted to test: AI models produce distinct structural writing patterns that are consistent enough across topics and formats to constitute identifiable "dialects" β and these dialects are separable from the content of what's being written, meaning the structural patterns appear regardless of whether you're writing about leadership, technology, cooking, or anything else.
Building the Prompt Corpus
The first methodological challenge was building a prompt set that would produce comparable outputs across models without introducing systematic bias toward any one model's strengths.
We wanted prompts that:
- Represented realistic professional writing use cases (not edge cases or deliberately adversarial inputs)
- Spanned multiple formats (LinkedIn posts, thought leadership paragraphs, short email copy, FAQ answers, executive summary paragraphs)
- Spanned multiple industries and topic areas to control for domain-specific vocabulary effects
- Were specific enough to produce substantial output but open enough to allow the model's natural writing style to emerge
We landed on a corpus of 200 base prompts, each submitted in five format variations, yielding 1,000 distinct prompt instances. Each instance was submitted to GPT-4o, Claude Sonnet, and Gemini Pro with identical temperature settings and no additional system prompts β the goal was to see each model's default output, not its optimized output.
This produced approximately 3,000 primary outputs. We then ran a second round of collection six months later to verify temporal consistency, bringing the total dataset to approximately 10,500 outputs.
Annotation Framework: Defining AI Sentence DNA
The core methodological innovation in this research was what we call AI Sentence DNA β a structured annotation framework for the sub-sentence and structural features that characterize each model's writing.
Developing this framework was iterative. We started with a broad set of candidate features from existing NLP literature on authorship attribution and stylometric analysis, then added AI-specific features we identified through exploratory reading of the outputs.
The final annotation framework covers five categories:
1. Opening construction type How does the output begin? Categories include: declarative statement, rhetorical question, conditional construction, narrative opening, definition opening, concession-then-claim, and direct instruction. We annotated the opening construction of every output independently from its content.
2. Hedging and qualification density The frequency of qualifying phrases β "it's worth noting," "while it's true that," "depending on context," "this varies significantly" β per 100 words. We developed a lexicon of approximately 80 hedging constructions and calculated density scores for each output.
3. Structural format preference The degree to which each output uses enumerated lists, headers, or other explicit structural markers versus flowing prose. Outputs were scored on a five-point scale from "fully prose" to "fully list-structured."
4. Vocabulary abstraction level The ratio of abstract/general vocabulary to concrete/specific vocabulary. We used a combination of automated scoring against a concrete-abstract vocabulary taxonomy and manual annotation for domain-specific cases.
5. Rhetorical mode The primary mode of argumentation: assertive (claims stated directly without qualification), exploratory (multiple perspectives examined), analytical (evidence and logic-forward), or narrative (story-driven). Each output was assigned a primary and secondary mode.
Annotation was performed by a team of four annotators using a shared codebook. Inter-rater reliability was calculated using Cohen's Kappa across 300 randomly sampled outputs annotated by all four raters. Overall Kappa was 0.73, indicating substantial agreement β acceptable for subjective linguistic annotation tasks.
What the Data Showed
The results were clearer than we expected. Across all five annotation dimensions, the three models produced statistically distinguishable distributions.
Opening construction: GPT-4o used rhetorical question openings in 34% of outputs. Claude used them in 8%. Gemini used them in 11%. Declarative openings were Claude's most common type at 47%, compared to GPT-4o's 28%.
GPT-4o led with a rhetorical question hook more than four times as often as Claude β a gap that held across every topic and format we tested.
Hedging density: Claude's outputs had the highest hedging density across all format types, averaging 4.2 hedging constructions per 100 words. GPT-4o averaged 1.7. Gemini averaged 2.9.
Claude hedges 2.5Γ more than GPT-4o per 100 words. That single finding explains most of the "why does Claude feel like an essay?" complaints we hear from content creators.
Structural format: GPT-4o was the most list-forward model, with 61% of outputs scoring 3 or higher on the five-point list-preference scale. Claude was most prose-forward (73% of outputs scored 1 or 2). Gemini was intermediate.
Vocabulary abstraction: Claude produced the highest abstraction scores across all format types. GPT-4o produced the lowest (most concrete) scores. The difference was largest for LinkedIn posts, where GPT-4o's abstraction score was 0.38 versus Claude's 0.61 on a normalized scale.
On LinkedIn post prompts, Claude's vocabulary was 60% more abstract than GPT-4o's β measured on a normalized concrete-to-abstract scale. That's the "sounds like a philosophy paper" effect, quantified.
Rhetorical mode: GPT-4o's primary mode was assertive in 58% of outputs. Claude's primary mode was exploratory in 52% of outputs. Gemini's primary mode was analytical in 44% of outputs.
We named these configurations based on their most distinctive characteristics: the Motivator dialect (GPT-4o), the Philosopher dialect (Claude), and the Analyst dialect (Gemini).
Cross-Format Consistency
One of the most important findings was how consistent these patterns were across formats. We expected dialect patterns to be stronger in open-ended formats (LinkedIn posts, thought leadership paragraphs) and weaker in constrained formats (FAQ answers, bullet point summaries). The data didn't support this.
Claude's hedging density was elevated relative to GPT-4o in every format category we measured, including formats where constrained output length should theoretically reduce the opportunity for hedging. GPT-4o's list-forward preference persisted even in format types where lists weren't specified.
This consistency suggests the dialect patterns are deeply embedded in the models' output generation β they're not format-specific stylistic choices but persistent architectural tendencies.
The dialect patterns held across every format we tested β from open-ended LinkedIn posts to constrained FAQ answers. These aren't stylistic preferences. They're architectural defaults.
Cross-Topic Consistency
We also tested whether dialect patterns were consistent across topic domains. The worry here was that domain-specific vocabulary might account for apparent differences β that Claude appears more philosophical because it's being asked philosophical questions, for example.
We controlled for this by including identical topic prompts (leadership, content creation, technology strategy, career advice) across all models and comparing model-to-model variation within topics against within-model variation across topics. Model-to-model variation was significantly larger, confirming that the dialect patterns are model-level phenomena rather than topic-level artifacts.
The Connection to Architecture
The AI Dialects research started as an empirical question: do models write differently? The data confirmed they do.
The follow-up question β why do they write differently β became the focus of our Vol. 2 research after Anthropic published its interpretability findings on functional emotional representations in Claude. The structural patterns we observed from the outside (qualification density, essayistic structure, abstract vocabulary) appear to be downstream of architectural features that Anthropic's team can now see from the inside.
The connection between the Philosopher dialect and Claude's functional emotional architecture is documented in our Vol. 2 research. The short version: the internal states that shape Claude's writing behavior produce measurable outputs that our annotation framework was capturing without knowing their source.
Limitations of the Research
This research has several meaningful limitations that should inform how you interpret the findings.
Snapshot in time. We conducted data collection at two points. Models are continuously updated, and the dialect patterns we identified may shift as training changes. We believe the architectural roots of the patterns make them more durable than surface-level features, but this should be verified with periodic re-collection.
No adversarial prompting. All prompts were designed to elicit natural outputs. Aggressive prompting β explicit instructions to write in a different style, chain-of-thought prompting, persona assignment β can shift dialect patterns substantially. The dialects we identified are default behaviors, not fixed constraints.
English only. Our prompt corpus was exclusively English-language. Dialect patterns may differ significantly across languages.
Three models. We covered the three major consumer-facing models at the time of collection. Several other models (Mistral, Llama variants, Grok) were not included. The framework is extensible but the findings are specific to the three we studied.
Annotation subjectivity. Despite reasonable inter-rater reliability, the annotation framework involves judgment calls. The hedging phrase lexicon in particular reflects our team's intuitions about what constitutes hedging β other researchers might define this differently.
What This Research Is Good For
The AI Dialects framework is useful for:
- Choosing the right model for a task. Knowing that GPT-4o defaults to the Motivator dialect and Claude defaults to the Philosopher dialect gives you a principled basis for model selection beyond general reputation.
- Editing AI outputs systematically. The annotation categories map directly to edit targets β you can check your AI outputs against the dialect patterns and make targeted revisions.
- Understanding voice loss. When AI writing doesn't sound like a specific person, the dialect patterns are usually part of the explanation β the model's characteristic construction has overridden the person's.
- Journalism and research. The methodology above is reproducible. Journalists covering AI writing differences, researchers studying model outputs, and developers building voice tools can all extend from this framework.
For the complete findings, including annotated examples and the full dialect taxonomy, see the AI Dialects research report. For the architectural explanation of why Claude writes the way it does, see Vol. 2 on emotional architecture. And if you want to understand how voice memory tools apply these insights to produce writing that sounds like a specific person rather than a dialect category, the complete guide to teaching AI your writing voice covers the practical application.
If you're a founder or professional trying to produce content that sounds like you β not like a dialect category β see how the best AI tools for personal branding apply voice learning in practice. The AI LinkedIn post generator is the fastest way to see it working directly.
Bloomberry's AI Dialects research is ongoing. If you're a researcher, journalist, or developer interested in extending this work, reach out through our research contact page.
Ready to write sharper?
Bloomberry turns your ideas into publish-ready thought leadership.
Try Bloomberry free