Insights
· 16 min read

We Audited 57 AI Blog Posts to Google's Quality Rater Rubric

We scored 57 AI-assisted blog posts against Google's 2025 Quality Rater Guidelines. 50 Highest. 7 High. Here is the rubric and what failed.

Anil Gulecha
Anil Gulecha
Ex-HackerRank, Ex-Google
Share
We Audited 57 AI Blog Posts to Google's Quality Rater Rubric
TL;DR
  • We scored 57 AI-assisted blog posts at kalviumlabs.ai against an 8-dimension rubric built from Google's January 2025 Quality Rater Guidelines update: 50 Highest, 7 High, 0 Medium/Low/Lowest.
  • Four dimensions (Expertise, Authority, Main-Content Quality, User Intent) scored 3.00 across the corpus. Trust (2.63), AI-Tell Voice (2.70), Experience (2.86), and Originality (2.86) are where we lose points.
  • Every post scoring below 22/24 is in the Strategy category. Strategy posts drift toward positioning language when they are not anchored in a specific project, number, or dated event. That is the structural fix.
  • The Jan 2025 QRG specifically flags Q&A-as-structure outside FAQ sections, em-dash residuals, and paraphrased-without-value-add content as scaled-content signals. 3 of 57 tripped the first. 14 of 57 tripped the second. Zero tripped the third.
  • Rubric, scores.csv, findings.md, and the re-runnable extraction script are published at kalviumlabs.ai/audit/. Re-run it yourself.

We run an AI-assisted blog publishing pipeline. A post-writer agent drafts every post from a queue. A content-reviewer agent checks it against a quality rubric. A validator script enforces structural rules. A human editor (usually Venkat) does a final pass. Three to five external sources per post. Two posts per day, seven days a week.

That pipeline has been running since March 2026. The January 2025 Search Quality Rater Guidelines update (PDF) was the largest rewrite of the effort and scaled-content sections since the Helpful Content system shipped. The word “paraphrased” jumped from 3 mentions in the previous version to 25. A new scoring axis called Effort, Originality, and Value-Add was formalized. Specific patterns (Q&A-as-structure outside FAQ sections, em-dashes, template transitions, certain AI-tell vocabulary) were called out by name as signals raters are now trained to flag.

We thought we should know what the current state of the blog looks like against that rubric. Not a spot check. Every post.

This is what we found, what we fixed before publishing this, and how to re-run the audit yourself.

The headline result

VerdictScore rangeCount%
Highest22-245088%
High18-21712%
Medium13-1700%
Low8-1200%
Lowest0-700%

The lowest-scoring post in the corpus sits at 19/24, which still falls in the High band on the QRG mapping. No post graded below High. The distribution is concentrated at the top: 31 of the 57 posts scored the maximum 24/24, and another 19 scored 23/24.

We did not build this rubric to grade ourselves kindly. It exists because we wanted a defensible, programmatically-extractable metadata layer plus a manual-read scoring layer we could re-run against any update to the guidelines. The full scoring methodology, the per-post scores, the objective metrics, and the full findings report are all published at kalviumlabs.ai/audit/. If you disagree with any score, the evidence is in the same repository.

How the rubric works

Eight dimensions, zero to three each, total zero to twenty-four, mapped to the five QRG verdict bands.

The eight dimensions:

  1. E1 Experience. First-hand evidence. Named projects, real numbers, failure modes encountered. The QRG’s 2025 update is explicit that AI cannot fake first-hand experience. This is the single most diagnostic dimension for AI-assisted content.
  2. E2 Expertise. Domain depth. Trade-off awareness. Opinions grounded in practice.
  3. A Authoritativeness. Why trust this source. Named author, credentials visible, topic-author-site alignment.
  4. T Trustworthiness. Accuracy, citations, honesty. External sources for non-trivial claims. Explicit limitations acknowledged.
  5. O Originality. New information, new framework, or first-hand data that adds to what is online. The QRG’s paraphrase-count explosion makes this critical.
  6. MC Main Content Quality. Structure, depth, hierarchy appropriate to query intent.
  7. UI User Intent. Title delivers on what the content provides. Query intent satisfied.
  8. AI-T AI-Tell Resistance. Specific to AI-assisted content. Tests for the patterns raters are now trained to catch: em-dash residuals, Q&A-heavy structure outside FAQ sections, template transitions, and the AI-tell vocabulary Google Search Central flags by name (the full blacklist is reproduced in our validator).

Each dimension is scored 0 (absent), 1 (weak), 2 (solid), 3 (exemplary). Totals map to QRG verdicts at 22-24 Highest, 18-21 High, 13-17 Medium, 8-12 Low, 0-7 Lowest. Binary flags (TL;DR present, FAQ present, byline present, internal-link count, external-citation count, em-dash count, AI-tell word count) are tracked separately and feed into the scoring but do not add to the 0-24 total directly.

The full rubric with per-band criteria and 2025 QRG references is at kalviumlabs.ai/audit/rubric.md.

Where the points live (and where they don’t)

Per-dimension means across the 57-post corpus:

DimensionMeanPosts below 3What that means
E2 Expertise3.000Every post shows domain depth beyond definitional level.
A Authority3.000Named author, credentials visible, topic-site alignment clean.
MC Quality3.000Structure, depth, and hierarchy consistently strong.
UI Intent3.000Titles deliver what the content provides. No bait-and-switch.
O Originality2.868Some Strategy posts lean on borrowed analogies (chef, gym, CFO-hire).
E1 Experience2.868Eight posts present anecdotes in third-person or abstract framing.
AI-T Voice2.7015Em-dash residuals plus occasional PAA-like Q&A structure.
T Trustworthiness2.632035% of posts under-cite external sources for the claims they make.

Four dimensions at a perfect 3.00 across all 57 posts is not an accident. It is what the pipeline is built to enforce. The post-writer agent pulls voice samples per author. The content-reviewer agent runs a ten-point quality check. The validator script hard-fails the build if frontmatter, structure, or AI-tell rules are violated. None of those four dimensions (Expertise, Authority, Main Content Quality, User Intent) require first-hand evidence to score, which is why they are easier for an AI pipeline to hit.

The four dimensions where we lose points all do require first-hand evidence, honest sourcing, or distinctive voice. Trust is the biggest gap: 20 of 57 posts score below 3 on trustworthiness, most because they made claims that would benefit from another external citation the writer did not include. AI-T Voice is second: 15 of 57 carry residual em-dashes or a Q&A-structure pattern outside their FAQ section. Experience and Originality tie at 2.86.

The category breakdown tells us something

CategorynMeanMinMax
Case Studies1123.82324
Technical1623.62324
Insights1523.02124
Strategy1521.91924

Every post scoring below 22/24 is in the Strategy category. Case Studies and Technical posts are structurally protected: a case study has to name a project with real numbers to exist as a case study; a technical post on agentic AI failure recovery is naturally anchored in our six production agent deployments. Strategy posts are the most essay-style, the least anchored in specific builds, and the most tempted into borrowed framings (chef analogies, gym memberships, CFO hiring). That temptation is the structural problem.

The 7 sub-Highest posts, named

All seven are in the Strategy category. Three root causes, usually combined:

  1. Em-dash residuals (six of the seven had four or more em-dashes before the backfill).
  2. Q&A-heavy structure in non-FAQ sections (three had non-FAQ question-H3 ratios above 40%).
  3. Positioning-piece framing without strong first-person project anchors (all seven).
SlugScorePrimary weakness (pre-fix)
how-to-choose-an-ai-development-company196 em-dashes plus 55% non-FAQ Q&A ratio. Chef analogy opening.
ai-development-india-why-startups-choose-bangalore201 first-person mention in 2,324 words. Three third-person anecdotes.
ai-development-services-what-you-actually-get204 em-dashes. Gym-membership analogy. Two AI-tell words in a vendor-example list.
ai-for-gulf-market-uae-saudi-startups201 external citation despite specific UAE PDPL and Saudi PDPL claims.
hire-ai-developers-full-time-vs-agency21CFO-hire analogy is borrowed framing. 10 first-person mentions.
200-ai-engineers-delivery-speed212 em-dashes. Positioning-piece framing.
5-questions-i-ask-every-client-before-code218 em-dashes (the highest count in the corpus).

The last one is diagnostic of how mechanical some of these failures are. 5-questions-i-ask-every-client-before-code has 89 first-person mentions (the second-highest in the entire corpus), a specific-founder opening, and a clean framework structure. On every dimension except AI-T Voice it would have been a 24. Eight residual em-dashes held it at 21.

What the 2025 QRG specifically looks for

Three patterns from the January 2025 update that raters are now trained to flag. The audit measured each:

Paraphrased-without-value-add content. The QRG’s “paraphrased” count went from 3 mentions in the previous version to 25 in this one. Zero posts in our corpus are paraphrase-heavy. Every post carries first-person voice, named cases, specific numbers, or original framework content. The content-reviewer agent plus voice samples plus validator rules together prevent the pattern. This is the single most important test a 2025-aware AI pipeline has to pass, and it is the one we pass cleanest.

Scaled-content abuse via Q&A stuffing. The QRG now explicitly flags heavy Q&A structure outside FAQ sections as a scaled-content signal (the People-Also-Ask covering pattern where every H3 is a question the writer hopes to rank for). We measured this as the proportion of non-FAQ H3 headers that are questions. 54 of 57 posts show zero non-FAQ Q&A. Two posts use a deliberate rhetorical “Four Questions That Decide It” framing where the Q&A structure is the point, not PAA-stuffing. Only one post (how-to-choose-an-ai-development-company) compounded Q&A stuffing with em-dash residuals, which is why it scored lowest.

Low-effort AI tells (em-dashes, template transitions, AI-tell vocabulary). Fourteen posts in the corpus (25%) had at least one em-dash before the backfill. Two posts had Tier 1 AI-tell vocabulary (the specific flagged terms are in our validator blacklist), both inside “what mediocre vendors say” example lists, which is benign in context but still trips content-lint. The em-dash count has now been zeroed out across the corpus. The flagged words have been replaced with non-triggering framings.

The fixes we shipped before publishing this post

Publishing an audit without shipping the fixes first would have made this post a marketing piece. So we ran two tiers of fixes in the 24 hours before this went live. Both are committed to the public website repository.

Tier 1: mechanical, sitewide (commit 15ffe2b, 2026-04-21)

Em-dash backfill. All 57 posts now carry zero em-dashes. Fourteen posts had at least one em-dash before the pass. The fix was a sed replacement with context-aware substitutions (period, colon, or comma depending on sentence structure) plus a content-reviewer re-check on each touched post.

External citation pass on ai-for-gulf-market-uae-saudi-startups. The post made specific claims about UAE PDPL, Saudi PDPL, SDAIA, DIFC, and ADGM data residency rules without linking to authoritative sources. It now carries five external citations to the actual regulatory pages.

Validator hardening. The validate-post.sh script that sits between the post-writer agent and the deploy pipeline now has three new hard-fail checks that map directly to the three Jan 2025 QRG red flags:

  • Minimum two external citations per post (previously a warning, now a hard fail)
  • Minimum 15 first-person mentions per 2,000 words (previously not measured)
  • Non-FAQ Q&A ratio capped at 50% as a fail, 30% as a warning (previously not measured)

These three checks run on every post the pipeline produces going forward. A post that fails any of them does not ship.

Tier 2: targeted rewrites of the lowest Strategy posts (this session)

We rewrote the three lowest-scoring Strategy posts in the corpus. The changes are surgical, not full rewrites. Each post was already structurally sound; each had a specific failure mode the audit identified.

The clearest before/after is ai-development-india-why-startups-choose-bangalore, which scored 20/24 with zero first-person mentions in 2,324 words despite being about our own work:

Before (opening):

A Series A founder in San Francisco was paying $340,000 per year for a single senior AI engineer in the Bay Area. His product had one AI feature working. His runway was 14 months. He moved to a Bangalore-based studio and shipped four AI features in the following three months at a fraction of the cost.

That is a third-person anecdote. No first-person voice, no named “we worked with” framing, no signal to a reader that the story came from our own founder conversations. A Google rater seeing that opening would not score it on the Experience dimension the way they would score a first-hand narrative.

After the Tier 2 rewrite:

One US-based founder we worked with had been budgeting $340,000 per year of total compensation for a single senior AI engineer in the Bay Area. His product had one AI feature working. His runway was 14 months. The pricing math simply did not close. He moved his next build cycle onto our Bangalore pods and we shipped four AI features in the following three months at roughly 15% of his original loaded cost.

Same underlying facts, but now it is our founder conversation, our pod deployment, and our result. The rewrite took the post from 0 first-person mentions to 20, added a second external citation (Stanford HAI’s AI Index 2024), and sprinkled “we” and “I” through the body where our experience is the actual source of the observation. Expected new score: 23-24. The other two rewrites (how-to-choose-an-ai-development-company and ai-development-services-what-you-actually-get) followed the same pattern: replace borrowed analogies with real project anchors, reframe question-structured H3s as non-question statements where the Q&A density was flagged, and lift first-person density above the new 15-per-2000-words threshold.

What this audit does not measure

This is where an audit blog post usually gets defensive. We will not.

It does not measure SEO performance. Ranking is a function of domain authority, link equity, query intent, and SERP competition, not just content quality. Our performance data lives in our weekly GSC-and-Cloudflare snapshot at data/seo/learnings.md. A post can score 24/24 on the rubric and still get zero impressions because the query has no volume, or the SERP is dominated by Wikipedia and a well-established competitor. Quality and ranking are correlated, not identical.

It does not measure conversion. PostHog tracks which posts drive book-a-call clicks. That is a separate number, and it does not correlate cleanly with the rubric score. Some of our highest-converting posts score 23/24; some score 24/24 and convert nothing.

It is not third-party validated. It is an internal audit with the biases that implies. The seven dimensions are derived from the QRG but the scoring is our own manual read on a 0-3 scale. Any third party re-running it should expect a different score distribution because calibration differs.

It is not representative of “AI content” generally. It is representative of the Kalvium Labs pipeline: post-writer agent with per-author voice samples, content-reviewer agent, hard-fail validator, human editorial pass, Google Ads Keyword Planner-driven topic queue. The result is not generalizable to AI content written by a one-shot prompt. That was the whole point of building the pipeline this way (full pipeline writeup).

It does not prove the content is worth reading. Rubric scores are a necessary-but-not-sufficient condition. A post can be QRG-compliant and still be boring. We try to catch that in the human review pass, but rubric scores alone do not certify usefulness.

Re-run it yourself

The rubric, the per-post scores, the objective metadata, and the extraction script are published at kalviumlabs.ai/audit/:

  • Rubric. The 8 dimensions, scoring bands, and QRG references.
  • Scores CSV. Per-post scores across the 8 dimensions plus verdict plus flag note, for all 57 posts.
  • Metadata CSV. The objective signals (word count, em-dash count, first-person density, non-FAQ Q&A ratio, citation counts).
  • Findings report. The long-form writeup this blog post summarizes.
  • Extraction script. Regenerates the metadata CSV from the current state of the corpus.

If you want to score us harder than we scored ourselves, the evidence is there. If you want to run the same methodology on your own corpus, the rubric file is short enough to adapt in an afternoon. The January 2025 QRG itself is public (Google Search Central’s E-E-A-T documentation plus the QRG PDF linked earlier; the Helpful Content system documentation sits underneath it), and there is no reason every AI-content operator should not be running this kind of audit on their own work.

FAQ

Why audit AI-written content against a Google rubric at all?

Because the January 2025 Quality Rater Guidelines update was specifically calibrated around the patterns AI content without human editorial effort tends to produce. Paraphrase-heavy content. Q&A-stuffed structure. Em-dash-heavy prose. Template transitions. If your AI pipeline is producing those patterns, your content is being scored against a rubric explicitly trained to flag them. Running the audit tells you whether your pipeline output looks like effort-and-originality content or like scaled-content abuse. It is the fastest way to find out before Google’s ranking systems do.

Did you use AI to score the audit?

No. The objective metadata (word counts, em-dash counts, citation counts, first-person density, non-FAQ Q&A ratio) was extracted programmatically via a shell script. The 8-dimension 0-24 scoring was done by a human reader with the rubric open. Using AI to score AI is a circular exercise that produces impressive-looking but meaningless results. The manual read is what makes the audit defensible.

Would the same rubric work for content that is not AI-assisted?

Yes, mostly. The seven non-AI-specific dimensions (Experience, Expertise, Authority, Trust, Originality, MC Quality, User Intent) map cleanly to any editorial evaluation. The AI-Tell Resistance dimension is specific to AI-assisted content and would be irrelevant for purely human-written posts. If your content operation is 100% human-written, drop that dimension and score out of 21 instead of 24, with the verdict bands rescaled proportionally.

How often should we re-run this kind of audit?

We are planning to re-run ours every quarter, or whenever the QRG ships a substantive update. Quarterly is enough cadence to catch pipeline regressions before they compound across dozens of posts. The expensive part is the manual-read pass, which for 57 posts took one engineer about a day. That cost scales linearly with corpus size but only with the newly-added posts if you keep the scoring CSV up-to-date between audits.

Does this audit change anything about how we publish content going forward?

Yes, three things. First, the validator now hard-fails three new checks (external citations, first-person density, non-FAQ Q&A ratio) that map directly to the Jan 2025 QRG red flags. Second, the queue-guardian agent now requires every Strategy topic to carry an internal project anchor or named external source before it enters the publishing queue, which should prevent the “positioning-piece without specifics” failure mode that created the seven sub-Highest posts. Third, we will re-audit quarterly and publish each re-audit.


Building AI content or tooling and want to know whether what you are shipping passes Google’s 2025 rubric? We run this audit as part of our Content Engine engagements. Or book a 30-minute call and we will tell you honestly what we would look at if we ran the audit on your corpus.

#quality rater guidelines#ai content seo#search quality rater#ai content quality#seo
Share

Tuesday Build Notes · 3-min read

One engineering tradeoff, every Tuesday.

From the engineers actually shipping. What we tried, what broke, what we'd do differently. Zero "5 AI trends to watch." Unsubscribe in one click.

Issue #1 lands the moment you subscribe: how we cut a client's LLM bill 60% without losing quality. The 3 model-routing rules we now use on every project.

Anil Gulecha

Written by

Anil Gulecha

Ex-HackerRank, Ex-Google

Anil reviews every architecture decision at Kalvium Labs. He's the engineer who still ships code — making technical trade-offs on RAG vs fine-tuning, model selection, and infrastructure choices. When a CTO evaluates us, Anil is the reason they trust the work.

You read the whole thing. That means you're serious about building with AI. Most people skim. You didn't. Let's talk about what you're building.

KL

Kalvium Labs

AI products for startups

You've read the thinking.
The only thing left is a conversation.

30 minutes. You describe your idea. We tell you honestly: can we prototype it in 72 hours, what would it cost, and is it worth building at all. No pitch. No deck.

Book a 30-Min Call →

Not ready to talk? Describe your idea and get a free product spec first →

What happens on the call:

1

You describe your AI product idea

5 min: vision, users, constraints

2

We ask the hard questions

10 min: what happens when the AI gets it wrong

3

We sketch a 72-hour prototype

10 min: architecture, scope, stack, cost

4

You decide if it's worth pursuing

If AI isn't the answer, we'll say so.

Chat with us