So, your site has llms.txt. It has schema.org JSON-LD. It has clean .md routes and an AGENTS.md. And ChatGPT, Claude, and Perplexity still can't see it. The reason might not be your AI files. It's the heaps of technology in front of and powering your site, like a CDN, WAF, or slow origin server, causing the AI crawler to never reach your content.
TL;DR: Four things block your AI visibility, in this order:
- A CDN or WAF rule that blocks AI crawler user-agents (GPTBot, ClaudeBot, PerplexityBot)
- CAPTCHAs or bot challenges on public content pages that AI crawlers can't pass
- A slow origin server (TTFB over 1 second) that times out before AI crawlers give up
- JavaScript-only rendering that returns an empty HTML shell to non-browser fetchers
You can run the AI Readiness Check to see whether your site is reachable, fast enough, and serving real HTML to crawlers. For the full diagnosis (including the CDN and WAF audit it can't do for you), read on.
Last week I shipped that tool and started running it against real sites. Sites I expected to score well, did not. Sites that had clearly invested in AI visibility, surprised me by returning a score of F. And I mean a LOT of sites. Not because they hadn't done the work. The problem was getting access to the AI files: the tool was getting 400s, 500s, JavaScript challenges, and request timeouts before it could see any of the work. ChatGPT, Claude, and Perplexity get similar treatment to what my scanner sees, since they share the same kind of HTTP fetch, no JavaScript execution, and no human-shaped fingerprint. The scanner won't catch every UA-specific block (those need direct testing, covered later in this post), but for performance issues and blanket challenges, what blocks one blocks them all.
You can put the most beautifully crafted llms.txt on the internet behind a security rule that blocks AI crawlers, and to ChatGPT it doesn't exist. You can have a perfect schema.org graph with a complete set of .md routes, and if your origin server takes 4 seconds to send the first byte, GPTBot already gave up and moved on.
This post is the companion to "The Complete List of AI Files Your Website Needs in 2026". That one was "what files to add." This one is "what's in front of those files preventing them from being read."
A quick note on browsing agents. A growing class of AI tools (ChatGPT Agent, Claude's Computer Use, Perplexity Comet, Gemini browsing) drive a real browser instead of making raw HTTP requests, so they can pass JavaScript challenges and some CAPTCHAs. But they're not what cites you in AI answers. The indexing and retrieval bots that do that are still HTTP fetchers. The rule of thumb holds: a CAPTCHA in front of public content blocks the traffic you most want for AEO (answer engine optimization).
The 8 layers between an AI crawler and your content
When an AI crawler makes a request to your site, the request can travel through up to eight independent layers before it ever touches your actual page content. Any one of them can slow it down, challenge it, or reject it outright:
- DNS resolution (your registrar / DNS provider).
- The CDN edge (Cloudflare, Fastly, Akamai, AWS CloudFront, Bunny, Vercel Edge, etc.).
- A WAF or bot management layer (Cloudflare Bot Management, AWS WAF, Akamai Bot Manager, Imperva, DataDome, HUMAN/PerimeterX, Sucuri, etc.).
- A rate limiter (often in the same stack).
- A challenge or CAPTCHA service (Turnstile, reCAPTCHA, hCaptcha, vendor-specific JS challenges).
- Your origin's reverse proxy (Nginx, Caddy, your hosting provider's edge).
- Application-level security (Wordfence, Sucuri Security plugin, Cloudways/Kinsta firewall rules, custom middleware).
- The actual origin server (what most consider their "website") with its own response time, render path, and database.
Most site owners only know two or three of those eight layers well. The other five are usually defaults set by a hosting provider, a one-click WordPress installer, a "secure your site!" walkthrough from 2022, or a DevOps engineer who left the company.
This is not an attack on any particular vendor or setup. The vendors are doing what their customers asked for: stop bots. They don't always know that many AI crawlers are good bots, and customers often don't know what defaults are running on their behalf.
Does blocking AI bots hurt your Google rankings?
AI-specific bot rules generally should not hurt traditional SEO. The dominant managed rules from major CDN/WAF vendors target AI user-agents (GPTBot, ClaudeBot, CCBot, Bytespider, etc.) and explicitly exclude verified search engines like Googlebot and Bingbot. Flipping those rules on doesn't directly tank your Google rankings. That part of the conventional wisdom is wrong, and worth correcting.
What absolutely does hurt SEO are the broader protections that often live in the same stack. "Under Attack" or interstitial JS challenge modes issue a multi-second JavaScript check to every visitor, and neither Googlebot nor AI crawlers can pass that kind of challenge. Aggressive WAF security presets like "High," "I'm Under Attack," or "Strict" have been reported to flag a meaningful share of legitimate Googlebot traffic as malicious, and browser integrity checks can block mobile Googlebot specifically since it doesn't always execute the same JS check the desktop crawler does.
Rate limits configured by IP range can catch Googlebot too, since Google's crawl infrastructure operates from a relatively concentrated range of US IPs. Geo-IP blocking that excludes North America frequently catches Googlebot and almost every AI crawler in the same blast radius. And application-level firewalls (Wordfence, Sucuri, ModSecurity rule sets) often ship with rules that treat any unusual user agent as suspicious.
Then there's performance, which hits both SEO and AEO. Page speed and Core Web Vitals are confirmed Google ranking signals, so slow LCP and slow TTFB hurt your rankings even if every bot can technically reach you. AI crawlers are also far less patient than Googlebot. SEO and performance vendors consistently report timeouts in the 1 to 5 second range, with the page being abandoned if complete HTML hasn't been delivered in that window. The AI companies themselves don't publish exact thresholds, but the directional claim is well established: much shorter than Googlebot, and they don't retry.
So, bot protection misconfigurations hurt SEO. Performance hurts both SEO and AEO. The AI-specific block hurts only AEO. Three different problems, three different fixes.
Why your site isn't showing up in ChatGPT, Claude, or Perplexity answers
In my experience running scans, four issues turn an A-grade site chock full of AI files into an empty box for ChatGPT, Claude, and Perplexity. In rough order of how often I see them, and how badly they hit:
1. Default-on AI bot blocking at your CDN or WAF
What it does: A number of CDN/security platforms have rolled out one-click "block AI bots" rules over the past two years. Cloudflare's managed rule is the most visible example, but Fastly, Akamai, and others ship similar toggles. The mechanics vary, but the effect is the same: requests with user-agents like GPTBot, ClaudeBot, CCBot, Meta-ExternalAgent, Bytespider, Amazonbot, and a long list of others get rejected at the edge, before robots.txt, before your origin, before anything else. Sometimes the rule is enabled by default for new accounts. Sometimes it gets enabled when an admin checks "yes, secure my site" during onboarding.
Why it's the biggest problem in 2026: The economics are real and ugly. AI crawlers visit publisher sites constantly to gather content. The AI tools they feed answer users directly, so very few of those users ever click through to the publisher. Cloudflare and several large publishers have reported crawl-to-referral ratios in the thousands or tens of thousands to one. The vendors didn't add these toggles to be hostile; they added them because publishers were being crawled aggressively without seeing any of the traffic that would normally justify it. The blanket-block toggle is a reasonable response to that math for the average publisher. It's a terrible response if your business model depends on showing up in AI answers.
How to find it: This is the layer most worth auditing first. The check is the same regardless of vendor: log into whatever sits in front of your origin and look for an "AI bots," "AI crawlers," or "AI scrapers" rule. This could be CDN/edge (Cloudflare, Fastly, Akamai, AWS CloudFront, Bunny, Vercel), dedicated bot management (DataDome, HUMAN, Imperva, Radware), and host-integrated security (WP Engine, Kinsta, Cloudways, Pantheon). If it exists, look at the per-bot table (most modern dashboards have one). Anything showing "Block" is invisible to that AI service.
The fix: Don't blanket-allow or blanket-block. Split by purpose. The pattern most publishers should run is:
| User-agent | Purpose | Recommended action |
|---|---|---|
OAI-SearchBot, ChatGPT-User |
ChatGPT search index, real-time fetch | Allow |
PerplexityBot, Perplexity-User |
Perplexity search and live answers | Allow |
ClaudeBot, Claude-SearchBot, Claude-User |
Claude indexing and live fetch | Allow |
Googlebot, Bingbot |
Traditional search (also feeds AI Overviews and ChatGPT/Bing) | Allow |
GPTBot |
OpenAI training | Allow if you want max visibility, block if you want to opt out of training |
CCBot |
Common Crawl, used to bootstrap many models | Block if you've already opted out of training |
Google-Extended, Applebot-Extended |
AI training only, separate from search | Block if opted out of training (does not affect search) |
Bytespider |
TikTok/ByteDance | Most publishers block; minimal upside |
The point: there's no single right answer, but there is a wrong one. Flipping a master toggle to "block everything" without realizing the search and retrieval bots are in the same list as the training ones.
2. CAPTCHAs and bot challenges blocking AI crawlers on public pages
What they do: Three classes of mechanism, all sharing the same failure mode for AI tools. They assume a real browser, real JavaScript execution, and a human-shaped fingerprint. AI crawlers have none of those.
The first is heuristic bot-detection modes. Many WAFs and bot managers ship "fight mode" presets that challenge anything looking automated. Useful for credential stuffing. Catastrophic for legitimate AI crawlers, which by definition look automated.
The second is "Under Attack" interstitials, designed for active DDoS situations. They issue a multi-second JS check to every visitor before letting them through. AI crawlers can't pass it. Googlebot can't pass it. Real users complain. If you ever turned this on for a bad night and forgot to turn it off, your AI visibility is currently zero.
The third is CAPTCHA and challenge widgets in front of public content pages. Cloudflare Turnstile, Google reCAPTCHA, hCaptcha, DataDome. These are all solid products doing what they're built to do: detect non-browsers and non-humans. Putting them on a login form, a comment box, or a checkout page is exactly correct. Putting them in front of the body of a blog post, a product page, a docs page, or a .md route makes that content invisible to every AI tool, full stop.
Where bot challenges are fine: in front of action endpoints, like POST handlers, sign-ups, comments, login.
Where they're not fine: in front of GET requests for public content (such as requesting your web page, or a sitemap.xml file.)
The fix:
- Use bot challenges only on action endpoints, never on public GET routes.
- If you must protect public pages, configure a "skip" or "allow" rule for verified bots that fires before any challenge.
- Keep interstitial and "under attack" modes available for active incidents, but make sure they're off during normal operation. They're emergency tools, not steady-state defaults.
- If you use a bot-fight or heuristic mode, scope it: exclude paths under
/blog/,/docs/,/articles/, and so on.
Verifying any of this is one curl command away, covered in the diagnosis section below.
3. Slow TTFB and server response times
This is the blocker with the most surface area, because every site has a performance budget and many sites have spent it.
The numbers AI crawlers actually expect: A caveat before the numbers is that AI companies themselves don't publish exact crawler thresholds. The figures here are the working consensus across SEO consultancies and crawler-monitoring vendors, who have converged on roughly the same numbers without much in the way of a primary source. Treat them as directionally right rather than authoritative, and tighten further if your traffic suggests it.
The timeout window is most commonly cited at 1 to 5 seconds, with some sources putting the ceiling at 10. AI crawlers don't retry the way Googlebot does, so whatever you serve in that window is the only shot they take. For TTFB, reported recommendations range from 200ms (aggressive) to 800ms (the absolute ceiling); anything above 1 second is risky. HTML payload should stay under 1MB, since bigger pages risk truncation or being skipped entirely. And on Core Web Vitals, aim for LCP under 2.5s, CLS under 0.1, and a healthy INP. Those are confirmed Google ranking signals, so this isn't an AI-only concern.
What slows you down in practice: Most often it's an origin on a cheap shared host with no edge caching, since a huge portion of the internet is set up that way to keep hosting costs down. Hosting in a region far from where AI crawlers make their requests (most operate out of US East) adds latency on top of that. Then there's the application layer: database-backed page generation with no caching layer, render-blocking JavaScript in the head, 4MB hero images served uncompressed, and "headless CMS" architectures with three serverless function hops between request and HTML. Any one of those will push you past the AI crawler timeout window. Stack two or three together and you're invisible.
The fix:
- Put a CDN in front of your origin. Plenty of free or inexpensive options solve the geographic problem alone. And this one change can make your human visitor's experience far better, as well!
- Cache aggressively. AI crawler traffic is not personalized; everyone gets the same response. Cache it.
- Compress everything (Brotli for HTML, AVIF/WebP for images, gzip for JSON).
- Run
curl -w "%{time_starttransfer}\n" -o /dev/null -s https://yoursite.com/your-best-page. If it returns more than 0.8 seconds, you have work to do.
4. JavaScript-only rendering (SPAs and CSR)
What it does: If your site uses client-side rendering (React, Vue, or Angular SPA without SSR), Next.js app router with mostly client components and no caching, or Next.js pages router without getServerSideProps, the initial HTML response is a near-empty shell with a <div id="root"> and a script tag. The actual content materializes later, after the browser executes your bundle.
Googlebot can deal with this, mainly, sometimes. Most AI crawlers cannot deal with it at all. The majority of AI crawlers do not render JavaScript and only see the raw HTML of a page, which means any critical content or navigation that depends on JS to load remains unseen.
How to spot the problem: View source on one of your pages (Ctrl+U on Windows/Linux and Option+Command+U on macOS). If you see your article text in the raw HTML, you're fine. If you only see <div id="__next"></div> and a bunch of script tags, that's what an AI crawler sees too.
The fix: Server-side render or pre-render your public pages.
- Next.js: use the
app/router with server components, orgetServerSideProps/getStaticPropsin pages router. - Nuxt: SSR mode (the default).
- React without a framework: introduce a pre-rendering layer (Prerender.io, react-snap) or migrate to Remix / Next.js.
- Angular: Angular Universal.
If migrating is too big a lift, a pre-rendering proxy like Prerender.io can sit in front of your origin and serve fully-rendered HTML to bots while leaving real users on the SPA. Not elegant, and certainly not my go-to, but it works.
A note on robots.txt
A common assumption is that robots.txt is where you control AI crawlers. It's not, or at least not the only place. Reputable AI crawlers (GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, CCBot) do honor robots.txt directives, so it's a perfectly valid layer for opting out of training. But it's the polite layer. It's enforcement-by-trust. The CDN, WAF, and bot-management layers above your origin are the actual enforcement. If your robots.txt says "allow GPTBot" but Cloudflare's managed AI bot rule is on, GPTBot is still blocked. Always audit both layers, and treat the infrastructure layer as authoritative.
How to test if AI crawlers can reach your site
Here's the diagnostic flow I run when an AI Readiness Check comes back with a low score that doesn't match the work the owner clearly put in.
Step 1: Curl as the actual AI bots
What your origin sees is not what the edge sees. The fastest way to find out what GPTBot or ClaudeBot is getting is to pretend to be them. (Heads up: AI crawler user-agent strings drift over time as vendors release new versions. Check the official bot docs for the current version of each before relying on a long-running test.)
# As GPTBot
curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" \
-I https://yoursite.com/your-best-page
# As ClaudeBot
curl -A "Mozilla/5.0 (compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" \
-I https://yoursite.com/your-best-page
# As PerplexityBot
curl -A "Mozilla/5.0 (compatible; PerplexityBot/1.0; +https://docs.perplexity.ai/guides/bots)" \
-I https://yoursite.com/your-best-page
What you want to see: HTTP/2 200. What's bad:
HTTP/2 403→ blocked outright. Likely a WAF or bot rule somewhere in the stack.HTTP/2 401→ some kind of auth wall in front of public content.HTTP/2 429→ rate limit. Whitelist verified bots.HTTP/2 503with a challenge header → JS challenge or CAPTCHA.- A 200 response with the body containing "Just a moment..." or "Checking your browser" → JavaScript challenge that no AI crawler will pass.
Step 1 alternative: Just ask the AI directly
If you don't want to drop into a terminal, you can ask the AI tools themselves. Open ChatGPT, Claude, or Perplexity and try:
Fetch the URL [your-page] right now. Don't use cached content. Tell me whether the fetch succeeded, the page title, the first sentence of the main body, and whether you hit any challenge, redirect, or blocking page.
When this works, you'll get back a clean answer with content that matches your live page. When it fails, the AI will tell you, usually with a specific error like "I couldn't access that page" or "I received a blocking response." That failure is exactly what real users see when they try to ask the AI about your content.
Two caveats: the tool may answer from cache (use a cache-busting query string like ?nocache=12345 to force a fresh fetch), and don't run this test in "agent" or "browse" mode since those drive a real browser and can pass challenges that the standard retrieval bots can't.
Step 2: Time the response
curl -w "TTFB: %{time_starttransfer}s\nTotal: %{time_total}s\nSize: %{size_download} bytes\n" \
-o /dev/null -s https://yoursite.com/your-best-page
Run this from a US server (e.g. a small DigitalOcean droplet in NYC) for the most realistic AI-crawler-shaped reading. If TTFB is over 800ms, AI crawlers are throwing your page back.
Step 3: Check what's in the raw HTML
curl -s https://yoursite.com/your-best-page | grep -c "your-actual-content-keyword"
Replace your-actual-content-keyword with a phrase that appears in your article body. If the count is zero, your content is JS-rendered and AI crawlers aren't seeing it.
Step 4: Look at server logs by user agent
Pull your last 30 days of access logs and grep for the AI user agents. If you see lots of GPTBot, ClaudeBot, and PerplexityBot hits with 200s, you're fine. If you see 403s, 503s, or nothing at all, you have a problem.
Most CDN and bot-management dashboards will do this for you with a per-bot view if you go looking. Names vary, but the data is usually there.
Step 5: Run the AI Readiness Check
Once you've worked through the manual checks, run the AI Readiness Check to confirm what you found and catch anything you missed. It surfaces most of the above automatically: failed fetches, slow responses, JS-rendered shells, and blocked status codes, and it shows you which checks couldn't run because the site never returned content. If a check that should have passed is failing, the most common reason in 2026 is one of the four reasons above, not missing files.
Should you allow or block AI crawlers? A decision framework
Past the technical fixes, there's a deliberate decision to make. The blunt-force "block everything" defaults exist because the math is genuinely ugly for many publishers. You give crawlers your bandwidth and content for years, and they send back a click maybe once in a thousand requests. So what's the right strategy? It depends on what your site does.
Should a personal blog or small business block AI crawlers?
No. Your goal is visibility, not protection. The traffic an AI tool sends you is small but valuable, and (no offense) your content was never going to anchor anyone's training run. Allow search and retrieval bots. Decide on training bots based on principle.
Should a publisher (news, magazine, recipe site) block AI crawlers?
This is the hard case. Allowing search and retrieval keeps you in AI answers. Allowing training contributes to a system that may eventually replace clicks to your site entirely. My take (but you need to think about this): allow search and retrieval, block training, watch the data, and revisit quarterly. A growing middle path is to return 402 Payment Required instead of 403 Forbidden for training crawlers. That signals "here's how to license this" rather than just "no," and multiple infrastructure vendors are now building toward that pattern.
Should an ecommerce or SaaS site block ChatGPT, Perplexity, or Claude?
No. Do not block ChatGPT-User, PerplexityBot, or Claude-User. These are the bots that fetch your page when a user is actively asking AI for a recommendation. Blocking them is the digital equivalent of locking the front door during business hours.
Should a docs site block AI crawlers?
No, and arguably less so than anyone else. Be maximally open. Coding agents (Cursor, Claude Code, Codex) make many of their fetches as authenticated users via direct tool use, and being unreachable means being unused. This is the one category where I'd argue against any AI block at all.
AI readiness fix checklist, ranked by impact
| Priority | Fix | Where to look | Impact |
|---|---|---|---|
| 🟢 Day 1 | Audit your edge / WAF / bot manager. Allow search and retrieval bots. | Whatever sits in front of your origin | Massive |
| 🟢 Day 1 | Confirm "under attack" or full-site interstitial mode is only on during an active incident, not as a steady-state default | Top-level CDN/security dashboard | Massive |
| 🟢 Day 1 | Curl your site as GPTBot, ClaudeBot, PerplexityBot. Confirm 200s. | Terminal | Diagnostic |
| 🟡 Week 1 | Move CAPTCHAs and bot challenges off public GET routes; keep on POST/auth only | WAF / app-level rules | High |
| 🟡 Week 1 | Verify Googlebot can pass your security stack | Google Search Console URL Inspection | High (SEO) |
| 🟡 Week 1 | Get TTFB under 800ms (ideally 200ms) | Hosting + CDN + caching | High (SEO + AEO) |
| 🟠 Month 1 | Server-side render any JS-only public pages | Framework migration | High |
| 🟠 Month 1 | Compress HTML, defer non-critical JS, optimize images | Build pipeline | Medium |
| 🔵 Quarterly | Review AI crawler traffic in server logs by user agent | Logs / vendor dashboard | Strategic |
| 🔵 Ongoing | Re-run AI Readiness Check | tristandenyer.com | Strategic |