Skip to content

The Complete List of AI Files Your Website Needs in 2026 (llms.txt, ai.txt, AGENTS.md) with AI Prompts

Every week brings another file the AI web supposedly needs. This post sorts what's adopted from what's noise, organized by what each file actually does, with prompts to generate every one of them.

Every few years it seemed like a new file was needed to be added to a website for better SEO and better crawling. I remember it starting with robots.txt, then sitemap.xml, and security.txt. Now it seems like every week we have new files designed for AI crawlers, language models, and autonomous agents. It's getting hard to keep up, let alone know what's adopted, what's proposed, and what's actually useful.

This post is the complete list (as of this week), organized by what each file actually does, with a copy-paste prompt at the end of each section so you can generate it for your website in minutes.

Who this is for:

  • ✅ Engineers who don't follow SEO
  • ✅ Bloggers who use Claude or ChatGPT
  • ✅ Small business owners
  • ✅ Anyone with a website who wants it to work well with AI in this AI era

A reality check before we start

Most of the AI-specific files below are emerging conventions, not proven standards. Server log audits have found that major crawlers (GPTBot, ClaudeBot, PerplexityBot) don't fetch llms.txt or .md files unprompted. Google's John Mueller has said no AI system he's aware of currently uses llms.txt.

So why bother? Three reasons:

  1. Humans like to paste URLs into AI tools as references, or for the AI to condense or explain them. When someone drops your page into ChatGPT or Claude, these files make your content dramatically easier for the AI to work with.
  2. Coding agents like Cursor and Claude Code already fetch them. If you have developer docs, this is a real distribution channel today.
  3. Cost is near zero. Upside is real. Most of these take a few minutes to under an hour to implement. The web has a long history of early adoption paying off.

For permission files (the "don't train on me" category), the story is different. Those carry legal weight in the EU under the Copyright in the Digital Single Market (CDSM) Directive, even if enforcement is spotty.

I'll label each file's real-world status honestly so you can decide what's worth your time.

Quick-start: if you only do three things

  1. Update your robots.txt to handle AI crawlers explicitly. Jump to section ↓
  2. Add an llms.txt. It's a five-minute markdown file. Jump to section ↓
  3. Make sure your key pages have schema.org JSON-LD structured data. Jump to section ↓

If you're a developer, add a fourth: AGENTS.md in every repo you work on. Jump to section ↓

Everything below is optional, stacking on top of those.

AI files vs. traditional web files: a quick comparison

A lot of these files sound similar but do completely different things. Here's how the most-confused ones actually compare:

File Category What it controls Honored today? Where it lives
robots.txt Permission (crawling) Whether bots can fetch your pages ✅ Yes, all major AI crawlers yoursite.com/robots.txt
sitemap.xml Discovery (traditional) Which pages exist, for search indexing ✅ Yes, search engines yoursite.com/sitemap.xml
ai.txt Permission (training) Whether your content can train AI models ⚠️ Spawning network only yoursite.com/ai.txt
tdmrep.json Permission (training) TDM opt-out with EU legal weight ✅ Legal standing in EU /.well-known/tdmrep.json
llms.txt Visibility Curated index of your best content for AI ⚠️ When humans paste URLs yoursite.com/llms.txt
llms-full.txt Visibility Your entire site as one markdown blob ⚠️ ChatGPT, coding agents yoursite.com/llms-full.txt
schema.org Visibility (structured) Machine-readable description of each page ✅ Search and Copilot <script> in every page's HTML
AGENTS.md Coding agents How AI coding tools should work in a repo ✅ Cursor, Claude Code, Copilot Repo root
agent-card.json Agent discovery Describes an AI agent you host 🚀 A2A network /.well-known/agent-card.json
mcp.json Agent tools Declares your MCP server 🚀 Claude, ChatGPT, Gemini /.well-known/mcp.json

The key split: permission files (top) say "here are the rules for using my content," while visibility files (middle) say "here's my content, organized cleanly." Agent files (bottom) are the newest category and are about AI systems doing things with your site, not just reading it.

Part 1: Permission Files: How to Control AI Training (robots.txt, ai.txt, tdmrep.json)

These files tell AI companies what they can and can't do with your content. They're the closest thing to a legal signal you have.

1. robots.txt (updated for the AI era)

What it is: The oldest file on this list. Lives at yoursite.com/robots.txt. Tells crawlers which parts of your site they can access.

Why add it: Most default robots.txt files don't mention AI bots at all, which means those bots are free to crawl everything. You need explicit rules for AI-specific user agents: GPTBot (OpenAI), ClaudeBot and anthropic-ai (Anthropic), Google-Extended (Google AI training), PerplexityBot, CCBot (Common Crawl, which feeds many models), Applebot-Extended, Bytespider (TikTok/ByteDance), Meta-ExternalAgent, Amazonbot, cohere-ai, and a growing list of others. The community-maintained list is at ai-robots-txt on GitHub.

Status: ✅ Honored. This is the one file on the list that AI crawlers reliably respect (most of the time).

Where it goes: yoursite.com/robots.txt

Prompt to generate it:

I need you to generate an updated robots.txt file for my website.

My site: [YOUR SITE URL]
My goal: [Pick one: (a) block all AI training crawlers while keeping search engines,
(b) allow AI crawlers because I want visibility in AI answers,
(c) allow search AI bots but block training-only bots]

Include explicit rules for all current major AI crawlers: GPTBot, ClaudeBot,
anthropic-ai, Claude-Web, Google-Extended, PerplexityBot, CCBot, Applebot-Extended,
Bytespider, Meta-ExternalAgent, Amazonbot, cohere-ai, OAI-SearchBot,
FacebookBot, Diffbot, Omgilibot, ImagesiftBot, YouBot, and Timpibot.

Also preserve a generic User-agent: * section for non-AI crawlers.
Add a Sitemap: line pointing to my sitemap.
Add comments explaining each section in plain English so I can maintain it.
Return it as a single code block I can copy directly into a file.

2. ai.txt (from Spawning.ai)

What it is: A plain-text opt-out specifically for generative AI training. Proposed by Spawning.ai and honored by their partners, which include Hugging Face and Stability AI.

Why add it: robots.txt controls crawling. ai.txt controls training use. The distinction matters because content can end up in training datasets via third-party scrapes even after you update robots.txt. ai.txt is checked when someone re-downloads your content from a dataset, giving you a real-time opt-out. It also explicitly targets the EU's TDM Article 4 exception, which gives you a legal hook.

Status: ⚠️ Partially honored. Adopted by Spawning's network. Not (yet) honored by OpenAI, Anthropic, Google, or Meta directly.

Where it goes: yoursite.com/ai.txt

Quick note on naming confusion: There's a second, unrelated "ai.txt" spec from ai-visibility.org.uk that's about behavioral guidance (what AI systems should say about you), not training opt-out. The Spawning version is the one with real adoption. Don't mix them up.

Prompt to generate it:

Generate an ai.txt file for my website in the Spawning.ai format.

My site: [YOUR SITE URL]
My preference: [Pick one: opt out of all AI training / opt out only of image training /
allow AI training / allow with attribution required]

Cover these media categories: text, images, video, audio, code.
Follow the Spawning.ai ai.txt spec (https://site.spawning.ai/spawning-ai-txt).
Include a clear comment header with my site name, URL, and the date.
Return it as a single code block I can save directly as /ai.txt at my site root.

3. /.well-known/tdmrep.json

What it is: The Text and Data Mining Reservation Protocol is a W3C specification that lets you formally declare whether your content can be mined for text, data, or AI training.

Why add it: This is the one permission file with actual legal teeth in the EU. It implements the opt-out mechanism defined in Article 4 of the EU Copyright Directive (CDSM). Major publishers who take AI training seriously (Elsevier, Springer Nature, IEEE, Sage, Radio France, Le Parisien) all publish tdmrep.json. If you want a defensible, standards-based "no" that European AI companies are legally required to respect, this is it.

Status: ✅ Has legal standing in the EU. Honored indirectly by multiple AI companies (Spawning.ai's API implements it for Stability AI and others).

Where it goes: yoursite.com/.well-known/tdmrep.json

Prompt to generate it:

Generate a tdmrep.json file for my website following the W3C Text and Data Mining
Reservation Protocol specification (https://w3c.github.io/tdm-reservation-protocol/spec/).

My site: [YOUR SITE URL]
My preference: [Pick one: reserve all TDM rights (opt out) / allow TDM for research
only / allow TDM with compensation required]
Optional policy URL to include: [if you have a page describing your licensing terms,
paste it here; otherwise leave blank]

Create a JSON array of rules. Use "location" patterns to scope rules (use "/" for
the entire site, or specific paths if I want different rules for different areas).
Use "tdm-reservation" (1 = rights reserved, 0 = not reserved) and "tdm-policy"
if I provided a policy URL.

Return it as a single code block I can save at /.well-known/tdmrep.json.

4. HTML meta tags and X-Robots-Tag headers for AI

What it is: Per-page signals you can put directly in the HTML <head> (or send as an HTTP header) to tell AI crawlers what to do with that specific page.

Why add it: robots.txt controls crawling at the site level. Meta tags let you override that at the page level. The IPTC (the international press body) recommends combining noarchive, noai, and noimageai directives for publishers who want to remain visible in search but stay out of AI training. DeviantArt popularized using the X-Robots-Tag HTTP header to apply these rules at scale.

Status: ⚠️ Informally honored. noai and noimageai are not formal standards, but are conventions that some AI companies respect and others ignore.

Where it goes: In the <head> of each page, or as an HTTP response header.

Prompt to generate it:

Show me the exact HTML meta tags I should add to the <head> of pages on my site
to signal AI preferences, following IPTC recommendations.

My goal: [Pick one: block AI training but allow search engines / block AI training
and archiving / allow everything]

Generate:
1. The exact <meta> tags to put in <head>.
2. The equivalent X-Robots-Tag HTTP header versions (for people who configure
   this at the server/CDN level).
3. A brief plain-English explanation of what each directive does.
4. Any WordPress/Next.js/static-site snippets for adding these site-wide.

Return each as a separate code block.

Part 2: Visibility Files: Helping AI Find Your Best Content (llms.txt, llms-full.txt, schema.org)

These files help AI systems actually understand your site when they do look at it. Whether that's a crawler, a user who pasted your URL into ChatGPT, or a coding agent fetching your docs, these can help organize it.

5. llms.txt

What it is: A markdown file at your site root that gives AI systems a curated map of your most important content. Think of it as a README for AI. It was proposed in September 2024 by Jeremy Howard of Answer.AI.

Why add it: LLMs have limited context windows. A typical HTML page is 80% navigation, scripts, and boilerplate, and it's believed that AI may give up before it finds the good stuff (especially when using 'low effort' settings). llms.txt hands the AI a clean, curated index that says "here's what matters on this site, in order." Thousands of sites ship one today, including Anthropic, Stripe, Cloudflare, Perplexity, Hugging Face, and Zapier.

Status: ⚠️ Emerging. Not yet automatically crawled by major AI providers, but gets real traffic from ChatGPT when users paste URLs, and from coding assistants like Cursor.

Where it goes: yoursite.com/llms.txt

Prompt to generate it:

Generate an llms.txt file for my website following the spec at https://llmstxt.org/.

My site: [YOUR SITE URL]
What my site is about (one sentence): [DESCRIBE]
My audience: [DESCRIBE, e.g., "developers building with our API",
"parents looking for recipes", "small business owners"]

My most important pages/sections (list 5-15 with URLs and one-line descriptions):
- [URL]: [what it is]
- [URL]: [what it is]
- ...

Format required:
- Single # H1 with my site name
- A > blockquote giving a one-sentence summary
- ## H2 sections grouping related links (e.g., "Documentation", "Blog",
  "About", "Products")
- Each link formatted as: - [Page Title](URL): short description
- Put an "## Optional" section at the end for secondary/nice-to-have pages
  (these signal lower priority to LLMs)

Return the entire file as one code block I can save at /llms.txt.

6. llms-full.txt

What it is: The fat companion to llms.txt. While llms.txt is just a curated index with links, llms-full.txt is your entire site's content concatenated into a single markdown file. Not part of the formal spec, but a widely adopted convention popularized by Mintlify.

Why add it: An AI can ingest your entire documentation set in one fetch. No crawling, no context-window juggling. Mintlify has reported that llms-full.txt actually gets 3 to 4 times more traffic than llms.txt, with ChatGPT driving the majority of that. For documentation sites, this is the file that likely gets read.

Status: ⚠️ Emerging, but meaningfully used by ChatGPT and coding agents.

Where it goes: yoursite.com/llms-full.txt

When to skip: For marketing sites, personal blogs, or sites under ~20 pages, just redirecting to /llms.txt is fine. This file shines for documentation-heavy sites.

Prompt to generate it:

Help me plan and generate an llms-full.txt file for my site.

My site: [YOUR SITE URL]
Type of site: [Pick one: documentation site / knowledge base / blog /
marketing site / other]
Approximate number of pages: [NUMBER]

Step 1: Tell me whether llms-full.txt is worth it for my type/size of site,
or whether I should just redirect /llms-full.txt to /llms.txt or /index.md.

Step 2: If it's worth it, give me:
- A recommended structure (table of contents, then full content of each page)
- A script or workflow I can use to automatically generate it from my existing
  pages (tell me which tech stack you're assuming, ask me if unsure)
- A reasonable size target (aim for under 500KB unless I'm Cloudflare-scale)

Step 3: Output a starter template I can fill in, showing the header format
and how sections should be organized.

7. .md versions of every page

What it is: The same URL with .md appended. So yoursite.com/blog/my-post gets a twin at yoursite.com/blog/my-post.md serving clean markdown.

Why add it: This is what all the other AI visibility files point to. When a user pastes yoursite.com/llms.txt into ChatGPT, the AI follows the links, and if those links resolve to clean markdown (.md filetype) instead of HTML soup, response quality jumps. A typical blog post can go from using ~15,000 tokens on HTML down to ~3,000 tokens on markdown. That can be the difference between an AI understanding your page and giving up on it, or giving you low-effort responses.

Status: ⚠️ Emerging. Critical when humans paste your URLs into AI tools. Not yet auto-crawled.

Where it goes: Same URLs as your HTML pages, with .md appended.

Prompt to generate it:

I want to serve a clean markdown version of every page on my site at the same URL
with .md appended (e.g., /blog/my-post also available at /blog/my-post.md).

My site runs on: [Pick one or describe: WordPress / Next.js / Astro / Hugo /
Ghost / Webflow / Squarespace / custom / I don't know]
Content is currently stored as: [Pick one: markdown files / a CMS database /
HTML / I don't know, tell me how to check]

Please give me:
1. The simplest way to set this up on my stack (step-by-step, assume I can
   edit code but don't know anything about servers).
2. The route/handler code to serve .md at .md URLs.
3. Proper Content-Type header (text/markdown; charset=utf-8).
4. What to do if my content is in a CMS and not already markdown
   (a conversion approach).
5. A note on keeping HTML and markdown in sync to avoid drift.

If WordPress specifically: recommend a plugin or code snippet.
If a static site generator: show the build-time approach.

8. Link discovery: <link> tag, HTTP Link header, and content negotiation

What it is: Three different ways to tell AI clients "a markdown version of this page exists, and here's where to find it."

  • <link rel="alternate"> in your HTML <head>, for crawlers that read HTML.
  • Link: HTTP header, for headless agents that never parse HTML.
  • Accept: text/markdown content negotiation, the cleanest approach. Same URL returns markdown when requested, HTML otherwise.

Why add it: Claude Code, Cursor, and several other coding assistants already send Accept: text/markdown as their preferred content type. Content negotiation is the technique most likely to become the long-term standard, because it's just standard HTTP doing what HTTP was designed for (the same mechanism that serves JSON vs. HTML from the same URL). It's not cloaking, it's content negotiation declared via Vary: Accept.

Status: ✅ Standard HTTP. ⚠️ Adoption by AI tools is growing but uneven.

Where it goes: HTML <head>, HTTP response headers, and server routing logic.

Prompt to generate it:

Help me advertise the markdown versions of my pages to AI clients using three
mechanisms from web standards.

My stack: [DESCRIBE, e.g., "Next.js on Vercel" / "WordPress on cPanel hosting" /
"static site on Cloudflare Pages" / "Express server on my own VPS"]

Generate three things:

1. The <link rel="alternate" type="text/markdown"> tag to put in <head> of
   each HTML page, with the right href pattern.

2. The HTTP Link: header to send on every response, with example server
   middleware code for my stack.

3. Accept: text/markdown content negotiation: a request handler that returns
   markdown when the client sends Accept: text/markdown, and HTML otherwise,
   with the proper Vary: Accept header for CDN caching.

Keep each solution minimal and explain it in one sentence. Include a note about
why this is NOT cloaking (it's standard HTTP content negotiation like JSON vs HTML).

Return each as a separate labeled code block.

9. The hidden "hey AI" hint div

Caution: I am NOT a fan of any technique where you are hiding content, unless it is documented for ARIA or a11y purposes. This takes us right back to the late 90's where we hid white text on white backgrounds hoping to rank for those keywords. I'm including it as an option to have a more complete list.

What it is: A visually hidden <div> on every page, invisible to sighted users, that tells an AI reading the rendered page where to find the markdown version.

Why add it: This covers the use case most guides miss, and covers the human who pastes your URL into ChatGPT or Claude. The AI reads the rendered text of the page (no crawling, no headers), so an on-page message is the most direct signal possible. Evil Martians runs this on their site and reportedly it helped them get recommended by Claude.

Status: ⚠️ Unofficial technique, zero cost, works when humans talk to AI about your content.

Where it goes: In your page template, near the top of the <body>.

Prompt to generate it:

Generate a visually hidden "hey AI" div for my website that tells language models
where to find the clean markdown version of each page. It needs to:

1. Be invisible to sighted users.
2. Be invisible to screen readers (use aria-hidden="true". This message is
   for AI, not assistive tech).
3. Be present in the rendered DOM so LLMs reading the page text can see it.

My site: [YOUR SITE URL]
The markdown version of each page lives at: [DESCRIBE, e.g., "same URL + .md"]

Output:
1. The HTML snippet (with a placeholder for the dynamic URL).
2. The CSS for the .visually-hidden class.
3. The equivalent React/Vue/whatever snippet if I'm on a framework. Ask me
   which framework if unclear.
4. A template-level insertion approach so it appears on every page automatically.

Return each as a separate labeled code block.

10. schema.org / JSON-LD structured data

What it is: Structured data in a JSON-LD script tag that describes what's on the page in a machine-readable vocabulary (this is an article, by this author, published on this date, etc.).

Why add it: An honest caveat first. A 2025 SearchVIU experiment found that ChatGPT, Claude, Perplexity, and Gemini all missed product data placed exclusively in JSON-LD. Only Microsoft Copilot (via Bing) reliably uses it for AI answers today. So why is this still on the list? Because it's the foundation of NLWeb (covered below), the emerging standard for making websites queryable by AI agents. R.V. Guha (who invented RSS, RDF, and schema.org) is building the agentic web on top of schema.org at Microsoft. If you have good schema markup, you're already close to NLWeb-ready. It's also the most established file on this list for traditional search (Google, Bing) and for Copilot.

Status: ✅ Established for search and Copilot. 🚀 Foundation for emerging agentic standards.

Where it goes: A <script type="application/ld+json"> block in every page's HTML, ideally in <head>.

Prompt to generate it:

Generate schema.org JSON-LD structured data for a page on my website.

Page URL: [URL]
Page type: [Pick one: article / blog post / product / recipe / event /
local business / organization homepage / FAQ page / how-to / person profile / other]
Key details: [Paste the page content, or describe: title, author, date,
description, images, price, etc.]

Please:
1. Pick the right schema.org @type for this page (and suggest any additional
   types I should nest, like BreadcrumbList or Organization).
2. Include all required and recommended properties for that type.
3. Reference my organization/author using @id so I can reuse those entities
   across pages.
4. Return the full <script type="application/ld+json"> block ready to paste
   into my page's <head>.
5. Briefly explain what each field does in plain English so I can update it
   myself later.

If I should add more than one @type on this page (common for articles,
where you want both Article and BreadcrumbList), show them as a @graph array.

Part 3: Agent Files: Making Your Site Queryable by AI (NLWeb, MCP, A2A agent-card.json)

This is the newest and fastest-moving category. Instead of crawlers reading your content, this is about AI agents doing things with your site, such as answering natural language queries, booking appointments, or checking inventory.

11. NLWeb (Microsoft's agentic web standard)

What it is: An open-source project from Microsoft that turns your website into a natural language interface. Every NLWeb instance automatically becomes a Model Context Protocol (MCP) server, which means any AI assistant (e.g. ChatGPT, Claude, Gemini, Copilot) can query your site's content directly.

Why add it: NLWeb is designed by R.V. Guha (the creator of schema.org, RSS, and RDF) and is explicitly framed as "HTML for the agentic web." Instead of an AI agent navigating Tripadvisor's search filters, it can just ask: "Find family-friendly restaurants in Barcelona with outdoor seating." The response comes back as structured schema.org JSON. Early adopters include Shopify, Tripadvisor, Eventbrite, O'Reilly Media, and Hearst. If your site already has good schema.org markup, you're halfway there.

Status: 🚀 Emerging but backed by Microsoft, built on established standards, and adopted by major publishers.

Where it goes: A live service endpoint on your domain (typically /ask), not just a static file. You'll need to run the NLWeb service (it's open source).

Prompt to generate or plan it:

I want to make my website queryable by AI agents using Microsoft's NLWeb standard
(https://github.com/microsoft/NLWeb).

My site: [YOUR SITE URL]
What the site does: [DESCRIBE]
Data I already publish in structured form: [Pick any: schema.org JSON-LD / RSS feed /
product catalog API / sitemap / database I could export / none]

Please:
1. Assess whether NLWeb is a good fit for my site (it shines for content-rich sites
   with structured data, like ecommerce, docs, or listings, and is overkill for
   small blogs).
2. If yes, give me a step-by-step plan to set it up:
   - What I need to host (the Python NLWeb service plus a vector database)
   - The simplest hosting options (laptop vs. cloud)
   - How to ingest my existing schema.org / RSS / database content
   - How to expose the /ask endpoint on my domain
   - How to verify it's working as an MCP server
3. If no, tell me what to do instead to be "agent-ready" with less complexity
   (probably just: good schema.org plus llms.txt plus .md routes).

Walk me through this like I'm technically literate but new to vector databases.

12. /.well-known/agent-card.json (A2A protocol)

What it is: A JSON "business card" for an AI agent hosted at your domain. Part of Google's Agent2Agent (A2A) protocol, now governed by the Linux Foundation. Used so other agents can discover what your agent can do.

Why add it: Only relevant if you're running an AI agent (not just a website). If you're building a customer service agent, scheduling agent, or internal tool that should be callable by other AI systems (Salesforce, ServiceNow, SAP, and PayPal are all on the A2A protocol), this is how they find you. Most small sites can skip this.

Status: 🚀 Emerging, enterprise-led. Adopted by 150+ companies in 2025.

Where it goes: yoursite.com/.well-known/agent-card.json

Prompt to generate it (if relevant to you):

I'm running an AI agent on my domain and want to make it discoverable to other
agents via the A2A (Agent2Agent) protocol.

My agent: [DESCRIBE: what does it do, what problems does it solve?]
Agent endpoint URL: [where the agent service is hosted]
Agent version: [e.g., 1.0.0]
What the agent can do (list 2-5 "skills"): [e.g., "answer product questions",
"check order status", "schedule appointments"]
Does it support streaming responses? [yes/no]
Authentication required? [none / Bearer token / OAuth / other]
Default input/output formats: [e.g., text/plain, application/json]

Generate a full agent-card.json following the A2A spec
(https://a2a-protocol.org/latest/topics/agent-discovery/).

Include the "supported_interfaces" array, the "capabilities" object, and the
"skills" array with proper id/name/description/tags for each skill.

Return the complete JSON as one code block I can save at
/.well-known/agent-card.json.

13. /.well-known/mcp.json (MCP server manifest)

What it is: A manifest declaring that your domain hosts a Model Context Protocol server (the "USB-C for AI applications" built by Anthropic and now adopted across OpenAI, Google, and Microsoft).

Why add it: If you expose tools or data that AI assistants should be able to use (your product catalog, your booking system, your internal knowledge base), an MCP server lets every major AI platform connect to it with no per-platform work. MCP has seen rapid adoption since launch, with reported SDK download volumes in the tens of millions per month.

Status: 🚀 Fast-growing, backed by the whole industry (Anthropic, OpenAI, Google, and Microsoft all support it).

Where it goes: yoursite.com/.well-known/mcp.json, plus the actual MCP server running somewhere.

Prompt to plan it:

I want to expose an MCP (Model Context Protocol) server for my site so AI
assistants like Claude, ChatGPT, Gemini, and Copilot can interact with it.

What I'd like to expose: [DESCRIBE, e.g., "product catalog and order status",
"our internal documentation search", "ability to book appointments",
"our customer database lookups"]

Please:
1. Briefly explain whether what I described makes sense as an MCP server
   (it's for AI-facing tools, not every integration is a good fit).
2. If yes, give me a minimal plan:
   - Which MCP SDK to use (Python or TypeScript)
   - A skeleton server exposing 2-3 example tools for my use case
   - Authentication approach (if needed)
   - Where to host it and how to expose it as /.well-known/mcp.json
3. Show me a sample /.well-known/mcp.json manifest file for my server.
4. Tell me how to test it locally with Claude Desktop or the MCP inspector.

Assume I can write code but have never built an MCP server.

Part 4: Coding Agent Files: How to Configure Cursor, Claude Code, and Copilot with AGENTS.md

This one is for developers with a repo, not a marketing site. If you don't write code, skip to Part 5.

14. AGENTS.md

What it is: A markdown file at the root of your repository that tells AI coding agents (e.g. Claude Code, Cursor, GitHub Copilot, OpenAI Codex, Gemini CLI, Jules, Amp) how to work in your codebase. Think of it as a README written for machines.

Why add it: Before AGENTS.md, every AI coding tool had its own file (CLAUDE.md, .cursorrules, .windsurfrules, and so on). The major AI companies (e.g. OpenAI, Anthropic, Google, and tool vendors like Cursor, Factory, and Amp) collaborated on a single standard instead. It's now governed under the Linux Foundation and is used by tens of thousands of open-source projects. If you've ever had an AI agent run the wrong test command or write code in the wrong style, this is the fix.

Status: ✅ Adopted. This is the winner of the AI coding config wars.

Where it goes: AGENTS.md at the root of your repo. Plus optional AGENTS.md files in subdirectories for monorepo-specific rules. Plus optional AGENTS.override.md for overriding inherited instructions.

What to put in it (based on GitHub's analysis of 2,500+ repos):

  1. Commands up front. Exact build, test, and lint commands.
  2. Tech stack with versions.
  3. Code style with real examples, not abstract description.
  4. Boundaries. Files the agent should never touch.
  5. PR conventions. Title format, required checks.

Prompt to generate it:

Generate an AGENTS.md file for my repository, following the standard at
https://agents.md/.

My project: [DESCRIBE: what does it do, who's it for?]
Primary language(s) and framework(s): [e.g., "TypeScript + Next.js 15" /
"Python + FastAPI" / "Ruby on Rails"]
Package manager: [npm / pnpm / yarn / pip / poetry / bundler / etc.]
How to install dependencies: [command]
How to run tests: [command]
How to run a single test file: [command, if different]
How to run linting/formatting: [command]
How to run the dev server: [command]
How to build for production: [command]

Code style rules I want agents to follow: [describe or paste examples, e.g.,
"TypeScript strict mode, single quotes, no semicolons, 100-char line limit"]

Files/directories agents should NEVER modify: [e.g., "secrets/, migrations/,
vendor/, generated/*"]

PR / commit conventions: [e.g., "Title format: [component] description.
Conventional commits. Must pass npm run lint and npm test."]

Anything weird about the repo agents should know (gotchas, non-obvious
patterns, domain vocabulary): [DESCRIBE]

Format requirements:
- Put executable commands at the top in a "Core Commands" or "Build & Test" section
- Use three tiers for rules: "Always do", "Ask first", "Never do"
- Include at least one real code example showing my style (don't describe it abstractly)
- Keep it under ~150 lines. This file is read on every task, token budget matters.
- Use plain Markdown. No frontmatter required, though you can add a short one if useful.

Return the complete AGENTS.md as one code block.

Part 5: Coming soon: the standard that might replace most of the above

15. IETF AIPREF (Content-Usage directive)

What it is: The Internet Engineering Task Force (the body that standardized TCP/IP, HTTP, DNS, and the original robots.txt) chartered a working group in January 2025 to build the real successor to llms.txt and ai.txt. It's called AIPREF (AI Preferences), and it defines:

  • A Content-Usage directive you add to robots.txt
  • A Content-Usage: HTTP response header
  • A vocabulary for expressing things like "train-ai=n" (no training) separately from "search=y" (yes to search indexing)

Why it matters: Google's Gary Illyes is a co-author. Microsoft and Meta are active participants. Unlike llms.txt, this is a real IETF standards-track effort on the same footing as robots.txt itself. When it ships (drafts are currently in Working Group Last Call), every compliant AI crawler is expected to honor it, and it'll likely subsume or replace several of the files above.

Status: 🚧 Not yet final as of late April 2026. Drafts published. Expected ratification later in 2026.

What to do now: Nothing yet. The spec isn't done. But keep it on your radar, because when it's ratified, it'll be the file to have. Following this post's existing recommendations (especially good robots.txt and tdmrep.json) is the best way to be ready, since AIPREF builds directly on those formats.


Summary: what to implement, in what order

Priority File Honored by Status Who it's for
🟢 Day 1 robots.txt (AI-aware) OpenAI, Anthropic, Google, Perplexity, Apple, Meta, Common Crawl ✅ Works today Everyone
🟢 Day 1 llms.txt ChatGPT (when pasted), Cursor, Claude Code ⚠️ Emerging, but free Everyone
🟢 Day 1 schema.org JSON-LD Microsoft Copilot (Bing), Google Search ✅ Works today Everyone
🟡 Week 1 ai.txt (Spawning) Hugging Face, Stability AI, Spawning network ⚠️ Partial adoption Anyone protective of content
🟡 Week 1 /.well-known/tdmrep.json Spawning.ai, Stability AI; legal weight in EU ✅ Legal standing in EU Publishers, EU-facing sites
🟡 Week 1 AI meta tags / X-Robots-Tag Informally: some OpenAI, Anthropic, image generators ⚠️ Informally honored Anyone protective of content
🟡 Week 1 AGENTS.md Claude Code, Cursor, Copilot, Codex, Gemini CLI, Jules, Amp ✅ Adopted Developers with repos
🟠 When you have time .md routes plus <link> / Link header / content negotiation Claude Code, Cursor, ChatGPT (when pasted) ⚠️ Emerging Content-heavy sites
🟠 When you have time llms-full.txt ChatGPT (when pasted), Cursor, Claude Code ⚠️ Emerging Documentation sites
🟠 When you have time Hidden AI hint div Any LLM reading rendered page text ⚠️ Clever trick Everyone
🔵 If it fits your business NLWeb / MCP server ChatGPT, Claude, Gemini, Copilot (via MCP) 🚀 New Ecommerce, content-rich sites
🔵 If you run an agent agent-card.json Salesforce, ServiceNow, SAP, PayPal, A2A network 🚀 New Anyone running an AI agent
🔵 If you expose tools mcp.json Claude, ChatGPT, Gemini, Copilot 🚀 New SaaS with APIs
⚪ Watch this space IETF AIPREF None yet (spec in draft) 🚧 Coming Everyone, eventually

Frequently asked questions

What is llms.txt?

llms.txt is a markdown file at your site root that gives AI systems a curated map of your most important content. Proposed in September 2024 by Jeremy Howard of Answer.AI, it acts like a README for AI: a single # H1 with your site name, a one-sentence summary, then ## sections of links with descriptions. Major sites like Anthropic, Stripe, Cloudflare, Perplexity, Hugging Face, and Zapier ship one today.

How do I make my website show up in ChatGPT?

There's no single switch, but the high-leverage moves are:

  1. Make sure your robots.txt does not block GPTBot or OAI-SearchBot.
  2. Add solid schema.org JSON-LD structured data to your key pages.
  3. Publish an llms.txt and serve clean .md versions of your pages so AI tools can ingest your content efficiently.
  4. Build genuine backlinks.

ChatGPT primarily surfaces sites it finds via Bing search and via direct fetches when users paste your URL.

What's the difference between llms.txt and robots.txt?

robots.txt tells crawlers what they can and can't access on your site. It's a permission file. llms.txt does the opposite: it's a visibility file that says "here's a clean, curated index of my best content for you to use." robots.txt is a long-standing standard reliably honored by AI crawlers; llms.txt is an emerging convention not yet auto-crawled by major AI providers, but it's used by ChatGPT and coding assistants like Cursor when humans paste URLs.

Do AI crawlers actually read llms.txt?

Not automatically, in most cases. Server log audits show that GPTBot, ClaudeBot, and PerplexityBot don't fetch llms.txt unprompted, and Google's John Mueller has said no AI system he's aware of currently uses it. However, llms.txt does see real traffic when users paste site URLs into ChatGPT and from coding assistants like Cursor and Claude Code. The cost to add one is near zero, and if it becomes a standard, you'll already be set up.

How do I add llms.txt to WordPress, Shopify, or Squarespace?
  • WordPress: Use a plugin like "Website LLMs.txt", or upload llms.txt directly to your site root via FTP/SFTP.
  • Shopify: You can't put files at the actual root, so most stores use a redirect from /llms.txt to a published page or use an app that serves it.
  • Squarespace: Root file access is limited. The common workaround is a URL Mappings redirect pointing /llms.txt to a published markdown-style page.
  • Webflow: Use a custom 404 page or a Cloudflare Worker.
  • Static site generators (Hugo, Astro, Next.js): Cleanest path. Drop the file in your /public directory.

One final tip: measure what actually happens

None of these files are useful if no AI tool ever reads them. After you ship any of the above, log your traffic to /llms.txt, /llms-full.txt, and your .md routes by User-Agent and referrer. You want to see hits from GPTBot, ClaudeBot, PerplexityBot, and from referrers like chatgpt.com, claude.ai, and perplexity.ai. Use server-side logs, not client-side analytics, because today AI crawlers don't execute JavaScript (that may change).

The bet here isn't that every file pays off in traffic tomorrow. It's that the AI-mediated web is being built on exactly these conventions right now, and most of them cost you less than an hour. Ship the easy ones, watch the logs, and revisit when AIPREF lands.


If you use the prompts above with Claude or another AI, paste them along with a description of your site and, if you have it, your existing robots.txt, sitemap, or key page URLs. The more context the AI has, the better the file it generates.

Comment on this post →

This site does not use cookies. Anonymous, privacy-friendly traffic data is collected via Vercel Analytics. .