Technical GEO: Making Your Site AI-Crawlable
The Technical Foundation
You can have the best content in the world, but if AI crawlers can't access, parse, or understand it, you'll never get cited.
Technical GEO ensures AI platforms can:
- Access your content (crawl permissions)
- Parse your content (rendering and performance)
- Understand your content (structured data and semantics)
Critical Technical Requirement #1: Allow AI Crawlers
The Problem
Many sites block AI crawlers in robots.txt without realizing it—effectively making themselves invisible to ChatGPT, Claude, Perplexity, and other AI platforms.
The Major AI Crawler User-Agents
ChatGPT (OpenAI):
User-agent: GPTBot
User-agent: ChatGPT-User
Claude (Anthropic):
User-agent: ClaudeBot
User-agent: Claude-Web
Perplexity:
User-agent: PerplexityBot
Google Gemini:
User-agent: Googlebot
User-agent: Google-Extended
Common Crawlers:
User-agent: CCBot (Common Crawl - used by many AI systems)
How to Check Your robots.txt
- Visit
yoursite.com/robots.txt - Look for
Disallowdirectives that affect AI bots - Check if you're blocking important content directories
Bad Example (blocking AI):
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
Good Example (allowing AI):
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Googlebot
Allow: /
Selective Blocking (recommended):
# Allow AI crawlers to most content
User-agent: GPTBot
Allow: /
# But block private areas
User-agent: GPTBot
Disallow: /admin/
Disallow: /user-dashboard/
Disallow: /checkout/
Implementation Steps
- Audit current robots.txt: Check what you're currently blocking
- Whitelist AI bots: Add explicit
Allowdirectives - Protect sensitive paths: Use targeted
Disallowfor admin/private areas - Test crawl access: Use tools like Screaming Frog to verify
- Monitor crawler activity: Check server logs for AI bot visits
Critical Requirement #2: Performance and Timeouts
The Timeout Problem
Many AI systems have strict timeouts:
- 1-5 seconds to retrieve content
- Long content may be truncated after timeout
- Slow sites get skipped entirely
If your page takes 6 seconds to load, AI crawlers may abandon it or only parse partial content.
Performance Targets for GEO
| Metric | Target | Why It Matters |
|---|---|---|
| Time to First Byte (TTFB) | < 200ms | AI crawlers need fast server response |
| First Contentful Paint | < 1.0s | Content must appear quickly |
| Largest Contentful Paint | < 2.5s | Main content fully visible |
| Total Page Load | < 3.0s | Before AI timeout window |
How to Improve Performance
1. Optimize Server Response
- Use a CDN (Cloudflare, Vercel Edge, AWS CloudFront)
- Implement edge caching
- Optimize database queries
- Use server-side caching (Redis, Memcached)
2. Minimize JavaScript
- Server-side render (SSR) critical content
- Avoid client-side rendering for main content
- Defer non-essential scripts
- Use static generation where possible
3. Optimize Assets
- Compress images (WebP, AVIF formats)
- Minify CSS and JavaScript
- Enable Brotli/Gzip compression
- Lazy load below-the-fold content
4. Database and Backend
- Cache API responses
- Optimize database indexes
- Use connection pooling
- Implement query result caching
Critical Requirement #3: JavaScript and Rendering
The JavaScript Problem
Most AI crawlers don't execute JavaScript well—or at all.
If your content requires JavaScript to display, AI platforms may see a blank page.
Example of the problem:
// BAD: Content only appears after client-side JS executes
export default function Page() {
const [content, setContent] = useState('')
useEffect(() => {
fetch('/api/content').then(r => r.json()).then(setContent)
}, [])
return <div>{content}</div> // Empty on initial HTML
}
AI crawlers will see:
<div></div>
Solutions for JavaScript-Heavy Sites
1. Server-Side Rendering (SSR)
Render content on the server before sending HTML:
// GOOD: Content in initial HTML
export async function getServerSideProps() {
const content = await fetchContent()
return { props: { content } }
}
export default function Page({ content }) {
return <div>{content}</div>
}
AI crawlers see:
<div>Actual content here...</div>
2. Static Generation (SSG)
Pre-render pages at build time:
export async function getStaticProps() {
const content = await fetchContent()
return { props: { content } }
}
3. Progressive Enhancement
Ensure core content works without JavaScript:
- Critical content in HTML
- JavaScript enhances interactivity
- Graceful degradation for non-JS environments
4. Dynamic Rendering / Prerendering
Serve different versions based on user-agent:
- Regular users: Full SPA experience
- Crawlers: Pre-rendered HTML snapshot
- Tools: Prerender.io, Rendertron
Testing JavaScript Rendering
Check what AI crawlers see:
- Disable JavaScript in browser
- View your page source (View > Source)
- Search for your main content
- If content is missing, you have a rendering problem
Tools:
- Chrome DevTools: Disable JS in settings
- Fetch as Google (Search Console)
- Screaming Frog: Check rendered vs. source HTML
Critical Requirement #4: Structured Data
Why Structured Data Matters
Structured data helps AI platforms understand:
- What type of content you have (article, product, FAQ)
- Key entities and relationships
- Facts and claims with sources
- Author information and credentials
Essential Schema Types for GEO
1. Article Schema
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Complete Guide to GEO",
"author": {
"@type": "Person",
"name": "Jane Smith",
"url": "https://example.com/authors/jane-smith"
},
"datePublished": "2025-01-15",
"dateModified": "2025-01-15",
"publisher": {
"@type": "Organization",
"name": "Citedify",
"logo": {
"@type": "ImageObject",
"url": "https://citedify.com/logo.png"
}
}
}
2. FAQPage Schema
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [{
"@type": "Question",
"name": "What is GEO?",
"acceptedAnswer": {
"@type": "Answer",
"text": "GEO (Generative Engine Optimization) is..."
}
}]
}
3. HowTo Schema
{
"@context": "https://schema.org",
"@type": "HowTo",
"name": "How to Optimize for AI Search",
"step": [{
"@type": "HowToStep",
"name": "Allow AI Crawlers",
"text": "Update robots.txt to allow GPTBot..."
}]
}
4. Organization Schema
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Citedify",
"description": "AI visibility monitoring and GEO analytics",
"url": "https://citedify.com",
"sameAs": [
"https://twitter.com/citedify",
"https://linkedin.com/company/citedify"
]
}
Implementing Structured Data
Next.js Example:
export default function Article({ data }) {
const schema = {
"@context": "https://schema.org",
"@type": "Article",
"headline": data.title,
"datePublished": data.publishedAt,
"author": {
"@type": "Person",
"name": data.author.name
}
}
return (
<>
<script
type="application/ld+json"
dangerouslySetInnerHTML={{ __html: JSON.stringify(schema) }}
/>
<article>{/* content */}</article>
</>
)
}
Testing Structured Data
Tools:
- Google Rich Results Test
- Schema.org Validator
- Structured Data Linter
Advanced Technical Optimizations
1. Create an llms.txt File
For documentation and reference content, create /llms.txt:
# Citedify Documentation
## Main Pages
- https://citedify.com/docs/getting-started
- https://citedify.com/docs/api-reference
- https://citedify.com/docs/integrations
## Guides
- https://citedify.com/learn/what-is-geo
- https://citedify.com/learn/technical-geo
This helps AI platforms find your most important content.
2. Optimize XML Sitemap
Ensure your sitemap:
- Lists all important pages
- Includes
<lastmod>dates - Prioritizes key content with
<priority> - Updates frequently
<url>
<loc>https://citedify.com/learn/what-is-geo</loc>
<lastmod>2025-01-15</lastmod>
<priority>0.9</priority>
</url>
3. Implement Canonical URLs
Prevent duplicate content issues:
<link rel="canonical" href="https://citedify.com/learn/what-is-geo" />
4. Use Semantic HTML
Structure content with proper HTML5 elements:
<!-- Good semantic structure -->
<article>
<header>
<h1>Article Title</h1>
<time datetime="2025-01-15">January 15, 2025</time>
</header>
<section>
<h2>Section Title</h2>
<p>Content...</p>
</section>
<footer>
<address>Author info</address>
</footer>
</article>
Avoid:
<!-- Bad: div soup -->
<div class="article">
<div class="title">Article Title</div>
<div class="content">Content...</div>
</div>
5. Mobile Optimization
Ensure mobile-friendly:
- Responsive design
- Readable font sizes (16px minimum)
- Touch-friendly buttons (44x44px minimum)
- No horizontal scrolling
- Fast mobile performance
6. HTTPS Only
Use HTTPS everywhere:
- SSL certificate required
- HTTP redirects to HTTPS
- Mixed content issues resolved
7. International Content
For multi-language sites:
- Use
hreflangtags - Clear language indicators
- Proper
langattribute on<html>
<link rel="alternate" hreflang="en" href="https://citedify.com/learn/what-is-geo" />
<link rel="alternate" hreflang="es" href="https://citedify.com/es/aprender/que-es-geo" />
Monitoring AI Crawler Activity
Check Server Logs
Look for AI bot visits in your access logs:
grep "GPTBot" /var/log/nginx/access.log
grep "ClaudeBot" /var/log/nginx/access.log
grep "PerplexityBot" /var/log/nginx/access.log
What to look for:
- Frequency of visits
- Pages being crawled
- Response times
- Error rates (4xx, 5xx)
Common Issues in Logs
Problem: No AI bot visits Solution: Check robots.txt, verify content quality
Problem: High error rates (404, 500) Solution: Fix broken pages, improve server stability
Problem: Long response times Solution: Optimize performance (see Performance section)
Technical GEO Checklist
Before launching content, verify:
Crawler Access:
- robots.txt allows GPTBot, ClaudeBot, PerplexityBot
- No accidental blocks of important directories
- XML sitemap up-to-date and submitted
Performance:
- TTFB < 200ms
- LCP < 2.5s
- CDN configured
- Assets optimized and compressed
Rendering:
- Content visible without JavaScript
- SSR or SSG implemented for dynamic content
- Tested with JS disabled in browser
Structured Data:
- Article, Organization, or relevant schema implemented
- Schema validated with testing tools
- Author information included
HTML & Semantics:
- Semantic HTML5 elements used
- Proper heading hierarchy (H1 > H2 > H3)
- Canonical URLs set
- Meta descriptions present
Mobile & Security:
- Mobile-responsive design
- HTTPS enabled
- No mixed content issues
Monitoring:
- Server logs tracked for AI bot activity
- Performance monitoring in place
- Error tracking configured
The Bottom Line
Technical GEO is the foundation everything else builds on. You can have perfect content, but if AI crawlers can't access or understand it, you'll never get cited.
The three critical moves:
- Allow crawlers: Update robots.txt, verify access
- Optimize performance: Sub-3-second page loads, SSR for JS content
- Add structured data: Help AI understand your content type and entities
Fix these technical issues first, then layer in content optimization. Skip technical GEO, and you're invisible—no matter how great your content is.
Next up: Learn platform-specific strategies for ChatGPT, Perplexity, Claude, and Google AI.