Module 4 of 6

Technical GEO: Making Your Site AI-Crawlable

The Technical Foundation

You can have the best content in the world, but if AI crawlers can't access, parse, or understand it, you'll never get cited.

Technical GEO ensures AI platforms can:

  1. Access your content (crawl permissions)
  2. Parse your content (rendering and performance)
  3. Understand your content (structured data and semantics)

Critical Technical Requirement #1: Allow AI Crawlers

The Problem

Many sites block AI crawlers in robots.txt without realizing it—effectively making themselves invisible to ChatGPT, Claude, Perplexity, and other AI platforms.

The Major AI Crawler User-Agents

ChatGPT (OpenAI):

User-agent: GPTBot
User-agent: ChatGPT-User

Claude (Anthropic):

User-agent: ClaudeBot
User-agent: Claude-Web

Perplexity:

User-agent: PerplexityBot

Google Gemini:

User-agent: Googlebot
User-agent: Google-Extended

Common Crawlers:

User-agent: CCBot (Common Crawl - used by many AI systems)

How to Check Your robots.txt

  1. Visit yoursite.com/robots.txt
  2. Look for Disallow directives that affect AI bots
  3. Check if you're blocking important content directories

Bad Example (blocking AI):

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

Good Example (allowing AI):

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Googlebot
Allow: /

Selective Blocking (recommended):

# Allow AI crawlers to most content
User-agent: GPTBot
Allow: /

# But block private areas
User-agent: GPTBot
Disallow: /admin/
Disallow: /user-dashboard/
Disallow: /checkout/

Implementation Steps

  1. Audit current robots.txt: Check what you're currently blocking
  2. Whitelist AI bots: Add explicit Allow directives
  3. Protect sensitive paths: Use targeted Disallow for admin/private areas
  4. Test crawl access: Use tools like Screaming Frog to verify
  5. Monitor crawler activity: Check server logs for AI bot visits

Critical Requirement #2: Performance and Timeouts

The Timeout Problem

Many AI systems have strict timeouts:

  • 1-5 seconds to retrieve content
  • Long content may be truncated after timeout
  • Slow sites get skipped entirely

If your page takes 6 seconds to load, AI crawlers may abandon it or only parse partial content.

Performance Targets for GEO

MetricTargetWhy It Matters
Time to First Byte (TTFB)< 200msAI crawlers need fast server response
First Contentful Paint< 1.0sContent must appear quickly
Largest Contentful Paint< 2.5sMain content fully visible
Total Page Load< 3.0sBefore AI timeout window

How to Improve Performance

1. Optimize Server Response

  • Use a CDN (Cloudflare, Vercel Edge, AWS CloudFront)
  • Implement edge caching
  • Optimize database queries
  • Use server-side caching (Redis, Memcached)

2. Minimize JavaScript

  • Server-side render (SSR) critical content
  • Avoid client-side rendering for main content
  • Defer non-essential scripts
  • Use static generation where possible

3. Optimize Assets

  • Compress images (WebP, AVIF formats)
  • Minify CSS and JavaScript
  • Enable Brotli/Gzip compression
  • Lazy load below-the-fold content

4. Database and Backend

  • Cache API responses
  • Optimize database indexes
  • Use connection pooling
  • Implement query result caching

Critical Requirement #3: JavaScript and Rendering

The JavaScript Problem

Most AI crawlers don't execute JavaScript well—or at all.

If your content requires JavaScript to display, AI platforms may see a blank page.

Example of the problem:

// BAD: Content only appears after client-side JS executes
export default function Page() {
  const [content, setContent] = useState('')

  useEffect(() => {
    fetch('/api/content').then(r => r.json()).then(setContent)
  }, [])

  return <div>{content}</div> // Empty on initial HTML
}

AI crawlers will see:

<div></div>

Solutions for JavaScript-Heavy Sites

1. Server-Side Rendering (SSR)

Render content on the server before sending HTML:

// GOOD: Content in initial HTML
export async function getServerSideProps() {
  const content = await fetchContent()
  return { props: { content } }
}

export default function Page({ content }) {
  return <div>{content}</div>
}

AI crawlers see:

<div>Actual content here...</div>

2. Static Generation (SSG)

Pre-render pages at build time:

export async function getStaticProps() {
  const content = await fetchContent()
  return { props: { content } }
}

3. Progressive Enhancement

Ensure core content works without JavaScript:

  • Critical content in HTML
  • JavaScript enhances interactivity
  • Graceful degradation for non-JS environments

4. Dynamic Rendering / Prerendering

Serve different versions based on user-agent:

  • Regular users: Full SPA experience
  • Crawlers: Pre-rendered HTML snapshot
  • Tools: Prerender.io, Rendertron

Testing JavaScript Rendering

Check what AI crawlers see:

  1. Disable JavaScript in browser
  2. View your page source (View > Source)
  3. Search for your main content
  4. If content is missing, you have a rendering problem

Tools:

  • Chrome DevTools: Disable JS in settings
  • Fetch as Google (Search Console)
  • Screaming Frog: Check rendered vs. source HTML

Critical Requirement #4: Structured Data

Why Structured Data Matters

Structured data helps AI platforms understand:

  • What type of content you have (article, product, FAQ)
  • Key entities and relationships
  • Facts and claims with sources
  • Author information and credentials

Essential Schema Types for GEO

1. Article Schema

{
  "@context": "https://schema.org",
  "@type": "Article",
  "headline": "Complete Guide to GEO",
  "author": {
    "@type": "Person",
    "name": "Jane Smith",
    "url": "https://example.com/authors/jane-smith"
  },
  "datePublished": "2025-01-15",
  "dateModified": "2025-01-15",
  "publisher": {
    "@type": "Organization",
    "name": "Citedify",
    "logo": {
      "@type": "ImageObject",
      "url": "https://citedify.com/logo.png"
    }
  }
}

2. FAQPage Schema

{
  "@context": "https://schema.org",
  "@type": "FAQPage",
  "mainEntity": [{
    "@type": "Question",
    "name": "What is GEO?",
    "acceptedAnswer": {
      "@type": "Answer",
      "text": "GEO (Generative Engine Optimization) is..."
    }
  }]
}

3. HowTo Schema

{
  "@context": "https://schema.org",
  "@type": "HowTo",
  "name": "How to Optimize for AI Search",
  "step": [{
    "@type": "HowToStep",
    "name": "Allow AI Crawlers",
    "text": "Update robots.txt to allow GPTBot..."
  }]
}

4. Organization Schema

{
  "@context": "https://schema.org",
  "@type": "Organization",
  "name": "Citedify",
  "description": "AI visibility monitoring and GEO analytics",
  "url": "https://citedify.com",
  "sameAs": [
    "https://twitter.com/citedify",
    "https://linkedin.com/company/citedify"
  ]
}

Implementing Structured Data

Next.js Example:

export default function Article({ data }) {
  const schema = {
    "@context": "https://schema.org",
    "@type": "Article",
    "headline": data.title,
    "datePublished": data.publishedAt,
    "author": {
      "@type": "Person",
      "name": data.author.name
    }
  }

  return (
    <>
      <script
        type="application/ld+json"
        dangerouslySetInnerHTML={{ __html: JSON.stringify(schema) }}
      />
      <article>{/* content */}</article>
    </>
  )
}

Testing Structured Data

Tools:

  • Google Rich Results Test
  • Schema.org Validator
  • Structured Data Linter

Advanced Technical Optimizations

1. Create an llms.txt File

For documentation and reference content, create /llms.txt:

# Citedify Documentation

## Main Pages
- https://citedify.com/docs/getting-started
- https://citedify.com/docs/api-reference
- https://citedify.com/docs/integrations

## Guides
- https://citedify.com/learn/what-is-geo
- https://citedify.com/learn/technical-geo

This helps AI platforms find your most important content.

2. Optimize XML Sitemap

Ensure your sitemap:

  • Lists all important pages
  • Includes <lastmod> dates
  • Prioritizes key content with <priority>
  • Updates frequently
<url>
  <loc>https://citedify.com/learn/what-is-geo</loc>
  <lastmod>2025-01-15</lastmod>
  <priority>0.9</priority>
</url>

3. Implement Canonical URLs

Prevent duplicate content issues:

<link rel="canonical" href="https://citedify.com/learn/what-is-geo" />

4. Use Semantic HTML

Structure content with proper HTML5 elements:

<!-- Good semantic structure -->
<article>
  <header>
    <h1>Article Title</h1>
    <time datetime="2025-01-15">January 15, 2025</time>
  </header>

  <section>
    <h2>Section Title</h2>
    <p>Content...</p>
  </section>

  <footer>
    <address>Author info</address>
  </footer>
</article>

Avoid:

<!-- Bad: div soup -->
<div class="article">
  <div class="title">Article Title</div>
  <div class="content">Content...</div>
</div>

5. Mobile Optimization

Ensure mobile-friendly:

  • Responsive design
  • Readable font sizes (16px minimum)
  • Touch-friendly buttons (44x44px minimum)
  • No horizontal scrolling
  • Fast mobile performance

6. HTTPS Only

Use HTTPS everywhere:

  • SSL certificate required
  • HTTP redirects to HTTPS
  • Mixed content issues resolved

7. International Content

For multi-language sites:

  • Use hreflang tags
  • Clear language indicators
  • Proper lang attribute on <html>
<link rel="alternate" hreflang="en" href="https://citedify.com/learn/what-is-geo" />
<link rel="alternate" hreflang="es" href="https://citedify.com/es/aprender/que-es-geo" />

Monitoring AI Crawler Activity

Check Server Logs

Look for AI bot visits in your access logs:

grep "GPTBot" /var/log/nginx/access.log
grep "ClaudeBot" /var/log/nginx/access.log
grep "PerplexityBot" /var/log/nginx/access.log

What to look for:

  • Frequency of visits
  • Pages being crawled
  • Response times
  • Error rates (4xx, 5xx)

Common Issues in Logs

Problem: No AI bot visits Solution: Check robots.txt, verify content quality

Problem: High error rates (404, 500) Solution: Fix broken pages, improve server stability

Problem: Long response times Solution: Optimize performance (see Performance section)

Technical GEO Checklist

Before launching content, verify:

Crawler Access:

  • robots.txt allows GPTBot, ClaudeBot, PerplexityBot
  • No accidental blocks of important directories
  • XML sitemap up-to-date and submitted

Performance:

  • TTFB < 200ms
  • LCP < 2.5s
  • CDN configured
  • Assets optimized and compressed

Rendering:

  • Content visible without JavaScript
  • SSR or SSG implemented for dynamic content
  • Tested with JS disabled in browser

Structured Data:

  • Article, Organization, or relevant schema implemented
  • Schema validated with testing tools
  • Author information included

HTML & Semantics:

  • Semantic HTML5 elements used
  • Proper heading hierarchy (H1 > H2 > H3)
  • Canonical URLs set
  • Meta descriptions present

Mobile & Security:

  • Mobile-responsive design
  • HTTPS enabled
  • No mixed content issues

Monitoring:

  • Server logs tracked for AI bot activity
  • Performance monitoring in place
  • Error tracking configured

The Bottom Line

Technical GEO is the foundation everything else builds on. You can have perfect content, but if AI crawlers can't access or understand it, you'll never get cited.

The three critical moves:

  1. Allow crawlers: Update robots.txt, verify access
  2. Optimize performance: Sub-3-second page loads, SSR for JS content
  3. Add structured data: Help AI understand your content type and entities

Fix these technical issues first, then layer in content optimization. Skip technical GEO, and you're invisible—no matter how great your content is.

Next up: Learn platform-specific strategies for ChatGPT, Perplexity, Claude, and Google AI.

Ready to Track Your GEO Performance?

Start monitoring your brand visibility across ChatGPT, Perplexity, Claude, and Google AI. Get actionable insights to improve your GEO score.