XML Sitemaps
An XML sitemap is a file that lists the URLs on your website that you want search engines to crawl and index. Think of it as a roadmap for search engine crawlers — a structured document that tells Googlebot, Bingbot, and other crawlers exactly which pages exist, when they were last updated, and how they relate to each other. While well-structured websites with strong internal linking can be discovered through crawling alone, XML sitemaps provide an explicit, authoritative inventory of your content that ensures nothing is missed.
Sitemaps are defined by the Sitemaps Protocol, an open standard originally developed by Google and later adopted by Microsoft, Yahoo, and other search engines. The protocol specifies a simple XML format that is easy to generate, validate, and maintain — making it an ideal target for automated quality checks.
XML Sitemap Format and Structure
An XML sitemap is a valid XML document that conforms to the Sitemaps Protocol schema. The root element is <urlset>, which contains one or more <url> elements, each representing a single page on your site. Here is a complete, minimal example:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2025-06-15</lastmod>
</url>
<url>
<loc>https://example.com/about/</loc>
<lastmod>2025-05-20</lastmod>
</url>
<url>
<loc>https://example.com/contact/</loc>
<lastmod>2025-04-10</lastmod>
</url>
</urlset>
The xmlns attribute on the <urlset> element declares the Sitemaps Protocol namespace. This is required for the sitemap to be valid. The file must be UTF-8 encoded and must be valid XML (properly nested elements, properly escaped special characters, and so on).
Required vs Optional Elements
Each <url> entry in a sitemap can contain four elements, only one of which is required:
loc (Required)
The <loc> element specifies the URL of the page. This is the only required element. The URL must be a fully qualified, absolute URL including the protocol (https://). It must match the canonical URL of the page — if your canonical URL uses https://www.example.com/, the sitemap should use the same format, not https://example.com/. URLs must be properly escaped: ampersands should be &, and other special XML characters must be escaped as well.
lastmod (Recommended)
The <lastmod> element specifies the date the page was last modified. Use the W3C Datetime format: either YYYY-MM-DD (e.g., 2025-06-15) or the full datetime format with timezone (2025-06-15T14:30:00+00:00). Google has stated that they use lastmod data when it is "consistently and verifiably accurate." If you set lastmod to the current date on every page regardless of actual changes, Google will learn to ignore it.
lastmod date when the content of the page actually changes in a meaningful way. Do not update it for trivial changes like fixing a typo in the footer or updating a copyright year. Inaccurate lastmod values train search engines to distrust your sitemap data.
changefreq (Optional, largely ignored)
The <changefreq> element provides a hint about how frequently the page is likely to change. Valid values are: always, hourly, daily, weekly, monthly, yearly, and never. In practice, Google has stated that it largely ignores this element and relies on its own crawling data to determine crawl frequency. You may include it, but do not rely on it to control crawl behavior.
priority (Optional, largely ignored)
The <priority> element indicates the relative importance of the page within your site, on a scale from 0.0 to 1.0. The default value is 0.5. Like changefreq, Google has indicated that it largely ignores this signal. The priority is relative to your own site — it does not affect how your pages rank against pages from other sites. Because this element is so widely misused (many sites set all pages to 1.0), search engines have learned to discount it.
Sitemap Index Files for Large Sites
The Sitemaps Protocol imposes two limits on individual sitemap files:
- Maximum 50,000 URLs per sitemap file
- Maximum 50MB (uncompressed) file size per sitemap
For sites with more than 50,000 URLs, you need to split your URLs across multiple sitemap files and use a sitemap index file to tie them together. A sitemap index file has a similar structure but uses <sitemapindex> as the root element and <sitemap> entries instead of <url> entries:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2025-06-15</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2025-06-14</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2025-06-10</lastmod>
</sitemap>
</sitemapindex>
Even for smaller sites that do not exceed the 50,000 URL limit, using a sitemap index with multiple sitemaps organized by content type (pages, blog posts, products, images) is a good organizational practice. It makes debugging easier and allows you to see at a glance which sections of your site have been updated recently.
Submitting Sitemaps to Search Consoles
Creating a sitemap is only half the job — you also need to tell search engines where to find it. There are three methods for submitting your sitemap:
Google Search Console
Log in to Google Search Console, select your property, navigate to "Sitemaps" in the left sidebar, enter the URL of your sitemap (e.g., https://example.com/sitemap.xml), and click "Submit." Google Search Console will show you the submission status, the number of URLs discovered, and any errors or warnings found in the sitemap. Check back periodically to monitor the coverage report, which shows how many of your sitemap URLs are indexed.
Bing Webmaster Tools
The process for Bing is similar. Log in to Bing Webmaster Tools, select your site, navigate to "Sitemaps" under "Configure My Site," and submit your sitemap URL. Bing provides similar coverage and error reporting.
robots.txt Reference
You can also declare your sitemap location in your robots.txt file by adding a Sitemap directive:
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
This method works for all search engines that support the Sitemaps Protocol. It is not a substitute for direct submission through search console tools (which provide error reporting and monitoring), but it serves as a helpful supplementary signal that any crawler can discover.
Dynamic Sitemap Generation
For sites where content changes frequently — blogs, e-commerce stores, news sites, documentation sites — maintaining a static XML file by hand is impractical. Dynamic sitemap generation creates the sitemap automatically from your content database or file system, ensuring it is always up to date.
Common approaches to dynamic sitemap generation include:
- CMS plugins: WordPress has plugins like Yoast SEO and XML Sitemaps that generate and update sitemaps automatically. Most modern CMS platforms have similar built-in or plugin-based solutions.
- Framework-level solutions: Next.js, Nuxt.js, Gatsby, and other static site generators can generate sitemaps as part of their build process. Framework-specific packages like
next-sitemapautomate this further. - Server-side scripts: A PHP, Python, or Node.js script can query your database for all published pages, generate the XML, and serve it at
/sitemap.xml. This approach gives you full control over which pages are included and what metadata is attached. - Build-time generation: For static sites, generate the sitemap as part of your build process. This ensures the sitemap matches the deployed pages exactly and avoids the runtime overhead of dynamic generation.
Regardless of the generation method, always validate the output. A malformed XML sitemap (missing closing tags, improperly escaped characters, invalid dates) will be rejected by search engines. Use an XML validator or the search console tools to verify your sitemap is error-free.
CodeFrog Supports Sitemap-Based Multi-URL Testing
One of CodeFrog's powerful features is its ability to read an XML sitemap and run its full suite of quality checks against every URL in the sitemap. Instead of testing pages one at a time, you can point CodeFrog at your sitemap and get a comprehensive quality report covering your entire site — accessibility, security, performance, SEO, HTML validation, and more for every page listed in the sitemap.
This sitemap-based testing approach has several advantages:
- Complete coverage: Every page in your sitemap gets tested, not just the pages you remember to check manually.
- Consistency: All pages are tested with the same rules and standards, ensuring consistent quality across your entire site.
- Regression detection: By running sitemap-based tests regularly (or as part of your CI/CD pipeline), you can detect quality regressions across any page on your site.
- Scale: Whether your sitemap has 10 URLs or 10,000, the testing process is the same. This makes quality engineering practical even for large sites.
When Sitemaps Matter Most
While XML sitemaps are a best practice for all websites, they are especially critical in certain situations:
- Large sites: Sites with thousands or millions of pages benefit enormously from sitemaps. Without one, search engines may not discover all pages through crawling alone, especially if internal linking is imperfect.
- New sites: New websites have few or no external backlinks, making it harder for crawlers to discover them. Submitting a sitemap to search consoles is one of the first things you should do when launching a new site.
- Sites with poor internal linking: If your site has orphan pages (pages with no internal links pointing to them) or a flat architecture with insufficient cross-linking, a sitemap ensures crawlers can still find every page.
- Frequently updated sites: News sites, blogs, and e-commerce stores that add content daily benefit from sitemaps with accurate
lastmoddates, which signal to crawlers that there is fresh content to index. - Sites with rich media: If your site includes videos, images, or news content that you want indexed in specialized search verticals (Google Images, Google News, Google Video), specialized sitemap extensions help search engines discover and categorize this content.
Common Sitemap Mistakes
When auditing sitemaps, watch for these frequent issues:
- Including non-canonical URLs: If a page has a canonical tag pointing to a different URL, only include the canonical URL in the sitemap.
- Including redirected URLs: Do not include URLs that return 301 or 302 redirects. Include only the final destination URLs.
- Including noindex pages: If a page has a
<meta name="robots" content="noindex">tag, do not include it in the sitemap. The contradictory signals confuse search engines. - Stale URLs: Remove URLs that return 404 errors. A sitemap full of dead links signals poor site maintenance to search engines.
- Mixed protocols: Do not mix HTTP and HTTPS URLs in the same sitemap. All URLs should use HTTPS if your site supports it.
- Incorrect encoding: Remember that sitemaps are XML documents. Ampersands must be escaped as
&, and other special characters must be properly encoded.
Resources
- Sitemaps.org Protocol — The official Sitemaps Protocol specification
- Google Sitemap Documentation — Google's guide to building, submitting, and managing XML sitemaps