XML Sitemaps Explained. What They Are and How to Submit One

Home / SEO News / XML Sitemaps Explained. What They Are and How to Submit One
David Galvin
20 March 2026
Read Time: 13 Minutes
Article Summary

XML sitemaps list the URLs you want search engines to find and index, making crawling more efficient. This guide covers structure, submission, limits, and when sitemaps actually matter.

Key Takeaways

An XML sitemap is a file that lists the URLs on your site you want search engines to find and index. It’s written in a standardized format (the sitemap protocol) that Googlebot, Bingbot, and other crawlers can parse automatically. Think of it as a roadmap you hand directly to search engines, pointing them toward every page that matters and away from the ones that don’t.

At Gorilla Marketing, we audit XML sitemaps as part of every technical SEO engagement. A well-structured sitemap won’t single-handedly fix your rankings. But a broken or bloated one can quietly prevent Google from discovering pages that should be driving traffic. This guide covers what XML sitemaps are, how they work, when you actually need one, and how to get yours submitted and validated.

What Is an XML Sitemap?

An XML sitemap is a structured file, typically located at yoursite.com/sitemap.xml, that tells search engines which URLs exist on your site. Each URL entry can include optional metadata like when the page was last modified. The file follows the sitemap protocol, an open standard originally developed by Google and later adopted by Bing, Yahoo, and other search engines.

The key distinction: an XML sitemap is for search engines, not humans. It’s machine-readable markup designed to be parsed by crawlers. There’s no styling, no navigation, no user experience to worry about. Its entire purpose is to make crawling your site more efficient.

A basic sitemap entry looks like this:


<url>

<loc>https://example.com/page/</loc>

<lastmod>2026-01-15</lastmod>

</url>

The tag is the only required element. It contains the full, absolute URL of the page. The tag is optional but genuinely useful when accurate. It tells crawlers when the page was last meaningfully updated, which can influence how often they re-crawl it.

Two other tags exist in the protocol: and . Google has confirmed it ignores both. Bing has said the same. They were part of the original spec but never proved reliable because site owners routinely set every page to “daily” and “1.0.” If your sitemap includes them, they’re not hurting anything, but they’re not helping either.

XML Sitemap vs HTML Sitemap

These get confused constantly, but they serve completely different purposes. An XML sitemap is a machine-readable file for search engine crawlers. An HTML sitemap is a webpage designed for human visitors, typically listing links to all major sections of a site.

HTML sitemaps were more important when search engines were less sophisticated at discovering content through crawling. They still have some value for user navigation on very large sites, but they’re not a substitute for an XML sitemap. If you’re choosing where to invest time, the XML version is what moves the needle for crawl efficiency and indexing.

Why Do XML Sitemaps Matter?

xml sitemaps illustration

An XML sitemap helps search engines discover and index your content faster and more reliably. That’s the core value proposition, and it matters most in specific situations.

Crawl efficiency

Search engines allocate a crawl budget to every site. On a site with thousands of pages, Googlebot isn’t guaranteed to find everything through links alone. An XML sitemap gives the crawler a complete list of URLs worth visiting, reducing wasted crawls on low-value pages and ensuring high-priority content gets discovered.

For large sites (e-commerce catalogs, news publishers, enterprise platforms), this isn’t optional. It’s the difference between Google knowing about 60% of your product pages and knowing about 95% of them.

Indexing new and updated content

When you publish a new page, it needs to be crawled before it can appear in search results. If that page is buried three clicks deep with no prominent internal links pointing to it, Google might not find it for weeks. A sitemap with an accurate date signals to crawlers that something new exists and is worth visiting.

The same applies to updated content. If you significantly revise a landing page, updating the in your sitemap gives Google a nudge to re-crawl it sooner rather than later.

Catching orphan pages

Orphan pages are URLs with no internal links pointing to them. Crawlers discover content primarily by following links, so orphan pages are effectively invisible without a sitemap. Site redesigns, CMS migrations, and URL restructures create orphan pages more often than most teams realize. A comprehensive XML sitemap acts as a safety net.

That said, a sitemap shouldn’t be a permanent fix. If a page is worth indexing, it should have proper internal links. The sitemap catches the gaps while you fix the linking.

What Are the URL Tags in a Sitemap?

The sitemap protocol defines a small set of XML tags for each URL entry. Here’s what each one does and whether you should care about it.

loc (required)

The full URL of the page. Must be an absolute URL including the protocol (https://). Must match the canonical version of the URL, meaning if a page redirects from http:// to https://, the sitemap should contain the https:// version. Consistency between your sitemap URLs, canonical tags, and internal links matters. Mismatches create confusion for crawlers and waste crawl budget on redirect chains.

lastmod (recommended)

The date the page was last meaningfully modified, in W3C datetime format (typically YYYY-MM-DD). Google has said it uses as a crawl scheduling signal when the dates are accurate. The operative word is “accurate.” If your CMS updates every time any minor change occurs (a plugin update, a comment moderation), the signal becomes noise. Google learns to ignore it for your site.

Keep honest. Update it when the on-page content actually changes in a meaningful way.

changefreq (ignored)

Was intended to tell crawlers how often a page changes (hourly, daily, weekly, etc.). Google ignores it entirely. Not worth setting or maintaining.

priority (ignored)

Was intended to indicate the relative importance of a URL within your site, on a 0.0 to 1.0 scale. Also ignored by Google. Most sitemap generators still include it, but it has zero effect on crawling or indexing behavior.

Sitemap Protocol Limits: What You Need to Know

The sitemap protocol sets two hard limits per sitemap file:

50,000 URLs maximum per file

50 MB maximum file size (uncompressed)

For most small-to-medium sites, these limits are irrelevant. A 200-page site fits comfortably in a single sitemap file. But for larger sites, you’ll need to split your URLs across multiple sitemap files and tie them together with a sitemap index.

Sitemap index files

A sitemap index is a file that lists your individual sitemap files. Instead of one massive sitemap, you create several smaller ones (grouped by content type, section, or however makes sense) and reference them all from a single index file, typically at sitemap.xml or sitemap_index.xml.

The structure looks like this:


<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

<sitemap>

<loc>https://example.com/post-sitemap.xml</loc>

<lastmod>2026-03-10</lastmod>

</sitemap>

<sitemap>

<loc>https://example.com/page-sitemap.xml</loc>

<lastmod>2026-02-28</lastmod>

</sitemap>

</sitemapindex>

This approach gives you organizational control. You can separate blog posts from product pages from category pages, making it easier to spot issues and monitor indexing by content type. Most CMS plugins generate sitemap indexes automatically.

Character encoding

Sitemaps must use UTF-8 encoding. All URLs must be properly escaped, meaning special characters like ampersands (&), quotes, and angle brackets need to use their XML entity equivalents. Most sitemap generators handle this automatically, but hand-coded sitemaps sometimes get it wrong.

When Do You Actually Need an XML Sitemap?

Not every site needs one. Google’s own documentation says sitemaps are “especially helpful” in certain scenarios and less critical in others. Here’s where the line falls.

You definitely need one if:

Your site has more than a few hundred pages. Crawl coverage gaps grow with site size.

Your site is new or has few external backlinks. A sitemap gives Google an alternative discovery path when inbound links are scarce.

Your pages aren’t well linked internally. Deep content four or five clicks from the homepage may not get crawled regularly without one.

You publish content frequently. Sitemaps signal new content to crawlers faster than waiting for natural discovery.

You’ve recently migrated or restructured. Post-migration, old URL structures break and new pages get orphaned. A sitemap helps crawlers find the new structure.

You might not need one if:

Your site has fewer than 100 pages, all well-linked internally, with a reasonable backlink profile. Google can discover everything through normal crawling. Even then, a sitemap doesn’t hurt and takes minimal effort to maintain.

What Should and Shouldn’t Be in Your Sitemap

This is where many sites go wrong. A sitemap should only contain URLs you want indexed. That sounds obvious, but in practice, CMS-generated sitemaps often include pages that have no business being there.

Include:

All indexable pages that return a 200 status code

Pages with a self-referencing canonical tag (or no canonical tag, which defaults to self-referencing)

Pages that aren’t blocked by robots.txt or noindex directives

Exclude:

Pages with noindex meta tags

Redirected URLs (3xx status codes)

Error pages (4xx, 5xx)

Paginated pages that use rel=”next/prev” (debatable, but generally the canonical version is sufficient)

Parameter-heavy URL variants (filtered product listings, session IDs, tracking parameters)

Admin, login, or staging pages that shouldn’t be publicly indexed

A clean sitemap signals to Google that you’ve curated the URLs deliberately. A bloated sitemap stuffed with thousands of non-indexable URLs wastes crawl budget and dilutes the signal. Google has to crawl each URL to determine whether it should be indexed, and if half your sitemap returns noindex or redirects, that’s crawl budget spent accomplishing nothing.

How to Reference Your Sitemap in robots.txt

Your robots.txt file should include a directive pointing search engines to your sitemap. This is one of the simplest ways to ensure crawlers find it. The format is straightforward:


Sitemap: https://example.com/sitemap.xml

Place this directive at the bottom of your robots.txt file. It works independently of any User-agent or Disallow rules in the file. Crawlers that read robots.txt will pick up the sitemap location automatically. This is the standard mechanism for sitemap discovery when you haven’t submitted the sitemap directly through a search engine’s webmaster tools. Robots.txt plays a broader role in crawl management beyond sitemap discovery, including controlling which sections of your site crawlers can access.

How to Submit Your Sitemap to Google Search Console

Submitting your sitemap through Google Search Console is the most direct way to tell Google it exists. Here’s the process.

Sign in to Google Search Console and select the property for your site.

Navigate to “Sitemaps” in the left sidebar under the Indexing section.

Enter your sitemap URL in the “Add a new sitemap” field. This is typically sitemap.xml or sitemap_index.xml, appended to your site’s root URL.

Click “Submit.”

Google will fetch your sitemap, validate it, and report back on the number of URLs discovered. Over the following days and weeks, the Sitemaps report will show how many of those URLs have been indexed, giving you a clear ratio of submitted vs indexed pages.

If there’s a significant gap between submitted and indexed URLs, that’s a diagnostic signal. It could indicate quality issues, duplicate content, crawl blocks, or pages that Google simply doesn’t consider worth indexing. Investigating that gap is a core part of technical SEO auditing.

Monitoring sitemap status

After submission, check back periodically. Google Search Console shows the last time Google successfully fetched your sitemap, the number of URLs discovered, and any errors encountered. Common errors include:

HTTP errors (sitemap returns a 404 or 500)

Parsing errors (malformed XML)

URLs that can’t be fetched (blocked by robots.txt, DNS issues)

If your sitemap shows errors, fix them promptly. A sitemap that Google can’t fetch or parse is worse than no sitemap at all, because it signals maintenance neglect to the crawler.

How to Submit Your Sitemap to Bing Webmaster Tools

Bing has its own webmaster tools platform, and submitting your sitemap there follows a similar process.

Sign in to Bing Webmaster Tools at bing.com/webmasters.

Select your site from the dashboard.

Navigate to “Sitemaps” in the left menu.

Enter your sitemap URL and submit.

Bing’s tooling also reports on URL discovery and indexation. While Bing’s search market share is smaller than Google’s in most sectors, it feeds into several other search products (including Cortana, DuckDuckGo’s organic results, and increasingly, AI-powered search tools). Submitting to Bing takes two minutes and broadens your crawl coverage at essentially zero cost.

Hreflang Sitemaps for International Sites

If your site targets multiple countries or languages, you can declare hreflang annotations directly in your XML sitemap. This is one of three implementation methods (the others being HTML link elements and HTTP headers), and for large international sites, it’s often the cleanest approach.

In a hreflang sitemap, each URL entry includes elements specifying the alternate language/region versions:


<url>

<loc>https://example.com/page/</loc>

<xhtml:link rel="alternate" hreflang="en-us" href="https://example.com/page/" />

<xhtml:link rel="alternate" hreflang="en-gb" href="https://example.co.uk/page/" />

<xhtml:link rel="alternate" hreflang="x-default" href="https://example.com/page/" />

</url>

The advantage: you don’t need to add markup to every page template. For sites with thousands of pages across multiple language versions, managing hreflang in the sitemap keeps the HTML cleaner and makes bulk updates easier.

The catch: every page that declares an alternate must be confirmed by that alternate pointing back. Broken bidirectional references cause Google to ignore the hreflang entirely.

Image and Video Sitemaps

The sitemap protocol supports extensions for images and videos, allowing you to surface media content that might not be discoverable through standard crawling.

Image sitemaps let you list images associated with a specific page. Particularly valuable when images are loaded dynamically via JavaScript, embedded in carousels, or served from a CDN on a different domain. Declaring them in the sitemap ensures they’re eligible for Google Images results.

Video sitemaps let you provide metadata about video content: title, description, thumbnail URL, duration, and content URL. If your site hosts product demos, tutorials, or webinars, a video sitemap improves your chances of appearing in video search results and rich snippets.

Both extensions are optional but worth implementing if media content drives meaningful traffic. An e-commerce site with hundreds of product images or a SaaS company with a video library has real indexing gains to capture here.

XML Sitemaps in WordPress: Yoast and Rank Math

If your site runs on WordPress, you likely don’t need to create a sitemap manually. Both Yoast SEO and Rank Math generate XML sitemaps automatically and keep them updated as you publish or modify content.

Yoast SEO

Yoast generates a sitemap index at yoursite.com/sitemap_index.xml that links to individual sitemaps for posts, pages, categories, and other content types. You can control which content types are included through the Yoast settings panel. By default, it excludes noindex content and includes only the canonical versions of URLs.

Rank Math

Rank Math follows a similar approach, generating a sitemap index at yoursite.com/sitemap_index.xml with sub-sitemaps by content type. It also offers granular control over which taxonomies, post types, and individual URLs appear in the sitemap.

WordPress core sitemaps

Since WordPress 5.5, the CMS includes a built-in sitemap feature at yoursite.com/wp-sitemap.xml. It’s functional but basic. If you’re running Yoast or Rank Math, their sitemaps supersede the core one and offer more control. Most SEO plugins disable the core sitemap automatically to avoid duplication.

Whichever plugin you use, verify the output. Install, enable, and then actually open the sitemap URL in a browser to confirm it contains the pages you expect and excludes the ones you don’t. Plugin defaults aren’t always right for your specific site structure.

Sitemap Validation: How to Check Yours Is Working

A sitemap that exists but contains errors is a liability. Validate yours regularly.

Manual checks: Open your sitemap URL in a browser. It should render as valid XML. Look for broken URLs, HTTP URLs on an HTTPS site, error-returning URLs, and pages that shouldn’t be included.

Google Search Console: The “Pages” report shows which submitted URLs are indexed, excluded, or errored, giving you a direct feedback loop between your sitemap and Google’s index.

Third-party tools: Screaming Frog can crawl your sitemap and cross-reference it against your actual site structure, revealing mismatches: sitemap URLs returning 404s, live pages missing from the sitemap, and canonical conflicts. Running this audit quarterly catches drift before it compounds.

Common XML Sitemap Mistakes

Even well-maintained sites make these errors. Most are easy to fix once identified.

Including noindex URLs. If a page has a noindex directive, it shouldn’t be in the sitemap. Mixed signals waste crawl budget and confuse Search Console reporting.

Stale lastmod dates. Setting to the current date on every generation cycle degrades its usefulness. Google learns to distrust inaccurate dates and may reduce crawl frequency.

Mismatched URLs. If your sitemap lists https://www.example.com/page/ but your canonical tags point to https://example.com/page/, that’s a conflict. Be consistent across sitemaps, canonicals, and internal links.

Exceeding protocol limits. A single file with more than 50,000 URLs or 50 MB gets rejected. Use a sitemap index to split the load.

Including redirected URLs. Every URL in your sitemap should return a 200. Redirects waste crawl budget. List the destination URL directly.

Forgetting to update after site changes. URL restructures and CMS migrations frequently break sitemaps. If your sitemap references old URL patterns, crawlers are chasing dead links.

How Sitemaps Fit Into Your Broader Technical SEO

An XML sitemap doesn’t operate in isolation. It’s one component of how search engines discover, crawl, and understand your site. It works alongside your robots.txt configuration, canonical tags, internal linking architecture, and crawl budget management to create a coherent system.

Canonical tags tell search engines which version of a page is the definitive one. Your sitemap should only contain those canonical URLs. Internal linking is still the primary mechanism for content discovery. A sitemap supplements it but shouldn’t replace it. If your only path to a high-value page is through the sitemap, that’s a structural problem worth fixing at the linking level.

The goal is alignment. Every signal you send to search engines should tell the same story about which pages matter and how your site is organized. When sitemaps, canonicals, robots.txt, and internal links conflict, crawlers have to guess. Guessing rarely works in your favor.

Getting Your Sitemap Right

A well-configured XML sitemap is one of the most straightforward technical SEO wins available. It takes minimal effort to set up, costs nothing to maintain if your CMS handles generation, and directly improves how efficiently search engines discover and index your content.

But “set it and forget it” doesn’t work here. Sites change. Pages get added, removed, and redirected. URL structures evolve. A sitemap that was perfect six months ago might be full of stale URLs and broken references now. Build a quarterly sitemap audit into your technical SEO workflow and you’ll catch problems before they compound.

If your sitemap needs attention, or if you’re not sure whether it’s pulling its weight, get in touch. Our team audits sitemap health as part of every technical SEO engagement, connecting the dots between crawl efficiency, indexing coverage, and the pages that actually drive revenue.

David Galvin
David has been in search marketing for over 8 years, specialising in technical SEO. He focuses on the technical foundations that impact visibility, including site structure, performance, and tracking. With a solid technical grounding and hands-on experience across Linux, PHP, JavaScript, and CSS, he works to identify and resolve the issues that genuinely hold websites back. If he’s not in front of a laptop, you’ll usually find him hiking up a mountain or visiting his son in Dublin.

Related Articles