-
Notifications
You must be signed in to change notification settings - Fork 5.8k
Open
Labels
Description
Summary
Currently, the RobotsTxtSettings.DisallowLanguages setting allows administrators to exclude specific languages from being indexed by robots. However, this exclusion is only applied in robots.txt, not in sitemap.xml.
As a result, languages disallowed in robots.txt can still appear in sitemap entries (including <xhtml:link rel="alternate" hreflang="...">), which may send conflicting signals to crawlers.
💡 Real-world use case
Imagine an e-commerce store that:
- Has a language version
/defor testing or development purposes - Blocks
/deinrobots.txtto prevent indexing - Still sees
/dealternate URLs insitemap.xml
This can cause:
- Google Search Console warnings
- Indexing of alternate versions that should not be public
- Duplicate content penalties across language versions
✅ Recommendation
Introduce a new setting:
SitemapXmlSettings.DisallowLanguagesThis would allow fine-grained control over which language versions are included in the sitemap, independently of what is excluded from robots.txt.
We should not reuse RobotsTxtSettings.DisallowLanguages, because:
- It may be desirable to exclude a language from sitemap but still allow crawlers to access it (e.g., noindex SEO experiments)
- The two systems have separate behavior and timing
📦 Affected areas
SitemapModelFactorySitemapXmlSettings
💪 Next step
I'd be happy to prepare a PR that introduces:
- A new setting
DisallowLanguagesinSitemapXmlSettings - Updates to
SitemapModelFactoryto respect this list when generating entries and alternate<xhtml:link>references