Skip to content

Conversation

@asafashirov
Copy link
Contributor

Fixes https://github.com/pulumi/marketing/issues/1240

Adds canonical link tags to all 1,621 static HTML files in the API reference documentation to resolve Google Search Console duplicate content warnings.

The issue affected 34 URLs that were being indexed with query parameters (?iaid, ?__hstc, etc.) and with/without index.html, causing Google to flag them as duplicates without user-selected canonical URLs.

Created scripts/add-canonical-tags.js that automatically injects canonical tags during the documentation generation process. The script properly handles index.html files, converts relative canonical URLs to absolute, and is integrated into the Makefile generate target to run after TypeDoc/JavaDoc generation.

All generated HTML files now include proper canonical tags pointing to clean, absolute URLs.

Fixes Google Search Console duplicate content warnings for API reference pages by adding canonical link tags to all static HTML files in the reference documentation.

This addresses 34 URLs flagged by Google Search Console with "Duplicate without user-selected canonical" errors. These pages were being indexed with various query parameters (iaid, __hstc, __hssc, __hsfp) and with/without index.html, causing Google to see them as duplicates without a specified canonical URL.

Changes:
- Created scripts/add-canonical-tags.js to inject canonical tags into static HTML files
- Integrated the script into the Makefile generate target
- Added canonical tags to 1,621 HTML files in static-prebuilt/docs/reference/pkg/
- Canonical URLs properly strip index.html and always point to clean directory URLs
- Applied to all language references: nodejs, python, dotnet, and java

The script runs automatically during the documentation generation process, ensuring all future builds include proper canonical tags.

Impact:
- Eliminates SEO penalties for duplicate content
- Consolidates page authority to canonical URLs
- Improves search engine rankings by clarifying preferred URLs
- Query parameter variations will correctly reference the canonical URL

Testing:
Verified canonical tags are correctly added:
- /docs/reference/pkg/nodejs/pulumi/policy/index.html → /docs/reference/pkg/nodejs/pulumi/policy/
- /docs/reference/pkg/python/pulumi/index.html → /docs/reference/pkg/python/pulumi/
- All 1,621 generated HTML files now include proper canonical tags
@asafashirov
Copy link
Contributor Author

will not do

@claude
Copy link
Contributor

claude bot commented Oct 27, 2025

Review Summary

This PR successfully addresses the Google Search Console duplicate content warnings by adding canonical link tags to 1,621 static HTML files in the API reference documentation. The implementation is clean, well-structured, and follows best practices.

✅ Approved Items

Script Implementation (scripts/add-canonical-tags.js)

  • Correctly generates absolute canonical URLs with proper base URL (https://www.pulumi.com)
  • Properly handles index.html files by stripping them from canonical paths (e.g., /path/index.html → /path/)
  • Includes appropriate safeguards: checks for existing canonical tags, handles missing head closing tags gracefully
  • Good error handling with try-catch blocks
  • File ends with newline as required ✓
  • Uses existing glob dependency from package.json ✓

Makefile Integration (line 52)

  • Correctly integrated into the generate target after documentation generation
  • Runs at the appropriate time (after TypeDoc/JavaDoc generation)
  • Includes helpful echo message for visibility

HTML File Changes

SEO Benefits

  • Resolves duplicate content issues caused by query parameters (?iaid, ?__hstc, etc.)
  • Eliminates index.html variants being indexed separately
  • Provides clear signals to search engines about the canonical version of each page

📝 Minor Observations (No Action Required)

  1. Script placement in Makefile: The script runs in the generate target, meaning it will re-process all HTML files every time make generate is run. The script does handle this efficiently by checking for existing canonical tags and skipping files with correct tags already present.

  2. Console output: The script provides detailed logging which is helpful for debugging but may be verbose during regular builds. This is acceptable for a maintenance script.

  3. Path handling: Uses forward slashes which works correctly on Linux/macOS. The replace on line 14 ensures Windows paths are normalized, though this repository likely runs on Unix-like systems.

✅ Repository Guidelines Compliance

  • ✓ File ends with newline (verified)
  • ✓ No code comments unless needed (clean, self-documenting code)
  • ✓ Uses existing dependencies (glob from package.json)
  • ✓ Integrated into Makefile as specified in AGENTS.md
  • ✓ Proper SEO implementation addressing canonical URLs

🎯 Conclusion

This PR effectively solves the duplicate content issue and is ready to merge. The solution is maintainable, follows repository conventions, and will automatically keep canonical tags updated when documentation is regenerated.


Mention @claude if you need additional reviews or want me to analyze specific aspects in more detail!

@pulumi-bot
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants