Skip to content

Conversation

@galaxy101quest
Copy link

@galaxy101quest galaxy101quest commented Sep 28, 2025

Refined RSS ingestion to better fit website articles and improved retrieval quality by splitting long posts into meaningful sections. The result is more precise answers.

  • rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.

  • db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.

  • requirements.txt (import): added beautifulsoup as it's needed in db_load.py

Contributions by Misha - [email protected]

Misha Ristich and others added 3 commits September 28, 2025 10:46
…rieval quality by splitting long posts into meaningful sections. The result is more precise answers.

- rss2schema.py (parse_rss_2_0): Updated the RSS→Schema.org mapping so feed items are treated as website articles with correct fields and stable URLs, rather than podcast-specific objects.

- db_load.py (process_rss_feed): Implemented heading-aware chunking (leveraging H1–H5 and block elements) to split long articles into anchored segments for higher-precision indexing and retrieval.

Contributions by Misha - [email protected]
@chelseacarter29
Copy link
Collaborator

Hi Misha! @galaxy101quest

Thanks for the PR! This is a good start - we may want to make a few adjustments and are going to do some additional testing. A couple of things I've seen so far:

  • Entire podcast RSS feeds now encode as one document instead of individual episodes
  • Something may be off in the chunking; I tried to encode some articles to test it out (I used https://platformer.news/feed to see what article results might look like) and it seems to be returning many of the same article but jumping to different sections (e.g., comments sometimes). Need to look into it a bit more.

Something Guha and I were chatting about was maybe having the ability to specify the 'type' when doing data load so you have options.

@galaxy101quest
Copy link
Author

Hi Chelsea :) @chelseacarter29

Thanks for the feedback. It's not supposed to do that, so something is off.
I have a few versions on my end - I'll check for improvements and send them over.
I'll test both points with the rss feed from the website you shared - that way it would be easier to compare results.

Yes, I was also thinking about having more options when loading data - different types might make things easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants