-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestingestionwontfixThis will not be worked onThis will not be worked onworking
Milestone
Description
Summary
Create a CLI utility (or extend ingestion/ingest.py) that allows a user to specify a PDF (via URL or local path), select a chunking strategy, and upsert the resulting chunks (with metadata) into the configured Pinecone index.
Requirements/Details
- CLI should accept:
- Path to local PDF file OR a direct URL to a PDF
- Choice of chunking strategy (e.g., RecursiveCharacterTextSplitter, fixed length, by page, etc.); provide sensible defaults and allow easy extension
- Pinecone API key and index name (or read from config/secrets)
- Optional metadata (title, authors, category, etc; can extract from PDF or user input)
- Load the PDF (download if URL), extract the text, and chunk it per the chosen strategy
- For each chunk, upsert to Pinecone:
- Use OpenAI or compatible embeddings model
- Attach all relevant metadata (document source, chunk index, chunk text, etc)
- Print/log summary of number of chunks, upsert status, and any errors
Implementation Ideas
- In
ingestion/ingest.py, add a CLI parser that accepts--pdf-path,--pdf-url,--chunking,--pinecone-api-key,--index-name, and--metadataargs - Use PyPDFLoader (LangChain) or PyPDF, and optionally add option to try both for robustness
- For chunking, use
langchain.text_splitter.RecursiveCharacterTextSplitter(already used in current code) and expose its parameters (chunk size, overlap) as CLI args - Add more strategies: by page, by heading, etc. Use a strategy pattern or factory for easy extension
- For upsert, use Pinecone client directly or via LangChain's Pinecone wrapper (see existing
ingest.pylogic) - If user provides a URL, download to temp file and clean up after
- Optionally auto-extract metadata from PDF (title, authors) using PyPDF or pdfminer
- Log to console and optionally a log file; summary should include number of chunks, failures, timing, etc
Improvements/Future Extensions
- Add option to process multiple PDFs in one go (input file with URLs/paths)
- Optionally support other vector DBs (Chroma, LanceDB, etc.)
- Add chunk deduplication (hashing), or skip chunks already in DB
- Provide Python API for programmatic use, not just CLI
- Add unit tests for chunking and upsert
- Allow user to specify custom metadata or override extracted metadata
- Support asynchronous or parallel chunking/upsert for large docs
References
- See
ingestion/ingest.pyfor current PDF/chunking/upsert logic - LangChain PyPDFLoader: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/pypdf
- Pinecone docs: https://docs.pinecone.io/
- LangChain text splitters: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitter/
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingenhancementNew feature or requestNew feature or requestingestionwontfixThis will not be worked onThis will not be worked onworking
Type
Projects
Status
Todo