[Feature] Chunk and Upsert Arbitrary PDF to Pinecone via CLI

## Summary
Create a CLI utility (or extend `ingestion/ingest.py`) that allows a user to specify a PDF (via URL or local path), select a chunking strategy, and upsert the resulting chunks (with metadata) into the configured Pinecone index.

## Requirements/Details
- CLI should accept:
  - Path to local PDF file OR a direct URL to a PDF
  - Choice of chunking strategy (e.g., RecursiveCharacterTextSplitter, fixed length, by page, etc.); provide sensible defaults and allow easy extension
  - Pinecone API key and index name (or read from config/secrets)
  - Optional metadata (title, authors, category, etc; can extract from PDF or user input)
- Load the PDF (download if URL), extract the text, and chunk it per the chosen strategy
- For each chunk, upsert to Pinecone:
  - Use OpenAI or compatible embeddings model
  - Attach all relevant metadata (document source, chunk index, chunk text, etc)
- Print/log summary of number of chunks, upsert status, and any errors

## Implementation Ideas
- In `ingestion/ingest.py`, add a CLI parser that accepts `--pdf-path`, `--pdf-url`, `--chunking`, `--pinecone-api-key`, `--index-name`, and `--metadata` args
- Use PyPDFLoader (LangChain) or PyPDF, and optionally add option to try both for robustness
- For chunking, use `langchain.text_splitter.RecursiveCharacterTextSplitter` (already used in current code) and expose its parameters (chunk size, overlap) as CLI args
- Add more strategies: by page, by heading, etc. Use a strategy pattern or factory for easy extension
- For upsert, use Pinecone client directly or via LangChain's Pinecone wrapper (see existing `ingest.py` logic)
- If user provides a URL, download to temp file and clean up after
- Optionally auto-extract metadata from PDF (title, authors) using PyPDF or pdfminer
- Log to console and optionally a log file; summary should include number of chunks, failures, timing, etc

## Improvements/Future Extensions
- Add option to process multiple PDFs in one go (input file with URLs/paths)
- Optionally support other vector DBs (Chroma, LanceDB, etc.)
- Add chunk deduplication (hashing), or skip chunks already in DB 
- Provide Python API for programmatic use, not just CLI
- Add unit tests for chunking and upsert
- Allow user to specify custom metadata or override extracted metadata
- Support asynchronous or parallel chunking/upsert for large docs

## References
- See `ingestion/ingest.py` for current PDF/chunking/upsert logic
- LangChain PyPDFLoader: https://python.langchain.com/docs/modules/data_connection/document_loaders/integrations/pypdf
- Pinecone docs: https://docs.pinecone.io/
- LangChain text splitters: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitter/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Chunk and Upsert Arbitrary PDF to Pinecone via CLI #34

Summary

Requirements/Details

Implementation Ideas

Improvements/Future Extensions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Chunk and Upsert Arbitrary PDF to Pinecone via CLI #34

Description

Summary

Requirements/Details

Implementation Ideas

Improvements/Future Extensions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions