Skip to content

[Feature] Chunk and Upsert Arbitrary PDF to Pinecone via CLI #34

@karthik18495

Description

@karthik18495

Summary

Create a CLI utility (or extend ingestion/ingest.py) that allows a user to specify a PDF (via URL or local path), select a chunking strategy, and upsert the resulting chunks (with metadata) into the configured Pinecone index.

Requirements/Details

  • CLI should accept:
    • Path to local PDF file OR a direct URL to a PDF
    • Choice of chunking strategy (e.g., RecursiveCharacterTextSplitter, fixed length, by page, etc.); provide sensible defaults and allow easy extension
    • Pinecone API key and index name (or read from config/secrets)
    • Optional metadata (title, authors, category, etc; can extract from PDF or user input)
  • Load the PDF (download if URL), extract the text, and chunk it per the chosen strategy
  • For each chunk, upsert to Pinecone:
    • Use OpenAI or compatible embeddings model
    • Attach all relevant metadata (document source, chunk index, chunk text, etc)
  • Print/log summary of number of chunks, upsert status, and any errors

Implementation Ideas

  • In ingestion/ingest.py, add a CLI parser that accepts --pdf-path, --pdf-url, --chunking, --pinecone-api-key, --index-name, and --metadata args
  • Use PyPDFLoader (LangChain) or PyPDF, and optionally add option to try both for robustness
  • For chunking, use langchain.text_splitter.RecursiveCharacterTextSplitter (already used in current code) and expose its parameters (chunk size, overlap) as CLI args
  • Add more strategies: by page, by heading, etc. Use a strategy pattern or factory for easy extension
  • For upsert, use Pinecone client directly or via LangChain's Pinecone wrapper (see existing ingest.py logic)
  • If user provides a URL, download to temp file and clean up after
  • Optionally auto-extract metadata from PDF (title, authors) using PyPDF or pdfminer
  • Log to console and optionally a log file; summary should include number of chunks, failures, timing, etc

Improvements/Future Extensions

  • Add option to process multiple PDFs in one go (input file with URLs/paths)
  • Optionally support other vector DBs (Chroma, LanceDB, etc.)
  • Add chunk deduplication (hashing), or skip chunks already in DB
  • Provide Python API for programmatic use, not just CLI
  • Add unit tests for chunking and upsert
  • Allow user to specify custom metadata or override extracted metadata
  • Support asynchronous or parallel chunking/upsert for large docs

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions