Skip to content

[Bug]: I faced the problem when run synthetic-data-kit CLI with API endpoint #56

@phattantran1997

Description

@phattantran1997

Version

0.0.4

Operating System

Windows

Python Version

3.8

What happened?

Hi, I tried to run command CLI: synthetic-data-kit -c synthetic_data_kit_config.yaml create "data/output/week3_chunk0.txt" --num-pairs 25 --type qa
with yaml config file:

Master configuration file for Synthetic Data Kit


paths:
  input:
    pdf: "data/pdf"
    html: "data/html"
    youtube: "data/youtube"
    docx: "data/docx"
    ppt: "data/ppt"
    txt: "data/txt"
  output:
    parsed: "data/output"
    generated: "data/generated"
    cleaned: "data/cleaned"
    final: "data/final"
llm:
  provider: "api-endpoint"

api-endpoint:
  api_base: "http://localhost:11434/v1"
  model: "llama2:latest"   # Replace with the exact model name (run `ollama list` to verify)


# vllm:
#   api_base: "http://localhost:11434/api"
#   port: 8000
#   model: "llama3-3b-instruct"
#   max_retries: 3
#   retry_delay: 1.0

ingest:
  default_format: "txt"
  youtube_captions: "auto"

generation:
  temperature: 0.7
  top_p: 0.95
  chunk_size: 1022
  overlap: 64
  max_tokens: 512
  num_pairs: 25

cleanup:
  threshold: 1.0
  batch_size: 4
  temperature: 0.3

format:
  default: "jsonl"
  include_metadata: true
  pretty_json: true

prompts:
  summary: |
    Summarize this document in 3-5 sentences, focusing on the main topic and key concepts.

  qa_generation: |
    Create 25 question-answer pairs from this text for LLM training.

    Rules:
    1. Questions must be about important facts in the text
    2. Answers must be directly supported by the text
    3. Return JSON format only.

    Text:
    {text}

  qa_rating: |
    Rate each of these question-answer pairs for quality and return JSON:
    [
      {"question": "same question", "answer": "same answer", "rating": n}
    ]

with API endpoint format.
but it always return error:


synthetic-data-kit -c synthetic_data_kit_config.yaml create "data/output/week3_chunk0.txt" --num-pairs 25 --type qa
Loading config from: /Users/lv/Documents/Opensources/LAZYAI/backend/.venv/lib/python3.10/site-packages/synthetic_data_kit/config.yaml
Config has LLM provider set to: api-endpoint
Loading config from: synthetic_data_kit_config.yaml
Config has LLM provider set to: api-endpoint
get_llm_provider returning: api-endpoint
L Using api-endpoint provider
Loading config from: synthetic_data_kit_config.yaml
Config has LLM provider set to: api-endpoint
API_ENDPOINT_KEY from environment: Found
Using API key: From env var
Using API base URL: http://localhost:11434/v1
L Using api-endpoint provider
Loading config from: synthetic_data_kit_config.yaml
Config has LLM provider set to: api-endpoint
L Error: 'NoneType' object cannot be interpreted as an integer

Relevant log output

Steps to reproduce

  1. update yaml file with API endpoint
  2. run CLI command synthetic-data-kit -c synthetic_data_kit_config.yaml create "data/output/week3_chunk0.txt" --num-pairs 25 --type qa
  3. error: Error: 'NoneType' object cannot be interpreted as an integer

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions