Skip to content

[Feature]: Fill-In-Middle dataset generation #87

@binaryblood

Description

@binaryblood

Is your feature request related to a problem? Please describe.

Need support to generate Fill-In-Middle datasets for training LLMs with code completion task.
Justification for the need:
There are a lot of proprietary software code that doesn't expose the APIs to general public unless we buy them.
We are trying to train llama models with one such technology. it is hard to generate data if the programming language is also a niche tech. in my case its groovy (not all LLMs know groovy, only the big ones know)and the APIs are not known by any AI Models.

Describe the solution you'd like

I'd like the solution to have a Fill-In-Middle tokens customizable like

  1. Codellama uses: < \PRE><\MID><\SUF>
  2. Deepseek uses: <|FIM_BEGIN|> <|FIM_HOLE|><|FIM_END|>
  3. Code gemma uses:<|fim_prefix|><|fim_suffix|><|fim_middle|><|file_separator|>

When a bunch of source codes are given as input, the synthetic-data-kit should produce high quality code completion prompts which can be used to train LLMs for training code completion.

Describe alternatives you've considered

I have considered fine tuning a small language model, with dataset that was generated without the help of LLMs by writing a groovy parser, parse each files, pick random chunks and write it as prompts.
they are static. they don't pick meaningful logical block to help LLM learn.
So we want to consider using this synthetic-data-kit library.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions