[Feature]: Fill-In-Middle dataset generation

### Is your feature request related to a problem? Please describe.

Need support to generate Fill-In-Middle datasets for training LLMs with code completion task.
**Justification for the need:**
There are a lot of proprietary software code that doesn't expose the APIs to general public unless we buy them.
We are trying to train llama models with one such technology. it is hard to generate data if the programming language is also a niche tech. in my case its groovy (not all LLMs know groovy, only the big ones know)and the APIs are not known by any AI Models.

### Describe the solution you'd like

I'd like the solution to have a Fill-In-Middle tokens customizable like 

1. Codellama uses: < \PRE><\MID><\SUF>
2. Deepseek uses: <|FIM_BEGIN|> <|FIM_HOLE|><|FIM_END|>
3. Code gemma uses:<|fim_prefix|><|fim_suffix|><|fim_middle|><|file_separator|>

When a bunch of source codes are given as input, the synthetic-data-kit should produce high quality code completion prompts which can be used to train LLMs for training code completion.

### Describe alternatives you've considered

I have considered fine tuning a small language model, with dataset that was generated without the help of LLMs by writing a groovy parser, parse each files, pick random chunks and write it as prompts.
they are static. they don't pick meaningful logical block to help LLM learn.
So we want to consider using this synthetic-data-kit library.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Fill-In-Middle dataset generation #87

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Fill-In-Middle dataset generation #87

Description

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions