-
Notifications
You must be signed in to change notification settings - Fork 199
Description
Is your feature request related to a problem? Please describe.
Need support to generate Fill-In-Middle datasets for training LLMs with code completion task.
Justification for the need:
There are a lot of proprietary software code that doesn't expose the APIs to general public unless we buy them.
We are trying to train llama models with one such technology. it is hard to generate data if the programming language is also a niche tech. in my case its groovy (not all LLMs know groovy, only the big ones know)and the APIs are not known by any AI Models.
Describe the solution you'd like
I'd like the solution to have a Fill-In-Middle tokens customizable like
- Codellama uses: < \PRE><\MID><\SUF>
- Deepseek uses: <|FIM_BEGIN|> <|FIM_HOLE|><|FIM_END|>
- Code gemma uses:<|fim_prefix|><|fim_suffix|><|fim_middle|><|file_separator|>
When a bunch of source codes are given as input, the synthetic-data-kit should produce high quality code completion prompts which can be used to train LLMs for training code completion.
Describe alternatives you've considered
I have considered fine tuning a small language model, with dataset that was generated without the help of LLMs by writing a groovy parser, parse each files, pick random chunks and write it as prompts.
they are static. they don't pick meaningful logical block to help LLM learn.
So we want to consider using this synthetic-data-kit library.
Additional context
No response