Skip to content

Conversation

Copy link

Copilot AI commented Nov 18, 2025

Thanks for asking me to work on this. I will get started on it and keep this PR's description up to date as I form a plan and make progress.

Original prompt

Create a Python script for preprocessing the updated CSV dataset into a TSV format suitable for training a conditional causal LM in ReInvent4. The script should validate SMILES, canonicalize them, remove invalid entries, handle duplicates, and output in the format: prompttargetec_score. Include RDKit checks and basic statistics (e.g., count per MODE). The dataset has ~8M rows, so make it efficient with batch processing or multiprocessing. Also, provide a simple tokenizer training script using SentencePiece for SMILES with special tokens for , , etc. Finally, include a sample HuggingFace training config YAML for fine-tuning a causal LM.

This pull request was created as a result of the following prompt from Copilot chat.

Create a Python script for preprocessing the updated CSV dataset into a TSV format suitable for training a conditional causal LM in ReInvent4. The script should validate SMILES, canonicalize them, remove invalid entries, handle duplicates, and output in the format: prompttargetec_score. Include RDKit checks and basic statistics (e.g., count per MODE). The dataset has ~8M rows, so make it efficient with batch processing or multiprocessing. Also, provide a simple tokenizer training script using SentencePiece for SMILES with special tokens for , , etc. Finally, include a sample HuggingFace training config YAML for fine-tuning a causal LM.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI self-assigned this Nov 18, 2025
Copilot stopped work on behalf of xiaolitter due to an error November 18, 2025 07:36
@halx halx closed this Nov 20, 2025
@halx halx deleted the copilot/preprocess-csv-to-tsv-format branch November 20, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants