-
Notifications
You must be signed in to change notification settings - Fork 361
Open
Labels
Description
Description of the bug:
The Prompt template for the bindingdb_ic50 task seems to be incorrect. It gives:
Question: Given the target amino acid sequence and compound SMILES string, predict their normalized binding affinity Kd from 000 to 1000, where 000 is minimum IC50 and 1000 is maximum IC50.
Drug SMILES: CC(C)c1c(C(=O)NCc2ccc(Cl)cc2)nn(-c2ccc(F)cc2)c1CC[C@@H](O)C[C@@H](O)CC(=O)[O-]
which seems to be an incomplete cut-and-paste from the kd task.
I ran the three tasks below with and without reasoning. Without reasoning it produced a numerical answer, but with reasoning it gives the following errant behaviour (for reference the protein sequence is the UniPROT ID; P04035):
- Running the normal Kd task, Gemma can recognize given protein sequences, and correctly interprets the SMILES string. Produces a numerical answer.
- Running this incorrect ic50 task, Gemma fails to recognize a given protein sequence and says it was not given. Further, it hallucinates atoms/functional groups in the drug SMILES string. Produces a numerical answer.
- If I change the prompt to say Ic50 rather than Kd, Gemma doesn't recognize either the SMILES string or the protein sequence. Produces a numerical answer identical to (2).
It seems that perhaps something was wrong with the prompts used in training for IC50?
Actual vs expected behavior:
No response
Any other information you'd like to share?
No response