Skip to content

bindingdb_ic50 bug in TxGemma 9b-chat (not 8-bit) #179

@MauricioCafiero

Description

@MauricioCafiero

Description of the bug:

The Prompt template for the bindingdb_ic50 task seems to be incorrect. It gives:

Question: Given the target amino acid sequence and compound SMILES string, predict their normalized binding affinity Kd from 000 to 1000, where 000 is minimum IC50 and 1000 is maximum IC50.
Drug SMILES: CC(C)c1c(C(=O)NCc2ccc(Cl)cc2)nn(-c2ccc(F)cc2)c1CC[C@@H](O)C[C@@H](O)CC(=O)[O-]

which seems to be an incomplete cut-and-paste from the kd task.

I ran the three tasks below with and without reasoning. Without reasoning it produced a numerical answer, but with reasoning it gives the following errant behaviour (for reference the protein sequence is the UniPROT ID; P04035):

  1. Running the normal Kd task, Gemma can recognize given protein sequences, and correctly interprets the SMILES string. Produces a numerical answer.
  2. Running this incorrect ic50 task, Gemma fails to recognize a given protein sequence and says it was not given. Further, it hallucinates atoms/functional groups in the drug SMILES string. Produces a numerical answer.
  3. If I change the prompt to say Ic50 rather than Kd, Gemma doesn't recognize either the SMILES string or the protein sequence. Produces a numerical answer identical to (2).

It seems that perhaps something was wrong with the prompts used in training for IC50?

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions