-
Notifications
You must be signed in to change notification settings - Fork 370
Implementation of VaultGemma Fine Tuning with Differential Privacy and Inference #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Summary of ChangesHello @RubensZimbres, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers a robust solution for fine-tuning the VaultGemma 1B language model, specifically tailored for medical question-answering, with a strong emphasis on data privacy. It integrates state-of-the-art techniques like LoRA and differential privacy to enable efficient and secure model adaptation, making it suitable for applications involving sensitive information. The changes provide a complete, documented pipeline for both training and inference. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive notebook for fine-tuning VaultGemma with differential privacy. The implementation is well-structured, but there are several critical issues and inconsistencies that need to be addressed. My review focuses on a critical discrepancy in the reported privacy budget, errors in model checkpointing and loading logic that will cause failures, and various inconsistencies between the code and documentation. Addressing these points will significantly improve the reliability and usability of the notebook.
| "epsilon = privacy_engine.get_epsilon(delta=target_delta)\n", | ||
| "print(f\"Final privacy cost: ε = {epsilon:.2f} for δ = {target_delta}\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a critical inconsistency in the reported privacy budget. The cell's output shows a final privacy cost of ε = 22.21 for δ = 0.01. However, the code is configured with target_epsilon = 3.0 and target_delta = 1e-5. This discrepancy suggests the output is from a different execution or there is a fundamental issue in the privacy accounting. The reported epsilon is also significantly higher than the target. This must be corrected to ensure the privacy claims of this notebook are valid.
| " device_map=\"auto\",\n", | ||
| ")\n", | ||
| "\n", | ||
| "adapter_path = \"./final_model\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to the adapter for inference is hardcoded to ./final_model. However, the training loop saves checkpoints to a dynamically generated path based on the training loss (e.g., ./final_model_acc_...). This will cause a FileNotFoundError when running the inference cell. The path should be updated to point to a valid checkpoint saved during training.
| "\n", | ||
| "# Training hyperparameters\n", | ||
| "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", | ||
| "num_train_epochs = 2\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| " log_message = f\"Step {global_step}: Train Loss = {avg_train_loss:.4f}\"\n", | ||
| " \n", | ||
| " # Save checkpoint if loss is below threshold\n", | ||
| " if avg_train_loss < 0.06:\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpointing based on a hardcoded training loss threshold (avg_train_loss < 0.06) is unreliable. This condition may never be met, or it could be met too frequently, leading to no checkpoints or too many. A more robust strategy is to save checkpoints based on improvements in the validation loss or simply at the end of each epoch.
VaultGemma/README.md
Outdated
| ```python | ||
| from transformers import AutoModelForCausalLM, GemmaTokenizer | ||
| from peft import PeftModel | ||
|
|
||
| # Load model and adapters | ||
| model = AutoModelForCausalLM.from_pretrained("google/vaultgemma-1b") | ||
| tokenizer = GemmaTokenizer.from_pretrained("google/vaultgemma-1b") | ||
| peft_model = PeftModel.from_pretrained(model, "path/to/adapters") | ||
|
|
||
| # Generate response | ||
| question = "What are the symptoms of diabetes?" | ||
| response = generate_response(question) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "!pip install -q -U transformers peft accelerate bitsandbytes datasets pandas\n", | ||
| "!pip install git+https://github.com/huggingface/[email protected]\n", | ||
| "!pip install kagglehub ipywidgets opacus -q" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transformers library is installed twice: first from pip and then immediately overwritten by an installation from a specific git commit. This is redundant. To streamline the setup, you can remove transformers from the first pip install command.
!pip install -q -U peft accelerate bitsandbytes datasets pandas
!pip install git+https://github.com/huggingface/[email protected]
!pip install kagglehub ipywidgets opacus -q
| "\n", | ||
| "# Load medical dataset\n", | ||
| "medical_data = load_dataset(\"medalpaca/medical_meadow_medical_flashcards\", split=\"train\")\n", | ||
| "data = medical_data.to_pandas().head(1000)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of samples to be used from the dataset is hardcoded as 1000. This makes the notebook less flexible for experimentation. It would be better to define this as a configurable variable at the top of the cell or in a dedicated configuration section.
NUM_SAMPLES = 1000
data = medical_data.to_pandas().head(NUM_SAMPLES)
| "model = AutoModelForCausalLM.from_pretrained(\n", | ||
| " model_path,\n", | ||
| " quantization_config=quantization_config,\n", | ||
| " torch_dtype=torch.bfloat16,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ], | ||
| "source": [ | ||
| "def generate_response(question, max_new_tokens=128, temperature=0.9, top_p=0.9):\n", | ||
| " prompt = f\"Instruction:\\nAnswer this medical question concisely.\\n\\nQuestion:\\n{question}\\n\\nResponse:\\n\"\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prompt template used for inference (Answer this medical question concisely.) is different from the one used during training (Answer this question truthfully.). This inconsistency can lead to suboptimal model performance, as the model is being prompted in a way it was not trained for. For best results, the prompt templates for training and inference should be identical.
prompt = f"Instruction:\nAnswer this question truthfully.\n\nQuestion:\n{question}\n\nResponse:\n"
|
Hi @RubensZimbres So, could you modify it with the following points?
|
|
Done, @bebechien , as well as gemini-code-assist issues addressed. |
bebechien
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm!
Add VaultGemma Fine-tuning with Differential Privacy and Inference
Overview
This PR adds a complete pipeline for privacy-preserving fine-tuning and inference of VaultGemma 1B on medical data using LoRA adapters and differential privacy via Opacus.
What's Added
VaultGemma_FineTuning_Inference_Huggingface.ipynbKey Features
Technical Details
Training Configuration
Privacy Guarantees
The implementation provides (ε, δ)-differential privacy guarantees through gradient clipping (max norm: 1.0) and automatic privacy accounting via Opacus.
Inference
Includes simple inference functions with adjustable generation parameters (temperature, top_p, max_new_tokens) and support for single or batch processing.
Files Changed
VaultGemma_FineTuning_Inference_Huggingface.ipynb- Main training and inference notebookREADME.md- Documentation