- This project is just a self-test for trying to recreate the GPT2 model architecture
 - Will also try to add a pytorch data loader class (later) instead of the custom Data Loading followed in the initial tutorial
 - Currently not intending to set up a training loop in this project, if I do so, will add the validation split too
 
Partial Success
- Could remember the overall architecture (made a mistake in initial tryout of missing the final layer norm and lm_head)
 - Divided Attention into MultiHeads and Head in initial try
- This is okay, but having all heads operate as a singular matrix operation instead of a list is more optimal
 - This also deviates from the structre of the original model
 
 - Could not remember the code for the buffer that masks out attention output
 - Missed out on adding residual connections in Block on first try
- Not a very bad miss, would have added if I had kept a diagram near me
 - Need to add the normalization due to multiple residual connections on the variance
 
 
Fixed all diffs for the basic model