-
Notifications
You must be signed in to change notification settings - Fork 133
PS and KM basic support #17
base: main
Are you sure you want to change the base?
Conversation
dgant
commented
Mar 13, 2020
- Modified download script to acquire PS/KM/HI/FA
- Added prepare.sh for PS/KM parallel and multilingual training
… PS/KM parallel and multilingual training.
| cat "${DEVTEST_HI}/dev.hi" > $DATA/valid.hi-en.hi | ||
| cat "${DEVTEST_HI}/dev.en" > $DATA/valid.hi-en.en | ||
| cat "${DEVTEST_HI}/test.hi" > $DATA/test.hi-en.hi | ||
| cat "${DEVTEST_HI}/test.en" > $DATA/test.hi-en.en |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEVTEST_PSKM and DEVTEST_HI are placeholders for where these datasets will eventually live.
|
Hi @dgant! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but we do not have a signature on file. In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR.
I've added a few comments. Most importantly, please check the contributor license process.
| download_opus_data $KM_ROOT $KM_TGT | ||
| download_opus_data $PS_ROOT $PS_TGT | ||
| download_opus_data $FA_ROOT $FA_TGT | ||
| #download_opus_data $HI_ROOT $HI_TGT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code
| elif [ "$TGT" = "ps" ]; then | ||
| URLS=("${PS_OPUS_URLS[@]}") | ||
| DATASETS=("${PS_OPUS_DATASETS[@]}") | ||
| elif [ "$TGT" = "fa" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have Farsi in the original flores. Is there a reason why are you including it in the target?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the intent was to improve the Pashto performance of a multilingual model by training for Farsi as well, being a related language. The Pashto-English parallel corpus was very limited so being able to make use of the Farsi-English corpus would hopefully be a boon. I don't recall whether we tested this or if we did whether it had an impact.
| @@ -0,0 +1,90 @@ | |||
| import argparse | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add more information on what is the intended use of this script?
Deduplication is certainly useful, but newcomers might not be so familiar
|
Hi @guzmanhe. I opened this PR while I was working at FAIR with Peng-Jen and Marc'Aurelio. I left FB two years ago so my contributor license agreement status has lapsed :). I won't be making any further changes to this pull request, so if there's not interest on your end to see this through you can close it. |