Skip to content
This repository was archived by the owner on Nov 21, 2023. It is now read-only.

Conversation

@dgant
Copy link

@dgant dgant commented Mar 13, 2020

  • Modified download script to acquire PS/KM/HI/FA
  • Added prepare.sh for PS/KM parallel and multilingual training

@facebook-github-bot facebook-github-bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Mar 13, 2020
cat "${DEVTEST_HI}/dev.hi" > $DATA/valid.hi-en.hi
cat "${DEVTEST_HI}/dev.en" > $DATA/valid.hi-en.en
cat "${DEVTEST_HI}/test.hi" > $DATA/test.hi-en.hi
cat "${DEVTEST_HI}/test.en" > $DATA/test.hi-en.en
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEVTEST_PSKM and DEVTEST_HI are placeholders for where these datasets will eventually live.

@facebook-github-bot
Copy link

Hi @dgant!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@putheakhem putheakhem mentioned this pull request Mar 30, 2021
Copy link
Contributor

@guzmanhe guzmanhe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR.
I've added a few comments. Most importantly, please check the contributor license process.

download_opus_data $KM_ROOT $KM_TGT
download_opus_data $PS_ROOT $PS_TGT
download_opus_data $FA_ROOT $FA_TGT
#download_opus_data $HI_ROOT $HI_TGT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code

elif [ "$TGT" = "ps" ]; then
URLS=("${PS_OPUS_URLS[@]}")
DATASETS=("${PS_OPUS_DATASETS[@]}")
elif [ "$TGT" = "fa" ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have Farsi in the original flores. Is there a reason why are you including it in the target?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the intent was to improve the Pashto performance of a multilingual model by training for Farsi as well, being a related language. The Pashto-English parallel corpus was very limited so being able to make use of the Farsi-English corpus would hopefully be a boon. I don't recall whether we tested this or if we did whether it had an impact.

@@ -0,0 +1,90 @@
import argparse
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add more information on what is the intended use of this script?
Deduplication is certainly useful, but newcomers might not be so familiar

@dgant
Copy link
Author

dgant commented Apr 14, 2022

Hi @guzmanhe. I opened this PR while I was working at FAIR with Peng-Jen and Marc'Aurelio. I left FB two years ago so my contributor license agreement status has lapsed :). I won't be making any further changes to this pull request, so if there's not interest on your end to see this through you can close it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed Do not delete this pull request or issue due to inactivity.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants