NLPの勉強を進めています。その中で、翻訳できるモデルがhugging faceに上がっているのを発見したので使ってみたいと思います。
学習済みモデル
mBART-50 many to many multilingual machine translation
Multilingual Translation with Extensible Multilingual Pretraining and Finetuning. 2020.
Author:Yuqing Tang and Chau Tran and Xian Li and Peng-Jen Chen and Naman Goyal and Vishrav Chaudhary and Jiatao Gu and Angela Fan
https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt
Hugging face
コード
google colab上での実行を想定しています。まず以下のコードを実行してエラーが出るか見てみます。
!pip install -q transformers
!pip install -q SentencePiece
# 学習済みモデルの読み込み
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt")
以下のようなエラーがでました。ランタイムの再起動を行って、上記コードを再実行してください。次はエラーがでないはずです。
ValueError Traceback (most recent call last) in () 4 5 model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") ----> 6 tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-many-mmt") 7 8 # translate Arabic to English 3 frames /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in init(self, *args, **kwargs) 104 else: 105 raise ValueError( --> 106 "Couldn't instantiate the backend tokenizer from one of: \n" 107 "(1) atokenizers
library serialization file, \n" 108 "(2) a slow tokenizer instance to convert or \n" ValueError: Couldn't instantiate the backend tokenizer from one of: (1) atokenizers
library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
日本語から英語に翻訳
article_ja = "これは日本語の文章です。3Dプリンターによるオーダーメイドプレートの有効性について検討したいと思います。"
tokenizer.src_lang = "ja_XX"
encoded_ar = tokenizer(article_ja, return_tensors="pt")
generated_tokens = model.generate(
**encoded_ar,
forced_bos_token_id=tokenizer.lang_code_to_id["en_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
[‘This is a Japanese text.I would like to discuss the effectiveness of 3D printers for custom plates.’]
英語から日本語に翻訳
論文のアブストラクトを翻訳してみました。パブメドのAPIでアブストラクトを引っ張ってきたら、アブストラクトだけ大量に日本語で読めそうですね。
でも著作権的に、勝手にアブストラクトを載せていいのかわからないので適当な文章を作成してみたいと思います。
内容はすべて創作・フィクションです。This text is fictional.
The usefulness of custom-made plates in the treatment of coronal axial vertebral instability was investigated. In this cohort study, patients with coronary axial spine instability were divided into two groups: one with existing treatment (cemented fixation) (n=30) and the other with a custom-made plate (n=30). The results showed that the relative risk of developing postoperative complications with the custom-made plate was 0.4, which was significant at the 0.05 level of significance.
article_en = "The usefulness of custom-made plates in the treatment of coronal axial vertebral instability was investigated. In this cohort study, patients with coronary axial spine instability were divided into two groups: one with existing treatment (cemented fixation) (n=30) and the other with a custom-made plate (n=30). The results showed that the relative risk of developing postoperative complications with the custom-made plate was 0.4, which was significant at the 0.05 level of significance."
tokenizer.src_lang = "en_XX"
encoded_ar = tokenizer(article_en, return_tensors="pt")
generated_tokens = model.generate(
**encoded_ar,
forced_bos_token_id=tokenizer.lang_code_to_id["ja_XX"]
)
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
[‘このコホート研究では、冠動脈軸性脊髄不安定症患者を既存の治療(セメント固定)(n=30)とカスタムメイドプレート(n=30)の2つのグループに分け、カスタムメイドプレートで術後合併症の発生の相対リスクが0.4であり、0.05レベルで有意であったことを示した。’]
なかなかいい感じに翻訳できました!