This is re-implementation of Google BERT model [paper] in Pytorch. I was strongly inspired by Hugging Face's code and I referred a lot to their codes, but I tried to make my codes more pythonic and pytorchic style. Actually, the number of lines is less than a half of HF's.
(It is still not so heavily tested - let me know when you find some bugs.)
cuda (8 GPUs)
Iter (loss=0.308): 100%|██████████████████████████████████████████████| 115/115 [01:19<00:00, 2.07it/s]
Epoch 1/3 : Average Loss 0.547
Iter (loss=0.303): 100%|██████████████████████████████████████████████| 115/115 [00:50<00:00, 2.30it/s]
Epoch 2/3 : Average Loss 0.248
Iter (loss=0.044): 100%|██████████████████████████████████████████████| 115/115 [00:50<00:00, 2.33it/s]
Epoch 3/3 : Average Loss 0.068
One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because we use the sentence boundaries for the "next sentence prediction" task).
Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.
cuda (8 GPUs)
Iter (loss=5.837): : 30089it [18:09:54, 2.17s/it]
Epoch 1/25 : Average Loss 13.928
Iter (loss=3.276): : 30091it [18:13:48, 2.18s/it]
Epoch 2/25 : Average Loss 5.549
Iter (loss=4.163): : 7380it [4:29:38, 2.19s/it]
...
Training Curve (1 epoch ~ 30k steps ~ 18 hours):
Loss for Masked LM vs Iteration steps
Loss for Next Sentence Prediction vs Iteration steps
请发表评论