在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:bloomberg/koan开源软件地址:https://github.com/bloomberg/koan开源编程语言:C++ 98.0%开源软件介绍:
A word2vec negative sampling implementation with correct CBOW update. kōan only depends on Eigen. Authors: Ozan İrsoy, Adrian Benton, Karl Stratos Thanks to Cyril Khazan for helping kōan better scale to many threads. MenuRationaleAlthough continuous bag of word (CBOW) embeddings can be trained more quickly than skipgram (SG) embeddings, it is a common belief that SG embeddings tend to perform better in practice. This was observed by the original authors of Word2Vec [1] and also in subsequent work [2]. However, we found that popular implementations of word2vec with negative sampling such as word2vec and gensim do not implement the CBOW update correctly, thus potentially leading to misconceptions about the performance of CBOW embeddings when trained correctly. We release kōan so that others can efficiently train CBOW embeddings using the corrected weight update. See this technical report for benchmarks of kōan vs. gensim word2vec negative sampling implementations. If you use kōan to learn word embeddings for your own work, please cite:
[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. [2] Karl Stratos, Michael Collins, and Daniel Hsu. Model-based word embeddings from decompositions of count matrices. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1282–1291, 2015. See here for kōan embeddings trained on the English cleaned common crawl corpus (C4). BuildingYou need a C++17 supporting compiler to build koan (tested with g++ 7.5.0, 8.4.0, 9.3.0, and clang 11.0.3). To build koan and all tests:
Run tests with (assuming you are still under
InstallationInstallation is as simple as placing the koan binary on your
Quick StartTo train word embeddings on Wikitext-2, first clone and build koan:
Download and unzip the Wikitext-2 corpus:
And learn CBOW embeddings on the training fold with:
or skipgram embeddings by running with LicensePlease read the LICENSE file. BenchmarksSee the report for more details. |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论