The problem to solve: Given a sentence, return the intent behind it (Think chatbot)
(要解决的问题:给定一个句子,返回其意图(想像聊天机器人))
Reduced example dataset (Intent on the left of dict):
(简化的示例数据集(字典左侧的Intent):)
data_raw = {"mk_reservation" : ["i want to make a reservation",
"book a table for me"],
"show_menu" : ["what's the daily menu",
"do you serve pizza"],
"payment_method" : ["how can i pay",
"can i use cash"],
"schedule_info" : ["when do you open",
"at what time do you close"]}
I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.
(我使用spaCy精简了句子,并使用gensim库提供的word2vec算法对每个单词进行了标记。)
This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:
(这是由于使用word2vec模型GoogleNews-vectors-negative300.bin导致的:)
[[[ 5.99331968e-02 6.50703311e-02 5.03010787e-02 ... -8.00536275e-02
1.94782894e-02 -1.83010306e-02]
[-2.14406010e-02 -1.00447744e-01 6.13847338e-02 ... -6.72588721e-02
3.03986594e-02 -4.14126664e-02]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]]
[[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ... 3.84555124e-02
-2.29179319e-02 -2.05144612e-03]
[-5.39291985e-02 -9.88398306e-03 4.39085700e-02 ... -3.55276838e-02
-3.66208404e-02 -4.57760505e-03]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
...
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
0.00000000e+00 0.00000000e+00]]]
- This is a List of sentences, and each sentence is a list of words ( [sentences[sentence[word]]] )
(这是一个句子列表,每个句子都是一个单词列表( [sentences [sentence [word]]] ))
- Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)
(每个句子(列表)的大小必须为10个字(我将其余的填充为零))
- Each word (list) has 300 elements (word2vec dimensions)
(每个单词(列表)有300个元素(word2vec尺寸))
By following some tutorials i transformed this to a TensorDataset.
(通过遵循一些教程,我将其转换为TensorDataset。)
At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:
(目前,我对如何使用word2vec感到非常困惑,可能我只是在浪费时间,到目前为止,我相信LSTM配置中的嵌入层应通过使用以下命令导入word2vec模型权重来构成:)
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)
word_embeddings = nn.Embedding.from_pretrained(weights)
This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type .
(这是不够的,因为pytorch表示不接受索引不是INT类型的嵌入。)
EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.
(编辑:我发现从gensim word2vec导入权重矩阵不是很简单,还必须导入word_index表。)
As soon as i fix this issue i'll post it here.
(一旦解决此问题,我就会在这里发布。)
ask by Souza translate from so