Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
292 views
in Technique[技术] by (71.8m points)

python - 如何在pytorch中使用gensim来创建意图分类器(使用LSTM NN)?(How to use gensim with pytorch to create an intent classifier (With LSTM NN)?)

The problem to solve: Given a sentence, return the intent behind it (Think chatbot)

(要解决的问题:给定一个句子,返回其意图(想像聊天机器人))

Reduced example dataset (Intent on the left of dict):

(简化的示例数据集(字典左侧的Intent):)

data_raw    = {"mk_reservation" : ["i want to make a reservation",
                                   "book a table for me"],
               "show_menu"      : ["what's the daily menu",
                                   "do you serve pizza"],
               "payment_method" : ["how can i pay",
                                   "can i use cash"],
               "schedule_info"  : ["when do you open",
                                   "at what time do you close"]}

I have stripped down the sentences with spaCy, and tokenized each word by using the word2vec algorithm provided by the gensim library.

(我使用spaCy精简了句子,并使用gensim库提供的word2vec算法对每个单词进行了标记。)

This is what resulted from the use of word2vec model GoogleNews-vectors-negative300.bin:

(这是由于使用word2vec模型GoogleNews-vectors-negative300.bin导致的:)

[[[ 5.99331968e-02  6.50703311e-02  5.03010787e-02 ... -8.00536275e-02
    1.94782894e-02 -1.83010306e-02]
  [-2.14406010e-02 -1.00447744e-01  6.13847338e-02 ... -6.72588721e-02
    3.03986594e-02 -4.14126664e-02]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]

 [[ 4.48647663e-02 -1.03907576e-02 -1.78682189e-02 ...  3.84555124e-02
   -2.29179319e-02 -2.05144612e-03]
  [-5.39291985e-02 -9.88398306e-03  4.39085700e-02 ... -3.55276838e-02
   -3.66208404e-02 -4.57760505e-03]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  ...
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]
  [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
    0.00000000e+00  0.00000000e+00]]]
  • This is a List of sentences, and each sentence is a list of words ( [sentences[sentence[word]]] )

    (这是一个句子列表,每个句子都是一个单词列表( [sentences [sentence [word]]] ))

  • Each sentence (list) must be of size 10 words (I am padding the remaining with zeroes)

    (每个句子(列表)的大小必须为10个字(我将其余的填充为零))

  • Each word (list) has 300 elements (word2vec dimensions)

    (每个单词(列表)有300个元素(word2vec尺寸))

By following some tutorials i transformed this to a TensorDataset.

(通过遵循一些教程,我将其转换为TensorDataset。)

At this moment, i am very confused on how to use the word2vec and probably i have just been wasting time, as of now i believe the embeddings layer from an LSTM configuration should be composed by importing the word2vec model weights using:

(目前,我对如何使用word2vec感到非常困惑,可能我只是在浪费时间,到目前为止,我相信LSTM配置中的嵌入层应通过使用以下命令导入word2vec模型权重来构成:)

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors)    
word_embeddings = nn.Embedding.from_pretrained(weights)

This is not enough as pytorch is saying it does not accept embeddings where indices are not INT type .

(这是不够的,因为pytorch表示不接受索引不是INT类型的嵌入。)

EDIT: I found out that importing the weight matrix from gensim word2vec is not straightforward, one has to import the word_index table as well.

(编辑:我发现从gensim word2vec导入权重矩阵不是很简单,还必须导入word_index表。)

As soon as i fix this issue i'll post it here.

(一旦解决此问题,我就会在这里发布。)

  ask by Souza translate from so

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

You don't need neither a neural network nor word embeddings.

(您既不需要神经网络,也不需要单词嵌入。)

Use parsed trees with NLTK, where intents are Verbs V acting on entities (N) in a given utterance :

(将解析的树与NLTK一起使用,其中意图是在给定utterance作用于entities (N)的动词V :)

短语

To classify a sentence, then you can use a Neural Net.

(要对句子进行分类,则可以使用神经网络。)

I personally like BERT in fast.ai.

(我个人喜欢fast.ai中的BERT。)

Once again, you won't need embeddings to run the classification, and you can do it in multilanguage:

(再一次,您不需要嵌入即可运行分类,并且可以使用多语言进行分类:)

Fast.ai_BERT_ULMFit

(Fast.ai_BERT_ULMFit)

Also, you can use Named Entity Recognition if you are working on a chatbot, to guide conversations.

(另外,如果您正在使用聊天机器人,则可以使用Named Entity Recognition来指导对话。)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...