Python snowball.EnglishStemmer类代码示例

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Python›Python编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Python中nltk.stem.snowball.EnglishStemmer类的典型用法代码示例。如果您正苦于以下问题：Python EnglishStemmer类的具体用法？Python EnglishStemmer怎么用？Python EnglishStemmer使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

在下文中一共展示了EnglishStemmer类的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Python代码示例。

示例1: str_to_dict

def str_to_dict(s):
    '''
    creates dictionary of words and counts
    input:  s string
    output: dictionary {word: count}
    '''
    s = s.encode('ascii','ignore')
    s = str(s)
    word_dict = {}
    l = re.findall(WORDRE, s)
    for w in l:
        w = w.lower()               # make all letters lowercase 
        
        if w[0] == "'":             # remove single quotes from beginning/
            w = w[1:]               # end of words in l
        elif w[-1] == "'":
            w = w[:-1]
        
        w = EnglishStemmer().stem(w)        # stems non-noun/verbs 
        w = w.encode('ascii','ignore')
        
        if w != '':
            if w not in word_dict:      # build dictionary
                word_dict[w] = 1
            else:
                word_dict[w] += 1

    return word_dict

开发者ID:ccr122，项目名称:ccr，代码行数:28，代码来源:parse.py

示例2: getAllStemEntities

def getAllStemEntities(entities):
    st = EnglishStemmer()
    q = [",", ".", "!", "?", ":", ";"]
    tmp = []
    sourceEntities = [x for x in entities if len(x) > 0]
    np.random.shuffle(entities)

    for i in xrange(len(entities)):
        if len(entities[i]) == 0:
            continue
        if i % 1000 == 0:
            print i
        entities[i] = entities[i].lower()
        entities[i] = entities[i].replace(" - ", " \u2013 ", entities[i].count(" - "))
        entities[i] = entities[i].replace(" -", " \u2013", entities[i].count(" -"))
        entities[i] = entities[i].replace("- ", "\u2013 ", entities[i].count("- "))
        entities[i] = entities[i].replace("-", " - ", entities[i].count("-"))
        entities[i] = entities[i].replace(")", " )", entities[i].count(")"))
        entities[i] = entities[i].replace("(", "( ", entities[i].count("("))
        entities[i] = entities[i].replace("\u0027", " \u0027", entities.count("\u0027"))
        for w in q:
            entities[i] = entities[i].replace(w, " " + w, entities[i].count(w))
        word = entities[i].split(" ")
        s = ""
        for w in word:
            s += st.stem(unicode(w)) + " "
        tmp.append(s[:-1])
        if len(tmp) > 50:
            break

    return tmp, entities[: len(tmp)]

开发者ID:mikhaylova-daria，项目名称:NER，代码行数:31，代码来源:allFunctions.py

示例3: Granularity

def Granularity(sentenceArray):
    for sentence in sentenceArray:
        # print(sentence)
        try:

            stemmer = EnglishStemmer()
            sentence = re.sub(r'\#.*?$', '', sentence)
            sentence = re.sub(r'\#.*? ', '', sentence)
            sentence = re.sub(r'\@.*?$', '', sentence)
            sentence = re.sub(r'\@.*? ', '', sentence)
            sentence = re.sub(r'pic.twitter.*?$', '', sentence)
            sentence = re.sub(r'pic.twitter.*? ', '', sentence)
            sentence = re.sub(r'\'m', ' am', sentence)
            sentence = re.sub(r'\'d', ' would', sentence)
            sentence = re.sub(r'\'ll', ' will', sentence)
            sentence = re.sub(r'\&', 'and', sentence)
            sentence = re.sub(r'don\'t', 'do not', sentence)

            data = stemmer.stem(sentence)
            print(data)
            from nltk.corpus import stopwords

            sentence = str(data)
            stop = stopwords.words('english')
            final = [i for i in sentence.split() if i not in stop]
            finalstring = ' '.join(final)
            os.system("printf \"" + str(finalstring) + "\n\">> stemstop/" + word)
        except Exception as e:
            print(e)

开发者ID:PgnDvd，项目名称:SNLP，代码行数:29，代码来源:Stemmer.py

示例4: query

def query(word):
    db = MySQLdb.connect("127.0.0.1","dizing","ynr3","dizing" )
    cursor=db.cursor()
    snowball_stemmer = EnglishStemmer()
    stem2 = snowball_stemmer.stem(word)
    cursor.execute("SELECT * FROM words WHERE original=%s OR stem1=%s OR stem2=%s", (word,word,stem2))
    rows = cursor.fetchall()
    words1 = dict()
    words2 = dict()
    for row in rows:
        if row[1] == word or row[3]==word:
            words1[word] = row[0]
        else:
            words2[word] = row[0]
    scenes1 = []
    scenes2 = []
    for (i,words_dict) in [(1,words1), (2,words2)]:
        wids = words_dict.values()
        for wid in wids:
            sql = "SELECT s.sentence, s.start, s.stop, s.ready, m.title FROM scene AS s, words_scenes AS ws, movie as m " + \
                           "WHERE ws.wid=%d AND ws.sid=s.sid AND s.mid = m.mid" % int(wid)
            # print sql
            cursor.execute(sql)
            rows = cursor.fetchall()
            if (i==1): scenes1 += rows
            else: scenes2 += rows
    print scenes1
    print scenes2
    return scenes1 + scenes2
    db.close()

开发者ID:yasinzor，项目名称:videosozluk，代码行数:30，代码来源:query_word.py

示例5: _execute

 def _execute(self):
     
     corpus = mongoExtractText(self.name)
     stemmer = EnglishStemmer()
     for item in corpus:
         line = item.replace(',', ' ')
         stemmed_line = stemmer.stem(line)
         self.sentiment.append((sentiment.sentiment(stemmed_line), stemmed_line))

开发者ID:cevaris，项目名称:nebula，代码行数:8，代码来源:mining.py

示例6: stem_word

def stem_word(word):
    """
    Stem words
    :param word: (str) text word
    :returns: stemmed word
    """
    stemmer = EnglishStemmer()
    return stemmer.stem(word)

开发者ID:vipul-sharma20，项目名称:tweet-analysis，代码行数:8，代码来源:util.py

示例7: as_eng_postagged_doc

def as_eng_postagged_doc(doc):
    '''Uses nltk default tagger.'''
    tags    = [t for _, t in nltk.pos_tag(list(doc.word))]
    stemmer = EnglishStemmer()
    lemmata = [stemmer.stem(w) for w in list(doc.word)]
    doc['pos']   = Series(tags)
    doc['lemma'] = Series(lemmata)
    return doc

开发者ID:estnltk，项目名称:pfe，代码行数:8，代码来源:corpus.py

示例8: use_snowball_stemmer

 def use_snowball_stemmer(self,word):
     """
     return stemmed words used snowball algorithm
     :param word:
     :return:
     """
     englishStemmer=EnglishStemmer()
     stemmed_word= englishStemmer.stem(word)
     return stemmed_word

开发者ID:soumik-dutta，项目名称:Keyword-Extraction，代码行数:9，代码来源:Stemming.py

示例9: getLemmatizerInfo

def getLemmatizerInfo(pathArticle):

    data = open(pathArticle, "r")
    text1 = data.read().decode('utf-8')

    sourceText = text1

    links1 = []
    l = 0
    for q in text1.split():
        if q == '\ufeff':
            continue
        links1.append([text1.find(q,l), q])
        l = len(q) + 1 + text1.find(q,l)

    text1 = text1.replace(' - ', ' \u2013 ', text1.count(' - '))
    text1 = text1.replace(' -', ' \u2013', text1.count(' -'))
    text1 = text1.replace('- ', '\u2013 ', text1.count('- '))
    text1 = text1.replace('-', ' - ', text1.count('-'))
    text1 = text1.replace('(', '( ', text1.count('('))
    text1 = text1.replace(')', ' )', text1.count(')'))
    text1 = text1.replace(' \u0027', ' \u301E', text1.count(' \u0027'))
    text1 = text1.replace('\u0027', ' \u0027', text1.count('\u0027'))
    text1 = text1.split()
    if text1[0] == u'\ufeff':
        text1=text1[1:]
    text = []
    for word in text1:
        text2 = []
        if len(word) == 0:
            continue
        while word[len(word)-1] in [',','.','!','?',':',';']:
            text2.append(word[len(word)-1])
            word = word[:-1]
            if len(word) == 0:
                break
        text.append(word)
        for i in range(len(text2)-1, -1,-1):
            text.append(text2[i])

    out = ''

    st = EnglishStemmer()

    l = 0
    links = []
    for word in text:
        if isOk(word):
            q = st.stem(word) + ' '
        else:
            q = word + ' '
        out += q.lower()
        links.append([l, q])
        l += len(q)
    return out, links, links1, sourceText

开发者ID:mikhaylova-daria，项目名称:NER，代码行数:55，代码来源:allFunctions.py

示例10: stemming

def stemming(tweet):
    tweets = tweet.split()
    wrdStemmer = EnglishStemmer()
    stemTweet =[]
    try:
        for tweet in tweets:
            tweet = wrdStemmer.stem(tweet)
            stemTweet.append(tweet)
    except:
        print("Error: Stemming")
    return " ".join(stemTweet)

开发者ID:RohithEngu，项目名称:Opinion-Summarizer，代码行数:11，代码来源:PreProcessing.py

示例11: fix_lemma_problem

def fix_lemma_problem(pred_scores, targets, space):
    from nltk.stem.snowball import EnglishStemmer
    es = EnglishStemmer()
    r = pred_scores.copy()
    lemmas = np.array([es.stem(v) for v in space.vocab])
    for i, t in enumerate(targets):
        g = es.stem(space.vocab[t])
        mask = (lemmas == g)
        #print space.vocab[t], np.sum(mask)
        r[i][mask] = -1e9
        #print r[i][mask]
    return r

开发者ID:stephenroller，项目名称:naacl2016，代码行数:12，代码来源:lexsub.py

示例12: get_stemmed_keywords

def get_stemmed_keywords(keywords):

  stemmer = EnglishStemmer()
  stemmed_keywords = list(keywords)
  # split into list of list
  stemmed_keywords = [keyword.split() for keyword in stemmed_keywords]
  # stem individual words
  stemmed_keywords = [list(stemmer.stem(word) for word in keyword) for keyword in stemmed_keywords]
  # list of words to string
  stemmed_keywords = [' '.join(keyword).encode('ascii') for keyword in stemmed_keywords]

  return stemmed_keywords

开发者ID:bohrjoce，项目名称:keyword-extraction，代码行数:12，代码来源:evaluate_multiple.py

示例13: main

def main(fname):
  e = EnglishStemmer()

  n, a = 0, 0
  for line in open(sys.argv[1]):
    title, body, tags, creationdate, acceptedanswerid, score, viewcount = eval(line)

    # Process text into tokens
    html_tags = RX_OPEN_TAGS.findall(body)
    body = RX_TAGS.sub("",body)
    print " ".join(e.stem(s) for s in RX_NONWORD.split(body))
    M = bayes.NaiveLearner(adjust_threshold=True, name="Adjusted Naive Bayes")

开发者ID:andrewdyates，项目名称:signalfire_sap，代码行数:12，代码来源:parse2.py

示例14: stemmed

def stemmed(text, snowball=False):
    """Returns stemmed text
    """
    if snowball:
        st = EnglishStemmer()
    else:
        st = PorterStemmer()
    words = wordpunct_tokenize(text)
    words = [st.stem(w) for w in words]
    text = ' '.join(words)

    return text

开发者ID:soodoku，项目名称:search-names，代码行数:12，代码来源:preprocess.py

示例15: similarity_score

def similarity_score(word1, word2):
    """ see sections 2.3 and 2.4 of http://dx.doi.org.ezp-prod1.hul.harvard.edu/10.1109/TKDE.2003.1209005
    :type word1: string
    :type word2: string
    :return: float: between 0 and 1; similarity between two given words
    """
    stemmer = EnglishStemmer()
    if stemmer.stem(word1) == stemmer.stem(word2):
        return 1
    alpha = 0.2
    beta = 0.6
    l, h = get_path_length_and_subsumer_height(word1, word2)
    return exp((-1)*alpha*l)*((exp(beta*h)-exp((-1)*beta*h))/(exp(beta*h)+exp((-1)*beta*h)))

开发者ID:ReganBell，项目名称:QReview，代码行数:13，代码来源:Analyze.py

示例16: normalize_tags

def normalize_tags():
    cursor.execute('SELECT app_id, tag, times FROM tag_app_rel;')
    all_tag_data = defaultdict(dict)
    for r in cursor:
        all_tag_data[r[0]][r[1]] = r[2]
    from nltk.stem.snowball import EnglishStemmer
    stemmer = EnglishStemmer()
    for app_id, tag_to_times in all_tag_data.iteritems():
        normalized_app_tag_dict = defaultdict(int)
        for tag, times in tag_to_times.iteritems():
            normalized_app_tag_dict[stemmer.stem(tag)] += times
        for tag, times in normalized_app_tag_dict.iteritems():
            cursor.execute('INSERT INTO tag_app_relation (app_id, tag, times) VALUES (%s, %s, %s)', (app_id, tag, times))

开发者ID:demien-aa，项目名称:CodingMan，代码行数:13，代码来源:services.py

示例17: nltk_tokenizer

def nltk_tokenizer(text, min_size=4, *args, **kwargs):
	from nltk.stem.snowball import EnglishStemmer
	from nltk.corpus import stopwords as stwds
	from nltk.tokenize import TreebankWordTokenizer
	
	stemmer = EnglishStemmer()
	stopwords = set(stwds.words('english'))
	
	text = [stemmer.stem(w) for w in TreebankWordTokenizer().
			tokenize(text) if not w in stopwords 
			and len(w) >= min_size]

	return text

开发者ID:danielcestari，项目名称:machine_learning，代码行数:13，代码来源:naive.py

示例18: tokenize_documents

def tokenize_documents(documents):

    stop_words = stopwords.words('english') + stopwords.words('spanish') #common words to be filtered
    english = EnglishStemmer()
    arabic = ISRIStemmer()

    punctuation = { ord(char): None for char in string.punctuation}

    def valid_word(token, filtered=stop_words): 
        # Returns false for common words, links, and strange patterns
            if (token in filtered) or (token[0:4] == u'http') or\
            (token in string.punctuation):
                return False
            else:
                return True

    for doc in documents:

        row = doc[0]
        doc = doc[1]

        if doc is not None:

            # remove trailing whitespace
            doc = doc.strip()
            # remove twitter handles (words in doc starting with @)
            doc = re.sub(r"@\w+|\[email protected]\w+", "", doc)
            # lowercase letters
            doc = doc.lower()
            # remove punctuation
            doc = doc.translate(punctuation)

            # tokenization: handles documents with arabic or foreign characters
            tokens = nltk.tokenize.wordpunct_tokenize(doc)

            cleaned_tokens = []
            for token in tokens:

                # for valid words, correct spellings of gaddafi and stem words
                if valid_word(token):
                
                    if token in [u'gadhafi', u'gadafi', u'ghadhafi', u'kadhafi', u'khadafi', u'kaddafi']:
                        token = u'gaddafi'
                    else:
                        token = arabic.stem(english.stem(token)) 

                    cleaned_tokens.append(token)    

            yield row
            yield cleaned_tokens

开发者ID:sharonxu，项目名称:nlp-twitter，代码行数:50，代码来源:process_text.py

示例19: stem_sen

def stem_sen(list_sentences):
  stemmer = EnglishStemmer()
  # map back should be a dict with words,
  # each word map to 3 version: noun, adj, verb,
  # and each version is a list of pair
  lem = WordNetLemmatizer()
  mapping_back = {}
  res_list = []
  res_sen = []
  stemmer = EnglishStemmer()

  # of course we want to return a list of sentences back as well
  for sent in list_sentences:
    tmp_list = []
    tok_list = word_tokenize(sent)
    tok_pos = nltk.pos_tag(tok_list)
    for tok,pos in tok_pos:
      if (tok.lower() in stopwords.words('english')):
        continue
      if len(tok) == 1:
        continue
      tok = lem.lemmatize(tok)
      pos = pos[:2]
      if ('NN' not in pos) and ('JJ' not in pos) and ('VB' not in pos):
        continue
      stem_tok = stemmer.stem(tok)
      if (stem_tok not in mapping_back):
        mapping_back[stem_tok] = {}
      if pos not in mapping_back[stem_tok]:
        mapping_back[stem_tok][pos] = {}

      # increase count
      if tok not in mapping_back[stem_tok][pos]:
        mapping_back[stem_tok][pos][tok] = 1
      else:
        mapping_back[stem_tok][pos][tok] += 1
      tmp_list.append(stem_tok + '-' + pos)
    res_sen.append(tmp_list)
  res_map = {}

  # do the second run through to find the most frequent - mapping
  for tok in mapping_back:
    for pos in mapping_back[tok]:
      tmp_tok = tok + '-' + pos
      # find the most frequently, unstemmed word correspond to the stemmer + tagged
      most_freq = max(mapping_back[tok][pos], key = mapping_back[tok][pos].get)
      res_map[tmp_tok] = most_freq.encode('ascii')
      res_list.append(tmp_tok)
  return res_sen, res_list, res_map

开发者ID:bohrjoce，项目名称:keyword-extraction，代码行数:49，代码来源:feature_extract.py

示例20: tokenize

 def tokenize(self):
     terms = word_tokenize(self.text);
     self.tokens = [];
     self.lemmas = []
     stemmer = EnglishStemmer();
     lemmatizer = WordNetLemmatizer()
     for term in terms:
         try:
             self.tokens.append(stemmer.stem(term).lower())
             self.lemmas.append(lemmatizer.lemmatize(term.lower()))
         except Exception, e:
             print 'current text:', self.text;
             print 'current term:', term;
             print str(e);
             sys.exit(-1);

开发者ID:DrDub，项目名称:window_shopper，代码行数:15，代码来源:WindowExtractor.py

注：本文中的nltk.stem.snowball.EnglishStemmer类示例由纯净天空整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Python snowball.SnowballStemmer类代码示例发布时间：2022-05-27

Python porter.PorterStemmer类代码示例发布时间：2022-05-27

Python util.grid_equal函数代码示例

1 Python 入门教程

Python入门教程 Python 是一种解释型、面向对象、动态数据类型的高级程序设计语言。 P

阅读：13806|2022-01-22

2 Python wikiutil.getFrontPage函数代码示例

Python wikiutil.getFrontPage函数代码示例

阅读：10193|2022-05-24

3 Python 简介

Python 简介 Python 是一个高层次的结合了解释性、编译性、互动性和面向对象的脚本

阅读：4090|2022-01-22

4 Python tests.group函数代码示例

Python tests.group函数代码示例

阅读：4043|2022-05-27

5 Python util.check_if_user_has_permission

Python util.check_if_user_has_permission函数代码示例

阅读：3845|2022-05-27

6 Python 操练实例98

Python 练习实例98 Python 100例题目：从键盘输入一个字符串，将小写字母全部转换成大

阅读：3510|2022-01-22

7 Python 环境搭建

Python 环境搭建本章节我们将向大家介绍如何在本地搭建 Python 开发环境。 Py

阅读：3030|2022-01-22

8 Python output.darkgreen函数代码示例

Python output.darkgreen函数代码示例

阅读：2653|2022-05-25

9 Python 基础语法

Python 基础语法 Python 语言与 Perl，C 和 Java 等语言有许多相似之处。但是，也

阅读：2649|2022-01-22

10 Python 中文编码

Python 中文编码前面章节中我们已经学会了如何用 Python 输出 Hello, World!，英文没

阅读：2302|2022-01-22

客服电话

电子邮件

Python snowball.EnglishStemmer类代码示例

示例1: str_to_dict

示例2: getAllStemEntities

示例3: Granularity

示例4: query

示例5: _execute

示例6: stem_word

示例7: as_eng_postagged_doc

示例8: use_snowball_stemmer

示例9: getLemmatizerInfo

示例10: stemming

示例11: fix_lemma_problem

示例12: get_stemmed_keywords

示例13: main

示例14: stemmed

示例15: similarity_score

示例16: normalize_tags

示例17: nltk_tokenizer

示例18: tokenize_documents

示例19: stem_sen

示例20: tokenize

请发表评论

全部评论

上一篇：

下一篇：

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.grid_equal函数代码示例

Python util.get_worker_name函数代码示例

Python util.get_webmention_target函数代

Python util.get_uuid函数代码示例

Python util.get_type_by_name函数代码示例

Python util.get_stdout函数代码示例

关于我们

产品与服务

解决方案

139-2527-9053