Java PTBTokenizerFactory类代码示例

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Java›Java编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Java中edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory类的典型用法代码示例。如果您正苦于以下问题：Java PTBTokenizerFactory类的具体用法？Java PTBTokenizerFactory怎么用？Java PTBTokenizerFactory使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

PTBTokenizerFactory类属于edu.stanford.nlp.process.PTBTokenizer包，在下文中一共展示了PTBTokenizerFactory类的6个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: applyPTBTokenizer

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
private static List<String> applyPTBTokenizer(DocumentPreprocessor dp, boolean tokenizeNLs, boolean ptb3Escaping) {
	PTBTokenizerFactory<Word> tf = PTBTokenizer.PTBTokenizerFactory.newWordTokenizerFactory("tokenizeNLs=" + tokenizeNLs + ",ptb3Escaping=" + ptb3Escaping + ",asciiQuotes=true");
	dp.setTokenizerFactory(tf);
	List<String> sentences = new ArrayList<>();
	for (List<HasWord> wordList : dp) {
		String sentence = "";
		for (HasWord word : wordList) {
			sentence += " " + splitCompounds(word.word());
		}
		sentences.add(sentence);
	}
	return sentences;
}

开发者ID:infolis，项目名称:infoLink，代码行数:14，代码来源:TokenizerStanford.java

示例2: TaggerWrapper

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
protected TaggerWrapper(MaxentTagger tagger) {
  this.tagger = tagger;
  this.config = tagger.config;

  try {
    tokenizerFactory =
      chooseTokenizerFactory(config.getTokenize(),
                             config.getTokenizerFactory(),
                             config.getTokenizerOptions(),
                             config.getTokenizerInvertible());
  } catch (Exception e) {
    System.err.println("Error in tokenizer factory instantiation for class: " + config.getTokenizerFactory());
    e.printStackTrace();
    tokenizerFactory = PTBTokenizerFactory.newWordTokenizerFactory(config.getTokenizerOptions());
  }

  outputStyle = OutputStyle.fromShortName(config.getOutputFormat());
  outputVerbosity = config.getOutputVerbosity();
  outputLemmas = config.getOutputLemmas();
  morpha = (outputLemmas) ? new Morphology() : null;
  tokenize = config.getTokenize();
  tagSeparator = config.getTagSeparator();
}

开发者ID:benblamey，项目名称:stanford-nlp，代码行数:24，代码来源:MaxentTagger.java

示例3: getWordsFromString

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
public static List<Word> getWordsFromString(String str) {
  PTBTokenizerFactory<Word> factory = (PTBTokenizerFactory<Word>)PTBTokenizer.factory();
  // Stanford's tokenizer actually changes words to American...altering our original text. Stop it!!
  factory.setOptions("americanize=false");
  Tokenizer<Word> tokenizer = factory.getTokenizer(new BufferedReader(new StringReader(str)));
  return tokenizer.tokenize();
}

开发者ID:nchambers，项目名称:probschemas，代码行数:8，代码来源:Ling.java

示例4: chooseTokenizerFactory

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
protected static TokenizerFactory<? extends HasWord>
  chooseTokenizerFactory(boolean tokenize, String tokenizerFactory,
                         String tokenizerOptions, boolean invertible) {
  if (tokenize && tokenizerFactory.trim().length() != 0) {
    //return (TokenizerFactory<? extends HasWord>) Class.forName(getTokenizerFactory()).newInstance();
    try {
      @SuppressWarnings({"unchecked"})
      Class<TokenizerFactory<? extends HasWord>> clazz = (Class<TokenizerFactory<? extends HasWord>>) Class.forName(tokenizerFactory.trim());
      Method factoryMethod = clazz.getMethod("newTokenizerFactory");
      @SuppressWarnings({"unchecked"})
      TokenizerFactory<? extends HasWord> factory = (TokenizerFactory<? extends HasWord>) factoryMethod.invoke(tokenizerOptions);
      return factory;
    } catch (Exception e) {
      throw new RuntimeException("Could not load tokenizer factory", e);
    }
  } else if (tokenize) {
    if (invertible) {
      if (tokenizerOptions.equals("")) {
        tokenizerOptions = "invertible=true";
      } else if (!tokenizerOptions.matches("(^|.*,)invertible=true")) {
        tokenizerOptions += ",invertible=true";
      }
      return PTBTokenizerFactory.newCoreLabelTokenizerFactory(tokenizerOptions);
    } else {
      return PTBTokenizerFactory.newWordTokenizerFactory(tokenizerOptions);
    }
  } else {
    return WhitespaceTokenizer.factory();
  }
}

开发者ID:benblamey，项目名称:stanford-nlp，代码行数:31，代码来源:MaxentTagger.java

示例5: getSentences1_old

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
public static List<String> getSentences1_old(String text, Set<String> entities){
        text=text.trim();
        text=StringEscapeUtils.escapeHtml(text);
        text=text.replaceAll("http:.*&hellip;\\z","");
        String[] toMatch={"\\ART\\[email protected]\\S+", "\\AMT\\[email protected]\\S+"};
        for(String t:toMatch){
                Pattern pattern = Pattern.compile(t, Pattern.CASE_INSENSITIVE);
                String newTweet = text.trim();
                text="";
                while(!newTweet.equals(text)){         //each loop will cut off one "RT @XXX" or "#XXX"; may need a few calls to cut all hashtags etc.
                        text=newTweet;
                        Matcher matcher = pattern.matcher(text);
                        newTweet = matcher.replaceAll("");
                        newTweet =newTweet.trim();
                }
        }
        text=text.replaceAll("-\\s*\\z","");
        text=text.replaceAll("&hellip;\\z","");
        text=StringEscapeUtils.unescapeHtml(text);
        text=text.trim();
        String[] parts=text.split(Extractor.urlRegExp);
        List<String> sentences=new ArrayList<String>();
        
//        for(int i=0;i<parts.length;i++){
        int limit=10;
        if(limit>parts.length) 
			limit=parts.length;
        for(int i=0;i<limit;i++){
//            parts[i]=text.replace("http://*&hellip;","");
            String text_cleaned=extractor.cleanText(parts[i]);
//            List<String> sentences_tmp=new ArrayList<String>();
            Reader reader = new StringReader(text_cleaned);
            DocumentPreprocessor dp = new DocumentPreprocessor(reader);
            dp.setTokenizerFactory(PTBTokenizerFactory.newWordTokenizerFactory("ptb3Escaping=false,untokenizable=noneDelete"));
                    //prop.setProperty("tokenizerOptions", "untokenizable=noneDelete");

            Iterator<List<HasWord>> it = dp.iterator();
            while (it.hasNext()) {
                StringBuilder sentenceSb = new StringBuilder();
                List<HasWord> sentence = it.next();
                boolean last_keep=false;
                for (HasWord token : sentence) {
                    if((!token.word().matches("[,:!.;?)]"))&&(!token.word().contains("'"))&&!last_keep){
                        sentenceSb.append(" ");
                    }
                    last_keep=false;
                    if(token.word().matches("[(\\[]"))
                            last_keep=true;
                    String next_word=token.toString();
                      
                    if((next_word.toUpperCase().equals(next_word))&&(!next_word.equals("I"))&&(!entities.contains(next_word)))
                        next_word=next_word.toLowerCase();
                    if(next_word.equals("i")) next_word="I";
                    sentenceSb.append(next_word);
                }
                String new_sentence=sentenceSb.toString().trim();
                Character fc=new_sentence.charAt(0);
                new_sentence=fc.toString().toUpperCase()+new_sentence.substring(1);
                if(new_sentence.endsWith(":"))
                    text=text.substring(0,text.length()-3)+".";

                sentences.add(new_sentence);
            }
  //          sentences.addAll(sentences_tmp);
        }
        return sentences;
    }

开发者ID:socialsensor，项目名称:trends-labeler，代码行数:68，代码来源:TrendsLabeler.java

示例6: getSentences1

import edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory; //导入依赖的package包/类
public static List<String> getSentences1(String text, Set<String> entities) {
//		System.out.println("   Text as it is    :   " + text);
		text = TrendsLabeler.getCleanedTitleMR(text);

		String[] parts = text.split(Extractor.urlRegExp);
		List<String> sentences = new ArrayList<String>();

		// for(int i=0;i<parts.length;i++){
		int limit = 10;
		if (limit > parts.length)
			limit = parts.length;
		for (int i = 0; i < limit; i++) {
			String text_cleaned = extr.cleanText(parts[i]);
			// List<String> sentences_tmp=new ArrayList<String>();
			Reader reader = new StringReader(text_cleaned);
			DocumentPreprocessor dp = new DocumentPreprocessor(reader);
			dp.setTokenizerFactory(PTBTokenizerFactory
					.newWordTokenizerFactory("ptb3Escaping=false, untokenizable=noneDelete"));
			// dp.setTokenizerFactory(PTBTokenizerFactory.newWordTokenizerFactory("untokenizable=noneDelete"));

			Iterator<List<HasWord>> it = dp.iterator();
			while (it.hasNext()) {
				StringBuilder sentenceSb = new StringBuilder();
				List<HasWord> sentence = it.next();
				boolean last_keep = false;
				for (HasWord token : sentence) {
					if ((!token.word().matches("[,:!.;?)]"))
							&& (!token.word().contains("'")) && !last_keep) {
						sentenceSb.append(" ");
					}
					last_keep = false;
					if (token.word().matches("[(\\[]"))
						last_keep = true;
					String next_word = token.toString();

					if ((next_word.toUpperCase().equals(next_word))
							&& (!next_word.equals("I"))
							&& (!entities.contains(next_word)))
						next_word = next_word.toLowerCase();
					if (next_word.equals("i"))
						next_word = "I";
					sentenceSb.append(next_word);
				}
				String new_sentence = sentenceSb.toString().trim();
				Character fc = new_sentence.charAt(0);
				new_sentence = fc.toString().toUpperCase()
						+ new_sentence.substring(1);
				if (new_sentence.endsWith(":"))
					text = text.substring(0, text.length() - 3) + "."; 

				sentences.add(new_sentence);
			}
			// sentences.addAll(sentences_tmp);
		}
		return sentences;
	}

开发者ID:socialsensor，项目名称:trends-labeler，代码行数:57，代码来源:TrendsLabeler.java

注：本文中的edu.stanford.nlp.process.PTBTokenizer.PTBTokenizerFactory类示例整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Java PixelUtils类代码示例发布时间：2022-05-22

Java Sampler类代码示例发布时间：2022-05-22

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18273|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9677|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8179|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8549|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8457|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9393|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8430|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7865|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8415|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7394|2022-11-06

客服电话

电子邮件

Java PTBTokenizerFactory类代码示例

示例1: applyPTBTokenizer

示例2: TaggerWrapper

示例3: getWordsFromString

示例4: chooseTokenizerFactory

示例5: getSentences1_old

示例6: getSentences1

请发表评论

全部评论

上一篇：

下一篇：

librespeed/speedtest: Self-hosted Speedt

avehtari/BDA_m_demos: Bayesian Data Anal

四维彩超怎么看性别？四维看男孩女孩诀窍

medfreeman/markdown-it-toc-and-anchor: m

Matlab如何循环读取文件

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053