Java BoilerpipeExtractor类代码示例

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Java›Java编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Java中de.l3s.boilerpipe.BoilerpipeExtractor类的典型用法代码示例。如果您正苦于以下问题：Java BoilerpipeExtractor类的具体用法？Java BoilerpipeExtractor怎么用？Java BoilerpipeExtractor使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

BoilerpipeExtractor类属于de.l3s.boilerpipe包，在下文中一共展示了BoilerpipeExtractor类的9个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * returns the article from an document with its basic html structure.
 *
 * @param HTMLDocument
 * @param URI the uri from the document for resolving the relative anchors in the document to absolute anchors
 * @return String
 */
public String process(HTMLDocument htmlDoc, URI docUri, final BoilerpipeExtractor extractor) {

    final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance();
    hh.setOutputHighlightOnly(true);

    TextDocument doc;

    String text = "";
    try {
        doc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
        extractor.process(doc);
        final InputSource is = htmlDoc.toInputSource();
        text = hh.process(doc, is);
    } catch (Exception ex) {
        return null;
    }


    return removeNotAllowedTags(text, docUri);
}

开发者ID:BartoszJarocki，项目名称:android-boilerpipe，代码行数:28，代码来源:HtmlArticleExtractor.java

示例2: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * parses the media (picture, video) out of doc
 * 
 * @param doc document to parse the media out
 * @param extractor extractor to use
 * @return list of extracted media, with size = 0 if no media found
 */
public List<Media> process(String doc, final BoilerpipeExtractor extractor) {
	final HTMLDocument htmlDoc = new HTMLDocument(doc);
	List<Media> media = new ArrayList<Media>();
	TextDocument tdoc;

	try {
		tdoc = new BoilerpipeSAXInput(htmlDoc.toInputSource()).getTextDocument();
		extractor.process(tdoc);
		final InputSource is = htmlDoc.toInputSource();
		media = process(tdoc, is);
	} catch (Exception e) {
		return null;
	}
	return media;
}

开发者ID:BartoszJarocki，项目名称:android-boilerpipe，代码行数:23，代码来源:MediaExtractor.java

示例3: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 *
 *            The processed {@link TextDocument}.
 *            The original HTML document.
 * @return A List of enclosed links
 * @throws BoilerpipeProcessingException
 */
public List<String> process(final URL url, final BoilerpipeExtractor extractor)
throws IOException, BoilerpipeProcessingException, SAXException {
    final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

    final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
    .getTextDocument();
    extractor.process(doc);

    final InputSource is = htmlDoc.toInputSource();

    return process(doc, is);
}

开发者ID:asimihsan，项目名称:handytrowel，代码行数:22，代码来源:LinkExtractor.java

示例4: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public String process(final URL url, final BoilerpipeExtractor extractor)
        throws IOException, BoilerpipeProcessingException, SAXException {
    final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

    // Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
    htmlDoc.encodeEscapedCharsAsText();

    // Added to support including images in extracted HTML output
    if (includeImages)
        htmlDoc.encodeImageTagsAsText();

    final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
                                     .getTextDocument();
    extractor.process(doc);

    final InputSource is = htmlDoc.toInputSource();

    String finalHtml = process(doc, is);

    // Added to fix bug with unicode characters not being recognized by SAX parser on AppEngine (bug while appending chars to StringBuffer by offset)
    finalHtml = HTMLDocument.restoreTextEncodedEscapedChars(finalHtml, htmlDoc.getCharset().name());

    // Added to support including images in extracted HTML output
    if (includeImages)
        finalHtml = HTMLDocument.restoreTextEncodedImageTags(finalHtml, htmlDoc.getCharset().name());

    return finalHtml;
}

开发者ID:BartoszJarocki，项目名称:android-boilerpipe，代码行数:29，代码来源:HTMLHighlighter.java

示例5: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link java.net.URL} using {@link de.l3s.boilerpipe.sax.HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 * 
 *            The processed {@link TextDocument}.
 *            The original HTML document.
 * @return A List of enclosed {@link Image}s
 * @throws BoilerpipeProcessingException
 */
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
		throws IOException, BoilerpipeProcessingException, SAXException {
	final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

	final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
			.getTextDocument();
	extractor.process(doc);

	final InputSource is = htmlDoc.toInputSource();

	return process(doc, is);
}

开发者ID:BartoszJarocki，项目名称:android-boilerpipe，代码行数:22，代码来源:ImageExtractor.java

示例6: process

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Fetches the given {@link URL} using {@link HTMLFetcher} and processes the
 * retrieved HTML using the specified {@link BoilerpipeExtractor}.
 * 
 * @param doc
 *            The processed {@link TextDocument}.
 * @param is
 *            The original HTML document.
 * @return A List of enclosed {@link Image}s
 * @throws BoilerpipeProcessingException
 */
public List<Image> process(final URL url, final BoilerpipeExtractor extractor)
		throws IOException, BoilerpipeProcessingException, SAXException {
	final HTMLDocument htmlDoc = HTMLFetcher.fetch(url);

	final TextDocument doc = new BoilerpipeSAXInput(htmlDoc.toInputSource())
			.getTextDocument();
	extractor.process(doc);

	final InputSource is = htmlDoc.toInputSource();

	return process(doc, is);
}

开发者ID:socialsensor，项目名称:storm-focused-crawler，代码行数:24，代码来源:ImageExtractor.java

示例7: main

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void main(String[] args) throws InterruptedException, IOException {
  List<String> lines = Files.readLines(new File("data/query.tsv"), Charsets.UTF_8);
  Set<String> ids = new HashSet<>();
  for (String line : lines) {
    ids.add(line.split("\t")[0]);
  }
  BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
  ExecutorService es = Executors.newFixedThreadPool(10);
  System.out.println(ids.size());
  DecimalFormat df = new DecimalFormat("00");
  for (String id : ids) {
    String googleHtml = Files.toString(new File("data/googlerp", id + ".html"), Charsets.UTF_8);
    Matcher matcher = pattern.matcher(googleHtml);
    int count = 0;
    while (matcher.find()) {
      count++;
      // check existence
      File docHtmlFile = new File("data/context", id + "-" + df.format(count) + ".html");
      File docTextFile = new File("data/context", id + "-" + df.format(count) + ".txt");
      if (docHtmlFile.exists() && docTextFile.exists()) {
        continue;
      }
      // get url
      String url = matcher.group(1);
      if (url.contains("wikihow") || url.contains("google")) {
        continue;
      }
      es.execute(() -> {
        System.out.println(id + " " + url);
        // download url
        try {
          String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
                  .returnContent().asString();
          Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
          String docText = extractor.getText(docHtml);
          Files.write(docText, docTextFile, Charsets.UTF_8);
        } catch (Exception e) {
          e.printStackTrace();
        }
      });
    }
  }
  es.shutdown();
  if (!es.awaitTermination(5, TimeUnit.MINUTES)) {
    System.out.println("Timeout occurs for one or some concept retrieval service.");
  }
}

开发者ID:ziy，项目名称:pkb，代码行数:48，代码来源:ContextExtractor.java

示例8: downloadSearchResult

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
public static void downloadSearchResult() throws IOException, BoilerpipeProcessingException,
        SAXException, URISyntaxException, InterruptedException {
  List<String> lines = Files.readLines(new File("data/e2e-apkbc-suggested-query.tsv"),
          Charsets.UTF_8);
  Set<String> ids = new HashSet<>();
  for (String line : lines) {
    ids.add(line.split("\t")[0]);
  }
  Pattern pattern = Pattern.compile("<a href=\"([^>\"]*)\" onmousedown=\"");
  BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR;
  ExecutorService es = Executors.newFixedThreadPool(10);
  System.out.println(ids.size());
  DecimalFormat df = new DecimalFormat("00");
  for (String id : ids) {
    String googleHtml = Files.toString(new File("data/e2e-googlerp", id + ".html"),
            Charsets.UTF_8);
    Matcher matcher = pattern.matcher(googleHtml);
    int count = 0;
    while (matcher.find()) {
      count++;
      // check existence
      File docHtmlFile = new File("data/e2e-context", id + "-" + df.format(count) + ".html");
      File docTextFile = new File("data/e2e-context", id + "-" + df.format(count) + ".txt");
      if (docHtmlFile.exists() && docTextFile.exists()) {
        continue;
      }
      // get url
      String url = matcher.group(1);
      if (url.contains("wikihow") || url.contains("google")) {
        continue;
      }
      es.execute(() -> {
        System.out.println(id + " " + url);
        // download url
        try {
          String docHtml = Request.Get(url).connectTimeout(2000).socketTimeout(2000).execute()
                  .returnContent().asString();
          Files.write(docHtml, docHtmlFile, Charsets.UTF_8);
          String docText = extractor.getText(docHtml);
          Files.write(docText, docTextFile, Charsets.UTF_8);
        } catch (Exception e) {
          e.printStackTrace();
        }
      });
    }
  }
  es.shutdown();
  if (!es.awaitTermination(5, TimeUnit.SECONDS)) {
    System.out.println("Timeout occurs for one or some concept retrieval service.");
  }
}

开发者ID:ziy，项目名称:pkb，代码行数:52，代码来源:AutomaticProceduralKnowledgeBaseConstructor.java

示例9: BoilerpipeContentHandler

import de.l3s.boilerpipe.BoilerpipeExtractor; //导入依赖的package包/类
/**
 * Creates a new boilerpipe-based content extractor, using the given
 * extraction rules. The extracted main content will be passed to the
 * <delegate> content handler.
 *
 * @param delegate
 *            The {@link ContentHandler} object
 * @param extractor
 *            Extraction rules to use, e.g. {@link ArticleExtractor}
 */
public BoilerpipeContentHandler(ContentHandler delegate, BoilerpipeExtractor extractor) {
    this.td = null;
    this.delegate = delegate;
    this.extractor = extractor;
}

开发者ID:kolbasa，项目名称:OCRaptor，代码行数:16，代码来源:BoilerpipeContentHandler.java

注：本文中的de.l3s.boilerpipe.BoilerpipeExtractor类示例整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Java BitDocSet类代码示例发布时间：2022-05-23

Java MutableList类代码示例发布时间：2022-05-23

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18034|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9598|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8143|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8524|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8426|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9334|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8392|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7827|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8380|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7375|2022-11-06

客服电话

电子邮件

Java BoilerpipeExtractor类代码示例

示例1: process

示例2: process

示例3: process

示例4: process

示例5: process

示例6: process

示例7: main

示例8: downloadSearchResult

示例9: BoilerpipeContentHandler

请发表评论

全部评论

上一篇：

下一篇：

librespeed/speedtest: Self-hosted Speedt

CVE-2022-30275

avehtari/BDA_m_demos: Bayesian Data Anal

四维彩超怎么看性别？四维看男孩女孩诀窍

膛的拼音和组词，带膛字词语大全

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053