• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

Java HtmlMapper类代码示例

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

本文整理汇总了Java中org.apache.tika.parser.html.HtmlMapper的典型用法代码示例。如果您正苦于以下问题:Java HtmlMapper类的具体用法?Java HtmlMapper怎么用?Java HtmlMapper使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。



HtmlMapper类属于org.apache.tika.parser.html包,在下文中一共展示了HtmlMapper类的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: prepare

import org.apache.tika.parser.html.HtmlMapper; //导入依赖的package包/类
@SuppressWarnings({ "rawtypes", "unchecked" })
@Override
public void prepare(Map conf, TopologyContext context,
        OutputCollector collector) {

    emitOutlinks = ConfUtils.getBoolean(conf, "parser.emitOutlinks", true);

    urlFilters = URLFilters.fromConf(conf);

    parseFilters = ParseFilters.fromConf(conf);

    upperCaseElementNames = ConfUtils.getBoolean(conf,
            "parser.uppercase.element.names", true);

    extractEmbedded = ConfUtils.getBoolean(conf, "parser.extract.embedded",
            false);

    String htmlmapperClassName = ConfUtils.getString(conf,
            "parser.htmlmapper.classname",
            "org.apache.tika.parser.html.IdentityHtmlMapper");

    try {
        HTMLMapperClass = Class.forName(htmlmapperClassName);
        boolean interfaceOK = HtmlMapper.class
                .isAssignableFrom(HTMLMapperClass);
        if (!interfaceOK) {
            throw new RuntimeException("Class " + htmlmapperClassName
                    + " does not implement HtmlMapper");
        }
    } catch (ClassNotFoundException e) {
        LOG.error("Can't load class {}", htmlmapperClassName);
        throw new RuntimeException("Can't load class "
                + htmlmapperClassName);
    }

    // instanciate Tika
    long start = System.currentTimeMillis();
    tika = new Tika();
    long end = System.currentTimeMillis();

    LOG.debug("Tika loaded in {} msec", end - start);

    this.collector = collector;

    this.eventCounter = context.registerMetric(this.getClass()
            .getSimpleName(), new MultiCountMetric(), 10);

    this.metadataTransfer = MetadataTransfer.getInstance(conf);
}
 
开发者ID:eorliac,项目名称:patent-crawler,代码行数:50,代码来源:ParserBolt.java


示例2: extract

import org.apache.tika.parser.html.HtmlMapper; //导入依赖的package包/类
/**
 * Create a pull-parser from the given {@link TikaInputStream}.
 *
 * @param input the stream to extract from
 * @param document file that is being extracted from
 * @return A pull-parsing reader.
 */
protected Reader extract(final Document document, final TikaInputStream input) throws IOException {
	final Metadata metadata = document.getMetadata();
	final ParseContext context = new ParseContext();
	final AutoDetectParser autoDetectParser = new AutoDetectParser(defaultParser);
	final Parser parser;

	if (null != digester) {
		parser = new DigestingParser(autoDetectParser, digester);
	} else {
		parser = autoDetectParser;
	}

	if (!ocrDisabled) {
		context.set(TesseractOCRConfig.class, ocrConfig);
	}

	context.set(PDFParserConfig.class, pdfConfig);

	// Set a fallback parser that outputs an empty document for empty files,
	// otherwise throws an exception.
	autoDetectParser.setFallback(FallbackParser.INSTANCE);

	// Only include "safe" tags in the HTML output from Tika's HTML parser.
	// This excludes script tags and objects.
	context.set(HtmlMapper.class, DefaultHtmlMapper.INSTANCE);

	final Reader reader;
	final Function<Writer, ContentHandler> handler;

	if (OutputFormat.HTML == outputFormat) {
		handler = (writer) -> new ExpandedTitleContentHandler(new HTML5Serializer(writer));
	} else {

		// The default BodyContentHandler is used when constructing the ParsingReader for text output, but
		// because only the body of embeds is pushed to the content handler further down the line, we can't
		// expect a body tag.
		handler = WriteOutContentHandler::new;
	}

	if (EmbedHandling.SPAWN == embedHandling) {
		context.set(Parser.class, parser);
		context.set(EmbeddedDocumentExtractor.class, new EmbedSpawner(document, context, embedOutput, handler));
	} else if (EmbedHandling.CONCATENATE == embedHandling) {
		context.set(Parser.class, parser);
		context.set(EmbeddedDocumentExtractor.class, new EmbedParser(document, context));
	} else {
		context.set(Parser.class, EmptyParser.INSTANCE);
		context.set(EmbeddedDocumentExtractor.class, new EmbedBlocker());
	}

	if (OutputFormat.HTML == outputFormat) {
		reader = new ParsingReader(parser, input, metadata, context, handler);
	} else {
		reader = new ParsingReader(parser, input, metadata, context);
	}

	return reader;
}
 
开发者ID:ICIJ,项目名称:extract,代码行数:66,代码来源:Extractor.java



注:本文中的org.apache.tika.parser.html.HtmlMapper类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
Java ContextService类代码示例发布时间:2022-05-23
下一篇:
Java CC类代码示例发布时间:2022-05-23
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap