• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

Java CrawlDatum类代码示例

原作者: [db:作者] 来自: [db:来源] 收藏 邀请

本文整理汇总了Java中org.apache.nutch.crawl.CrawlDatum的典型用法代码示例。如果您正苦于以下问题:Java CrawlDatum类的具体用法?Java CrawlDatum怎么用?Java CrawlDatum使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。



CrawlDatum类属于org.apache.nutch.crawl包,在下文中一共展示了CrawlDatum类的20个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: testRedirFetchInOneSegment

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
 * Check a fixed sequence!
 */
@Test
public void testRedirFetchInOneSegment() throws Exception {
  // Our test directory
  Path testDir = new Path(conf.get("hadoop.tmp.dir"), "merge-"
      + System.currentTimeMillis());

  Path segment = new Path(testDir, "00001");

  createSegment(segment, CrawlDatum.STATUS_FETCH_SUCCESS, true, true);

  // Merge the segments and get status
  Path mergedSegment = merge(testDir, new Path[] { segment });
  Byte status = new Byte(status = checkMergedSegment(testDir, mergedSegment));

  Assert.assertEquals(new Byte(CrawlDatum.STATUS_FETCH_SUCCESS), status);
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:20,代码来源:TestSegmentMergerCrawlDatums.java


示例2: generate

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void generate() throws Exception {
	
	init();
	createNutchUrls();
	createNutchIndexData();
	
	Path ffetch = new Path(options.getResultPath(), CrawlDatum.FETCH_DIR_NAME);
	Path fparse = new Path(options.getResultPath(), CrawlDatum.PARSE_DIR_NAME);
	Path linkdb = new Path(segment, LINKDB_DIR_NAME);
	
	FileSystem fs = ffetch.getFileSystem(new Configuration());
	fs.rename(ffetch, new Path(segment, CrawlDatum.FETCH_DIR_NAME));
	fs.rename(fparse, new Path(segment, CrawlDatum.PARSE_DIR_NAME));
	fs.rename(linkdb, new Path(options.getResultPath(), LINKDB_DIR_NAME));
	fs.close();
	
	close();
}
 
开发者ID:thrill,项目名称:fst-bench,代码行数:19,代码来源:NutchData.java


示例3: testFilterOutlinks

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testFilterOutlinks() throws Exception {
  conf.set(LinksIndexingFilter.LINKS_OUTLINKS_HOST, "true");
  filter.setConf(conf);

  Outlink[] outlinks = generateOutlinks();

  NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
          new ParseData(new ParseStatus(), "title", outlinks, metadata)),
      new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());

  Assert.assertEquals(1, doc.getField("outlinks").getValues().size());

  Assert.assertEquals("Filter outlinks, allow only those from a different host",
      outlinks[0].getToUrl(), doc.getFieldValue("outlinks"));
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TestLinksIndexingFilter.java


示例4: testIt

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIt() throws ProtocolException, ParseException {
  String urlString;
  Protocol protocol;
  Content content;
  Parse parse;

  for (int i = 0; i < sampleFiles.length; i++) {
    urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];

    Configuration conf = NutchConfiguration.create();
    protocol = new ProtocolFactory(conf).getProtocol(urlString);
    content = protocol.getProtocolOutput(new Text(urlString),
        new CrawlDatum()).getContent();
    parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content)
        .get(content.getUrl());

    int index = parse.getText().indexOf(expectedText);
    Assert.assertTrue(index > 0);
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestPdfParser.java


示例5: injectedScore

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Override
public void injectedScore(Text url, CrawlDatum datum)
    throws ScoringFilterException {

  // check for the presence of the depth limit key
  if (datum.getMetaData().get(MAX_DEPTH_KEY_W) != null) {
    // convert from Text to Int
    String depthString = datum.getMetaData().get(MAX_DEPTH_KEY_W).toString();
    datum.getMetaData().remove(MAX_DEPTH_KEY_W);
    int depth = Integer.parseInt(depthString);
    datum.getMetaData().put(MAX_DEPTH_KEY_W, new IntWritable(depth));
  } else { // put the default
    datum.getMetaData()
        .put(MAX_DEPTH_KEY_W, new IntWritable(defaultMaxDepth));
  }
  // initial depth is 1
  datum.getMetaData().put(DEPTH_KEY_W, new IntWritable(1));
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:19,代码来源:DepthScoringFilter.java


示例6: testFixedSequence

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
 * Check a fixed sequence!
 */
@Test
public void testFixedSequence() throws Exception {
  // Our test directory
  Path testDir = new Path(conf.get("hadoop.tmp.dir"), "merge-"
      + System.currentTimeMillis());

  Path segment1 = new Path(testDir, "00001");
  Path segment2 = new Path(testDir, "00002");
  Path segment3 = new Path(testDir, "00003");

  createSegment(segment1, CrawlDatum.STATUS_FETCH_GONE, false);
  createSegment(segment2, CrawlDatum.STATUS_FETCH_GONE, true);
  createSegment(segment3, CrawlDatum.STATUS_FETCH_SUCCESS, false);

  // Merge the segments and get status
  Path mergedSegment = merge(testDir, new Path[] { segment1, segment2,
      segment3 });
  Byte status = new Byte(status = checkMergedSegment(testDir, mergedSegment));

  Assert.assertEquals(new Byte(CrawlDatum.STATUS_FETCH_SUCCESS), status);
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:TestSegmentMergerCrawlDatums.java


示例7: fetch

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Override
protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
  lastFetchTime = currFetchTime;
  currFetchTime = currentTime;
  previousDbState = datum.getStatus();
  lastSignature = datum.getSignature();
  datum = super.fetch(datum, currentTime);
  if (firstFetchTime == 0) {
    firstFetchTime = currFetchTime;
  } else if ((currFetchTime - firstFetchTime) > (duration / 2)) {
    // simulate a modification after "one year"
    changeContent();
    firstFetchTime = currFetchTime;
  }
  return datum;
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TestCrawlDbStates.java


示例8: reduce

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void reduce(Text key, Iterator<NutchWritable> values,
    OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
  StringBuffer dump = new StringBuffer();

  dump.append("\nRecno:: ").append(recNo++).append("\n");
  dump.append("URL:: " + key.toString() + "\n");
  while (values.hasNext()) {
    Writable value = values.next().get(); // unwrap
    if (value instanceof CrawlDatum) {
      dump.append("\nCrawlDatum::\n").append(((CrawlDatum) value).toString());
    } else if (value instanceof Content) {
      dump.append("\nContent::\n").append(((Content) value).toString());
    } else if (value instanceof ParseData) {
      dump.append("\nParseData::\n").append(((ParseData) value).toString());
    } else if (value instanceof ParseText) {
      dump.append("\nParseText::\n").append(((ParseText) value).toString());
    } else if (LOG.isWarnEnabled()) {
      LOG.warn("Unrecognized type: " + value.getClass());
    }
  }
  output.collect(key, new Text(dump.toString()));
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:23,代码来源:SegmentReader.java


示例9: testIndexHostsOnlyAndFilterInlinks

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIndexHostsOnlyAndFilterInlinks() throws Exception {
  conf = NutchConfiguration.create();
  conf.set(LinksIndexingFilter.LINKS_ONLY_HOSTS, "true");
  conf.set(LinksIndexingFilter.LINKS_INLINKS_HOST, "true");

  filter.setConf(conf);

  Inlinks inlinks = new Inlinks();
  inlinks.add(new Inlink("http://www.test.com", "test"));
  inlinks.add(new Inlink("http://www.example.com", "example"));

  NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
          new ParseData(new ParseStatus(), "title", new Outlink[0], metadata)),
      new Text("http://www.example.com/"), new CrawlDatum(), inlinks);

  Assert.assertEquals(1, doc.getField("inlinks").getValues().size());

  Assert.assertEquals(
      "Index only the host portion of the inlinks after filtering",
      new URL("http://www.test.com").getHost(),
      doc.getFieldValue("inlinks"));

}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:TestLinksIndexingFilter.java


示例10: fetchPage

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
 * Fetches the specified <code>page</code> from the local Jetty server and
 * checks whether the HTTP response status code matches with the expected
 * code. Also use jsp pages for redirection.
 * 
 * @param page
 *          Page to be fetched.
 * @param expectedCode
 *          HTTP response status code expected while fetching the page.
 */
private void fetchPage(String page, int expectedCode) throws Exception {
  URL url = new URL("http", "127.0.0.1", port, page);
  CrawlDatum crawlDatum = new CrawlDatum();
  Response response = http.getResponse(url, crawlDatum, true);
  ProtocolOutput out = http.getProtocolOutput(new Text(url.toString()),
      crawlDatum);
  Content content = out.getContent();
  assertEquals("HTTP Status Code for " + url, expectedCode,
      response.getCode());

  if (page.compareTo("/nonexists.html") != 0
      && page.compareTo("/brokenpage.jsp") != 0
      && page.compareTo("/redirection") != 0) {
    assertEquals("ContentType " + url, "text/html",
        content.getContentType());
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:28,代码来源:TestProtocolHttp.java


示例11: reduce

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void reduce(Text key, Iterator<CrawlDatum> values,
    OutputCollector<Text, CrawlDatum> output, Reporter reporter)
    throws IOException {
  boolean duplicateSet = false;

  while (values.hasNext()) {
    CrawlDatum val = values.next();
    if (val.getStatus() == CrawlDatum.STATUS_DB_DUPLICATE) {
      duplicate.set(val);
      duplicateSet = true;
    } else {
      old.set(val);
    }
  }

  // keep the duplicate if there is one
  if (duplicateSet) {
    output.collect(key, duplicate);
    return;
  }

  // no duplicate? keep old one then
  output.collect(key, old);
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:DeduplicationJob.java


示例12: testBlockHTML

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testBlockHTML() throws Exception {
  conf.set(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "block-html.txt");
  filter.setConf(conf);

  for (int i = 0; i < parses.length; i++) {
    NutchDocument doc = filter.filter(new NutchDocument(), parses[i],
        new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());

    if (MIME_TYPES[i].contains("html")) {
      Assert.assertNull("Block only HTML documents", doc);
    } else {
      Assert.assertNotNull("Allow everything else", doc);
    }
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:MimeTypeIndexingFilterTest.java


示例13: testIt

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIt() throws ProtocolException, ParseException {
  String urlString;
  Protocol protocol;
  Content content;
  Parse parse;

  for (int i = 0; i < sampleFiles.length; i++) {
    urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];

    Configuration conf = NutchConfiguration.create();
    protocol = new ProtocolFactory(conf).getProtocol(urlString);
    content = protocol.getProtocolOutput(new Text(urlString),
        new CrawlDatum()).getContent();
    parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content)
        .get(content.getUrl());

    Assert.assertEquals("121", parse.getData().getMeta("width"));
    Assert.assertEquals("48", parse.getData().getMeta("height"));
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestImageMetadata.java


示例14: map

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void map(Text urlText, CrawlDatum datum, Context context)
    throws IOException, InterruptedException {

  URL url = new URL(urlText.toString());
  String out = "";
  switch (mode) {
    case MODE_HOST:
      out = url.getHost();
      break;
    case MODE_DOMAIN:
      out = URLUtil.getDomainName(url);
      break;
  }

  if (datum.getStatus() == CrawlDatum.STATUS_DB_FETCHED
      || datum.getStatus() == CrawlDatum.STATUS_DB_NOTMODIFIED) {
    context.write(new Text(out + " FETCHED"), new LongWritable(1));
  } else {
    context.write(new Text(out + " UNFETCHED"), new LongWritable(1));
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:CrawlCompletionStats.java


示例15: testRandomizedSequences

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
 *
 */
@Test
public void testRandomizedSequences() throws Exception {
  for (int i = 0; i < rnd.nextInt(16) + 16; i++) {
    byte expectedStatus = (byte) (rnd.nextInt(6) + 0x21);
    while (expectedStatus == CrawlDatum.STATUS_FETCH_RETRY
        || expectedStatus == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
      // fetch_retry and fetch_notmodified never remain in a merged segment
      expectedStatus = (byte) (rnd.nextInt(6) + 0x21);
    }
    byte randomStatus = (byte) (rnd.nextInt(6) + 0x21);
    int rounds = rnd.nextInt(16) + 32;
    boolean withRedirects = rnd.nextBoolean();

    byte resultStatus = executeSequence(randomStatus, expectedStatus, rounds,
        withRedirects);
    Assert.assertEquals(
        "Expected status = " + CrawlDatum.getStatusName(expectedStatus)
            + ", but got " + CrawlDatum.getStatusName(resultStatus)
            + " when merging " + rounds + " segments"
            + (withRedirects ? " with redirects" : ""), expectedStatus,
        resultStatus);
  }
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:27,代码来源:TestSegmentMergerCrawlDatums.java


示例16: testDeduplicateAnchor

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testDeduplicateAnchor() throws Exception {
  Configuration conf = NutchConfiguration.create();
  conf.setBoolean("anchorIndexingFilter.deduplicate", true);
  AnchorIndexingFilter filter = new AnchorIndexingFilter();
  filter.setConf(conf);
  Assert.assertNotNull(filter);
  NutchDocument doc = new NutchDocument();
  ParseImpl parse = new ParseImpl("foo bar", new ParseData());
  Inlinks inlinks = new Inlinks();
  inlinks.add(new Inlink("http://test1.com/", "text1"));
  inlinks.add(new Inlink("http://test2.com/", "text2"));
  inlinks.add(new Inlink("http://test3.com/", "text2"));
  try {
    filter.filter(doc, parse, new Text("http://nutch.apache.org/index.html"),
        new CrawlDatum(), inlinks);
  } catch (Exception e) {
    e.printStackTrace();
    Assert.fail(e.getMessage());
  }
  Assert.assertNotNull(doc);
  Assert.assertTrue("test if there is an anchor at all", doc.getFieldNames()
      .contains("anchor"));
  Assert.assertEquals("test dedup, we expect 2", 2, doc.getField("anchor")
      .getValues().size());
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:27,代码来源:TestAnchorIndexingFilter.java


示例17: testIndexHostsOnlyAndFilterOutlinks

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIndexHostsOnlyAndFilterOutlinks() throws Exception {
  conf = NutchConfiguration.create();
  conf.set(LinksIndexingFilter.LINKS_ONLY_HOSTS, "true");
  conf.set(LinksIndexingFilter.LINKS_OUTLINKS_HOST, "true");

  Outlink[] outlinks = generateOutlinks(true);

  filter.setConf(conf);

  NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
          new ParseData(new ParseStatus(), "title", outlinks, metadata)),
      new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());

  Assert.assertEquals(1, doc.getField("outlinks").getValues().size());

  Assert.assertEquals(
      "Index only the host portion of the outlinks after filtering",
      new URL("http://www.test.com").getHost(),
      doc.getFieldValue("outlinks"));
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestLinksIndexingFilter.java


示例18: indexerScore

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum,
    CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
    throws ScoringFilterException {

  NutchField tlds = doc.getField("tld");
  float boost = 1.0f;

  if (tlds != null) {
    for (Object tld : tlds.getValues()) {
      DomainSuffix entry = tldEntries.get(tld.toString());
      if (entry != null)
        boost *= entry.getBoost();
    }
  }
  return initScore * boost;
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TLDScoringFilter.java


示例19: checkMergedSegment

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
 * Checks the merged segment and removes the stuff again.
 * 
 * @param the
 *          test directory
 * @param the
 *          merged segment
 * @return the final status
 */
protected byte checkMergedSegment(Path testDir, Path mergedSegment)
    throws Exception {
  // Get a MapFile reader for the <Text,CrawlDatum> pairs
  MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new Path(
      mergedSegment, CrawlDatum.FETCH_DIR_NAME), conf);

  Text key = new Text();
  CrawlDatum value = new CrawlDatum();
  byte finalStatus = 0x0;

  for (MapFile.Reader reader : readers) {
    while (reader.next(key, value)) {
      LOG.info("Reading status for: " + key.toString() + " > "
          + CrawlDatum.getStatusName(value.getStatus()));

      // Only consider fetch status
      if (CrawlDatum.hasFetchStatus(value)
          && key.toString().equals("http://nutch.apache.org/")) {
        finalStatus = value.getStatus();
      }
    }

    // Close the reader again
    reader.close();
  }

  // Remove the test directory again
  fs.delete(testDir, true);

  LOG.info("Final fetch status for: http://nutch.apache.org/ > "
      + CrawlDatum.getStatusName(finalStatus));

  // Return the final status
  return finalStatus;
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:45,代码来源:TestSegmentMergerCrawlDatums.java


示例20: parseMeta

import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public Metadata parseMeta(String fileName, Configuration conf) {
  Metadata metadata = null;
  try {
    String urlString = "file:" + sampleDir + fileSeparator + fileName;
    Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
    Content content = protocol.getProtocolOutput(new Text(urlString),
        new CrawlDatum()).getContent();
    Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
    metadata = parse.getData().getParseMeta();
  } catch (Exception e) {
    e.printStackTrace();
    Assert.fail(e.toString());
  }
  return metadata;
}
 
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:16,代码来源:TestParseReplace.java



注:本文中的org.apache.nutch.crawl.CrawlDatum类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。


鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
上一篇:
Java Long2LongMap类代码示例发布时间:2022-05-22
下一篇:
Java CellEntry类代码示例发布时间:2022-05-22
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap