本文整理汇总了Java中org.apache.nutch.crawl.CrawlDatum类的典型用法代码示例。如果您正苦于以下问题:Java CrawlDatum类的具体用法?Java CrawlDatum怎么用?Java CrawlDatum使用的例子?那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。
CrawlDatum类属于org.apache.nutch.crawl包,在下文中一共展示了CrawlDatum类的20个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于我们的系统推荐出更棒的Java代码示例。
示例1: testRedirFetchInOneSegment
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
* Check a fixed sequence!
*/
@Test
public void testRedirFetchInOneSegment() throws Exception {
// Our test directory
Path testDir = new Path(conf.get("hadoop.tmp.dir"), "merge-"
+ System.currentTimeMillis());
Path segment = new Path(testDir, "00001");
createSegment(segment, CrawlDatum.STATUS_FETCH_SUCCESS, true, true);
// Merge the segments and get status
Path mergedSegment = merge(testDir, new Path[] { segment });
Byte status = new Byte(status = checkMergedSegment(testDir, mergedSegment));
Assert.assertEquals(new Byte(CrawlDatum.STATUS_FETCH_SUCCESS), status);
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:20,代码来源:TestSegmentMergerCrawlDatums.java
示例2: generate
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void generate() throws Exception {
init();
createNutchUrls();
createNutchIndexData();
Path ffetch = new Path(options.getResultPath(), CrawlDatum.FETCH_DIR_NAME);
Path fparse = new Path(options.getResultPath(), CrawlDatum.PARSE_DIR_NAME);
Path linkdb = new Path(segment, LINKDB_DIR_NAME);
FileSystem fs = ffetch.getFileSystem(new Configuration());
fs.rename(ffetch, new Path(segment, CrawlDatum.FETCH_DIR_NAME));
fs.rename(fparse, new Path(segment, CrawlDatum.PARSE_DIR_NAME));
fs.rename(linkdb, new Path(options.getResultPath(), LINKDB_DIR_NAME));
fs.close();
close();
}
开发者ID:thrill,项目名称:fst-bench,代码行数:19,代码来源:NutchData.java
示例3: testFilterOutlinks
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testFilterOutlinks() throws Exception {
conf.set(LinksIndexingFilter.LINKS_OUTLINKS_HOST, "true");
filter.setConf(conf);
Outlink[] outlinks = generateOutlinks();
NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
new ParseData(new ParseStatus(), "title", outlinks, metadata)),
new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
Assert.assertEquals(1, doc.getField("outlinks").getValues().size());
Assert.assertEquals("Filter outlinks, allow only those from a different host",
outlinks[0].getToUrl(), doc.getFieldValue("outlinks"));
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TestLinksIndexingFilter.java
示例4: testIt
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIt() throws ProtocolException, ParseException {
String urlString;
Protocol protocol;
Content content;
Parse parse;
for (int i = 0; i < sampleFiles.length; i++) {
urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
Configuration conf = NutchConfiguration.create();
protocol = new ProtocolFactory(conf).getProtocol(urlString);
content = protocol.getProtocolOutput(new Text(urlString),
new CrawlDatum()).getContent();
parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content)
.get(content.getUrl());
int index = parse.getText().indexOf(expectedText);
Assert.assertTrue(index > 0);
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestPdfParser.java
示例5: injectedScore
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Override
public void injectedScore(Text url, CrawlDatum datum)
throws ScoringFilterException {
// check for the presence of the depth limit key
if (datum.getMetaData().get(MAX_DEPTH_KEY_W) != null) {
// convert from Text to Int
String depthString = datum.getMetaData().get(MAX_DEPTH_KEY_W).toString();
datum.getMetaData().remove(MAX_DEPTH_KEY_W);
int depth = Integer.parseInt(depthString);
datum.getMetaData().put(MAX_DEPTH_KEY_W, new IntWritable(depth));
} else { // put the default
datum.getMetaData()
.put(MAX_DEPTH_KEY_W, new IntWritable(defaultMaxDepth));
}
// initial depth is 1
datum.getMetaData().put(DEPTH_KEY_W, new IntWritable(1));
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:19,代码来源:DepthScoringFilter.java
示例6: testFixedSequence
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
* Check a fixed sequence!
*/
@Test
public void testFixedSequence() throws Exception {
// Our test directory
Path testDir = new Path(conf.get("hadoop.tmp.dir"), "merge-"
+ System.currentTimeMillis());
Path segment1 = new Path(testDir, "00001");
Path segment2 = new Path(testDir, "00002");
Path segment3 = new Path(testDir, "00003");
createSegment(segment1, CrawlDatum.STATUS_FETCH_GONE, false);
createSegment(segment2, CrawlDatum.STATUS_FETCH_GONE, true);
createSegment(segment3, CrawlDatum.STATUS_FETCH_SUCCESS, false);
// Merge the segments and get status
Path mergedSegment = merge(testDir, new Path[] { segment1, segment2,
segment3 });
Byte status = new Byte(status = checkMergedSegment(testDir, mergedSegment));
Assert.assertEquals(new Byte(CrawlDatum.STATUS_FETCH_SUCCESS), status);
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:TestSegmentMergerCrawlDatums.java
示例7: fetch
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Override
protected CrawlDatum fetch(CrawlDatum datum, long currentTime) {
lastFetchTime = currFetchTime;
currFetchTime = currentTime;
previousDbState = datum.getStatus();
lastSignature = datum.getSignature();
datum = super.fetch(datum, currentTime);
if (firstFetchTime == 0) {
firstFetchTime = currFetchTime;
} else if ((currFetchTime - firstFetchTime) > (duration / 2)) {
// simulate a modification after "one year"
changeContent();
firstFetchTime = currFetchTime;
}
return datum;
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TestCrawlDbStates.java
示例8: reduce
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void reduce(Text key, Iterator<NutchWritable> values,
OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
StringBuffer dump = new StringBuffer();
dump.append("\nRecno:: ").append(recNo++).append("\n");
dump.append("URL:: " + key.toString() + "\n");
while (values.hasNext()) {
Writable value = values.next().get(); // unwrap
if (value instanceof CrawlDatum) {
dump.append("\nCrawlDatum::\n").append(((CrawlDatum) value).toString());
} else if (value instanceof Content) {
dump.append("\nContent::\n").append(((Content) value).toString());
} else if (value instanceof ParseData) {
dump.append("\nParseData::\n").append(((ParseData) value).toString());
} else if (value instanceof ParseText) {
dump.append("\nParseText::\n").append(((ParseText) value).toString());
} else if (LOG.isWarnEnabled()) {
LOG.warn("Unrecognized type: " + value.getClass());
}
}
output.collect(key, new Text(dump.toString()));
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:23,代码来源:SegmentReader.java
示例9: testIndexHostsOnlyAndFilterInlinks
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIndexHostsOnlyAndFilterInlinks() throws Exception {
conf = NutchConfiguration.create();
conf.set(LinksIndexingFilter.LINKS_ONLY_HOSTS, "true");
conf.set(LinksIndexingFilter.LINKS_INLINKS_HOST, "true");
filter.setConf(conf);
Inlinks inlinks = new Inlinks();
inlinks.add(new Inlink("http://www.test.com", "test"));
inlinks.add(new Inlink("http://www.example.com", "example"));
NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
new ParseData(new ParseStatus(), "title", new Outlink[0], metadata)),
new Text("http://www.example.com/"), new CrawlDatum(), inlinks);
Assert.assertEquals(1, doc.getField("inlinks").getValues().size());
Assert.assertEquals(
"Index only the host portion of the inlinks after filtering",
new URL("http://www.test.com").getHost(),
doc.getFieldValue("inlinks"));
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:TestLinksIndexingFilter.java
示例10: fetchPage
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
* Fetches the specified <code>page</code> from the local Jetty server and
* checks whether the HTTP response status code matches with the expected
* code. Also use jsp pages for redirection.
*
* @param page
* Page to be fetched.
* @param expectedCode
* HTTP response status code expected while fetching the page.
*/
private void fetchPage(String page, int expectedCode) throws Exception {
URL url = new URL("http", "127.0.0.1", port, page);
CrawlDatum crawlDatum = new CrawlDatum();
Response response = http.getResponse(url, crawlDatum, true);
ProtocolOutput out = http.getProtocolOutput(new Text(url.toString()),
crawlDatum);
Content content = out.getContent();
assertEquals("HTTP Status Code for " + url, expectedCode,
response.getCode());
if (page.compareTo("/nonexists.html") != 0
&& page.compareTo("/brokenpage.jsp") != 0
&& page.compareTo("/redirection") != 0) {
assertEquals("ContentType " + url, "text/html",
content.getContentType());
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:28,代码来源:TestProtocolHttp.java
示例11: reduce
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void reduce(Text key, Iterator<CrawlDatum> values,
OutputCollector<Text, CrawlDatum> output, Reporter reporter)
throws IOException {
boolean duplicateSet = false;
while (values.hasNext()) {
CrawlDatum val = values.next();
if (val.getStatus() == CrawlDatum.STATUS_DB_DUPLICATE) {
duplicate.set(val);
duplicateSet = true;
} else {
old.set(val);
}
}
// keep the duplicate if there is one
if (duplicateSet) {
output.collect(key, duplicate);
return;
}
// no duplicate? keep old one then
output.collect(key, old);
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:25,代码来源:DeduplicationJob.java
示例12: testBlockHTML
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testBlockHTML() throws Exception {
conf.set(MimeTypeIndexingFilter.MIMEFILTER_REGEX_FILE, "block-html.txt");
filter.setConf(conf);
for (int i = 0; i < parses.length; i++) {
NutchDocument doc = filter.filter(new NutchDocument(), parses[i],
new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
if (MIME_TYPES[i].contains("html")) {
Assert.assertNull("Block only HTML documents", doc);
} else {
Assert.assertNotNull("Allow everything else", doc);
}
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:MimeTypeIndexingFilterTest.java
示例13: testIt
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIt() throws ProtocolException, ParseException {
String urlString;
Protocol protocol;
Content content;
Parse parse;
for (int i = 0; i < sampleFiles.length; i++) {
urlString = "file:" + sampleDir + fileSeparator + sampleFiles[i];
Configuration conf = NutchConfiguration.create();
protocol = new ProtocolFactory(conf).getProtocol(urlString);
content = protocol.getProtocolOutput(new Text(urlString),
new CrawlDatum()).getContent();
parse = new ParseUtil(conf).parseByExtensionId("parse-tika", content)
.get(content.getUrl());
Assert.assertEquals("121", parse.getData().getMeta("width"));
Assert.assertEquals("48", parse.getData().getMeta("height"));
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestImageMetadata.java
示例14: map
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public void map(Text urlText, CrawlDatum datum, Context context)
throws IOException, InterruptedException {
URL url = new URL(urlText.toString());
String out = "";
switch (mode) {
case MODE_HOST:
out = url.getHost();
break;
case MODE_DOMAIN:
out = URLUtil.getDomainName(url);
break;
}
if (datum.getStatus() == CrawlDatum.STATUS_DB_FETCHED
|| datum.getStatus() == CrawlDatum.STATUS_DB_NOTMODIFIED) {
context.write(new Text(out + " FETCHED"), new LongWritable(1));
} else {
context.write(new Text(out + " UNFETCHED"), new LongWritable(1));
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:CrawlCompletionStats.java
示例15: testRandomizedSequences
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
*
*/
@Test
public void testRandomizedSequences() throws Exception {
for (int i = 0; i < rnd.nextInt(16) + 16; i++) {
byte expectedStatus = (byte) (rnd.nextInt(6) + 0x21);
while (expectedStatus == CrawlDatum.STATUS_FETCH_RETRY
|| expectedStatus == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {
// fetch_retry and fetch_notmodified never remain in a merged segment
expectedStatus = (byte) (rnd.nextInt(6) + 0x21);
}
byte randomStatus = (byte) (rnd.nextInt(6) + 0x21);
int rounds = rnd.nextInt(16) + 32;
boolean withRedirects = rnd.nextBoolean();
byte resultStatus = executeSequence(randomStatus, expectedStatus, rounds,
withRedirects);
Assert.assertEquals(
"Expected status = " + CrawlDatum.getStatusName(expectedStatus)
+ ", but got " + CrawlDatum.getStatusName(resultStatus)
+ " when merging " + rounds + " segments"
+ (withRedirects ? " with redirects" : ""), expectedStatus,
resultStatus);
}
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:27,代码来源:TestSegmentMergerCrawlDatums.java
示例16: testDeduplicateAnchor
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testDeduplicateAnchor() throws Exception {
Configuration conf = NutchConfiguration.create();
conf.setBoolean("anchorIndexingFilter.deduplicate", true);
AnchorIndexingFilter filter = new AnchorIndexingFilter();
filter.setConf(conf);
Assert.assertNotNull(filter);
NutchDocument doc = new NutchDocument();
ParseImpl parse = new ParseImpl("foo bar", new ParseData());
Inlinks inlinks = new Inlinks();
inlinks.add(new Inlink("http://test1.com/", "text1"));
inlinks.add(new Inlink("http://test2.com/", "text2"));
inlinks.add(new Inlink("http://test3.com/", "text2"));
try {
filter.filter(doc, parse, new Text("http://nutch.apache.org/index.html"),
new CrawlDatum(), inlinks);
} catch (Exception e) {
e.printStackTrace();
Assert.fail(e.getMessage());
}
Assert.assertNotNull(doc);
Assert.assertTrue("test if there is an anchor at all", doc.getFieldNames()
.contains("anchor"));
Assert.assertEquals("test dedup, we expect 2", 2, doc.getField("anchor")
.getValues().size());
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:27,代码来源:TestAnchorIndexingFilter.java
示例17: testIndexHostsOnlyAndFilterOutlinks
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
@Test
public void testIndexHostsOnlyAndFilterOutlinks() throws Exception {
conf = NutchConfiguration.create();
conf.set(LinksIndexingFilter.LINKS_ONLY_HOSTS, "true");
conf.set(LinksIndexingFilter.LINKS_OUTLINKS_HOST, "true");
Outlink[] outlinks = generateOutlinks(true);
filter.setConf(conf);
NutchDocument doc = filter.filter(new NutchDocument(), new ParseImpl("text",
new ParseData(new ParseStatus(), "title", outlinks, metadata)),
new Text("http://www.example.com/"), new CrawlDatum(), new Inlinks());
Assert.assertEquals(1, doc.getField("outlinks").getValues().size());
Assert.assertEquals(
"Index only the host portion of the outlinks after filtering",
new URL("http://www.test.com").getHost(),
doc.getFieldValue("outlinks"));
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:22,代码来源:TestLinksIndexingFilter.java
示例18: indexerScore
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum,
CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)
throws ScoringFilterException {
NutchField tlds = doc.getField("tld");
float boost = 1.0f;
if (tlds != null) {
for (Object tld : tlds.getValues()) {
DomainSuffix entry = tldEntries.get(tld.toString());
if (entry != null)
boost *= entry.getBoost();
}
}
return initScore * boost;
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:17,代码来源:TLDScoringFilter.java
示例19: checkMergedSegment
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
/**
* Checks the merged segment and removes the stuff again.
*
* @param the
* test directory
* @param the
* merged segment
* @return the final status
*/
protected byte checkMergedSegment(Path testDir, Path mergedSegment)
throws Exception {
// Get a MapFile reader for the <Text,CrawlDatum> pairs
MapFile.Reader[] readers = MapFileOutputFormat.getReaders(fs, new Path(
mergedSegment, CrawlDatum.FETCH_DIR_NAME), conf);
Text key = new Text();
CrawlDatum value = new CrawlDatum();
byte finalStatus = 0x0;
for (MapFile.Reader reader : readers) {
while (reader.next(key, value)) {
LOG.info("Reading status for: " + key.toString() + " > "
+ CrawlDatum.getStatusName(value.getStatus()));
// Only consider fetch status
if (CrawlDatum.hasFetchStatus(value)
&& key.toString().equals("http://nutch.apache.org/")) {
finalStatus = value.getStatus();
}
}
// Close the reader again
reader.close();
}
// Remove the test directory again
fs.delete(testDir, true);
LOG.info("Final fetch status for: http://nutch.apache.org/ > "
+ CrawlDatum.getStatusName(finalStatus));
// Return the final status
return finalStatus;
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:45,代码来源:TestSegmentMergerCrawlDatums.java
示例20: parseMeta
import org.apache.nutch.crawl.CrawlDatum; //导入依赖的package包/类
public Metadata parseMeta(String fileName, Configuration conf) {
Metadata metadata = null;
try {
String urlString = "file:" + sampleDir + fileSeparator + fileName;
Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
Content content = protocol.getProtocolOutput(new Text(urlString),
new CrawlDatum()).getContent();
Parse parse = new ParseUtil(conf).parse(content).get(content.getUrl());
metadata = parse.getData().getParseMeta();
} catch (Exception e) {
e.printStackTrace();
Assert.fail(e.toString());
}
return metadata;
}
开发者ID:jorcox,项目名称:GeoCrawler,代码行数:16,代码来源:TestParseReplace.java
注:本文中的org.apache.nutch.crawl.CrawlDatum类示例整理自Github/MSDocs等源码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。 |
请发表评论