Java WARCReaderFactory类代码示例

OGeek|极客世界-中国程序员成长平台 › 门户 › 编程› Java›Java编程经验

原作者: [db:作者] 来自: [db:来源] 收藏邀请

本文整理汇总了Java中org.archive.io.warc.WARCReaderFactory类的典型用法代码示例。如果您正苦于以下问题：Java WARCReaderFactory类的具体用法？Java WARCReaderFactory怎么用？Java WARCReaderFactory使用的例子？那么恭喜您, 这里精选的类代码示例或许可以为您提供帮助。

WARCReaderFactory类属于org.archive.io.warc包，在下文中一共展示了WARCReaderFactory类的20个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于我们的系统推荐出更棒的Java代码示例。

示例1: processWarc

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
private void processWarc(Path warcFile) throws IOException {
    extractorStats.addWarc(warcFile.getFileName().toString());
    InputStream is = Files.newInputStream(warcFile);
    ArchiveReader reader = WARCReaderFactory.get(warcFile.toString(), is, true);

    int i = 0;
    reader.setStrict(false);
    for (ArchiveRecord record : reader) {
        record.setStrict(false);
        extractorStats.visitedRecord();
        handleRecord(record);
        if (i++ % 1000 == 0) {
            System.err.println(extractorStats);
        }
    }
}

开发者ID:tballison，项目名称:SimpleCommonCrawlExtractor，代码行数:17，代码来源:AbstractExtractor.java

示例2: generate

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void generate(Path path, int numPages) throws Exception {

    Gson gson = new Gson();
    long count = 0;
    try (BufferedWriter writer = Files.newBufferedWriter(path)) {
      ArchiveReader ar = WARCReaderFactory.get(new URL(sourceURL), 0);
      for (ArchiveRecord r : ar) {
        Page p = ArchiveUtil.buildPage(r);
        if (p.isEmpty() || p.getOutboundLinks().isEmpty()) {
          log.debug("Skipping {}", p.getUrl());
          continue;
        }
        log.debug("Found {} {}", p.getUrl(), p.getNumOutbound());
        String json = gson.toJson(p);
        writer.write(json);
        writer.newLine();
        count++;
        if (count == numPages) {
          break;
        } else if ((count % 1000) == 0) {
          log.info("Wrote {} of {} pages to {}", count, numPages, path);
        }
      }
    }
    log.info("Wrote {} pages to {}", numPages, path);
  }

开发者ID:astralway，项目名称:webindex，代码行数:27，代码来源:SampleData.java

示例3: readBz2

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
/**
 * Reads bz2 warc file
 *
 * @param file warc file
 * @throws IOException
 */
public static void readBz2(String file)
        throws IOException
{
    // decompress bz2 file to tmp file
    File tmpFile = File.createTempFile("tmp", ".warc");
    BZip2CompressorInputStream inputStream = new BZip2CompressorInputStream(
            new FileInputStream(file));

    IOUtils.copy(inputStream, new FileOutputStream(tmpFile));

    WARCReader reader = WARCReaderFactory.get(tmpFile);

    int counter = 0;
    for (ArchiveRecord record : reader) {
        System.out.println(record.getHeader().getHeaderFields());

        counter++;
    }

    FileUtils.forceDelete(tmpFile);

    System.out.println(counter);
}

开发者ID:habernal，项目名称:nutch-content-exporter，代码行数:30，代码来源:WARCReaderTest.java

示例4: testARCReaderClose

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public void testARCReaderClose() {
    try {
        final File testfile = new File(ARCHIVE_DIR + testFileName);
        FileUtils.copyFile(new File(ARCHIVE_DIR + "fyensdk.warc"),
                testfile);
        
        WARCReader reader = WARCReaderFactory.get(testfile);
        WARCRecord record = (WARCRecord) reader.get(0);
        BitarchiveRecord rec =
                new BitarchiveRecord(record, testFileName);
        record.close();
        reader.close();
        testfile.delete();
    } catch (IOException e) {
        fail("Should not throw IOException " + e);
    }

}

开发者ID:netarchivesuite，项目名称:netarchivesuite-svngit-migration，代码行数:19，代码来源:WARCReaderTester.java

示例5: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
/**
	 * @param args
	 * @throws IOException 
	 */
	public static void main(String[] args) throws IOException {
		// Set up a local compressed WARC file for reading 
		String url = "https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2014-23/segments/1404776400583.60/warc/CC-MAIN-20140707234000-00000-ip-10-180-212-248.ec2.internal.warc.gz";
//		String fn = "data/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
		String fn = url;
		FileInputStream is = new FileInputStream(fn);
		// The file name identifies the ArchiveReader and indicates if it should be decompressed
		ArchiveReader ar = WARCReaderFactory.get(fn, is, true);
		
		// Once we have an ArchiveReader, we can work through each of the records it contains
		int i = 0;
		for(ArchiveRecord r : ar) {
			// The header file contains information such as the type of record, size, creation time, and URL
			System.out.println(r.getHeader());
			System.out.println(r.getHeader().getUrl());
			System.out.println();
			
			// If we want to read the contents of the record, we can use the ArchiveRecord as an InputStream
			// Create a byte array that is as long as the record's stated length
			byte[] rawData = IOUtils.toByteArray(r, r.available());
			
			// Why don't we convert it to a string and print the start of it? Let's hope it's text!
			String content = new String(rawData);
			System.out.println(content.substring(0, Math.min(500, content.length())));
			System.out.println((content.length() > 500 ? "..." : ""));
			
			// Pretty printing to make the output more readable 
			System.out.println("=-=-=-=-=-=-=-=-=");
			if (i++ > 4) break; 
		}
	}

开发者ID:TeamHG-Memex，项目名称:common-crawl-mapreduce，代码行数:36，代码来源:WARCReaderTest.java

示例6: initialize

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext context)
		throws IOException, InterruptedException {
	FileSplit split = (FileSplit) inputSplit;
	Configuration conf = context.getConfiguration();
	Path path = split.getPath();
	FileSystem fs = path.getFileSystem(conf);
	fsin = fs.open(path);
	arPath = path.getName();
	ar = WARCReaderFactory.get(path.getName(), fsin, true);
}

开发者ID:TeamHG-Memex，项目名称:common-crawl-mapreduce，代码行数:12，代码来源:WARCFileRecordReader.java

示例7: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void main(String[] args) throws Exception {

    if (args.length != 2) {
      log.error("Usage: TestParser <pathsFile> <range>");
      System.exit(1);
    }
    final List<String> loadList = IndexEnv.getPathsRange(args[0], args[1]);
    if (loadList.isEmpty()) {
      log.error("No files to load given {} {}", args[0], args[1]);
      System.exit(1);
    }

    WebIndexConfig.load();

    SparkConf sparkConf = new SparkConf().setAppName("webindex-test-parser");
    try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {

      log.info("Parsing {} files (Range {} of paths file {}) from AWS", loadList.size(), args[1],
          args[0]);

      JavaRDD<String> loadRDD = ctx.parallelize(loadList, loadList.size());

      final String prefix = WebIndexConfig.CC_URL_PREFIX;

      loadRDD.foreachPartition(iter -> iter.forEachRemaining(path -> {
        String urlToCopy = prefix + path;
        log.info("Parsing {}", urlToCopy);
        try {
          ArchiveReader reader = WARCReaderFactory.get(new URL(urlToCopy), 0);
          for (ArchiveRecord record : reader) {
            ArchiveUtil.buildPageIgnoreErrors(record);
          }
        } catch (Exception e) {
          log.error("Exception while processing {}", path, e);
        }
      }));
    }
  }

开发者ID:astralway，项目名称:webindex，代码行数:39，代码来源:TestParser.java

示例8: initialize

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
@Override
public void initialize(InputSplit inputSplit, TaskAttemptContext context) throws IOException,
    InterruptedException {
  FileSplit split = (FileSplit) inputSplit;
  Configuration conf = context.getConfiguration();
  Path path = split.getPath();
  FileSystem fs = path.getFileSystem(conf);
  fsin = fs.open(path);
  arPath = path.getName();
  ar = WARCReaderFactory.get(path.getName(), fsin, true);
}

开发者ID:astralway，项目名称:webindex，代码行数:12，代码来源:WARCFileRecordReader.java

示例9: readPages

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static Map<URL, Page> readPages(File input) throws Exception {
  Map<URL, Page> pageMap = new HashMap<>();
  ArchiveReader ar = WARCReaderFactory.get(input);
  for (ArchiveRecord r : ar) {
    Page p = ArchiveUtil.buildPage(r);
    if (p.isEmpty() || p.getOutboundLinks().isEmpty()) {
      continue;
    }
    pageMap.put(URL.fromUri(p.getUri()), p);
  }
  ar.close();
  return pageMap;
}

开发者ID:astralway，项目名称:webindex，代码行数:14，代码来源:IndexIT.java

示例10: read

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
/**
 * Reads default (gzipped) warc file
 *
 * @param file gz file
 * @throws IOException
 */
public static void read(String file)
        throws IOException
{
    WARCReader reader = WARCReaderFactory.get(new File(file));

    int counter = 0;
    for (ArchiveRecord record : reader) {
        System.out.println(record.getHeader().getHeaderFields());

        counter++;
    }

    System.out.println(counter);
}

开发者ID:habernal，项目名称:nutch-content-exporter，代码行数:21，代码来源:WARCReaderTest.java

示例11: openFile

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
private WARCReader openFile(Path filePath) throws IOException {
    return WARCReaderFactory.get(filePath.toFile());
}

开发者ID:ViDA-NYU，项目名称:ache，代码行数:4，代码来源:WarcTargetRepository.java

示例12: open

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static ArchiveReader open(Path path) throws IOException {
    /*
     * ArchiveReaderFactor.get doesn't understand the .open extension.
     */
    if (path.toString().endsWith(".warc.gz.open")) {
        return WARCReaderFactory.get(path.toFile());
    } else {
        return ArchiveReaderFactory.get(path.toFile());
    }
}

开发者ID:nla，项目名称:bamboo，代码行数:11，代码来源:WarcUtils.java

示例13: testWarcCopy

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public void testWarcCopy() {
    try {
        byte[] warcBytes = (
                "WARC/1.0\r\n"
                + "WARC-Type: metadata\r\n"
                + "WARC-Target-URI: metadata://netarkivet.dk/crawl/setup/duplicatereductionjobs?majorversion=1&minorversion=0&harvestid=1&harvestnum=59&jobid=86\r\n"
                + "WARC-Date: 2012-08-24T11:42:55Z\r\n"
                + "WARC-Record-ID: <urn:uuid:c93099e5-2304-487e-9ff2-41e3c01c2b51>\r\n"
                + "WARC-Payload-Digest: sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U\r\n"
                + "WARC-IP-Address: 207.241.229.39\r\n"
                + "WARC-Concurrent-To: <urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb30>\r\n"
                + "WARC-Concurrent-To: <urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb31>\r\n"
                + "Content-Type: text/plain\r\n"
                + "Content-Length: 2\r\n"
                + "\r\n"
                + "85"
                + "\r\n"
                + "\r\n").getBytes();
        File orgFile = new File(TestInfo.WORKING_DIR, "original4copy.warc");
        FileUtils.writeBinaryFile(orgFile, warcBytes);

        File copiedFile = new File(TestInfo.WORKING_DIR, "copied.warc");
        WARCWriter writer = WARCUtils.createWARCWriter(copiedFile);
        WARCUtils.insertWARCFile(orgFile, writer);
        writer.close();

        byte[] bytes = FileUtils.readBinaryFile(copiedFile);
        //System.out.println( new String(bytes));

        WARCReader reader = WARCReaderFactory.get(copiedFile);
        Assert.assertNotNull(reader);
        ArchiveRecord record = reader.get();
        Assert.assertNotNull(record);
        ArchiveRecordHeader header = record.getHeader();
        Assert.assertNotNull(header);

        Assert.assertEquals("metadata", header.getHeaderValue("WARC-Type"));
        Assert.assertEquals("metadata://netarkivet.dk/crawl/setup/duplicatereductionjobs?majorversion=1&minorversion=0&harvestid=1&harvestnum=59&jobid=86", header.getHeaderValue("WARC-Target-URI"));
        Assert.assertEquals("2012-08-24T11:42:55Z", header.getHeaderValue("WARC-Date"));
        Assert.assertEquals("<urn:uuid:c93099e5-2304-487e-9ff2-41e3c01c2b51>", header.getHeaderValue("WARC-Record-ID"));
        Assert.assertEquals("sha1:SUCGMUVXDKVB5CS2NL4R4JABNX7K466U", header.getHeaderValue("WARC-Payload-Digest"));
        Assert.assertEquals("207.241.229.39", header.getHeaderValue("WARC-IP-Address"));
        Assert.assertEquals("<urn:uuid:e7c9eff8-f5bc-4aeb-b3d2-9d3df99afb31>", header.getHeaderValue("WARC-Concurrent-To"));
        Assert.assertEquals("text/plain", header.getHeaderValue("Content-Type"));
        Assert.assertEquals("2", header.getHeaderValue("Content-Length"));
    }
    catch (IOException e) {
        e.printStackTrace();
        Assert.fail("Unexpected exception!");
    }

}

开发者ID:netarchivesuite，项目名称:netarchivesuite-svngit-migration，代码行数:53，代码来源:WARCUtilsTester.java

示例14: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void main(String[] args) throws IOException, S3ServiceException {
	// We're accessing a publicly available bucket so don't need to fill in our credentials
	S3Service s3s = new RestS3Service(null);
	
	// Let's grab a file out of the CommonCrawl S3 bucket
	String fn = "common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
	S3Object f = s3s.getObject("aws-publicdatasets", fn, null, null, null, null, null, null);
	
	// The file name identifies the ArchiveReader and indicates if it should be decompressed
	ArchiveReader ar = WARCReaderFactory.get(fn, f.getDataInputStream(), true);
	
	// Once we have an ArchiveReader, we can work through each of the records it contains
	int i = 0;
	for(ArchiveRecord r : ar) {
		// The header file contains information such as the type of record, size, creation time, and URL
		System.out.println("Header: " + r.getHeader());
		System.out.println("URL: " + r.getHeader().getUrl());
		System.out.println();
		
		// If we want to read the contents of the record, we can use the ArchiveRecord as an InputStream
		// Create a byte array that is as long as all the record's stated length
		byte[] rawData = new byte[r.available()];
		r.read(rawData);
		// Note: potential optimization would be to have a large buffer only allocated once
		
		// Why don't we convert it to a string and print the start of it? Let's hope it's text!
		String content = new String(rawData);
		System.out.println(content.substring(0, Math.min(500, content.length())));
		System.out.println((content.length() > 500 ? "..." : ""));
		
		// Pretty printing to make the output more readable 
		System.out.println("=-=-=-=-=-=-=-=-=");
		if (i++ > 4) break; 
	}
}

开发者ID:Smerity，项目名称:cc-warc-examples，代码行数:36，代码来源:S3ReaderTest.java

示例15: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
/**
 * @param args
 * @throws IOException 
 */
public static void main(String[] args) throws IOException {
	// Set up a local compressed WARC file for reading 
	String fn = "data/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
	FileInputStream is = new FileInputStream(fn);
	// The file name identifies the ArchiveReader and indicates if it should be decompressed
	ArchiveReader ar = WARCReaderFactory.get(fn, is, true);
	
	// Once we have an ArchiveReader, we can work through each of the records it contains
	int i = 0;
	for(ArchiveRecord r : ar) {
		// The header file contains information such as the type of record, size, creation time, and URL
		System.out.println(r.getHeader());
		System.out.println(r.getHeader().getUrl());
		System.out.println();
		
		// If we want to read the contents of the record, we can use the ArchiveRecord as an InputStream
		// Create a byte array that is as long as the record's stated length
		byte[] rawData = IOUtils.toByteArray(r, r.available());
		
		// Why don't we convert it to a string and print the start of it? Let's hope it's text!
		String content = new String(rawData);
		System.out.println(content.substring(0, Math.min(500, content.length())));
		System.out.println((content.length() > 500 ? "..." : ""));
		
		// Pretty printing to make the output more readable 
		System.out.println("=-=-=-=-=-=-=-=-=");
		if (i++ > 4) break; 
	}
}

开发者ID:Smerity，项目名称:cc-warc-examples，代码行数:34，代码来源:WARCReaderTest.java

示例16: getArchiveReader

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
protected ArchiveReader getArchiveReader(final File f,
	final long offset)
throws IOException {
	if (ARCReaderFactory.isARCSuffix(f.getName())) {
		return ARCReaderFactory.get(f, true, offset);
	} else if (WARCReaderFactory.isWARCSuffix(f.getName())) {
		return WARCReaderFactory.get(f, offset);
	}
	throw new IOException("Unknown file extension (Not ARC nor WARC): "
		+ f.getName());
}

开发者ID:iipc，项目名称:webarchive-commons，代码行数:12，代码来源:ArchiveReaderFactory.java

示例17: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void main(String[] args) throws IOException, S3ServiceException {
		// We're accessing a publicly available bucket so don't need to fill in our credentials
		S3Service s3s = new RestS3Service(null);
		
		// Let's grab a file out of the CommonCrawl S3 bucket
		String fn = "common-crawl/crawl-data/CC-MAIN-2013-48/segments/1386163035819/warc/CC-MAIN-20131204131715-00000-ip-10-33-133-15.ec2.internal.warc.gz";
		S3Object f = s3s.getObject("aws-publicdatasets", fn, null, null, null, null, null, null);
		
		// The file name identifies the ArchiveReader and indicates if it should be decompressed
		ArchiveReader ar;
		try {
			ar = WARCReaderFactory.get(fn, f.getDataInputStream(), true);
			
			// Once we have an ArchiveReader, we can work through each of the records it contains
			int i = 0;
			for(ArchiveRecord r : ar) {
				// The header file contains information such as the type of record, size, creation time, and URL
				System.out.println("Header: " + r.getHeader());
				System.out.println("URL: " + r.getHeader().getUrl());
				
//			System.out.println(r.);
				
				
				// If we want to read the contents of the record, we can use the ArchiveRecord as an InputStream
				// Create a byte array that is as long as all the record's stated length
				byte[] rawData = new byte[r.available()];
				r.read(rawData);
				// Note: potential optimization would be to have a large buffer only allocated once
				
				// Why don't we convert it to a string and print the start of it? Let's hope it's text!
				String content = new String(rawData);
				System.out.println(content.substring(0, Math.min(500, content.length())));
				System.out.println((content.length() > 500 ? "..." : ""));
				
				// Pretty printing to make the output more readable 
				System.out.println("=-=-=-=-=-=-=-=-=");
				if (i++ > 4) break; 
			}
		} catch (ServiceException e) {
			// TODO Auto-generated catch block
			e.printStackTrace();
		}
	}

开发者ID:TeamHG-Memex，项目名称:common-crawl-mapreduce，代码行数:44，代码来源:S3ReaderTest.java

示例18: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void main(String[] args) throws Exception {

    if (args.length != 2) {
      log.error("Usage: LoadS3 <pathsFile> <range>");
      System.exit(1);
    }
    final List<String> loadList = IndexEnv.getPathsRange(args[0], args[1]);
    if (loadList.isEmpty()) {
      log.error("No files to load given {} {}", args[0], args[1]);
      System.exit(1);
    }

    final int rateLimit = WebIndexConfig.load().getLoadRateLimit();

    SparkConf sparkConf = new SparkConf().setAppName("webindex-load-s3");
    try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {

      log.info("Loading {} files (Range {} of paths file {}) from AWS", loadList.size(), args[1],
          args[0]);

      JavaRDD<String> loadRDD = ctx.parallelize(loadList, loadList.size());

      final String prefix = WebIndexConfig.CC_URL_PREFIX;

      loadRDD.foreachPartition(iter -> {
        final FluoConfiguration fluoConfig = new FluoConfiguration(new File("fluo.properties"));
        final RateLimiter rateLimiter = rateLimit > 0 ? RateLimiter.create(rateLimit) : null;
        try (FluoClient client = FluoFactory.newClient(fluoConfig);
            LoaderExecutor le = client.newLoaderExecutor()) {
          iter.forEachRemaining(path -> {
            String urlToCopy = prefix + path;
            log.info("Loading {} to Fluo", urlToCopy);
            try {
              ArchiveReader reader = WARCReaderFactory.get(new URL(urlToCopy), 0);
              for (ArchiveRecord record : reader) {
                Page page = ArchiveUtil.buildPageIgnoreErrors(record);
                if (page.getOutboundLinks().size() > 0) {
                  log.info("Loading page {} with {} links", page.getUrl(), page.getOutboundLinks()
                      .size());
                  if (rateLimiter != null) {
                    rateLimiter.acquire();
                  }
                  le.execute(PageLoader.updatePage(page));
                }
              }
            } catch (Exception e) {
              log.error("Exception while processing {}", path, e);
            }
          });
        }
      });
    }
  }

开发者ID:astralway，项目名称:webindex，代码行数:54，代码来源:LoadS3.java

示例19: main

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
public static void main(String[] args) throws Exception {

    if (args.length != 1) {
      log.error("Usage: LoadHdfs <dataDir>");
      System.exit(1);
    }
    final String dataDir = args[0];
    IndexEnv.validateDataDir(dataDir);

    final String hadoopConfDir = IndexEnv.getHadoopConfDir();
    final int rateLimit = WebIndexConfig.load().getLoadRateLimit();

    List<String> loadPaths = new ArrayList<>();
    FileSystem hdfs = IndexEnv.getHDFS();
    RemoteIterator<LocatedFileStatus> listIter = hdfs.listFiles(new Path(dataDir), true);
    while (listIter.hasNext()) {
      LocatedFileStatus status = listIter.next();
      if (status.isFile()) {
        loadPaths.add(status.getPath().toString());
      }
    }

    log.info("Loading {} files into Fluo from {}", loadPaths.size(), dataDir);

    SparkConf sparkConf = new SparkConf().setAppName("webindex-load-hdfs");
    try (JavaSparkContext ctx = new JavaSparkContext(sparkConf)) {

      JavaRDD<String> paths = ctx.parallelize(loadPaths, loadPaths.size());

      paths.foreachPartition(iter -> {
        final FluoConfiguration fluoConfig = new FluoConfiguration(new File("fluo.properties"));
        final RateLimiter rateLimiter = rateLimit > 0 ? RateLimiter.create(rateLimit) : null;
        FileSystem fs = IndexEnv.getHDFS(hadoopConfDir);
        try (FluoClient client = FluoFactory.newClient(fluoConfig);
            LoaderExecutor le = client.newLoaderExecutor()) {
          iter.forEachRemaining(path -> {
            Path filePath = new Path(path);
            try {
              if (fs.exists(filePath)) {
                FSDataInputStream fsin = fs.open(filePath);
                ArchiveReader reader = WARCReaderFactory.get(filePath.getName(), fsin, true);
                for (ArchiveRecord record : reader) {
                  Page page = ArchiveUtil.buildPageIgnoreErrors(record);
                  if (page.getOutboundLinks().size() > 0) {
                    log.info("Loading page {} with {} links", page.getUrl(), page
                        .getOutboundLinks().size());
                    if (rateLimiter != null) {
                      rateLimiter.acquire();
                    }
                    le.execute(PageLoader.updatePage(page));
                  }
                }
              }
            } catch (IOException e) {
              log.error("Exception while processing {}", path, e);
            }
          });
        }
      });
    }
  }

开发者ID:astralway，项目名称:webindex，代码行数:62，代码来源:LoadHdfs.java

示例20: testBasic

import org.archive.io.warc.WARCReaderFactory; //导入依赖的package包/类
@Test
public void testBasic() throws IOException, ParseException {

  ArchiveReader archiveReader = WARCReaderFactory.get(new File("src/test/resources/wat.warc"));
  Page page = ArchiveUtil.buildPage(archiveReader.get());
  Assert.assertNotNull(page);
  Assert.assertFalse(page.isEmpty());

  Assert
      .assertEquals(
          "http://1079ishot.com/presale-password-trey-songz-young-jeezy-pre-christmas-bash/screen-shot-2011-10-27-at-11-12-06-am/",
          page.getUrl());
  Assert
      .assertEquals(
          "com.1079ishot>>o>/presale-password-trey-songz-young-jeezy-pre-christmas-bash/screen-shot-2011-10-27-at-11-12-06-am/",
          page.getUri());

  Assert.assertEquals("2015-04-18T03:35:13Z", page.getCrawlDate());
  Assert.assertEquals("nginx/1.6.2", page.getServer());
  Assert
      .assertEquals(
          "Presale Password &#8211; Trey Songz &#038; Young Jeezy Pre-Christmas Bash Screen shot 2011-10-27 at ",
          page.getTitle());
  Assert.assertEquals(0, page.getOutboundLinks().size());

  ArchiveReader ar2 = WARCReaderFactory.get(new File("src/test/resources/wat-18.warc"));

  int valid = 0;
  int invalid = 0;
  Iterator<ArchiveRecord> records = ar2.iterator();
  while (records.hasNext()) {
    try {
      ArchiveRecord r = records.next();
      ArchiveUtil.buildPage(r);
      valid++;
    } catch (ParseException e) {
      invalid++;
    }
  }
  Assert.assertEquals(18, valid);
  Assert.assertEquals(0, invalid);
}

开发者ID:astralway，项目名称:webindex，代码行数:43，代码来源:ArchiveUtilTest.java

注：本文中的org.archive.io.warc.WARCReaderFactory类示例整理自Github/MSDocs等源码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。

鲜花

握手

雷人

路过

鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

Java Avataar类代码示例发布时间：2022-05-22

Java ActivityInfoCompat类代码示例发布时间：2022-05-22

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18165|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9639|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8163|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8541|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8443|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9363|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8413|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7850|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8397|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7389|2022-11-06

客服电话

电子邮件

Java WARCReaderFactory类代码示例

示例1: processWarc

示例2: generate

示例3: readBz2

示例4: testARCReaderClose

示例5: main

示例6: initialize

示例7: main

示例8: initialize

示例9: readPages

示例10: read

示例11: openFile

示例12: open

示例13: testWarcCopy

示例14: main

示例15: main

示例16: getArchiveReader

示例17: main

示例18: main

示例19: main

示例20: testBasic

请发表评论

全部评论

上一篇：

下一篇：

CVE-2022-33677

apache/apisix-ingress-controller: APISIX

librespeed/speedtest: Self-hosted Speedt

avehtari/BDA_m_demos: Bayesian Data Anal

四维彩超怎么看性别？四维看男孩女孩诀窍

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053