• 设为首页
  • 点击收藏
  • 手机版
    手机扫一扫访问
    迪恩网络手机版
  • 关注官方公众号
    微信扫一扫关注
    迪恩网络公众号

modesty/pdf2json: A PDF file parser that converts PDF binaries to text based JSO ...

原作者: [db:作者] 来自: 网络 收藏 邀请

开源软件名称:

modesty/pdf2json

开源软件地址:

https://github.com/modesty/pdf2json

开源编程语言:

Java 94.4%

开源软件介绍:

Pre-merge PR: Convert commonJS to ES Module. Please help to test it out and report issues before 7/31/2022. Thanks.

pdf2json

pdf2json is a node.js module that parses and converts PDF from binary to json format, it's built with pdf.js and extends it with interactive form elements and text content parsing outside browser.

The goal is to enable server side PDF parsing with interactive form elements when wrapped in web service, and also enable parsing local PDF to json file when using as a command line utility.

Install

npm install pdf2json

Or, install it globally:

sudo npm install pdf2json -g

To update with latest version:

sudo npm update pdf2json -g

To Run in RESTful Web Service or as Commandline Utility

  • More details can be found at the bottom of this document.

Test

After install, run command line:

npm run test

It'll scan and parse 260 PDF AcroForm files under ./test/pdf, runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for each PDF. It usually takes ~20s in my MacBook Pro to complete, check ./test/target/ for outputs.

Test Exception Handlings

After install, run command line:

npm run test-misc

It'll scan and parse all PDF files under ./test/pdf/misc, also runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for 5 PDF fields, while catches exceptions with stack trace for:

  • bad XRef entry for pdf/misc/i200_test.pdf
  • unsupported encryption algorithm for pdf/misc/i43_encrypted.pdf
  • Invalid XRef stream header for pdf/misc/i243_problem_file_anon.pdf

Test Streams

After install, run command line:

npm run parse-r

It scans 165 PDF files under ../test/pdf/fd/form_, parses with Stream API, then generates output to ./test/target/fd/form.

More test scripts with different commandline options can be found at package.json.

Code Example

  • Parse a PDF file then write to a JSON file:
    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.json", JSON.stringify(pdfData));
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Or, call directly with buffer:

    fs.readFile(pdfFilePath, (err, pdfBuffer) => {
      if (!err) {
        pdfParser.parseBuffer(pdfBuffer);
      }
    })

Or, use more granular page level parsing events (v2.0.0)

    pdfParser.on("readable", meta => console.log("PDF Metadata", meta) );
    pdfParser.on("data", page => console.log(page ? "One page paged" : "All pages parsed", page));
    pdfParser.on("error", err => console.erro("Parser Error", err);
  • Parse a PDF then write a .txt file (which only contains textual content of the PDF)
    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser(this,1);

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.content.txt", pdfParser.getRawTextContent(), ()=>{console.log("Done.");});
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");
  • Parse a PDF then write a fields.json file that only contains interactive forms' fields information:
    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.fields.json", JSON.stringify(pdfParser.getAllFieldsTypes()), ()=>{console.log("Done.");});
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Alternatively, you can pipe input and output streams: (requires v1.1.4)

    const fs = require('fs'),
        PDFParser = require("pdf2json");
    
    const inputStream = fs.createReadStream("./pdf2json/test/pdf/fd/form/F1040EZ.pdf", {bufferSize: 64 * 1024});
    const outputStream = fs.createWriteStream("./pdf2json/test/target/fd/form/F1040EZ.json");
    
    inputStream.pipe(new PDFParser()).pipe(new StringifyStream()).pipe(outputStream);

With v2.0.0, last line above changes to

    inputStream.pipe(this.pdfParser.createParserStream()).pipe(new StringifyStream()).pipe(outputStream);

For additional output streams support:

    //private methods	
	#generateMergedTextBlocksStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".merged.json"), resolve, reject);
			this.pdfParser.getMergedTextBlocksStream().pipe(new StringifyStream()).pipe(outputStream);	
		});
	}

    #generateRawTextContentStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".content.txt"), resolve, reject);
			this.pdfParser.getRawTextContentStream().pipe(outputStream);
		});
    }

    #generateFieldsTypesStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".fields.json"), resolve, reject);
			this.pdfParser.getAllFieldsTypesStream().pipe(new StringifyStream()).pipe(outputStream);
		});
	}

	#processAdditionalStreams() {
        const outputTasks = [];
        if (PROCESS_FIELDS_CONTENT) {//needs to generate fields.json file
            outputTasks.push(this.#generateFieldsTypesStream());
        }
        if (PROCESS_RAW_TEXT_CONTENT) {//needs to generate content.txt file
            outputTasks.push(this.#generateRawTextContentStream());
        }
        if (PROCESS_MERGE_BROKEN_TEXT_BLOCKS) {//needs to generate json file with merged broken text blocks
            outputTasks.push(this.#generateMergedTextBlocksStream());
        }
		return Promise.allSettled(outputTasks);
	}

Note, if primary JSON parsing has exceptions, none of additional stream will be processed. See p2jcmd.js for more details.

API Reference

  • events:

    • pdfParser_dataError: will be raised when parsing failed
    • pdfParser_dataReady: when parsing succeeded
  • alternative events: (v2.0.0)

    • readable: first event dispatched after PDF file metadata is parsed and before processing any page
    • data: one parsed page succeeded, null means last page has been processed, signle end of data stream
    • error: exception or error occured
  • start to parse PDF file from specified file path asynchronously:

        function loadPDF(pdfFilePath);

If failed, event "pdfParser_dataError" will be raised with error object: {"parserError": errObj}; If success, event "pdfParser_dataReady" will be raised with output data object: {"formImage": parseOutput}, which can be saved as json file (in command line) or serialized to json when running in web service. note: "formImage" is removed from v2.0.0, see breaking changes for details.

  • Get all textual content from "pdfParser_dataReady" event handler:
        function getRawTextContent();

returns text in string.

  • Get all input fields information from "pdfParser_dataReady" event handler:
        function getAllFieldsTypes();

returns an array of field objects.

Output format Reference

Current parsed data has four main sub objects to describe the PDF document.

  • 'Transcoder': pdf2json version number
  • 'Agency': the main text identifier for the PDF document. If Id.AgencyId present, it'll be same, otherwise it'll be set as document title; (deprecated since v2.0.0, see notes below)
  • 'Id': the XML meta data that embedded in PDF document (deprecated since v2.0.0, see notes below)
    • all forms attributes metadata are defined in "Custom" tab of "Document Properties" dialog in Acrobat Pro;
    • v0.1.22 added support for the following custom properties:
      • AgencyId: default "unknown";
      • Name: default "unknown";
      • MC: default false;
      • Max: default -1;
      • Parent: parent name, default "unknown";
    • v2.0.0: 'Agency' and 'Id' are replaced with full metadata, example: for ./test/pdf/fd/form/F1040.pdf, full metadata is:
          Meta: {
              PDFFormatVersion: '1.7',
              IsAcroFormPresent: true,
              IsXFAPresent: false,
              Author: 'SE:W:CAR:MP',
              Subject: 'U.S. Individual Income Tax Return',
              Creator: 'Adobe Acrobat Pro 10.1.8',
              Producer: 'Adobe Acrobat Pro 10.1.8',
              CreationDate: "D:20131203133943-08'00'",
              ModDate: "D:20140131180702-08'00'",
              Metadata: {
                  'xmp:modifydate': '2014-01-31T18:07:02-08:00',
                  'xmp:createdate': '2013-12-03T13:39:43-08:00',
                  'xmp:metadatadate': '2014-01-31T18:07:02-08:00',
                  'xmp:creatortool': 'Adobe Acrobat Pro 10.1.8',
                  'dc:format': 'application/pdf',
                  'dc:description': 'U.S. Individual Income Tax Return',
                  'dc:creator': 'SE:W:CAR:MP',
                  'xmpmm:documentid': 'uuid:4d81e082-7ef2-4df7-b07b-8190e5d3eadf',
                  'xmpmm:instanceid': 'uuid:7ea96d1c-3d2f-284a-a469-f0f284a093de',
                  'pdf:producer': 'Adobe Acrobat Pro 10.1.8',
                  'adhocwf:state': '1',
                  'adhocwf:version': '1.1'
              }
          }
  • 'Pages': array of 'Page' object that describes each page in the PDF, including sizes, lines, fills and texts within the page. More info about 'Page' object can be found at 'Page Object Reference' section
  • 'Width': the PDF page width in page unit

Page object Reference

Each page object within 'Pages' array describes page elements and attributes with 5 main fields:

  • 'Height': height of the page in page unit
  • 'Width': width of the page in page unit, moved from root to page object in v2.0.0
  • 'HLines': horizontal line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit
  • 'Vline': vertical line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit;
    • v0.4.3 added Line color support. Default is 'black', other wise set in 'clr' if found in color dictionary, or 'oc' field if not found in dictionary;
    • v0.4.4 added dashed line support. Default is 'solid', if line style is dashed line, {dsh:1} is added to line object;
  • 'Fills': an array of rectangular area with solid color fills, same as lines, each 'fill' object has 'x', 'y' in relative coordinates for positioning, 'w' and 'h' for width and height in page unit, plus 'clr' to reference a color with index in color dictionary. More info about 'color dictionary' can be found at 'Dictionary Reference' section.
  • 'Texts': an array of text blocks with position, actual text and styling information:
    • 'x' and 'y': relative coordinates for positioning
    • 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
    • 'A': text alignment, including:
      • left
      • center
      • right
    • 'R': an array of text run, each text run object has two main fields:
      • 'T': actual text
      • 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
      • 'TS': [fontFaceId, fontSize, 1/0 for bold, 1/0 for italic]

v0.4.5 added support when fields attributes information is defined in external xml file. pdf2json will always try load field attributes xml file based on file name convention (pdfFileName.pdf's field XML file must be named pdfFileName_fieldInfo.xml in the same directory). If found, fields info will be injected.

Dictionary Reference

Same reason to having "HLines" and "VLines" array in 'Page' object, color and style dictionary will help to reduce the size of payload when transporting the parsing object over the wire. This dictionary data contract design will allow the output just reference a dictionary key , rather than the actual full definition of color or font style. It does require the client of the payload to have the same dictionary definition to make sense out of it when render the parser output on to screen.

  • Color Dictionary
        const kColors = [
                '#000000',		// 0
                '#ffffff',		// 1
                '#4c4c4c',		// 2
                '#808080',		// 3
                '#999999',		// 4
                '#c0c0c0',		// 5
                '#cccccc',		// 6
                '#e5e5e5',		// 7
                '#f2f2f2',		// 8
                '#008000',		// 9
                '#00ff00',		// 10
                '#bfffa0',		// 11
                '#ffd629',		// 12
                '#ff99cc',		// 13
                '#004080',		// 14
                '#9fc0e1',		// 15
                '#5580ff',		// 16
                '#a9c9fa',		// 17
                '#ff0080',		// 18
                '#800080',		// 19
                '#ffbfff',		// 20
                '#e45b21',		// 21
                '#ffbfaa',		// 22
                '#008080',		// 23
                '#ff0000',		// 24
                '#fdc59f',		// 25
                '#808000',		// 26
                '#bfbf00',		// 27
                '#824100',		// 28
                '#007256',		// 29
                '#008000',		// 30
                '#000080',		// Last + 1
                '#008080',		// Last + 2
                '#800080',		// Last + 3
                '#ff0000',		// Last + 4
                '#0000ff',		// Last + 5
                '#008000',		// Last + 6
                '#000000'		// Last + 7
            ];
  • Style Dictionary:
            const kFontFaces = [
               "QuickType,Arial,Helvetica,sans-serif",							// 00 - QuickType - sans-serif variable font
               "QuickType Condensed,Arial Narrow,Arial,Helvetica,sans-serif",	// 01 - QuickType Condensed - thin sans-serif variable font
               "QuickTypePi",													// 02 - QuickType Pi
               "QuickType Mono,Courier New,Courier,monospace",					// 03 - QuickType Mono - san-serif fixed font
               "OCR-A,Courier New,Courier,monospace",							// 04 - OCR-A - OCR readable san-serif fixed font
               "OCR B MT,Courier New,Courier,monospace"							// 05 - OCR-B MT - OCR readable san-serif fixed font
            ];

            const kFontStyles = [
                // Face		Size	Bold	Italic		StyleID(Comment)
                // -----	----	----	-----		-----------------
                    [0,		6,		0,		0],			//00
                    [0,		8,		0,		0],			//01
                    [0,		10,		0,		0],			//02
                    [0,		12,		0,		0],			//03
                    [0,		14,		0,		0],			//04
                    [0,		18,		0,		0],			//05
                    [0,		6,		1,		0],			//06
                    [0,		8,		1,		0],			//07
                    [0,		10,		1,		0],			//08
                    [0,		12,		1,		0],			//09
                    [0,		14,		1,		0],			//10
                    [0,		18,		1,		0],			//11
                    [0,		6,		0,		1],			//12
                    [0,		8,		0,		1],			//13
                    [0,		10,		0,		1],			//14
                    [0,		12 
                       
                    
                    

鲜花

握手

雷人

路过

鸡蛋
该文章已有0人参与评论

请发表评论

全部评论

专题导读
热门推荐
阅读排行榜

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服(服务时间 9:00~18:00)

在线QQ客服
地址:深圳市南山区西丽大学城创智工业园
电邮:jeky_zhao#qq.com
移动电话:139-2527-9053

Powered by 互联科技 X3.4© 2001-2213 极客世界.|Sitemap