modesty/pdf2json: A PDF file parser that converts PDF binaries to text based JSO ...

原作者: [db:作者] 来自: 网络收藏邀请

开源软件名称：

modesty/pdf2json

开源软件地址：

https://github.com/modesty/pdf2json

开源编程语言：

Java 94.4%

开源软件介绍：

Pre-merge PR: Convert commonJS to ES Module. Please help to test it out and report issues before 7/31/2022. Thanks.

pdf2json

pdf2json is a node.js module that parses and converts PDF from binary to json format, it's built with pdf.js and extends it with interactive form elements and text content parsing outside browser.

The goal is to enable server side PDF parsing with interactive form elements when wrapped in web service, and also enable parsing local PDF to json file when using as a command line utility.

Install

npm install pdf2json

Or, install it globally:

sudo npm install pdf2json -g

To update with latest version:

sudo npm update pdf2json -g

To Run in RESTful Web Service or as Commandline Utility

More details can be found at the bottom of this document.

Test

After install, run command line:

npm run test

It'll scan and parse 260 PDF AcroForm files under ./test/pdf, runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for each PDF. It usually takes ~20s in my MacBook Pro to complete, check ./test/target/ for outputs.

Test Exception Handlings

After install, run command line:

npm run test-misc

It'll scan and parse all PDF files under ./test/pdf/misc, also runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for 5 PDF fields, while catches exceptions with stack trace for:

bad XRef entry for pdf/misc/i200_test.pdf
unsupported encryption algorithm for pdf/misc/i43_encrypted.pdf
Invalid XRef stream header for pdf/misc/i243_problem_file_anon.pdf

Test Streams

After install, run command line:

npm run parse-r

It scans 165 PDF files under ../test/pdf/fd/form_, parses with Stream API, then generates output to ./test/target/fd/form.

More test scripts with different commandline options can be found at package.json.

Code Example

Parse a PDF file then write to a JSON file:

    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.json", JSON.stringify(pdfData));
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Or, call directly with buffer:

    fs.readFile(pdfFilePath, (err, pdfBuffer) => {
      if (!err) {
        pdfParser.parseBuffer(pdfBuffer);
      }
    })

Or, use more granular page level parsing events (v2.0.0)

    pdfParser.on("readable", meta => console.log("PDF Metadata", meta) );
    pdfParser.on("data", page => console.log(page ? "One page paged" : "All pages parsed", page));
    pdfParser.on("error", err => console.erro("Parser Error", err);

Parse a PDF then write a .txt file (which only contains textual content of the PDF)

    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser(this,1);

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.content.txt", pdfParser.getRawTextContent(), ()=>{console.log("Done.");});
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Parse a PDF then write a fields.json file that only contains interactive forms' fields information:

    const fs = require('fs'),
        PDFParser = require("pdf2json");

    const pdfParser = new PDFParser();

    pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
    pdfParser.on("pdfParser_dataReady", pdfData => {
        fs.writeFile("./pdf2json/test/F1040EZ.fields.json", JSON.stringify(pdfParser.getAllFieldsTypes()), ()=>{console.log("Done.");});
    });

    pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");

Alternatively, you can pipe input and output streams: (requires v1.1.4)

    const fs = require('fs'),
        PDFParser = require("pdf2json");
    
    const inputStream = fs.createReadStream("./pdf2json/test/pdf/fd/form/F1040EZ.pdf", {bufferSize: 64 * 1024});
    const outputStream = fs.createWriteStream("./pdf2json/test/target/fd/form/F1040EZ.json");
    
    inputStream.pipe(new PDFParser()).pipe(new StringifyStream()).pipe(outputStream);

With v2.0.0, last line above changes to

    inputStream.pipe(this.pdfParser.createParserStream()).pipe(new StringifyStream()).pipe(outputStream);

For additional output streams support:

    //private methods	
	#generateMergedTextBlocksStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".merged.json"), resolve, reject);
			this.pdfParser.getMergedTextBlocksStream().pipe(new StringifyStream()).pipe(outputStream);	
		});
	}

    #generateRawTextContentStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".content.txt"), resolve, reject);
			this.pdfParser.getRawTextContentStream().pipe(outputStream);
		});
    }

    #generateFieldsTypesStream() {
		return new Promise( (resolve, reject) => {
			const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".fields.json"), resolve, reject);
			this.pdfParser.getAllFieldsTypesStream().pipe(new StringifyStream()).pipe(outputStream);
		});
	}

	#processAdditionalStreams() {
        const outputTasks = [];
        if (PROCESS_FIELDS_CONTENT) {//needs to generate fields.json file
            outputTasks.push(this.#generateFieldsTypesStream());
        }
        if (PROCESS_RAW_TEXT_CONTENT) {//needs to generate content.txt file
            outputTasks.push(this.#generateRawTextContentStream());
        }
        if (PROCESS_MERGE_BROKEN_TEXT_BLOCKS) {//needs to generate json file with merged broken text blocks
            outputTasks.push(this.#generateMergedTextBlocksStream());
        }
		return Promise.allSettled(outputTasks);
	}

Note, if primary JSON parsing has exceptions, none of additional stream will be processed. See p2jcmd.js for more details.

API Reference

events:
- pdfParser_dataError: will be raised when parsing failed
- pdfParser_dataReady: when parsing succeeded
alternative events: (v2.0.0)
- readable: first event dispatched after PDF file metadata is parsed and before processing any page
- data: one parsed page succeeded, null means last page has been processed, signle end of data stream
- error: exception or error occured
start to parse PDF file from specified file path asynchronously:

        function loadPDF(pdfFilePath);

If failed, event "pdfParser_dataError" will be raised with error object: {"parserError": errObj}; If success, event "pdfParser_dataReady" will be raised with output data object: {"formImage": parseOutput}, which can be saved as json file (in command line) or serialized to json when running in web service. note: "formImage" is removed from v2.0.0, see breaking changes for details.

Get all textual content from "pdfParser_dataReady" event handler:

        function getRawTextContent();

returns text in string.

Get all input fields information from "pdfParser_dataReady" event handler:

        function getAllFieldsTypes();

returns an array of field objects.

Output format Reference

Current parsed data has four main sub objects to describe the PDF document.

'Transcoder': pdf2json version number
'Agency': the main text identifier for the PDF document. If Id.AgencyId present, it'll be same, otherwise it'll be set as document title; (deprecated since v2.0.0, see notes below)

'Id': the XML meta data that embedded in PDF document (deprecated since v2.0.0, see notes below)

all forms attributes metadata are defined in "Custom" tab of "Document Properties" dialog in Acrobat Pro;
v0.1.22 added support for the following custom properties:
- AgencyId: default "unknown";
- Name: default "unknown";
- MC: default false;
- Max: default -1;
- Parent: parent name, default "unknown";
v2.0.0: 'Agency' and 'Id' are replaced with full metadata, example: for ./test/pdf/fd/form/F1040.pdf, full metadata is:

      Meta: {
          PDFFormatVersion: '1.7',
          IsAcroFormPresent: true,
          IsXFAPresent: false,
          Author: 'SE:W:CAR:MP',
          Subject: 'U.S. Individual Income Tax Return',
          Creator: 'Adobe Acrobat Pro 10.1.8',
          Producer: 'Adobe Acrobat Pro 10.1.8',
          CreationDate: "D:20131203133943-08'00'",
          ModDate: "D:20140131180702-08'00'",
          Metadata: {
              'xmp:modifydate': '2014-01-31T18:07:02-08:00',
              'xmp:createdate': '2013-12-03T13:39:43-08:00',
              'xmp:metadatadate': '2014-01-31T18:07:02-08:00',
              'xmp:creatortool': 'Adobe Acrobat Pro 10.1.8',
              'dc:format': 'application/pdf',
              'dc:description': 'U.S. Individual Income Tax Return',
              'dc:creator': 'SE:W:CAR:MP',
              'xmpmm:documentid': 'uuid:4d81e082-7ef2-4df7-b07b-8190e5d3eadf',
              'xmpmm:instanceid': 'uuid:7ea96d1c-3d2f-284a-a469-f0f284a093de',
              'pdf:producer': 'Adobe Acrobat Pro 10.1.8',
              'adhocwf:state': '1',
              'adhocwf:version': '1.1'
          }
      }

'Pages': array of 'Page' object that describes each page in the PDF, including sizes, lines, fills and texts within the page. More info about 'Page' object can be found at 'Page Object Reference' section
'Width': the PDF page width in page unit

Page object Reference

Each page object within 'Pages' array describes page elements and attributes with 5 main fields:

'Height': height of the page in page unit
'Width': width of the page in page unit, moved from root to page object in v2.0.0
'HLines': horizontal line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit
'Vline': vertical line array, each line has 'x', 'y' in relative coordinates for positioning, and 'w' for width, plus 'l' for length. Both width and length are in page unit;
- v0.4.3 added Line color support. Default is 'black', other wise set in 'clr' if found in color dictionary, or 'oc' field if not found in dictionary;
- v0.4.4 added dashed line support. Default is 'solid', if line style is dashed line, {dsh:1} is added to line object;
'Fills': an array of rectangular area with solid color fills, same as lines, each 'fill' object has 'x', 'y' in relative coordinates for positioning, 'w' and 'h' for width and height in page unit, plus 'clr' to reference a color with index in color dictionary. More info about 'color dictionary' can be found at 'Dictionary Reference' section.
'Texts': an array of text blocks with position, actual text and styling information:
- 'x' and 'y': relative coordinates for positioning
- 'clr': a color index in color dictionary, same 'clr' field as in 'Fill' object. If a color can be found in color dictionary, 'oc' field will be added to the field as 'original color" value.
- 'A': text alignment, including:
  - left
  - center
  - right
- 'R': an array of text run, each text run object has two main fields:
  - 'T': actual text
  - 'S': style index from style dictionary. More info about 'Style Dictionary' can be found at 'Dictionary Reference' section
  - 'TS': [fontFaceId, fontSize, 1/0 for bold, 1/0 for italic]

v0.4.5 added support when fields attributes information is defined in external xml file. pdf2json will always try load field attributes xml file based on file name convention (pdfFileName.pdf's field XML file must be named pdfFileName_fieldInfo.xml in the same directory). If found, fields info will be injected.

Dictionary Reference

Same reason to having "HLines" and "VLines" array in 'Page' object, color and style dictionary will help to reduce the size of payload when transporting the parsing object over the wire. This dictionary data contract design will allow the output just reference a dictionary key , rather than the actual full definition of color or font style. It does require the client of the payload to have the same dictionary definition to make sense out of it when render the parser output on to screen.

Color Dictionary

        const kColors = [
                '#000000',		// 0
                '#ffffff',		// 1
                '#4c4c4c',		// 2
                '#808080',		// 3
                '#999999',		// 4
                '#c0c0c0',		// 5
                '#cccccc',		// 6
                '#e5e5e5',		// 7
                '#f2f2f2',		// 8
                '#008000',		// 9
                '#00ff00',		// 10
                '#bfffa0',		// 11
                '#ffd629',		// 12
                '#ff99cc',		// 13
                '#004080',		// 14
                '#9fc0e1',		// 15
                '#5580ff',		// 16
                '#a9c9fa',		// 17
                '#ff0080',		// 18
                '#800080',		// 19
                '#ffbfff',		// 20
                '#e45b21',		// 21
                '#ffbfaa',		// 22
                '#008080',		// 23
                '#ff0000',		// 24
                '#fdc59f',		// 25
                '#808000',		// 26
                '#bfbf00',		// 27
                '#824100',		// 28
                '#007256',		// 29
                '#008000',		// 30
                '#000080',		// Last + 1
                '#008080',		// Last + 2
                '#800080',		// Last + 3
                '#ff0000',		// Last + 4
                '#0000ff',		// Last + 5
                '#008000',		// Last + 6
                '#000000'		// Last + 7
            ];

Style Dictionary:

            const kFontFaces = [
               "QuickType,Arial,Helvetica,sans-serif",							// 00 - QuickType - sans-serif variable font
               "QuickType Condensed,Arial Narrow,Arial,Helvetica,sans-serif",	// 01 - QuickType Condensed - thin sans-serif variable font
               "QuickTypePi",													// 02 - QuickType Pi
               "QuickType Mono,Courier New,Courier,monospace",					// 03 - QuickType Mono - san-serif fixed font
               "OCR-A,Courier New,Courier,monospace",							// 04 - OCR-A - OCR readable san-serif fixed font
               "OCR B MT,Courier New,Courier,monospace"							// 05 - OCR-B MT - OCR readable san-serif fixed font
            ];

            const kFontStyles = [
                // Face		Size	Bold	Italic		StyleID(Comment)
                // -----	----	----	-----		-----------------
                    [0,		6,		0,		0],			//00
                    [0,		8,		0,		0],			//01
                    [0,		10,		0,		0],			//02
                    [0,		12,		0,		0],			//03
                    [0,		14,		0,		0],			//04
                    [0,		18,		0,		0],			//05
                    [0,		6,		1,		0],			//06
                    [0,		8,		1,		0],			//07
                    [0,		10,		1,		0],			//08
                    [0,		12,		1,		0],			//09
                    [0,		14,		1,		0],			//10
                    [0,		18,		1,		0],			//11
                    [0,		6,		0,		1],			//12
                    [0,		8,		0,		1],			//13
                    [0,		10,		0,		1],			//14
                    [0,		12 
                       
                    
                     
                      



鲜花




握手




雷人




路过




鸡蛋

该文章已有0人参与评论

请发表评论

全部评论

专题导读

More+

10-27 六六分期app的软件客服如何联系？(六六分期

11-06 可心卡盟:win10系统火狐flash插件崩溃怎么

11-06 亲亲特价:怎么删除回收站图标

11-06 济南大学虚拟社区:鲁大师节能降温的具体办

11-06 xlueops.exe:无线网络安装向导

11-06 女斗合众国:win7系统cf与主机连接不稳定怎

11-06 0xc000022-[cf烟雾头]cf怎么调烟雾头

11-06 qizideyouhuo:应用程序无法正常启动0xc0000

11-06 ipz-185:win7系统vcf文件怎么打开

11-06 傻哥蹦迪:win10系统s4怎么打开usb调试

11-06 八神浩树gtaste:回收站清空了怎么恢复

11-06 妖尾之黑色守护:win10系统电脑没有1440x900

11-06 校园至尊魔王小说:win7系统浏览网页时字体

11-06 女斗合众国:win10系统访问共享文件夹提示请

11-06 tokyo hot n0654:恢复win7系统默认字体一招

11-06 雨酷仙境:设置win7系统转移临时文件夹腾出

11-06 阿穆纳伊之杖:win7系统开始菜单在右边还原

11-06 tunespotting:win10系统火狐flash插件总是

11-06 甘尔葛分析师：计谋网站seo关键词暴涨有什

11-06 蔡贵霖: 计谋网站seo关键词暴涨有什么秘密

11-06 博益网首页:ao3网页版进入不了解决方法

11-06 漏斗子专栏: 网站数据分析小白易懂精华篇

11-06 见证双虹怎么做:win7系统开启telnet命令的

11-06 颾狐蝶蜋:系统资源不足无法完成请求的服务

11-06 国光中学校歌:提交网站到alexa查询详细步骤

11-06 西安有情天:静态网页和动态网页的区别

11-06 红木雅尚斋:外部链接构造对网站的好处

11-06 前官礼遇：防止域名劫持–增强域安全性的10

11-06 密传二转答案: 中文分词算法有哪些

11-06 金泉家园邮编:百度快照劫持的表现及应对方

cby-chen/Kubernetes: 二进制安装 Kubernetes，Binary installation of Kubernetes -- ...发布时间：2022-07-09

bluzi/jsonstore: jsonstore offers a free and secured JSON-based cloud datastore ...发布时间：2022-07-08

剪的笔顺,诠释剪的笔画,认识剪的部首

1 六六分期app的软件客服如何联系？(六六分期

六六分期app的软件客服如何联系？不知道吗？加qq群【895510560】即可！标题：六六分期

阅读：18341|2023-10-27

2 可心卡盟:win10系统火狐flash插件崩溃怎么

今天小编告诉大家如何处理win10系统火狐flash插件总是崩溃的问题，可能很多用户都不知

阅读：9705|2022-11-06

3 亲亲特价:怎么删除回收站图标

今天小编告诉大家如何对win10系统删除桌面回收站图标进行设置，可能很多用户都不知道

阅读：8195|2022-11-06

4 济南大学虚拟社区:鲁大师节能降温的具体办

今天小编告诉大家如何对win10系统电脑设置节能降温的设置方法，想必大家都遇到过需要

阅读：8563|2022-11-06

5 xlueops.exe:无线网络安装向导

我们在使用xp系统的过程中,经常需要对xp系统无线网络安装向导设置进行设置，可能很多

阅读：8474|2022-11-06

6 女斗合众国:win7系统cf与主机连接不稳定怎

今天小编告诉大家如何处理win7系统玩cf老是与主机连接不稳定的问题，可能很多用户都不

阅读：9415|2022-11-06

7 0xc000022-[cf烟雾头]cf怎么调烟雾头

电脑对日常生活的重要性小编就不多说了，可是一旦碰到win7系统设置cf烟雾头的问题，很

阅读：8446|2022-11-06

8 qizideyouhuo:应用程序无法正常启动0xc0000

我们在日常使用电脑的时候，有的小伙伴们可能在打开应用的时候会遇见提示应用程序无法

阅读：7877|2022-11-06

9 ipz-185:win7系统vcf文件怎么打开

今天小编告诉大家如何对win7系统打开vcf文件进行设置，可能很多用户都不知道怎么对win

阅读：8429|2022-11-06

10 傻哥蹦迪:win10系统s4怎么打开usb调试

今天小编告诉大家如何对win10系统s4开启USB调试模式进行设置，可能很多用户都不知道怎

阅读：7405|2022-11-06

扫描微信二维码

查看手机版网站

随时了解更新最新资讯

139-2527-9053

在线客服（服务时间 9:00～18:00）

在线QQ客服

地址：深圳市南山区西丽大学城创智工业园

电邮：jeky_zhao#qq.com

移动电话：139-2527-9053

客服电话

电子邮件

modesty/pdf2json: A PDF file parser that converts PDF binaries to text based JSO ...

开源软件名称：

开源软件地址：

开源编程语言：

开源软件介绍：

pdf2json

Install

Test

Test Exception Handlings

Test Streams

Code Example

API Reference

Output format Reference

Page object Reference

Dictionary Reference

请发表评论

全部评论

上一篇：

下一篇：

librespeed/speedtest: Self-hosted Speedt

Delphi 编译错误信息表

avehtari/BDA_m_demos: Bayesian Data Anal

四维彩超怎么看性别？四维看男孩女孩诀窍

medfreeman/markdown-it-toc-and-anchor: m

剪的笔顺,诠释剪的笔画,认识剪的部首

六六分期app的软件客服如何联系？(六六分期

florent37/ViewAnimator: A fluent Android

florent37/Shrine-MaterialDesign2: implem

CVE-2020-36276

SimpleSoftwareIO/simple-sms: Send and re

关于我们

产品与服务

解决方案

139-2527-9053