在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称:modesty/pdf2json开源软件地址:https://github.com/modesty/pdf2json开源编程语言:Java 94.4%开源软件介绍:
pdf2jsonpdf2json is a node.js module that parses and converts PDF from binary to json format, it's built with pdf.js and extends it with interactive form elements and text content parsing outside browser. The goal is to enable server side PDF parsing with interactive form elements when wrapped in web service, and also enable parsing local PDF to json file when using as a command line utility. Install
Or, install it globally:
To update with latest version:
To Run in RESTful Web Service or as Commandline Utility
TestAfter install, run command line:
It'll scan and parse 260 PDF AcroForm files under ./test/pdf, runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for each PDF. It usually takes ~20s in my MacBook Pro to complete, check ./test/target/ for outputs. Test Exception HandlingsAfter install, run command line:
It'll scan and parse all PDF files under ./test/pdf/misc, also runs with -s -t -c -m command line options, generates primary output JSON, additional text content JSON, form fields JSON and merged text JSON file for 5 PDF fields, while catches exceptions with stack trace for:
Test StreamsAfter install, run command line:
It scans 165 PDF files under ../test/pdf/fd/form_, parses with Stream API, then generates output to ./test/target/fd/form. More test scripts with different commandline options can be found at package.json. Code Example
const fs = require('fs'),
PDFParser = require("pdf2json");
const pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("./pdf2json/test/F1040EZ.json", JSON.stringify(pdfData));
});
pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf"); Or, call directly with buffer: fs.readFile(pdfFilePath, (err, pdfBuffer) => {
if (!err) {
pdfParser.parseBuffer(pdfBuffer);
}
}) Or, use more granular page level parsing events (v2.0.0) pdfParser.on("readable", meta => console.log("PDF Metadata", meta) );
pdfParser.on("data", page => console.log(page ? "One page paged" : "All pages parsed", page));
pdfParser.on("error", err => console.erro("Parser Error", err);
const fs = require('fs'),
PDFParser = require("pdf2json");
const pdfParser = new PDFParser(this,1);
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("./pdf2json/test/F1040EZ.content.txt", pdfParser.getRawTextContent(), ()=>{console.log("Done.");});
});
pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf");
const fs = require('fs'),
PDFParser = require("pdf2json");
const pdfParser = new PDFParser();
pdfParser.on("pdfParser_dataError", errData => console.error(errData.parserError) );
pdfParser.on("pdfParser_dataReady", pdfData => {
fs.writeFile("./pdf2json/test/F1040EZ.fields.json", JSON.stringify(pdfParser.getAllFieldsTypes()), ()=>{console.log("Done.");});
});
pdfParser.loadPDF("./pdf2json/test/pdf/fd/form/F1040EZ.pdf"); Alternatively, you can pipe input and output streams: (requires v1.1.4) const fs = require('fs'),
PDFParser = require("pdf2json");
const inputStream = fs.createReadStream("./pdf2json/test/pdf/fd/form/F1040EZ.pdf", {bufferSize: 64 * 1024});
const outputStream = fs.createWriteStream("./pdf2json/test/target/fd/form/F1040EZ.json");
inputStream.pipe(new PDFParser()).pipe(new StringifyStream()).pipe(outputStream); With v2.0.0, last line above changes to inputStream.pipe(this.pdfParser.createParserStream()).pipe(new StringifyStream()).pipe(outputStream); For additional output streams support: //private methods
#generateMergedTextBlocksStream() {
return new Promise( (resolve, reject) => {
const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".merged.json"), resolve, reject);
this.pdfParser.getMergedTextBlocksStream().pipe(new StringifyStream()).pipe(outputStream);
});
}
#generateRawTextContentStream() {
return new Promise( (resolve, reject) => {
const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".content.txt"), resolve, reject);
this.pdfParser.getRawTextContentStream().pipe(outputStream);
});
}
#generateFieldsTypesStream() {
return new Promise( (resolve, reject) => {
const outputStream = ParserStream.createOutputStream(this.outputPath.replace(".json", ".fields.json"), resolve, reject);
this.pdfParser.getAllFieldsTypesStream().pipe(new StringifyStream()).pipe(outputStream);
});
}
#processAdditionalStreams() {
const outputTasks = [];
if (PROCESS_FIELDS_CONTENT) {//needs to generate fields.json file
outputTasks.push(this.#generateFieldsTypesStream());
}
if (PROCESS_RAW_TEXT_CONTENT) {//needs to generate content.txt file
outputTasks.push(this.#generateRawTextContentStream());
}
if (PROCESS_MERGE_BROKEN_TEXT_BLOCKS) {//needs to generate json file with merged broken text blocks
outputTasks.push(this.#generateMergedTextBlocksStream());
}
return Promise.allSettled(outputTasks);
} Note, if primary JSON parsing has exceptions, none of additional stream will be processed. See p2jcmd.js for more details. API Reference
function loadPDF(pdfFilePath); If failed, event "pdfParser_dataError" will be raised with error object: {"parserError": errObj}; If success, event "pdfParser_dataReady" will be raised with output data object: {"formImage": parseOutput}, which can be saved as json file (in command line) or serialized to json when running in web service. note: "formImage" is removed from v2.0.0, see breaking changes for details.
function getRawTextContent(); returns text in string.
function getAllFieldsTypes(); returns an array of field objects. Output format ReferenceCurrent parsed data has four main sub objects to describe the PDF document.
Page object ReferenceEach page object within 'Pages' array describes page elements and attributes with 5 main fields:
v0.4.5 added support when fields attributes information is defined in external xml file. pdf2json will always try load field attributes xml file based on file name convention (pdfFileName.pdf's field XML file must be named pdfFileName_fieldInfo.xml in the same directory). If found, fields info will be injected. Dictionary ReferenceSame reason to having "HLines" and "VLines" array in 'Page' object, color and style dictionary will help to reduce the size of payload when transporting the parsing object over the wire. This dictionary data contract design will allow the output just reference a dictionary key , rather than the actual full definition of color or font style. It does require the client of the payload to have the same dictionary definition to make sense out of it when render the parser output on to screen.
const kColors = [
'#000000', // 0
'#ffffff', // 1
'#4c4c4c', // 2
'#808080', // 3
'#999999', // 4
'#c0c0c0', // 5
'#cccccc', // 6
'#e5e5e5', // 7
'#f2f2f2', // 8
'#008000', // 9
'#00ff00', // 10
'#bfffa0', // 11
'#ffd629', // 12
'#ff99cc', // 13
'#004080', // 14
'#9fc0e1', // 15
'#5580ff', // 16
'#a9c9fa', // 17
'#ff0080', // 18
'#800080', // 19
'#ffbfff', // 20
'#e45b21', // 21
'#ffbfaa', // 22
'#008080', // 23
'#ff0000', // 24
'#fdc59f', // 25
'#808000', // 26
'#bfbf00', // 27
'#824100', // 28
'#007256', // 29
'#008000', // 30
'#000080', // Last + 1
'#008080', // Last + 2
'#800080', // Last + 3
'#ff0000', // Last + 4
'#0000ff', // Last + 5
'#008000', // Last + 6
'#000000' // Last + 7
];
const kFontFaces = [
"QuickType,Arial,Helvetica,sans-serif", // 00 - QuickType - sans-serif variable font
"QuickType Condensed,Arial Narrow,Arial,Helvetica,sans-serif", // 01 - QuickType Condensed - thin sans-serif variable font
"QuickTypePi", // 02 - QuickType Pi
"QuickType Mono,Courier New,Courier,monospace", // 03 - QuickType Mono - san-serif fixed font
"OCR-A,Courier New,Courier,monospace", // 04 - OCR-A - OCR readable san-serif fixed font
"OCR B MT,Courier New,Courier,monospace" // 05 - OCR-B MT - OCR readable san-serif fixed font
];
const kFontStyles = [
// Face Size Bold Italic StyleID(Comment)
// ----- ---- ---- ----- -----------------
[0, 6, 0, 0], //00
[0, 8, 0, 0], //01
[0, 10, 0, 0], //02
[0, 12, 0, 0], //03
[0, 14, 0, 0], //04
[0, 18, 0, 0], //05
[0, 6, 1, 0], //06
[0, 8, 1, 0], //07
[0, 10, 1, 0], //08
[0, 12, 1, 0], //09
[0, 14, 1, 0], //10
[0, 18, 1, 0], //11
[0, 6, 0, 1], //12
[0, 8, 0, 1], //13
[0, 10, 0, 1], //14
[0, 12
全部评论
专题导读
上一篇:cby-chen/Kubernetes: 二进制安装 Kubernetes,Binary installation of Kubernetes -- ...发布时间:2022-07-09下一篇:bluzi/jsonstore: jsonstore offers a free and secured JSON-based cloud datastore ...发布时间:2022-07-08热门推荐
热门话题
阅读排行榜
|
请发表评论