Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
421 views
in Technique[技术] by (71.8m points)

python - pdfminer - extract text behind LTFigure object

I am extracting text from pdf files using python pdfminer library (see docs).

However, pdfminer seems unable to extract all texts in some files and extracts LTFigure object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.

Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:

https://github.com/druskacik/ltfigure-pdfminer

I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f in browser, it should be extractable.

I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.

question from:https://stackoverflow.com/questions/65926516/pdfminer-extract-text-behind-ltfigure-object

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Given that you also consider other libraries, I suggest using poppler-util's pdftohtml to convert the pdf to xml:

!apt-get install -y poppler-utils
!pdftohtml -c -hidden -xml document.pdf output.xml

It will output an xml file with the text and top, left, width, and height values for the boxes. It had no issues with the text that pdfminer doesn't recognize.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...