python - pdfminer - extract text behind LTFigure object

Question

Welcome To Ask or Share your Answers For Others

python - pdfminer - extract text behind LTFigure object

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - pdfminer - extract text behind LTFigure object

I am extracting text from pdf files using python pdfminer library (see docs).

However, pdfminer seems unable to extract all texts in some files and extracts LTFigure object instead. Assuming from position of this object it "covers" some of the text and thus this text is not extracted.

Both pdf file and short jupyter notebook with the code extracting information from pdf are in the Github repository I created specifically in order to ask this question:

https://github.com/druskacik/ltfigure-pdfminer

I am not an expert on how pdf files work but common sense tells me that if I can look for the text using control + f in browser, it should be extractable.

I have considered using some other library but the problem is that I also need positions of the extracted words (in order to use them for my machine learning model), which is a functionality only pdfminer seems to provide.

question from:https://stackoverflow.com/questions/65926516/pdfminer-extract-text-behind-ltfigure-object

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T19:06:59+0000

Given that you also consider other libraries, I suggest using poppler-util's pdftohtml to convert the pdf to xml:

!apt-get install -y poppler-utils
!pdftohtml -c -hidden -xml document.pdf output.xml

It will output an xml file with the text and top, left, width, and height values for the boxes. It had no issues with the text that pdfminer doesn't recognize.

Categories

python - pdfminer - extract text behind LTFigure object

python - pdfminer - extract text behind LTFigure object

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags