-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extracting order pre-definable? #18
Comments
Does the |
@zevio, if you delete the pdfbox-app*jar file cached by python-pdfbox (in |
I was about to correct my suggestion. Actually I think the issue is not directly linked to the jar file version but to the -sort option as you previously said. The same issue currently happens with Apache Tika, that bundles PDFBox. But calling setSortByPosition() does not seem to work at my end neither changing the configuration file in Apache Tika. Still, using the -sort option with the jar file corrects most of my issues. However and surprisingly, I obtained much better results with OCR (Pytesseract) for PDF content extraction. |
Hi Guys,
Just wondering for a pdf file, if the text extraction order can be defined? As pointed out here, is there similar setting to adjust the extracting order?
This images shows the error.
AUB_Financials_Dec_2018_pg9.pdf
Much appreciated any insights.
Thanks.
Luke
The text was updated successfully, but these errors were encountered: