While developing Paperity, we encountered the problem of extracting full text from PDF documents. Vast majority of academic papers are published as PDFs, and we wanted to unlock their contents and make them searchable in Paperity.
Extracting text from a PDF document is one of the hardest practical problems that seem easy on the first sight. It should not be much different than using a word processor, right? Absolutely wrong. PDF format is designed for laying out pages and faithfully reproducing the same visual layout everywhere, be it a screen or a printer. Therefore, it does not consist of a continuous stream of letters, words, and sentences; but of pages and objects with specific sizes and coordinates relative to the page. This is a very low-level representation that must be thoroughly preprocessed before it can be analyzed as a complete text. Moreover, PDF authoring tools apply different typographical tricks while converting the text to PDF. For example, letters “f” and “i” are typically joined in a single “glyph”, to make them look better when printed, so that, for instance, the word “justification” in your word processor becomes “justification” (note the single character that is a combination of “f” and “i”) when converted to PDF. This adds another level of complexity while extracting text. Split words at the end of lines pose another problem.