Extracting Content from Academic Papers in PDF Format

While developing Paperity, we encountered the problem of extracting full text from PDF documents. Vast majority of academic papers are published as PDFs, and we wanted to unlock their contents and make them searchable in Paperity.

Extracting text from a PDF document is one of the hardest practical problems that seem easy on the first sight. It should not be much different than using a word processor, right? Absolutely wrong. PDF format is designed for laying out pages and faithfully reproducing the same visual layout everywhere, be it a screen or a printer. Therefore, it does not consist of a continuous stream of letters, words, and sentences; but of pages and objects with specific sizes and coordinates relative to the page. This is a very low-level representation that must be thoroughly preprocessed before it can be analyzed as a complete text. Moreover, PDF authoring tools apply different typographical tricks while converting the text to PDF. For example, letters “f” and “i” are typically joined in a single “glyph”, to make them look better when printed, so that, for instance, the word “justification” in your word processor becomes “justification” (note the single character that is a combination of “f” and “i”) when converted to PDF. This adds another level of complexity while extracting text. Split words at the end of lines pose another problem.

Continue reading

Exception handling basics

bugOne of the useful concepts that came with modern languages is exception handling. In a nutshell, it is the concept of catching exceptions as they occur, instead of checking the inputs and return values of every function called. Consider this simple example:

int divide(int a, int b) {
if (b==0) {
...show an error message (division by zero) to the user...
else {
return a/b;
}
}

Now, you should never show a message inside a function, but for simplicity, please bear with me. Consider the following equivalent:

int divide(int a, int b) {
try {
return a/b;
catch (ArithmeticException e) {
...show an error message (division by zero) to the user...
}

In the second example, you are not checking for the inputs because somebody else already does. The integer division operator (/) is programmed to throw ArithmeticException when the divisor is 0, so it is enough to ‘catch’ this exception instead.

Continue reading

Can you hear the sound? A brief introduction to Soundex.

soundexOne of the common and seemingly simple problems in software development is to compare two strings. Assume that we are writing a contact list application and trying to prevent the same person from being added to the contacts more than once. Let’s say the operator tried to add a new name, new_name into the database. Consider the following Python-looking pseudo code:

for existing_name in existing_names:
  if existing_name == new_name:
    print "Error adding " + new_name + ": Person already exists"

That was really simple, right? What happens if we try to add a “JOHN SMITH” into the database, but there already exists a “John Smith”? It will get added, because the interpreter thinks that they are different names, because equality operator recognizes uppercase and lowercase versions of the same letter as different. Let’s modify our code to be smarter (actually dumber) about uppercase and lowercase:

for existing_name in existing_names:
  if existing_name.lower() == new_name.lower():
    print "Error adding " + new_name + ": Person already exists"

That instructed the computer to convert everything (both new name and the existing name) into lower case, then compare the two strings. That way, “JOHN SMITH”, “John Smith”, or even “jOhN SMiTh” all become “john smith” before comparison. Nice. It even compiles, so ship it. Wait! Here is our first customer complaint:

Dear Developer,
Your product is great, still I have a little problem. My database is full of duplicate names, some mistyped by our data entry staff. An example list is attached. Can you please send us an updated version that prevents this from happening?

Example names:
Chelsea McAlister
Chelsie McAlister
Chelsey MacAllister
Chelsee MacAlister

Your Customer

What should you do? You can reduce it a bit by educating the data entry staff, but it is impossible to prevent it from happening altogether. There must be a way that you can implement such that the application detects and warns about these collisions. What is the common thing between these names? They have many letters in common, but not all. Sometimes the length of the strings are the same, sometimes they are different. But why does a human think that they are the same person? Because although they are written differently, they are pronounced the same, or at least very similar. Here is our clue! They sound the same!

Continue reading

Floating point precision

float

A floating point. Well, a floating doughnut.

How do you compare two numbers for equality? Use the equality operator of your language-of-choice, right? Well, not exactly. For example, consider the following code (in Java):

int i = 2 * 5;
if (i == 10) {
...do something
} else {
...do something else
}

and this one:

float f = 2 * 5;
if (f == 10) {
...do something
} else {
...do something else
}

What would be the end result? The correct answer is that the first if block always returns true, while we cannot be so sure about the second one. But why?

Continue reading

String interning

javaMany beginners of Java are taught never to use == operator for string equality, but to use .equals() method instead. :

if (strMyString == "ok") {
... do something
}

The reason is that when comparing primitives, == operator checks if their value is the same. But when comparing objects (like Strings), it checks if two are actually the same object. If not, comparison returns false even if the two objects have exactly the same content.

After reading the above explanation, you may be amazed that the following code returns true:

String s1 := "foobar";
String s2 := "foo" + "bar";
return (s1 == s2);

How can the == operator think that s1 and s2 are the same object when they are clearly not?

Continue reading