Google Now Indexing Image Text

Google is now indexing scanned documents in search results. In other words if you scan a page of text , save it as a jpg or gif image and post it to the web, it will be treated like an actual page of text rather than an image. In a post on the Official Google Blog, Product Manager Erin Levey reveals a little bit on what Google’s doing:

“In the past, scanned documents were google inverted index¬†included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.

While we’ve indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however — it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages”.

Leave a reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>