I answered this question on StackOverflow, and it was too important not to duplicate here

QUESTION
=================

I am extracting texts from OCRed Tiff files by using a library and dumping it in database. The text I am extracting are actually FORMS having fields like NAME, DOB, COUNTRY etc.   Since OCR does not know the difference between actual value and the label, it’s just dumping all text. Now I have text in DB in following format:

Name: MyName Address: My Address

Now the next step is to extract values lile MyName and MyAddrss from the DB. The document types may vary hence a generic parser might not work.

What would you suggest to do in this situation? Should I write different parsers? I am working on .NET

ANSWER
=================

Hello. This is a common question for which an OCR industry found a generic solution years ago, and the solution branches into two separate directions. Using OCR for form processing, otherwise known as data extraction, can be one of the following two methods.

TEXT PARSING – considered as an old approach that still works in many situations. Obviously you are experienced in that and know the pros and cons, so I will be brief here. Pros is that it requires no other technology, just generic programming. Cons are that a) it requires programming, b) not very adaptive to variations, c) if formatting changes overtime may have to deal re-write some spagetti or legacy code, and d) requires near-perfect OCR result in order to find data successfully (i.e. mis-recognized label may result in missing data). In other words, great for quick and simple solutions, but not too adaptive to variations and changes. Have done it a lot back in my school and early programming days.

DYNAMIC DATA CAPTURE – using some special technology to dynamically locate data. Some technologies do it on the image-level and feed clean data to your database. Other technologies do it on the post-OCR text level. I am most familiar with data capture on image level, as it has several key benefits for complex projects I have done, so I will talk more about that. Only con is that you may need to invest into a specialized software tool, but that is a tool that provides a lot of benefit. Even a plumber has to invest into tools to do his job. The benefit of image-based data extraction is that post-OCR text is not always perfect, so the text-based extractor has to accommodate for mistakes, something that an old text parsing approach cannot. Also, in text parsing you can use only text, while in image parsing you have a ton of other information, such as lines (like in table columns), white gaps between texts (such as paragraph separators), pictures, logos, checkboxes, etc.

For example, I heavily use ABBYY FlexiCapture for these types of extraction (http://www.wisetrend.com/abbyy_flexicapture.shtml). That tool allows me to define what data I need to extract and how it should be extracted. For example, you would do something like this:

  1. Identify the format style, if more than one. If you have multiple formats, you can apply a different set of extraction rules per format.
  2. Locate label “Name:” or some other variation of it using fuzzy search or rules to accommodate OCR mistakes if any. Look in a certain area if more than one name occurs on the page
  3. Locate the area that contains chars of certain type next to the found label Name. Those chars have to fit certain criteria to be accepted as MyName field, and all those criteria are defined through UI (or scripting if you want).
  4. OCR the area content with MyName chars. Another benefit here is that you no longer use a generic OCR. You can use a very specific OCR settings that apply only to your MyName area – which increases the accuracy of OCR and data. This is most useful for specialized data, such as part numbers, codes, addresses, etc. You can use regular expressions, dictionaries, rules. You can be specific per field. That is not possible when full page OCR is used.
  5. Send the clean data to DB. Before you send the data, if you want to guarantee OCR quality, most tools usually have some kind of Verification capability to visually check (requires a human) OCRed text against the image.

In general, setting up these processes is much quicker and more liberating than code-based text parsing. There is plenty of scripting and APIs available for those who want to go past UI or need additional automation.

I scratched the surface, but hopefully that provides a start for your research and decision. If I have not addressed anything, please feel free to let me know.

Ilya Evdokimov, Data Capture Expert for 10+ years, CDIA+ Certified

My blog with more data capture stuff is here: http://wisetrend.com/ocr_and_data_capture_blog/

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

For years we have been polishing one of the most demanded and demanding areas of data capture and OCR – AP department automation.  Processing and automation of numerous variations of different Invoices, Purchase Orders and Agreements is still one of the larger data capture industry’s challenges, but we are proud to offer our proven solution for this task.

About our Invoice and Purchase Order Data Capture and Processing Approach: Invoices are considered some of the more complex documents.  Luckily the technology is capable enough today, no more tedious text parsing necessary, and there is a set of proven methods.  Over the years we have gone through numerous projects and method revisions of setting up those projects, and today I believe we have most balanced method of needed efforts and achieved capabilities through utilization of latest software features.  We bypass the single template approach, which in the past proved to be an unpredictable trap of professional services.  Today we have a repeatable and easily quantifiable method where after the initial implementation we can exactly estimate further needs for professional services, if needed.  Through a special hands-on training process we pass on the continuation of the setup to the client, giving them control and empowering their in-house capabilities.  In fact, the last project was run by accountants trained on FlexiCapture template creation, not IT.  Please watch out for a press release on this subject in the next few days.

This process has worked well for all participants in the near past, and we plan to continue polishing this process in the future.

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Admiring the new iPhone 4G with the new and improved camera and a high-definition crystal clear screen, I immediately pop up whit dozens of ideas what I could do with that. As the quality of hardware improves, what used to be negligible becomes more and more pronounced.

Think about this – 20 years ago a photograph was a photograph and no one would question those pesky pixels. With the birth of computers, digital picture viewing, and digital picture taking, picture quality became one of the most important concerns for many. As the technology improves, it only encourages an infinite race towards perfection.

Today, and the screen of the iPhone 4G improves the user perception of the picture, the shadowy gray pictures no longer cut it.

iPhone 3Gs picture of a random text for OCR

Instead, we desire crispiness, high quality contrast, and most importantly appeal to our ultimate judge – the eye. A simple submit to an online OCR system through e-mail or API can return the same image within seconds – but in a different light. The image could be deskewed (lines straightened), despeckled (pixel noise removal), and binarized (remove all colors). Obviously not correct for pictures of people and buildings, but this does wonders on text documents, business cards, signs.

Image after being cleaned up through an Online OCR engine

Now one can fully utilize the new sharp screen they got on their iPhone 4G to view these types of images. Of course, this benefit is useful in those cases where looking at images is desired. Otherwise, I would take it one step further and view the actual OCR result for a true digital sharpest possible text.

Result form OCR conversion in MS Word document

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

This month we are starting a Summer 2010 series of Tips & Tricks related to OCR and form processing industry in general, with a touch of mobile image processing, 3rd party tools and utilities, and best practices.  For the past 10+ years I have been building projects for a wide range of companies and have acquired a unique perspective into what works and what does not, even though it sounds great on paper.  Having used mostly ABBYY OCR for these implementations, I plan to cover ABBYY Recognition Server and ABBYY FlexiCapture product lines, but most of these generic approaches and tricks should work for all other OCR and Data Capture systems out there.  Stay tuned for new information in this series.

Happy OCR-ing,
Ilya Evdokimov, CDIA+

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

When I talk to people about the unique technique of printing text documents to image just for the purpose of running optical character recognition ( OCR ) or data capture on them, they are rightfully confused and think I’m a little nutz.

Why would you ever convert an already digital document back to image? I promise it’s not because I’m so fond of OCR; it actually has its purpose.

Language Detection: By converting a document to image for OCR, I can check the language of each word in the document. While I would much prefer to use a language detection tool on a digital file, there is no robust tool that exists to do this at volume. The unique aspect of OCR engines is that they contain morphology and dictionaries. This is where OCR has improved its accuracy in the past 5 years. OCR engines attempt to identify the language of text in order to better read the document. Because this mechanism is already built into the engine, if I convert a digital file to image and OCR it, I can tell you what languages exist in that document. Additionally, while font is a clear indicator of language, if it is not accompanied by the proper language encoding, it will not tell the digital process what a language is, and in OCR there is no need for such an encoding.

Normalization of digital formats: While a PDF created in Acrobat and a PDF created in a third party tool look identical to the viewer, internally these PDF files are very different. In order to accurately digitally parse a PDF file, you have to have a standard format that is used. If you do not have a standard format, you are dealing with variations in the document visually and its infrastructure. This becomes an overwhelming number of variations. For example, a collection of invoices has as many variations as there are invoices’ times as many PDF generating applications exist. However, if you were to OCR the PDF to parse, versus digital parsing, then you are dealing with only the number of variants that exist in the invoices themselves.

However crazy it sounds like, the above two are real scenarios and there are many more. I doubt that these problems will always exist, but it makes you think twice about crazy statements such as printing a digital document to image just so you can OCR it.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

In some organizations, document preparation prior to scanning is the largest time cost in their document entry process. In all organizations, it’s an important consideration. Document preparation is the processes of sorting, organizing, and preparing documents for the most successful document scan and chance at accuracy in downstream software processes. Sometimes document preparation is as simple as dividing pages into a small enough stack that a document scanner can handle, to as complex as staple removing, envelop opening, and document separation using page separators.

As recognition technology advances, the need for document preparation diminishes. New technologies are allowing for automatic document separation based on templates or keywords, automatic document rotation, annotation, sorting, etc. The challenge for organizations becomes picking what document preparation step to use technology on versus manual labor. This has been a challenging question and as new technologies surface, it becomes even more challenging.

If an organization keeps its focus on return on investment, the path should become clear. Complete evaluation of the technologies will show accuracy and % of automation that can be accomplished with technology, and the amount of time and cost it will save. The tricky part of the evaluation is really in the understanding of the environment. Doing a study of how document preparation is currently done, and all document preparations required for document entry should be fairly straight-forward. Listing the features of document preparation that can be handled by software and those products that have them is a little more complex and requires an organization to spend dedicated time on it. The process of separating documents and barcodeing documents tends to be the biggest cost and the low hanging fruit to seek automation for. Using OCR software can determine document start and end with keywords versus a person manually placing separator pages or barcodes on the document.

For most organizations the result is a combination of manual and automatic. The ultimate goal would be to automate every step in document preparation that can be automated and leave those that have to be manual such as placing documents in a scanner.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

The two most common question when organizations ask when they are seeking document automation technology is “how fast is it?” and “how accurate is it?”. Many don’t realize that the two are at opposition to each other most of the time. The more accurate a system, the slower it is, and the faster it is, the less accurate. But there is one fatal mistake in all these calculations, and that mistake is how efficiency is calculated.

Most companies who trial data capture, calculate performance on the slowest step which is optical character recognition (OCR). Literally, companies will hit the “read” button and immediately start timing until the read is complete. This is what is considered the speed of the document automation system. This is incorrect.

There is no question that OCR can be a tremendous bottleneck in the entire entry process, but poor OCR could create an even greater bottleneck. Imagine an OCR engine that reads a document with 100 characters in 1 second as compared to an engine that reads the same 100 characters in 3 seconds. Your initial thought is that the first engine would be better, but consider that the first engine may be 60% accurate leaving 40 characters to be manually entered, and the other engine 98% accurate leaving 2 characters to be manually entered or correct. If you consider an average entry speed of 1.6 characters per second then it will take the 40 characters an additional 25 seconds to enter for a total entry time of 26 seconds for the faster engine. For the slower engine it will take an additional 1.25 seconds to enter or edit 2 wrong characters thus a total entry time of 4.25 seconds. This means that end-to-end, the slower engine is 6 times faster in the document automation process then the slower engine.

This simple calculation illustrates the folly in assuming that the slower OCR time makes for a slower overall process. Usually focusing on accuracy has the greatest benefit for an organization unless you are improving the speed of a slower engine with hardware, or two engines are too close to see a benefit.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

Users of OCR might be surprised to learn that one of the initial and biggest drivers for the technology has yet to be fully actualized. It was believed soon after the invention of optical character recognition by Ray Kurzweil, that the greatest use of the technology would be in assisting language translation. Even Kurzweil himself very quickly used OCR technology to simply convert scanned image to text so that it could be read digitally for the blind. Some of the developers of OCR technology did not even start with any specialty in imaging but actually specialized in language and dictionary software.

The relationship of OCR technology to language is very interesting and several levels deep. For example, the modern engines show greatest improvements in accuracy by deploying more statistical language models and dictionaries vs. core recognition algorithms. In this method, language is improving the accuracy of OCR technology. For example the letter “e” in English is more frequent than the letter “c”, so in the case where there is a question between an “e” and “c”, this information is useful.

But the most sought after initial use of OCR was simply to get digital text in order to convert it to another language. The dream was to enable travelers to take pictures of foreign signs or documents and have them converted on the fly to their native language. While this was one of the biggest drivers for the further development of OCR, the roadblocks of photography, accurate language translation, and poor processing power of mobile devices was overlooked. Because of this, the use of OCR primarily became document automation and a means to reduce the cost of data entry. This focus changed the way the engines were developed with the new focus being document OCR and not photographic.

I’m confident that the dream will eventually be actualized but I also suspect that many changes to the way OCR engines operate, and the appearance of new specialized engines will happen first.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

The search for greater accuracy when it comes to document automation, never stops. It’s true that with every new release, OCR technology has become so advanced that the jumps in accuracy are not what they were 10 years ago. Now, new versions of OCR engines contain enhancements for low quality documents and vertical document types but general OCR can’t get much better. Because of this, modern integrations need to find new tricks. This blog is full of them, but I’m about to explain just one more. OCRing inverted text.

OCRing inverted text is nothing new. Many document types have regions where white text is printed on a black background. The modern engines have an ability to read this text. Typically it’s not as accurate as black text on white background OCR, but it has its unique benefits. Especially with complex document types such as EOBs and drivers licenses.

There is a trick in using inverted text OCR to increase overall OCR accuracy. The method is to first OCR a document normally, then using imaging technology to invert the image. When you invert the image, the black text on white background switches to white text on a black background. Once the inversion is done, run OCR again. By comparing the two OCR results, you have essentially voted the same engine with little effort.

Large volume processing environments can deploy this trick without re-loading a new OCR engine, and applying different settings. It’s important to note that when using this technique, how you compare the two results is as important as the process itself. Typically you will assign more weight to the original version of the document then the inverted one. There you have it, one more tool in increasing the OCR accuracy of the engine you already use.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com

I often speak of unique uses of OCR, and here is yet another. OCRing video files! But why? Part of the management of rich media assets is indexing these files. Technologies such as speech recognition and optical character recognition give a greater index and search value to rich media.

By using OCR technology to find and extract text from video frames, the data can be stored as meta-data. In the simplest scenario, this is a text file that accompanies the video file. More complex environments will even tell you the minuet and second the text occurs. Because this is not a traditional use of the technology, some special consideration must take place.

First is converting and separating frames to individual images files. For the OCR to be effective it needs to work on a series of images. Although a video is only a sequence of images that repeat at a high rate of speed, it’s still somewhat of a challenge to convert video files such as MPEG to a series of images. Not only that, dealing with motion blurs that might occur in some frames will also be a problem.

The second challenge is dealing with frames that are repeats. Essentially, because there are so many similar images that are only slightly different from each other, the text on a series of frames might not change. Better OCR results will account for this and not repeat text as the frames would.

And finally dealing with the variations of fonts, and often small sizes. This requires an OCR engine with specific settings for specialized OCR, and one that is very accurate on complex low quality documents.

I expect that in the future, this technique in conjunction with speech recognition will be used in eDiscovery, content management, and robust search of rich media files.

Chris Riley – Industry Expert

BlogMemes co.mments del.icio.us de.lirio.us Digg Diigo Facebook Google Google Reader Ask.com MyStuff Ask.com Yahoo! MyWeb Newsvine reddit SlashDot StumbleUpon Technorati ThisNext Dobavi.com Dao.bg Lubimi.com Ping.bg Pipe.bg Svejo.net Web-bg.com Plugin by Dichev.com