TDWI Upside - Where Data Means Business

Improving Data Accuracy with Robotic Process Automation

We examine two important areas where automation can improve OCR performance.

Many existing implementations of OCR do not take advantage of true automation. The reasons can include a lack of available staff to tend to the system, the software's inability to support the necessary automation, or a lack of understanding that OCR is even possible. In my previous article (Using OCR: How Accurate is Your Data?), I introduced the concept of how to achieve maximum automation through field-level confidence scores. Confidence scores are software-generated values associated with each field-level OCR answer that can be used to determine if an answer is accurate.

For Further Reading:

Using OCR: How Accurate is Your Data?

The Future of Text Analytics

Trustworthy Data: The Goal of Data Quality and Governance

If OCR software can generate consistent and reliable confidence scores, generally these scores will be higher for correct answers and lower for incorrect answers, allowing you to select a "threshold" to identify scores that represent answers that are mostly correct versus those that are mostly incorrect. OCR software that cannot provide reliable confidence scores requires your organization to review all information.

If you feel overwhelmed with the work required to truly enjoy automation and streamlined data handling, you're not alone. In our engagements with clients both large and small, we find that many businesses are using antiquated OCR technology that requires additional labor-intensive tasks to get the necessary data into their business systems.

The key to reducing your workload and improving efficiency is automation -- more appropriately called "robotic process automation," or RPA for short.

RPA is simply the automation of human tasks by software that is programmed with specific rules. For instance, the process of creating and assigning a new employee with an email account was once a purely manual process. With RPA, software encoded with rules can do this automatically.

The reality is that this technology has been around for years in various forms. In the early years, one of the primary drawbacks was that someone had to encode the rules, and this was not easy -- usually involving programming. In the last several years, RPA vendors have embraced the "learn by example" philosophy, allowing relatively non-tech-savvy staff to create rules through a user interface that records their actions. For instance, using the example of email accounts, RPA software records all the mouse clicks and data entry required to provision the new account. Once recorded, the generated rules are stored in a repository to be used when invoked by some event. These sets of rules are typically referred to as "bots."

RPA and Document Capture

How can RPA help with document capture accuracy?

The analysis to identify a threshold compares OCR output and confidence scores against the actual values on a corresponding page (Using OCR: How Accurate is Your Data?). Performing this analysis ahead of time will yield better initial performance, but it can also be done in production.

There are two important areas where RPA can be used to automatically improve performance.

Collection and Analysis of Data Errors to Adjust Confidence Thresholds

If your organization verifies all OCR data due to lack of field-level confidence thresholds, the key is to provide your RPA bot with data on field-level OCR errors. How? Consider a manual process:

  1. A user collects the field-level OCR answers for a field on a form along with the field-level operator actions and compares the two.
  2. Any OCR field answer that is corrected by an operator is an error. To calculate confidence thresholds, a user sorts all answers by the confidence scores.
  3. The user notes all cases where fields with relatively high confidence scores were incorrect and cases where fields with lower confidence scores were correct.
  4. Based on this review, a confidence score threshold that separates correct data from incorrect can be identified by calculating the ratio of correct answers to incorrect answers for any given threshold. The threshold with the highest ratio is selected and updated in the system.
  5. This exercise is repeated for every field.

Now consider using RPA. The user goes through the very same process of analysis; the RPA software observes these actions and then, via an API-level integration, updates confidence thresholds. The benefit is that once the RPA software has "learned" the process that is undertaken to arrive at the correct confidence threshold, the system can automate the process on an ongoing basis with no user intervention required.

Collection and Analysis of Common Data Values

Another factor involved with accuracy of extracted data is the potential range of answers and the associated formats. With field-based data extraction using OCR, the "raw" answer can sometimes have errors that can be eliminated by adding information that can help correct the answer.

Take, for instance, a name field. OCR might recognize "John Smith" as "J0hn Sm1th." Adding a list of potential answers can reduce these errors; providing this additional "context" during recognition helps identify likely answers. Essentially any time field-based OCR can include this context, accuracy improves.

Sometimes you may have these additional databases available, but in cases where this information is not collected, RPA can assist with this effort during production. Just as the case with automation of confidence thresholds, RPA can collect the answers from OCR and compare them to any corrections to create an ever-growing database of potential field values. For the name field, a database of common names can gradually be created for use in subsequent forms processing. The same can be done with fields such as city, state, ZIP code, and gender, as well as fields containing dates and amounts. Any field that can have common entries or patterns can benefit from this RPA-based context creation.


The end product of document capture could be considered RPA. After all, once configured, the result is the automation of data collection from documents. However, there are also opportunities to look at RPA from a more "micro" level by identifying specific definable and structured activities within a document automation project to automate to improve OCR results. Automation of configuring confidence score thresholds and creation of field-level context databases are two such opportunities.

Today, RPA is helping us optimize parts of processing challenges. To effectively leverage RPA, it's important to find the right solution to make the process efficient. This may be full automation or, more likely, a combination of human and automated tasks.


About the Author

Greg Council is vice president of marketing and product management at Parascript and has over 20 years of experience. Responsible for market vision and product strategy, he oversees all aspects of Parascript software life cycles, leading the successful development and introduction of advanced technology to the marketplace.

TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, & Team memberships available.