How OCRed PDF Productions Degrade Electronic Evidence

By February 6, 2013Articles

Many legal teams use endorsed searchable PDFs as their preferred format for producing electronic evidence. I suspect that two of the most common reasons for this may be that PDFs are a format attorneys are very familiar with, and that the productions can be prepared in-house using the tools the firm has.

I am generally not a fan of PDF productions because I think they lack both the advantages of a native production (e.g. maintaining the metadata and functionality of complex electronic files) and the advantages of a TIFF production accompanied by load files (e.g. flexibility and ease of use with legal review platforms). In fact, our experience shows that upon receiving a searchable PDF production, most law firms hire an outside company, or engage their in-house litigation support team to have the documents converted to a TIFF production with load files so that they can be loaded into a legal review platform.

More concerning to me, though, is the fact that searchable PDF productions are frequently prepared (unnecessarily) using OCR rather than extracted text.

Let’s take a look at a problematic, but commonly used production workflow:

  1. Electronic files are prepared for review using e-Discovery processing (TIFFs, load files, extracted text)
  2. Documents are loaded into a review database (the database contains, among other things, metadata and extracted text)
  3. Review is performed and a list of documents to be produced and their designations is determined
  4. Images to be produced are OCRed, endorsed and converted to searchable PDFs for production

Extracted Text

Text that is captured from already searchable electronic documents (e.g. e-mails, Excel spreadsheets etc.) with great accuracy during e-Discovery processing.

OCR Text

Text that is obtained by interpreting and translating to text characters found on an image file using a process called Optical Character Recognition (OCR). OCR is typically used in the absence of extractable text and is much less accurate.

How OCR’ed PDF Productions Degrade Electronic Evidence

The workflow above effectively discards the more accurate, electronically extracted text and replaces it with OCR text with much less accuracy. For example, let’s assume that the original electronic document was the following Excel spreadsheet:

PDF Production Excel Sample

Figure 1 – Original Native File

The text electronically extracted from this file is as follows. As expected, formatting is not maintained but the text is 100% accurate.

Test Data Cell A1 (Sin[t]Sqrt[Abs[Cos[t]]])/(Sin[t]+7/5)-2Sin[t]+2, {t, 0, 10}
Test Data Cell A2 Test Data Cell B2

Figure 2 – Text Electronically Extracted from Original Native File

Once the native file is converted to TIFF, OCRed and exported as a searchable PDF, the text embedded in the PDF is degraded. Take a look at the following example prepared using one of the most accurate OCR engines:

Test Data Cell A1 (Sin[t]Sqrt[Abs[Cos[t]]])/(Sin[t]+7/5)-2Sin[t]+2, {t, 0, l0}
“7e&t “Data &ell ‘SZ

Figure 3 – Text Extracted from Searchable PDF Created Using OCR

As you can see in the screenshot above, even though the plain text in cell A1 and the formula in cell B1 were correctly recognized, the lack of contrast in cell A2 and the complex font in cell B2 prevented the OCR engine from correctly recognizing those characters. Needless to say, if you are on the receiving end of such a production and utilizing keyword searches, the lack of accuracy in the provided text can be quite frustrating. Please note that the issue described above is not caused by a deficiency of the PDF format, but by the unnecessary OCR process in the workflow.

Conclusion

During e-Discovery processing, text is typically extracted along with the coordinates of each character or word. This makes it possible to export searchable PDFs with embedded extracted text if desired. Instead of OCRing images, searchable PDFs should be created through the e-Discovery platform using the original, more accurate extracted text. OCR should be used on non-searchable documents such as image-only PDFs, scanned images etc. to complement the extracted text.

If some of the documents contain redactions, the redactions can be made in a platform that supports automatically mirroring the redactions in the extracted text. If this is not possible, OCR can be performed only on the redacted images.

While drafting discovery agreements, legal teams should consider the distinction between extracted text and OCR and request extracted text when available.

Arman Gungor

About Arman Gungor

Arman Gungor is a certified computer forensic examiner (CCE) and an adept e-Discovery expert with over 21 years of computer and technology experience. Arman has been appointed by courts as a neutral computer forensics expert as well as a neutral e-Discovery consultant. His electrical engineering background gives him a deep understanding of how computer systems are designed and how they work.