Xerox copiers found to rewrite documents without OCR

August 7, 2013

While scanners and photocopiers may not get the brightness or contrast correct, as long as the scan/copy is clearly readable, it is considered identical to the original. When text is converted to an editable form through an OCR process, characters can end up being misidentified such as a '0' being read as an '8', an upper case 'I' coming out as a lowercase 'L' and so on, but can this happen if no OCR takes place?

To the surprise of a German computer scientist, David Kriesel, he noticed that after scanning and printing out a construction plan, several figures were clearly wrong in the print-out. The tell-tale sign was that a smaller room had larger dimensions printed for than a larger room on the same sheet. While the scanning resolution was low, there was no way that the errors could be caused by the low scanning resolution, as dimensions such as '21.11' came out as '14.13' and these figures were indisputably readable in the scan:

After scanning various documents while making sure OCR was disabled, he was able to replicate the issue on other documents including an invoice, such as where '65,40' came out as '85,40'. In this case, figure '8' was clearly formed, ruling out the likelihood of a pixel shift or dithering. Unsurprisingly, it made him question whether incorrect figures could lead to a company being made liable for legal action and of course whether the company could in turn hold the photocopying company liable.

Upon closely examining the scanned documents and getting user feedback, it turns out that the issue is caused by the copier's use of the JBIG2 compression algorithm. This compression uses a dictionary based compression, where it tries matching similar patches of pixels in its dictionary against scanned material to reduce the file size of the resulting scan. If a close match is found, it will substitute this patch. Unfortunately, with a low enough resolution, this technique can end up substituting patches that it determines to be similar, but which have very different yet readable digits.

David has since had a conference with Xerox and the problem is indeed the JBIG2 compression algorithm which the copier uses on its 'Normal' quality setting. To work around this problem, the user must use choose 'higher' or 'high' for the scan quality setting, as this results in the copier using a different compression algorithm. The problem affects the Xerox Workcentre 7535 and 7556 and potentially other models also.