An Overview of Document Conversion

Document conversion is the process of changing the format of a document. The original document may be paper or electronic. Either way, there are many times where a specific format of an electronic document is required. With technology increasingly present in all aspects of business and personal lives, it’s becoming increasingly necessary to share documents and data between many systems. This often requires some form of conversion as not all systems support the same types.

Obviously, in the case of paper documents, the conversion to an electronic format is required for accessing it via computer. Different electronic formats provide different tradeoffs between size, quality, and features. Careful consideration needs to be made before starting a conversion project to ensure that the right format is chosen.

Basic scanning, like many people can do from home, usually results in a simple image file (TIFF, PNG, JPEG) or a PDF. In both cases (image or PDF), the resulting document is viewed by the computer as just an image. This means that other than viewing the document on a computer, users cannot do much with it.

One way this shortcoming is addressed is by adding metadata to the electronic document. Some file formats allow users to add data such as the author, title, subject, creation date, or arbitrary tags to the file which can then be searched.

The downside to using only metadata is that the full text of the document is still unsearchable or editable. This limitation can be removed by processing the file with special software. There are a few types of software for different situations:

 

  1. Optical character recognition (OCR) software can recognize printed text characters and convert a document image to editable, searchable text.
  2. Intelligent mark recognition (IMR) software can recognize handwritten characters (and printed characters to a lesser degree, albeit not as well as OCR). IMR works best on neatly written documents where letters don’t run together.
  3. Optical mark recognition (OMR) software can recognize items such as checkmarks in checkboxes and filled-in bubbles on forms.

Using a combination of these systems can cover most paper document conversion demands. However, when converting paper documents to electronic ones care has to be taken when scanning them. Wrinkles in the paper or shadows can affect the result. Furthermore, and potentially more likely, is an angled, or skewed scan. If the document is not correctly aligned, OCR/IMR/OMR systems can fail to properly recognize some, or all of the document’s text. Text clarity, fading, size and even the language can also affect the results.

Regarding file formats, the most common file formats used for searchable text are generally: PDF, plain text, MS Word, XML, XHTML, and HTML. The benefits of most of these formats is that they’re industry standards that can be parsed and processed easily by computers. The virtues of standards are that most, if not all, document management systems will recognize and understand them.

With all of these details to consider, starting a document conversion project may seem daunting. However, remember that in the end it will result in lower costs and higher productivity.

Leave a Reply