Should I OCR this document?
Thread poster: Mark Connolly
Mark Connolly
Mark Connolly
Mexico
Spanish to English
Jun 2, 2018

I have turned down jobs before because clients tell me not to OCR a pdf document. This time I accepted the job before getting the instructions and work is thin on the ground. I turns out the document is full of tables that OCR beautifully.

I always OCR without a format, could I get away with it?


 
Kevin Fulton
Kevin Fulton  Identity Verified
United States
Local time: 22:05
German to English
I don't see why not Jun 3, 2018

To be honest, I don't understand why a client might not want you to use OCR on a file. After all, how you produce a usable intermediate (i.e. working) document is your business.

However, using OCR isn't always trouble-free.

One problem with using OCR on PDF files is that all sorts of artifacts including hidden tags can be embedded in the converted file which then interfere with successful formatting in Word, for example. There are various utilities available, such as Co
... See more
To be honest, I don't understand why a client might not want you to use OCR on a file. After all, how you produce a usable intermediate (i.e. working) document is your business.

However, using OCR isn't always trouble-free.

One problem with using OCR on PDF files is that all sorts of artifacts including hidden tags can be embedded in the converted file which then interfere with successful formatting in Word, for example. There are various utilities available, such as Code Zapper, or TransTools Suite which "clean up" such artifacts and help regularize fonts and spacing. Another issue is faulty character recognition – although a character or word may appear legible to the human eye, it might be misinterpreted during the OCR process. Again, a careful reading of the output document should help eliminate such errors.

If you are using a CAT tool, you don't have many alternatives to using OCR, apart from INFIX, which results in reproducing a translated PDF file after using a CAT tool.

Using OCR to reproduce tables makes perfect sense to me, assuming the process doesn't introduce spacing or formatting errors.

You might ask the client regarding the instruction not to use OCR. It's possible that the client uses DTP and hidden embedded tags interfere with the process. As mentioned above, there are utilities that remedy this issue.
Collapse


 
finnword1
finnword1
United States
Local time: 22:05
English to Finnish
+ ...
ignorant clients Jun 3, 2018

Ask them to send you the material in text or Word document or to OCR the material themselves.

 
Germaine
Germaine  Identity Verified
Canada
Local time: 22:05
English to French
+ ...
Agree with Kevin Jun 3, 2018

Using Adobe Acrobat (Standard), you can simply "save as" the pdf in one of the various format offered, including Word and Excel and most of the time, there's little word processing to do. An OCR (EN+FR) is also included, should the pdf be a scan.

Sure, the software is pricey at first, but upgrades (and you don't have to buy each and everyone) are more affordable. See it as an investment. You'll be surprised by all you can do with it (and even more with Adobe Acrobat Pro). I started
... See more
Using Adobe Acrobat (Standard), you can simply "save as" the pdf in one of the various format offered, including Word and Excel and most of the time, there's little word processing to do. An OCR (EN+FR) is also included, should the pdf be a scan.

Sure, the software is pricey at first, but upgrades (and you don't have to buy each and everyone) are more affordable. See it as an investment. You'll be surprised by all you can do with it (and even more with Adobe Acrobat Pro). I started with version 4 and I am now using version X. I never regretted buying it. It has been worth every cent!

P.S.: should you buy it, don't forget to install the pdf printer. You'll get better pdfs by "printing" your Word/Excel documents than "saving as".
Collapse


 
Tom in London
Tom in London
United Kingdom
Local time: 02:05
Member (2008)
Italian to English
I agree with F Jun 4, 2018

finnword1 wrote:

Ask them to send you the material in text or Word document or to OCR the material themselves.


Finnword's suggestion is the correct one.


 
LEXpert
LEXpert  Identity Verified
United States
Local time: 21:05
Member (2008)
Croatian to English
+ ...
Be careful what you wish for Jun 4, 2018

Tom in London wrote:

finnword1 wrote:

Ask them to send you the material in text or Word document or to OCR the material themselves.


Finnword's suggestion is the correct one.


That often results in a slipshod effort yielding tag soup and horrible segmentation that costs you more than time than it saves, especially since, if the client is going to go through the trouble of OCRing for you, they're going to figure that they might as well run it through their CAT tool and knock your price down a bit. 9 times out of 10, I can do a much better job of OCRing a file than the client can.


 
José Henrique Lamensdorf
José Henrique Lamensdorf  Identity Verified
Brazil
Local time: 23:05
English to Portuguese
+ ...
In memoriam
Definitely true! Jun 4, 2018

LEXpert wrote:

9 times out of 10, I can do a much better job of OCRing a file than the client can.


I always wonder why clients - particularly agencies - "lie" about having done (horrible) OCR work.

They send me a table with a sea of typos, I ask them for the original file, and they say it's all they've got.

Later they ask me to proofread a laid-out PDF to check whether they've put all my translations in the right places.


 
DZiW (X)
DZiW (X)
Ukraine
English to Russian
+ ...
extra work = extra charge Jun 4, 2018

Sometimes I use FreeTM.com (free WordFast Anywhere), which can convert not very complicated or bizarre PDFs to email box, otherwise I have to use FineReader. Anyway, I do charge for this, because it takes more time and efforts to make the text ok.

Most clients know very little even regarding the final translation, so many just aren't aware of an editable document, types of PDF/DJVU and why OCR/DTP at all. In this view, translators wor
... See more
Sometimes I use FreeTM.com (free WordFast Anywhere), which can convert not very complicated or bizarre PDFs to email box, otherwise I have to use FineReader. Anyway, I do charge for this, because it takes more time and efforts to make the text ok.

Most clients know very little even regarding the final translation, so many just aren't aware of an editable document, types of PDF/DJVU and why OCR/DTP at all. In this view, translators work as mentors and educators, teaching the ABC.

Shortly, clients don't know why exactly they must pay for something not asked for. When I had a similar issue and asked for an editable copy, my client insisted the file must be intact and very reluctantly sent me a password to unprotect the PDF. I had to explain to him again that a scanned PDF is no different with or without a password for it's but a set of images, no text. He was surprised and wondered whether translation involves reading a hardcopy or from the screen. I was ready to cancel the deal, when he suddenly replied he understood the problem--he could only view the file as photos without selection word or making remarks... Finally he sent me the original DOC and once more he was dumbfounded by a question which final format was required--DOC, PDF or some other... Yes, as far as there were charts and I didn't want to mess with explaining about ZIP/RAR and sent him a DOC, an RTF, and a searchable PDF... He was puzzled and asked whether he had to pay threefold)

Why, I believe it's much better than "a plain DOC file without tables and graphics", which turned to be a DOC with scanned handwriting.
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

Should I OCR this document?






CafeTran Espresso
You've never met a CAT tool this clever!

Translate faster & easier, using a sophisticated CAT tool built by a translator / developer. Accept jobs from clients who use Trados, MemoQ, Wordfast & major CAT tools. Download and start using CafeTran Espresso -- for free

Buy now! »
Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »