Obfuscation: A valid way to protect sensitive data?
Thread poster: Hans Lenting
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
Aug 18, 2019

Some CAT tools (like CafeTran Espresso 10 Croissant) already offer a way to mask sensitive data at the segment level.

Following a recent discussion in this forum, I have been musing over a more thorough way to protect sensitive data at the document level. I'd welcome your opinion on the validity of this approach.

Task:
Some CAT tools (like CafeTran Espresso 10 Croissant) already offer a way to mask sensitive data at the segment level.

Following a recent discussion in this forum, I have been musing over a more thorough way to protect sensitive data at the document level. I'd welcome your opinion on the validity of this approach.

Task:

  • Translate a document of 5,000 words with sensitive data, subject field: legal, finance or patents.
  • Make sure that the originating document cannot be reconstructed.
  • Protect sensitive data.


Approach:

  • Mask all names (company names, street names, names of individual persons etc.) and numbers.
  • Merge the 5,000 words with 95,000 words of similar documents.
  • Sort the new document with 100,000 words (e.g. alphabetically, one segment per line)
  • Run the document through a public MT system to create a TM.
  • Use this TM to translate the originating document of 5,000 words.


Question:

  • What way would the public MT system have to retrieve sensitive data from the 100,000 words document.


(I'd be not surprised, if there actually is a way to reconstruct the originating document )
Collapse


 
Gary Evans
Gary Evans  Identity Verified
Germany
Local time: 16:39
German to English
Come again Aug 18, 2019

Hi Hans,

Seems quite complicated to me. Are you suggesting that the text becomes mixed up at the sentence level? Beats me how you can put that soup back together again. It also looks like 95% of the translation is pointless work for the MT.

I'd personally stick to not using MT for highly sensitive translations as others have stated elsewhere here.

Regards,
Gary


Véronique Guider
 
DZiW (X)
DZiW (X)
Ukraine
English to Russian
+ ...
Fizzy fuzziness Aug 18, 2019

First, there's no one-to-one equivalence in translation.

Second, most CATs use "segments", significantly downgrading (=separating) the textual to lexico-grammatical level.

Third, even removing WHO-WHAT (names and numbers) is enough to overgeneralize the independent clauses, let alone garbling WHEN-WHERE-WHY-HOW specifications.

Fourth, while shuffled fragments, choppy, run-on, and loose sentences with excessive subordination and non-parallel structure render
... See more
First, there's no one-to-one equivalence in translation.

Second, most CATs use "segments", significantly downgrading (=separating) the textual to lexico-grammatical level.

Third, even removing WHO-WHAT (names and numbers) is enough to overgeneralize the independent clauses, let alone garbling WHEN-WHERE-WHY-HOW specifications.

Fourth, while shuffled fragments, choppy, run-on, and loose sentences with excessive subordination and non-parallel structure render any text meaningless, such tricks do take more time and efforts without much gains for the translator. Why third-party online(?) exotic(?) MT?

Fifth, a secret meta-language might help to some extent, yet if a perpetrator can access your TM, then how about other papers, intermediary works, and correspondence? It's just not worth it.

IMO
Collapse


 
Hans Lenting
Hans Lenting
Netherlands
Member (2006)
German to Dutch
TOPIC STARTER
Less effort than one might assume ... Aug 19, 2019

Gary Evans wrote:

Seems quite complicated to me. Are you suggesting that the text becomes mixed up at the sentence level?


Actually, nearly all the steps can be automated.

Beats me how you can put that soup back together again.


The beauty is that you don't need to put the soup back together, since you'll be using the TM to translate the originating document.

It also looks like 95% of the translation is pointless work for the MT.


That's right, but the only relevance here are the costs of these extra words.

However, I see a real problem: you can soon run out of fresh 'distraction' documents, since you can only upload every sentence once to the online MT system. After that, the MT system will be able to differentiate between new (real) segments and old (distracting) segments. Especially in combination with IP logging and other fingerprinting techniques (which I'm quite sure, they all use).


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 16:39
Member (2006)
English to Afrikaans
+ ...
@Hans Aug 19, 2019

Hans Lenting wrote:
  • Merge the 5,000 words with 95,000 words of similar documents.


  • 1. My annual Google Translate bill is $100. With your method, it'll be $2000.

    2. A malicious machine could use something similar to a plagiarism checker to identify which segments in your "text" are likely from public sources and therefore whatever remains are likely the confidential segments.

    3. Even if you then also sort all segments alphabetically, a neural system can conceivably exist to calculate probable original orders of segments.

    Hans Lenting wrote:
  • Mask all names (company names, street names, names of individual persons etc.) and numbers.


  • You could go further, and alter the 1000 most commonly used adverbs and adjectives. If a CAT tool does this, you can ensure that a specific adverb is not always replaced with the same dummy adverb. I wonder if such a thing is feasible.


    DZiW (X)
     
    Hans Lenting
    Hans Lenting
    Netherlands
    Member (2006)
    German to Dutch
    TOPIC STARTER
    Some answers Aug 20, 2019

    Samuel Murray wrote:

    Hans Lenting wrote:
  • Merge the 5,000 words with 95,000 words of similar documents.


  • 1. My annual Google Translate bill is $100. With your method, it'll be $2000.


    The 95,000 words was just an arbitrary number. But indeed, your costs will increase.

    2. A malicious machine could use something similar to a plagiarism checker to identify which segments in your "text" are likely from public sources and therefore whatever remains are likely the confidential segments.


    True. That would indeed be a likely scenario.

    3. Even if you then also sort all segments alphabetically, a neural system can conceivably exist to calculate probable original orders of segments.


    When I posted my first message in this thread, I was indeed aware of this possibility. But how would this work?

    Hans Lenting wrote:
  • Mask all names (company names, street names, names of individual persons etc.) and numbers.


  • Samuel Murray wrote:
    You could go further, and alter the 1000 most commonly used adverbs and adjectives.


    I cannot see the point of doing that ...

    Anyway: Let's wait for someone with a better idea .


     
    DZiW (X)
    DZiW (X)
    Ukraine
    English to Russian
    + ...
    Timestamps/authors/synonyms Aug 21, 2019

    Hans, are you talking about local (offline) TMs or shared/online ones?

    If the former, there's no use to get on all fours barefooted, showing one's flexibility to please numerous two-centers and spongers. Providing that a hacker can access TMs, it's not a big deal to reveal the original communication /shadow copies /invoices/ temporary files whatever.

    If the latter, it's still about local security policy and individual habits/practices of every participant--includ
    ... See more
    Hans, are you talking about local (offline) TMs or shared/online ones?

    If the former, there's no use to get on all fours barefooted, showing one's flexibility to please numerous two-centers and spongers. Providing that a hacker can access TMs, it's not a big deal to reveal the original communication /shadow copies /invoices/ temporary files whatever.

    If the latter, it's still about local security policy and individual habits/practices of every participant--including the servers.

    In all, what about clients/agencies practices? How they could prove their protection is adequate and no sensitive data leakage is possible on their side, I wonder?


    While a targeted attack or a custom order is still possible, no real malefactor would even consider hacking very your PC to get TMs, unless (1) it's far too easy and (2) he knows with what big fish you deal.
    Collapse


     
    Hans Lenting
    Hans Lenting
    Netherlands
    Member (2006)
    German to Dutch
    TOPIC STARTER
    Late answer Aug 24, 2019

    DZiW wrote:

    Hans, are you talking about local (offline) TMs or shared/online ones?


    Sorry for my late answer, but I'm talking about TMs created by MT systems. My CAT tool (CafeTran Espresso 10 Croissant) allows creation of TMs by running all segments of a translation project through MT systems. E.g. when you want to use MT and you know that you won't have access to the internet later (e.g. during a flight, in the bush, etc.).


     
    Samuel Murray
    Samuel Murray  Identity Verified
    Netherlands
    Local time: 16:39
    Member (2006)
    English to Afrikaans
    + ...
    On CAT tools creating MT'd TMs Aug 24, 2019

    Hans Lenting wrote:
    My CAT tool (CafeTran Espresso 10 Croissant) allows creation of TMs by running all segments of a translation project through MT systems.


    My CAT tool (WFC) doesn't have that feature but I accomplish it in 5 minutes using a combination of its features plus a little AutoIt script. There is a new feature in Wordfast Pro 5 which allows for MT to be used during pre-translation, so I suppose one could use that as well (then extract the TM from the TXLF file). Trados has an option to "use automation" during pre-translation, but I couldn't get it to work.


     


    There is no moderator assigned specifically to this forum.
    To report site rules violations or get help, please contact site staff »


    Obfuscation: A valid way to protect sensitive data?







    TM-Town
    Manage your TMs and Terms ... and boost your translation business

    Are you ready for something fresh in the industry? TM-Town is a unique new site for you -- the freelance translator -- to store, manage and share translation memories (TMs) and glossaries...and potentially meet new clients on the basis of your prior work.

    More info »
    Trados Studio 2022 Freelance
    The leading translation software used by over 270,000 translators.

    Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

    More info »