An Expert’s Guide on How to Protect Data Using NLP

data science, NLP, scikit, python, machine learning

Data protection and keeping sensitive information private is very important for companies and their customers. There have been several huge data leaks in the past that lead to trust issues towards the involved companies. To get value from text data with machine learning large collections of documents are necessary. But to access them can be a privacy issue for customers in finance, legal, medicine and many more. As a freelancer in data science and machine learning accessing large quantities of data is necessary to build accurate and useful models. But for some of your clients, it might be (with good reasons) scary or impossible to disclose their data raw and unprotected.

So how can you work with sensitive text data at scale yet keep the contained information as secure as possible? In this article, I'm going to show you some of my methods to work with sensitive text data and discuss what the caveats are.

Get a sample dataset

To explain the methods, we use the 20 Newsgroups dataset which is also easily available through scikit-learn. The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. So, we're facing a document classification problem here. We start by loading the data through the scikit-learn API.

from sklearn import datasets

train_dataset = datasets.fetch_20newsgroups(subset="train")
test_dataset = datasets.fetch_20newsgroups(subset="test")
In [1]:
train_texts =
train_labels =
In [2]:

We have the following categories for the documents:

In [3]:
['alt.atheism', '', '', '', 'comp.sys.mac.hardware', '', '', '', '', '', '', 'sci.crypt', 'sci.electronics', '', '', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']

Let's have a look at an example.

In [4]:
"From: (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host:\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tell me a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"
Out [4]:
In [5]:
Out [5]:

So, this document is about cars, which we would probably have guessed.

The baseline setup

We first setup the machine learning pipeline we will be using throughout the article. For simplicity reasons, we use a simple bag of words TFIDF model with a naive bayes classifier, a simple, effective and popular method for text classification. But the proposed methods would also work with more complicated methods like neural networks.

from tqdm import tqdm_notebook
from hashlib import shake_128
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

def run_ml(texts, labels, tokenizer=None):
   tr_txt, vl_txt, y_tr, y_vl = train_test_split(texts, labels,
                                                 train_size=0.85, test_size=0.15,
   text_clf = Pipeline([
       ('vect', CountVectorizer(tokenizer=tokenizer)),
       ('tfidf', TfidfTransformer()),
       ('clf', MultinomialNB()),
   ]), y_tr)
   y_pred = text_clf.predict(vl_txt)
   print("Accuracy: {:.1%}".format(accuracy_score(y_vl, y_pred)))
In [6]:

To compute a performance baseline, we first run the pipeline with unprotected raw data.

run_ml(train_texts, train_labels)
In [7]:
Accuracy: 85.0%

Work with anonymized documents

Since we want to do machine learning with the documents, we want to preserve as much information as possible. Depending on how critical your data is, you can pick from several ways to do this. I'll show you three different ways to do this and we compare the performance on our simple machine learning model. The basic idea here is, that (most) machine learning models basically treat the token Hello the same way as 16ff566c558eb688e. As long as the relationships between the tokens are preserved everything will work fine. In the end, what the models see is just numbers.

1. Remove personally identifiable information automatically

We start out with a method to automatically remove personally identifiable information such as names and locations. Depending on your dataset you might also remove credit card information or certain IDs with this method. A simply basic approach is to use a named entity tagger to find this information in the text and then replace it with a random string. For tokenization and named entity recognition we will use the awesome spaCy library.

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm", )
doc = nlp(train_texts[0])
In [8]:

Let's see what the entity tagger found.

displacy.render(doc, style="ent")
In [9]:
From: (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host Person
University of Maryland ORG
 , College Park
15 Cardinal
 , College Park

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s Date
 . It was called a  
Bricklin GPE
 . The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tell me a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

- IL
---- brought to you by your neighborhood  

So, the tagger detected the person Lerxst and the location University of Maryland. We remove these entity types from the text now. You already see a caveat of this method, since it didn't recognize the email address. But you could filter it out with a regular expression.

def replace_by_type(texts):
   new_texts = []
   for doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
       new_txt = doc.text
       for ent in doc.ents:
           new_txt = new_txt.replace(ent.text, ent.label_)
   return new_texts
In [51]:
train_texts_type = replace_by_type(train_texts)

run_ml(train_texts_type, train_labels)
In [52]:
Accuracy: 79.0%

We can clearly see, that the methods provide only a little lower performance as the baseline. But this is dependent on the dataset and problem at hand. On the downside, the method has no guarantee that all relevant personal information is found. So, keep that in mind.

2. Hash personally identifiable information automatically

For some use-cases, it might cause problems to remove all personal information by their types. For example, specific locations can contain relevant information for your problem. So, we modify the previous approach. Now we're not replacing the detected personal information by its type but its unique hash value. This keeps the information available for the machine learning model and preserves its privacy.

from hashlib import sha256

def replace_by_hash(texts):
   new_texts = []
   for doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
       ents = dict()
       for ent in doc.ents:
           ents[ent.text] = sha256(ent.text.lower().encode
       new_txt = doc.text
       for ent in sorted(ents, key=len):
               if len(ent) > 2:
                   new_txt = new_txt.replace(ent, ents[ent])
   return new_texts
In [40]:
train_texts_hash = replace_by_hash(train_texts)

run_ml(train_texts_hash, train_labels)
In [41]:
Accuracy: 83.3%

This method performs well by keeping more relevant information. However, it still suffers from the privacy problems of the previous approach. Also, the hashes can be cracked by brute force or count-based statistical approaches.

3. Go fully encrypted: encrypt every word in the document

One method to mitigate the problems with the previous two methods is to go fully encrypted. To keep as much information as possible, we can just map every word to a unique secret value that cannot easily be inverted. A hash function will work well here. This will not change the vectorspace of our bag of words but makes it impossible for humans to understand.

def encrpyt_texts(texts):
   new_texts = []
doc in tqdm_notebook(nlp.pipe(texts, disable=["tagger", "parser"], n_threads=4), total=len(texts)):
       enc_text = []
       for token in doc:
       new_texts.append(" ".join(enc_text))
return new_texts
In [44]:
train_texts_enc = encrpyt_texts(train_texts)
In [49]:
run_ml(train_texts_enc, train_labels, tokenizer=lambda doc: doc.split(" "))
In [50]:
Accuracy: 79.0%

We can see, that we kept most of the performance compared to the baseline. One serious downside of this method is that you cannot interpret the dataset after you apply it. This makes it ultimately secure, but it also makes the machine learning workflow more tedious. Also note, that the method is vulnerable to statistical attacks and brute force attacks to decrypt the data.


We saw three methods of how you can work with text datasets to keep sensible personal or commercial information safe. One serious drawback is that they make it harder (or impossible) to diagnose your models. But they can help you to get started with your projects faster. You can start from here to craft a method fitting your use-case best.

Here you can see a fast overview of the covered methods:

Method Keeps information? Secure? Legible by humans? Speed? Is transfer learning possible?
1. some loss can miss personal information but cannot be inverted yes fast yes
2. no loss can miss personal information and can be inverted mostly quite fast restricted
3. no loss very secure but can be brute forced no slow hardly

I hope this article helps you in your day-to-day work as a data scientist, machine learning engineer or especially as a freelancer in these fields. Nonetheless, always keep in mind that none of these methods is completely safe against attacks!