Have you ever wanted to share a list of emails without exposing sensitive data for the emails that are unknown to a partner?

Enter one-way cryptographic hash functions

Crypto – Not Just the magic sauce for Bitcoin, cryptographic hash functions underpin most website security.

A one-way hash function is an algorithm which takes an input, encrypts and outputs a fixed length hash. A strong algorithm is one where the hash is irreversible with modern computing power. As in there is no current way to  go from the hashed output to the input

The most commonly adopted standard for one-way hash functions is the SHA-256 algorithm. The output of the SHA-256 algorithm is a fixed length encoded 64-character string.

  • “Tevunah” becomes  ‘a5dd7414c077318ba6c21a9620aa78aecfb86c4d65cc362366e5222d0867a9b1’
  • “tevunah” becomes ‘66514d92bd1c19543a663ccff8d31522d1d9c4a853102231154cb67dd3f33b43’
There is no way to go back from the hash to ‘tevunah.’ Since Tevunah and tevunah would give different hashes we are going to want to standardize the inputs by making all email lower case, and trimming for empty spaces.

Salting

For added security, we can add a “salt” by appending an ending of ‘somethingrandom’ to end of all the emails
Lookup tables
  • Example: ” Matt@Tevunah.com” becomes “matt@tevunah.comtable”
Also, have an email known in both lists to makes sure the encryption is working the same way for a sanity check.

The primary function of salts is to defend against dictionary attacks or against its hashed equivalent, a pre-computed rainbow table attack.

By adding a random salt you substantially decrease the chance of finding an email in a lookup table of hashes for common names.

The hash of ‘helloworld’ 936a185caaa266bb9cbe981e9e05cb78cd732b0b3280eb944412bb6f8f8f07af is found in a lookuptable but the hash of helloworldtable is not.

In Python:

import hashlib
import pandas as pd


email_df = pd.read_csv('~/Downloads/Emailtest.csv')
email_df['email'] = email_df['email'].str.strip().str.lower()


def hash_email(email):
salt = 'table'
return hashlib.sha256(str(email) + salt).hexdigest()


email_df['emailhash'] = email_df['email'].apply(hash_email)
email_df.to_csv('~/Downloads/Emailtesthashed.csv')

In Excel: ‘=CONCAT(TRIM(LOWER(email_column)),”table”)’

We can then take the files of just the emailhash column and do a vlookup
client_df = pd.read_csv('~/Downloads/our_list.csv')
joined_df = email_df.join(client_df.set_index('emailhash'), on='emailhash', how='inner')
joined_df.to_csv('~/Downloads/joined.csv')

hashing eliminates the need of sharing the full customer list,  and clients can solely share the hashed results so we can determine the matched emails.