HEGP CHALLENGE

Homomorphic Encryption of Genotypes and Phenotypes (HEGP)
Edit text!

The $1,000 Challenge!

To go straight to the actual challenge click here.

Crack the code! The authors announce the HEGP (heegeepee; hiːdʒiːpiː) Challenge, with a $1,000 (one thousand dollar) prize for the individual or group who can crack HEGP encrypted data. HEGP, when proven solid, will have a large impact on the way human genetics is pursued today because it will allow for sharing genotype data while preserving privacy of the individual. Not only is sharing data required for reproducible Science, there is also a large interest in hosting data on laptops and servers that need not be HIPAA compliant.

On this page we chat about cracking HEGP, Rubik cube, DNA, strawberries and Enigma.

To go straight to the actual challenge click here.

Relevance

With a publication in Genetics we present a novel Homomorphic encryption method named HEGP that allows for sharing of genotypes and phenotypes in the context of Genome Wide Association (GWA) studies. Importantly sharing but without giving away private information and making individuals identifiable. For more information check out a privacy researcher's perspective and a recent follow up paper on Using encrypted genotypes and phenotypes for collaborative genomic analyses to maintain data confidentiality by Tianjing Zhao, Fangyi Wang, Richard Mott, Jack Dekkers, and Hao Cheng, Genetics. 2024 Mar 6;226(3):iyad210. doi: 10.1093/genetics/iyad210. PMID: 38085098, where they combine Bayesian variable selection methods for genetic parameter estimation, genomic prediction, and GWAS.

With HEGP anyone can freely hand over genotype and phenotype data to anyone else and have them reproduce the results. Even better, the encoding can be stacked. So the second party can encrypt and combine their data independently and give it to a third party.

So any party can use our data to add it their own analysis. And they can share their data back with us. As such, genotype data can be stored on (public) servers and GWA analysis can be reproduced.

This is a breakthrough in FAIR data sharing and contrasts greatly with current practice of hiding/protecting genotype data and only providing summary statistics. One example is an important UK Biobank depression study where genetic markers are presented with their statistical significance. We can only assume that this study can be reproduced by a group of researchers having access to the original data. The truth is that these outcomes can not be reproduced by you or me! HEGP will make data sharing and reproducible analysis a reality.

Cite

Mott R, Fischer C, Prins P, Davies RW. Private Genomes and Public SNPs: Homomorphic Encryption of Genotypes and Phenotypes for Shared Quantitative Genetics. Genetics. 2020 Jun;215(2):359-372. doi: 10.1534/genetics.120.303153. Epub 2020 Apr 23. PMID: 32327562; PMCID: PMC7268998.

Simple explanation

Homomorphic encryption is a mathematical translation of data into an encrypted form where the result of a computation is the same for the unencrypted and encrypted forms. With HEGP a matrix of data is transformed by a high-dimensional random linear orthogonal transformation key as described in the paper (open access) and visualised in below animation (hit ). The resulting matrix scrambles the data while preserving the 'shape of the data' for analysis. One way to think about this is that when a Rubik cube gets rotated the fields change colour, but the object still maintains its shape as a three dimensional cube. With DNA the genotype/phenotype shape is typically used to predict associations between genotypes and phenotypes. An example of a phenotype is a preference for strawberry taste. An example of an associated genotype is a DNA encoded olfactory receptor.

Here we display a typical example of genome-wide association (GWA) of phenotype against genotype:

Example of genome-wide
association (GWA) of phenotype against genotype

Genes (on chromosomes) involved in some trait are marked. This is the backbone computation for finding genes involved in some trait and pursued in the UK Biobank involving half a million subjects. To find associations GWA is applied to find genes involved in, for example, cancer or COVID-19 mortality; i.e., the first step towards finding causality and potentially better treatments.

In above image data is shown before and after encryption. The unencrypted data contains three values while the encrypted data shows a normal distribution.

Enigma and why a challenge?

The Enigma machine is an encryption device developed and used in the 20th century to protect commercial, diplomatic and military communication. It was employed extensively by Nazi Germany during World War II, in all branches of the German military (source wikipedia). Enigma encrypted text by a transformation and was cracked by the Polish Cipher Bureau in 1932 and the crack was used by the allied forces to win the war. To ascertain HEGP is bullet proof, unlike ENIGMA, we invite the algorithmic inclined to crack the code and make HEGP history (one way or the other).

Animation of HEGP encrypting using step-wise linear orthogonal transformation:
Edit text!

Animation

In above animation genotypes and phenotypes are encoded in a matrix. Rows are individuals and columns are genotype markers along chromosomes as well as phenotypes. Starting with the unencrypted data it gets scrambled by the keywhile the analysis will still render the same result (the black lines are values too small to show).

Is this encryption safe?

We tried find ways of cracking the code as discussed in our paper. Brute force guessing of a solution would take more compute seconds than there are molecules in the universe.

What is the challenge?

The challenge consists of decryption of two data sets.

For the first challenge we encrypted a data set consisting of 500 individuals by 12,359 SNPs that exists somewhere in the public domain. We consider this data cracked if you can identify 50 individuals correctly. Note that this data may be derived from human, mouse, rat, nematode or plant data.

The second challenge is harder. We encrypted a mammalian data set that is not in the public domain (yet). We consider the code cracked if you compute the plaintext genotypes of this matrix correctly with mean error of under 10%.

Who wins a $1,000 check?

The best result with a publicly reproducible solution submitted by December 31st 2020 wins the HEGP Challenge and a cash prize of one thousand dollars. Points are given for improving the method. In case of a tie the prize is shared. If there is no winner we will hold the prize until someone claims it.

How to submit?

Submit your answer through a public git repository with the exact steps take to get there. The analysis should be reproducible by anyone. You can announce your solution on the website issue tracker.

To go straight to the actual challenge click here.

(image created xkcd - license CC BY-NC 2.5)