HEGP CHALLENGE

1. HEGP challenge
2. Details
3. Reference code
4. Submissions
5. Feedback
6. General introduction
7. Toy Example
8. References

1. HEGP challenge

2020-11-01 The authors announce the HEGP (heegeepee; hiːdʒiːpiː) Challenge, with a $1,000 (one thousand dollar) prize for the individual or group who can crack HEGP encrypted data.

See this page for a general introduction and relevance of HEGP

2. Details

The authors offer a reward of $1000 (one thousand US dollar) for cracking the HEGP homomorphic encryption method for genotypes and phenotypes as described in this open access paper.

The challenge consists of decryption of two genotype data sets.

In both data sets, plain text genotypes are represented numerically as matrices of allele dosages, with rows representing individuals and columns representing SNP sites or genotype markers (see below toy example). Each plain text genotype dosage takes the integer values 0,1,2. After encryption these dosages are transformed into floating point cyphertexts that are approximately normally distributed. The challenge is to recover the unencrypted original values from the cyphertext.

In HEGP, the ciphertext has been generated by matrix multiplication of the plaintext genotype matrix by a random orthogonal matrix (the key generated using Linux /dev/random and rustiefel) which is unknown to you. The task therefore is to find the inverse of this matrix, which because it is orthogonal, will be the transpose of original key.

For the first challenge1 we encrypted a data set consisting of 500 individuals by 12,359 SNP genotypes which exists in the public domain. We consider this data cracked if 50 individuals (10%) are identified correctly. Note that this data might be from human, mouse, rat, nematode or plants. Note also that these genotypes are a subset of a public dataset, and that any missing genotypes in the public data have been imputed before encryption.

The second challenge2 may be harder. We encrypted a mammalian data set that is not in the public domain (yet), consisting of 1,586 individuals and 12,561 SNPs. We consider the code cracked if you can compute the plaintext genotypes of this matrix correctly with mean error of under 10%.

The ciphertext files for challenge1 and challenge2 can be downloaded from https://files.genenetwork.org/HEGP-challenge/

The MD5 checksums of the ciphertext files are found in md5sum.txt. The full list, also for the original (secret for now) plaintext and keys is

5605626a9403ebdf96041d2dd766a516  challenge1.ciphertext.RData
148ce09c6dfc878af09874c090a7bd64  challenge1.ciphertext.txt
abbc5ce33025e723805cb14a332f6c9d  challenge1.maf.txt
29f348fd4d9fbd453f658def53dbb87a  challenge1.key.RData
b03e91f69d8124658e30562e1fa3b64c  challenge1.key.txt
dcf724ff4798f4ec6674d029f220385f  challenge1.plaintext.RData
e26367b991dfcedeaee139ffcc9a38a7  challenge1.plaintext.txt
d2db18e96a3a8173b12383aa8d606583  challenge1.sample.ids.txt

6d5d10279424a7c952ed8f9b0b086e84  challenge2.ciphertext.RData
c3706ecf12275859e3a040b23af1c9f9  challenge2.ciphertext.txt
15fa26a4d03e18f05643e917b8480216  challenge2.maf.txt
9615196bf89fb19105686a018e4c40c0  challenge2.key.RData
ecf00d15e233b6dda73e40ebdc3358d2  challenge2.key.txt
a007e8f5f085798a88180386ffbfa58f  challenge2.plaintext.RData
2b1870215dd67cfe1b19c423979ab3c7  challenge2.plaintext.txt

With this challenge we want to address the concern that individuals might be identified based on information in the public domain and/or by finding patterns in the encrypted data.

We will release the plaintext file and the key file when the challenge is cracked or after six months (which is May 1st 2021). In case the prize is unclaimed we'll create a new dataset for a fresh challenge.

The ciphertext Datasets are provided as RData objects which can be manipulated using the hegp R package. Briefly, a dataset D is a list, comprising of minimally a phenotype vector D$y, and a genotype dosage matrix D$geno (rows are individuals, columns are SNPs). There are additional components that are useful for GWAS but not relevant for the decryption challenge. Full details are in the documentation of the R package. The challenge datasets have fake phenotype data that can be ignored.

To unpack the RData original datasets and write the genotypes and allele frequencies to tab-delimited text files from R, convert with

library(readr)
load("challenge2.ciphertext.RData")
write.csv( De$geno, "challenge1.ciphertext.txt", quote=F)

To run a GWA use the mixed-model-gwas and hegp-R packages you can run in R

library(hegp)
load("challenge2.ciphertext.RData")
ge.mm=basic.mm.gwas(De,mc.cores=1)
  Built kinship dim 1586 1586
  estimated heritability 4.355436e-05
ls(ge.mm)
  [1] "bp"   "chr"  "logP"
ge.mm$logP[0:6]
  [1] 0.22838514 0.74881022 0.13431556 0.00929719 1.13027222 0.13211178

and draw a plot with

library(ggplot2)
ggplot(ge.mm) + geom_point(aes(bp, logP), colour="#000099")

To validate we ran the encrypted version (De) and then plaintext version (D) to compare outcomes with

library(hegp)
load("challenge2.ciphertext.RData")
load("challenge2.plaintext.RData")
ls()
  [1] "D"  "De"
ls(D)
  [1] "cov"  "geno" "maf"  "map"  "y"
ls(De)
  [1] "cov"  "geno" "map"  "y"
g=basic.gwas(D)
ge=basic.gwas(De)
mean(abs(g$logP-ge$logP))
  0.02951522
ge.mm=basic.mm.gwas(De,mc.cores=1)
  Built kinship dim 1586 1586
  estimated heritability 4.355436e-05
g.mm=basic.mm.gwas(D,mc.cores=1)
  Built kinship dim 1586 1586
  estimated heritability 4.356434e-05
mean(abs(g.mm$logP-ge.mm$logP))
   [1] 0.02943622

Documentation for the R encryption and GWA functions can be found here.

3. Reference code

The reference code for HEGP is published under GPLv3 licensed R code and Julia code. An example of running a GWA as presented in the paper can be found here. The algorithm with a description of brute force attack is described in the results section.

4. Submissions

Submissions should be posted in a permanent public git account (GitHub, gitlab or similar) and reward will be given in USD. The solution should be reproducible and announced on the website issue tracker. In case you don't want to use the issue tracker it is also possible to E-mail the authors to indicate where the solution is hosted. One such E-mail address is pjotr.public542 at thebird dot nl.

Important note: only real solutions will be considered that can reliably crack the key of any such dataset. If the original key somehow gets ripped from one of our laptops, for example, it does not count. Security stands or falls with how data is handled and how a key is generated. Even so, this challenge is about algorithmically cracking the code for the general case.

5. Feedback

Feedback on the challenge, this website etc., can be posted to the issue tracker.

6. General introduction

The homomorphic encryption method consists of an orthogonal transformation by multiplication by the orthogonal matrix P with a matrix containing the data y and G.

From the original paper: privacy in relation to quantitative genetic analysis. (A) A numeric phenotype vector y (left) and genotype dosage matrix G (right) are represented as colours and shades of gray. Each row of the matrix represents one individual and each column one SNP. Genotypes are encoded as imputed dosages clustered at the values Embedded Image giving the numbers of alternate alleles. (B) The same data after multiplication by an orthogonal matrix P (a rotation represented by the curved orange arrow). The genotype dosages are now represented by a continuum of real numbers. (C) The distribution of dosages for a particular SNP (column of G), clustered around 0,1,2. (D) The distribution of the same dosages after orthogonal transformation by multiplication by the orthogonal matrix P (black histogram) with the normal distribution with same mean and variance superimposed in red. (F) The normal QQ plot for the data in D, showing the transformed dosages are very close to a normal distribution. (E) A cartoon of the HEGP scheme. The top black arrow and equation show the linear mixed model relating the phenotype y to genotype G with regression coefficients β representing the allelic effects. The variance matrix for the residuals is V. After multiplication by orthogonal matrix P, plaintext data y, G and the mixed linear model are transformed as shown in orange. The likelihood and regression estimates β are preserved. HEGP, homomorphic encryption for genotypes and phenotypes; QQ, quantile–quantile.

7. Toy Example

The task is best described with an small example, a SNP dosage matrix G with 10 subjects (rows) and 8 SNPs (columns):

sub1     0    1    2    1    1    1    1    2
sub2     0    1    2    1    2    2    2    2
sub3     1    1    2    1    2    2    2    1
sub4     2    1    2    2    1    2    1    1
sub5     2    2    2    0    1    1    2    1
sub6     2    2    2    1    1    2    2    2
sub7     1    2    1    1    1    1    2    2
sub8     0    1    2    1    2    1    2    2
sub9     1    1    2    2    1    1    1    2
sub10    1    2    2    1    2    2    2    1

The first step standardises each column to have mean 0 and variance 1. This does not encrypt the data but it makes the encryption easier to process. The resulting plaintext dosage matrix H is

sub1  -1.224745e+00 -0.7745967  0.3162278 -0.1761661 -0.7745967 -0.9486833 -1.449138  0.7745967
sub2  -1.224745e+00 -0.7745967  0.3162278 -0.1761661  1.1618950  0.9486833  0.621059  0.7745967
sub3  -5.749536e-10 -0.7745967  0.3162278 -0.1761661  1.1618950  0.9486833  0.621059 -1.1618950
sub4   1.224745e+00 -0.7745967  0.3162278  1.5854946 -0.7745967  0.9486833 -1.449138 -1.1618950
sub5   1.224745e+00  1.1618950  0.3162278 -1.9378267 -0.7745967 -0.9486833  0.621059 -1.1618950
sub6   1.224745e+00  1.1618950  0.3162278 -0.1761661 -0.7745967  0.9486833  0.621059  0.7745967
sub7  -1.761818e-09  1.1618950 -2.8460499 -0.1761661 -0.7745967 -0.9486833  0.621059  0.7745967
sub8  -1.224745e+00 -0.7745967  0.3162278 -0.1761661  1.1618950 -0.9486833  0.621059  0.7745967
sub9   5.584054e-10 -0.7745967  0.3162278  1.5854946 -0.7745967 -0.9486833 -1.449138  0.7745967
sub10 -7.763129e-10  1.1618950  0.3162278 -0.1761661  1.1618950  0.9486833  0.621059 -1.1618950

Next, we sample a random orthogonal 10x10 matrix P (the key)

 [1,] -0.02512827 -0.4797328  0.07364427  0.29302653 -0.52531836 -0.06077012 -0.035656119 -0.21264789 -0.03656364  0.59199429
 [2,]  0.49690414 -0.2130092 -0.18437213 -0.08094409 -0.33989098  0.11458794  0.382316649  0.51918534 -0.29625914 -0.18714471
 [3,]  0.23650365  0.2027581  0.09184543  0.01711933  0.45947279 -0.11898452  0.549979207  0.04901441  0.03087585  0.60259531
 [4,] -0.12668661  0.2640016 -0.24657807 -0.31047136 -0.08100156  0.01898761 -0.009330209 -0.34743514 -0.77909552  0.14950124
 [5,]  0.55558644 -0.1256327 -0.11075554 -0.35350803  0.12981113  0.51028646 -0.332763562 -0.28269321  0.20446208  0.16914618
 [6,] -0.20079972 -0.1310010  0.33769938 -0.45070382  0.09236098 -0.05672098 -0.390626201  0.58184807 -0.14355153  0.31914343
 [7,]  0.30606137  0.1956211  0.67114299 -0.30536330 -0.32325985 -0.31468499  0.106176504 -0.27993855  0.04238805 -0.17152201
 [8,] -0.31057809 -0.5813563  0.28456863 -0.21173902  0.26275032  0.30931163  0.411201611 -0.21696952 -0.12053771 -0.21059453
 [9,]  0.07808339 -0.3843085 -0.42321573 -0.37251969  0.13692025 -0.68291915 -0.007551188 -0.12694060  0.15450958 -0.05619238
[10,] -0.36775701  0.2383173 -0.23348592 -0.45381168 -0.41614295  0.20969099  0.325976664  0.04578156  0.44759290  0.14716487

Then we encrypt the dosages F = PH to make the ciphertext

sub1   0.5198393  0.26551339 -0.01916052  1.4507213  0.2713387  1.18915219 -0.7379842 -1.16229696
sub2  -1.3586430 -0.22207809 -1.14214913 -0.1029684 -0.2902612 -1.24575229 -0.1165288  1.69811232
sub3  -0.1600357  1.24824503 -1.06841109 -1.0985638  0.1892755 -0.50328813  0.7284091 -0.62463676
sub4  -0.1988559  1.28854213 -0.43475181 -1.5181190  0.7876347  1.15643550  1.6061022 -0.19111906
sub5   0.1706426  0.64079385  1.16737846 -0.5553653 -0.9595529 -0.17537980 -0.6155911  0.60202029
sub6  -0.8145915 -0.03660399  1.22187582 -1.2021241  2.1778375  0.07512188  1.6196177 -0.61084795
sub7  -1.4268916 -1.30507703 -0.35896396  0.1191278  0.8610707  0.21228333 -0.1347692  0.19297276
sub8   1.7994281  1.79366813 -1.42174710 -0.9805980 -1.1053001 -0.41331874  1.0923847 -0.53943417
sub9  -0.5944334  0.12376342 -0.50869100 -0.3285796 -0.6138749 -2.04363166 -0.7562708  0.08008153
sub10 -0.7061947  0.56034104 -1.04875037  0.7321312  0.4268941 -0.12102879  0.7390060  1.80792099

This form of encryption is homomorphic with respect to many quantitative genetics analyses, particularly the mixed model GWAS.

To decrypt the ciphertext it is necessary to multiply it by the inverse of the key P, which is equal to the transpose of P because it is orthogonal.

The challenge is, in the absence of knowing either P or H, to find an orthogonal matrix Q such that QF "looks like" a genotype dosage matrix. That is, the distribution of the plaintext for a given SNP will be trimodal (or bimodal is the rarer homozygote genotype is absent from the sample) with expected modes specified by the Hardy-Weinberg equilibrium distribution. We provide the allele frequencies of the plaintext, which will help in defining these expected modes (see the maf.txt files). It is reasonable to provide this information even though it might make the encryption less secure because users of the ciphertext would need this information for some genetic analyses.

We provide the toy dataset and its encryption key as R objects in the file "toy.hegp.RData" from Toy-HEGP so that you can reproduce these analyses. E.g.

# Load a table
toy = read.table("toy2.txt",header=F,row.names=1)

library(hegp)
load("toy.hegp.RData") # loads encryption key
plaintext.toy$geno = as.matrix(toy)
c = encrypt.D(plaintext.toy, key.toy)
d = encrypt.D(c, key.toy, invert=TRUE)
mean(abs(c$geno-d$geno))
[1] 2.524352e-09

8. References

The original HEGP paper
HEGP discussion on reddit crypto forum
For more background see this page