Privacy and confidentiality of genetic information: from genetic privacy to open consent

Privacy and confidentiality of genetic information: from genetic privacy to open consent

1. Introduction

1.1. Economics of WGS have changed

Genome sequencing technology has advanced at a rapid pace. It is expected to keep going down dramatically with a $100 whole genome sequencing (WGS) expected in 3-5 years. To put things in perspective, the cost was a little over 100,000$ in 2008. The collection and analysis of such data, formally known as genomics, have the potential to drive scientific research, help to solve crimes and support the further development of personalised medicine. Nevertheless, the more secrets we uncover about our genetic information, the more eugenics and transhumanism are shaping into realities.

1.2. There are individual and collective benefits ….

The benefits of this genomics revolution are potentially enormous, and the biomedical community has certainly remarkably high hopes for such data. A recent publication in Nature Medicine by Staaf et al. has shown that a WGS of tumour cells could help predict the prognosis of a patient's cancer and offer clues to identify the most effective treatment. The development of precision medicine using that trove of data could enable the preventive care and monitoring of many diseases of genetic origin such as Marfan syndrome, early-onset Alzheimer's (,, or silent heart conditions (,, It would also reduce the time, cost, and failure rate of pharmaceutical clinical trials, thus reducing the cost of new medicines. It is estimated today that the current average cost of bringing a new drug to market was 1.3 billion USD and the costs will keep rising sharply over the next few years.

1.3. …and dangers lurking in the shadows

That race to harvest the potential of genomics will usher a new era for genetic engineering and genetic screening. Genetic engineering could lead to a new era of transhumanism and enlightenment where humans are stronger, healthier, and enjoy enhanced cognitive abilities; or it could bring us closer to a futuristic dystopia where genetic discrimination and eugenics are defining society. The potential for genetic discrimination, if left unchecked, has been a major concern for researchers, health professionals and the public. Some nefarious consequences of genetic discrimination have been illustrated in the 1997 film Gattaca. That future is not as distant as we might believe. Many discriminations are already rampant in various countries. In Australia, health insurance is “community rated”, they look at the entire population for the cost not a case by case basis, so genetic information is less likely to influence health insurance coverage decisions. Even so, life insurance companies are legally allowed to “underwrite” when evaluating the genetic risks of applicants. This underwriting can lead to higher premiums to those with higher genetic risks. Australian life insurance companies can require individuals to report genetic testing results if they have already been tested (even from direct-to-consumer tests) but cannot force individuals to take genetic tests. In Argentina, health plans discriminate against those who have disabilities or who have genetic conditions. Another true story exemplifying genetic discrimination was shared by Dr Noralane Lindor at the Mayo Clinic's Individualizing Medicine Conference (2012) , - Lindor 2012. During her study of a cancer patient, Dr Lindor also sequenced the grandchildren of her patient, two of whom turned out to have the mutation for the same type of cancer. One of these grandchildren applied to the US Army to become a helicopter pilot. Even though genetic testing is not a required procedure for military recruitment, as soon as she revealed that she previously went through the aforementioned genetic test, she was rejected for the position.

1.4. Separation of metadata from genetics data is no longer privacy-preserving

Those examples are only the beginning as only five countries (the US in 2008, UK in 2010, France in 2002, Canada in 2017 and Malawi in 2003) have at least partial laws against genetic discrimination. Some efforts have been made as well to apply those principles at European Union level and provisions regarding genetic testing have been implemented but have yet to translate into fully fledged genetic discrimination laws. One straightforward way that has been suggested to protect people against those discriminations is to separate the metadata from the genetical data and associated results. Those metadata are formerly known as identifiers, and can include name, postal address, age, sex, sexual orientation, prior medical history and many many more. However, the increased availability of genomics data, and very inadequate or inexistent set of regulations worldwide in place, has major implications with respect to personal privacy. Apart from homozygote twins, each genome is unique to a person. The genome being part of our fundamental essence, it is possible to infer precise phenotypical and biological traits of the person. For example, it would be straightforward, for instance, to know the gender, ethnicity, or eyes colour, among other things from the genome alone. The length of the telomere can even give an indication of the age range of a person. Gymrek at al. have successfully demonstrated that it is possible to reidentify a large number (50% in their study from a randomly sampled set) of people and their family using only the Y chromosome, commercial genealogy databases and freely available metadata (age and state of residence of the participant at the time of the collection) from the dataset. That ease of reidentification has demonstrated that simply separating identifiers from the genomics data is not enough to protect people’s privacy. Those new threats require new principles, new regulations, and therefore new systems to protect our right to privacy while allowing research to be carried out and enable new precision medicine treatments to be developed.

2. How to uniquely reidentify a grain of sand

2.1. Genetic data is highly unique

The human genome contains approximately 3 billion base pairs which are distributed across twenty-three pairs of chromosomes. Despite its size, it is estimated that on average the DNA of two individuals differ by 0.6% or 20 million variances in the base pairs. By comparison, the first facial recognition systems used only 21 points to uniquely identify someone. In addition to naturally occurring variances, environmental factors can further introduce new variations throughout the life of an individual. Viruses and radiations can cause further changes to the genome, thus creating new identifiable markers.

To make the matter worse, those variations between individuals are the key to uncovering the cause of diseases, so removing them would prevent any scientific research from being conducted. Furthermore, genotypic data is never collected alone. A wide range of metadata is collected to give context and provide some insights on the genotypic data. 450,000 UK Biobank WGS are already available for both public and private research. While the data has been collected with the “consent” of the patients and personal identifiers (e.g. name or home address) are removed from the data provided to researchers, it still contains an extensive amount of information on each patient. It currently contains 628 data fields including location, education, physical measure summary and mental health, among others. Because of that unicity and richness, it makes reidentification vastly easier compared with other types of personal data. Its sensitivity would result in potentially great harm if fallen into the wrong hands.

2.2. Reidentification poses threats

Indeed, it is common knowledge that genomics data contains a lot of information regarding associations with certain diseases and kinship. Breast cancer risks are well known to be linked to BRAC1 and BRAC2 genes, a single nucleotide polymorphism mutation causes Drepanocytose while the leading cause of cervical cancer is usually the Human PapillomaVirus (HPV) which is a well-known STD by the integration of the virus’ genome into the host genome. All that information is readily available in our genome and trained experts, with enough resources, could still infer a lot more yet.

2.3. Threats due to inference beyond reidentification

An additional tool that could be leveraged to reidentify people is epigenetics, which is the study of heritable phenotype changes that do not involve alterations in the DNA sequence. One example of an epigenetic mechanism is DNA methylation which is responsible for modifying the function of the genes and affecting gene expression. A known perturbator of DNA methylation is smoking. A study has shown that exposure to cigarette smoke is associated with reproducible and specific DNA methylation changes in new-born and adult blood. It is, therefore, possible to positively identify a smoker or former smoker using epigenetics. All this auxiliary information about an individual further increases the risks of reidentification and those insights can have real-word consequences on our lives (e.g. genetic discrimination as highlighted in the previous section).

2.4. Kinship and corporate uses create issues

The genome and the compilation of a bank of genomes can reveal family relationships between different individuals. 23andMe has caused numerous controversies all over the world. Revealing (sometimes erroneously) to Korean clients that they had a substantial amount of Japanese ancestry, or that the father of a US teenager was not her biological father was probably a bit more information than they bargained for. Those controversies come on top of their questionable closeness with Facebook billionaire Yuri Milner and Google Ventures as backers and the sale of their anonymised database for $60M to a Genentech. In France, both victims and perpetrators of serious crimes are subject to a collection of DNA. While strict rules have been put in place to ensure the protection of that data, and that no abuse by the state is possible, the absence of transparency regarding the storage and use of that data can be the source of rightful concerns.

2.5. There is a potential threat from states and state agencies

The danger those genomics banks represent is not limited only to an individual’s privacy, but it can also be a threat to their very life. Population stratification is a wildly used method in clinical research by which populations are grouped according to a difference in allele frequencies between subpopulations. Population structure frequently arises from physical separation by distance or barriers, like mountains and deserts, followed by genetic drift. This stratification is essential as the association with a given trait or disease could be found due to the underlying structure of the population and not a disease-associated gene. While it is a critical step in any clinical study to uncover the actual underlying cause of a disease, it also allows identifying people positively as part of specific subgroups. In the context of research, this stratification is harmless and necessary. Nevertheless, when used to single out a particular population in a given country, it could have dramatic consequences. History books are filled with examples of discriminations for racial or political reasons and China’s Uyghurs, or Myanmar’s Rohingya are sadly the latest examples. Uyghurs have a distinct ancestry from the majority Han Chinese making them easily identifiable from a genetic standpoint. Research has shown that it is already possible to infer characteristic facial features from genetic data alone. That threat from the state may look distant from the point of view of western citizens, but, once the genetic data bank is established, we are only a few checks and balances away from a new dystopian society.

2.6. Criminal organisations and individuals could misuse the data

If the threat of a rogue state looks scary, the threat from a criminal organisation is another that will send chills down your spine. Allergic diseases are a wild spread disease in the western world. It is estimated that about 20% of people are affected by allergic rhinitis, about 6% of people have at least one food allergy and about 20% have atopic dermatitis at some point in time in their life. Besides, about 1–18% of people have asthma depending on the country. Allergic diseases are strongly familial, which implies that genetic factors are at play. While the current mechanisms are not yet fully understood (someone usually doesn’t inherit a particular allergy, just the likelihood of having allergies). It is not unrealistic to imagine that, in a not so distant future, scientists will be able to predict someone’s allergies from birth. With this capability in the hands of criminals and only a handful of additional information, it would transform that hypothetical threat into a very real scenario.

2.7. The consequences of individual and collective secrecy

Given the dark picture that we have presented up to this point, one would be tempted, to consider keeping all genomics information secret, very heavily anonymised or not do them at all. Unfortunately, the consequences of doing either would be devastating. Absolute secrecy would prevent any research from being carried out and thus prevent potentially life-saving discoveries. At the same time, complete anonymisation of the data and removal of the associated metadata would render the genomics data wholly and utterly useless. As a matter of fact, those metadata and the variations in the genetic data are the core elements that enable to identify potential risk factors for a given patient and provide new knowledge in the context of research. It is, therefore, primordial to find a balance between utility and privacy with respect to genomics data. In order to achieve that balance to unlock the potential of genomics data while preserving privacy, it is capital to adopt the query-based paradigm for data release. Rather than publishing (de-identified) data, the system must store the data in a protected environment and allows analysts to send queries about the data. Since analyses are computed using fine-grained data, it possible to achieve better utility and stronger privacy compared to de-identification techniques.

The growing number of health-data breaches (510 breaches affecting 40 Million people in the U.S. in 2019 alone), the use of genomic databases for law enforcement purposes, research, the ease of reidentifiability and the lack of transparency of personal-genomics companies are raising unprecedented privacy concerns.

3. How to approach the safe genomics conundrum

3.1. Principles of safe genomics and importance of deniability 

The threats that we have presented so far are part of a larger panel of potential threats and, while some of them are more likely to happen than others, it has become apparent of the vital necessity to develop new methods to solve the safe genomics challenge.

In an ideal world, the sequencing would be done from the comfort of anyone’s home by receiving a kit and following simple instructions. Once all the necessary steps are finished, you would throw away the sequencer and transfer the data to a secure warehouse (locally or in the cloud) of your choice. The warehouse would be secure against external attacks, host the data encrypted at all times (e.g. at rest, in transit and during the processing) and void of any potential identifiers that would allow the reidentification of the data originator. The person could then browse its data, explore its variations using dedicated tools and generate automated reports describing the potential health risks, their ancestry, or interesting facts about them (polyallelism, unusual variations, etc.).

The data would have a unique identifier that would allow you to identify your own data. Each new sequencing would generate a new unique identifier and thus prevent any membership attack. The strict decorrelation of the metadata and the data itself gives way to the possibility of deniability (e.g. nobody can prove definitively the existence of that genetic data) as you would be the sole custodian of the data. Indeed, under those circumstances, it would be impossible for anyone else to demonstrate that you have that data, and thus it would be impossible for anyone to force you to hand it over. That deniability is the ultimate privacy right for genetic data. It would prevent any genetic discrimination, questionable insurance premiums or other undesirable consequences. We do acknowledge that this right to deniability is a morally grey area. Nonetheless, we believe that people’s safety and a fair playing field is morally more important.

Ideally, such a system would satisfy some important properties: 

  • Secure against threats. The infrastructure is secure against penetration and malicious uses attacks that aim at gaining unauthorised access to data. 

  • Flexible. Data analysts should be able to submit different queries that serve a large array of analytical purposes. This can be achieved by enabling developers to propose new algorithms that can be loaded on the platform. 

  • Privacy-preserving. The query results should consist of aggregate data, and never disclose individual-level information of people whose records are in the dataset. This guarantee should hold when analysts obtain and combine outputs for multiple queries. 

  • User-defined and auditable data-control. The logging of every activity performed on the platform (e.g., data upload, query execution, data-access request) on smart-contracts (self-executing protocols that can change the blockchain global state) to automatically execute data-access policies (grant or deny access to all or part of the data)  dynamically defined by the data owners. 

  • Open. The code of the algorithms should be open-source. This allows for better security, privacy and utility, as everybody can review and contribute to the algorithms. 

3.2. Benefits of implementing these principles: Individuals-Centred, Auditable, and Privacy-Preserving Genomics

The user-defined and auditable data-control gives back full control over their data to the users. This aspect is often overlooked but is of critical importance as it embodies the consent of the user. By giving back full control on how, when and what data is used by external people, it provides guarantees against certain misuses such as the one we witnessed during the collaboration between the NHS and Google Deepmind. While the project was laudable (monitor and diagnose acute kidney injury), the access to sensitive information such as HIV status, mental health history and abortions to Google’s researchers was a deep violation of patient’s privacy. That property would also allow plausible deniability to users to prevent any intended consequences.

In practice, in order to implement the proposed system for the secure exploration of genomic datasets with controlled and transparent data access, a novel approach that combines cryptographic privacy-preserving technologies, such secure multi-party computation, with the auditability of smart-contracts and the latest privacy techniques must be brough forward.

To establish trust and incentivise genomic data sharing, data access requests and users' consent must be communicated transparently and maintained immutably, which would ensure auditability and deter misuse. If an individual, holding its DNA data, wants to use the analytics of a biotech, the protocol must guarantee that such an analysis can be done without any leakage, and without anyone getting access to the user's data. The open-source nature of the algorithms and the idea of bringing algorithms to the data, rather than transferring data to external systems, has already been adopted by several other projects such as, in genomics, the Global Alliance for Genomics and Health (GA4GH) Beacon Project.   

The security of the platform is ensured by using an end-to-end protection of data (e.g., during storage, transfer, and computation). This ensures that, at any given time, the data is protected against both inside (system administrators, engineers, etc.) and outside (hackers) malicious attackers. The use of Intel SGX® and the ability to attest that the executing code is legitimate provide the necessary guarantees that no foul play is at work. 

Nonetheless, even when the data is protected end-to-end during the computation, the released aggregate results can also be used by an attacker to try to re-identify individuals whose data was part of the input dataset, through membership or attribute inference attacks. Rare diseases dataset makes for a perfect example of those risks. Indeed, an attacker could easily infer with high confidence that a target individual is part of the dataset if they hold some background information about the target, such as demographic data or some clinical attributes that further restrict the set of matching rare-disease patients to a very low count. The use of differential privacy as the data-release mechanism can help prevent the re-identification of any individual from the query-results with high confidence. Multiple layers of privacy protections need to be put in place to ensure that the query entropy remains small. In the context of genomics, the use of privacy budgets per user, low-count suppression, and output noise addition will form the foundation of the privacy mechanisms. 

The efficient, privacy-preserving, and auditable data-discovery functionality provided by such a platform can easily enable several common use cases for genomics on large populations of individuals with response times in the order of a few minutes. For instance, it allows drug-target validation whereby researchers test whether a genetic variant associated with a drug target is more common in a case group than in a control group for different populations spread out over the entire globe. The platform can also be used to determine frequencies of genetic variants of interest in the population before recruiting for clinical trials. In this instance, this capability would help mitigate failures in recruiting due to too narrow inclusion criteria. 

This approach could allow DNA research enthusiasts to explicitly consent when granting access to their DNA data and meta-data, for a specific processing. These capabilities would facilitate research by enlarging available data sets and removing the middleman profiting from it, therefore freeing more funds to research itself. 

4. Conclusion 

The idea of having every genome sequenced for every new-born may add years of life expectancy to future generations and further usher the arrival of precision medicine. Science for the greater good and the promises that it holds must not blind us to the dangers that await us, and the life of servitude it might create if left without proper checks and balances.

Current regulations, when they exist, are now outdated in the light of scientific and computational progress. Privacy research has not focused as much efforts on medical data as it has done on location data. New principles, new regulations and privacy technologies must be brought forward to realise the potential of genetics data fully. If those changes fail to pass and societies give up essential privacy, to purchase a little temporary safety, they deserve neither privacy nor safety. (I am sure Benjamin Franklin would surely agree with this)