Why we should care about zero-knowledge proof in machine learning

Damir Valput

2022-04-10 12:42:03
Reading Time: 4 minutes

Imagine you want to share a secret with someone without actually revealing the secret. Confused? That’s how I felt the first time I heard about the idea behind the zero-knowledge proof concept. After delving a bit deeper, I’m now convinced that both Data Scientists and Machine Learning Engineers alike should care a lot more about it. Here’s a brief overview of what ZKP is and why it matters.

Disclaimer: I’m not a cybersecurity expert, and this post based on my research that I performed as an enthusiastic Data Scientist.

What is zero-knowledge proof?

In zero-knowledge proof (ZKP) there are two parties: the prover and the verifier. The prover has a piece of secret data that is confidential—they don’t want to share it with the verifier. Zero-knowledge proof (ZKP) gives a way for the prover to convince the verifier that a statement is true without revealing any additional information beyond the fact that the statement is true. While ZKP isn’t as certain as revealing the actual information itself, it does allow us to convince someone that a statement is true thanks to its mathematical properties of soundness and completeness.

I’m aware that this still doesn’t explain how could one possibly prove something to someone without giving them any proof, so let me give a very simplified example of how that could be possible without entering into technical details and messy water of advanced mathematics of cryptography.

Getting behind the fundamental idea of ZKP

Let us suppose I wish to prove to someone that I know the combination of a safe, without revealing the said combination to them. In other words, I want to leave them convinced that I do know the combination to the safe, without leaking any information about the combination. How could I go about doing that?

One way to do that is by telling them to write down a secret that they are sure only they know, and then to place it in the safe and lock it. If I am able to tell them what the secret is, I will convince them that I know the combination of the safe without revealing it to them.

In zero-knowledge, the proof leaks no information about the secret data. A zero-knowledge proof, in addition to being zero-knowledge in the sense that it reveals no extra information to the verifier, needs to be:

  • Complete: it needs to convince the verifier that the statement is true.
  • Sound: the proof has to present a valid argument based on true premises.

Why should you, as a Data Scientist (or something similar), care?

First and foremost, zero-knowledge proofs allow us to focus on data privacy and allow people to be in control of their data. For example, imagine you could prove to your potential landlord your financial liquidity without actually revealing anything about your financial data.

ZKP could allow you to convince your potential landlord that your salary is in a certain (satisfactory) range without  revealing the exact number. And data privacy is a very big deal in data science.

Or imagine you built a machine learning model and, though you don’t want to share your machine learning model parameters with other people, you want to convince people to use it and that it is indeed giving valid and reliable results. ZKP can help do that.

Use cases in machine learning

Therefore, zero-knowledge proof has the potential to preserve the privacy of a machine learning (ML) model while guaranteeing a certain level of accuracy and reproducibility of the model. For example, the creator of an ML model could convince a potential user that the model indeed provides some level of accuracy on a data set with certain characteristics without revealing information about the model itself. The secret data on the prover is the machine learning model itself, and the public computation is the inference the ML model performs on a test dataset or user input. The claim that the prover, aka ML model owner, is making is that the inference of the ML model on the input dataset is correct…well, almost 100% of the time (it is a probabilistic proof after all).

The beauty of this all is that because of the soundness property of ZKP, the prover cannot lie about the accuracy of the ML model, and because of the ZK property, the privacy of the ML model is maintained.

In the area of machine learning research, such protocols for zero knowledge proofs for some categories of machine learning algorithms have already been developed (check an example here or here ). Relying on zero-knowledge proof, the provider of the machine learning model, who acts as a prover in this case, could provide a security guarantee for the product or service they are providing to their users. While the model itself would still remain a black box, the trust in the model performance would increase, thus addressing the burning question of trustworthiness in machine learning.

Similar to the above application, zero-knowledge proof can also be used to build data pipelines to which multiple entities can contribute, each curating their own parts of the pipeline, maintaining the high level of confidentiality between the varying entities through ZKP (check an example here).

As zero-knowledge proof provides ways for data owners to protect their data, it opens the path to novel data sharing agreements or data usage in various businesses through which data owners wouldn’t necessarily share the data. For example, a user could share information about their personal data with a service provider, which could in turn offer them a more personalised service while the data itself would remain confidential.

The area in which ZKP has seen most application so far is blockchain, but most experts agree that this is only the beginning. I believe we will see more of ZKP in the future in various areas and industries, including data science, artificial intelligence and aviation, in which all of the use cases described above could prove to be extremely beneficial.

Author: Damir Valput

© datascience.aero