In this article, we first explain what it means for differential privacy software to have vulnerabilities. Then, we present a new class of attacks on naive implementations of differential privacy: precision-based attacks. These attacks expose vulnerabilities in multiple open-source libraries, including diffprivlib, SmartNoise Core, and OpenDP.
At Tumult Labs, we are building a hardened, production-ready platform that makes it easy to use and deploy differential privacy (DP). Differential privacy is conceptually simple. Suppose a differentially private mechanism generates a given output from a dataset.
And now, suppose we add or remove one record from this dataset, and re-run the same mechanism. With DP, we may get the same output as before, with similar probability.
When designing DP algorithms, one must:
- prove mathematically that it satisfies this guarantee,
- and implement it so it follows the proof.
This last step is often overlooked in the scientific literature. But in practice, it can be far from straightforward! Sometimes, implementations have vulnerabilities.
What does “vulnerability” mean in this context? This is when the promise of differential privacy doesn’t hold, even though it should. For example, a specific output can be obtained from the first database, but not the second. This could allow someone to detect whether someone’s information was in the data. An adversary could even find out the data associated with a particular individual! Even though that’s exactly what differential privacy should prevent.
At Tumult, we put in place several measures to increase the robustness of our software. One of them is adversarial research: we audit our code and other similar software projects to discover vulnerabilities. Before we tell you about what we found, let's start with some historical context.
The first floating-point vulnerability
In 2012, a researcher, Ilya Mironov, found vulnerabilities in naive implementations of the Laplace mechanism. This important DP primitive is about noise addition: it samples a random number from the Laplace distribution and adds it to the true value of an aggregation. In the original mechanism, the proof works with real numbers, with infinite precision. You can't write down exactly a number sampled from the Laplace distribution: it has infinitely many digits, so you wouldn’t ever be finished.
In practice, of course, computers don’t have infinite time and memory. So they approximate continuous numbers, and round the results of arithmetic operations. And this difference between theory and the implementation can create vulnerabilities.
Computers approximate real numbers using a floating-point representation. In principle, you can plot all possible 64-bit floating-point numbers on a line. Here is a much-simplified illustration of what this can look like.
Each point on this line corresponds to one possible floating-point number. Of course, this is simplified: when using 64-bit floats, there are way more than 8 numbers strictly in the interval [2, 4)! But the schema still demonstrates important properties of floating-point numbers:
- There are a finite number of them.
- Their precision changes depending on how large they are: there are the same number of floats in the interval [0.5, 1) as in [1, 2), or [2, 4), etc.
When using floating-point naively, continuous distributions can end up having holes: floating-point that can never be sampled. Say we sample a number from the Laplace distribution, and add it to the number 3. The set of possible outputs might look like this.
As you can see, some outputs will never be returned, regardless of the value of the randomness. The distribution has holes. Now, suppose we do the same operation, except we add noise to the number 4.
Note that the possible outputs are different: they depend on the original number we added noise to. This means that some outputs are distinguishing: we can only observe them with a specific input.
In this fictional example, we can get output 3.75 from input 3, but not from input 4. Conversely, we can get output 5.5 from input 4, but not input 3. This is a differential privacy vulnerability: an attacker looking at the output can distinguish between two possible input values.
Mironov found that naive implementations of the Laplace mechanism had this exact problem. He proposed a mitigation, and proved that it fixed the problem. Mironov's approach had some cost in utility, and only applied to the Laplace distribution, so some alternatives were later proposed. These alternatives were implemented in open-source projects… which leads us to our novel class of attacks, exposing vulnerabilities in naive implementations of differential privacy: precision-based attacks.
A new class of vulnerabilities: precision-based attacks
As illustrated above, each floating-point number has a finite precision. This precision depends on how large the number is. For example, for 64-bit floating-point numbers, the number 1 has a precision of 2-52: the next floating-point number is 1+2-52. Let’s zoom in.
This means that if you add 1+1.8*2-54 using 64-bit floating-point numbers, you will get exactly the same number as 1.
Bigger floating-point numbers have a coarser precision. The schema above shows that all those in the interval [1,2) are multiples of 2-52. Similarly, the ones in [0.5,1) are all multiples of 2-53, the ones in [2,4) multiples of 2-51, and so on.
Summing floating-point numbers is where it gets interesting. Suppose we sum two floating-point numbers x+y, where x is in an interval with precision 2k. Then the result is always a multiple of 2(k-1). And this happens regardless of where it ends up! Even if the sum is very close to 0, where the precision is much higher, this is still going to be true.
To see why, let’s pick some specific value, e.g. x=1.25. Its precision is 2-52, and we’ll show that no matter what number y we choose, x+y will be a multiple of 2-53. We need to look at two cases, depending on the value of y.
First, y might be lower than -0.5 or larger than 0.5.
This case is easy: y is a multiple of 2-53, and x is as well (it’s even a multiple of 2-52). So x+y is also a multiple of 2-53.
Second, y might be in [-0.5, 0.5].
In that case, the result is in a predictable zone around 1.25: x+y is at least 0.75 and at most 1.75. But in this interval, all possible floating-point numbers are multiples of 2-53. It doesn’t matter if y isn’t a multiple of 2-53: the result of the sum will be rounded anyway.
So in both cases, x+y is a multiple of 2-53. It is easy to extend this example: we can show that if x is in an interval with precision 2k, then x+y will always be a multiple of 2(k-1).
Now, why is this fact interesting? It means that when we add noise to x=1, we know for sure that the result will be a multiple of 2-53. But if we add noise to x=0, this might not be true, depending on the noise distribution. If the noise was supposed to hide the true value of the input, that’s a differential privacy violation. If the attacker observes an output that is not a multiple of 2-53... they can distinguish with 100% certainty between the two possible hypotheses x=0 vs. x=1.
Three vulnerable open-source libraries
will print something like:
This happens with all floating-point noise addition mechanisms except for Mironov’s approach. The outputs that are not multiples of 2^-53 are distinguishing events and break DP's guarantee. The following schema illustrates the vulnerability.
Interestingly, these projects had taken active steps to mitigate floating-point vulnerabilities.
- diffprivlib uses a custom approach to noise generation. In particular, this approach relies on combining samples from several distributions.
- SmartNoise Core uses a multiple-precision floating-point library to generate a hole-free noise distribution.
- OpenDP takes the same approach as SmartNoise Core. It also changes the order of some operations, and rounds in conservative directions.
Note that in SmartNoise Core and OpenDP, users have to enable a special flag to use floating-point features; the documentation warns that this option might be dangerous.
This vulnerability shows that implementing these privacy-critical primitives correctly is difficult. It suggests that implementations should implement them with great care, and prove that the implementation provides the desired guarantee. Ad hoc mitigations are not enough to reach a high level of robustness.
At Tumult Labs, we are building a differential privacy platform; this software is trusted to protect extremely sensitive data, and is currently deployed in production at large institutions. This motivates a strong concern for correctness: we are committed to researching, identifying, and fixing vulnerabilities in our software. Stay tuned for more updates on our work in this area!
Precision-based attacks expose vulnerabilities in three open-source projects.
- In diffprivlib, the Laplace mechanism, truncated Laplace mechanism, folded Laplace mechanism, bounded-domain Laplace mechanism, bounded-noise Laplace mechanism, Gaussian mechanism, analytic Gaussian mechanism, and staircase mechanism are affected.
- In SmartNoise Core, the Laplace mechanism, Gaussian mechanism, and truncated Gaussian mechanism are affected.
- In OpenDP, the Laplace mechanism, Gaussian mechanism, and analytic Gaussian mechanism are affected.
2021-11-01: diffprivlib issue sent to the team (via private email)
2021-11-08: diffprivlib issue acknowledged
2021-12-17: SmartNoise Core & OpenDP issue communicated to the team & acknowledged