User-level privacy in Tumult Analytics with privacy IDs

Tumult Analytics got a big upgrade: support for privacy IDs, allowing users to build differentially private algorithms that provide user-level privacy guarantees, even when each user may contribute multiple records.

Blog Post

Suppose you work on the data analytics team of a social media company. You want to share insights about user interactions on your company’s new app, while protecting each user’s data. Each user can, of course, contribute many different interactions to your dataset.

What happens if we take the most common approach to differential privacy, and protect each row individually? Think of location data, like we see in the table above: most records from a single user are likely to originate from the same place — where they live. Hiding location information from individual rows will not be enough to protect it: the total influence of a user’s many contributions taken together might still leak this sensitive information in the output!

This simple example shows that whenever a user contributes many times to a dataset, standard row-level privacy guarantees are not enough: information that is personal to users might still be revealed in the output, since their data is not limited to a single row. Instead, we need a stronger notion: user-level privacy guarantees, which protect all the data of any individual user across multiple rows.

The requirement of user-level privacy is very common. Past releases of social media data provide a real-world example in which user-level privacy is critical. And the same is true of many other real-world deployments of differential privacy: mobility patterns, search queries, marketing insights, financial information, and so on. More generally, in any deployment of differential privacy, a critical question is always going to be: who, or what, do we want to protect in the data? As we have seen, when we want to protect individuals whose data is distributed across multiple rows, protecting individual rows is not enough.

Tumult Analytics now solves this problem with a big upgrade: privacy IDs, short for privacy identifiers, enable the protection of individuals across large datasets, no matter how many times they appear in the data. When utilized in conjunction with the exclusive features of Tumult Analytics, privacy IDs enable a diverse set of differentially private analyses that no other differential privacy library can support.

Privacy IDs in practice

To protect each user’s data, you need to identify them with a specific attribute, like their user ID, and protect all rows with that attribute. This is exactly why we built the privacy IDs feature: by hiding the presence of all the rows that share the same value in an identifier column, you can share data while guaranteeing user-level privacy protection.

With privacy IDs, we've added a number of features to our API, including new behaviors for query transformations and the concept of constraints. The privacy IDs feature is set up using the new AddRowsWithID protected change, which ensures that all rows with the same value in a specified column are protected.

Initializing an Analytics Session with our user activities table might look like this:

With differential privacy, we must hide the maximum impact of the protected change by adding statistical noise. However, when each user can contribute arbitrarily many rows to our data, we need a bound on user contributions. This is where the new concept of constraints comes into play: before performing an aggregation, we enforce a constraint on our data:

The above query enforces the MaxRowsPerID constraint, which limits the total number of rows contributed by each privacy ID in the table. In this case, the privacy IDs are the values in the “User ID” column, which we use to limit the number of rows per identifier to three.

With this additional enforced truncation step, we are now able to evaluate the query, here finding the most frequent user actions, in a differentially private manner:

Using privacy IDs is as simple as that! To get started, check out our new tutorial, Working with privacy IDs.

Privacy IDs and advanced features in Tumult Analytics

With newly added support for privacy IDs, Tumult Analytics is now clearly ahead of all other differential privacy platforms. As we outlined in our whitepaper, Analytics was already the only framework that supports joins between multiple tables, advanced privacy accounting, or parallel composition, among other features used by our customers at the U.S. Census Bureau, the Internal Revenue Service, or the Wikimedia Foundation. Privacy IDs work seamlessly with all these features, and they make some of them (like private joins) even easier to use.

For a sneak peek at some of the more advanced uses of privacy IDs, you can take a look at our second tutorial, Doing more with privacy IDs.

Final words

The setting where a single person contributes multiple rows to the data, and where user-level privacy is needed, is extremely common. It occurs with social media use cases, in data collection for ad tech or telemetry, in financial datasets, and so on. In such contexts, privacy IDs are a critical element of sharing sensitive data while providing appropriate privacy protection to the people in the data. We’re proud to ship this central feature to Tumult Analytics, and excited about the safe data sharing or publication use cases that it will unlock.

If you have any questions or comments, don’t hesitate to ping us on Slack!

Read paper

other Blog Post articles

View All

Blog Post

PETs and you: mapping privacy-enhancing technologies to your use cases

Say you’re working on a new project involving sensitive data — for example, adding a new feature to a healthcare app.

Blog Post

The fundamental trilemma of synthetic data generation

In this blog post, we outline the three key desiderata of synthetic data solutions — flexibility, accuracy, and privacy — and explain the fundamental trade-off between them.

Blog Post

Don’t leave the door to your data clean room open!

Using data clean rooms to join data between two parties does not always mean that the data is fully protected: outputs can also sometimes leak individual information. How can you mitigate this risk?

Blog Post

The fundamental trilemma of synthetic data generation

In this blog post, we outline the three key desiderata of synthetic data solutions — flexibility, accuracy, and privacy — and explain the fundamental trade-off between them.

Blog Post

Publishing Wikipedia usage data with strong privacy guarantees

Tumult Labs helped engineers at the Wikimedia Foundation design, implement, and deploy a differentially private solution to publish Wikipedia usage metrics in a provably secure way.

Blog Post

A framework to evaluate the robustness of anonymization solutions

In this blog post, we introduce a conceptual framework to help prospective buyers of anonymization technology evaluate claims on the trustworthiness and security of potential solutions.

Unleash the power and value of your data.

Request a Demo