Suppose you work on the data analytics team of a social media company. You want to share insights about user interactions on your company’s new app, while protecting each user’s data. Each user can, of course, contribute many different interactions to your dataset.
What happens if we take the most common approach to differential privacy, and protect each row individually? Think of location data, like we see in the table above: most records from a single user are likely to originate from the same place — where they live. Hiding location information from individual rows will not be enough to protect it: the total influence of a user’s many contributions taken together might still leak this sensitive information in the output!
This simple example shows that whenever a user contributes many times to a dataset, standard row-level privacy guarantees are not enough: information that is personal to users might still be revealed in the output, since their data is not limited to a single row. Instead, we need a stronger notion: user-level privacy guarantees, which protect all the data of any individual user across multiple rows.
The requirement of user-level privacy is very common. Past releases of social media data provide a real-world example in which user-level privacy is critical. And the same is true of many other real-world deployments of differential privacy: mobility patterns, search queries, marketing insights, financial information, and so on. More generally, in any deployment of differential privacy, a critical question is always going to be: who, or what, do we want to protect in the data? As we have seen, when we want to protect individuals whose data is distributed across multiple rows, protecting individual rows is not enough.
Tumult Analytics now solves this problem with a big upgrade: privacy IDs, short for privacy identifiers, enable the protection of individuals across large datasets, no matter how many times they appear in the data. When utilized in conjunction with the exclusive features of Tumult Analytics, privacy IDs enable a diverse set of differentially private analyses that no other differential privacy library can support.
Privacy IDs in practice
To protect each user’s data, you need to identify them with a specific attribute, like their user ID, and protect all rows with that attribute. This is exactly why we built the privacy IDs feature: by hiding the presence of all the rows that share the same value in an identifier column, you can share data while guaranteeing user-level privacy protection.
With privacy IDs, we've added a number of features to our API, including new behaviors for query transformations and the concept of constraints. The privacy IDs feature is set up using the new AddRowsWithID protected change, which ensures that all rows with the same value in a specified column are protected.
Initializing an Analytics Session with our user activities table might look like this:
With differential privacy, we must hide the maximum impact of the protected change by adding statistical noise. However, when each user can contribute arbitrarily many rows to our data, we need a bound on user contributions. This is where the new concept of constraints comes into play: before performing an aggregation, we enforce a constraint on our data:
The above query enforces the MaxRowsPerID constraint, which limits the total number of rows contributed by each privacy ID in the table. In this case, the privacy IDs are the values in the “User ID” column, which we use to limit the number of rows per identifier to three.
With this additional enforced truncation step, we are now able to evaluate the query, here finding the most frequent user actions, in a differentially private manner:
Using privacy IDs is as simple as that! To get started, check out our new tutorial, Working with privacy IDs.
Privacy IDs and advanced features in Tumult Analytics
With newly added support for privacy IDs, Tumult Analytics is now clearly ahead of all other differential privacy platforms. As we outlined in our whitepaper, Analytics was already the only framework that supports joins between multiple tables, advanced privacy accounting, or parallel composition, among other features used by our customers at the U.S. Census Bureau, the Internal Revenue Service, or the Wikimedia Foundation. Privacy IDs work seamlessly with all these features, and they make some of them (like private joins) even easier to use.
For a sneak peek at some of the more advanced uses of privacy IDs, you can take a look at our second tutorial, Doing more with privacy IDs.
The setting where a single person contributes multiple rows to the data, and where user-level privacy is needed, is extremely common. It occurs with social media use cases, in data collection for ad tech or telemetry, in financial datasets, and so on. In such contexts, privacy IDs are a critical element of sharing sensitive data while providing appropriate privacy protection to the people in the data. We’re proud to ship this central feature to Tumult Analytics, and excited about the safe data sharing or publication use cases that it will unlock.
If you have any questions or comments, don’t hesitate to ping us on Slack!