Social web

Revealing Wikipedia usage data while protecting privacy

Blue figure

Wikipedia’s volunteers want a systematic way to prioritize where to focus their work. Which entries are being read most? By which readers where?
DP was the technology that solved for the twin, and potentially contradictory, goals of privacy preservation and actionable insights.

case study image

summary

Empowering Wikipedia editors to make decisions on what to edit next

Regularly-updated usage data is vital for Wikipedia editors seeking to make data-driven decisions about which articles to edit next.
Daily usage data - by subject and geography - is also used by researchers of internet behavior, and the information landscape.


Open data is part of the Wikimedia Foundation’s ethos. However, transparency - in particular with respect to the location and behavior of readers - can  put individuals’ privacy at risk. DP was sought as the technology that could surface insight while preserving user privacy.

key outcomes

0
%
Increase in number of data points released per day
0
%
Increase in number of page views released per day
<.0
0
%
Spurious rate
<.
0
%
Drop rate

goals

Publish more data

When data benefits various stakeholders, making more of it available adds value to numerous communities.

Publish data more frequently

Sharing data more often can enhance its utility.

Meet the Open Data mandate, safely

Ensure data is shared widely while maintaining strict standards of privacy and security.

our process

A collaborative, calibrated process to assure utility while maintaining privacy

Proven with industry leaders in the private and public sector, our process delivers on your specific goals.

Define the problem
In a problem statement, state the rationale for the data release, the plan for releasing the data, a privacy unit, possible error metrics, and a pseudocode first draft of the algorithm to be used.
arrowarrow
Confirm the viability of using DP
Using default hyperparameters, see if it is possible to conduct a differentially- private data aggregation.
arrow
Decide on error metrics to optimize for
Create a set of internal error metrics to evaluate the data release against. All differentially-private datasets have some noise added, but if the noise needed to provide privacy guarantees is too great, the dataset might no longer be useful.
blue figure
arrow
Experiment with a wide variety of hyperparameters
Until error metrics are optimized, conduct a grid search of hyperparameters (output threshold, noise type and scale, keyset, etc.) until one is found to be optimal (given the above error metrics).
arrowarrow
Productionize the pipeline
Turn the finalized aggregation into a finished script, integrate error calculation and privacy loss logging. Automate running the job regularly.

the results

“With Tumult Labs' open source software and expertise in technical implementation, the Wikimedia Foundation team is now able to release more granular, equitable, and safe data about how readers are using our platforms.”
- Hal Triedman, Senior Privacy Engineer

How do Wikipedia editors decide which parts of the online encyclopedia to improve? Like in most modern organizations, an essential part of the decision-making process relies on data. Wikipedia’s editor community seeks to understand which parts of the site are most engaged with, and by whom. This information helps them to prioritize where to add content and make other improvements. But this information can put user privacy at risk.


Wikipedia sought Tumult’s help in applying DP in a way that would maintain privacy while providing actionable insights.

More data released

Wikimedia is better able to meet its mission around Open Data

More frequent data releases

Editors gain clearer insights of where to put their attention

Safer data releases

Data serves researchers without risking privacy

More resources

View All
Image of a collage website

Illuminating college outcomes, while protecting privacy

Case Study

Joining sensitive data sets from the Department of Education and the IRS in a way that protected privacy resulted in College Scorecard - a platform that allows students and families to simultaneously consider the cost and evidenced outcomes of a range of possible degrees.

Read more
right arrow
Green shape on dark background

Evaluating the usability of differential privacy tools with data practitioners

Research

Differential privacy (DP) has become the gold standard in privacy-preserving data analytics, but implementing it in realworld datasets and systems remains challenging.

Read more
right arrow
Image of a collage website

Illuminating college outcomes, while protecting privacy

Case Study

Joining sensitive data sets from the Department of Education and the IRS in a way that protected privacy resulted in College Scorecard - a platform that allows students and families to simultaneously consider the cost and evidenced outcomes of a range of possible degrees.

Read more
right arrow
Green shape on dark background

Evaluating the usability of differential privacy tools with data practitioners

Research

Differential privacy (DP) has become the gold standard in privacy-preserving data analytics, but implementing it in realworld datasets and systems remains challenging.

Read more
right arrow
Green shape on dark background

Tumult Analytics: a robust, easy-to-use, scalable, and expressive framework for differential privacy

White paper

Tumult Analytics is an open-source framework for releasing aggregate information from sensitive datasets with differential privacy (DP).

Read more
right arrow

Unleash the power and value of your data.