For the U.S. census, keeping your data anonymous and useful is a tricky balance
As the country waits for more results from last year’s national head count, the U.S. Census Bureau is facing an increasingly tricky balancing act.
How will the largest public data source in the United States continue to protect people’s privacy while also sharing the detailed demographic information used for redrawing voting districts, guiding federal funding, and informing policymaking and research for the next decade?
Concerns have been brewing among census watchers about how the bureau will strike that balance, beginning with the redistricting data it’s on track to put out by mid-August.
That release is expected to be the first set of 2020 census statistics to come with controversial new safeguards that bureau officials say are needed to keep people anonymous in publicly available data and prevent the exploitation of their personal information. But based on early tests, many data users are alarmed that the new privacy protections could render some of the new census statistics useless.
The state of Alabama has filed a federal lawsuit to try to block the bureau from putting these new protections in place. The case is currently before a three-judge court that is expected to rule soon on a request for an emergency court order. Whichever way it goes, the case is likely to reach the U.S. Supreme Court. The legal challenge could ultimately derail the bureau’s schedule for releasing the data many state and local redistricting officials need to prepare for upcoming elections.
Here’s what else you need to know:
Why does the Census Bureau have to protect people’s privacy?
Under current law, the federal government is not allowed to release personally identifiable information from the census until 72 years after it’s gathered for the constitutionally mandated tally. The bureau has relied on that promise of confidentiality to get many of the country’s residents to volunteer their information once a decade, especially among people of color, immigrants and other historically undercounted groups who may be unsure about how their responses could be used against them.
But it is becoming harder for the bureau to uphold that pledge and continue releasing statistics from the census. Advances in computing and access to voter registration lists and commercial data sets that can be cross-referenced have made it easier to trace purportedly anonymized information back to an individual person.
For a way out of this conundrum, the bureau has been building a new privacy protection system based on a mathematical concept known as differential privacy. Invented at Microsoft’s research arm, it has served as a framework for privacy measures in smaller Census Bureau projects, as well as at some tech companies.
“Differential privacy is in every iPhone and every iPad,” says Cynthia Dwork, a computer scientist at Microsoft Research and Harvard University who co-invented differential privacy. “That may have a larger scale than the number of respondents to the U.S. decennial census, but there’s a totality and commitment to privacy that’s different here” with the bureau’s plans for 2020 census data, Dwork adds.
How has the bureau protected people’s privacy in past census data?
For decades, the bureau has stripped away names and addresses from census records before turning them into anonymized data. That information is broken down by race, ethnicity, age and sex to levels as detailed as a neighborhood.
But even in a sea of statistics, certain households — particularly those in the minority of a community — can stick out because they live in isolated areas or have other distinctive characteristics that could make it easier to reveal who they are.
As part of additional privacy protections over the years, the agency has withheld some data tables, and sometimes particular cells within tables, from the public in the past. The bureau has also added “noise” — or data for fuzzing the census results — to certain tables before releasing them. Beginning with data from the 1990 count, it has used a technique called “swapping” to switch out data about certain households with those from different neighborhoods.
What prompted the bureau to choose differential privacy to protect 2020 census data?
In 2016, researchers at the bureau began conducting internal experiments to test the strength of the privacy protections used for 2010 census data, and based on the results, agency officials concluded they can no longer rely on data swapping.
Using a fraction of the census data the bureau released a decade ago, the researchers were able to reconstruct a complete set of records for every person included in the 2010 census numbers. Then, after cross-referencing that reconstructed data with records bought from commercial databases, they were able to re-identify 52 million people by name, according to a court filing by John Abowd, the bureau’s chief scientist. In a worst-case scenario, the bureau’s researchers estimated, attackers with access to more commercial data could unmask the identities of as many 179 million people, or 58% of the population included in the 2010 census.
To try to better protect people’s privacy for the 2020 census, the bureau announced in 2017 plans to create a new system, based on differential privacy, that officials say allows them to add the least amount of noise needed to preserve privacy in most of the released data and balance confidentiality and usability.
“Obviously, you know, it’s not the easiest thing to do,” the bureau’s acting director, Ron Jarmin, said at the Population Association of America’s annual meeting in May, adding that the bureau decided against data swapping and withholding certain tables as alternative safeguards. “To achieve a similar level of privacy protection with those sort of traditional methods, I think, would have produced a product that was even … less useful for data users than what we’re contemplating right now.”
How will differential privacy affect 2020 census data?
The bureau says no noise was added to protect people’s privacy in the new state population numbers, including those used to reallocate congressional seats and Electoral College votes, as well as numbers for Washington, D.C., and Puerto Rico. The bureau is also planning to release the total number of housing units in each census block, as well as the number of prisons, college dorms and other group-living quarters in each block, without privacy protections.
But it remains unclear exactly how the bureau’s differential privacy plans will affect other new redistricting data that is expected out by Aug. 16, including population numbers and demographic details about counties, cities and other smaller areas.
It depends on the amount of noise the bureau chooses to add and how it tries to smooth out the effects of adding noise. Last week, bureau officials announced their final privacy settings for the new redistricting data, which the bureau said will “lead to lower noise infusion” than the amount of noise that was added to test files the bureau released in April to demonstrate the potential effects of its privacy plans. The revised algorithm “ensures the accuracy of data necessary for redistricting and Voting Rights Act enforcement,” the bureau said in its announcement.
However, data users need test files to analyze the usability of the new redistricting data and to compare the effects of the bureau’s finalized plans with those of simulated settings for earlier test files.
“The Census Bureau recognizes that data users would benefit from access to the final set of [Privacy-Protected Microdata Files] for the redistricting data as soon as possible, and we are committed to releasing this information by September 30,” the bureau said Monday in a statement to NPR. “Given the delay in delivery of the redistricting data, however, we must prioritize production of the redistricting data so that states can commence already-delayed redistricting work.”
Separate privacy protection decisions for other 2020 data sets are expected to be made later after gathering more public feedback. In May, the bureau’s committee of outside scientific advisers recommended delaying the release of these data sets after redistricting data is out to give more time for testing the new privacy protections.
Why have the bureau’s differential privacy plans been controversial?
Preliminary tests of the bureau’s new privacy protections have left many data users worried that their ability to use 2020 census statistics could be severely limited, particularly data about small geographic areas and minority groups within communities that many governments rely on for planning.
Before announcing their finalized plans for redistricting data, bureau officials stressed, however, that their differential privacy plans were a work in progress as they gathered feedback from the public.
Alabama filed a federal lawsuit in March and is trying to block the bureau from using differential privacy, which the state claims will make the data unusable for redrawing voting maps. Sixteen states, most of which also have Republican-controlled legislatures, are supporting Alabama’s claims in an amicus brief.
And more lawsuits over differential privacy may be coming later, including from civil rights groups that have been monitoring the bureau’s test data to see whether the new protections make it harder to ensure fair representation of people of color during redistricting.
“At this point, it seems not at all clear that anything the bureau releases will eliminate the possibility that the Voting Rights Act and its enforcement could be adversely affected by differential privacy,” Thomas Saenz, the president and general counsel of the Mexican American Legal Defense and Educational Fund who also serves on the bureau’s National Advisory Committee on Racial, Ethnic, and Other Populations, told NPR before the bureau announced its finalized privacy settings for new redistricting data.
During a public meeting in May, that committee’s chairperson — James Tucker, a voting rights attorney for the Native American Rights Fund — noted that many civil rights organizations “believe that the privacy concerns raised by the bureau are overstated.”
“As bureau experts acknowledge, commercially available search engines that we use every day allow even the most unsophisticated data user to readily identify personally identifiable information about nearly any person they want,” Tucker said, adding that there are “widespread concerns that the Census Bureau may be rushing in its embrace of differential privacy.”
Bureau officials have previously countered similar arguments by noting that census data provides information that is often not well represented in commercial databases, including data about children, people’s self-reported racial and ethnic identities, and same-sex couples who live in the same household.
“Given that the decennial census is a mandatory survey with universal coverage, the Census Bureau believes that it has both a legal and an ethical responsibility to use the strongest privacy protection technology available,” the bureau said last year in a written response to committees of state demographers and data centers.
What happens if the courts block the bureau from using differential privacy?
The release of 2020 census redistricting data — which is already late because of the coronavirus pandemic and the Trump administration’s interference with the census schedule — could be further delayed by “multiple months” past August, Abowd, the bureau’s chief scientist, has warned in a court filing.
“This delay is unavoidable because the Census Bureau would need to develop and test new systems and software,” Abowd added, later estimating that the work could last for at least six to seven months.
Editor’s note: Apple and Microsoft are among NPR’s financial supporters.