Booz Allen: Privacy Insights

Where Algorithms Meet Accountability

VELOCITY V3. 2025 | Max Wragan, Edward Raff, and Sean Guillory

Data-level protections in the increasing fight for privacy

Privacy has long been considered a fundamental right within democratic societies, but data privacy is at risk of becoming an illusion. New technologies and the proliferation of data are making it difficult for people to keep their personal information private. In October 2024, Forbes reported that two Harvard students had used a pair of smart glasses, AI, and data from online sources to identify people they’d never previously met and locate their personal information. The authors of a study published in the journal Patterns estimated it was possible to uniquely identify 93% of people in a dataset of 60 million people using only four pieces of auxiliary data.

Protecting privacy, which includes maintaining data privacy, is an issue of public trust. When people don’t trust enterprises and institutions to value their privacy and act responsibly, their faith in those enterprises and institutions erodes. The Pew Research Center has reported that approximately 7 in 10 U.S. adults are concerned about how the government uses the data it collects on them. According to that same study, 81% of Americans say the information companies collect will be used in ways with which they are not comfortable.

Such negative headlines and stories that detail the growing threats to data privacy obscure a fundamental truth: Collecting and analyzing data is not inherently bad. On the contrary, there is so much value to be gained from sharing data safely and responsibly. When hospitals share data in confidential and mutually agreed-upon manners, for instance, doctors can do more research with external partners, which accelerates the pace of medical innovation and leads to potentially lifesaving treatments. When federal agencies tasked with maintaining public safety and national security can combine sensitive datasets, they can identify threats earlier and better protect citizens.

Putting the appropriate data privacy safeguards in place will accelerate innovation. This is particularly true in the field of AI where models need access to large swaths of useful, clean data to learn new capabilities and optimize their performance. Balancing the growing data privacy problem with the benefits of safe and responsible data sharing demands a broader understanding of the privacy landscape. On this issue, the federal government has the opportunity to lead the way by adopting new techniques and continually emphasizing the importance of data privacy. In this article, we outline challenges to data privacy and examine several techniques to mitigate them. 

Data, Data Everywhere

The root cause of the data privacy problem is the sheer volume of data in circulation. Organizations across all sectors are collecting data to an unprecedented degree. According to the company Skynova, “64% of business owners and executives” collect customer data from social media sites. Healthcare organizations collect data to improve patient care and track healthcare trends. Even the federal government collects data; as noted on the Government Accountability Office’s website, “The federal government collects and uses personal information on individuals in increasingly sophisticated ways for things like law enforcement, border control, and enhanced online interactions with citizens.”

The issue is that the techniques commonly used to protect collected data and maintain data privacy—the k-anonymity standard, summary statistics, aggregating information, hiding their predictive models behind an application programming interface (API)—are not always sufficient. There’s also the threat of cybercriminals breaching a system and stealing data. All of which is to say a promise contained within typical privacy policies—“we will never share your personal information”—leaves out an important caveat: intentionally.

Further compounding the problem is the availability of powerful algorithms and AI models, which make it easier for bad actors to use incomplete or partially redacted datasets to cross-reference and deanonymize data and enable them to extract information that was previously thought to be protected. As more personal information makes its way online—whether intentionally or unintentionally—bad actors will have more data to work with, making future efforts at safeguarding data privacy significantly more difficult.

“Putting the appropriate data privacy safeguards in place will accelerate innovation. This is particularly true in the field of AI where models need access to large swaths of useful, clean data to learn new capabilities and optimize their performance.”

A Different Approach to a Growing Problem

A family of techniques commonly referred to as differential privacy is being used by Apple, Google, Microsoft, and well-informed corners of the U.S. government to achieve data privacy. Differential privacy is a mathematical framework that introduces “noise,” or random variation, into a dataset to camouflage individual data points. Like a photo filter that blurs a person’s facial features, differential privacy limits the information you can see and extract once the data is shared, whether that’s through an API call, a database, or a machine learning algorithm.

Differential privacy is not perfect. It requires users to think about and account for the information they give away whenever they provide access to their data. It is also imperative to add noise judiciously. When the optimal level of noise is added, the aggregate information that can be extracted—such as averages, ranges, and statistical likelihoods—isn’t significantly changed by the injection of random differences into individual records. Add too much noise, however, and the accuracy of the data suffers.  

The advantage of differential privacy is what it guarantees: When a dataset is shared under its veil, there is nothing a recipient can do to extract more data out of it than what the owner wants to reveal.

Dispelling Data Privacy Misconceptions

Misconception
Reality

Organizations can fully control and audit AI data usage.

Analytics and AI systems can create outputs that are difficult to trace, making auditing data use a significant challenge for any organization.

Data anonymization prevents sensitive data from being traced.

Anonymized data can often be reidentified by linking datasets.

Access controls prevent unauthorized data access.

Although access controls are a good tool for unauthorized access prevention, sophisticated attackers can still exploit system vulnerabilities to gain access to sensitive data.

Data retention policies enforce data removal within established schedules.

Legacy systems and decentralized data can result in data retention beyond policy mandates.

Compliance with privacy regulations ensures mitigated exposure risks.

Regulatory frameworks often lag technological advancements, exposing systems to privacy risks.

Differential Privacy in Action

The 2020 U.S. Census was one of the largest deployments of differential privacy to date, and it was also one of the highest stakes use cases for data protection, as the data collected during the census determines political representation and distribution of government funding. In a study published in Science, researchers verified that the methods used successfully protected respondent confidentiality. They also noted that while differential privacy preserved data accuracy at the state, regional, and national levels, accuracy was compromised at the neighborhood level, which could result in under- or over-representation of certain groups, particularly racial and ethnic minorities.

This finding illustrates an important point: In general, the larger the dataset, the easier it is to guarantee privacy without damaging data utility. However, the tradeoff is highly dependent on the goals of the analysis. If you simply want to calculate the average age of an entire population, differential privacy is achievable with as few as 100 records. If you want the average age for a specific group determined by several additional demographic criteria, you would need to add more noise to protect individual records. Similarly, differential privacy is harder to achieve when you have significant outliers. In income tax data, for example, most privacy methods struggle to disguise billionaires because their data stands out so much from all the others.

Making the Investment

Protecting privacy does not come cheaply. Many of the best techniques, including differential privacy, are just starting to proliferate in industry and currently sit at the top of the cost curve. For problems that require a custom solution, a federal agency might need to spend somewhere in the range of $1 million to $10 million to research, pilot, and develop its own differential privacy program.

These costs aren’t insurmountable and are well within the budgets of many federal agencies. They also pale in comparison with financial penalties organizations risk when they fail to properly protect people’s data. A federal judge required the Office of Personnel Management to pay a $63 million settlement to current and former federal employees and job applicants who were affected by a data breach.

Furthermore, protecting the privacy of citizens is an essential responsibility of democratic governments. With continued collaboration between government, industry, and universities, more affordable solutions will become available. For example, text classification models were once nearly impossible to train in a differentially private way before the noise destroyed the sparsity in datasets, which dramatically increased the time and cost to train models.

By carefully thinking through the objectives and accounting for where the noise is needed, it is now possible to reduce training time from months to minutes. Achieving this requires a careful and deliberate accounting of the information in the algorithm and where it goes, but the payoff is significant. The upfront investment is well worth the long-term payoff: Once these optimizations become repeatable, the cost for subsequent deployments drops.

When it comes to making the case for more investment in data privacy, the dollar cost may not be the biggest hurdle. Far more formidable is the combination of organizational cultures that are resistant to sharing information, combined with a lack of understanding of new data privacy solutions. When executives don’t fully understand the challenges that need to be addressed or the available solutions, they’re less likely to champion a program that requires investment and cultural change.

But a lack of understanding is not a valid reason to put off implementing data privacy solutions. Federal agencies have an obligation to advance data privacy technologies as part of responsible AI, a reality that’s been acknowledged at the highest levels. According to the 2023 Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence: “The Federal Government will enforce existing consumer protection laws and principles and enact appropriate safeguards against fraud, unintended bias, discrimination, infringements on privacy, and other harms from AI.”

Executive Order 14110 goes on to direct the National Institute of Science and Technology (NIST) to create guidelines for agencies to evaluate the efficacy of differential-privacy guarantee protections, including those for AI. At the time this article went to press, NIST indicated that a final report was in progress.

The Strengths and Shortcomings of Three Data Protection Techniques

Data Protection Techniques
Limitations and Optimal Use Cases

Synthetic data allows for the creation of artificial datasets that resemble the statistical properties of real datasets. Their mirrored statistical properties enable users to glean insights through model training and data analysis without compromising the sensitive information contained within the real datasets they are designed to resemble. However, synthetic data performance and resemblance suffer on more detail-oriented tasks with smaller subsamples of the data.

Synthetic datasets lack the intricacies of real datasets, which necessitates careful data management and additional bias mitigation strategies for analysis that could make an impact on data subsections, such as underrepresented minority groups. The use of synthetic data is best suited for situations where model precision may be less important than maintaining privacy, such as healthcare research or fraud detection.

Federated learning reduces risks associated with data exposure by enabling machine learning models to train across decentralized devices that safeguard data by keeping it local. It brings the model to the data, rather than the data to the model. Models learn locally on devices, and the updates and improvements they gain from this training are sent back to a central server.

Within the health industry, federated learning has been deployed to protect patient data confidentiality and to improve the utility of clinical models for patients, such as the University of Pennsylvania’s Federated Tumor Segmentation (FeTS) platform. However, the technique is not well-suited for all use cases, as uneven data distribution or data types result in frequent obstacles and latency for federated learning applications.

Homomorphic encryption allows analysts to perform calculations on data without first decrypting it. This technique is particularly advantageous when confidentiality is paramount (e.g., financial transactions). Preserving encryption during computation also presents practical advantages in allowing external or third-party analysts to process the data without accessing raw information.

While well-suited for highly sensitive but small datasets, practical limitations of the technique are present with larger datasets due to the computationally intensive nature that causes meaningful lags and expense with large-scale utilization.

Alternative Paths to Privacy and Supplemental Techniques

Differential privacy offers a unique advantage by processing privacy risk incrementally, in contrast to many other techniques that often oversimplify privacy as either “protected” or “not protected.” While these alternative methods can provide a layer of data privacy and security, they are not foolproof. With the growing volumes of data being collected and often insufficiently protected, the risk of exposure grows larger each time data is processed. This is where differential privacy stands out: It ensures that the risk of reidentification remains within the defined bounds of its mathematical framework, regardless of how much data on an individual is already available.

However, in cases where differential privacy may be too difficult to implement or situationally incompatible, alternative techniques can still help mitigate exposure risk. These techniques include synthetic data, federated learning, and homomorphic encryption. They can strengthen data protection when differential privacy isn't possible or bolster differential privacy solutions when they are in place.

Key Takeaways

  • Advanced algorithms and AI tools are making it easier to breach data privacy, with recent incidents showing how readily personal information can be extracted from seemingly protected datasets.
  • Differential privacy—a mathematical framework that introduces controlled noise into datasets—offers a promising solution, as demonstrated by its successful use in the 2020 U.S. Census.
  • While implementing robust privacy solutions requires significant investment, federal agencies have both the responsibility and opportunity to lead innovation in data privacy protection.

Meet the Authors

Max Wragan

leads AI strategy and risk management to help Booz Allen and their clients deploy AI and AI tools to responsibly enhance project delivery with cutting-edge technology.

leads Booz Allen’s machine-learning (ML) research team, develops high-end technical talent, and disseminates the latest ML skills, techniques, and knowledge across the firm.

Sean Guillory, Ph.D.

is a lead scientist on cognitive domain operations projects and an AI program manager for the AI Rapid Prototyping Museum.

References

VELOCITY MAGAZINE

Booz Allen's annual publication dissecting issues at the center of mission and innovation.

Subscribe

Want more insights from Velocity? Sign up to receive more stories about emerging technologies and the impacts they’re making on missions of national importance.



1 - 4 of 8