If The Data Is Poisoned, Alignment Won’t Save Us

keivn

The Mirror Problem

A lot of the AI safety discourse is framed as a speculative risk: as we build towards a superintelligent AI future, how do we ensure that they are “aligned” and non-hostile? What is overlooked however, is that the fear of domination by AI may reflect the deeply rooted cultural values of the groups most participatory in AI development — values tied to hierarchy and supremacy, the subjugation of others, and take-all-and-leave-nothing practices.

These values are visible in contemporary artifacts of culture from how news cycles prioritize conflict over cooperation, to how advertising frames consumption as success, to how entertainment valorizes zero-sum competition. And with large-scale AI systems trained on internet-scale corpora, these artifacts which contain dominant cultural values are active features of the data. Thus, the real concern is less about a hypothetical future danger and more about the present encoding of logics of domination into the systems themselves (Crawford & Paglen, 2019).

Poison in the Well

Generative models reproduce the biases of their training data. When internet-scraped corpora overrepresent dominant groups and denigrate marginalized ones, models produce skewed and stereotypical outputs reflective of how they’ve been trained (Birhane & Prabhu, 2021; Denton et al., 2021).

I first noticed this when I started to experiment with image generation using GANs in 2022. At that time GANs already struggled to generate much of anything coherent based on the prompt provided, but I noticed it did a significantly better job generating people of white/Eurocentric ethnicity than of other ethnic groups. A dataset-inspection tool let me peek at the underlying training data, where I discovered caricatures, racist memes, and stereotypes for non-white ethnicities. The poison wasn’t theoretical — it was visible.

What started as an exploration of using GANs to generate alternate representations of canonical artworks featuring non-white ethnic groups turned out to be a rudimentary practice in dataset governance: counterfactual data generation. Later research has shown this practice can mitigate bias: for example, Hall et al. (2022) demonstrate that synthetic augmentation can be used to rebalance representational skew.

Shifting Gears

This anecdote reframes the scope of AI safety. Current debates often center on downstream alignment of frontier models (Christiano, 2018). But the more immediate concern is the upstream governance of training data. If datasets reproduce dominant hierarchies and systemic bias, then model outputs will necessarily encode and amplify those patterns, regardless of what control measures are layered later (Birhane & Prabhu, 2021).

The link to existential risk is straightforward: systems trained on poisoned data don’t just reflect unfairness — they make misaligned world models. Misaligned world models are unsafe because they distort decision-making at scale, creating vulnerabilities in critical domains like healthcare, governance, and security. In other words, cultural bias is not just “bad optics,” it is a failure mode that compounds into structural risk (Noble, 2018).

In technical terms, this is pre-distributional alignment: ensuring that what models learn in the first place does not embed disproportionate benefit for some while harming others. This requires a shift in AI safety towards data-centric alignment: moving the locus of “safety” from reinforcement signals applied after training, to the epistemic substrate that training draws from.

Dataset Governance is AI Safety

This reframing suggests that dataset governance should be elevated as a central concern of AI safety. By addressing harmful cultural artifacts embedded in training corpora (Crawford & Paglen, 2019), curating or augmenting datasets with equitable counterexamples (Hall et al., 2022), and integrating perspectives from communities most affected by systemic harms, we mitigate not only issues of bias but also the very risks that alignment debates purport to solve.

This is often dismissed as an “ethics” or “fairness” problem rather than a safety one. But that distinction collapses on inspection. Biased data produces brittle models. Brittleness is not just socially harmful — it is a safety vulnerability. A model that encodes toxic stereotypes is also one that can be adversarially exploited through those representational weaknesses. In other words, bias is a robustness failure in disguise (Birhane & Prabhu, 2021).

Another common objection is that “data governance doesn’t scale.” But filtering and synthetic augmentation already scale — they are the foundation of modern RLHF pipelines and synthetic data generation methods (Schuhmann et al., 2022) What doesn’t scale yet is participatory curation, and that is precisely the kind of upstream intervention where safety research could innovate. Governance is no less scalable than alignment protocols, which themselves are experimental and resource-intensive.

Encoding the Conditions of Misalignment

Some would argue that alignment is inherently post-training — but that misses the point. Pre-distributional alignment is both cheaper and more effective, because it prevents harms from being learned in the first place. Band-aid solutions applied after training cannot unlearn poisoned foundations.

Safety is not only about controlling hypothetical AGI (Bostrom, 2014). It is about preventing current systems from amplifying existing harms into structural risks. Data-centric alignment reframes safety as control over epistemic inputs, not just behavioral outputs. Poisoned data → poisoned world models → unsafe decision-making at scale.

And crucially, AI capability by current standards is not opposed to this shift. True capability requires reliability — and reliability requires governance of the data substrate.

Absent this intervention, the “mirror problem” remains: we are not just fearing hostile superintelligence — we are actively encoding the conditions for it.

Citations

Birhane, A., & Prabhu, V. (2021). Large image datasets: A pyrrhic win for computer vision? WACV.
Crawford, K., & Paglen, T. (2019). Excavating AI: The politics of training sets for machine learning.
Denton, E., et al. (2021). On the genealogy of machine learning datasets. Big Data & Society.
Noble, S. U. (2018). Algorithms of oppression: How search engines reinforce racism. NYU Press.
Schuhmann, C., et al. (2022). LAION-5B: Large-scale dataset for open CLIP. arXiv:2210.08402.
Hall, M., et al. (2022). Synthetic data generation for fairness. FAccT.
Christiano, P. (2018). Alignment problem from a deep learning perspective. AI Alignment Forum.
Bostrom, N. (2014). Superintelligence: Paths, dangers, strategies. Oxford University Press.

Effective Altruism Forum
EA Forum