Data Quality and AI Readiness: Why Artificial Intelligence Fails Before It Even Starts

June 24, 2026

Author: Manuela Bazzarelli- Head of Operations @ Aramix (Datrix Group), Marika Savarese - Head of Tech Innovation @ Aramix (Datrix Group)

The public debate on AI tends to focus on models. But the variable that separates projects that make it into production from those that remain proof-of-concepts is another one: data quality and the maturity with which an organization governs it.

Over the past two years, artificial intelligence has entered companies at unprecedented speed. Attention has focused on models: generative capabilities, multimodality, cost reduction, new architectures. Yet, in the transition from experimentation to production, one recurring fact emerges: the problem is not the model. It is the data.

What the debate calls “AI readiness” — the real ability of an organization to adopt artificial intelligence systems in a scalable and reliable way — largely depends on a less visible but decisive variable: the quality of data and the maturity of its governance.

This issue is directly connected to the broader question of the engineered enterprise in the age of AI. If AI is not simply a tool to be adopted, but a technology that changes how a company produces knowledge, coordinates work and makes decisions, then data quality is not a technical prerequisite. It is the first test of the organizational architecture. Where data is fragmented, ambiguous or unmanaged, even the most advanced AI remains disconnected from the real system of the enterprise.

According to Gartner, up to 70% of the time in analytics and AI projects is spent on data preparation activities, including data access, cleaning, transformation and integration. This is not a marginal inefficiency: it is the structural condition of applied AI. Most of the work is not about intelligence, but about the information infrastructure that makes it possible.

The paradox of more powerful models and more fragile contexts

The evolution of foundation models has created a widespread, and partly misleading, perception: that AI has become less dependent on data. Latest-generation models show remarkable general-purpose capabilities, often without any specific training on the business domains in which they are deployed.

But this autonomy is only apparent.

When AI is integrated into real processes — customer operations, supply chain, risk management, compliance — it inevitably comes into contact with proprietary data: incomplete, duplicated, distributed across legacy systems and often lacking a shared semantic structure. The result is a paradox worth stating clearly: the more sophisticated models become, the more sensitive they are to the quality of the context. An advanced model does not correct inconsistent data; it amplifies it.

There is, however, a second paradox, which is discussed less often. Improving data quality reduces noise, but it does not eliminate strategic uncertainty. A dataset can be accurate, complete and up to date, and still represent only a portion of the decision-making problem. In this case, AI does not simply produce an error: it produces a more sophisticated form of false precision. The decision appears more objective because it is numerical, more reliable because it is automated, more defensible because it is traceable. But the formal precision of the output does not necessarily coincide with an understanding of the system in which that output will be used.

This point is crucial for companies. As the tradition of bounded rationality suggests, organizations do not make decisions based on a complete representation of the world, but on partial maps, satisficing criteria and limited search processes. A model can therefore optimize a local metric while worsening overall performance: improving lead scoring while reducing commercial variety; increasing pricing efficiency while weakening the customer relationship; accelerating credit scoring while embedding decision-making rigidities that are hard to detect. Data quality is therefore a necessary condition for AI readiness, but it is not a guarantee against systemic error.

What “data quality” really means

Data quality is not a single concept, but a set of technical and semantic dimensions consolidated also in International Organization for Standardization standards, such as ISO 8000, and in the DAMA — Data Management Association — literature:

Accuracy: the data correctly represents reality
Completeness: no relevant elements are missing
Consistency: the same data does not vary across systems
Timeliness: the data is up to date with respect to its intended use
Uniqueness: absence of duplications or ambiguity

In the context of AI, however, these dimensions are not sufficient. Models do not only require correct data: they require data that can be interpreted unambiguously by probabilistic systems. In other words, machine-usability, not just formal quality.

This is far from a marginal conceptual shift, because it moves the problem from data cleaning to semantic structure — and therefore from data engineering to information architecture.

A useful way to understand this problem comes from the theory of “rugged landscapes,” or rugged adaptive landscapes. In a simple landscape, every local improvement brings the organization closer to the best solution. In a rugged landscape, by contrast, variables are interdependent: changing one element changes the value of the others. Performance does not depend on a single choice, but on the overall configuration. In these contexts, local learning can lead to local peaks: solutions that are better than previous ones, but inferior to alternatives that would require broader, less intuitive or less immediately measurable moves.

For AI, this means that the problem is not only having “clean” data. It is understanding which decision-making landscape that data is representing. If the model observes only part of the system, it will tend to optimize that part. If the fitness metric is narrow, AI will learn to serve that metric. In organizational terms, the risk is mistaking a computable representation of the problem for the problem itself: a critical issue already highlighted by studies on cognitive and experiential search, where simplified maps guide search but can also constrain it. The enterprise does not see the landscape: it sees a computable projection of it.

OpenAI

AI readiness: an emergent property of architecture

AI readiness is not a characteristic of a single system, but an emergent property of the entire corporate information architecture.

According to McKinsey & Company’s analyses of organizations that are able to scale AI beyond the pilot stage, several recurring elements can be identified.

The first is the presence of a unified or federated data architecture, which reduces information fragmentation and silos across functions.

The second is data traceability, often referred to as data lineage: the ability to trace every transformation that a piece of information has undergone.

The third is consistency between training and production in machine learning systems, one of the most frequent critical issues in industrial AI projects.

Finally, governance is not separate from development, but integrated into the processes themselves.

From this perspective, AI readiness is not about models, but about the structure of the system that feeds them.

The invisible cost of poor data quality

The cost of poor data quality rarely becomes visible immediately. More often, it emerges as a progressive degradation in system performance.

Unstable predictive models, an increase in false positives, inconsistent automated decisions, or the inability to scale prototypes are recurring symptoms of a systemic problem.

The critical point is not model development, but the production phase, where data drift progressively makes predictions unreliable.

This effect is particularly evident in systems based on behavioral data, where the context evolves faster than the infrastructure designed to represent it.

From data lake to data product: the paradigm shift

In recent years, a paradigm shift has emerged in data management: from data lake to data product.

The traditional data lake model has often generated large information repositories, but little semantic structure, making it difficult to reuse data consistently.

The data product approach — associated with data mesh models — reverses this logic: each dataset is treated as an autonomous product, with clear ownership, documentation, SLAs and explicit quality metrics.

This shift is crucial for AI readiness because it introduces a fundamental organizational principle: data is not a passive asset, but a product with clearly defined responsibilities.

LLMs and data governance: a new level of complexity

The adoption of large language models introduces a less visible but deeper transformation: the separation between data, instruction and decision becomes blurred.

In traditional systems, the pipeline is relatively traceable: dataset, feature engineering, model, output. In systems based on LLMs and retrieval-augmented generation architectures, or RAG, the chain becomes more complex: unstructured documents, embeddings, semantic retrieval, dynamic prompts, probabilistic generation and application-level post-processing.

This layering introduces what could be described as an “opacity layer”: an intermediate level that makes it difficult to reconstruct, in deterministic terms, the relationship between source data and final output.

The problem is not only technical, but one of accountability: why did the system produce this answer?

The return of regulation: the EU AI Act and high-risk systems

With the European Union Artificial Intelligence Act, the European regulatory framework has introduced a key principle: responsibility lies not with the model, but with the system.

This radically changes the nature of data governance. It is no longer a support function, but a compliance requirement.

Organizations that develop or use systems classified as “high-risk” must demonstrate the traceability of the data used in models, the documentation of transformations, control over training datasets and knowledge bases, drift monitoring, and continuous risk assessment.

In this context, tools such as data catalogs, lineage tracking and model registries are no longer optional. They become essential elements of auditability and, therefore, of legitimacy.

Apparent compliance and real control

One effect already observable in organizations is the divergence between formal compliance and substantive control.

Many companies implement governance tools — catalogs, dashboards, policies — without achieving a real semantic understanding of data flows.

The problem becomes more acute in LLM-based systems, where data is often unstructured and the boundary between information and instruction, or prompt, is ambiguous. In these cases, the risk is not only technical error, but the inability to verify the origin of automated decisions.

The change introduced by European regulation is therefore conceptual before it is technical. Regulation no longer concerns isolated models, but complex socio-technical systems. As a result, even a highly performing model may be non-compliant if the information system that feeds it cannot be controlled. From this perspective, AI readiness ceases to be a measure of performance and becomes a measure of governability.

Data governance must therefore also become governance of learning. Organizations learn from experience, but experience is always selective. They observe active processes, acquired customers, launched products, executed campaigns and already controlled channels more clearly. They observe much less the alternatives that were not chosen, customers lost before entering the funnel, interrupted projects and failures made invisible by reporting systems. This asymmetry generates a form of learning myopia: the organization learns mainly from what is close in time, space and the memory of success.

AI can reduce this myopia only if it is designed within an appropriate decision-making architecture. Otherwise, it can amplify it. A system trained primarily on the company’s historical experience can make the past more efficient, but not necessarily make the future more intelligent. The literature on problemistic search shows that the way an organization defines a problem shapes the search for solutions: reducing search to an automatic reaction to performance below aspiration leaves out diagnosis, representation and managerial judgment. This is why AI readiness requires audit trails, uncertainty metrics, model stress tests, controlled experiments and explicit moments of human judgment.

AI as the point of arrival

The most common mistake in adopting artificial intelligence is to consider it the starting point of digital transformation.

In reality, it is the point of arrival.

Before AI, there is data. Before data, there are the systems that generate it. And before systems, there is the organization that defines what quality, responsibility and decision-making mean.

This is why data quality is not a technical issue, but a strategic choice for the engineered enterprise. And AI readiness is not a technological checklist, but a measure of the maturity with which an organization is able to make its informational complexity readable, controllable and governable.