As AI continues to transform industries and revolutionize how we live and work, ensuring that the data used to train and deploy AI models is of the highest quality is essential. Data quality dimensions play a critical role in ensuring the trustworthiness of AI, as they provide a framework for evaluating the accuracy, completeness, consistency, timeliness, and cleanliness of data. In this article, we explore the main dimensions of data quality – completeness, coverage, consistency, timeliness, cleanliness,  Uniqueness, and validity – and discuss the importance of each dimension in ensuring the reliability and accuracy of AI models. We also examine the challenges of maintaining high-quality data in large-scale AI environments, including data silos, big data overload, unbalanced data sets, inconsistent data, and data sparsity. By understanding the importance of data quality dimensions and implementing effective data management strategies, organizations can ensure the trustworthiness of their AI models and unlock the full potential of AI in their industries.”

Trusted AI – The Case of Data Quality Dimensions

Data quality measures how well-suited a dataset serves its specific purpose. The umbrella term encompasses all factors influencing whether data can be relied upon for its intended use. Data quality refers to implementing processes and tools that ensure the data is fit to serve an organization’s specific needs. Examples of data quality issues include duplicated data, incomplete data, inconsistent data, incorrect data, poorly defined data, poorly organized data, and poor data security.

Data quality rules are an integral component of data governance, which is the process of developing and establishing a defined, agreed-upon set of rules and standards by which all data across an organization is governed. A data quality issue is identified whenever the data does not fulfill a given expectation. Measuring data quality can be seen from very different angles (e.g., dimensions) and measured by very different metrics. The different dimensions of data quality are categories along which data quality can be grouped and instantiated as metrics of quality. Metrics describe how specifically a dimension is measured both quantitatively and qualitatively and can be tracked over time.

Here are some important metrics for data quality:

  • Completeness: This is a measure of the data’s ability to effectively deliver all the required features (variables). Human knowledge cannot be replaced by incomplete data because it lacks sufficient representation of the world. The number of available features is a valid proxy for data completeness.
  • Coverage: This measure ensures that all possible scenarios (e.g., data points), world states, or relevant inputs and outputs are included in the data. The amount of data is a valid proxy for data coverage (e.g., the AI/ML model error rates decrease linearly as the number of data points doubles).
  • Consistency refers to the uniformity of data as it moves across applications. The same data values stored in different databases should not conflict with one another, and data should be represented in the same format ( e.g., date values)
  • Timeliness: A timely data set is readily available when it is needed. For example, streaming data can be updated in real-time to ensure that it is readily available and accessible to AI models. Furthermore, timely data is up-to-date and has historical depth.
  • Cleanliness: A low cleanliness score would mean a low-quality dataset with erroneous data. The source of errors in data can originate from human errors or the failure of the equipment by which it was collected. A secondary source of errors is the inaccuracy of ETL operations (e.g., data view), which extract duplicate or overlapping records.
  • Uniqueness – The percentage of distinct (non-repetitive) samples in the dataset after excluding non-informative features like ID.
  • Validity – The extent to which the data comply with explicit and implicit business rules; we identify 3 types of business rules:
    • Implicit – find extreme values without prior knowledge, such as outlier detection.
    • Implicit with prior knowledge – find abnormal values by general prior knowledge (and not necessarily specific about the data); for example, a person’s age cannot be over 120 or negative.
    • Explicit – Rules and thresholds the user prearranged by giving us prior knowledge of the specific features.

The data quality dimensions fall into two categories: an extrinsic data quality dimension (also called a contextual or application-dependent dimension), which depends on the use case at hand, and an intrinsic data quality dimension (also called a task-independent dimension). Intrinsic dimensions are easier to implement and include, for example, dimensions such as completeness and consistency, while extrinsic dimensions include, for example, dimensions such as cleanliness and timeliness. A data quality issue is an automatically generated alert of a specific problem type on either a single feature (e.g.,  column), a single data point (e.g., a row), a group of columns, a group of rows, or the data set as a whole.

Considering the wide variety of angles and metrics that can be used to assess data quality, it is common to formulate an aggregate data quality score that is simple to understand, does not depend on rows, columns, or constraints, and is standardized and comparable with other data quality scores. In addition to its prevalence, a data quality issue may be associated with confidence. The confidence represents the probability that the reported issue is a real problem and is used to calculate the aggregate quality score. Implicit or explicit constraints reflect the notion of confidence in a data quality issue. A rule-based, specified (user-defined), or confirmed constraint refers to an explicit constraint (e.g., a recognizable pattern).

An implicit constraint is inferred from the data and is associated with a notion of confidence determining how sure the automatic quality assessment tool that alerts is real and valid. Different costs are associated with data quality tests, depending on how easy it is to identify and the number of passes over the data (e.g., the number of passes over the data). For example, tests requiring a single pass over the data include data class scope violations, date types violations, format violations, missing values, out-of-range values, etc. It is essential for issues related to ethical AI, biased data, or how personal, sensitive, and financial data should be masked to implement data protection rules to enforce strict policies and predefined standards. High-quality data is not always easy to obtain and maintain, especially in a scaled AI environment with large-scale datasets.


Several data quality issues need to be considered and prevented in such settings, including:

  • Inaccurate, incomplete, and improperly labeled large datasets are typically the causes of AI project failure. While cleaning gigabytes of data might seem simple, imagine cleaning petabytes or zettabytes of data. Traditionally, data cleaning and spotting methods don’t scale, which has led to the development of AI-powered tools.
  • Data silos are data sets only accessible to a limited group or individual within an organization. Several factors can contribute to data silos in large organizations, including technical challenges in integrating data sets and issues with proprietary or security access controls. Structure breakdowns at the organizational hierarchy level often lead to data silos due to a lack of collaboration between departments. AI models can be limited by a lack of access to comprehensive data sets, resulting in lower-quality results.
  • Big data overload and throwing too much data at models can lead to data noise since a significant amount of the data is not usable or relevant. Additionally, all that extra data may result in AI systems learning more from nuances and variances in the data than from the more significant overall trend. Thus, having too much data and shifting from big to good data is counterintuitive in many cases regarding data quality.
  • Unbalanced data sets can significantly hinder the performance of AI models. In unbalanced data, data from one class or group is overrepresented, while data from other classes is unnecessarily underrepresented. Only a tiny portion of a data set will be valuable to an AI model. As a result of the rigors of EDA, imbalanced datasets will be revealed, and appropriate methodologies and algorithms can handle supervised learning with imbalanced classes.
  • Inconsistent data refers to the collection of irrelevant data to be used for AI model training. Using clean but irrelevant data results in the same problems as using poor-quality data to train the model. When dealing with multiple data sources, inconsistency is a big indicator of a data quality problem.
  • Data sparsity is a common issue of missing data or when there is an insufficient quantity of specific expected values in vast data sets. Sparse data can affect the performance of AI algorithms and their ability to produce accurate predictions and generalize to out-of-distribution data sets. Unless data sparsity is identified, models will be trained on insufficient data, reducing their effectiveness or accuracy.  
  • Data labeling issues are fundamental to supervised AI and ML models. Data labeling is a complex, expensive, and hard task, often requiring human resources to put metadata on a wide range of data types. Proper labeling of AI training data is emphasized whenever large data sets are involved in supervised learning tasks.

Possible mitigation:  A solid data management framework should include two equally important components, namely data quality assurance and data quality control, to enable the development, implementation, and management of policies, strategies, and programs that govern, secure, and enhance the value of data collected by an organization. The first step in the data quality improvement process is to understand data profiling as an initial assessment of the current state of the data sets. An essential component of data quality assessment tools is data profiling, including a data quality dashboard that delivers a flexible user experience and can be tailored to the specific needs of the data quality dimensions.

For instance, when considering a large number of data sets in the data catalog, it should be possible to quickly identify high-quality and low-quality data sets without looking at the details. An aggregate data quality score should be automatically generated from a data set using an AI algorithm that looks for many different aspects of data quality metrics. The scalar representative score is calculated based on the metrics computed by the various data quality dimensions. In other cases, entity matching algorithms are applied for data entity matching tasks, also called duplicate identification, record linkage, and entity resolution of objects and data instances referring to the same real-world entity (e.g., John Lee Hooker, J.L. Hooker). Joining two datasets based on shared identifiers is a crucial task for data integration and cleaning. It is also essential to apply automatic methods of deduplication, ensuring there are no duplications or overlapping values across all data sets. Finally, it is also crucial to have a data quality monitoring capability to enable frequent data quality checks. Combined with AI, the data quality control component can automatically detect, report, and correct data variations based on predefined business rules and parameters. Automated, real-time monitoring is valuable in data quality management because it helps extract greater value from data sets and reduces risks and costs. To maintain the highest level of data quality, companies must remember that proper data management is key to having data in the best condition possible.

Before allowing AI models to learn from data, it’s imperative to keep a close watch over the data being collected, run regular checks on the data, keep it as accurate as possible, and ensure the data is in the right format. Data quality issues are less likely to occur when companies stay on top of their data.