QDB: quality in database

14th International Workshop on Quality in Databases (QDB’25)

at the 51st VLDB conference on September 1, 2025, London, UK

About | News | Schedule | Important Dates | CfP | Submission | Invited Speakers | People

Quality in Databases

Data quality has been a major concern of organizations for decades, leading to the introduction of standards and quality frameworks. Recent advances in artificial intelligence (AI), e.g., generative AI, have brought data quality (DQ) back into the spotlight. In enterprises, it is particularly important to build data ecosystems that can cope with the emerging challenges posed by AI-based systems. Data quality has been tackled from different perspectives: the database community has made significant advances in data profiling and data cleaning and still focuses on DQ issues like duplicate detection or missing data handling; the information systems community provides solutions for addressing DQ at an organizational level; the machine learning (ML) community focuses mainly on the development of robust models that can deal with issues in the data. We want to build upon the success of QDB 23 and QDB 24 and continue offering an open format for joint discussions between different communities on the future of DQ assessment and improvement.

QDB 2025 Proceedings: https://www.vldb.org/2025/Workshops/vldb.html#outline-container-org2a44ac9

QDB 2024 Proceedings: https://vldb.org/workshops/2024/#outline-container-org90551d8

QDB 2023 Proceedings: https://ceur-ws.org/Vol-3462/

News
Schedule
The program consists of talks on the accepted papers and two keynotes from academia.
We plan to have a poster session to spark more discussion and networking.

September 1st (all times are in London Time)
08:45 - 09:00
Opening Remarks

09:00 - 10:00
Session 1: Morning Keynote - Chair: Hazar Harmouch

 
Keynote Title: Model Lakes

Keynote Speaker: Renée J. Miller (University of Waterloo, Canada)

Keynote Abstract: Given a set of learning (AI) models, it can be hard to find models appropriate to a task, understand the models, and characterize how models are different one from another. Currently, practitioners rely on manually-written documentation (model cards) to understand and choose models. However, not all models have complete and reliable documentation. As the number of models increases, the challenges of finding, differentiating, and understanding models become increasingly crucial. Inspired from research on data lakes, we introduce the concept of model lakes. We explore the question of why should be care about data quality in model lakes?

Keynote Speaker Bio: Renée J. Miller is the Canada Excellence Research Chair in Data Intelligence at the University of Waterloo. She is a Fellow of the Royal Society of Canada, Canada’s National Academy of Science, Engineering and the Humanities. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and a University Distinguished Professorship at Northeastern University. She is a Fellow of the ACM and the AAAS. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her colleagues received the ICDT Test-of-Time Award and the 2≈rk, she has received the CS Canada Lifetime Achievement Award in Computer Science. Professor Miller was an Editor-in-Chief of the VLDB Journal and former president of the non-profit Very Large Data Base (VLDB) Foundation. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.


10:00 - 10:30
Break



10:30 - 12:00
Session 2: Morning Research Session - Chair: Lisa Ehrlinger

 
Out in the Wild: Investigating the Impact of Imperfect Data on a Tabular Foundation Model
Vasileios Papastergios and Anastasios Gounaris

 
Exploring Privacy-Preserving Record Linkage: A Holistic Framework for Dataset Generation and Detailed Result Analysis
Florens Rohde, Victor Christen, and Erhard Rahm

 
Dynamic Knowledge Graph-based Measurement of Data Quality
Johannes Schrott, Rainer Meindl, Christian Lettner, Stefan Hammer, and Magdalena Leitner

 
Evolving Gracefully: Building Robust and Self-Adaptive Data Cleaning Pipelines for Schema Evolution and Uncertainty
Kevin Kramer, Valerie Restat, and Uta Störl


12:00 - 13:30
Lunch



13:30 - 15:00
Session 3: Afternoon Keynote - Chair: Hazar Harmouch

 
Keynote Title: From XAI to XEE through Influence and Provenance, and optimising models for fairness when data drifts over time: some work in progress on connecting data and models to ensure quality and trust in both.

Keynote Speaker: Paolo Missier (University of Birmingham, UK)

Keynote Abstract: In the “Data-to-AI” value chain (leading to decisions, new knowledge, insights, etc), interventions aimed at improving the quality of the underpinning data are increasingly driven by model optimisation goals. Embracing this view, over the past few years the popular “Data-Centric AI” (DCAI) paradigm has been producing many interesting examples of model-driven quality improvement methods, including model-driven data cleaning, model-driven dataset pruning, and many more. These are all “engineering” tasks, focused on improving the effectiveness and efficiency of the entire value chain. Complementary to this, one can view “trust” in the models as a user-centred manifestation of quality, which translates into transparency requirements, and thus into the need to provide effective explanations that encompass both models and the data used to train them. Within this setting, in this talk we present two strands of ongoing work. Firstly, we aim to generalise DCAI both on the data side, controlling the impact of data drift on the model over time, and on the model side, adding fairness as a quality metric and complementary to accuracy. Combining these two elements raises interesting new challenges. And secondly, we aim to provide “eXplainaibility End-to-End” (XEE) by combining established XAI techniques, namely Influence Functions, with our own work on tracking the provenance of training datasets through data processing pipelines. As this is mostly work in progress, expect fewer results and more preliminary ideas, hopefully leading to stimulating interactions through the talk.

Keynote Speaker Bio: Paolo Missier is Chair in Computer and Data Science at the University of Birmingham, UK, where he also serves as Director of the Data and AI Institute. He previously held academic positions at Newcastle University (2011–2023) and was a Fellow of the Alan Turing Institute (2018–2023). His career spans both academia and industry, including applied research at Bellcore in the USA (1994–2001) and consultancy for the Italian Government and private sector on data quality (2001–2004). He holds a PhD in Computer Science from the University of Manchester, where he focused on data quality in scientific workflows. His research interests lie in improving data science to improve science and health data science at scale. Since 2016, he has been Senior Associate Editor of the ACM Journal of Data and Information Quality (JDIQ).

 14:30
Poster session (continues during the coffee break)

15:00 - 15:30
Break


15:30 - 16:50
Session 4: Afternoon Research Talks - Chair: Lorena Etcheverry

 
Label Flipping For Group Fairness
Shashank Thandri and Romila Pradhan

 
PBE Meets LLM: When Few Examples Aren’t Few-Shot Enough
Shuning Zhang and Yongjoo Park

 
Towards an SLM-based Auditing of Relational Schemas and Data Quality for Practical Data Governance
Antony de Medeiros

16:50 -17:00
Closing Remarks

↑ top

Submission deadline: May 16, 2025 Extended to May 23rd, 5pm Pacific Time
Submission website: https://cmt3.research.microsoft.com/QDB2025
Notification of acceptance: June 20, 2025
Final papers due: July 8, 2025
Workshop: Monday, September 1, 2025

↑ top

This workshop aims to exchange novel ideas and best practices about data quality assessment and improvement in the era of AI. The event should unite experienced and senior-level data quality researchers with junior researchers and PhD students. We specifically expect junior researchers to benefit, since they get to meet the community and continue high-quality research on data quality. The suggested topics of interest include, but are not limited to:

Foundational DQ methods and assessment

  • Data profiling for data quality measurement
  • Statistical methods to detect erroneous data
  • Data lineage and provenance tracking
  • Industry-specific data quality standards and compliance
  • Data versioning and quality control
  • Benchmark data sets to evaluate DQ assurance methods

AI/ML-specific data quality

  • Data preprocessing
  • Data quality for foundation models
  • Data quality using generative AI
  • Bias detection and mitigation in training data
  • Data quality for few-shot and zero-shot learning
  • Post-training quality/fact checking
  • FAIRness3 in data quality
  • Explainable data cleaning
  • ML-powered methods for improving data quality

Implementation and process optimization

  • Automation of DQ assessment and improvement methods
  • Real-time data quality monitoring
  • Cost-benefit analysis of data quality improvements
  • Integration of data quality tools in MLOps pipelines

We appreciate submissions on all these topics for different domains (e.g., healthcare, mobility, production) and for various types of data (e.g., graphs, time series).

↑ top

Submission
Authors are invited to submit original, unpublished full research papers and demo descriptions that are not being considered for publication in any other forum. Please submit your paper as a PDF using Microsoft's QDB CMT site. You need to append the category tag as a suffix to the title of the paper such as “Data Management in the Year 3000 [Regular]”; “Spatial Database System [Demo]”. This must be done both in the paper file and in the CMT submission title. The suffix will not be part of the camera-ready copy if the paper is accepted.

Format
It is the authors' responsibility to ensure that their submissions adhere to the format detailed here . In particular, it is not allowed to modify the format with the objective of squeezing in more material. Submissions that do not comply with the formatting will be rejected without review. Note that the limit of up to 6 pages (including all figures, tables, and references) must be followed for both full papers and demos.

Publication
Accepted papers will be distributed via vldb.org.

↑ top

Keynote 1: Renée J. Miller (University of Waterloo, Canada)
Keynote 2: Paolo Missier (University of Birmingham, UK)
Program Chairs:
Lisa Ehrlinger (Hasso Plattner Institute, University of Potsdam, Germany)
Lorena Etcheverry (Universidad de la República, Uruguay)
Hazar Harmouch (University of Amsterdam, Netherlands)

Steering Committee:
Sourav S Bhowmick (Nanyang Technological University, Singapore)
Ihab Ilyas (University of Waterloo, USA)
Felix Naumann (Hasso Plattner Institute, University of Potsdam, Germany)

Program Committee:

↑ top