Data is a critical part of machine learning (ML) powered products. It is the basis for training, validating, and evaluating model performance. Yet, data quality issues are not uncommon, resulting in unanticipated failures in production as ML models frequently deal with incomplete and erroneous data. Starting with a good understanding of the real-world application is necessary to drive data quality, increase confidence in our model performance during training and production, and improve customer satisfaction. 

“The function of a machine learning system can be descriptive, meaning that the system uses the data to explain what happened; predictive, meaning the system uses the data to predict what will happen; or prescriptive, meaning the system will use the data to make suggestions about what action to take.”—Artificial Intelligence and the Future of Work 

ML-enabled products learn the patterns embedded in the existing data in order to make predictions on the new data. This is a contrast from traditional programming where the developers provide step-by-step instructions on how to solve a problem. That is the crux of the ML challenge. An ML system learns from data, but data in the real world is not static—it evolves and changes, where even a smallest change can fail in unexpected ways and have unintended consequences. As a practice, if we expect that our ML-powered application will fail, then success of the ML projects depend on how well we understand the desired and acceptable user experiences so that we can effectively manage the ML output as a prediction, i.e. probability, for handling errors gracefully and building customer trust. With that, it shouldn’t come as a surprise that real-world ML systems require both ML development and modern software development expertise, and that only a small fraction of a real-world ML system is composed of the ML code, while the surrounding infrastructure is vast and complex as highlighted in the Google’s paper Hidden Technical Debt in Machine Learning Systems

Diagram showing that a small fraction of a real-world ML system is composed of the ML code, and the surrounding infrastructure is vast and complex.

Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.

Alice Zheng and Amanda Casari highlight this relationship between data and model vividly in their book, Feature Engineering for Machine Learning:

“What we call data are observations of real-world phenomena. For instance, stock market data might involve observations of daily stock prices, announcements of earnings by individual companies, and even opinion articles from pundits. Personal biometric data can include measurements of our minute-by-minute heart rate, blood sugar level, blood pressure, etc. … Each piece of data provides a small window into a limited aspect of reality. The collection of all of these observations gives us a picture of the whole. But the picture is messy because it is composed of a thousand little pieces, and there’s always measurement noise and missing pieces. … Trying to understand the world through data is like trying to piece together reality using a noisy, incomplete jigsaw puzzle with a bunch of extra pieces.”

So, while the ground truth data, which dictates the “right” answers, is needed for ML training and model generation, the performance of these models is further impacted by the quality of the ground truth and accuracy of its real-world representation. Given that, starting with the notion that data is always biased and full of measurement errors will help move us on the right path of first validating and baselining our ground truth. O’Reilly’s recent report AI Adoption in the Enterprise 2021, further highlights the impact of data quality challenges as a bottleneck to AI adoption (18%), right behind the challenge of hiring skilled people (19%). 

Chart displaying bottlenecks to AI adoption.

Bottlenecks to AI adoption.

Monica Rogati’s The Data Science Hierarchy of Needs visualization best demonstrates the importance of data quality: “Yes, self-actualization (AI) is great, but you first need food, water and shelter (data literacy, collection and infrastructure).”

The data science hierarchy of needs: collect, move/store, explore/transform, aggregate/label, and learn/optimize.

It is important to highlight that the data gathering and preparation is a time-consuming task, as the training process can require large amounts of data and cover as many variations that would exist in a real-world application as possible. A recent survey by Anaconda identified that data scientists spend about 45% of their time on data preparation tasks, compared to 23% on model selection and training, and 11% on deploying models.

Garbage in, garbage out is a classic saying that well-applies to machine learning systems. Furthermore, in ML-enabled products data quality is a double edge sword, impacting not just the model in training but also in new data used by the model to inform future decisions. ML tasks are defined by what data exists, what we can use as the input variables (i.e. features), how we go about collecting what is needed as cost-effectively as possible, and how the given input data maps to predict an output value. While we could try to make the assumption that the relationship between inputs and outputs does not change by claiming that mapping learned from historical data is just as valid in the future on new data and its defined relationship, this is a faulty assumption as the real world is naturally complex, chaotic, and messy. 

This is your machine learning system? Yup! You pour the data into this big pile of linear algebra, then collect the answers on the other side. What if the answers are wrong? Just stir the pile until they start looking right.

Machine Learning (xkcd)

A recent publication by the Google Research team further underscores the importance of data and impacts of poor data quality in AI development in their paper “Everyone wants to do the model work, not the data work.” Data cascades are defined as: “compounding events causing negative, downstream effects from data issues — triggered by conventional AI/ML practices that undervalue data quality.” Unfortunately, as the paper emphasizes, these data cascades, often originating upstream during data collection and labeling, had invisible compounding effects on AI models resulting in increased project costs, wasted time and effort, and at times harm to beneficiaries.

Data cascades in high-stakes AI.

Data cascades in high-stakes AI. Cascades are opaque and protracted, with multiplied negative impacts. Cascades are triggered in the upstream (e.g., data collection) and have impacts on the downstream (e.g., model deployment). Thick red arrows represent the compounding effects after data cascades start to become visible; dotted red arrows represent abandoning or re-starting of the ML data process. Indicators are mostly visible in model evaluation, as system metrics, and as malfunctioning or user feedback.

Google’s Research Paper identifies four data cascade challenges:

  • Interacting with physical world brittleness: Change is the only constant and the real world is full of issues that can trigger data cascades as the authors highlight: “Data cascades often appeared in the form of hardware, environmental, and human knowledge drifts.” Hardware components, such as cameras and sensors, and environmental conditions can introduce unexpected data issues from sensor drifts, calibration problems, out-of-focus images, improper lighting, and more. The “human drift” factors with social, political, community, and changing regulations/policies add another dimension and complexity to our problem space. While the training data for model development goes through validation, it is difficult to capture all these potential variations of the real-world conditions and their implications to ML inference. 
  • Inadequate application-domain expertise: Domain experts are important in defining ground truth, identifying features, and interpreting data. Adopting an end-to-end engagement throughout the AI/ML development pipeline with experts, as authors identified, will minimize errors and increase data quality on how data is cleaned, merged, corrected, and interpreted. However, experts alone may not be able to address potential issues when it comes to dealing with subjectivity in ground truth, and finding representative data that reflects the real-world for generalizing the AI model.
  • Conflicting reward systems: Data and data-related operations, such as consistency in labeling, quality of data entry, completeness of duration and frequency, are critical to the quality of ML models. However, a mismatch in the incentives due to limited budgets or lack of data literacy training and limited education on the importance and impact of the data collection on the final product can result in wasted time and effort. 
  • Poor cross-organizational documentation: Poor documentation and lack of metadata create situations where the practitioners need to make assumptions to assess quality, representativeness, and fit for use cases, ultimately resulting in discarded datasets or the need to recollect data. As the authors highlights: “In a few cases where metadata cascades were avoided, practitioners created reproducible assets for data through data collection plans, data strategy handbooks, design documents, file conventions, and field notes. For example, P46 and P47 (aquaculture, US) had an opportunity for data collection in a rare Nordic ocean environment, for which they created a data curation plan in advance and took ample field notes. A note as detailed as the time of a lunch break saved a large chunk of their dataset when diagnosing a data issue downstream, saving a precious and large dataset.”

Thankfully it is not all hopeless, as the authors shared observations from teams who have overcome the data cascade challenges:

“The teams with the least data cascades had step-wise feedback loops throughout, ran models frequently, worked closely with application-domain experts and field partners, maintained clear data documentation, and regularly monitored incoming data. Data cascades were by-and-large avoidable through intentional practices, modulo extrinsic resources (e.g., accessible application-domain experts in the region, access to monetary resources, relaxed time constraints, stable government regulations, and so on). Although the behaviour of AI systems is critically determined by data, even more so than code [111]; many of our practitioner strategies mirrored best practices in software engineering [38, 83]. Anticipatory steps like shared style guides for code, emphasising documentation, peer reviews, and clearly assigned roles—adapted to data—reduced the compounding uncertainty and build-up of data cascades.”

Challenges with data are even more significant for embedded systems due to the constrained nature of building embedded products. The models that we produce need to run within the footprint of embedded hardware resources: the power consumption, memory size, and execution speed constraints. However, as mentioned earlier, the inputs (i.e. features) provided to the model drive the overall model’s accuracy and inference prediction—the more complex the model and accuracy needs of the output predictions, the more complicated and challenging it is for the embedded system in terms of resource needs, and the more important it is to understand the real-world application requirements. 

ML Project Lifecycle

ML Project Lifecycle

In summary, data is a critical part of developing ML products. However, it is our understanding of the real-world application and environment, and our commitment to iterating according to the needs of the data, model, and constraints of the embedded solution that helps ensure that we deliver on the model performance and ultimately on the product promise. 

  • Start small and with the end in mind, work quickly towards an end-to-end application, and iterate often with feedback loops. Partner with domain experts throughout the development lifecycle not just for data quality but also to detect and minimize any systemic bias. 
  • Before spending the time and storage collecting potentially noisy and deficient data, start by considering the needs of the application and identifying where and how the data will be extracted. This will bring the added benefit of better gauging the size of the data collection effort, and exploring opportunities for using a smaller data set with higher data quality to deliver on the required model performance. 
  • Before collecting data, consider the hardware, application, and its environment in order to develop a realistic understanding and expectations on how the data may break and how to build strategies for detecting and correcting this, such as metadata, outlier detection, monitoring for data drifts, and any data augmentation needs. Don’t forget the important task of data literacy, and teaching the necessary data collection steps, labeling, and why it all matters.
  • Use proof of concept (PoC) small scale data collection to validate the approach and feasibility. Explore using synthetic datasets to build and test the model to confirm it works.
  • Keep data integrity in mind throughout the project lifecycle. Data reduction is an important technique that reduces the volume of data required for training while maintaining the integrity of data. This process not only helps reduce the costs associated with data collection and data management, but more importantly can improve model accuracy. However, domain knowledge is needed to maintain data integrity during the reduction process, as what works in one solution domain doesn’t work for the other. For example, for object detection applications we could reduce the image resolution without impacting accuracy, however the same process will not work for detecting manufacturing defects as the decreased image quality will also remove the details of the manufacturing imperfections we are looking for. In summary, it is important to remove unnecessary details, noise, fillers, and such from our data, but we need to be careful to maintain data integrity and quality.
  • Apply DataOps and MLOps practices with clear documentation and adjust the level of rigor to the phase of the project (PoC projects don’t require the same level of diligence as production). Look for ways to build resilience into data collection and data pipelines by establishing end-to-end monitoring of data, model deployment, and inference. Define metrics to monitor model quality (accuracy, mean error rate) along with key business metrics that the model influences (ROI, cost, …). 

For more tips and real-world examples of collecting quality data, check out our whitepaper, Designing IoT: Data Collection in Harsh Environments.

Many thanks to Riyadth Al-Kazily, Heather Brundage, Andrew Reading, & Chris Font for reading this article, correcting my mistakes, challenging my logic, and adding needed clarity and supporting arguments.