Using Generative AI with Clean Data to Survive in Shark Infested Waters: Lean Data (Part 3)
Introduction
With all the hype around generative AI, it’s not surprising many organizations are incorporating AI into their strategic plans. The problem is, without clean training data, large language models (LLMs) are worthless.
As organizations increasingly recognize the power of Artificial Intelligence (AI) in unlocking the value of data, the process of providing high-quality training data for LLMs is critical. In part 3 of this blog post series, we discuss how a data fabric can be used to implement techniques borrowed from lean manufacturing to optimize the time required to integrate training data for LLMs and maximize business results.
In many ways, the current state of data integration resembles pre-industrial manufacturing. Instead of an assembly line approach, individual “data craftsmen”, also known as (data engineers) in small teams, or in many cases single IT “heros”, build bespoke data architectures that don’t scale. This is very similar to the state of software application development before the advent of DevOps.
Now that organizations are open to the idea that AI is a key component of future competitiveness, they’ll soon realize that raw data is the input that AI converts into business outcomes; this fact that will drive organizations to borrow concepts from lean manufacturing, essentially creating data factories of their own. “Lean Data” is closely related to “Industry 4.0” but whereas Industry 4.0 describes all cyber-physical systems, Lean Data concerns itself with the optimization of data manufacturing (data pipeline) as part of a data factory (many data pipelines).
10 Key Concepts of Lean Data
Value
A core principle of Lean Data is to align data integration efforts with customer needs, cost optimization, and other AI driven efforts to improve an organization’s competitiveness. This is similar to Lean Manufacturing but extends beyond product value to all elements of an organization’s competitive strategy. A data fabric provides a unified and holistic view of the data ecosystem, enabling organizations to focus on minimizing time to business value. Unlike a data lake or data lakehouse, a data fabric creates the agility needed to “start with the end in mind”, that is to design their AI strategy to focus on business outcomes and work backward to design the AI/Data system taking advantage of the ability to change data pipelines on the fly as the business changes. Utilizing agile methodologies is a key component of value in Lean Data, and will be the topic of a future blog post in this series.
Value Streams
An effective data fabric approach facilitates the mapping of data flow in the integration process by providing a comprehensive view of data movement across the organization. By understanding how data flows through the fabric, organizations can optimize the integration pipeline, ensuring that the right data reaches the model training stages efficiently.
Flow
When implementing Lean Data, data fabrics must ensure a smooth, efficient, and continuous data flow by integrating data from various sources in real-time, batch or in between, depending on the requirements for each business outcome. In 1984, Eliyahu Goldratt introduced the concept of the “Theory of Constraints” in his seminal book, “The Goal”. Connectivity is a critical limiting factor in delivering clean training data to LLMs and other data monetization efforts. To minimize these constraints, a data fabric must connect to the broadest set of connection methods for both legacy and modern information systems. This includes not just modern interfaces like APIs and cloud storage, but SQL databases, flat files, SFTP sites, and other legacy data communication methods. A best practice is to leverage open source Javascript for connectivity because the broad array of JS connectors and software development kits (SDKs) supported by IT vendors, creating a force multiplier for ensuring all components are kept up to date via 3rd party vendor vulnerability detection and patching of their JS connectors. We are talking integration here, not data analytics where there are other purpose built options in Python, R and other programming languages.
The situation is not static; as business objectives (value) evolve, bottlenecks including physical constraints, business constraints, process constraints, and most importantly, people constraints will emerge. This requires a data fabric approach that facilitates identifying and addressing bottlenecks as they occur with a policy driven approach, like software defined and infrastructure as code (IaC) approaches seen in DevOps.
Pull
Lean Data enables a pull-based approach to data integration, where data is integrated on-demand as required by the outcome. Instead of pushing all available data, the data fabric must have the capability to dynamically pull training data from relevant sources, with the option for on-demand (event driven) or according to a schedule, depending on use case. The data fabric must also be able to implement automation that enables LLMs and other data requestors to request specific data subsets, thus reducing unnecessary data processing and storage costs.
Perfection
Lean Data promotes continuous data quality improvement by incorporating data governance and validation mechanisms. This ensures that data is accurate, reliable, and compliant with quality standards before being integrated into training datasets, leading to higher model performance. While many consider the evolution of data privacy regulations a source of data friction, it isn’t because of trust. A data fabric that incorporates standardized capabilities for consent management, data privacy masking, data lineage, validation, logging and error reporting actually increases trust and facilities both broader sharing of data and continuous improvement via agile processes.
Empowerment
Lean Data Empowerment is all about being able to trust your employees and stakeholders with significant tasks, including access to enterprise data through LLMs. LLMs that incorporate both foundational LLM training datasets and enterprise data are subject to data leakage that can put businesses at risk. Although vendors like Microsoft have announced LLM offerings that offer commercial protection from enterprise data leakage (like Bing Chat Enterprise), it’s not enough to protect against leakage outside the organization. Users, and by extension the LLM they use, need to be able to protect against data leakage between user roles as well. As an example, if an organization were to feed all sales data into an LLM, how would they prevent salespeople from poaching each other’s leads by accessing sales data through an LLM? A data fabric in place must provide the ability to govern the flow of and access to sensitive data either directly by users or by LLMs through automation.
Standardization
In Lean manufacturing, standardization refers to documenting steps and the sequences of those steps for creating standardized tasks. Lean Data refers to not just the documentation of steps (or components) and the sequencing of those steps in data pipelines, but the standardization of the data pipeline components themselves. By leveraging a data fabric with a minimal set of standard pipeline components, organizations can not just establish and enforce standardized data integration pipeline templates, but drastically reduce the complexity, time and cost required to build and maintain data pipelines, which results in reduced time to data, and reduced time to decision. An effective Data Fabric approach will utilize a minimum set of data pipeline components and a drag and drop user interface (UI) that makes sequencing steps simple. By defining consistent data pipeline components, an effective data fabric approach also ensures uniformity across all data pipelines, minimizing integration complexities.
Just in Time
In Lean Manufacturing, Just-in-time (JIT) refers to methods to reduce flow times in manufacturing systems and improve response times to customers and suppliers. While a data fabric can optimize data processing by enabling just-in-time data integration with policy defined data pipeline components, Lean Data JIT also refers to the just-in-time creation and change management of the data integration pipelines themselves as well as their outputs. The ability to apply agile methodologies to data integration is a requirement for meeting the value principle of Lean Data, consequently an effective data fabric approach seeks to drive the change management cost in building and maintaining data pipelines to zero where possible and to minimize it elsewhere.
Visual Management
A data fabric approach in support of Lean Data must offer real-time monitoring and visualization of data integration processes through intuitive user interfaces. Teams must be able to not only track data flow, processing times, and errors, but also provide reporting for the purposes of security and compliance. This empowers IT operations to make informed decisions and address issues promptly, cybersecurity professionals to perform security audits and build in security by design, and compliance professionals to ensure compliance with relevant data privacy and security regulations through data privacy by design.
Efficiency (Waste)
Waste in Lean Manufacturing refers to reducing or eliminating everything that does not add value (the 7 wastes), things like excess transportation costs, inventory, idle time, overprocessing, defects etc. to improve product quality, reduce production cost, and production time. In Lean Data, instead of the seven wastes, we have the 10 efficiencies. As with the wastes in Lean Manufacturing, Lean Data efficiency refers to addressing traditional form of waste, work and other costs that also don’t add value. In addition it also refers to opportunities to improve efficiency not traditionally thought of as sources of waste. In other words, Lean Data Efficiency seeks to eliminate waste while minimizing rework (or technical debt remediation) and maximizing reuse. Data fabrics can play a key role in optimizing the 10 Efficiencies of Lean Data:
The 10 Efficiencies of Lean Data
Change Management. Data fabrics minimize the cost of change management by making all parts of data pipelines configurable and able to be automated via policies. Data users can request schema changes via change requests that are fulfilled in minutes instead of days. A data fabric’s modular, standardized components also make it possible to quickly connect to new data sources and reuse existing work to create new data pipelines by copying existing pipelines or parts of pipelines.
Minimize rework. To minimize rework, a data fabric approach must be able to:
Connect to legacy systems easily to bypass technical debt (until other business value exists that justifies refactoring that technical debt)
Use modular data pipeline components to minimize the amount of work required to rework existing data pipelines. This maximizes the ability of operators to implement changes by only needing to modify small parts of data pipelines instead of starting from scratch. For example, if an HR department wanted to change HR systems, the data fabric only requires a change in the connector and mapper – all other downstream data pipeline components remain unchanged.
Eliminate dependencies between IT systems. Data fabrics isolate changes between existing IT systems by not requiring data changes in existing systems of record. Instead, the data fabric can easily transform input data into digestible output data via a low-code pipeline management interface and automation.
Minimize data pipeline sprawl. An effective data fabric approach requires data fabrics that have a hierarchical catalog system for managing both data pipelines and pipeline components.
Maximize Reuse. with a modular, composable architecture for building data pipelines, data fabrics maximize reuse by making it possible for operators to copy and modify existing data pipelines
Talent. Data fabrics help organizations maximize the productivity of its most skilled engineers and developers by eliminating 95% (or more) of custom development for building data pipelines with a low-code, drag and drop interface for building and managing data pipelines. As a result, organizations are able to utilize non-coding operational personnel for most data pipeline tasks. Expensive data architects and ETL developers are then able to span across more data pipelines by a factor of 100x.
Access Automation (Zero Trust). Given the sensitivity of data and need to govern that data with access and identity information for both human and non-human data requestors like LLMs, manual access management processes are too inefficient to scale to the 100’s to 10,000’s of data pipelines required. Since an effective data fabric approach requires that sensitive data be governed, data fabrics with automated access management capabilities are a requirement. By integrating employee and vendor systems with identity platforms (Azure AD, Okta, etc.), requestor identity is automatically established and data pipelines now can govern data effectively because they have accurate access and identity data at all times.
Security and Privacy Automation. In software development, DevSecOps is a term coined to refer to the idea that security is built into the application by design and from the beginning. DevOps engineers refer to this as “shift left”, referring to moving security implantations and reviews to the left of a project management Gantt chart. Lean Data seeks to improve efficiency in implementing and managing cyber security and data privacy governance in data integrations by “shifting left” privacy and cybersecurity requirements when building data pipelines. This concept in Lean Data is referred to as “Privacy and Security by design”. An efficient data fabric approach requires that the data fabric include standardized modular components that automate filtering in data pipelines based on consent and requestor identity.
Data Quality. How many times have we heard CDOs and data users complain about data quality? The problem is that traditional approaches to data integration lack efficient data validation capabilities. Lean Data seeks to optimize the processes required to clean data. Data fabrics help to optimize data cleansing by:
Quickly building connections to data validation software services that clean data.
Including the capability to inject custom data checks into data pipelines.
Interoperability. To create efficiency, Data fabrics maximize interoperability by making it possible to integrate with the widest set of legacy and modern systems possible (connectors) and streamline the transformation of input data models to output data models.
Transfer. Data Fabrics minimize data transfer costs by being able to easily configure subsets of data (input schemas) and make those subsets policy defined so operators can change pipeline input schemas on the fly and LLMs and other applications can automate input schema selection.
Storage. Data Fabrics minimize storage costs by eliminating the need for a data lake, warehouse, or lake house in most cases by providing just-in-time access to data from the primary systems of record. Data fabrics can also be used to create new systems of record with persistent storage, but this is not an aggregated lake, lakehouse or warehouse. The only other intermediate data storage is when performance or cost constraints require data to be persisted as part of a data pipeline.
Conclusion
Embracing Lean Data principles with the support of a data fabric is critical for organizations seeking to unlock the true potential of AI and derive maximum value from their data assets. By aligning data integration efforts with customer needs, optimizing data flow, and ensuring continuous data quality improvement, organizations can create efficient data manufacturing processes. Lean Data's pull-based approach and standardization of pipeline components reduce complexity and time required for data integration. Agile methodologies facilitate adapting to changing business objectives and addressing bottlenecks in real-time. Through Lean Data, organizations can eliminate waste, maximize reuse, and optimize various aspects of data integration, empowering them to stay at the forefront of innovation and competitiveness in the industry.
Please share if you like this content!
Tyler Johnson
Cofounder, CTO PrivOps