Using Generative AI with Clean Data to Survive in Shark Infested Waters: Data Friction (Part 2)

July 12, 2023 Tyler Johnson

“Fix the wiring before you turn on the light.”

Introduction

With all the hype around generative AI , it’s not surprising many organizations are incorporating AI into their strategic plans. The problem is, without clean training data, large language models (LLMs) are worthless.

As organizations increasingly recognize the power of Artificial Intelligence (AI) in unlocking the value of data, the process of providing high-quality training data for LLMs is critical. In part 2 of this blog post series, we delve into sources of data friction that pose significant hurdles for utilizing enterprise data with LLMs, discussing the nature of these sources of data friction and how organizations can utilize a data fabric to minimize data friction and fuel the success of their AI and data strategies.

Data friction can arise due to technical limitations, incompatible formats, legal or regulatory constraints, security concerns, operational inefficiencies, or lack of interoperability, among other factors. It represents a challenge to organizations seeking to leverage data as a strategic asset, as it slows down data-driven processes, introduces delays, increases costs, and undermines the overall effectiveness of data utilization. Overcoming data friction requires addressing these barriers through technological advancements, standardization efforts, policy changes, and collaborative initiatives to enable the seamless and secure movement of data throughout its lifecycle.

6 Key Sources of Data Friction

Obsolete, Outdated IT (Technical Debt)

Legacy IT systems and outdated infrastructure can impede data integration efforts for LLMs. These systems may lack the necessary compatibility, scalability, and agility to handle the volumes and complexities of data required for training models effectively. Traditionally, overcoming technical debt requires organizations to invest in modernizing their IT infrastructure, adopting cloud-based solutions, leveraging containerization, implementing robust data integration frameworks, etc. Unfortunately, these approaches rarely work for a number of reasons, notably challenges in communicating the business value of technical debt remediation and resistance to change. Fortunately, there is a better way. By embracing a distributed data fabric approach that emphasizes maximum interoperability and minimizes change management costs associated with data pipeline development and maintenance, organizations can significantly mitigate the impact of technical debt on AI initiatives. This approach allows them to bypass most technical debt issues and optimize data flow, resulting in a more efficient training process for LLMs.

Data Privacy, Security & Governance Requirements

In part 1 of this blog series, we delved into how data privacy, security, and governance are paramount concerns when integrating data into LLMs. To get the most out of their AI strategy, organizations must strike a delicate balance between obtaining valuable LLM and other ML training data, operational efficiency, and maintaining compliance with GDPR and other data privacy regulations. Embracing a data fabric that incorporates cybersecurity and data privacy by design is essential for achieving this balance.

Data Quality

The quality of training data significantly impacts the performance and accuracy of LLMs. Challenges related to data quality include inconsistencies, incompleteness, inaccuracies, and biases within datasets. To mitigate these challenges, organizations typically invest in data cleansing, preprocessing, validation, and augmentation data tooling. Unfortunately, the change rate in such tooling is enormous and there are literally thousands of tools out there to choose from offered by startups and major tech vendors alike. How do you choose the right tooling to support your AI, analytics and digital business transformation initiatives? It’s impossible to get it right, the competitive environment that drives (or should drive) IT initiatives changes just as fast as the data tooling ecosystem (or faster). But by utilizing a data fabric as a hot pluggable backplane for data tooling, organizations can create a best of breed approach to data tooling that future proofs organizations’ data quality strategy and rises in stark contrast to large vendors who have a strong financial interest in limiting interoperability. This is also true when thinking about what AI tooling to adopt.

Talent Shortages

Building and maintaining data integrations for AI and other uses traditionally require skilled data engineers and IT Operations experts. However, there is a shortage of talent in these specialized fields. While organizations generally address these challenges by investing in upskilling existing teams, fostering collaborations with academic institutions, and leveraging third-party expertise through partnerships or outsourcing, these actions all come with a cost. By utilizing a low code data fabric, organizations can:

Create a force multiplier for existing data pipeline development expertise by drastically reducing both the amount of coding required for data pipeline projects as well as the effort required for data pipeline change management.
Offload tasks like schema changes, data masking changes, etc. that traditionally required data engineers to non-coders (IT operations managers, analysts, etc.).
Maximize the efficiency of data scientists, AI engineers and others with an order of magnitude reduction in the effort required for data pipeline changes.

Vendor Lock-In

Vendor lock-in occurs when organizations become overly dependent on specific technologies or platforms for data integration. This dependency limits flexibility, hampers innovation, and restricts the ability to switch vendors. The current IT industry approach hasn’t helped either. Technology vendors often create platform stickiness (lock-in) so they can:

Sell additional products and raise prices by controlling or limiting how their platforms work with other technologies
Make it more difficult to switch vendors

In addition, vendor solutions are generally designed to solve a narrow technical problem set; organizational and business process challenges are generally an afterthought. These approaches work directly against interoperability and make the process of building and maintaining data pipelines for LLM training data much slower, more costly, and more likely to fail. By embracing a data fabric that features a modular architecture and ability to connect to anything - both modern (APIs, cloud storage) and legacy systems (Databases/SQL, text/csv, etc.), organizations can facilitate seamless integration with multiple vendors and technologies, enabling organizations to adapt and evolve as rapidly as the AI and data landscape is.

Operational & Data Silos

Operational and data silos create barriers to data integration by segregating data and inhibiting its seamless flow across different departments, systems, or business units. Traditionally, organizations attempt to break down these silos by launching change initiatives that attempt to shift organizations towards a data-driven culture, facilitate the adoption of enterprise-wide data integration strategies, encourage cross-functional collaboration, foster data sharing, implement centralized data repositories, and promote data governance practices etc. but unfortunately, these projects are difficult to complete and usually fall short. It is important to recognize the existence of operational and data silos as operational debt and data debt, close cousins to technical debt. And just like technical debt, it is generally better to bypass data and operational debt than mitigate it. By using a data fabric, organizations create a hybrid integration layer that keeps changes in one system from affecting other systems, essentially eliminating the need to break down data operational and organizational silos for the purpose of obtaining actionable training data for LLMs.

Conclusion

Using a data fabric approach can help organizations overcome data friction and improve the training process for LLMs. This approach emphasizes interoperability and reduces change management costs, reducing technical debt and enabling smooth integration across different systems and formats. It also ensures compliance with data privacy, security, and governance requirements while extracting valuable insights. The data fabric helps address data quality challenges by providing a flexible foundation for selecting data tools, ensuring strategies remain relevant and avoiding reliance on a single vendor. It also helps organizations tackle talent shortages by reducing coding requirements and empowering existing teams. Moreover, the data fabric eliminates operational and data silos, allowing organizations to obtain useful training data for LLMs without major organizational changes. Overall, adopting a data fabric approach provides a comprehensive and efficient solution to drive successful AI and data strategies.

Please share if you like this content!

Tyler Johnson

Cofounder, CTO PrivOps