LLMs: An Assessment From a Data Engineer

cover
27 Jul 2024

Introduction:

Ever since the buzz about GenAI and ChatGPT, I have been trying to understand the space and leveraging it into my day-to-day activities as a Data Engineer. No doubt it has been useful for debugging with respective programming and documentation. Personally, it played a significant role in helping me plan my Dubai itinerary. It even contributed to naming my friend’s daughter. However, it works pretty well if you give appropriate and precise requirements.

There is a lot of anxiety among people that AI is going to take our jobs as they feel almost everything will be automated through AI. Is LLM as good as everyone claims?

In this article, we will look into the specifics of Gen AI’s role in data engineering and see where it flourishes, where it requires enhancement, and where human expertise remains irreplaceable.

Photo by Scott Graham on Unsplash

Areas of Strength:

Basic Data Querying:

100% the human necessity is going to get disrupted, but the severity based on the complexity is high. Even the SQL practitioners won’t feel much difference here, and self-serve analytics will catch fire in this area which is something phenomenal! Again, to quote positively, data engineers won’t need to guide the stakeholders nearly as much!

Troubleshooting Pipeline Failures:

LLMs coupled with agents will slash on-call load massively. LLMs processing stack traces/quality failures can recommend the built-in remedies and nudge the ball down the field to address the problem, presumably even over Slack. And that is a fantastic percentage of data engineering hours saved! To be clear, some hard failures require manual troubleshooting.

Anomaly Detection:

Regardless of the structure, format, source, complexity, timeliness, or even natural play of data, large language models can identify anomalies. LLMs are well-suited to anomaly identification because they can work across a variety of data sources, identify subtle patterns, and distinguish when data isn’t behaving as expected. This will help data engineers to identify and fix data quality problems in advance.

Areas for Enhancement:

Data Cleansing:

Our results suggest that Language Models (LLMs) are highly effective in data cleansing due to their high performance in unstructured data management and semantic error identification and correction, as well as high scalability. Furthermore, our review of the current state-of-the-art suggests that even more substantial improvements are possible, including fine-tuning of LLMs to specific domains or even organization data.

In addition, various interactive mechanisms as well as coupling with data quality frameworks can significantly increase the efficiency of LLMs. Overall, LLMs allow for organizing and improving the whole process of data cleansing, helping to obtain high-quality datasets, and, consequently, valuable insights.

Visualization and Report Generation:

Language Models (LLMs) excel in visualization and documentation tasks by automating the creation of reports, summarizing data insights, and generating narrative-driven reports. Enhancements needed include improving document organization and categorization, data-driven storytelling capabilities integrating with version control systems and visualization tools for seamless output generation, and enhancing interpretability features to ensure clarity and coherence in reports.

LLM can be a helpful starting point, and for our work, it is a good boilerplate. Still, the generated material is hard to debug — or often not, worth the trouble. Prompt-driven content is still far from being able to turn out a nice, optimized, and absolutely accurate data pipeline.

Optimization

By providing suggestions on efficient query structures and detecting problems that slow performance, LLMs are especially powerful when it comes to SQL optimization. They can even make recommendations for optimizations on the basis of historical query data. Things for improvement include integrating database-specific optimization techniques into the training of LLM, bridging with query execution plans in order to give context-oriented advice and feedback mechanisms so that one can modify strategies for optimization according to what users prefer as well as performance metrics.

Physical Data Model

As soon as the logical model is finalized, LLM will be efficient in moving and upgrading this logical (or conceptual) data model into the targeted architecture. The problem that LLM encounters with all the examples is that it’s hard for most people to understand the pipeline, table, and its relationship.

Areas Where Human Expertise Remains Irreplaceable:

Logical Data Model

Data modeling is a very soft skills-focused task. You have to communicate with different stakeholders to understand the business context in detail. Understanding what data is needed, the relationship between entities, and considering different scenarios are human-oriented tasks that LLMs must continue to struggle with.

Building Pipeline and Framework

Once they reach a certain level of complexity, LLMs begin to give up. Imagine if someone asked ChatGPT, “Can you provide me the source code of Android’s next version?” While LLMs can help with routine data engineering tasks, they do not have the insight or inventiveness unique to humans that allows for new solutions to complex challenges, unlocking creativity in technical terms.

Domain Expertise and Intuition:

Human data engineers bring a unique blend of domain expertise and intuition to their work that often surpasses what conventional software isn’t able to achieve through automation. This intuition allows them to find subtle patterns in data that no fully rigorous automated system can.

Based on their knowledge of this field and their experiences in it, they can predict future trends. They can also make decisions in light of a broader context, including corporate or government needs, ethical considerations, and potential future scenarios. This holistic approach to data engineering surpasses what LLMs have achieved. Although LLMs have great power, they are typically restricted to the patterns and relationships they have been trained to recognize.

Conclusion:

On the whole, Gen AI can certainly automate trivial procedures such as looking for information, launching programs that don’t work properly, repeatedly correcting faults, and anomaly detection. My gut tells me that Gen AI will win out and excel at tasks such as Data Cleansing, Optimization, and Writing Pipelines — once the team understands gaps and missing patterns.

But it can not learn purely human soft skills that are important for building logical data models or a framework. I believe that these skills, which cannot be automated, are also essential for stakeholders to ask for innovation in the first place. After all, no one ever asked Steve Jobs if they needed anything like an iPhone; he just thought up what would become its invention.

References:

https://barrmoses.medium.com/will-genai-replace-data-engineers-no-and-heres-why-708b0a27da6b

​​__https://www.linkedin.com/pulse/navigating-data-governance-challenges-era-large-fnq5e__

https://www.dataversity.net/navigating-the-risks-of-llm-ai-tools-for-data-governance/

https://www.skyflow.com/post/why-ai-governance-requires-data-governance

https://coreykeyser.medium.com/you-cant-govern-ai-without-governing-data-27bb88d8f9ea

https://www.ankursnewsletter.com/p/gpt-4-gpt-3-and-gpt-35-turbo-a-review

https://towardsdatascience.com/from-chaos-to-clarity-streamlining-data-cleansing-using-large-language-models-a539fa0b2d90

https://medium.com/@wangshally11/data-cleaning-warm-up-before-training-large-language-models-74ed1edceead