How I Built a Data Lakehouse With Delta Lake Architecture

Data Engineer Explains the Data Lakehouse Architecture

Nicholas Leong


Image by Author

“Mark my words, AI is far more dangerous than nukes.” — Elon Musk

Data runs the world now, I write and talk about it all over my profile.

As data evolves, businesses are thinking of ways to utilize their data better. Ever since the inception of ChatGPT, it further triggered businesses to realize the potential of AI and its capabilities, and some of them wondered if they could do something similar with their data.

Little do they know, the GPT-1 model was introduced in June 2018, which was the first iteration of ChatGPT itself. It had a whopping 56% accuracy score according to the GPT-1 paper, it was not looking good at the time, but look how the tables have turned now.

Screenshot by Author

The point I’m trying to make here is how people ignore work behind the scenes of GPT. One does not simply create a Large Language Model without a wealth of diverse and rich data. Data is actually required to train the model.

Without data, there wouldn’t be a ChatGPT.

With high volumes of data, there are several uphill battles to be addressed. As a Data Engineer, I can think of a few off the bat.

  • Data Mining
  • Data Storage
  • Data Processing

Data Storage and Compute

ChatGPT is trained on data from the internet — Data Mining.

Diagram by Author

If you were to collect data from the Internet, you might end up with a dataset so large that it exceeds the storage capacity of any machine’s RAM or hard drive. Performing even basic searches on such a large dataset could consume significant computational resources. Two main costs for such operations include

  • Storage Cost— The cost of storing huge amounts of data.



Nicholas Leong

Data Engineer — Crunching data and writing about it so you don’t get headaches. 1M+ reads on Medium.