Organizations create more data than ever before. Newer, faster technologies and storage systems have risen to the challenge, but the storage elements of Artificial Intelligence can be complex.
AI data often requires high-performance, scalable storage and long retention periods. Organizations must find cost-effective storage systems to protect, manage and analyze large amounts of data; to ensure short and long-term success, it's crucial for organizations to assess their storage and data management needs throughout an AI project.
Chinmay Arankalle, author and data engineer, has spent years working on Big Data systems. In his recent book,The Artificial Intelligence Infrastructure Workshop, Arankalle and his co-authors discuss the complexities of AI and how organizations can navigate them. The book discusses data center architecture for AI workloads, including machine learning and large data sets.
In this Q&A, Energy ExemplarSenior Data EngineerArankalle discusses some of the factors organizations should consider in their AI storage plans. These factors include cost management, priority setting and scalability. Editor's note: This transcript has been edited for length and clarity.
What do you think are some of the common storage and compute challenges we're facing in this Big Data era? Chinmay Arankalle: The data field is growing at a high speed. One of the main challenges we see in front of us [is] the various kinds of data we can come across. Currently, we divide data into three categories: structured, semistructured and unstructured data. We follow ELT [extract, load, transform] in these types of high-volume data stores.
The end goal is not fixed, usually. Maybe now, the data we have has some use, [but] after 10 years, the data might have some different use altogether. Depending on the usage, we decide what format the data should be stored in. For example, if you want to query the data, the obvious choice would be a columnar data format, like Parquet, which supports ad hoc pairing. It's supported by different parallel processing frameworks, like Apache Spark. But, in the future, new use cases might come up. As time progresses, the need of the data could change [for new AI models]. And that will give birth to the new formats of the data.
The second challenge ahead of us is how we can utilize the older formats along with the newer ones. Since we can't suddenly get rid of the old data, there should be some harmony between older data and the new data. The third challenge is mostly about availability of the data. For example, if we have stored the data in a partition format, and we would really like to have subsecond latency for the particular snapshot of the data, it becomes quite difficult if the partitioning strategy is not appropriate. Similarly, the fourth challenge in front of us is using retention features. Storage nodes have some cost behind them. We need to make some segregation and push down the required data to archives. The main part here is the retention of the data. It's very, very difficult to keep track of stored data and delete a particular customer's data. That is where retention comes into the picture.
The last piece is quality of the data. Usually, in these types of storage, the quality is undermined when we load the data. We load the data as is -- this is the practice I have seen, which should be avoided. What's the significance of data lakes and data lakehouses compared to data warehouses in AI?
Arankalle: Data warehousing is a 25- or 30-year-old concept where we basically store data at one place and it is cross-referenced everywhere. If we must update that piece of data, then we just must update it once at that single place; it will be cross-referenced.
But, soon, the data warehouse concept began to fall short for new latency requirements when data grew and new data formats came into existence.