Among the most common components of modern data architecture is the use of a data lake, which is a location where data flows in to serve as a central repository.
The concept of the data lake has evolved from being just a location for data collection to a more organized approach known as a data lakehouse. Whether it's called a data lake or a data lakehouse, there is a need for certain skills and IT professionals to effectively manage the technology.
A data lake is a large open storage location that typically uses object storage as a unified repository for unstructured data coming from multiple sources. Those sources can include event streaming data, operational and transactions data and databases. While data lakes can be in on-premises environments, they are more commonly created with cloud object storage services that enable large scalable data capacity, such as Amazon Simple Storage Service (S3), Google Cloud Storage or Microsoft Azure Data Lake Storage. Data lakes first emerged to help enable big data workloads with the Apache Hadoop big data platform. A data lake architecture differs from a data warehouse in that warehouse data is transformed into a format that provides structured data and organization. A data warehouse enables users to more easily query the data and use it for data analytics and business intelligence use cases. Data warehouses also provide data governance and data management capabilities. The concept of the data lakehouse -- first coined by Databricks -- is an attempt to bring together the best of data lakes and data warehouse technologies. A data lakehouse aims to combine the ease of use and open nature of a data lake with the data warehouse's ability to easily execute queries against data. A data lakehouse provides additional structure on top of a data lake -- often with the use of a data lake table format technology, such as Delta Lake, Apache Iceberg and Apache Hudi. It also uses a query engine technology, such as Apache Spark, Presto and Trino.
Managing data within an organization can be a multi-stakeholder effort. It can involve different job roles depending on the particular use case. Data warehouses are often managed by data warehouse managers and data warehouse analysts. Those two roles involve data management and data analytics skills, which are typically tied to a specific data warehouse vendor technology. Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes. With data lakehouses, there can often be multiple stakeholders for management in addition to data engineers, including data scientists. Business analysts also fit into the management mix. They take responsibility to ensure data quality and metadata are properly managed to support business objectives. As organizations begin to shift from data warehouse to data lake architectures, there is some overlap between the people who manage data warehouses and those who manage data lakes. Data still needs to come from multiple sources, it still needs to be governed and there is the same need for analytics so the data can be used effectively. At the executive level -- whether it's a data warehouse, data lake or data lakehouse -- a chief data officer is often the job role that is tasked with the top level of responsibility for all data use.
There are a variety of paths toward certifications for those looking to verify their skills for managing data lakes. A modern data lake or lakehouse deployment often uses cloud resources and tools from a specific vendor to enable data management and data queries. Managing a data lake is not an abstract idea. It's a hands-on effort that can benefit from specific certifications. The leading cloud and data lake vendors all have some form of training and available certification. Amazon Web Services and its S3 cloud object storage service are commonly used to enable data lakes. This certification is geared toward people with experience working with AWS services. It validates skills in using AWS data lakes and analytics services. This Google certification provides an examination that verifies skills to build, deploy and use data models that benefit from data lake and analytics services running in Google Cloud. Microsoft's Azure Data Lake Storage Gen2 is a popular option for building data lakes. With this certification, Microsoft provides validation for those looking to use Microsoft services for data lakes. This certification helps users learn and validate data lakehouse skills on the Databricks platform. This tool integrates multiple open source technologies, including Apache Spark and Delta Lake. Cloudera is often associated with the open source Hadoop big data technology, which is one of the originators of the data lake concept. This certification will validate skills required to ingest, transform, store and analyze data in the Cloudera environment. This certification is designed to help organizations that are updating from a data warehouse to a cloud data lake or data lakehouse mode. Dremio, a company that builds data lakehouse technology, has expanded its training options with Dremio University, which provides certificates of completion.