Setting up a data lake the easy way with Cloud Pak for Data

Governed access to data and information architecture provided where you want it to be: on premises or on any cloud

Wilfried Hoge
2 min readMar 25, 2020

Customers are investing in Artificial Intelligence (AI) and Machine Learning (ML) because they all see the potential of improving business processes with them. However, there are two important prerequisites to be successful with infusing AI and ML into the processes: Data and Information Architecture

Data Lake

Implementing AI and ML basically means to train models with known data and then use these models to get insights on new data. So, access to known data is essential to bring AI and ML to life. The most prominent concept in the industry to provide data for model building is the Data Lake. While for most people, a Data Lake is identical to Hadoop, the IBM Data and AI team has a different notion of a Data Lake as a concept to get easy access to data. This might be an Hadoop environment but could also be a set of repositories. These repositories could include various technologies like object storage, relational databases, NoSQL stores. It has ever been important that data movement to a central store is just an option and not a necessity. Just giving the Data Scientists and Analysts access to data is not sufficient, there is also a level of control needed. When the Data Lake is combined with the right level of control we are talking about a Governed Data Lake.

Cloud Pak for Data

The information architecture (IA) to implement AI and ML is as important as the access to the data: Our motto There’s no AI without IA is expressing this quite well. But it is a complex task to build a suitable information architecture from scratch. Several components are necessary and these components have to be integrated to give users an end to end experience. With Cloud Pak for Data we have a platform that implements an information architecture out of the box. The services a customer needs to bring AI and ML to live are pre-integrated and setting them up could be done in a few days. Services in Cloud Pak for Data like Watson Knowledge Catalog, Data Virtualization and Watson Studio make it as easy as never before to create a Governed Data Lake.

Where to deploy

Cloud Pak for Data is a software solution based on Red Hat OpenShift. It provides an integrated experience that you can find on public cloud environments with the freedom to have it deployed where you want it to be: in your data center today, at cloud provider A tomorrow and at cloud provider B next year. Learn more about it at the Cloud Pak for Data homepage.



Wilfried Hoge

Analytics Architect at @IBM. Member of @D64eV, interested in Data Science, Data Lake, IoT and Machine Learning. My views are my own.