To build a data lakehouse that really hits the mark, prioritize business sponsorship over IT assumptions

Yogi Schulz: How to build a data lakehouse that really hits the markIf your IT department builds a data lakehouse, will business end-users come? Unfortunately, some CIOs forget they’re not working with Kevin Costner on a sequel to the Field of Dreams movie. Instead, they are sucked into sponsoring an enterprise data lakehouse project by their IT staff. A data lakehouse combines the low operating cost of data lakes with data warehouses’ data management and structure features on one platform.

These CIOs are genuinely shocked when almost no one cares or wants to come and use the shiny new data lakehouse for business intelligence (BI) applications. They are more astounded when the organization complains about wasted money. CIOs expected the organization to sing their praises for the initiative to improve data integration, accessibility, and analytics.

What could possibly have gone wrong?

IT sponsorship vs. business sponsorship

FREE CONTENT
Login
Not yet a member? Join Us
1149 words
Reading Time: 5 minutes

When a well-intentioned CIO sponsors a data lakehouse project, the project will operate without the following:

  • Essential high-level guidance about business priorities that senior management provides.
  • Support of middle management to allocate resources to improve data quality.
  • Involvement of business analysts to understand the detailed business requirements.

A data lakehouse project dominated by IT leadership will lose momentum as development costs climb and no end-user valuable deliverables such as reports and charts arise. Eventually, the project is cancelled, and the reputation of the IT leadership takes a hit.

A better method involves developing BI applications with backing from business sponsors endorsed by IT leadership. This way, the focus turns to tackling precise business issues or goals rather than relying solely on IT’s assumptions about business data and needs. Stakeholders grasp the importance of the underlying data lakehouse as essential supporting infrastructure, but it doesn’t overshadow the project itself.

Technology focus vs. business benefit focus

A data lakehouse project dominated by IT staff will tend to use the latest technology to develop and operate a data lakehouse, data lake, or data warehouse. This focus occurs because the staff:

  • Is convinced the latest technology will best support robust BI applications.
  • Typically builds robust custom applications with extensive data validation, operational features, security and backup/recovery included.
  • Enjoys exploring the newest technology.
  • Is building their resumes in anticipation of a call from a headhunter.

A dramatically cheaper approach to building BI applications is to leave as much of the data in the operational datastores (ODS) where it resides. Only copy and transform data to a data lakehouse if the ODS structure is seriously unworkable in a BI context. This approach leaves more project budget to develop BI reports and charts that deliver the needed business benefits.

Simple data sources vs. valuable data sources

A data lakehouse project dominated by IT staff will tend to import simple internal data sources into the data lakehouse because the development effort is low. Also, the IT staff is typically unaware of useful external data sources.

A superior approach to building BI applications is collaborating with business analysts to rank data sources in decreasing order of business value. Then, the team can add the internal or external data sources to the BI environment one at a time as a new release. Only add another data source once most of the previous release’s BI reports and charts have been completed. This approach minimizes time to value, ensures the most business value is achieved, and maintains stakeholder support for the BI project.

Elaborate architecture vs. minimal architecture

Dominating IT architects will design a data lakehouse using an idealized framework. The resulting architecture is often too elaborate to understand easily, challenging to load and expensive to maintain.

A superior approach to architecting a data lakehouse environment is to balance trade-offs among the following design goals carefully:

  • Query performance.
  • Minimizing the amount of data copied and transformed from ODSs.
  • Query development complexity.
  • Data lakehouse load complexity.
  • Operating and maintenance costs.

Every design idea that improves query performance, even if it adds complexity to the data lakehouse load, is worth implementing. Allowing idealized frameworks, though widely admired, to dominate the design is always a bad idea.

Data quantity vs. data quality

A data lakehouse project sponsored by the CIO will gravitate toward data quantity for the data lakehouse because the team doesn’t know which data sources are most helpful.

However, this quantity approach is blind to data quality issues. These issues will slow or inhibit the:

  • Acceptance of the data lakehouse as a functional BI environment.
  • Development of enterprise and departmental BI applications.

Poor data quality first manifests itself through these IT technical issues:

  • Hindering integrating data from multiple sources.
  • Creating summation errors.
  • Causing software crashes.
  • Causing system performance problems.

Then poor data quality leads to these business issues:

  • A lack of confidence in reports and charts.
  • Uninformed or misinformed decision-making that adds risk.
  • Inaccurate problem analysis that adds cost.
  • Poor customer relationships that reduce sales and market share.
  • Disappointing product launches that slow growth.

A data lakehouse project sponsored by the CIO has no clout with the business to address data quality issues. The project will fail because the end-user-visible deliverables are sparse and not helpful.

A superior approach to building BI applications is to:

  • Prioritize data sources for inclusion in the BI project based on business value.
  • Expect data quality issues and allocate business resources to improve data quality.
  • Assess data sources for data quality issues. To reduce time to value, fix easy data problems first.

This approach ensures that the BI reports and charts are accurate and will build confidence in the BI applications.

Data inconsistencies vs. data standards

Data inconsistencies make integration difficult, complicate query development and slow query performance. Inconsistencies can occur in reference, master and transaction data. For example:

  1. Incompatible identifiers for key data such as vendor or product across IT systems.
  2. Variations exist on $1000, such as 1,000, 1000 CDN, CDN 1000, 1000.00 or “one thousand dollars.”
  3. Variations are found in units of measure abbreviations such as kg, Kg, kilogr, and KG.
  4. Numbers are not left zero-filled.
  5. Text is right justified as opposed to left justified.
  6. Multiple date formats are used.
  7. The letter O is used instead of zero.
  8. Incorrect conversions between EBCDIC and ASCII are evident.

A data lakehouse project may ignore this issue because it complicates the ETL software that integrates data from diverse data sources. However, the result is that end-users cannot use the data lakehouse. The organization is better served by the CIO championing the setting of data standards.

To ensure that they, the business end-users, will come, the CIO should quit listening to his ambitious techies and champion building BI applications with business sponsorship supported by IT leadership.

Yogi Schulz has over 40 years of information technology experience in various industries. Yogi works extensively in the petroleum industry. He manages projects that arise from changes in business requirements, the need to leverage technology opportunities, and mergers. His specialties include IT strategy, web strategy and project management.

For interview requests, click here.


The opinions expressed by our columnists and contributors are theirs alone and do not inherently or expressly reflect the views of our publication.

© Troy Media
Troy Media is an editorial content provider to media outlets and its own hosted community news outlets across Canada.