29 Jan What on earth is a Data Lake and do I need one?
The wonderful thing about being in technology is the rollercoaster of the hype cycle. Disruptive technologies burst onto the scene with catchy buzzwords, the hype that sucks in billions of clicks and then the deep trench of real world failures until a few startup unicorns rise from the depths with solutions we can actually pitch. As far as names go Data Lake is oddly mesmerizing and serene. But it is a hardcore real life technology trend that has us excited. We take pride in pitching latest technologies to client that help them achieve their business objectives and have been acquiring expertise in Data Lakes.
A Data Lake is a data repository. That’s perhaps its greatest similarity to a Data Warehouse, which it is often confused for. Unlike a Data Warehouse, which uses structured data, a Data Lake holds all data you have in its original native form even if unstructured. Data structure and requirements are not defined until queried for. It allows the use of low cost storage as opposed to Data Warehousing whose hidden storage costs are often a pain for CIO’s who invest in them. You don’t have to look for data specifically as is the case for organizations where data is often siloed. It’s all in one place.
Thinking about Data Lakes cannot be done as just a technology trend. For us Data Lakes are part of a technology strategy for a new breed of digital organization. This digital organization relies on agility. Agile organizations make smarter decisions faster. And it is in this context of agile digital organizations that the concept of Data Lakes makes sense and shines.
You have to understand most new data being created in the world is in unstructured form. Social media posts, images and videos are all unstructured data. Analytics are a powerful thing. Their first gift to the world was Business Intelligence. This allowed you to see where you were using structured data. For example how you did in last 6 months financially? What is the data on what customers have done? Which customers bought what product category in which region? This was great for businesses as planning capability especially for financial purposes greatly increased organizational efficiencies. Structured data feeds built on top of data warehouses were set up to achieve this. But the agile digital organization goes beyond that. Its present is grounded in knowing the past but its focus is the future. It asks: Why will customers leave? Who is the influencer I can personalize my sales efforts to, for maximum word of mouth that will contribute to consistent quarterly growth? How can I consistently satisfy my most frequent high margin customers? It is ruthlessly focused on delighting the customer and its processes, management, information technology all live to enable and enhance that focus. And to do that it requires insights from all sorts of data: Structured, semi-unstructured and unstructured. Because that is the data its customers and often a company’s departments related to customers are producing.
In an information technology strategy supporting an agile digital organization, Data Lakes can play an essential part. Such an organization creates use cases first and ensures technology supports them in parallel with using technology to keep costs low. This way IT supports both business strategy and operations. These use cases can be operational in nature like inventory optimization or cost of sales efficiency. Or they can be customer focused like cross selling and upselling. The best practice here is that a cross functional team led by a business executive supported by an IT executive leads and owns this process. Once these use cases are established advanced analytics are then used to find data to support how the organization should move on these use cases.
Using advanced analytics requires access to all kinds of data in any form. Complex data modeling and algorithm led data crawling has to be used on this data. We are talking about massive data loads running on demand for lightning quick near real time or real time analysis. Data Lakes make this possible by being built for scale for all kinds of data not just structured one. There are organizations running advanced analytics on Exabytes of data using Data Lakes already.
Operationally you can even use the cloud as an enabler you are talking about any amount of throughput on any amount of file size with no additional coding requirements once the Data Lake is set up. Companies like Microsoft offer enterprise grade Data Lakes with SLA’s today. If you choose to keep this in your own Data Center Data Lakes use open source technology running on stock dependable storage making it far more cost effective to scale from a storage point of view.
Data Lakes are at the heart of a technology solution for a new kind of agile digital organization whose competency in analyzing massive data sets, getting actionable insights and acting on them gives it a significant edge over rivals. Challenges remain to be sure. Data engineers who set up data feeds and Data Scientists who work with business to create complex data models and algorithms are incredibly rare in most geographies. Getting every department to relinquish data to be put into data lakes is often a far tougher cultural challenge then leaders realize. But Data Lakes mean insights that can help companies win in persistently uncertain times. For this reason alone they are here to stay.