Implementing a data lake—“a massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing ‘big data’”— can avoid duplication of efforts, encompass a company’s full data store, simplify data acquisition and storage, and democratize data within the enterprise. For details, see our previous post, What is a Data Lake?
What are the risks and challenges of data lakes?
The primary challenges are actually flip sides of the benefits:
- Access to data across the organization
- Data quality and curation
- Security and access control
What is the access challenge?
It’s one thing to store vast quantities of data in a data lake, but it’s another to make it all rapidly accessible to users across a company. Big Data management systems can speed access by indexing data from disparate sources into a unified namespace that allows users to find it anywhere. Peaxy Aureum, for example, indexes data whenever it is ingested, created or modified, eliminating the need to periodically crawl the data infrastructure and improving overall performance and speed.
What is the data quality challenge?
By definition, observes Gartner, a data lake accepts any data, without oversight or governance. By foregoing the data preprocessing used in data warehouses, data lakes contain data sets that are inconsistent, incomplete, duplicative, and out of context. It’s up to users of the data to make sense of it. Aureum now works in an Edge computing model, where large amounts of unstructured data can be curated and processed down to manageable sizes that don’t run into bandwidth problems when it is transferred to a data lake in the cloud.
What are the security risks?
The lack of central oversight opens data lakes to two related security issues. One is the risk of intrusion, hacking and data tampering. The data management system that organizes the data lake should mitigate that risk with appropriate authentication, message integrity checking and encryption of data communications.
The second issue is access control. As Gartner points out, “many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure.” Again, the data management system should provide access tools such as single sign-on (SSO) and role-based access control that ensure that only authorized users can see private or restricted data.
So the data management system is the key?
Exactly. In the absence of centralized control, a data lake needs a sophisticated data management system that facilitates and aggregates data access across the company, supports data curation and incorporates best practices for security and access control.
Where can I learn more about data lakes?
We recommend this resource for a deeper dive into data lakes:
- How to create a data lake for fun and profit by Andrew C. Oliver in InfoWorld.
The Peaxy Executive Series is designed to explain quickly and simply
what business leaders need to know about using big data and data access systems.