When structured data enters a repository — inventory, orders, customer records, invoices and other transactional data — it’s already well organized for retrieval.  You can search for an order, a customer, or an item and the database knows just where to find the information you need.

Unstructured data is a different story. It includes a wide variety of files, such as design schematics, simulations, service manuals, telemetry data, genomics or proteomics arrays, seismic readouts, IoT sensor data, office documents, media files (images, audio, or video) and other disparate data sets. Unstructured data sets are often large in size, can be created and stored anywhere in the company, and tend not to have standard or consistent formats.

With this data growing as such a fast pace, how do you make sure you collect, curate, and organize all of it, now and in the future?

Smart policy-based data lifecycle management

To make unstructured data available wherever it is needed, a effective data access strategy must incorporate a platform that can:

  • ingest it by classifying, categorizing, indexing and storing the information so that the platform knows what and where it is, and
  • aggregate it by grouping, replicating and providing distributed access to the information so that users and analytics applications across the company can readily access it.

Both of these activities should be governed by data policies that you define and set up for your organization. These policies embody decisions about where data is stored, and when and how it will be moved, over the course of its life. For example, an ingest policy tells the platform where to ingest, index and store new data. A migration policy tells the platform when (and where) to move data based on availability and access needs.

DLM policies are implemented by a data access platform using data classes and storage classes. Data classes are rules for how data will appear to users. Storage classes define how data is managed on particular media types. Automated policies manage data to reduce total cost of ownership and improve performance commensurate with each document’s stage in its lifecycle.

Ingest: categorization, classification, indexing and initial data capture

When a data access platform ingests data, it must assign each file a permanent, immutable pathname so it is never lost and links never stop working.

The data is then managed within a single, unified namespace. A unified namespace is essential because it allows users and applications to look for data in a single place, even though the data itself is scattered throughout the organization (and potentially across the globe) in multiple repositories and storage systems.

All data is indexed and metadata extracted when it first enters the repository. This metadata is automatically updated whenever the documents are changed, so that a copy is always within reach. These smart indexes enable true Find capabilities, returning relevant documents from all data sets on the data plane.

Aggregate: Grouping, replication and distribution

Although data in your company is originally created and stored in one place, it may be used elsewhere. An effective data access strategy includes a platform that will aggregate, group, replicate and distribute your data for quick and convenient access wherever it is needed — even on a different campus.

Aggregation enables edge computing, in which data is preprocessed and pre-analyzed where it’s collected. Edge computing is essential for use cases in which enormous amounts of unstructured data are collected remotely and needed for quick analysis. Because it would require too much bandwidth and time to transmit all that data to a central data lake for analysis, post-processing work on that data can create reduced-order data sets ready for analysis before moving on.

For example, a self-driving car collects too much data to transfer elsewhere for analysis, and must analyze that data instantaneously so that it can brake, accelerate or swerve as needed. Instead, the analysis must first be sent “to the edge” of the network for reduction before moving on to the data lake.

Replication provides redundancy and resilience. If a drive fails (and we know the one thing that is predictable is drive failure), a document can be quickly restored from another copy. The platform should ensure data integrity by periodically checking files for damage, and “heal” itself by retrieving any damaged file — and only that file — from a replica server.

Performance and scalability

In managing vast amounts of unstructured data, an effective data access strategy must require a platform that does not sacrifice speed of access, nor should its expansion require system downtime or costly forklift upgrades (replacement of outdated hardware and other infrastructure). If the platform is appropriately architected, performance and scalability should follow from its design.

When throughput needs to grow, computer systems can be scaled vertically (“up”) by using faster hardware, or horizontally (“out”) by adding nodes in a distributed system. Horizontal scaling is strongly desirable because it allows expansion as needed, rather than having to plan far ahead of time.

It does not require technology to keep up with throughput needs and minimizes total cost of ownership (TCO) without requiring forklift upgrades. An effective data access strategy should support either type of scaling to reflect each company’s usage, whether to accommodate a large number of files or a large data volume.


Takeaway

Organizations face an ever-accumulating mountain of unstructured data that is critical to their business. It is essential to craft an effective data access strategy that enables the ability to ingest and aggregate data for universal access, as well as scale seamlessly to accommodate exponential growth.