Large modern organizations that depend on vast data sets to run their businesses may hit roadblocks when they want to retrieve data. Essential data is often unstructured and scattered among repositories across the company. This data might also be decades old, such as original design documents and test data. If users can’t find it, they may have to recreate it from scratch, burning days and weeks that could be used on other higher priority functions.

How do you find the needle in the haystack when there’s a difficult business, computational biology or engineering problem (such as predictive analytics)?

Search vs. Find

Plain vanilla search is unintelligent and out of context. If you search for a text string (such as a part number) on your corporate intranet, you will probably find many instances of it — some relevant, some not — but not all. You will not find it in documents on computers or servers that are not connected to or not searchable on the network. You will not find it inside unstructured documents, such as CAD drawings or testbench data, that store it in their own native formats. And if you could search for it in all those places, the search could take hours, or even days.

An effective data access strategy requires a platform that offers a true “Find” capability that will locate the most relevant files quickly. That’s because they should have already been ingested and aggregated — identified, located, indexed, tagged, classified and categorized — when they first entered the repository. This metadata is automatically updated whenever the documents are changed, so that a copy is always within reach. The distributed namespace and massive parallel processing enable rapid and successful searches across previously inaccessible, large-scale, unstructured data.

A multifaceted find capability

To successfully retrieve documents in a data access system, the Find engine should include the following elements:

  1. Smart indexing: Return relevant documents from across an organization based on indexes set up to identify and monitor terms, concepts and data elements that are important to your business, such as part numbers and serial numbers.
  2. Granular scope: Each index has a scope (a directory, organization, or discipline) and can include or exclude file types, subdirectories and other attributes.
  3. Custom and contextual metadata: Organizations can easily add custom metadata for indexing, which can dramatically improve the effectiveness of deep content mining later on. They can also define data policies that set the context for each index by specifically including or excluding certain files from Find results, based on their file type, attributes or access permissions.
  4. Incremental indexing: Within each index’s context, the platform monitors new or changed content and updates the index accordingly. Incremental indexing eliminates the cost (in money and time) of periodically “crawling” the namespace, and avoids the need for full-system scans.
  5. Threaded Find: Threaded Find is an indexing technique in which each data element can include links to related material. This rapidly speeds up queries.
  6. Application-specific parsers: A true Find capability is not limited to searching metadata; it can look inside files and documents that store data in native formats, such as seismic and engineering drawings, simulations and office documents. These data sets are often out of reach of traditional search engines. The Find engine can access this deep content using specialized parsers that understand those data formats. Parsers can be developed and added to the Find engine to address a business’ core content.

Curated access: The Digital Dossier concept

Repeating the same Find query multiple times is a waste of time and productivity. That’s why an effective data access strategy requires a platform that makes it easy for users to assemble Find results into a “Digital Dossier” of all relevant data and documents that share a common context such as a particular part or serial number. The system should automatically update the dossier when new documents enter that context.

A Digital Dossier offers a visual representation of all essential documents — including an evergreen “birth certificate,” often a nameplate with serial number — for industrial equipment and structures. Users add value when they assemble, edit and annotate that information. They can save, edit, apply and share queries that specify target context, then select pertinent results and add them to the dossier. They can copy results that capture a point in time (like a particular share price), or link results that are dynamic and change over time (like the current share price). They can include material from external sources, then sort and group that information.

Most importantly, users can share their Digital Dossiers with others, who work on the updates collaboratively. In this way, users make ongoing contributions to the quality and utility of common data. The longevity and usefulness of these Digital Dossiers can far outlast the tenure of the individuals who helped create it.


Takeaway

Data access solutions are about illuminating data assets and making them useful, not just storing them in a dark corner of a server farm. Smart businesses understand that this strategy determines winners and losers in the marketplace.

Makers of heavy equipment in aviation, mining, oil and gas, and the automotive industry, in particular, need to leverage data seamlessly in their engineering process. Giving engineers the ability to easily find whatever design or part number they need while working on the next generation can save millions, if not billions of dollars over time.