One of the major benefits in creating a robust ML-enabled data pipeline is to feed data into an anomaly detection algorithm. The effectiveness of the anomaly detection capability, however, is almost entirely dependent on the quality of the data pipelines feeding into it. Before we cover how this happens, let’s first look at the challenges of working with data pipelines in general.
A data pipeline consists of an origin, one or more processors and a destination.
Origin or data source: The first step in creating a pipeline is identifying how the data source will be accessed. For example, a data source might be Modbus data from an energy storage system (ESS) used on a grid scale battery array, a programmable logic controller (PLC) in a manufacturing center, SCADA data from one or more wind turbines, or possibly SQL or JSON data from a historian that’s used to aggregate data from multiple sensors. Accessing data at the origin may require specialized connectors that have an understanding of how to receive or access data stored in a variety of formats.
Processors and Executors are used to transform the data received by the data source. An example of a processor could be a field mapper which is used to map source data (values or names) into the format the destination expects to receive. An executor is used to trigger a task when it receives a particular event. For example, an executor may issue an SQL query or read existing S3 data in order for a processor stage to be able to perform a specific calculation.
A data pipeline can consist of any number of processor and executor steps. A processor step is written to do a specific task in the transformation of the source data. Depending on the complexity required to transform the data, these steps could be performed consecutively or in parallel.
Destinations or locations for transformed data: In the final step of a data pipeline, the destination defines how and where the transformed data will be stored. Transformed data could be archived in a highly optimized data warehouse, appended to a time-series database, or written to either AWS S3 or Azure Blob Storage. Similar to the data source, specialized connectors may be required depending on where the transformed data will be written.
In this visual representation of a Peaxy data pipeline, several transformations are performed on source data before writing it into one of two destination databases.
Extract, transform, load (ETL) is a common type of data pipeline used for preparing structured and semi-structured data for analysis. In ETL, source data may come from telemetry, customer relationship management (CRM) systems, or enterprise resource planning (ERP) systems. The data is then transformed by matching and mapping data across the sources, applying filters, converting to an expected unit of measure, or cleaning the data to address inconsistencies or duplicates. Finally, the pipeline can load the processed data into a data warehouse.
Working with data pipelines, especially ones in which data is streaming in from the edge, can raise security concerns. Protecting edge devices, and the communication channels between the edge and cloud, are of critical importance. Understanding encryption of data at rest and on the wire, the principle of least privilege, and denial of service must all be considered to ensure pipelines are secure.
Even after a data pipeline is created, it requires active management since systems are rarely static. All pipelines will at times require components to be updated. Developing ad hoc pipelines with home grown tools can sometimes be expedient, but it is better and ultimately more cost effective longer term to work with a partner who understands pipeline architecture. You can rely on the partner’s experience in deploying the most appropriate tools such as Azure IoT Hub or AWS Greengrass.
Although pipelines involving edge devices must be designed to assume failures, there are many ways to minimize them. It starts with selecting the right robust or ruggedized hardware. Understanding the need for redundancy for network uplink over wired Ethernet, WiFi, 4G, or satellite can keep your edge devices in communication even when some services are offline. Local storage can be intelligently leveraged to provide persistence of data when uplinks are offline. Understanding time series databases, and how data must be replayed after a period of outage, will minimize gaps in sensor data.
Even products such as wind turbines from a single company evolve over the years and represent their data in different ways. An older industrial product may expose telemetry in binary format, which can be accessed through CAN or modbus. Attaching to devices often requires hardware that is less common and more difficult to interface with than Ethernet, such as RS-232, RS-422, or RS-485. Newer devices may expose a more contemporary REST API but have a limited set of sensors available. Next generation industrial equipment can have hundreds of sensors, which can inform thousands of data points describing the status and health of the equipment, creating data volume challenges.
If compute resources are sufficient at the edge, data can be normalized there. Otherwise, the raw data is pushed to the cloud for processing. In either case, it is necessary for the data schema to be understood. Data schemas can be stored and referenced through a schema registry, which ensures a mutual understanding of what the data looks like between the producer and consumer.
One of the major uses of a data pipeline is to route data through anomaly detection algorithms to alert operators of potential faults. In this case, the data pipeline should include a number of additional processes to perform the following functions:
Verify that the data stream is being received at the predefined rate. For example, the pipeline is configured to receive data every 10 seconds, and raise an event in cases where data hasn’t been received for more than 60 seconds.
Verify that received data matches the expected format. For example, the pipeline may be expecting a data stream consisting of 22 attributes, of which 10 are floats, 5 are integers, 5 are boolean and 2 timestamps. In cases where a received data stream contains nulls, infinities, or truncated values, the pipeline should increment a counter to indicate the number of invalid data samples received against the total number of received samples.
Perform required data transformations. This process should perform all the data transformations required to pass the data sample through an anomaly detection algorithm. A data transformation may consist of generating one or more columns derived from the source data. Other transformations may consist of turning a numerical value into a categorical value.
Incorporating the above three functions within a data pipeline may take multiple processors and executor steps. Where a predicted value falls above or below the expected value by a configured threshold, an event or notification is raised, alerting operators to the potential impending fault.
In the next issue, we’ll look at how ML-enabled data pipelines can detect product anomalies in more detail.
Peaxy Lifecycle Intelligence trains a basket of machine-learning algorithms on historical failure data to discover non-trivial discrepancies in live data streams that can signal incipient component failures. These anomalies are fed into PLI’s alert management system, where the user can fully analyze the data.
By choosing to pre-emptively repair or replace equipment, operators can avert catastrophic failures. Because each asset class has a unique use case, Peaxy’s analytics experts perform the initial tuning and tweaking of algorithms. The module’s accuracy improves over time through continuous data ingestion, the use of multiple competing algorithms, and manual feedback on false-positive alerts