Aureum 4.0 Under the Hood

August 15, 2016 Peaxy Aureum

The latest Aureum release (4.0) builds on the 3.1 features of Peaxy Find (threaded search capability) and security data services. It also improves interoperability with Hadoop Distributed File System (HDFS) and Spark. This post gives an overview of how Aureum works and explains the technical reasons it offers such high-scale and performance. This has more technical descriptions than most other posts on this blog, but it could come in handy for your Chief Data Officer, system architect or data scientist.

Aureum – A Modern Approach to Data Access

Aureum is an integrated clustered file system and data access platform that offers dynamically expandable scale-out facilities to all applications. Its incrementally scalable file-serving infrastructure can run on commodity servers from different vendors, regardless of form factors or media type.

Aureum is thinking past the “data silos” model that tends to make critical business data go dark. Aureum enables enterprises to consolidate unstructured data from across the business into a unified namespace that is easily accessible and searchable. Aureum creates truly scalable data access.

Versions 3.0 and 3.1 introduced new indexing and search capabilities, three levels of security, and HDFS and clustered SAMBA support.

Version 4.0 includes:

How Aureum Works

Aureum is a file-serving cluster built out of Virtual Machines (VMs) or Containers running on several physical servers. VMs provide better fault isolation than Containers, but are less efficient than Containers. (Note: To simplify the discussion, in the following we will only talk about VMs, without explicitly mentioning Containers, unless the description specifically refers to only one of the two.)

Aureum server code runs in user space on each VM that is part of the cluster and that executes a separate Linux image. Containers within the same server all run under a single Linux instance. The servers, once assigned to Aureum, exclusively run Aureum software and are fully devoted to implementing Aureum abstractions to insure the appropriate levels of high performance and availability.

When Aureum is initially configured or new hardware is added, the system assigns storage, CPU, RAM and network ports to VMs as needed to achieve the desired trade-offs in terms of cost and performance. This allows the cluster to grow incrementally and organically while allowing the administrator to achieve the right match and balance of user requirements, disk capacity, network bandwidth, and hardware characteristics (mixing and matching vendors, form factors, hardware generations and types of storage devices) while offering a single namespace that aggregates all these physical devices.

Aureum is accessed through a small software client component that provides a POSIX-compliant interface for Linux systems or a Windows SMB interface, or finally via a clustered SAMBA infrastructure embedded within Aureum. This means that after Aureum has been mounted on a local client and the user has been authenticated, all applications will work like they have always worked with standard Windows shares.

Enabling a Scalable Distributed File System

The key function of a scalable distributed file system is to create a namespace that encompasses all the files in the system with great performance regardless of the number or size of the files.

In a file system, there is data, and there is metadata describing this data. The data structure containing the metadata is usually called a directory, subdirectory or folder. This data structure is hierarchical and the path to a particular file is called the data path. In many file systems, the directories are files, that is, the namespace and the data space are stored together on the same storage unit.

Usually a client has a single local storage unit, most commonly a solid-state drive (SSD) or a hard disk drive (HDD). In a storage system, there are a large number of storage units, often of different capacity and performance. In order to support distributed file storage, each Aureum VM manages the storage devices available to it. The physical media Aureum can work with include all types of drives, such as SATA, SAS and SSDs.

Each Aureum VM implements either a data space service or a namespace service. Data space services manage the storage resources where user file data is stored. The namespace server stores a subset of the basic hierarchical file system namespace that keeps track of the attributes for files, directories and symbolic links, such as ownership information, access permissions, creation and modification dates.

In Aureum, the namespace has its own storage subsystem, entirely and persistently stored in random access memory (RAM). This subsystem is backed on stable storage via a journaling mechanism. Therefore, any pathname-related operation like a file lookup or open is completely handled in RAM, avoiding disk I/O. The data structures used to implement this allow a large number of files and directories to be managed within a single VM.

In order to ensure high availability, both namespace and data space VMs are replicated across physical servers within software abstractions called “hyperservers”. Hyperservers provide the following benefits:

For scalability, the entire Peaxy namespace is partitioned across hyperservers. Thus the namespace itself is a collection of fragments distributed across all namespace hyperservers. The clients see a single namespace, but know where each directory is located and communicate directly with the hyperserver in charge, without intermediaries—there are no bottlenecks.

A few more points to make on Aureum’s technical benefits: