Hadoop is able to make solving many real-world problems much easier, but sometimes the system design can mean unnecessary duplication of data and limited access. One example is in the digitization of paper-based content.

Rasterizing PDF in commercial printing

One of the key features of a PDF is that each page is independent of all other pages in a document. Instead of raster-processing the pages in a pipeline using an expensive supercomputer, you can take a number of inexpensive servers, distribute the pages in a round-robin fashion, and rasterize them in parallel. You assemble the images in order and feed them to the digital press. This costs a fraction of a mini-supercomputer: scale-out instead of scale-up.

Limitations of Vanilla HDFS

The software used for this divide-and-conquer process is Hadoop, which consists of two pieces: the MapReduce computation management and the HDFS to store the input PDF and the output raster images. HDFS was designed for the technology of a dozen years ago: the data moves to the computation (a time-consuming process) using the HTTP protocol (not designed for this task) and the data is replicated three times on local disks (which uses a lot of disk space). Further, it is not POSIX compliant, so you cannot access your data from your other programs. The HDFS layer only supports sequential access to (large) files and has a single (possibly redundant) node that implements the namespace and may often become a bottleneck.

Systems that offer an HDFS interface on top of a scalable, highly-available, general-purpose distributed file system can be readily integrated into Hadoop workflows and have major advantages. They can provide a far more robust storage foundation, avoid the HDFS limitations and allow non-HDFS applications to access HDFS-generated and managed files. As such, they are much better longer-term data repositories than native HDFS clusters. Improvements in network technology significantly reduce the need to process the data where the computations take place while the computational engines and the storage infrastructure can each be optimized to carry out their functions in the most cost/performance-efficient way. Peaxy Aureum is one such system.

Sometimes a quick hack is implemented to solve a difficult problem. Typically, it is great at the moment, but does not survive the test of time. This contrasts to systems with a deep cogent architecture, which can adapt to technological progress. Modern scale-out file systems with an HDFS API for MapReduce are replacing HDFS for Hadoop applications, enabling better storage with fewer duplicates and better access.