What Can NFS-Enabled Hadoop Do For You?
NFS -- a distributed file system protocol -- allows access to files on a remote computer in a manner similar to how a local file system is accessed.
By Sachin Sinha, Director of Big Data Analytics, ThrivOn
Hadoop Distributed File System (HDFS) is a file system in user space (FUSE). It doesn't behave like other Linux file systems. HDFS cannot be mounted the way traditional file systems can be. We use HDFS command-line utilities to interact with HDFS. However, files residing in HDFS are not accessible to traditional Linux programs.
Let's say we have a Linux script that analyzes text. If our text files are in HDFS, they won't be accessible to this program. Access to files residing on HDFS is usually handled through HDFS Client or webHDFS. Lack of seamless integration with a client's file system makes it difficult for users and impossible for some applications to access HDFS.
Now consider the following scenarios, which are common in the enterprise world:
- File browsing and downloading: Users and applications want to browse the files saved on HDFS and download from HDFS
- File uploading: Users and applications want to upload files on HDFS
- Data streaming: Applications want to stream data directly HDFS
None of these can be achieved with the regular, plain-vanilla HDFS if it is being done outside of Hadoop, without going through the Java-based HDFS API. In comes Network File System or NFS, a distributed file system protocol that allows access to files on a remote computer in a manner similar to how a local file system is accessed. This is something enterprises have played with for years. NFS interface support is one way for HDFS to have such easy integration. With NFS enabled for Hadoop, files can be browsed, downloaded, and written to and from HDFS as if it were a local file system. These are critical enterprise requirements. With NFS enablement, HDFS can be accessed using an HDFS client, Web API, and the NFS protocol. This way HDFS will be easier to access and be able to support more applications and use cases.
MapR's Direct Access NFS has allowed files to be modified and accessed via mounting the Hadoop cluster over NFS for quite some time. However, other distributions are realizing the importance of this capability in the enterprise world and are moving quickly to enable their distributions for NFS. IBM enabled this via General Parallel File System (GPFS) in their Hadoop distribution called BigInsights. Hortonworks, on the other hand, has gone ahead with implementing NFS via a gateway, a stateless daemon that translates NFS protocol to HDFS access protocols in its distribution called HDP.
Regardless of the implementation, using this interface allows users of a Hadoop cluster to rapidly push data to HDFS in a way they are familiar with from desktop applications. Additionally, this opens up the possibilities for scripting the pushing of data from some networked machine into Hadoop including upstream preprocessing of data from other systems. The flexibility to access HDFS-resident data from Hadoop and non-Hadoop applications frees users to build more flexible big data workflows. For example, a customer may analyze a piece of data with SPSS. As part of that workflow, they may use a series of ETL steps to manipulate data. A MapReduce program may best execute those ETL processes. Trying to build this workflow on HDFS would require additional steps, as well as moving data in and out of HDFS. Using NFS-enabled Hadoop simplifies the architecture and minimizes the data movement
MapR is slightly ahead in this game by supporting random reads and writes and thus allowing files to be modified, overwritten, and read as required. MapR enables multiple concurrent reads and writes on any file. However, other distributions are quickly catching up by including support for version 4 of NFS protocol, high availability, and security integration. There is no doubt about the value that NFS provides, making Hadoop much easier and less expensive to use. Signs are positive that NFS-enabled Hadoop will become standard among all the distributions rather than a niche feature.
Sachin Sinha is director of big data analytics at ThrivOn. In this role, Mr. Sinha is responsible for design of innovative architectures, development of methodologies, and delivery of solutions in analytics, business intelligence, and data warehousing that help clients realize maximum value from their data assets. For over a decade, Mr. Sinha has designed, architected, and delivered data integration, data warehousing, analytics, and business intelligence solutions. Specializing in information management, Mr. Sinha's domestic and international consulting portfolio includes a broad array of organizations in the financial services, insurance, health care, pharmaceutical, and energy industries. You can contact the author at firstname.lastname@example.org.