The Hadoop Distributed File System (HDFS) was developed following the distributed file system design principles. Running on commodity hardware, HDFS is extremely fault-tolerant and robust, unlike any other distributed systems.
The Hadoop Distributed File System holds huge amounts of data and provides very prompt access to it. In order to keep the data safe and provide quick access to it, it is stored across various machines. This approach helps prevent data loss in the event of a failure. Also thanks to the flexibility of distributed nature of HDFS, many applications are made available for parallel processing.
Significant features of Hadoop Distributed File System
- Apt for distributed processing as well as storage.
- A command line interface for extended querying capabilities.
- Strictly implemented permissions and authentications.
- Prompt health checks of the nodes and the cluster.
- File system data can be accessed via streaming.
The Architecture
The Hadoop Distributed File System is built following the master-slave architecture. It has the following main components:
Name node
The name node is the service hardware which contains the Linux OS and the name node software. The system having the name node software on it acts as the master server and is responsible for the following tasks:
- Adjust the access control to the files.
- Manage the namespace of the file system.
- Executes the operations pertaining to files such as opening, closing and renaming the files and the directories.
Data node
It is the service hardware featuring the Linux OS as well as the data node software. For every node in the cluster, there is always going to be a data node. These nodes are responsible for managing data storage of their system. Their significant tasks are:
- They are responsible for the operations like the creation of block, their deletion and their replication while complying with the instructions of the name node.
- They are also responsible for performing the read-write operations on the file system, as per the request of the client.
Block
Generally, the data is stored in files in the HDFS. The file is further divided into a single or maybe, even more, segments which are then stored in separate data nodes. These segments are known as blocks. Simply put, a block is the smallest amount of readable data by HDFS. The size of a block is by default set to 64MB, but it is not set in stone, it can be changed as per your requirements.
With this architecture, the Hadoop Distributed File System aims to provide with a very specific set of benefits to organizations working with Big Data. They are:
- Scalability: Hadoop can easily handle varying workloads, and for huge datasets increasing the number of nodes is very easily handled by HDFS.
- Recovery and Fault Detection: Since there is a lot of service hardware in the cluster, failure of a component is very frequent. Therefore due to the robust nature of HDFS, in the case of failure, a job is automatically distributed among the working nodes and the process continues.
- Hardware and data: With the hardware and data so close to one another, a requested job is completed efficiently. Especially when huge datasets are in play, HDFS reduces the network overhead while increasing the throughput.