Hadoop Notes: HDFS Rack Awareness

HDFS Rack Awareness

What is Rank Awareness? Rack awareness enables HDFS to understand a cluster topology that may include multiple racks of servers or multiple data centers, and to orchestrate its block placement accordingly. This allows us to achieve the goals of Data locality, fault tolerance, and resiliency .
Advantage of Rank Awareness: Implementing a rack topology and using rack awareness in HDFS, data can be dispersed across racks or data centers to provide further fault tolerance in the event of a rack failure, switch or network failure, or even a data center outage.
With a rack topology defined, Hadoop will place blocks on nodes according to the following strategy:

The first replica of a block is placed on a node in the cluster. This is the same node as the Hadoop client if the client is running within the cluster.
The second replica of a block is placed on a node residing on a different rack from the first replica.
The third replica, assuming a default replication factor of 3, is placed on a different node on the same rack as the second replica.

How is Rank Awareness implemented: Rack awareness is implemented using a user-supplied script, which could be in any scripting language available on the cluster. Common scripting languages used include Python, Ruby, or BASH. The script, called a rack topology script, provides Hadoop with an identifier for any given node, telling Hadoop to which rack that particular node belongs. A rack could be a physical rack (e.g., a 19” server rack) or a particular subnet, or even an abstraction representing a data center.
The script needs to return a hierarchical location ID for a host passed in as a script argument. The hierarchical location ID is in the form /datacenter/rack. This can be pseudo-coded as follows: topology_script([datanode_hosts]) -> [rackids]
What happens when there is no topology script: Rack awareness is still implemented if a rack topology script is not supplied. In that case, all nodes have a default location ID of /default-rack.
The script can be implemented in several different ways with no strict definition of how to accomplish this. For instance, you could embed the location ID in the hostname itself and use simple string manipulation to determine the location, as in the following example: dc01rack01node02. You could also create a lookup table or map and store this in a file or in a database.
Where is the topology script located: Rack awareness is implemented on the HDFS client or client application, so the script, its chosen interpreter, and any supporting data such as lookup files need to be available on the client.
A sample rack topology script is provided below

#!/usr/bin/env python

# input: hostname or ipaddress of a node or nodes

# output: rack id for each node

# example:

# input of dc01rack01node02

# outputs /dc01/rack01

import sys

DEFAULT_RACK = "/dc01/default-rack"

for host in sys.argv[1:]:

  if len(host) == 16:

   dcid = host[:4]

   rackid = host[4:10]

   print "/" + dcid +
     "/" + rackid

  else:

   print DEFAULT_RACK

This script is enabled using the net.topology.script.file.name configuration property in the core-site.xml configuration

<property>

  <name>net.topology.script.file.name</name>

  <value>/etc/hadoop/conf/rack-topology-script.py</value>

</property>

Hadoop Notes

Wednesday, 24 April 2019

HDFS Rack Awareness

No comments:

Post a Comment