Wednesday, 24 April 2019

Hadoop Configuration , Default Configuration and Configuration Precedence

Everything in Hadoop is configurable and everything has a default value (maybe not everything, but nearly everything!) 

Hadoop Configurations Basics: 
  • Every host participating in a Hadoop cluster, including clients and servers, has its own local set of configuration files, typically stored in /etc/hadoop/conf (often a symbolic link to a physical directory in $HADOOP_HOME). 
  • Most of the configuration files for core components are written in XML, with <property>, <name> and <value> tags for individual configuration properties. 
  • Many of the configuration properties are specific to a process such as the DataNode process but can exist on other nodes that are not running that particular process and are simply ignored.  
  • Often changes to configuration specific to a particular daemon such as the DataNode daemon will require a restart of that specific service to be read and take effect. Most common cause of Hadoop daemons failing to start are errors in configuration, such as XML markup errors, for instance, missing closing tags, misspelled property names, or incorrect values. 
  • Configuration properties are routinely added and removed in new releases. Properties can also be deprecated in some cases. Often, but not always, the older analogous property names are still acceptable. For instance, the core-site.xml property fs.default.name changed to fs.defaultFS in release 2.x of Hadoop. However, the fs.default.name property is still accepted. 

Configuration Defaults: 

Almost every configuration property (in the overwhelming majority of cases) has a default value. 
We can find the default values in the *-default.xml documentation for the specific configuration file and specific release we are interested in. 


For any particular application, such as a MapReduce application, you can often find the complete set of configuration properties submitted with the application, including defaults and user-defined settings, using the Configuration link in the ApplicationMaster UI (accessible from the YARN ResourceManager UI). 

Configuration Precedence: 

We know that Hadoop configuration properties can be set on master nodes, slave nodes, and even clients. In addition, if a property is not set, a default value for that property will often be used. However, when property is set to different values in more than one location configuration precedence comes to play. 

  • The first order of precedence for a configuration setting (beyond the defaults) is the*-site.xmlfile on the slave node or nodes that an application interacts with during the submission process.   

  • The next higher order of precedence is the*-site.xmlfile(s) on the client machine. Any property values supplied here will supersede values set on the equivalent files on the slave node host(s).   

  • The highest precedence for a given configuration property value is assigned to the application being submitted. That is either in the Job object or using command-line arguments such as-D dfs.replication=4.  

Conclusion: The developer submitting the application has the highest precedence when it comes to configuration. 
  
How to prevent a property from being overridden: 
Use a <final> tag in the configuration on your cluster (server) nodes to ensure that a particular configuration value cannot be overridden, irrespective of the order of precedence. 
  
Example: 
<property>  
  <name>dfs.blocksize</name>  
  <value>134217728</value>  
  <final>true</final>  
</property>  

No comments:

Post a Comment