Hadoop Notes: core-site.xml and hdfs-site.xml Imp properties

HDFS Configuration Parameters(Detailed):

Common Properties (core-site.xml)

Used to store many properties that are required by different processes in a cluster, including client processes, master node processes, and slave node processes.

fs.defaultFS

fs.defaultFS is the property that specifies the filesystem to be used by the cluster. Most of the time this is HDFS. However, the filesystem used by the cluster could be an object store, such as S3 or Ceph, or a local or network filesystem. Note that the value specifies the filesystem scheme as well as hostname and port information for the filesystem.

When using filesystems such as S3, you will need to provide additional AWS authentication parameters, including fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey, to supply your credentials to the remote platform. This may also be the case with other remote filesystems or object stores.

When high-availability and federation is used fs.defaultFS will use a construct called nameservices instead of specifying hosts explicitly.

Example:

<property>

 
     <name>fs.defaultFS</name>

 
     <value>hdfs://mynamenode:8020</value>

</property>

hdfs-site.xml

dfs.namenode.name.dir and dfs.namenode.edits.dir

The dfs.namenode.name.dir property specifies the location on the filesystem where the NameNode stores its on-disk metadata, specifically the fsimage file(s).This value is also used as the default directory for the edits files used for the NameNodes journalling function. However, this location can be set to a different directory using the dfs.namenode.edits.dir property.

<property>

  <name>dfs.namenode.name.dir</name>

  <value>file:///disk1/dfs/nn,file:///disk2/dfs/nn</value>

</property>

Note that there are no spaces between the comma-delimited values. Example value is /opt/app/daya01/hdfs/nn

A loss of the NameNode’s metadata would result in the loss of all of the data stored in the cluster’s distributed filesystem—potentially petabytes of data. The on-disk metadata structures are there to provide durability and crash consistency for the NameNode’s metadata, which is otherwise stored in volatile memory on the NameNode host.
The value for the dfs.namenode.name.dir property is a comma-separated list of directories (on a local or network file system—not on HDFS) where the fsimage files and edits files by default will be stored by default.

The default value for dfs.namenode.name.dir is file://${hadoop.tmp.dir}/dfs/name. This should be changed on a production system.

The NameNodes metadata is written to each directory in the comma-separated list of directories specified by dfs.namenode.name.dir synchronously. If any directory or volume specified in the list is unavailable, it will be removed from the cached list of directories and no further attempts will be made to write to this directory until the NameNode is restarted.

The parallel write operations to multiple directories provide additional fault tolerance to the NameNode’s critical metadata functions. For this reason, you should always provide more than one directory, residing on a different physical disk, volume, or disk controller to minimize the risk of outages if one volume or channel fails.

In some cases such as non-HA deployments should, specify an NFS mount point in the list of directories in dfs.namenode.name.dir, which will then store a copy of the metadata on another host, providing further fault tolerance and recovery options. Note that if you do this, you should soft mount and configure retries for the NFS mount point; otherwise the NameNode process may hang if the mount is temporarily unavailable for whatever reason.

The dfs.namenode.name.dir and dfs.namenode.edits.dir properties are read by the NameNode daemon upon startup. Any changes to these properties will require a NameNode restart to take effect.

dfs.namenode.checkpoint.dir/period/txns

There are three significant configuration properties that relate to the checkpointing function.

The dfs.namenode.checkpoint.dir property is a comma-separated list of directories, analogous to the dfs.namenode.name.dir property discussed earlier, used to store the temporary edits to merge during the checkpointing process.

The dfs.namenode.checkpoint.period property specifies the maximum delay between two consecutive checkpoints with a default value of one hour.

The dfs.namenode.checkpoint.txns property specifies the number of transactions at which the NameNode will force a checkpoint. Checkpointing will occur when either threshold is met (time interval or transactions).

dfs.datanode.data.dir

The dfs.datanode.data.dir is the property that specifies where the DataNode will store the physical HDFS blocks. Like the dfs.namenode.name.dir property this too is a comma-separated list of directories with no spaces between.

Unlike the dfs.namenode.name.dir setting, writes to the directories specified in the dfs.datanode.data.dir property are performed in a round-robin fashion (i.e., the first block on one directory, the next block on the next, and so on).

The configuration for each DataNode may differ . On different slave nodes, the volumes and directories may differ. However, when planning your cluster, it is best to try to homogenize as many configuration settings as possible, making the cluster easier to manage.

dfs.datanode.du.reserved

HDFS is a “greedy” file system. Volumes associated with directories on the DataNodes specified by dfs.datanode.data.dir will be 100% filled with HDFS block data if left unmanaged. This is problematic because slave nodes require working space for intermediate data storage. If available disk storage is completely consumed by HDFS block storage, data locality suffers as processing activities may not be possible on the node.

The dfs.datanode.du.reserved configuration property in hdfs-site.xml specifies the amount of space in bytes on each volume that must be reserved, and thus cannot be used for HDFS block storage. It is generally recommended to set this value to 25% of the available space or at least 1 GB, depending upon the local storage on each DataNode

dfs.blocksize

The property dfs.blocksize specifies the block size in bytes for new files written by clients, including files produced as the result of an application or job run by the client. The default is 134217728, or 128MB.

Although commonly thought of as a cluster- or server-based setting, dfs.blocksize is actually a client setting.
This property can be influenced by administrators using the <final> tag on a server node as discussed earlier

dfs.replication

The dfs.replication property located in the hdfs-site.xml file determines the number of block replicas created when a file is written to HDFS.
The default is 3, which is also the recommended value.
As with the dfs.blocksize property, the dfs.replication property is a client-side setting

Examples of above Properties:

<property>

 
      <name>dfs.namenode.name.dir</name>

 
      <value>/opt/app/data01/hdfs/nn</value>

 
    </property>

 
   <property>

 
      <name>dfs.namenode.checkpoint.dir</name>

 
      <value>/opt/app/data01/hdfs/snn</value>

 
    </property>

 
    <property>

 
      <name>dfs.namenode.checkpoint.edits.dir</name>

 
      <value>${dfs.namenode.checkpoint.dir}</value>

 
    </property>

 
    <property>

 
      <name>dfs.namenode.checkpoint.period</name>

 
      <value>21600</value>

 
    </property>

 
    <property>

 
      <name>dfs.namenode.checkpoint.txns</name>

 
      <value>10000000</value>

 
    </property>

 
   <property>

 
      <name>dfs.datanode.data.dir</name>

 
     
  <value>/opt/data/data01/hdfs/dn,/opt/data/data02/hdfs/dn,/opt/data/data03/hdfs/dn,/opt/data/data04/hdfs/dn,/opt/data/data05/hdfs/dn,/opt/data/data06/hdfs/dn,/opt/data/data07/hdfs/dn,/opt/data/data08/hdfs/dn,/opt/data/data09/hdfs/dn,/opt/data/data10/hdfs/dn,/opt/data/data11/hdfs/dn,/opt/data/data12/hdfs/dn</value>

   </property>

<property>

 
      <name>dfs.datanode.du.reserved</name>

 
      <value>1073741824</value>

 
    </property>

 
    <property>

 
      <name>dfs.blocksize</name>

 
      <value>134217728</value>

 
    </property>

 
    <property>

 
      <name>dfs.replication</name>

 
      <value>3</value>

 
    </property>

 
    <property>

 
      <name>dfs.replication.max</name>

 
      <value>50</value>

 
    </property>

Hadoop Notes

Wednesday, 24 April 2019

core-site.xml and hdfs-site.xml Imp properties

No comments:

Post a Comment