HDFS Configuration
Parameters(Detailed):
Common Properties
(core-site.xml)
|
- Used to
store many properties that are required by different processes in a
cluster, including client processes, master node processes, and slave
node processes.
fs.defaultFS
- fs.defaultFS is the property that specifies
the filesystem to be used by the cluster. Most of
the time this is HDFS. However, the filesystem used by the cluster could be an object store, such
as S3 or Ceph, or a local or network filesystem. Note that the value specifies the filesystem scheme as well
as hostname and port information for the filesystem.
- When using filesystems such as S3, you will need to provide
additional AWS authentication parameters, including fs.s3.awsAccessKeyId and
fs.s3.awsSecretAccessKey, to supply your
credentials to the remote platform. This may also be the case with
other remote filesystems or object stores.
- When high-availability and federation is used fs.defaultFS
will use a construct called nameservices instead of specifying hosts
explicitly.
Example:
<property>
<name>fs.defaultFS</name>
<value>hdfs://mynamenode:8020</value>
</property>
|
|
hdfs-site.xml
|
dfs.namenode.name.dir and
dfs.namenode.edits.dir
- The dfs.namenode.name.dir property specifies the
location on the filesystem where the NameNode stores its on-disk
metadata, specifically the fsimage file(s).This value is also used as
the default directory for the edits files used for the NameNodes
journalling function. However, this location can be set to a different
directory using the dfs.namenode.edits.dir property.
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///disk1/dfs/nn,file:///disk2/dfs/nn</value>
</property>
|
Note that there are no spaces between the
comma-delimited values. Example value is /opt/app/daya01/hdfs/nn
- A loss of the NameNode’s metadata would result in the loss of
all of the data stored in the cluster’s distributed
filesystem—potentially petabytes of data. The on-disk metadata structures are there to
provide durability and crash consistency for the NameNode’s metadata, which is otherwise stored in volatile memory on the NameNode
host.
- The value for the dfs.namenode.name.dir property is a comma-separated list of directories (on a local or network file system—not on
HDFS) where the fsimage files and edits files by default will be stored
by default.
- The NameNodes metadata is
written to each directory in the comma-separated list of directories specified by dfs.namenode.name.dir synchronously. If any directory or volume specified in the list is
unavailable, it will be removed from the cached list of directories and
no further attempts will be made to write to this directory until the
NameNode is restarted.
- The parallel write operations to multiple directories provide
additional fault tolerance to the NameNode’s critical metadata
functions. For this reason, you should always provide more than one directory, residing on a different
physical disk, volume, or disk controller to minimize the risk of outages if one volume or channel fails.
- In some cases such as non-HA
deployments should, specify an NFS mount point in the list of
directories in dfs.namenode.name.dir, which will then store a copy of the metadata on another host,
providing further fault tolerance and recovery options. Note that if
you do this, you should soft
mount and configure retries for the NFS mount point; otherwise the NameNode process may hang if the mount is
temporarily unavailable for whatever reason.
- The dfs.namenode.name.dir and dfs.namenode.edits.dir properties
are read by the NameNode daemon upon
startup. Any changes to these properties will require a
NameNode restart to take effect.
|
dfs.namenode.checkpoint.dir/period/txns
There are three significant configuration
properties that relate to the checkpointing function.
- The dfs.namenode.checkpoint.dir property is a comma-separated
list of directories, analogous to
the dfs.namenode.name.dir property discussed earlier, used to store the
temporary edits to merge during the checkpointing process.
- The dfs.namenode.checkpoint.period property specifies the maximum delay between two
consecutive checkpoints with a default value of one hour.
- The dfs.namenode.checkpoint.txns property specifies the number
of transactions at which the NameNode will force a checkpoint. Checkpointing will occur when either threshold is met (time
interval or transactions).
|
dfs.datanode.data.dir
- The dfs.datanode.data.dir is the property that specifies where the DataNode will store the physical HDFS blocks. Like
the dfs.namenode.name.dir property this too is a comma-separated list of directories with
no spaces between.
- Unlike the dfs.namenode.name.dir setting, writes to the
directories specified in the dfs.datanode.data.dir property
are performed in a round-robin fashion (i.e., the
first block on one directory, the next block on the next, and so on).
- The configuration for each DataNode may differ . On different slave nodes, the volumes and
directories may differ. However, when
planning your cluster, it is best to try to homogenize as many
configuration settings as possible, making the cluster easier to
manage.
|
dfs.datanode.du.reserved
- HDFS is a “greedy” file system. Volumes associated with
directories on the DataNodes specified
by dfs.datanode.data.dir will be 100% filled with HDFS
block data if left unmanaged. This is problematic because slave nodes
require working space for intermediate data storage. If available disk
storage is completely consumed by HDFS block storage, data locality
suffers as processing activities may not be possible on the node.
- The dfs.datanode.du.reserved configuration property in hdfs-site.xml specifies the amount of space in bytes on
each volume that must be reserved, and thus cannot be used for HDFS
block storage. It is generally recommended to set
this value to 25% of the available space or at least 1 GB, depending
upon the local storage on each DataNode
|
dfs.blocksize
- The
property dfs.blocksize specifies the block size in bytes for new files
written by clients, including files produced as the result of an
application or job run by the client. The default is 134217728, or
128MB.
- Although
commonly thought of as a cluster- or server-based setting, dfs.blocksize is actually a
client setting.
- This property can be influenced by administrators using
the <final> tag on a server node as discussed earlier
|
dfs.replication
- The dfs.replication property
located in the hdfs-site.xml file determines the number of
block replicas created when a file is written to HDFS.
- The default is 3, which is also the
recommended value.
- As with
the dfs.blocksize property,
the dfs.replication property is a
client-side setting
|
|
Examples of above Properties:
|
|
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/app/data01/hdfs/nn</value>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/opt/app/data01/hdfs/snn</value>
</property>
<property>
<name>dfs.namenode.checkpoint.edits.dir</name>
<value>${dfs.namenode.checkpoint.dir}</value>
</property>
<property>
<name>dfs.namenode.checkpoint.period</name>
<value>21600</value>
</property>
<property>
<name>dfs.namenode.checkpoint.txns</name>
<value>10000000</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/data/data01/hdfs/dn,/opt/data/data02/hdfs/dn,/opt/data/data03/hdfs/dn,/opt/data/data04/hdfs/dn,/opt/data/data05/hdfs/dn,/opt/data/data06/hdfs/dn,/opt/data/data07/hdfs/dn,/opt/data/data08/hdfs/dn,/opt/data/data09/hdfs/dn,/opt/data/data10/hdfs/dn,/opt/data/data11/hdfs/dn,/opt/data/data12/hdfs/dn</value>
</property>
<property>
<name>dfs.datanode.du.reserved</name>
<value>1073741824</value>
</property>
<property>
<name>dfs.blocksize</name>
<value>134217728</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.replication.max</name>
<value>50</value>
</property>
|
No comments:
Post a Comment