Wednesday, 24 April 2019

Other Hadoop Environment Scripts and Configuration Files


Other Hadoop Environment Scripts and Configuration Files
We are familiar with many configuration files like core-site.xml,yarn-site.xml,mapred-site.xml,and hdfs-site.xml. Apart from these there are some more configuration files in the configuration directory.
[sukul@server1 ~]$ cd $HADOOP_HOME/conf
[sukul@server1 conf]$ ls *xml
capacity-scheduler.xml  hadoop-policy.xml  mapred-site.xml  ssl-server.xml
core-site.xml           hdfs-site.xml      ssl-client.xml   yarn-site.xml
[sukul@server1 conf]$ ls *sh
hadoop-env.sh  mapred-env.sh  yarn-env.sh

A] hadoop-env.sh/yarn-env.sh/mapred-env.sh
  • The hadoop-env.sh is used to source environment variables for Hadoop daemons and processes.
  • This can include daemon JVM settings such as heap size or Java options, as well as basic variables required by many processes such as HADOOP_LOG_DIRorJAVA_HOME (Following shows just few lines of the hadoop-env.sh script)
[sukul@server1 conf]$ cat hadoop-env.sh  | grep -v '^#' | sed '/^$/d'
export JAVA_HOME=/opt/app/java/jdk/jdk180/
export HADOOP_HOME_WARN_SUPPRESS=1
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/2.6.5.4-1/hadoop}
export JSVC_HOME=/usr/lib/bigtop-utils
export HADOOP_HEAPSIZE="4096"
export HADOOP_NAMENODE_INIT_HEAPSIZE="-Xms233472m"
export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}"
export HADOOP_NAMENODE_OPTS="-server -XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -XX:+UseCMSInitiatingOccupancyOnly -XX:CMSInitiatingOccupancyFraction=70 -XX:ErrorFile=/opt/log/hadoop/$USER/hs_err_pid%p.log -XX:NewSize=25600m -XX:MaxNewSize=25600m -XX:MetaspaceSize=128m -XX:MaxMetaspaceSize=256m -Xloggc:/opt/log/hadoop/$USER/gc.log-`date +'%Y%m%d%H%M'` -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -Xms233472m -Xmx233472m -Dhadoop.security.logger=INFO,DRFAS -Dhdfs.audit.logger=INFO,RFAAUDIT ${HADOOP_NAMENODE_OPTS}"

  • Basically, if you need to pass any environment variables to any Hadoop process, thehadoop-env.shfile is the file to do this in, as it is sourced by all Hadoop control scripts. 
  • Similarly, there may be other environment shell scripts such as yarn-env.shandmapred-env.sh that are used by these specific processes to source necessary environment variables. (Following shows just few lines of the mapred-env.sh and yarn-env.sh scripts)

[sukul@server1 conf]$ cat mapred-env.sh  | grep -v '^#' | sed '/^$/d'
export HADOOP_JOB_HISTORYSERVER_HEAPSIZE=16384
export HADOOP_MAPRED_ROOT_LOGGER=INFO,RFA
export HADOOP_JOB_HISTORYSERVER_OPTS=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:/opt/log/hadoop-mapreduce/mapred/gc_trace.log -XX:ErrorFile=/opt/log/hadoop-mapreduce/mapred/java_error.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/log/hadoop-mapreduce/mapred/heap_dump.hprof
export HADOOP_OPTS="-Dhdp.version=$HDP_VERSION $HADOOP_OPTS"
[sukul@server1 conf]$ cat yarn-env.sh  | grep -v '^#' | sed '/^$/d'
export HADOOP_YARN_HOME=/usr/hdp/2.6.5.4-1/hadoop-yarn
export YARN_LOG_DIR=/opt/log/hadoop-yarn/$USER
export YARN_PID_DIR=/var/run/hadoop-yarn/$USER
export HADOOP_LIBEXEC_DIR=/usr/hdp/2.6.5.4-1/hadoop/libexec
export JAVA_HOME=/opt/app/java/jdk/jdk180/
export HADOOP_YARN_USER=${HADOOP_YARN_USER:-yarn}
export YARN_CONF_DIR="${YARN_CONF_DIR:-$HADOOP_YARN_HOME/conf}"
if [ "$JAVA_HOME" != "" ]; then
  #echo "run java in $JAVA_HOME"
  JAVA_HOME=$JAVA_HOME
fi
if [ "$JAVA_HOME" = "" ]; then
  echo "Error: JAVA_HOME is not set."
  exit 1
fi

B] log4j.properties
  • Hadoop usesLog4J(the Java logging framework) to store and manage its log files. Log files are produced by nearly every process in Hadoop, including daemons, applications, and tasks. 
  • Thelog4j.propertiesfile provides configuration for log file management, including how to write log records, where to write them, and how to manage rotation and retention of log files.
  • Following shows sample log4j.properties file:
[sukul@server1 ~]$ cd $HADOOP_HOME/conf
[sukul@server1 conf]$ ls log4j.properties
log4j.properties
[sukul@server1 conf]$ cat log4j.properties | grep -v '^#' | sed '/^$/d'
hadoop.root.logger=INFO,console
hadoop.log.dir=.
hadoop.log.file=hadoop.log
log4j.rootLogger=${hadoop.root.logger}, EventCounter
log4j.threshhold=ALL
log4j.appender.RFA=org.apache.log4j.DailyRollingFileAppender
log4j.appender.RFA.File=${hadoop.log.dir}/${hadoop.log.file}
log4j.appender.RFA.DatePattern=.yyyy-MM-dd
log4j.appender.RFA.MaxBackupIndex=45
log4j.appender.RFA.layout=org.apache.log4j.PatternLayout
log4j.appender.RFA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
hadoop.tasklog.taskid=null
hadoop.tasklog.iscleanup=false
hadoop.tasklog.noKeepSplits=4
hadoop.tasklog.totalLogFileSize=100
hadoop.tasklog.purgeLogSplits=true
hadoop.tasklog.logsRetainHours=12
log4j.appender.TLA=org.apache.hadoop.mapred.TaskLogAppender
log4j.appender.TLA.taskId=${hadoop.tasklog.taskid}
log4j.appender.TLA.isCleanup=${hadoop.tasklog.iscleanup}
log4j.appender.TLA.totalLogFileSize=${hadoop.tasklog.totalLogFileSize}
log4j.appender.TLA.layout=org.apache.log4j.PatternLayout
log4j.appender.TLA.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
hadoop.security.logger=INFO,console
hadoop.security.log.maxfilesize=256MB
hadoop.security.log.maxbackupindex=20
log4j.category.SecurityLogger=WARN,console
hadoop.security.log.file=SecurityAuth.audit
log4j.appender.DRFAS=org.apache.log4j.DailyRollingFileAppender
log4j.appender.DRFAS.File=${hadoop.log.dir}/${hadoop.security.log.file}
log4j.appender.DRFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.DRFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.DRFAS.DatePattern=.yyyy-MM-dd
log4j.appender.RFAS=org.apache.log4j.RollingFileAppender
log4j.appender.RFAS.File=${hadoop.log.dir}/${hadoop.security.log.file}
log4j.appender.RFAS.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAS.layout.ConversionPattern=%d{ISO8601} %p %c: %m%n
log4j.appender.RFAS.MaxFileSize=${hadoop.security.log.maxfilesize}
log4j.appender.RFAS.MaxBackupIndex=${hadoop.security.log.maxbackupindex}
hdfs.audit.logger=INFO,console
log4j.logger.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=${hdfs.audit.logger}
log4j.additivity.org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit=false
log4j.appender.RFAAUDIT=org.apache.log4j.RollingFileAppender
log4j.appender.RFAAUDIT.File=${hadoop.log.dir}/hdfs-audit.log
log4j.appender.RFAAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.RFAAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.RFAAUDIT.MaxBackupIndex=180
log4j.appender.RFAAUDIT.MaxFileSize=16106127360
mapred.audit.logger=INFO,console
log4j.logger.org.apache.hadoop.mapred.AuditLogger=${mapred.audit.logger}
log4j.additivity.org.apache.hadoop.mapred.AuditLogger=false
log4j.appender.MRAUDIT=org.apache.log4j.DailyRollingFileAppender
log4j.appender.MRAUDIT.File=${hadoop.log.dir}/mapred-audit.log
log4j.appender.MRAUDIT.layout=org.apache.log4j.PatternLayout
log4j.appender.MRAUDIT.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.MRAUDIT.DatePattern=.yyyy-MM-dd
hadoop.metrics.log.level=INFO
log4j.logger.org.apache.hadoop.metrics2=${hadoop.metrics.log.level}
log4j.logger.org.jets3t.service.impl.rest.httpclient.RestS3Service=ERROR
log4j.appender.NullAppender=org.apache.log4j.varia.NullAppender
log4j.appender.EventCounter=org.apache.hadoop.log.metrics.EventCounter
log4j.logger.org.apache.hadoop.conf.Configuration.deprecation=WARN
log4j.logger.BlockStateChange=ERROR
log4j.logger.org.apache.hadoop.hdfs.StateChange=WARN
yarn.log.dir=.
hadoop.mapreduce.jobsummary.logger=${hadoop.root.logger}
hadoop.mapreduce.jobsummary.log.file=hadoop-mapreduce.jobsummary.log
log4j.appender.JSA=org.apache.log4j.DailyRollingFileAppender
yarn.server.resourcemanager.appsummary.log.file=hadoop-mapreduce.jobsummary.log
yarn.server.resourcemanager.appsummary.logger=${hadoop.root.logger}
log4j.appender.RMSUMMARY=org.apache.log4j.RollingFileAppender
log4j.appender.RMSUMMARY.File=${yarn.log.dir}/${yarn.server.resourcemanager.appsummary.log.file}
log4j.appender.RMSUMMARY.MaxFileSize=256MB
log4j.appender.RMSUMMARY.MaxBackupIndex=20
log4j.appender.RMSUMMARY.layout=org.apache.log4j.PatternLayout
log4j.appender.RMSUMMARY.layout.ConversionPattern=%d{ISO8601} %p %c{2}: %m%n
log4j.appender.JSA.layout=org.apache.log4j.PatternLayout
log4j.appender.JSA.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
log4j.appender.JSA.DatePattern=.yyyy-MM-dd
log4j.appender.JSA.layout=org.apache.log4j.PatternLayout
log4j.logger.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=${yarn.server.resourcemanager.appsummary.logger}
log4j.additivity.org.apache.hadoop.yarn.server.resourcemanager.RMAppManager$ApplicationSummary=false

  • In some cases different components may have their own specificlog4j.propertiesfile, which may be located in the Hadoop configuration directory, such askms-log4j.propertiesandhttpfs-log4j.properties. 

C] hadoop-metrics.properties
  • We may also have hadoop-metrics.properties and/or hadoop-metrics2.propertiesfiles in your Hadoop configuration directory. These are used to define application and platform metrics to collect. Following are the sample hadoop-metrics.properties

[sukul@server1 conf]$ cat hadoop-metrics2.properties | grep -v '^#' | sed '/^$/d'
*.period=10
*.sink.timeline.plugin.urls=file:///usr/lib/ambari-metrics-hadoop-sink/ambari-metrics-hadoop-sink.jar
*.sink.timeline.class=org.apache.hadoop.metrics2.sink.timeline.HadoopTimelineMetricsSink
*.sink.timeline.period=10
*.sink.timeline.sendInterval=60000
*.sink.timeline.slave.host.name=serv084.zbc.xyz.com
*.sink.timeline.zookeeper.quorum=serv269.zbc.xyz.com:2181,serv271.zbc.xyz.com:2181,serv267.zbc.xyz.com:2181
*.sink.timeline.protocol=http
*.sink.timeline.port=6188
*.sink.timeline.truststore.path = /etc/security/clientKeys/all.jks
*.sink.timeline.truststore.type = jks
*.sink.timeline.truststore.password = bigdata
datanode.sink.timeline.collector.hosts=serv287.zbc.xyz.com
namenode.sink.timeline.collector.hosts=serv287.zbc.xyz.com
resourcemanager.sink.timeline.collector.hosts=serv287.zbc.xyz.com
nodemanager.sink.timeline.collector.hosts=serv287.zbc.xyz.com
jobhistoryserver.sink.timeline.collector.hosts=serv287.zbc.xyz.com
journalnode.sink.timeline.collector.hosts=serv287.zbc.xyz.com
maptask.sink.timeline.collector.hosts=serv287.zbc.xyz.com
reducetask.sink.timeline.collector.hosts=serv287.zbc.xyz.com
applicationhistoryserver.sink.timeline.collector.hosts=serv287.zbc.xyz.com
resourcemanager.sink.timeline.tagsForPrefix.yarn=Queue

D] Other Configuration Files:   
  • Slaves files: Used by the cluster startup scripts in the Hadoop sbin directory. This contains list of slave nodes.

  • hadoop-policy.xml,kms-site.xml, orssl-server.xml: configuration files related to security or access control policies, SSL configuration, or key management


No comments:

Post a Comment