Currently support cluster management through script invokes the thrift interface. The invoke path is bin/JstormYarn
jstorm-on-yarn use wholesale mode Unlike spark-on-yarn use resale mode. means an AM is generated each time when submitting an application to cluster. The AM of jstorm-on-yarn is permanent, that is, for a JStorm cluster, it will be only one AM.
wholesale mode raise 50% utilization, for example, in production environment we found resale mode (spark-on-yarn) can only use 60-70% physical resource. but wholesale can reach nearly 90%. that’s a wide gap；Flexible resource grading support requirement better in many cases，eg：application 1 use more memory resource in period A, use more cpu cores in period B, application 2 use more network IO in period A,use more memory resource in period B, then, if use resale mode , total resource needs equals every resource ‘s peak value of application 1 plus every resource’s peak value of application 2. and wholesale mode can share resource between multi-application in whole lifecycle,reduce resource waste；And if use resale mode , container often crash. because default GC strategy most application can’t trigger oldgen GC, when heap memory run out, NM kill this container and AM will start another contaier in difference place. that need schedule overhead and state management. use wholesale mode this situation can be completely avoided
So currently jstorm-on-yarn works as that there is an AM as total control. It is responsible for the following tasks: * Create Nimbus (and automatically restart nimbus when it hung) * Receives requests and create Supervisor * Cluster capacity expand and reduce dynamically
Submitting, killing and other operations on the topology, interact with nimbus and supervisor in the container directly as with the standalone cluster of JStorm, do not interact with AM.
From start-JstormYarn.sh start, it will call JStormOnYarn class to create AM, essentially the yarn client of JStorm.
This class is quite simple, it mainly set some parameters of nimbus container, such as memory, CPU core, jar, libjar, shellscript (actual value is start_jstorm.sh, for starting nimbus and supervisor).
Then uploaded the jar and shellscript to the HDFS. Finally call
yarnClient.submitApplication (appContext) to create the AM.
This step is achieved through thrift interface, embodied as: client end JstormAM.py (automatically generated by thrift), server-side automatically generated RPC interface is JstormAM, while the real processing logic is in JstormAMHandler class (the same as the thrift in jstorm). The handler initializes thrift server and monitor client’s requests when it creates AM.
Look at a few important methods: addSupervisors and startNimbus In fact, these two methods have similar implements. In essence, it specified the number of container, CPU core count. Launching the supervisor or nimbus is done in JStormMaster. It determine to launch the supervisor or nimbus base on the container’s priority property.
Note that, when jstorm-on-yarn creates a supervisor, not only apply the memory for supervisor itself, but will specify a chunk of memory (such as 20 or 40G), the container will hold all the worker under the supervisor. When the container is hung due to some question , all the workers will be killed, which is more stringent than in the standalone cluster.
In order to facilitate the maintenance, please add the following configuration at all storm.yaml in YARN cluster:
jstorm.on.yarn: true supervisor.childopts: -Djstorm-on-yarn=true jstorm.log.dir: /home/yarn/jstorm_logs/<cluster_name>
if you need enable nimbus auto restart, pleas add the following configuration
blobstore.dir: /yourhdfsdir blobstore.hdfs.hostname: yourhdfshost blobstore.hdfs.port: yourhdfsport
The above configuration will override the default log path, all the logs are redirected to the /home/yarn below, preventing from missing the log after the container hangs. If more than one container of supervisor is launch on the same machine, it will add value
supervisor.deamon.logview.port at the end of supervisor log (YARN dynamically generated port and port range for each container) to distinguish logs and prevent the logs from affecting each other. For example if two supervisor http ports are 9000 and 9001 respectively, then supervisor-9000.log and supervisor-9001.log are generated. koala will add the specified port suffix for the log according to the
TO BE ADDED
See: http: //zh.hortonworks.com/get-started/yarn/