What is the difference between start-all.sh and start-dfs.sh in Hadoop
There are different scripts in bin dir in Hadoop which is used to launch Hadoop DFS and Hadoop Map/Reduce Daemons.
- start-dfs.sh – Starts the Hadoop DFS daemons (Namenode and Datanodes)
- stop-dfs.sh – Stops the Hadoop DFS daemons (Namenode and Datanodes)
- start-all.sh – Starts all Hadoop daemons (Namenode, datanodes, Jobtracker, Tasktrackers) Deprecated
- stop-all.sh – Stops all Hadoop daemons (Namenode, datanodes, Jobtracker, Tasktrackers) Deprecated
What is the difference between Map and Reduce and what are they used for?
Map and Reduce terms are used in Big Data. They are not opposite to each other but compliment each other.
Map : Function which is used on a set of input values and calculates a set of key/value pairs.
Reduce : Function which takes the output from Map and applies some other function
What is the reason behind using SSH in Hadoop? Is there specific reason behind this requirement or we can use Hadoop without SSH too?
This is the mail Archive here which i found very useful for this question. I will provide the more details once i dig into script of Hadoop.
> It is not necessary to have SSH set up to run Hadoop, but it > does make things easier. SSH is used by the scripts in the > bin directory which start and stop daemons across the cluster > (the slave nodes are defined in the slaves file), see the > start-all.sh script as a starting point. > These scripts are a convenient way to control Hadoop, but > there are other possibilities. If you had another system to > control daemons on your cluster then you wouldn't need SSH.