SpatiaHadoop
" Support spatial data on Apache Hadoop"
SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. With spatialhadoop you can analyze your huge spatial datasets on a cluster of machines.
SpatialHadoop adds new spatial data types (e.g., Point, Rectangle, and Polygon) that can be used in MapReduce programs. It also adds spatial indexes that organizes the data in HDFS according to dataset spatial locality to allow efficient retrieval and processing.
SpatialHadoop support Grid Index, R-tree
This tutorial consider Linux version of spatial hadoop 1.2.1 and these set up steps should work in any later version:
Single Node:
1-Download the last version of spatial Hadoop
Project Website : http://spatialhadoop.cs.umn.edu/
2- Extract the compressed file.
Open the terminal and decompress the downloaded file.
3- Move to conf folder under the extracted folder.
4- Edit hadoop-env.sh by adding the directory of your JAVA_HOME.
5-Test your installation by moving to to the bin directory:
Congratulation now spatial hadoop works perfectly in single node mode the rest of this tutorial consider cluster mode.
Cluster :
Cluster mode consider the following example: if want to set up a cluster that consist of the 4 nodes "machines" and we will assign one of the node to be the core or the master as shown in the figure:
6- Create a directory inside the hadoop installation called hdfs
7- Configure the HDFS site by editing hdfs-site.xml and spatial-site.xml file under the configuration
by adding these properties
<configuration>
<property>
<name>dfs.name.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
<property>
<name>dfs.name.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/name</value>
</property>
<property>
<name>dfs.data.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
8- Configure the mapred-site.xml
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>node2:9001</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>20</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/mapred</value>
</property>
</configuration>
<property>
<name>mapred.job.tracker</name>
<value>node2:9001</value>
</property>
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>20</value>
</property>
<property>
<name>mapred.local.dir</name>
<value>[your Directory]/hadoop-1.2.1/hdfs/mapred</value>
</property>
</configuration>
9- Add node2 to the master file
10- Add node3, node4, and node5 to the slave file.
Now for each node in your cluster :
11- Copy spatial hadoop folder to node2, node3, node4, and node5
13- Format node using the following command
12- In each node in the cluster you need to start cluster by the following command
13- On the local machine you can test your configuration the same command line as step 5 above.