Friday, November 15, 2013

Mahout Tutorial

Mahout

Through this tutorial I will guide you to install Maven and Mahout on Linux machine. from the terminal. And test the installation using movelens dataset.

Prerequisites for Building Mahout
  • Java JDK 1.6 y
Not necessary, but to keep things organized I created a folder to place all the installation files in it.

$ mkdir mahout
$ cd mahout

 Maven installation:

1- I downloaded Maven 3.1.1, from http://maven.apache.org/

2-  Decompress the "apache-maven-3.1.1-bin.tar.gz"

$ tar -zxvf apache-maven-3.1.1-bin.tar.gz



 Mahout installation:

1- I checked out the latest version as recommended, so I downloaded the last version of Apache Mahout from the following link https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout 

$ svn co http://svn.apache.org/repos/asf/mahout/trunk
 
2- Now you can find a folder called trunk under our created directory mahout

$ cd trunk 

3- build Mahout by using Maven, this will take some time to finish.

$ ../apache-maven-3.1.1/bin/mvn install

4- If everything goes fine then you should see something like this

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------
------------
[INFO] Total time: 46:12.061s
[INFO] Finished at: Thu Nov 14 08:45:09 CST 2013
[INFO] Final Memory: 37M/147M
[INFO] ------------------------------------------------------------------------




5- Configure haddop Environment, this can be done in various way  

  •  First
$ export HADOOP_HOME= <Directory to your hadoop installation>

  •  Second edit the mahout script by adding these lines at the begining of the following file
$nano  /trunk/mahout/bin/mahout


HADOOP_CONF_DIR=<directory to your hadoop configuration folder>
HADOOP_HOME= <directory to your hadoop>

HADOOP_CLASSPATH= <directory to your hadoop>


 For example:
HADOOP_CONF_DIR=/hadoop-1.0.4/conf
HADOOP_HOME=/hadoop-1.0.4
HADOOP_CLASSPATH=/hadoop-1.0.4


6- Download the movielens dataset (http://grouplens.org/) in this example I downloaded the 10M rating

7- Convert the Rating.data which contains the rating and u.user to "comma separated file" csv with the following python script should do the work for you.



#!/usr/bin/python
# this script takes one file and convert it to CSV file 
# Example input file with tab or | 
#input = udata.txt
#output = udata_txt.csv 

import sys
import os

sourceData = sys.argv[1]
filename = sourceData.split('/')[-1]
filename = filename.replace('.','_')
outputData = os.getcwd() +'/'+ filename + '.csv'
print 'source Data :', sourceData
print 'output file :', outputData
source_file = open(sourceData)
out = open(outputData,'w')
for line in source_file:
        if '\t' in line:
         newline = line.replace('\t',',')
        elif '|' in line:
         newline = line.replace('|',',')
        elif '::' in line:
         newline = line.replace('::',',')
        else:
         continue
        out.write(newline)
        print newline

print 'program end'

7- Run the cluster of machine and ,Copy the Rating_data.csv to HDFS

$ ./hadoop fs -copyFromLocal <directory>/Rating_data.csv /

8-  Move to the following directory /trunk/bin/


$ cd /trunk/bin
$ cd /trunk/bin
$ ./bin/mahout recommenditembased --input /Rating_data.csv --numRecommendations 2 --output /output --similarityClassname SIMILARITY_PEARSON_CORRELATION'

9- You should be able to see the output now in the HDFS

Friday, October 25, 2013

SpatialHadoop Tutorial

SpatiaHadoop

"  Support spatial data on Apache Hadoop"
 



SpatialHadoop is a MapReduce extension to Apache Hadoop designed specially to work with spatial data. With spatialhadoop you can analyze your huge spatial datasets on a cluster of machines.

SpatialHadoop adds new spatial data types (e.g., Point, Rectangle, and Polygon) that can be used in MapReduce programs. It also adds spatial indexes that organizes the data in HDFS according to dataset spatial locality to allow efficient retrieval and processing.

 SpatialHadoop support Grid Index, R-tree

This tutorial consider Linux version of spatial hadoop 1.2.1 and these set up steps should work in any later version:

Single Node: 


1-Download the last version of spatial Hadoop 
    Project Website : http://spatialhadoop.cs.umn.edu/
   
2- Extract the compressed file. 
     Open the terminal and decompress the downloaded file.

 

3-  Move to conf folder under the extracted folder.


4- Edit hadoop-env.sh by adding  the directory of your JAVA_HOME.


5-Test your installation by moving to to the bin directory:

 

Congratulation now spatial hadoop works perfectly in single node mode the rest of this tutorial consider cluster mode. 

Cluster :


Cluster mode consider the following example: if want to set up a cluster that consist of the 4 nodes "machines" and we will assign one of the node to be the core or the master as shown in the figure: 

6- Create a directory inside the hadoop installation called hdfs

7-   Configure the HDFS site by editing hdfs-site.xml and spatial-site.xml file under the configuration
by adding these properties

<configuration>
     <property>
       <name>dfs.name.dir</name>
       <value>[your Directory]/hadoop-1.2.1/hdfs/name</value>
     </property>

     <property>
       <name>dfs.data.dir</name>
       <value>[your Directory]/hadoop-1.2.1/hdfs/data</value>
     </property>

     <property>
         <name>dfs.replication</name>
         <value>1</value>
     </property>
</configuration>
8- Configure the mapred-site.xml 

<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>node2:9001</value>
     </property>

    <property>
      <name>mapred.job.reuse.jvm.num.tasks</name>
       <value>20</value>
   </property>

    <property>
       <name>mapred.local.dir</name>
       <value>[your Directory]/hadoop-1.2.1/hdfs/mapred</value>
    </property>

</configuration>


9- Add node2 to the master file 

10- Add node3, node4, and node5 to the slave file.

Now for each node in your cluster :
11- Copy spatial hadoop folder to node2, node3, node4, and node5

13- Format node using the following command




12- In each node in the cluster you need to start cluster by the following command 


13- On the local machine you can test your configuration the same command line as step 5 above.