Louai Alarabi: November 2013

Mahout

Through this tutorial I will guide you to install Maven and Mahout on Linux machine. from the terminal. And test the installation using movelens dataset.

Prerequisites for Building Mahout

Java JDK 1.6 y

Maven 2.2 or higher (http://maven.apache.org/). Use 3.x to build from SVN..

Not necessary, but to keep things organized I created a folder to place all the installation files in it.

$ mkdir mahout

$ cd mahout

Maven installation:

1- I downloaded Maven 3.1.1, from http://maven.apache.org/

2- Decompress the "apache-maven-3.1.1-bin.tar.gz"

$ tar -zxvf apache-maven-3.1.1-bin.tar.gz

Mahout installation:

1- I checked out the latest version as recommended, so I downloaded the last version of Apache Mahout from the following link https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout

$ svn co http://svn.apache.org/repos/asf/mahout/trunk

2- Now you can find a folder called trunk under our created directory mahout

$ cd trunk 

3- build Mahout by using Maven, this will take some time to finish. 

$ ../apache-maven-3.1.1/bin/mvn install 

4- If everything goes fine then you should see something like this

[INFO] BUILD SUCCESS

[INFO] ------------------------------------------------------------

------------

[INFO] Total time: 46:12.061s

[INFO] Finished at: Thu Nov 14 08:45:09 CST 2013

[INFO] Final Memory: 37M/147M

[INFO] ------------------------------------------------------------------------

5- Configure haddop Environment, this can be done in various way   

 First

$ export HADOOP_HOME= <Directory to your hadoop installation> 

 Second edit the mahout script by adding these lines at the begining of the following file 

$nano  /trunk/mahout/bin/mahout

HADOOP_CONF_DIR=<directory to your hadoop configuration folder>
HADOOP_HOME= <directory to your hadoop>

HADOOP_CLASSPATH= <directory to your hadoop>

 For example:

HADOOP_CONF_DIR=/hadoop-1.0.4/conf

HADOOP_HOME=/hadoop-1.0.4

HADOOP_CLASSPATH=/hadoop-1.0.4

6- Download the movielens dataset (http://grouplens.org/) in this example I downloaded the 10M rating

7- Convert the Rating.data which contains the rating and u.user to "comma separated file" csv with the following python script should do the work for you. 

#!/usr/bin/python
# this script takes one file and convert it to CSV file 
# Example input file with tab or | 
#input = udata.txt
#output = udata_txt.csv 

import sys
import os

sourceData = sys.argv[1]
filename = sourceData.split('/')[-1]
filename = filename.replace('.','_')
outputData = os.getcwd() +'/'+ filename + '.csv'
print 'source Data :', sourceData
print 'output file :', outputData
source_file = open(sourceData)
out = open(outputData,'w')
for line in source_file:
        if '\t' in line:
         newline = line.replace('\t',',')
        elif '|' in line:
         newline = line.replace('|',',')        elif '::' in line:
         newline = line.replace('::',',')        else:
         continue
        out.write(newline)
        print newline

print 'program end'
7- Run the cluster of machine and ,Copy the Rating_data.csv to HDFS

$ ./hadoop fs -copyFromLocal <directory>/Rating_data.csv /

8-  Move to the following directory /trunk/bin/

$ cd /trunk/bin

$ cd /trunk/bin

$ ./bin/mahout recommenditembased --input /Rating_data.csv --numRecommendations 2 --output /output --similarityClassname SIMILARITY_PEARSON_CORRELATION'

9- You should be able to see the output now in the HDFS

Louai Alarabi

Friday, November 15, 2013

Mahout Tutorial