Friday, November 15, 2013

Mahout Tutorial

Mahout

Through this tutorial I will guide you to install Maven and Mahout on Linux machine. from the terminal. And test the installation using movelens dataset.

Prerequisites for Building Mahout
  • Java JDK 1.6 y
Not necessary, but to keep things organized I created a folder to place all the installation files in it.

$ mkdir mahout
$ cd mahout

 Maven installation:

1- I downloaded Maven 3.1.1, from http://maven.apache.org/

2-  Decompress the "apache-maven-3.1.1-bin.tar.gz"

$ tar -zxvf apache-maven-3.1.1-bin.tar.gz



 Mahout installation:

1- I checked out the latest version as recommended, so I downloaded the last version of Apache Mahout from the following link https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout 

$ svn co http://svn.apache.org/repos/asf/mahout/trunk
 
2- Now you can find a folder called trunk under our created directory mahout

$ cd trunk 

3- build Mahout by using Maven, this will take some time to finish.

$ ../apache-maven-3.1.1/bin/mvn install

4- If everything goes fine then you should see something like this

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------
------------
[INFO] Total time: 46:12.061s
[INFO] Finished at: Thu Nov 14 08:45:09 CST 2013
[INFO] Final Memory: 37M/147M
[INFO] ------------------------------------------------------------------------




5- Configure haddop Environment, this can be done in various way  

  •  First
$ export HADOOP_HOME= <Directory to your hadoop installation>

  •  Second edit the mahout script by adding these lines at the begining of the following file
$nano  /trunk/mahout/bin/mahout


HADOOP_CONF_DIR=<directory to your hadoop configuration folder>
HADOOP_HOME= <directory to your hadoop>

HADOOP_CLASSPATH= <directory to your hadoop>


 For example:
HADOOP_CONF_DIR=/hadoop-1.0.4/conf
HADOOP_HOME=/hadoop-1.0.4
HADOOP_CLASSPATH=/hadoop-1.0.4


6- Download the movielens dataset (http://grouplens.org/) in this example I downloaded the 10M rating

7- Convert the Rating.data which contains the rating and u.user to "comma separated file" csv with the following python script should do the work for you.



#!/usr/bin/python
# this script takes one file and convert it to CSV file 
# Example input file with tab or | 
#input = udata.txt
#output = udata_txt.csv 

import sys
import os

sourceData = sys.argv[1]
filename = sourceData.split('/')[-1]
filename = filename.replace('.','_')
outputData = os.getcwd() +'/'+ filename + '.csv'
print 'source Data :', sourceData
print 'output file :', outputData
source_file = open(sourceData)
out = open(outputData,'w')
for line in source_file:
        if '\t' in line:
         newline = line.replace('\t',',')
        elif '|' in line:
         newline = line.replace('|',',')
        elif '::' in line:
         newline = line.replace('::',',')
        else:
         continue
        out.write(newline)
        print newline

print 'program end'

7- Run the cluster of machine and ,Copy the Rating_data.csv to HDFS

$ ./hadoop fs -copyFromLocal <directory>/Rating_data.csv /

8-  Move to the following directory /trunk/bin/


$ cd /trunk/bin
$ cd /trunk/bin
$ ./bin/mahout recommenditembased --input /Rating_data.csv --numRecommendations 2 --output /output --similarityClassname SIMILARITY_PEARSON_CORRELATION'

9- You should be able to see the output now in the HDFS