Through this tutorial I will guide you to install Maven and Mahout on Linux machine. from the terminal. And test the installation using movelens dataset.
Prerequisites for Building Mahout
- Java JDK 1.6 y
- Maven 2.2 or higher ( Use 3.x to build from SVN..
Not necessary, but to keep things organized I created a folder to place all the installation files in it.
$ mkdir mahout
$ cd mahout
Maven installation:
1- I downloaded Maven 3.1.1, from
2- Decompress the "apache-maven-3.1.1-bin.tar.gz"
$ tar -zxvf apache-maven-3.1.1-bin.tar.gz
Mahout installation:
1- I checked out the latest version as recommended, so I downloaded the last version of Apache Mahout from the following link
$ svn co http:
2- Now you can find a folder called trunk under our created directory mahout
$ cd trunk
3- build Mahout by using Maven, this will take some time to finish.
$ ../
apache-maven-3.1.1/bin/mvn install
4- If everything goes fine then you should see something like this
[INFO] ------------------------------ ------------------------------
[INFO] Total time: 46:12.061s
[INFO] Finished at: Thu Nov 14 08:45:09 CST 2013
[INFO] Final Memory: 37M/147M
[INFO] ------------------------------ ------------------------------ ------------
5- Configure haddop Environment, this can be done in various way
HADOOP_CONF_DIR=<directory to your hadoop configuration folder>
HADOOP_HOME= <directory to your hadoop>
HADOOP_CLASSPATH= <directory to your hadoop>
For example:
[INFO] ------------------------------
[INFO] Total time: 46:12.061s
[INFO] Finished at: Thu Nov 14 08:45:09 CST 2013
[INFO] Final Memory: 37M/147M
[INFO] ------------------------------
5- Configure haddop Environment, this can be done in various way
- First
- Second edit the mahout script by adding these lines at the begining of the following file
HADOOP_CONF_DIR=<directory to your hadoop configuration folder>
HADOOP_HOME= <directory to your hadoop>
HADOOP_CLASSPATH= <directory to your hadoop>
For example:
6- Download the movielens dataset ( in this example I downloaded the 10M rating
7- Convert the which contains the rating and u.user to "comma separated file" csv with the following python script should do the work for you.
7- Run the cluster of machine and ,Copy the Rating_data.csv to HDFS
$ ./hadoop fs -copyFromLocal <directory>/Rating_data.csv /
8- Move to the following directory /trunk/bin/
6- Download the movielens dataset ( in this example I downloaded the 10M rating
7- Convert the which contains the rating and u.user to "comma separated file" csv with the following python script should do the work for you.
# this script takes one file and convert it to CSV file
# Example input file with tab or |
#input = udata.txt
#output = udata_txt.csv
import sys
import os
sourceData = sys.argv[1]
filename = sourceData.split('/')[-1]
filename = filename.replace('.','_')
outputData = os.getcwd() +'/'+ filename + '.csv'
print 'source Data :', sourceData
print 'output file :', outputData
source_file = open(sourceData)
out = open(outputData,'w')
for line in source_file:
if '\t' in line:
newline = line.replace('\t',',')
elif '|' in line:
newline = line.replace('|',',')
elif '::' in line:
newline = line.replace('::',',')
print newline
print 'program end'
7- Run the cluster of machine and ,Copy the Rating_data.csv to HDFS
$ ./hadoop fs -copyFromLocal <directory>/Rating_data.csv /
8- Move to the following directory /trunk/bin/
$ cd /trunk/bin
$ cd /trunk/bin
$ ./bin/mahout recommenditembased --input /Rating_data.csv --numRecommendations 2 --output /output --similarityClassname SIMILARITY_PEARSON_CORRELATION'
9- You should be able to see the output now in the HDFS