Mahout and Hadoop: a simple integration

A recommendation system is very common to see in every web site today. Mahout could achieve this aim to produce recommendations on a web site. In this article I’d like to illustrate how to get it using Mahout integrated with Hadoop that’s useful for dealing  with a large number of data.


At first, two questions. Is mandatory using Hadoop with Mahout? Can I use Mahout without Hadoop? The answers are, “No” for the first and “Yes” for the last.

Nevertheless, I think is a very common combination thinking of Mahout with Hadoop. The reason is easy to explain. Mahout builds the recommendations using a large amount of data, ideally with all the users who browse the web site, and Hadoop is used for dealing with a large amount of data.

Very briefly, Hadoop is composed by two modules. HDFS (Hadoop Distributed File System) and Map/Reduce Jobs. It’s not the scope of this article explain these concepts, I wrote something about in a previous article. In order to compute data and produce the recommendations, Hadoop reads the data and starts some Map/Reduce Jobs. We’ll see them in more detail in the following part of this article.

 Now, take a look at Mahout component Diagram (from the Mahout official WebSite).


 Java Application could be a stand-alone application or Web Application.

 Recommender is the engine of the system.

 A DataModel is the interface to information about user preferences.

 Data Store is where the user navigation preferences are stored. It could be FileSytem or Database (not only a Database).

 Others boxes are described at Mahout official website, so, it’s quite useless copy them here.











As I said, Mahout produces recommendations. These could be user-based or item-based. Again, in Mahout web site you can see a large number of algorithms (or similarity as they are called) to get the users preferences or items preferences.

Let’s take a look at how produce these recommandations. I’m sorry but I can’t explain in this article how install Mahout and Hadoop on your machine because it’s not complicated but not very short to explain. I found a good number of articles which explain how to do it. I used a virtual machine run with VMWare and Ubunto 9.10 installed. I tried to use Cygwin but I got a lot of problems with path files so, I recommend to install a virtual machine instead of going crazy with windows vs linux path description.

Mahout can build runtime recommendations taking one preference file and return recommendations built. Instead of this way, I used Hadoop map/reduce job in Mahout to make the recommendations.

The first is named “ItemSimilarityJob” which returns a file composed by “item_a item_b relationship score” and one job named “RecommenderJob” which returns a file composed by “user [item_a:rate,item_b:score,…]”.

The rate is always (except using SIMILARITY_COOCCURRENCE) between -1 and 1. This value is calculated applying the similarity algorithm and changes in base of which one is applied.

It’s not the scope of this article explain the difference in the algorithm. Each one use a different approach at the data and, most important thing, there’s not a “correct” and “incorrect” algoritm or “good” and “bad” one.

The use of SIMILARITY_TANIMOTO_COEFFICIENT rather than SIMILARITY_LOGLIKELIHOOD depends from the type of browsing the users are making on the website and whether you’re confident that the type of recommendations produced are of good quality for the website.

Another argument that might be applied, to get a good quality items, is named “threshold“. This argument defined the value of relationship score and discard item pairs with a value below this.

 These are only a little part of one huge argument. Recommendation is illustrated in every aspects in different article on the net where you can find more information of this topic.

Let’s start with an example to explain a real application of these concepts. I’m going to give recommendation using the user data navigation of the web site. The rate preference would be always 1 (page visited) so, I avoided to specify it. The are named boolean preferences (We’ll see it again at the command line).

Preference files (user_views.csv):


The Matrix illustrates the previous preferences file:

User/Item 101 102 103 104 105 106 107
1 X X X        
2 X X X X      
3 X     X X   X
4     X X   X  
5   X X X X X  

As you can see, User 1 views items 101, 102, 103; User 2, ….

Once uploaded it on Hadoop file system (HDFS), starting Mahout for producing item-to-item relations with the command:

-i input file
-o output file
-s name of similarity to apply
-b no preference value
-tr Threshold value

./mahout itemsimilarity -i input/user_views.csv -o output/ -s SIMILARITY_TANIMOTO_COEFFICIENT -b

The output file named “part-r-00000” is the follow:

101	102	0.5
101	103	0.4
101	104	0.4
101	105	0.25
101	107	0.333333333
102	103	0.75
102	104	0.4
102	105	0.25
102	106	0.25
103	104	0.6
103	105	0.2
103	106	0.5
104	105	0.5
104	106	0.5
104	107	0.25
105	106	0.333333333
105	107	0.5

The first two columns are the item relations and the last column is the rate preferences. The higher is the value the higher is the correlation between the items. The calculation of this value is quite easy to get:


c=Number of users that have viewed both item1 and item2

a=Numer of users that have viewed item1

b=Numer of users that have viewed item2

Relation between item1 and item2 is = 2/(3+3-2) = 0.5.

Cool, not?!? With log-likehood similarity the result is the follow:

101	102	0.12160727029227858
101	103	0.5423213660693733
101	104	0.5423213660693733
101	105	0.12160727029227858
101	107	0.5423213660693733
102	103	0.6905400104897509
102	104	0.5423213660693733
102	105	0.12160727029227858
102	106	0.12160727029227858
103	104	0.33569960607118654
103	105	0.6905400104897508
103	106	0.5423213660693733
104	105	0.5423213660693733
104	106	0.5423213660693733
104	107	0.33569960607118654
105	106	0.12160727029227858
105	107	0.6905400104897509

This calculation is quite more complicated than the previous one. its value is “more an expression of how unlikely it is for two users to have so much overlap“. As I said before, you can’t say Tanimoto is better than log-likehood. They both give you valid recommendations.

Have in mind that Tanimoto and log-likehood are the only similarity that could be using with boolean data.

Consume the data

So, now we’ve the recommendations produced and the idea is to let the users see them. I looked for a valid meccanism to integrated Mahout FrontEnd interface with data produced using Hadoop but I haven’t been able to find out a good solution. So, I’ve built a “manual” integration by moving the data from HDFS to NTFS (File System of my OS) and let it available with an easy Rest interface.

 The spring configuration file:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns=""


	<context:component-scan base-package="it.iol.mahoutweb" />
	<task:annotation-driven /> 

	<bean id="fileItemSimilarity"

And the code:

public class ControllerMahout {

	FileItemSimilarity itemSimilarity;

	@RequestMapping(value = "/Recommender/{item}",
                         method = RequestMethod.GET)
	public String getPwdEncodedBase64(@PathVariable("item")
                                           String item) {
		StringBuilder output = new StringBuilder();
		try {
  		long[] items = itemSimilarity

			output.append("Item ");

			for (long item_output : items) {
		} catch (Exception e) {
		return output.toString();
	@Scheduled(fixedDelay = 120000)
	public void reloadReccomendation() {
		itemSimilarity.refresh(new LinkedList<Refreshable>());

Every two minutes the source file is reloaded in the object itemSimilarity. Obviously, this time depends from how long the calculation process is and the number of data to processing.

The main point of this code is on lines 14 and 15. The “allSimilarItemIDs(item)” method returns the recommendations for one item. Deploying the solution on Tomcat and browsing the url <host>/MahoutWeb/Recommender/101 you get the recommendations for item 101.

Item 101;104,105,107,102,103,


I’m really keen on Hadoop and Mahout. I think they’re having a good success and they’ve found their natural destination on the cloud system. Distributed computing is almost a common solution to resolve the big data problem.

In spite of those considerations, I’m not sure I’d use this solution in a web site with frequently data refreshing. Hadoop works good with big numbers but not very well with medium-little numbers(It spent 3 minutes to compute the previous data on two processors configured on virtual machine).

Probably I’d use something like “EasyRec” but, if you’re going to let me know your opinion, I’d be glad to read it.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.