alexcuesta

My tech blog

Map-Reduce: Words per Author

leave a comment »

I have pushed an example of map-reduce written in Python and using the lightweight framework mincemeat.py as part of an exercise of a course about Big Data in Coursera. The code can be found in github:

https://github.com/alexcuesta/mapreduce-words-per-author

Don’t expect good code. It’s the first time I write something in Python. The interesting thing is the actual map-reduce implementation.

Basically,the task is to compute how many times every term occurs across titles, for each author.

I logged each phase so you can study what it does. It’s quite tricky to see until you see it working. There is a ‘small’ folder with a couple of documents and few rows so you can easily check what the code does.

MAP

The map (and combine) phase is implemented in the mapfn function. This maps each author and calculates partial counts per document. In other words, for a given author we’ll get partial word counts for multiple documents (files).

REDUCE

The reduce phase gets all partial counts for multiple documents and computes the final count for all of them.

 

Written by alexcuesta

May 2, 2013 at 10:59 pm

Posted in Big Data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: