Map-Reduce: Words per Author
I have pushed an example of map-reduce written in Python and using the lightweight framework mincemeat.py as part of an exercise of a course about Big Data in Coursera. The code can be found in github:
Don’t expect good code. It’s the first time I write something in Python. The interesting thing is the actual map-reduce implementation.
Basically,the task is to compute how many times every term occurs across titles, for each author.
I logged each phase so you can study what it does. It’s quite tricky to see until you see it working. There is a ‘small’ folder with a couple of documents and few rows so you can easily check what the code does.
The map (and combine) phase is implemented in the mapfn function. This maps each author and calculates partial counts per document. In other words, for a given author we’ll get partial word counts for multiple documents (files).
The reduce phase gets all partial counts for multiple documents and computes the final count for all of them.