Using Stanford Phoenix++ MapReduce to count the most frequent words in Wikipedia Database

Log of a quick and fun afternoon project.

1. Download Wikipedia English-language Database:

http://en.wikipedia.org/wiki/Wikipedia:Database_download

Download pages-articles.xml.bz2

Uncompress it ro pages-articles.xml (around 45GB)

2. Download Phoenix++:

http://mapreduce.stanford.edu/

http://mapreduce.stanford.edu/plus/phoenix++-1.0.tar.gz

make binary file and then find ‘word_count’ in ‘tests’ folder

3. copy wordcount and pages-articles.xml to the same folder

./word_count pages-articles.xml 1000 >wikiwords.txt

Using wordcount program to count top 1000 popular words from wiki database

4. Result: (Using 8 Xeon E5 server, 128 processing cores) Continue reading Using Stanford Phoenix++ MapReduce to count the most frequent words in Wikipedia Database