Log of a quick and fun afternoon project.
1. Download Wikipedia English-language Database:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Download pages-articles.xml.bz2
Uncompress it ro pages-articles.xml (around 45GB)
2. Download Phoenix++:
http://mapreduce.stanford.edu/
http://mapreduce.stanford.edu/plus/phoenix++-1.0.tar.gz
make binary file and then find ‘word_count’ in ‘tests’ folder
3. copy wordcount and pages-articles.xml to the same folder
./word_count pages-articles.xml 1000 >wikiwords.txt
Using wordcount program to count top 1000 popular words from wiki database
4. Result: (Using 8 Xeon E5 server, 128 processing cores) Continue reading Using Stanford Phoenix++ MapReduce to count the most frequent words in Wikipedia Database