Log of a quick and fun afternoon project.
1. Download Wikipedia English-language Database:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
Download pages-articles.xml.bz2
Uncompress it ro pages-articles.xml (around 45GB)
2. Download Phoenix++:
http://mapreduce.stanford.edu/
http://mapreduce.stanford.edu/plus/phoenix++-1.0.tar.gz
make binary file and then find ‘word_count’ in ‘tests’ folder
3. copy wordcount and pages-articles.xml to the same folder
./word_count pages-articles.xml 1000 >wikiwords.txt
Using wordcount program to count top 1000 popular words from wiki database
4. Result: (Using 8 Xeon E5 server, 128 processing cores)
—————-
Job Name: myjob.pbs
Session:
Limits: neednodes=1,nodes=1,procs=128
Resources: cput=00:09:30,mem=49870344kb,vmem=51225052kb,walltime=00:02:57
—————-
Wikiwords.txt:
Wordcount: Running…
Wordcount: Calling MapReduce Scheduler Wordcount
Wordcount: MapReduce Completed
Wordcount: Results (TOP 1000 of 34425700):
THE – 193399750
QUOT – 124981071
GT – 122310147
LT – 119499174
OF – 119433785
ID – 91232877
IN – 79301110
AND – 78304198
TO – 68535720
A – 67833125
TITLE – 56486124
REF – 49681363
TEXT – 47596541
PAGE – 41273141
AMP – 39064116
USER – 35322581
IS – 33185078
HTTP – 31938499
FOR – 30390474
FORMAT – 30158636
MODEL – 29339983
TIMESTAMP – 29309382
NS – 29050522
REVISION – 28964076
CONTRIBUTOR – 28846548
SHA – 28800926
USERNAME – 28605557
COMMENT – 27578561
ON – 27358434
CATEGORY – 27120495
NAME – 25772995
WAS – 24528542
AS – 23756873
X – 23528969
WWW – 23189689
BY – 22963410
TALK – 22949011
COM – 22768372
WITH – 19990960
PARENTID – 19708590
T – 19512107
THAT – 18856762
WIKIPEDIA – 18397499
S – 18188906
IT – 17835250
….
Total: 6318735075