Apache Pig for batch data analysis over Hadoop

2014-08-25

In these days I’m playing with Apache Pig for running data analysis over Apache Hadoop. Below a sample wordcloud generated from the top word count of nouns of the Italian translation of the Bible la-sacra-bibbia-frequenza-paroleCopy the file book.txt to hadoop distribuited file system (HDFS) withhadoop-2.4.0/bin/hdfs dfs -copyFromLocal -f book.txtTest the pig job locally withpig-0.13.0/bin/pig -x local wordcount.pigRun the pig job in hadoop withpig-0.13.0/bin/pig -x mapreduce wordcount.pigLook at results withhadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part\*|moreCopy the results to a local file withhadoop-2.4.0/bin/hdfs dfs -cat book-wordcount/part\* > frequenza-parole-bibbia.txt  Below the two scripts I used for this short tutorial: Wordcount (pig script):a = load '/user/matteo/book.txt'; b = foreach a {         line = LOWER(REPLACE((chararray)$0, '\[!?\\\\.»«:;,\\'\]', ' '));     generate flatten(TOKENIZE(line)) as word; } c = group b by word; d = foreach c generate group, COUNT(b) as cnt; d\_ordered = ORDER d BY cnt DESC; store d\_ordered into '/user/matteo/book-wordcount';  Wordcloud (R script)``` library(wordcloud) p = read.table(file=“frequenza-parole-bibbia.txt”) png("/home/matteo/la-sacra-bibbia-frequenza-parole.png", width=900, height=900) wordcloud(p$V1, p$V2, scale=c(8,.3),min.freq=2,max.words=200, random.order=T, rot.per=.15) dev.off()


Enter your instance's address


More posts like this