Howto managing tweets saved in #Hadoop using #Apache #Spark SQL

2015-01-15

Instead of using the old Hadoop way (map/reduce), I suggest using the newer and faster way (Apache Spark on top of Hadoop Yarn): in few lines you can open all tweets (zipped json files saved in several subdirectories hdfs://path/to/YEAR/MONTH/DAY/*gz) and query them in a SQL like language``` sc = SparkContext(appName=“extraxtStatsFromTweets.py”) sqlContext = SQLContext(sc) tweets = sqlContext.jsonFile("/tmp/twitter/opensource/2014/*/*.gz") tweets.registerTempTable(“tweets”) t = sqlContext.sql(“SELECT distinct createdAt,user.screenName,hashtagEntities FROM tweets”) tweets_by_days = count_items(t.map(lambda t: javaTimestampToString(t[0]))) stats_hashtags = count_items(t.flatMap(lambda t: t[2])\ .map(lambda t: t[2].lower()))


Enter your instance's address


More posts like this

My Networking Survival Kit

2020-03-15 | #Me

In this small tutorial I’ll speak about tunneling, ssh port forwarding, socks, pac files, Sshuttle I’ve been using Linux since 1995 but I have never been interested a lot in networking.

Continue reading 