The Fifth International AAAI Conference on Weblogs and Social Media is holding a new data challenge using a new dataset from that includes about three TB of social media data collected by Spinn3r between January 13 and February 14th, 2011.
The dataset consists of over 386M blog posts, news articles, classifieds, forum posts and social media content in a month including events such as the Tunisian revolution and the Egyptian protests. The content includes the syndicated text, its original HTML as found on the web, annotations and metadata (e.g., author information, time of publication and source URL), and boilerplate/chrome extracted content. The data is formatted as Spinn3r’s protostreams – an extension to Google protobuffers. It is also broken down by date, content type and language making it easy to work with selected data.
See the ICWSM Data Challenge pages for more information on the challenge task, its associated ICWSM workshop and procedures for data access.