Datasets are indispensable for the study of opinion spam/spammer detection. However, to prepare such a kind of dataset is more difficult than that for the other types of spam detection tasks such as email spams and web spams due to the subtle nature of opinion spams. This website provides a dataset for a real case, Samsung probed in Taiwan over ‘fake web reviews’, reported by BBC on 16 April 2013. It can be used to study the behaviors of opinion spammers, their interactions in terms of first posts and replies, and the detection tasks.
This dataset contains threads and posts from the Samsung board on Mobile01 during Jan, 2011 to May, 2012 and profiles of users who posted at least one post during the period. The spam dataset comes from two confidential spreadsheets that appear to be internally-kept records of the spam posts.
The data instances are split into the training set and the test set mainly by their temporal orders. There are 2 detection tasks, i.e., spam detection and spammer detection, in our work. The following is a brief introduction to the files. For more details, please refer to our papers in WWW 2015 and SIGIR 2015.
(1) Spam detection for first posts
- data/first_post/train.json: The training set file for first posts.
- data/first_post/test.json: The test set file for first posts.
- data/first_post/test_star.json: The test set* file for first posts.
(2) Spam detection for replies
- data/reply/train.json: The training set file for replies.
- data/reply/test.json: The test set file for replies.
- data/reply/test_star.json: The test set* file for replies.
(3) Spammer detection
- spammer/train.json: The train set file for spammers.
- spammer/test.json: The test set file for spammers.
- data/thread_info.json: The file providing all meta data of threads.
All files are in JSON format and each JSON file has a list of JSON objects containing a post data. The meta data of posts (i.e., first posts and replies) includes 'content', 'is_spam', 'nfloor', 'pnum', 'thid','time', 'uid' and 'uname'. The following is an example of a spam post.
'is_spam' : True,
'nfloor' : 4,
'pnum' : 1,
'thid' : '2708016',
'time' : '2012-04-26T17:47:00.000Z',
'uid' : '2092614',
'uname' : 'imCH'
The meta data of profiles includes 'is_spam', 'login_time', 'n_eff_posts', 'n_posts', 'n_replies', 'n_threads', 'p_phone_section', 'reg_time', 'score', and 'uid'. The following is an example of a non-spammer.
'is_spam' : False,
'login_time' : '2014-04-22T00:00:00.000Z',
'n_eff_posts' : 7,
'n_replies' : 7,
'n_threads' : 0,
'p_phone_section' : 100,
'reg_time' : '2011-04-19T00:00:00.000Z',
'score' : 0,
'uid' : '1955698',
The meta data of threads includes 'clicks', 'fid', 'thid', 'time', 'title' and 'tot_pages'.
Please refer to our WWW-2015 and SIGIR-2015 paper for more details.
Dataset language and Character encoding
All text is in traditional Chinese and encoded to UTF8.
How to Cite the Corpus
Please cite the following two papers when referring to the Mobile01 Corpus in academic publications and papers.
Yu-Ren Chen and Hsin-Hsi Chen (2015). “Opinion Spam Detection in Web Forum: A Real Case Study.” In Proceedings of 24th International World Wide Web Conference (WWW 2015), May 18-22, 2015, Florence, Italy. DOI: http://dx.doi.org/10.1145/2736277.2741085
Yu-Ren Chen and Hsin-Hsi Chen (2015). “Opinion Spammer Detection in Web Forum.” Proceedings of the 38th Annual ACM SIGIR Conference (SIGIR 2015), August 9-13, 2015, Santiago, Chile. DOI: http://dx.doi.org/10.1145/2766462.2767766