A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling

KIPS Transactions on Software and Data Engineering, Vol. 6, No.9, pp.445-456, September 2017
10.3745/KTSDE.2017.6.9.445, Full Text

Abstract

In this paper, we propose a method to detect spam tweets containing unhealthy information by using an n-gram dictionary under limited labeling. Spam tweets that contain unhealthy information have a tendency to use similar words and sentences. Based on this characteristic, we show that spam tweets can be effectively detected by applying a Naive Bayesian classifier using n-gram dictionaries which are constructed from spam tweets and normal tweets. On the other hand, constructing an initial training set requires very high cost because a large amount of data flows in real time in a twitter. Therefore, there is a need for a spam detection method that can be applied in an environment where the initial training set is very small or non exist. To solve the problem, we propose a method to generate pseudo-labels by utilizing twitter's retweet function and use them for the configuration of the initial training set and the n-gram dictionary update. The results from various experiments using 1.3 million korean tweets collected from December 1, 2016 to December 7, 2016 prove that the proposed method has superior performance than the compared spam detection methods.


Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from October 15, 2016)

Multiple requests among the same browser session are counted as one view. If you mouse over a chart, the values of data points will be shown.


Cite this paper

[KIPS Transactions Style]
H. Choi and C. H. Park, "A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling," KIPS Transactions on Software and Data Engineering, Vol.6, No.9, pp.445-456, 2017, DOI: 10.3745/KTSDE.2017.6.9.445.

[IEEE Style]
Hyeok-Jun Choi and Cheong Hee Park, "A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling," KIPS Transactions on Software and Data Engineering, vol. 6, no. 9, pp. 445-456, 2017. DOI: 10.3745/KTSDE.2017.6.9.445.

[ACM Style]
Choi, H. and Park, C. H. 2017. A Method for Twitter Spam Detection Using N-Gram Dictionary Under Limited Labeling. KIPS Transactions on Software and Data Engineering, 6, 9, (2017), 445-456. DOI: 10.3745/KTSDE.2017.6.9.445.