2019 |
Luckner, Marcin; Gad, Michal; Sobkowiak, Pawel Antyscam-Practical web spam classifier Journal Article International Journal of Electronics and Telecommunications, 65 (4), pp. 713–722, 2019, ISSN: 23001933. Abstract | Links | BibTeX | Tagi: Automatic classification, Imbalanced sets classification, Machine learning, Spam detection, Web spam detection @article{Luckner2019b, title = {Antyscam-Practical web spam classifier}, author = {Marcin Luckner and Michal Gad and Pawel Sobkowiak}, doi = {10.24425/ijet.2019.130255}, issn = {23001933}, year = {2019}, date = {2019-01-01}, journal = {International Journal of Electronics and Telecommunications}, volume = {65}, number = {4}, pages = {713--722}, abstract = {To avoid of manipulating search engines results by web spam, anti spam system use machine learning techniques to detect spam. However, if the learning set for the system is out of date the quality of classification falls rapidly. We present the web spam recognition system that periodically refreshes the learning set to create an adequate classifier. A new classifier is trained exclusively on data collected during the last period. We have proved that such strategy is better than an incrementation of the learning set. The system solves the starting-up issues of lacks in learning set by minimisation of learning examples and utilization of external data sets. The system was tested on real data from the spam traps and common known web services: Quora, Reddit, and Stack Overflow. The test performed among ten months shows stability of the system and improvement of the results up to 60 percent at the end of the examined period.}, keywords = {Automatic classification, Imbalanced sets classification, Machine learning, Spam detection, Web spam detection}, pubstate = {published}, tppubtype = {article} } To avoid of manipulating search engines results by web spam, anti spam system use machine learning techniques to detect spam. However, if the learning set for the system is out of date the quality of classification falls rapidly. We present the web spam recognition system that periodically refreshes the learning set to create an adequate classifier. A new classifier is trained exclusively on data collected during the last period. We have proved that such strategy is better than an incrementation of the learning set. The system solves the starting-up issues of lacks in learning set by minimisation of learning examples and utilization of external data sets. The system was tested on real data from the spam traps and common known web services: Quora, Reddit, and Stack Overflow. The test performed among ten months shows stability of the system and improvement of the results up to 60 percent at the end of the examined period. |
Luckner, Marcin; Gad, Michal; Sobkowiak, Pawel Antyscam-Practical web spam classifier Journal Article International Journal of Electronics and Telecommunications, 65 (4), pp. 713–722, 2019, ISSN: 23001933. Abstract | Links | BibTeX | Tagi: Automatic classification, Imbalanced sets classification, Machine learning, Spam detection, Web spam detection @article{Luckner2019c, title = {Antyscam-Practical web spam classifier}, author = {Marcin Luckner and Michal Gad and Pawel Sobkowiak}, doi = {10.24425/ijet.2019.130255}, issn = {23001933}, year = {2019}, date = {2019-01-01}, journal = {International Journal of Electronics and Telecommunications}, volume = {65}, number = {4}, pages = {713--722}, abstract = {To avoid of manipulating search engines results by web spam, anti spam system use machine learning techniques to detect spam. However, if the learning set for the system is out of date the quality of classification falls rapidly. We present the web spam recognition system that periodically refreshes the learning set to create an adequate classifier. A new classifier is trained exclusively on data collected during the last period. We have proved that such strategy is better than an incrementation of the learning set. The system solves the starting-up issues of lacks in learning set by minimisation of learning examples and utilization of external data sets. The system was tested on real data from the spam traps and common known web services: Quora, Reddit, and Stack Overflow. The test performed among ten months shows stability of the system and improvement of the results up to 60 percent at the end of the examined period.}, keywords = {Automatic classification, Imbalanced sets classification, Machine learning, Spam detection, Web spam detection}, pubstate = {published}, tppubtype = {article} } To avoid of manipulating search engines results by web spam, anti spam system use machine learning techniques to detect spam. However, if the learning set for the system is out of date the quality of classification falls rapidly. We present the web spam recognition system that periodically refreshes the learning set to create an adequate classifier. A new classifier is trained exclusively on data collected during the last period. We have proved that such strategy is better than an incrementation of the learning set. The system solves the starting-up issues of lacks in learning set by minimisation of learning examples and utilization of external data sets. The system was tested on real data from the spam traps and common known web services: Quora, Reddit, and Stack Overflow. The test performed among ten months shows stability of the system and improvement of the results up to 60 percent at the end of the examined period. |
2014 |
Luckner, Marcin; ł, Micha; ł, Pawe Stable web spam detection using features based on lexical items Journal Article Computers & Security, 46 , pp. 79–93, 2014, ISSN: 01674048. Abstract | Links | BibTeX | Tagi: Context analysis, Lexical items analysis, Regular expressions, Spam detection features, Web spam detection @article{Luckner2014a, title = {Stable web spam detection using features based on lexical items}, author = {Marcin Luckner and Micha{ł} Michal Gad and Pawe{ł} Pawel Sobkowiak}, url = {http://dx.doi.org/10.1016/j.cose.2014.07.006 http://linkinghub.elsevier.com/retrieve/pii/S0167404814001151}, doi = {10.1016/j.cose.2014.07.006}, issn = {01674048}, year = {2014}, date = {2014-01-01}, journal = {Computers & Security}, volume = {46}, pages = {79--93}, abstract = {Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features.}, keywords = {Context analysis, Lexical items analysis, Regular expressions, Spam detection features, Web spam detection}, pubstate = {published}, tppubtype = {article} } Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features. |
Luckner, Marcin; ł, Micha; ł, Pawe Stable web spam detection using features based on lexical items Journal Article Computers & Security, 46 , pp. 79–93, 2014, ISSN: 01674048. Abstract | Links | BibTeX | Tagi: Context analysis, Lexical items analysis, Regular expressions, Spam detection features, Web spam detection @article{Luckner2014ab, title = {Stable web spam detection using features based on lexical items}, author = {Marcin Luckner and Micha{ł} Michal Gad and Pawe{ł} Pawel Sobkowiak}, url = {http://dx.doi.org/10.1016/j.cose.2014.07.006 http://linkinghub.elsevier.com/retrieve/pii/S0167404814001151}, doi = {10.1016/j.cose.2014.07.006}, issn = {01674048}, year = {2014}, date = {2014-01-01}, journal = {Computers & Security}, volume = {46}, pages = {79--93}, abstract = {Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features.}, keywords = {Context analysis, Lexical items analysis, Regular expressions, Spam detection features, Web spam detection}, pubstate = {published}, tppubtype = {article} } Web spam is a method of manipulating search engines results by improving ranks of spam pages. It takes various forms and lacks a consistent definition. Web spam detectors use machine learning techniques to detect spam. However, the detectors are mostly verified on data sets coming from the same year as the learning sets. In this paper we compared Support Vector Machine classifiers trained and tested on WEBSPAM–UK data sets from different years. To obtain stable results we proposed new lexical-based features. The HTML document – transformed into a text without HTML tags, a set of visible symbols, and a list of links including the ones from tags – gave information about weird combinations of letters; consonant clusters; statistics on syllables, words, and sentences; and the Gunning Fog Index. Using data collected in 2006 as a learning set, we obtained very stable accuracy among years. This choice of the training set reduced the sensitivity in 2007, but that can be improved by managing the acceptance threshold. Finally, we proved that the balance between the sensitivity and the specificity measured by the Area Under the Curve (AUC) is improved by our selection of features. |
Publikacje
2019 |
Antyscam-Practical web spam classifier Journal Article International Journal of Electronics and Telecommunications, 65 (4), pp. 713–722, 2019, ISSN: 23001933. |
Antyscam-Practical web spam classifier Journal Article International Journal of Electronics and Telecommunications, 65 (4), pp. 713–722, 2019, ISSN: 23001933. |
2014 |
Stable web spam detection using features based on lexical items Journal Article Computers & Security, 46 , pp. 79–93, 2014, ISSN: 01674048. |
Stable web spam detection using features based on lexical items Journal Article Computers & Security, 46 , pp. 79–93, 2014, ISSN: 01674048. |