DSpace Repository

Destek vektör makineleri ve gauss karışım modeli ile istenmeyen e-postaların tespiti = Support vector machine and gauss mixture model detection of unsolicited e-mails /

Show simple item record

dc.creator Ateş, Nurullah, 1986- author 100308
dc.creator Küçüksille, Ecir Uğur, 1976- thesis advisor 9288
dc.creator Süleyman Demirel Üniversitesi. Fen Bilimleri Enstitüsü. Bilgisayar Mühendisliği Anabilim Dalı. 24579 issuing body
dc.date 2014.
dc.identifier http://tez.sdu.edu.tr/Tezler/TF02605.pdf
dc.description In this thesis, two different filtering methods with content based in which Support Vector Machines, a supervised learning algorithm, which detect spam mails and Gaussian Mixture Models, a unsupervised learning algorith, are used were carried out . In methods the title and the body of e-mails were used as attributes and processing was applied to character strings which belong to the messages in order to get accurate attributes.In the study carried out with Turkish messages, expressions that are not letters and attached to the character string in its beginning and the end were removed from it, the characters except the first five ones were deleted from the character string, all letters were turned into lower case letters and the character string repeating less than three times was deleted from the candidate attribute set. With Mutual Information algorithm, 49 character strings that have the highest value were chosen as attributes.In the second method, Lingspam that is a special data set was used. In content filters the most important attribute is the words of the message. A word has different writing styles depending on time, whether it is singular or plural etc. . The English word people is the plural of person, the word “plays” is the plural of “play” and “found” is the past form of “find”. Therefore, while words are examined in the process of spam filtering, it is important to examine the word according to the spelling of simple meaning of it. Lingspam data set used the simple spelling of the words in its messages which it defined as lemmitization. Also, in this data set the most commonly used words in a language were removed from it because these words cannot differentiate spam and normal e-mails from each other and extend the operating time of algorithm as they are often inclued in messages.In order to avoid his mail to be filtered, Spam sender writes the words like “viagra” that may be an attribute in different ways as “v*i*a*g*r*a”, ”v1a1g1r1a”, ”v.iagra”, ”viagraaaa” and even “v i ag r a”. These writing styles reduce the chance of detecting spam mails.The original side of this study is that Soundex algorithm that is used for the correction of pronunciation in many studies was used to differentiate the different writing styles of words. In the second method, the acquisition of %98,6 correct identification results in the tests that were carried out by using DVM has shown the accuracy of the use of Soundex. Keywords: Spam, electronic mail, Support Vector Machine, Gauss Mixture Model, Soundex.
dc.description Tez (Yüksek Lisans) - Süleyman Demirel Üniversitesi, Fen Bilimleri Enstitüsü, Bilgisayar Mühendisliği Anabilim Dalı, 2014.
dc.description Kaynakça var.
dc.description In this thesis, two different filtering methods with content based in which Support Vector Machines, a supervised learning algorithm, which detect spam mails and Gaussian Mixture Models, a unsupervised learning algorith, are used were carried out . In methods the title and the body of e-mails were used as attributes and processing was applied to character strings which belong to the messages in order to get accurate attributes.In the study carried out with Turkish messages, expressions that are not letters and attached to the character string in its beginning and the end were removed from it, the characters except the first five ones were deleted from the character string, all letters were turned into lower case letters and the character string repeating less than three times was deleted from the candidate attribute set. With Mutual Information algorithm, 49 character strings that have the highest value were chosen as attributes.In the second method, Lingspam that is a special data set was used. In content filters the most important attribute is the words of the message. A word has different writing styles depending on time, whether it is singular or plural etc. . The English word people is the plural of person, the word “plays” is the plural of “play” and “found” is the past form of “find”. Therefore, while words are examined in the process of spam filtering, it is important to examine the word according to the spelling of simple meaning of it. Lingspam data set used the simple spelling of the words in its messages which it defined as lemmitization. Also, in this data set the most commonly used words in a language were removed from it because these words cannot differentiate spam and normal e-mails from each other and extend the operating time of algorithm as they are often inclued in messages.In order to avoid his mail to be filtered, Spam sender writes the words like “viagra” that may be an attribute in different ways as “v*i*a*g*r*a”, ”v1a1g1r1a”, ”v.iagra”, ”viagraaaa” and even “v i ag r a”. These writing styles reduce the chance of detecting spam mails.The original side of this study is that Soundex algorithm that is used for the correction of pronunciation in many studies was used to differentiate the different writing styles of words. In the second method, the acquisition of %98,6 correct identification results in the tests that were carried out by using DVM has shown the accuracy of the use of Soundex. Keywords: Spam, electronic mail, Support Vector Machine, Gauss Mixture Model, Soundex.
dc.language tur
dc.publisher Isparta : Süleyman Demirel Üniversitesi Fen Bilimleri Enstitüsü,
dc.subject Süleyman Demirel Üniversitesi
dc.title Destek vektör makineleri ve gauss karışım modeli ile istenmeyen e-postaların tespiti = Support vector machine and gauss mixture model detection of unsolicited e-mails /
dc.type text


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Advanced Search

Browse

My Account