Machine Learning Algorithms in web page classification

machine learning vs web development and web mining machine learning for web applications and introduction to machine learning with web data
Dr.NaveenBansal Profile Pic
Published Date:25-10-2017
Your Website URL(Optional)
Machine Learning-Based Web Documents Categorization by Semantic Graphs 1 1 Francesco Camastra , Angelo Ciaramella , 2 1 Alessio Placitelli , and Antonino Staiano 1 Dept. of Science and Technology, University of Naples “Parthenope”, Isola C4, Centro Direzionale, I-80143, Napoli (NA), Italy camastra, angelo.ciaramella,, 2 Vitrociset s.p.a., Via Tiburtina, 1020 - 00156 Roma, Italy Abstract. This work aims to approach web pages categorization by means of semantic graphs and machine learning techniques. We propose to use a semantic graph that can provide a compact and structured representation of the concepts present in a document in order to take into account the semantic information. The semantic graph allows de- termining a map of the semantic areas contained in the document and theirrelationshipsw.r.t.aparticularconceptorterm.Thesemanticmea- sure between the terms is calculated by using the lexical database (i.e., WordNet). The document categorization is accomplished by a machine learning technique.We compare theperformance of both supervised and unsupervised techniques (i.e., Support Vector Machine and Self Orga- nizing Maps, respectively). The proposed methodology has been applied for classification and agglomeration of benchmark and real data. From the analysis of the results it can be shown that the model trained with semantic features obtains satisfactory results, in particular by using the unsupervised machine learning technique. 1 Introduction With the dramatically quick and explosive growth of information available over the Internet, World Wide Web has become a powerful platform to store, dissem- inate and retrieve information as well as mine useful knowledge 3. Information is mostly in the form of unstructured data. As the data on the web has been growing, it has lead to several problems such as increased difficulty of finding relevant information and extracting potentially useful knowledge. Web mining is an emerging research area focused on the application of data mining techniques to discover patterns from the Web. According to analysis targets, Web mining can be divided into three different types, which are Web usage mining, Web content mining and Web structure mining. In this work we address the problem of Web content mining. Web content mining extracts information from different Web sites for its access and knowledge discovery. In particular, we study a novel methodology for Web pages categorization considering the textual content.  c Springer International Publishing Switzerland 2015 75 S. Bassis et al. (eds.), Recent Advances of Neural Networks Models and Applications, Smart Innovation, Systems and Technologies 37, DOI: 10.1007/978-3-319-18164-6_876 F. Camastra et al. In the past 20 years, the number of text documents in digital form has grown exponentially 12. As a consequence of this exponential growth, great impor- tance has been put on the classification of documents into groups that describe the content of the documents. The function of a classifier is to merge text doc- uments into one or more predefined categories based on their content. Each document can belong to several categories or may present its own category. In 9 the authors review the Web-specific features and algorithms that have been explored and found to be useful for Web page classification. Most approaches described in literature do not consider the semantic information in the document and therefore in some cases may not perform adequately. In 1 an approach to incorporateconcepts from backgroundknowledgeinto document representations for text document classification (by using boosting machine learning technique) has been proposed. To extract concepts from texts, the authors have developed a detailed process, that can be used with any ontology with lexicon. Our work aims to approach web pages categorization by means of semantic graphs and 1 machine learning techniques . The semantic graph allows determining a map of the semantic areas contained in the document and their relationships w.r.t. a particular concept or term. The similarity between the terms is calculated by using the lexical database (i.e., WordNet) and the pages are represented using a TF-IDF (Term Frequency-Inverse Document) mechanism. The paper is organized as follows. In Section 2, we introduce the categorization problem of documents, and, in Section 3 the TF-IDF methodology is presented. In Section 4 we describe the semantic graph and how to use it for the the TF- IDF methodology. In Section 5, the experimental results on benchmark and real data are presented. Finally, some conclusions and future remarks are outlined. 2 Document Categorization Categorization of documents refers to the problem of automatic classification of a set of documents in classes (or categories or topics). A common approach for text classification is formed by five steps. The first step (tokenization) eliminates the punctuation signs in the text. The second step (stopping), removes from the text the so-called stopping words, i.e., common words (e.g, articles, modal verbs, prepositions) that are widespread in every text and therefore cannot be used for discriminatingatext. Thethird stepisthe stemming,whereeachtermisreduced to own lexical root (or stem)bymeansofa stemming algorithm (e.g., Porter’s algorithm). In the fourth step, the document is represented by means of a vector whose generic i-th coordinate is computed by TF-IDF (Term Frequency-Inverse Document) approach 11. Finally, the document classification is performed by a machine learning technique. The approach described above does not consider the semantic information in the document and therefore in some cases may not perform adequately. 1 The work was made when Alessio Placitelli was M. Sc. Student at University of Naples Parthenope.Machine Learning-Based Web Documents Categorization 77 3Scoring To extract information from a document we compute a score between a query term t and a document d, based on the weight of t in d. The simplest approach is to assign the weight to be equal to the number of occurrences of term t in doc- ument d (tf , Term Frequency of the term t in document d)6.Rawtermfre- t,d quency as above suffers from a critical problem: all terms are considered equally important when it comes to assessingrelevancy on a query. In fact, certain terms have little or no discriminating power in determining relevance. A mechanism for attenuating the effect of terms that occur too often in the collection to be meaningful for relevance determination. An idea could be to reduce the tf t,d weight of a term by a factor that grows with its collection frequency. It is more commonplace to use for this purpose the document frequency df , defined to be t the number of documents in the collection that contain a term t.Denotingthe total number of documents in a collection by N, we define the inverse document frequency (idf)ofaterm t as follows: N idf =log . (1) t df t Thus, the idf ofa rareterm is high, whereasthe idf ofa frequentterm is likely to be low. We now combine the definitions of term frequency and inverse document frequency, to produce a composite weight for each term in each document. The TF-IDF weighting scheme assigns to term t a weight TF-IDF in document d given by TF-IDF = tf ×idf . (2) t,d t,d t We may view each document as a vector with one component corresponding to each term in the dictionary, together with a weight for each component that is given by equation (2). 4SemanticGraph In order to take into account the semantic information, we propose to use a semantic graph that can provide a compact and structured representation of the concepts present in a document. The semantic graph allows determining a map of the semantic areas contained in the document and their relationships w.r.t. a particular concept or term, called target. The semantic weight indicates how much the document is relevant w.r.t. the target. The semantic graph is a undi- rected, fully connected graph, consisting of the terms of the document connected by relations of similarity to a target term. A semantic graph is computed starting from a single term. Let t be the term whose semantic graphhas to be computed and N is the number of the most sim- ilar terms in the document, the construction of the semantic graph is performed by means of four well-defined phases: similarity calculation, ranking, graph con- struction and semantic weight calculation. The similarity (s) between the terms78 F. Camastra et al. Fig.1. Minimum spanning tree computed by the algorithm of Kruskal is calculated by using the lexical database WordNet 7. Next, the terms more similar to t, are ranked on the basis of a properly chosen similarity metric. Now the top N terms are used as the vertices of the undirected weighted semantic graph. For each pair of vertices an edge is created. The weight of the edge is pro- portional (1−s) to the semantic distance between the terms (e.g., Lin similarity in the 0,1 interval 5). For instance, consider the construction of a semantic graph related to the target term computer based on a document containing the terms: Internet, www, cat, network, software, computer, web,and homepage.The information contained in the semantic graph can be represented by a single syn- thetic value called semantic weight (w ). This value is obtained calculating the s sum of the reconstructed weights (1 − w ) for the arcs belonging to the Mini- s mum Spanning Tree (MST) ofthe semantic graph. The weight indicates that the semantic parsed document, represented through the semantic graph, is relevant to the target word. The higher the value of the weight, the more the document refers to the subject matter from the end target. On the contrary, the smaller this value, the less the document identifies the target. In Figure 1, the minimum spanning tree computed by the algorithm of Kruskal is presented. The final se- mantic weight is proportional to this estimated weight. Summarizing, the steps ofthe proposedcategorizationprocessareasfollows.The firstthree steps arethe same as in usual categorization process (i.e., tokenization, stop words removal, stemming), in the fourth step, to each term it is associated the semantic weight, instead of the usual TF-IDF value. In Figure 2 , the use of a semantic graph in a bag of words mechanism6, is shown. Finally, using the TF-IDF vectors we have performed the document categorization by means of a machine learning technique. In this specific case, both Support Vector Machine (SVM) 2 and Self Organizing Maps (SOMs) 4 have been applied. 5 Experimental Results The proposed methodology has been applied for classification (SVM) and ag- glomeration (SOM) of two different corpora. To evaluate the performance, the results are compared with those obtained by the standard TF-IDF mechanism considering different metrics. For the SOM, we consider the quantization error (QE), the topographic error (TE) and the combined error (CE). Regarding the SVM, the percentage of documents correctly classified is evaluated (for further results as confusion matrix, measures of precision and recall see 8).Machine Learning-Based Web Documents Categorization 79 Fig.2. Bag of words mechanism and semantic graph: extraction of features from documents In a first phase of validation, the Reuters 21578 corpus has been considered 10. This corpus was issued by the multinational Reuters in 2000 and made publicly available for researchpurposes. We used only three categoriesfor a total of 390 documents divided as follows: Cocoa (55 documents), Money Supply (138 documents) and Ship (197 documents). In Tables 1 and 2, we report the results applying SOM and SVM techniques, respectively. Using SOM, in the case of semantic weights, a QE of 41.961, a TE of 0 and a CE of 48.414 are obtained. Instead the SVM, as reported in Table 1, has allowed obtaining a percentage of correct classification of 97.17% (379 documents). In the case of standard TF- IDF, SOM has detected a QE of 43.675, a TE of 0.005 and a CE of 49.820. The percentage of correct classification obtained through SVM is of 93.58% (365 papers). Successively, a scraping software has been used for analyzing a web page and extractthemaincontentexcludingtags,templatesandotherkindofunnecessary code 8. In order to build the corpus, the scraper was launched for five days (from March 15, 2013 to March 20, 2013) by performing the scraping of 1995 journalistic news of 5 different categories: politics, sport, business, science and entertainment. In Table 3, we describe the categories and information sources 2 for the corpus . The dictionary of the processed corpus is composed initially of 27305 terms. The removal of low-frequency terms leads to the elimination of 15619 words, decreasing the size of the dictionary terms to 11686 terms. The SVM training wasperformedusing a linearkernel and acost coefficient of C=1.0. The main objective of these experimental results is to compare TF-IDF and SWA approaches. For this reason we chose to use a simple linear kernel and to considerthe same weightbetween the slackvariability penalty and the margin in SVM optimization mechanism 2. A 10-fold cross-validationis performed. On 2 The corpora are available on request.80 F. Camastra et al. Table 1. SVM results: Reuters and scraped corpora SVM training set training set test set TF-IDF 93.58% 81.35% 62% SWA 97.71% 79.79% 70.36% Reuters Real data Real data Table 2. SOM results: Reuters and scraped corpora (QE = Quantization Error; TE = Topographic Error; CE = Combined Error) SOM QE TE CE QE TE CE Test set Test set TF-IDF 41.96% 0.0% 48.41% 75.04% 0.01% 81.74% 83.19% 75.12% SWA 43.67% 0.0% 49.82% 47.41% 0.04% 58.43% 86.68% 79.37% Reuters Reuters Reuters Real data Real data Real data Real data Feature Selection Table 3. Information sources used for the corpora (feed RSS) Category Information sources Politics The Guardian, The Telegraph, The Scotsman, BBC Sports The Guardian, The Independent, The Telegraph, Daily Mail, The Express, The Daily Star, The Scotsman, BBC Business The Guardian, The Telegraph, Daily Mail, The Express, BBC Science The Guardian, The Independent, The Telegraph, Daily Mail, The Express, The Daily Star, The Scotsman (Technology), BBC Entertainment The Guardian (Movie), The Telegraph, Daily Mail, The Express, The Scotsman, BBC Table 4. Test corpus Category ofdocuments Words (average) Characters (average) Politics 47 718 4224 Sport 210 523 3057 Business 165 590 3479 Science 141 584 3585 Entertainment 137 595 3448 the standard TF-IDF vectors, the classification percentage is of 81.35% (on the overall training set). In the unsupervised case, the SOM has a height of 17 and a width of 13 neurons. The estimated QE is 75.047, the TE is of 0.015 while the CE is 81.740.By using the semantic wights the SVM classification is of 80.760%. Instead, SOM has produced a QEof 47.410,a TE of0.047anda CE of58.433.In Figure 3 we show the comparison between the topological mapping of the SOM obtained with TF-IDF and by using semantic weights, respectively. The results and the figures show that by using semantic weights is possible to obtain a more regulartopographicmap.Moreover,thegeneratedmodelsareevaluatedthrougha testcorpuscomposedby700terms(seeTable4fordetails).Weobtainapercentage of 62% of correct classifications using the TF-IDF features and of 70.36% usingMachine Learning-Based Web Documents Categorization 81 a) b) Fig.3. Topographic result of the SOM: a) with TF-IDF features; b) with semantic features semantic features. The performance obtained on the trained SOM are of 83.19% ofcorrectclassification for TF-IDF and of86.68%for the semantic case. Finally, from the previous corpus we generate a reduced corpus obtained by using a semantic feature selection. The semantic feature selection is obtained by usinganaggregationapproachbasedonthesimilaritymeasuresandWordNet.In particular, the words are chosen by using an agglomerating rate and considering the most dissimilar ones for building the bag of words 8. In this case we note that the best performance are obtained using the SOM with semantic weights (79,37% of perfect classification). 6 Conclusions Inthispaper,wepresentedamethodologyforwebpagescategorizationbymeans of semantic graphs and machine learning techniques. The semantic graph allows determining a map of the semantic areas contained in the document and their relationships (i.e., semantic metric) w.r.t. a particular concept or term. The doc- ument categorization is accomplished by means of a supervised or unsupervised machinelearningtechnique,SupportVectorMachine(SVM) andSelfOrganizing Maps (SOM), respectively. The model that uses semantic features and trained on the Reuters corpus obtains better results for both SVM and SOM. For the other corpora, the best results are obtained by SOM and semantic weights. We can consider that a category may present concepts strongly correlated with the other categories and this behavior can be better managed by an unsupervised mechanism.We,however,wishtohighlightthatbyusingsemanticweightsaWeb Page can also not contain a specific term but it contains correlated concepts. In the next future the authors will focus their attention on the use of different parameters for the machine learning techniques, different semantic metrics and on the categorization of documents also using images, audio and video.82 F. Camastra et al. References 1. Bloehdorn, S., Hotho, A.: Boosting for text classification with semantic features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006) 2. Cortes, C., Vapnik,V.:Support-vectornetworks.Machine Learning20(3), 273–297 (1995) 3. Divya,C.:Mining ContentsinWebPages and RankingofWebPages UsingCosine Similarity. International Journal of Science and Research (IJSR) 3(4) (2014) 4. Kohonen, T.: The self-organizing map. Proceedings of the IEEE 78(9), 1464–1480 (1990) 5. Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, San Francisco, vol. 1, pp. 296–304 (1998) 6. Manning, C.D., Raghavan, P., Schutze, ¨ H.: Introduction to Information Retrieval. Cambridge University Press (2008) 7. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.J.: Introduction to wordnet: An on-line lexical database. International Journal of Lexicography 3(4), 235–244 (1990) 8. Placitelli,A.P.:Categorizzazione dipaginewebmediantegrafosemanticoetecniche di machine learning, MSc dissertion, University of Naples “Parthenope” (2013) 9. Qi, X., Davison, B.D.: Web Page classification: Features and algorithms. ACM Computing Surveys (CSUR) 41(2), 12 (2009) 10. 11. Salton, G., Buckley,C.: Term-weighting approaches in automatic text retrieval. In: Information Processing and Management, pp. 513–523 (1988) 12. Trstenjaka, B., Mikacb, S., Donkoc, D.: KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering 69, 1356–1364 (2014) Web Spam Detection Using Transductive−Inductive Graph Neural Networks 1 1 1,2 Anas Belahcen , Monica Bianchini , and Franco Scarselli 1 Dipartimento di Ingegneria dell’Informazione e Scienze Matematiche Università degli Studi di Siena − Siena, Italy 2 LeRMA − ENSIAS Mohammed V Souissi University − Rabat, Morocco monica,, Abstract. The Web spam detection problem has received a growing interest in the last few years, since it has a considerable impact on search engine reputations, being fundamental for the increase or the deterioration of the quality of their results. As a matter of fact, the World Wide Web is naturally represented as a graph, where nodes correspond to Web pages and edges stand for hyperlinks. In this paper, we address the Web spam detection problem by using the GNN architecture, a supervised neural network model capable of solving classification and regression problems on graphical domains. Interestingly, a GNN can act as a mixed transductive−inductive model that, during the test phase, is able to classify pages by using both the explicit memory of the classes assigned to the training examples, and the information stored in the network parameters. In this paper, this property of GNNs is evaluated on a well−known benchmark for Web spam detection, the WEBSPAM−UK2006 dataset. The obtained results are comparable to the state−of−the−art on this dataset. Moreover, the experiments show that performances of both the standard and the transductive−inductive GNNs are very similar, whereas the computation time required by the latter is significantly shorter. 1 Introduction In several application areas, data are naturally represented as graphs or trees, e.g., in computer vision, molecular biology, software engineering and natural language processing. As a matter of fact, nodes in these structures are used to represent objects, while edges determine the relationships between them. For example, the World Wide Web is commonly described by a graph, where nodes represent Web pages and edges stand for hyperlinks. In the Web graph, nodes and edges may have vector labels, collecting the information available about the page contents and the hyperlinks, respectively. Traditional machine learning approaches try to reduce graphical data into simple representations, as, e.g., a set of vectors. In this way, the topological information may © Springer International Publishing Switzerland 2015 83 S. Bassis et al. (eds.), Recent Advances of Neural Networks Models and Applications, Smart Innovation, Systems and Technologies 37, DOI: 10.1007/978-3-319-18164-6_9 84 A. Belahcen, M. Bianchini, and F. Scarselli be lost during the preprocessing step, which can deeply affect the achieved performance. On the other hand, the Graph Neural Network (GNN) model 1 is capable of processing graphs directly, without any preprocessing step. GNNs are supervised neural network models that extend the recursive paradigm, and can be applied on most of the practically useful kinds of graphs, including directed, undirected, labeled and cyclic graphs. GNNs have been successfully employed in several application domains, such as molecule classification, object localization in images, and Web page ranking. In this paper, we apply GNNs to Web spam detection, i.e., the problem of classifying a Web page as a document containing spam or not. Such a problem has received a growing interest in the last few years, due to its importance for search engines 2−4. In Web spam detection, the Web graph can be used both for learning and testing. During training, we use a small set of pages, for which the target is known, to learn the GNN parameters. Then, the trained GNN is applied on the whole Web graph, to classify the remaining Web pages. GNNs are well suited for Web spam detection, since they can learn to automatically classify pages, exploiting both the information available on the page contents and on the Web connectivity. Interestingly, a GNN can operate using two modalities: it can act as a pure inductive model or as a mixed transductive−inductive model. In the former case, during the test phase, the GNN completely relies on its parameters to classify Web pages. The actual classification of a training page is not explicitly memorized, so that such page can be even misclassified by the GNN. With the transductive−inductive modality, the classification of the training pages is explicitly added to the Web graph. In this way, the GNN operates also as a transductive model, that classifies test pages by using the information already available for the training pages, and by diffusing such an information through the Web graph. In order to evaluate our approach, we tested GNNs on a well−known benchmark for Web spam detection, the WEBSPAM−UK2006 dataset 5, finding promising preliminary results. 2 The Graph Neural Network Model Graph Neural Networks (GNNs) are a supervised connectionist model capable of solving classification and regression problems on graphical domains. One of the major advantages of GNNs is their capacity of processing graphs directly (without preprocessing), which preserves the information collected into the graph topology. In fact, graph nodes are used to represent concepts, while edges determine the relationships between them. Each concept or node in the graph is defined by its features, and also by the information contained in its neighborhood. Based on these two information sources, the GNN calculates a state , for each node , which contains the node representation (see Fig. 1). Then, using this state, the GNN produces an output that denotes the classification decision on that node. Formally, the output of a GNN is defined by the following equations: Web Spam Detection n Using Transductive−Inductive Graph Neural Networks 85 , , , , (1) , , where and are param metric functions, implemented by two feedforward neu ural networks, which express th he dependence of the state at each node on the state of f its neighborhood, and the depe endence of the node output on its state, respectively. Fig. 1. A graph and, in evidence e, the neighborhood of a node. The state x of node 3 depends on n the 3 information contained in its nei ighborhood. The transition and the output functions are, respecti ively defined as , , , , , , , , , , , , , , , , , , Moreover, , , , represent the label of , the labels of its attached edges, and the states and the e labels of the nodes in its neighborhood, respectively. In order to compute the o output defined by Eq. (1), the Banach Fixed Point Theor rem suggests the following class sic iterative scheme: 1 , , , , (2) , , for each node . Intuitively y, the computation described by Eq. (2) can be interpre eted as the activity of a networ rk consisting of units which compute and . Suc ch a network, built by replacing each node of the graph with a unit computing (see F Fig. 2), will be called the enco oding network. Each unit stores the current state at node and, when activat ted, it calculates 1 (Fig. 2). The output at is produced by another unit w which implements . 86 A. Belahcen, M. Bianchini, and F. Scarselli Fig. 2. The graph (on the left) and the corresponding encoding network (on the right). Graph nodes (circles) are replaced by ad hoc units computing and (squares). When and are implemented by feedforward neural networks, the encoding network is a recurrent network. More details on the GNN training algorithm and output computation can be found in 1. Here, it suffices to say that both training and test sets consist of a labelled graph which, in our application, is a portion of the Web graph. For the training set, also targets for some nodes are provided, which define the actual class of these nodes, i.e. whether corresponding pages are spam or not. The training procedure adapts the network parameters in order to produce the correct outputs on the supervised pages, while the test procedure uses the trained GNN to classify the remaining pages. As mentioned in Section 1, GNNs can be exploited either as a common parameterized inductive model or as a mixed transductive−inductive model. In the inductive setting, the network is fed with the Web graph, using supervised pages to adapt the GNN parameters. Hence, in this way, the information contained in the training set is used to approximate a classification function that can be used to directly classify the nodes of the Web graph. On the other hand, in the transductive−inductive model, during training, a subset of the supervised pages is assigned a label enriched with their class membership, whereas the remaining (the class membership label is unset) are used for training − i.e. they contribute to the calculus/optimization of the error function. Instead, during testing, a component of the label of each training page explicitly specifies whether such a page is spam or not (the class membership label is unset for unsupervised pages), so that the information available on the training pages is directly diffused through the Web graph. 3 The WEBSPAM−UK2006 Dataset In order to assess our approach, we evaluate the GNN model on the WEBSPAM− UK2006 dataset. Actually, the dataset was adopted in 2007 by the Web Spam Challenge, a competition held annually during the International Workshop on Web Spam Detection Using Transductive−Inductive Graph Neural Networks 87 Adversarial Information Retrieval on the Web. The Web graph is a crawl of the .uk domain that includes 77.9 million pages and over 3 billion links in 11,402 hosts. The labeling was at the host level, i.e., the assessors labeled the hosts as normal or spam. Such a benchmark is particularly suited for our purpose both because it has been used by several research groups and because it is sufficiently large to produce significant results and, at the same time, not too huge to prevent a wide experimentation. 3.1 Features Data are represented by the following features: (1) link−based features, which include, f.i., the indegree and the outdegree of hosts and their neighbors, PageRank and TrustRank; (2) content−based features, which include, f.i., the fraction of anchor and visible text, the compression rate, the corpus precision (the fraction of words in a page that belong to the set of popular terms), and the corpus recall (the fraction of popular terms that appear in the page). 3.2 Feature Preprocessing The WEBSPAM−UK2006 dataset includes 41 link−based and 96 content−based features, which are used as node labels. Due to the high number of features, we use a feedforward neural network in order to summarize and compress them into a single one. More precisely, different configurations were used for GNNs, as it follows. • Link and content−based features directly: The most significant link and content−based features are selected, using a correlation−based feature selector 6, and integrated as the node label. • Link−based feature: A feedforward neural network is employed in order to compress all the link−based features into a single output. This output is then used as the node label. • Content−based feature: A feedforward neural network is employed in order to compress all the content−based features into a single output. This output is then used as the node label. • Compressed and uncompressed features: Link and content−based features, already compressed by feedforward networks, in addition to some features directly selected (in particular, the PageRank and the TrustRank of the host and of its maximum scored page) are collected together and then used as the node label. 3.3 Teams Participating to the 2007 Web Spam Challenge We compare our results with those gained by the six teams participating to the 2007 Web Spam Challenge. The competition attracted three teams from academic institutions (Hungarian Academy of Sciences, University of Waterloo, and Chinese Academy of Sciences) and three teams from industry research laboratories (Genie 88 A. Belahcen, M. Bianchini, and F. Scarselli Knows, Microsoft, and France Télécom). Results obtained by competing teams are shown in Table 3. Notice that, after the competition, other groups have worked on these benchmarks, but their results are difficult to be comparatively evaluated, due to the fact that, in most cases, the original splitting between training and test sets has not been used. 4 Experimental Results In this section, we present the experimental results obtained by GNNs. The experiments are divided into two parts. The first one uses the original training/test splitting, already fixed in the challenge, whereas in the second one the splitting is randomly constructed from all the dataset pages. The choice of testing the approach on a random splitting is a common procedure and it is motivated by the presence of different the data distributions in the original training and test sets. Besides, for each splitting, an inductive learning model and a mixed transductive−inductive learning approach were used. As mentioned before, in the transductive−inductive setting, the training pages were divided into two groups, in order to define the error function and to simulate the transductive inference, respectively. In our experiments, two equal−size groups were randomly defined. Finally, the performance was measured by the area under the ROC curve, the F−measure, and the accuracy. 4.1 The Random Splitting The WEBSPAM−UK2006 dataset was randomly split into training (2228 hosts), validation (1000 hosts), and test (2518 hosts) sets. The results are divided according to whether GNNs are used as an inductive or a mixed transductive−inductive learning model. Table 1 shows the performance obtained by different GNN configurations. Each row represents a different simulation, as described below: - In the first experiment, the most significant link and content−based features were chosen, using a correlation−based feature selector 6. In fact, according to a preliminary experiment, ten link−based and two content−based features were selected. The performance was the lowest compared to the other configurations. - In the second and in the third experiment, a feedforward neural network was used, in order to compress all the features into a single output (each type of features has its own output). Exploiting this idea, the performance of the model increases. - In the last configuration, we combine link and content−based features, already compressed by feedforward networks, with directly selected features (i.e., PageRank and TrustRank of the host and of its maximum PageRank page). With this experiment, we obtain the highest performance. For most of the experiments, the results obtained in the transductive−inductive learning framework are slightly better. In Table 3, we compare the results of our best GNN configuration with those produced by the other teams, proving that it gains very similar performance to the winner. Web Spam Detection Using Transductive−Inductive Graph Neural Networks 89 Table 1. Performances of different GNN configurations with random splitting Accuracy F-Measure ROC Configurations Transd.-Ind. Induc. Transd.-Ind. Induc. Transd.-Ind.. Induc. Link and content directly 89,48% 89,27% 0.7550 0.7528 0.9467 0.9446 Link based (FFNN) 90,30% 90,27% 0.7680 0.7763 0.9506 0.9516 Content based (FFNN) 90,50% 90,11% 0.7532 0.7531 0.9417 0.9387 Link and content (FFNN) 94,08% 93,96% 0.8534 0.8499 0.9681 0.9717 and directly selected 4.2 The Original Splitting The splitting adopted in the Challenge was also used for the experiments. In this case, the training set includes 8415 hosts (7472 normal, 767 spam, and 175 undecided), while the test set contains 2247 hosts (651 normal, 1346 spam, and 250 undecided). The architecture used to address this classification problem is shown in Fig. 3. Fig. 3. The configuration adopted in the original splitting. In this case, content−based and link−based features compressed into single features (by feedforward networks) are used, in addition to features directly selected, to construct the whole labels for the GNN processing. The produced output will be used in a second GNN. The output of the second GNN will be the decision on the hosts, classified as spam or normal. As in the random splitting experiments, this GNN configuration can be used as an inductive or a mixed transductive−inductive model. The obtained results are shown in Table 2. Table 2. Performance comparison with the original splitting Accuracy F-Measure ROC Configurations Transd. Induc. Transd. Induc. Transd. Induc. Link and content (FFNN) 89,53% 89,23% 0.9219 0.9215 0.9496 0.9502 and directly selected 90 A. Belahcen, M. Bianchini, and F. Scarselli Based on the experiments, we observe that the standard inductive and the transductive−inductive models are comparable in terms of performance. Nevertheless, some more experiments are worth carrying out in order to clearly establish the GNN ability in addressing the proposed problem. In the following, we compare the performance obtained in our experiments with respect to those of the other competing teams, and also evaluate training times of both the transductive−inductive and the inductive models, with respect to the random and the original splitting. 4.3 Performance Comparison The experiments, based on both original and random splitting, show that our results are comparable to the best results obtained so far on the WEBSPAM−UK2006. Table 3. Comparative results Participants F1 ROC Abou et al. (Genie Knows) 0.81 0.80 Benczur et al. (Hungarian Academy of 0.91 0.93 Sciences) Cormack (University of Waterloo) 0.67 0.96 Fetterly et al. (Microsoft) 0.79 - Filoche et al. (France Télécom) 0.88 0.93 Geng et al. (Chinese Academy of Sciences) 0.87 0.93 Random splitting 0.85 0.97 Original splitting 0.92 0.95 4.4 Training Time Comparison Even if the two learning frameworks show comparable performances, they significantly differ in terms of training time (see Fig. 4). Fig. 4. Comparison between the transductive−inductive and the inductive configurations with respect to the training time Web Spam Detection Using Transductive−Inductive Graph Neural Networks 91 Actually, the training time for the inductive model is greater than that for the transductive−inductive approach, with the random splitting, of about 20%, while, with the original splitting, it increases of 76%, which means that the transductive−inductive model is as efficient as the inductive model, but it is certainly very less expensive from the computational point of view. 5 Conclusions According to experiments conducted on the WEBSPAM−UK2006 dataset, the results obtained using the random and the original splitting can be compared, in terms of performance, to the state−of−the−art results. Besides, the experiments were realized based on both transductive−inductive and inductive frameworks, which show different training times, clearly assessing the advantages of the transductive−inductive approach from the computational point of view. References 1. Scarselli, F., Gori, M., Tsoi, A.-C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE Trans. on Neural Networks 20(1), 61–80 (2009) 2. Di Noi, L., Hagenbuchner, M., Scarselli, F., Tsoi, A.-C.: Web Spam Detection by Probability Mapping GraphSOMS and Graph Neural Networks. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010, Part II. LNCS, vol. 6353, pp. 372–381. Springer, Heidelberg (2010) 3. Gyöngyi, Z., Garcia-Molina, H.: Web spam taxonomy. Adversarial information retrieval on the Web (2005) 4. Castillo, C., Donato, D., Becchetti, L., Boldi, P., Leonardi, S., Santini, M., Vigna, S.: A reference collection for Web spam. ACM SIGIR Forum 40(2), 11–24 (2006) 5. Web spam challenge (2007), n=Main.PhaseIResults 6. Hall, M.A.: Correlation–based Feature Subset Selection for Machine Learning. Hamilton, New Zealand (1998) Hubs and Communities Identification in Dynamical Financial Networks 1 1,2 3 1 Hassan Mahmoud , Francesco Masulli ,MarinaResta , Stefano Rovetta , 1 and Amr Abdulatif 1 DIBRIS, Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi, University of Genoa, via Dodecaneso 35, 16146, Genoa, Italy 2 Center for Biotechnology, Temple University, Philadelphia, USA 3 Dipartimento di Economia, University of Genoa, via Vivaldi 5, 16126, Genoa, Italy hassan.mahmoud, francesco.masulli, marina.resta,, Abstract. In this study we aim at identifying companies influencing the performance of the stock market sector. We propose an approach for constructing the similarity between stock company profiles based on the estimates of the log return similarity of stock prices and on Fuzzy Spectral Modularity community detection method to infer the network hubs and significant communities and we applied it to the Italian stock market store. Experimental results show that companies in the same sector highly affect the price change of each other. Moreover, We notice a robust temporal stability of detected communities, and the short time correlation computed with the fuzzy rand index is strong. Keywords: Communities, Dynamical Financial networks, Correlation networks, Spectral clustering, Fuzzy clustering, Stability. 1 Introduction Community discovery is difficult task since communities are hidden behind com- plicated relationships. Moreover, it is unclear how to extract coverage in real world networks. Several attempts were devoted to characterize overlapping com- munities 14, however none of them is able to efficiently infer the hidden struc- ture of the network. In financial stocks, in order to identify groups of assets highly affecting the priceofeachothercomparedtotherestofthenetwork(i.e.,communities)several approaches have been proposed able to identify the structure of the empirical correlation matrix of asset profiles, and model their hierarchies in terms of hier- archical trees and networks 13,15,19. The process of communities identification needs to make use of statistically reliable information, often the correlation ma- trix is sensitive to several factors, such as the heterogeneity of sampling, the interaction with environment, and the non-stationariety of data sources. The Italian stock market network we will analyze in this work showed a structure evolving through time progression in which there is a possibility for  c Springer International Publishing Switzerland 2015 93 S. Bassis et al. (eds.), Recent Advances of Neural Networks Models and Applications, Smart Innovation, Systems and Technologies 37, DOI: 10.1007/978-3-319-18164-6_1094 H. Mahmoud et al. some assets observations to be missed or for others to appear and affect the stock market. Finding a suitable community detection technique in complex real world networks is an open problem due to various challenges given by clustering approaches. Among them, there are the following: initialization criteria (e.g., choosing an initial number of clusters is required in partitional clustering like K-Means 8, while it is not needed in hierarchical clustering), accuracy (e.g., a main drawback of hierarchical clustering is the possible misclassification of some nodes 11, while removing edges may result in singleton clusters in graph bisection approach), stability (e.g., results may differ depending on the specific similarity measure used and on the random initialization of cluster centers in partitional clustering). This paper is organized as follows: Sec. 2 illustrates the proposed method to identify financial communities and hubs using the Fuzzy c-means Spectral Modularity community detection method we proposed in 9; Sec. 3 applies the fuzzy Rand index for measuring the stability of the overlapping evolutionary communities obtained in each time window (30 days); Sec. 4 shows the experi- mental results obtained on the Italian Stock Market data; Finally Sec. 5 shows the conclusions. 2 Financial Communities Identification In this paper we consider correlation estimators based on the Euclidean dis- tance between the log returns daily closure prices of assets pairs. After obtaining the asset similarities, we apply the Fuzzy c-means Spectral Modularity (FSM) community detection method we proposed in 9 to infer the overlapping asset communities, and identify hub assets. Finally, we measured the strength of our proposedFSM in quantifyingthe Italian stockmarketnetworkusingfuzzy Rand index 18. The approach we propose for analyzing how assets interact in financial net- works includes the following steps (also summarised in Fig.1): 1. Normalize the data. 2. Define the initial time t and the length l of the temporal window. 0 3. Within each temporal window measure the Log–Return Similarity (LRS) of stock prices at closure between each pair of following days defined as: p (t) h r (t)=ln =lnp (t)−lnp (t−1), (1) h h h p (t−1) h where p (t) refers to closure price of asset h at time t. h 4. Calculate the pairwise Euclidean distance d between the profiles of assets hi h and i (temporal windows) for each pair of assets (excluding the cases with missing observations, and replacing undefined values with zero in each asset, hence the cardinality is always the same):   l−1   2 d = (r (t )−r (t )) . (2) hi h p i p p=0Hubs and Communities Identification in Dynamical Financial Networks 95 5. Estimate the similarity between asset profiles 16 as:   −d hk sim =exp , (3) hk s where s is the dispersion, it depends on the data distribution. We estimated s experimentally using histogram analysis instead of choosing it randomly. 6. Apply the FSM-community detection method to the asset profile similarities matrix Σ=sim to infer overlapping financial communities. hk The art of identifying nodes having more influence over the network structure than others is referred to as node centrality study. A vertex with high centrality (hub) implies that it lies on considerable fractions of shortest paths connecting vertexes. As a consequence, various centrality measures have been developed in networkanalysis: among them the most known(and used) are: centrality degree, closeness, betweenness, and modularity 2,7. Network modularity is used for measuring the strength of community struc- ture in networks. High network modularity implies the existence of dense connections within communities, and of sparse links between them. Although modularity shows a resolution limit specially in case of detecting small commu- nities, it has the advantages of not requiring prior knowledge about the number or sizes of communities, and it is capable of discovering network partitions com- posed of communities having different sizes. Moreover, degree, and closeness are local measures which limit their efficiency in case of evolving networks 10,11. The Fuzzy c-means Spectral Modularity (FSM) - community detec- tion method 9, derived by the Ng et al. 12 spectral clustering algorithm 20,5,3. The main improvement introduced in the FSM is the application of the Fuzzy c-means (FCM) algorithm 1,4 for clustering in the affine subspace ∗ spanned by the first k eigenvectors. The FCM allows an instance to belong to two or more clusters at the same time, with possibly different membership degrees. This feature supports the detection of overlappingcommunities and can allow to understand the role that each node may play in different communities. 3 Measuring Evolving Communities Stability Using Fuzzy Rand Index To measure the stability in our experiments, we used the fuzzy Rand index as defined in 6, in that paper the Rand index is viewed as distance function D =1−RI. RI a=(1−u−v)×u×v;(4) b=(1−u−v)×(1−u×v); (5) c = max((u−v),0); (6) d = max((v −u),0). (7)

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.