Women’s Forums on the Dark Web

gender differences in the online reading environment and online communication and gender differences in online gaming a literature review
Dr.AlbaNathan Profile Pic
Dr.AlbaNathan,United States,Researcher
Published Date:09-08-2017
Your Website URL(Optional)
Comment
Women’s Forums on the Dark Web 1 Introduction The rapid development and evolution of the Internet have enabled people to access information whenever and wherever they want. Recently, with the advent of Web 2.0, the Internet has evolved toward multimedia-rich content delivery, end-user con- tent generation, and community-based social interactions (O’Reilly 2005) . More and more web forums, blogs, wikis, and other social media have been generated and become popular. Such Web 2.0 social media help enhance information sharing, opinion generation, and community-based discussions for various emerging social and political topics. Although it has a male-dominated history, the Internet is becoming a new medium for women to share their concerns and express opinions about personal, social, and political issues (Harcourt 2000) . With such a trend, the need for women to claim the Internet as an important space of their own has emerged. Women could gain equal presence or influence with men in the virtual community. In addition, their desire for gender equality continues to influence their Internet contributions and writing. Meanwhile, the increasing availability of the Internet offers marginalized groups and individuals a voice in the public sphere (Harp and Tremayne 2006; Mitra 2004) . For example, Harcourt ( 2000) mentions the increasing voice of local Arab women on a global level through the Internet; Mitra (2004) argues that the Internet has allowed women in South Asia to be heard by the outside world. In many disciplines, questions concerning gender differences in the context of online communication have been raised (Halbert 2004) . Online gender differences (i.e., digital gender gap in some studies), which refers to the differences between women and men in Internet use, have been shown and studied in previous research (Fountain 2000; Fuller 2004; Harp and Tremayne 2006) . Some studies point out that women are less likely to express political opinions and tend to have a less authorita- tive manner in their conversation style (Ogan et al. 2005) . More research is critically needed to explain online gender differences in social, political, and even business (e.g., online shopping) activities. , 370 19 Women’s Forums on the Dark Web Text classification techniques can be used to identify and analyze online gender differences by examining the discrepancy between women’s and men’s writing styles. Previous studies on gender classification focus on using feature-based text classification methods to examine how women and men might use language(s) dif- ferently. Those studies have shown noticeable differences between women’s and men’s writing styles on different types of texts, including e-mails (Corney et al. 2002 ) , web blogs (Nowson and Oberlander 2006; Schler et al. 2006) , novels (Hota et al. 2006) , and nonfiction articles (Koppel et al. 2002) . In the web forum context, previous studies mainly used keyword-based analysis to examine the topic differ- ences between males and females (Guiller and Durndell 2007; Seale et al. 2006) . Few studies have investigated the gender differences in web forums using feature- based text classification techniques. In this study, we propose a feature-based gender classification framework for web forums to examine the writing styles and contents (including different types of linguistic features) of female and male posters. Our analysis encompasses gender classification on an Islamic women’s political forum. We compare and examine different feature sets consisting of content-free and content-specific features. We also study feature selection using the information gain (IG) heuristic. We further analyze the different topics preferred by women and men, respectively. The results of classification using support vector machine (SVM) indicate a high level of clas- sification accuracy, demonstrating the efficacy of this framework of gender classifi- cation for web forums. The remainder of this chapter is organized as follows: Sect. 2 provides a review of previous research in online gender differences and online text classification stud- ies. Section 3 describes research gaps and questions, while Sect. 4 presents our research design. Section 5 describes the experiment used to evaluate the effective- ness of the proposed framework and discusses the results. Section 6 concludes the chapter with closing remarks and future directions. 2 Li terature Review In this section, we review previous research on online gender differences and online text classification studies. 2.1 Onl ine Gender Differences With the increasing availability and popularity of the Internet, as well as the advent of Web 2.0, more and more women participate in community-based social media (Consaluo and Paasonen 2002). The Internet, therefore, has become a medium for women to share their political pressure and knowledge (Harcourt 2000). They are also creating their own online networks to exchange information and opinions (Sherman 2001). 2 Literature Review 371 The Internet is not only useful as a fast communication medium; it is also a very crucial channel of information on women’s rights issues. Women use the Internet to fight against violence by building a strong layer of support through which their personal struggles can be discussed and solutions shared (Harcourt 2000) . As an example, in her study, Harcourt ( 2000) talks about a case regarding a Muslim wom- an’s right of choice of marriage; she argues that “we could, within hours, receive case law on the issue from other Muslim countries as well as legal and scholarly opinions and references that prove critical in winning the case.” As argued by Shade (2002; see also Seale et al. 2006) , the Internet is to the third-wave feminists what independent feminist presses were to the second-wave, providing a means to form female networks and resist the relatively male-dominated networks in cyberspace. In sum, women want to be treated equally with men in the virtual society and claim the Internet as an important space of their own (Halbert 2004) . Along with such a trend, researchers have an increasing interest in studying online gender differences, which refers to the fact that there are differences between women and men in Internet use (Bimber 2000) . The major online gender difference noted is that fewer women than men use the Internet. For example, the A. C. Nielsen CommerceNet consortium from 1999 showed that among US and Canadian Internet users, 53% were men and 47% were women; among online shoppers, 62% were men and 38% were women; and among people who reported having used the Internet in the last 24h for any purpose, 68% were men and 32% were women (CommerceNet 1999) . In the realm of political activity, the National Election Study (NES) data shows that visitors to Internet campaign sites during the 1998 election season were 60% male and 40% female ( National Election Study 1998). However, with the rapid development and increasing availability of the Internet, more and more women are accessing the Internet to acquire information, express their ideas, and share common concerns. The May 2008 survey by the Pew Internet and American Life Project found that 73% of men and 73% of women use the Internet (Pew Internet and American Life Project 2008) . In contrast, its 2004 survey reported 66% and 61% Internet use for men and women, respectively. Although access technology is not an issue today, women and men do have dif- ferences in Internet use depending on motivation and interest in the content being produced and consumed (Harp and Tremayne 2006) . Jackson et al. ( 2001) found that women are more likely to use the Internet as a communication tool, and men are more likely to use it as a means of information seeking. According to Ogan et al. (2005) , women are less likely to express political opinions and tend to have a less authoritative manner in their conversation style. Meanwhile, some studies (Fuller 2004 ; Youngs 2004) argue that women are responsible for and belong to the private sphere of life, i.e., the domestic sphere of home, family, private relations, and sexual reproduction; on the other hand, men are best suited and responsible for the public sphere and political realm including government and commercial establishments. Harp and Tremayne (2006) use network theory and feminist theory to study online gender differences in web blogs and offer suggestions for increasing the representa- tion of female voices in the political blogosphere. 372 19 Women’s Forums on the Dark Web As to online communication on web forums, previous studies have used keyword analysis to show that women and men do have different topics that they are interested in and care about (Seale et al. 2006) . Seale et al. (2006) analyzed cancer-related web forums and found that women’s discussions are more likely to lean toward the exchange of emotional support, including concern with the impact of illness on a wide range of other people; however, men are more likely to par- ticipate in the threads on treatment information, medical personnel, and proce- dures. Guiller and Durndell ( 2007) analyzed an online course discussion board and found that women are more likely to explicitly agree and support others and make more personal and emotional contributions than men; on the other hand, men are more likely to use authoritative language and to respond negatively in interactions than women. 2.2 Online Text Classification In this study, we adopt online text classification techniques to study the online gen- der differences in web forums by examining the writing style of the posted mes- sages. Online text classification has several important characteristics, including various types of problems, features, and online texts. These are summarized in the taxonomy presented in Table 19.1. Based on the proposed taxonomy, Table 19.2 shows selected previous studies dealing with online text classification. We discuss the taxonomy and related studies in detail below. 2.2.1 Differ ent Types of Online Text Classification Problems With the advent of Web 2.0, more and more automatic classification studies using online text-based social media data have appeared. In those studies, the investigated classification problems mainly include authorship classification, sentiment classifi- cation, and gender classification. Unlike the classical topic-based classification problem in information retrieval, social media classification relies heavily on the information and fluid writing styles of authors in various online social media, such as e-mail, blog, and forum. Authorship classifi cation: Authorship classification aims at determining which author produced which piece of writing by examining the styles and contents of writings produced by different authors. Previous studies have applied authorship classification to various online social media texts. De Vel and his collaborators (2000) have conducted a series of experiments to identify the authors of e-mails. They apply the conventional text classification methods to authorship classification on online texts. Some more recent studies (Abbasi and Chen 2005; Li et al. 2006; Zheng et al. 2006) construct frameworks for authorship classification on online texts, with an emphasis on comparing different types of features and classification techniques. A recent comprehensive study conducted by Abbasi and Chen ( 2008a) 2 Literature Review 373 Table 19.1 A taxonomy of online social media text classification Category Description Label Problems Authorship Determines which author produced which piece of writing by P1 classification examining all the writings by all the authors in the collection Sentiment Determines whether a text is objective or subjective, or whether P2 classification a subjective text contains positive or negative sentiments Gender Determines whether a piece of writing was produced by a female P3 classification or male by examining the writing styles and contents F eatures Lexical features Character- or word-based statistical measures of lexical variation F1 F eatures Syntactic features Function words; punctuation; part-of-speech (POS) tags F2 Structural features Features related to text organization and layout; technical features F3 such as various file extensions, font sizes, and font colors Content-specific Important keywords and phrases (e.g., n-grams) on certain topics F4 features Online te xt types E-mail Using e-mails as dataset T1 Web blog Using web blogs as dataset T2 Web forum Using web forum messages as dataset T3 Online review Online reviews of products, movies, music, etc. T4 Online news Online news articles and news web pages T5 Table 19.2 Selected previous studies in online social media text classification Problems Features Online text types Study P1 P2 P3 F1 F2 F3 F4 T1 T2 T3 T4 T5 de Vel 2000    del Vel et al. 2001     Abbasi and Chen 2005       Zheng et al. 2006       Li et al. 2006       Abbasi and Chen 2008       Subasic and Huettner 2001    Pang et al. 2002    Turney 2002    Dave et al. 2003    Hu and Liu 2004    Gamon 2004      Wiebe et al. 2004      Grefenstette et al. 2004    Mishne 2005      Abbasi et al. 2008a       Corney et al. 2002      Schler et al. 2006      Nowson and Oberlander 2006     374 19 Women’s Forums on the Dark Web tested their newly developed Writeprints technique with a rich set of features on various online datasets, including e-mails, instant messages, feedback comments, and program codes. The accuracy of their algorithm reaches as high as 94% when differentiating between 100 authors. Sen timent classifi cation: Sentiment classification for online texts aims to analyze direction-based texts (i.e., texts containing opinions and emotions) to determine whether a text is objective or subjective, or whether a subjective text contains posi- tive or negative sentiments. The common, two-class sentiment classification prob- lem involves classifying sentiments as positive or negative (Pang et al. 2002; Turney 2002) . However, additional variations include classifying sentiments as opinion- ated/subjective or factual/objective (Wiebe et al. 2001; Wiebe et al. 2004) . Instead of sentiments, some other studies attempt to classify emotions, including happiness, sadness, anger, horror, etc. (Grefenstette et al. 2004; Mishne 2005; Subasic and Huettner 2001) . As to web forums, Abbasi et al. (2008b) develop a system for senti- ment classification with a new feature selection algorithm and test its performance on both English and Arabic web forums. Gender classification: Gender classification aims to determine whether a piece of writ- ing was produced by a female or male by examining the writing styles and contents of female and male authors. Previous gender classification studies using automatic text classification techniques have been done on both traditional articles (e.g., novels and nonfiction articles) and online social media texts (e.g., e-mails and web blogs). As an example of gender classification on traditional articles, Koppel et al. ( 2002) use the exponential gradient (EG) algorithm to classify genders for both fiction and nonfiction documents. By using a feature set combining function words and part-of- speech (POS) tags, they achieve 79.5% accuracy for fiction documents and 82.6% accuracy for nonfiction documents. After feature selection, the accuracy increased to 98% for both fiction and nonfiction documents. Another study conducted by Hota et al. (2006) classifies the gender of Shakespeare’s characters based on a collection of his plays. They achieve the highest accuracy of 74.28% using support vector machine (SVM) on the feature set consisting of both content-independent and c ontent-based features. Argamon and his collaborators (2003a) analyze writing styles and identify a set of lexical and syntactic features that differ significantly according to author gender in both fiction and nonfiction documents. In particular, they find that although the total number of nominals used by female and male authors is virtually identical, females use many more pronouns, and males use many more noun specifiers. For online social media text, most previous gender classification studies focus on e-mails (Corney et al. 2002) and web blogs (Nowson and Oberlander 2006; Schler et al. 2006) . Corney et al. (2002) use SVM to classify genders for e-mails and achieve the highest F-measure of 71.1% using the combination of lexical features, structural features, and selected gender-specific features. Nowson and Oberlander (2006) use SVM to classify genders for web blogs and achieve the highest accuracy of 91.5% using the combination of part of speech (POS), bigrams, and trigrams as the features. Schler et al. (2006) also conduct gender classification on web blogs and emphasize the significant differences in writing styles and contents between female and male bloggers as well as among authors of different ages. 2 Literature Review 375 2.2.2 Features for Online Social Media Text Classification Features are very important for online social media text classification. Good feature sets can improve the performance of the classifier. There are four types of features that are often used in previous online social media text classification studies: lexical features, syntactic features, structural features, and content-specific features. Among them, the first three types are content-free features; the fourth type contains features related to specific topics. Le xical features: Lexical features are character- or word-based statistical measures of lexical variation. Lexical features mainly include character-based lexical features (Argamon et al. 2003b; Gamon 2004; Yule 1938) , vocabulary richness measures (Yule 1944) , and word-based lexical features (de Vel 2001; Mishne 2005; Zheng et al. 2006) . Examples of character-based lexical features are the total number of characters, the number of characters per sentence, the number of characters per word, and the usage frequency of individual letters. Examples of vocabulary rich- ness measures include the number of words that occur once (hapax legomena) and twice (hapax dislegomena) and some other statistical measures defined by Yule (1944) . Examples of word-based lexical features are the total number of words, the number of words per sentence, and word-length distribution. Synta ctic features: Syntactic features indicate the patterns used to form sentences. Commonly used syntactic features include function words (Koppel et al. 2002; Koppel et al. 2006; Mosteller 1964) , punctuation (Baayen et al. 2002) , and part-of- speech (POS) tags (Argamon et al. 1988; Baayen et al. 2002; Gamon 2004; Nowson and Oberlander 2006) . These studies also demonstrate that syntactic features may be more reliable compared with lexical features. To study the writing style differ- ences between females and males, Argamon and his collaborators (2003a) use over 1,000 features including 467 function words and a set of POS tags. In their studies on authorship classification for online texts, Abbasi and Chen ( 2008) use up to 300 function words, 8 punctuations, and almost 2,300 POS tags as syntactic features. Structural features: Structural features show the text organization and layout. They are especially useful for online social media texts (de Vel et al. 2001) . Traditional structural features include greetings, signatures, the number of paragraphs, and the average paragraph length (de Vel et al. 2001; Zheng et al. 2006) . Other structural features include technical features such as the use of various file extensions, font sizes, and font colors (Abbasi and Chen 2005) . For example, Zheng et al. ( 2006) use 14 different structural features in their authorship classification study on both English and Chinese online messages. Abbasi and Chen ( 2008) adopt 64 structural features in their authorship classification on various types of online texts. Content-specific features: Different from the content-free features (i.e., lexical fea- tures, syntactic features, and structural features), content-specific features are com- prised of important keywords and phrases on certain topics (Martindale and McKenzie 1995; Zheng et al. 2006) such as word n-grams (Abbasi and Chen 2005; Abbasi and Chen 2008; Diederich et al. 2003; Nowson and Oberlander 2006). Usually, these fea- tures can express personal interest in a specific domain. For example, content-specific 376 19 Women’s Forums on the Dark Web features on a discussion of computers may include “laptop” and “notebook.” Previous studies have shown that content-specific features can improve the performance of online text classification (Abbasi and Chen 2005; Abbasi et al. 2008a; Schler et al. 2006; Zheng et al. 2006) . 2.2.3 Differ ent Types of Online Social Media Texts In the taxonomy shown in Table 19.1, we summarize the major types of texts used in previous online text classification studies as e-mail, web blog, web forum, online review, and online news. Some previous studies use e-mails to form their datasets; oth- ers use web blogs, web forum messages, online reviews, or online news. In the taxon- omy, the online review category includes the online reviews of products, movies, music, etc.; the online news category consists of online news articles and news web pages. Some general conclusions can be drawn from Table 19.2 and the literature review. Most previous online text classification studies focus on authorship classification and sentiment classification; relatively less effort has been put on gender classifica- tion. For authorship classification, to improve the classification performance, the most recent studies (Abbasi and Chen 2005; Abbasi and Chen 2008; Zheng et al. 2006) have incorporated all four types of features, i.e., lexical features, syntactic features, structural features, and content-specific features. For sentiment classifica- tion, early studies (Dave et al. 2003; Hu and Liu 2004; Subasic and Huettner 2001; Turney 2002) often use one type of feature. Some later studies (Gamon 2004; Mishne 2005; Wiebe et al. 2004) add other types of features to improve the classification performance. Abbasi et al. (2008b) conduct sentiment classification using all four types of features. Previous gender classification studies also include different types of features; however, we have not seen a study using all four types of features. According to different types of online texts, previous authorship classification studies mainly use e-mails (de Vel 2000) , web blogs (Li et al. 2006) , and web forums (Abbasi and Chen 2005; Abbasi and Chen 2008; Zheng et al. 2006); few have used online reviews or online news. In contrast, previous sentiment classification studies mainly use web blogs (Mishne 2005) , web forums (Abbasi et al. 2008a; Grefenstette et al. 2004) , online reviews (Gamon 2004; Hu and Liu 2004; Pang et al. 2002; Turney 2002) , and online news (Subasic and Huettner 2001; Wiebe et al. 2004) ; few have used e-mails. The texts used in previous gender classification studies are rela- tively limited, mainly e-mails (Corney et al. 2002) and web blogs (Nowson and Oberlander 2006; Schler et al. 2006) . Few studies have used web forums, online reviews, or online news. 3 Re search Gaps and Questions Our review of previous literature and our conclusions point to several notable research gaps. First, few studies have investigated online gender differences in the context of web forums using feature-based gender classification techniques. Previous 4 Research Design 377 studies have shown the existence and evolution of online gender differences and the importance of gender role in political movements on the web. However, in analyz- ing the gender differences in web forums, most studies used basic keyword-based analysis. Second, to the best of our knowledge, no previous study has used both content-free features (i.e., lexical, syntactic, and structural features) and content- specific features to conduct automatic gender classification for web forums. Therefore, we raise the following research questions: 1. Can gender classification techniques be used to identify and analyze online gender differences in web forums? 2. Will the use of both content-free features (i.e., lexical, syntactic, and structural features) and content-specific features improve gender classification perfor- mance for web forums compared to using only the content-free features? 3. F or relatively large feature sets, will feature selection that returns a smaller num- ber of the most important features improve the gender classification performance for web forums? 4 Research Design In order to address these questions, we develop a framework of feature-based gen- der classification on web forums. The framework includes several essential compo- nents which we will describe in detail: web forum message acquisition, feature generation, and classification and evaluation. 4.1 Web Forum Message Acquisition This component consists of two steps: forum message collecting and forum mes- sage parsing. First, spidering programs are developed to collect all the messages in a given open source web forum as HTML pages. After that, we build parsers to parse out the message information from the raw HTML pages and store the parsed data in a relational database. 4.2 F eature Generation In this component, we generate different feature sets containing different types of features. By doing this, we can compare and evaluate the performance of different feature sets in gender classification for web forums in order to answer our research questions 2 and 3. There are several steps in this component: feature extraction, unigram/bigram preselection, and feature selection. Each of these steps leads to the generation of different feature sets. 378 19 Women’s Forums on the Dark Web F eature Extraction: Different types of features are extracted based on all mess ages collected from a given open source web forum. In this study, we extract the lexical features (denoted by F1), syntactic features (denoted by F2), and structural features (denoted by F3) as content-free features, and unigrams (denoted by F4(unigram)) and bigrams (denoted by F4(bigram)) as content-specific features. As we described in Sect. 2.2.2, lexical features (i.e., F1) mainly include character- based lexical features, vocabulary richness measures, and word-based lexical fea- tures. In this study, we adopt the character-based lexical features used in de Vel (2000) , Forsyth and Holmes (1996) , and Ledger and Merriam ( 1994) ; the vocabu- lary richness features used in Tweedie and Baayen (1998) ; and the word-length frequency features used in Mendenhall (1887) and de Vel et al. ( 2001) . In total, we use 87 lexical features. Syntactic features (i.e., F2) are important because they can indicate people’s dif- ferent habits of organizing sentences (Zheng et al. 2006) . Function words and punc- tuation are often used as syntactic features. Different sets of function words, ranging from 12 to 122, have been tested in various studies (Baayen et al. 1996; Burrows 1989; de Vel et al. 2001; Holmes and Forsyth 1995; Tweedie and Baayen 1998) . However, there is no generally accepted good set of function words for different applications. In this study, we adopt a large set of 150 function words used in Zheng et al. (2006) since this study also focuses on web forum messages, although it is about authorship classification instead of gender classification. In addition to func- tion words, we adopt eight punctuations suggested by Baayen et al. ( 1996) . Therefore, we use 158 syntactic features in total. Structural features (i.e., F3) represent the layout of writing. De Vel ( 2000) intro- duces several structural features specifically for e-mails. Zheng et al. (2006) use 14 structural features in their authorship classification study for web forums. In a series of online text classification studies, Abbasi and his collaborators (Abbasi and Chen 2005; Abbasi and Chen 2008; Abbasi et al. 2008a) use 62 structural features related to both word structures (e.g., the number of paragraphs, the number of sentences per paragraph, etc.) and technical structures (e.g., font colors, font sizes, use of images, etc.). Most of those 62 features are about technical structures, including 29 different font colors, 8 different font sizes, 4 types of image displays, and 7 types of hyper- links. In this study, we adopt some of the structural features used in previous research. We choose five of the most common features that can be applied to a broad number of general web forums. All of them are related to word structures. Specifically, the five structural features are the total number of sentences in a message, the total number of paragraphs in a message, the number of sentences per paragraph in a message, the number of characters per paragraph in a message, and the number of words per paragraph in a message. We do not use many structural features related to technical structures (e.g., font colors and font sizes) since some web forums may not have the related characteristics. For example, some popular (but old) web forums do not have the functions for users to change the font colors and font sizes. Unigram/bigram preselection: Although content-free features are important for online text classification, content-specific features that consist of important keywords 4 Research Design 379 and phrases on certain topics could be more meaningful, thus leading to relatively high representative ability. Content-specific features used in previous online text classification studies are either a relatively small number of manually selected, domain-specific keywords (Li et al. 2006; Zheng et al. 2006), or a relatively large number of n-grams automatically learned from the textual data collection (Abbasi and Chen 2005; Abbasi et al. 2008a; Abbasi et al. 2008b; Peng 2003; Schler et al. 2006). The large potential feature spaces of n-grams have been shown to be effective for online text classification (Abbasi and Chen 2008). Therefore, in this study, we use n-grams as content-specific features. Specifically, we use unigrams (i.e., F4(unigram)) and bigrams (i.e., F4(bigram)). The unigrams and bigrams are extracted from all messages in the web forum. After removing the stopwords, we keep the unigrams and bigrams that appear more than ten times in the whole forum as our content- specific features. By conducting feature extraction and unigram/bigram preselection, we obtain five types of features. Based on those different types of features, we build three feature sets in an incremental way: (1) feature set F1 + F2 + F3 which includes lexical fea- tures, syntactic features, and structural features; (2) feature set F1 + F2 + F3 + F4(unigram) which consists of lexical features, syntactic features, structural features, and uni- grams; and (3) feature set F1 + F2 + F3 + F4(unigram) + F4(bigram) which is com- posed of lexical features, syntactic features, structural features, unigrams, and bigrams. This incremental order represents the evolutionary sequence of features used for online text classification (Abbasi and Chen 2008; Zheng et al. 2006) . Studies (Abbasi and Chen 2008; Zheng et al. 2006) have shown that lexical and syntactic features are the foundation for structural and content-specific features. In this study, we use feature set F1 + F2 + F3 which contains only content-free features as the base- line feature set to assess the performance of the other two proposed feature sets: feature set F1 + F2 + F3 + F4(unigram) and feature set F1 + F2 + F3 + F4(unigram) + F 4(bigram), each of which also incorporates content-specific features. The baseline feature set (i.e., feature set F1 + F2 + F3) contains 250 features with 87 lexical features, 158 syntactic features, and 5 structural features, as we described before. By adding unigrams and unigrams plus bigrams as content-specific features, respectively, feature sets F1 + F2 + F3 + F4(unigram) and F1 + F2 + F3 + F4(unigram) + F4(bigram) have much larger numbers of features than the baseline feature set in general. The exact numbers depend on the number of unigrams and bigrams col- lected from the text collection. Feature selection: When the number of features is large, feature selection may improve the classification performance by selecting an optimal subset of features (Guo and Nixon 2009). Previous classification studies using n-gram features usually include some form of feature selection in order to extract the most important words or phrases (Koppel and Schler 2003) . In this study, we use the information gain (IG) heuristic to conduct feature selection due to its reported effectiveness in previous online text classification research (Abbasi and Chen 2008; Koppel and Schler 2003) , thus building two selected feature sets: selected feature set F1 + F2 + F3 + F4(unigram) and selected feature set F1 + F2 + F3 + F4(unigram) + F4(bigram). 380 19 Women’s Forums on the Dark Web As defined in the following formula, information gain IG(C,A) measures the amount of entropy decrease on a class C when providing a feature A (Quinlan 1986; Shannon 1948) . The decreasing amount of entropy reflects the additional informa- tion gained by adding feature A. In the formula, H(C) and H(CA) represent the entropies of class C before and after observing feature A, respectively. The informa- tion gain for each feature varies along the range 0–1 with higher values indicating more information gained by providing certain features. All features with an information gain greater than 0.0025 (i.e., IG(C,A) 0.0025) are selected. The use of such a threshold is consistent with prior work using IG for text feature selection (Abbasi et al. 2008a; Yang and Pedersen 1997) : IG(C, A) = H(C) − H(C A), where H(C) = − p(c)log p(c), H(C A) = − p(a) p(c a)log p(c a). ∑ 2 ∑ ∑ 2 c∈C a∈A c∈C 4.3 Cl assification and Evaluation In this study, we build five different feature sets: feature set F1 + F2 + F3, feature set F1 + F2 + F3 + F4(unigram), feature set F1 + F2 + F3 + F4(unigram) + F4(bigram), selected feature set F1 + F2 + F3 + F4(unigram), and selected feature set F1 + F2 + F 3 + F4(unigram) + F4(bigram). We aim to study and compare the performance of these feature sets in order to identify the best one for web forum gender classifica- tion. Because of its often reported best performance in many previous online text classification studies (Abbasi and Chen 2008; Abbasi et al. 2008a; Li et al. 2006; Zheng et al. 2006) , we choose SVM as the classifier. To assess the performance of each feature set on gender classification for web forums, we adopt the standard classification performance metrics, i.e., accuracy, pre- cision, recall, and F-measure. These metrics have been widely used in information retrieval and text classification studies (Abbasi and Chen 2008; Abbasi et al. 2008a; Li et al. 2008). In particular, accuracy measures the overall classification correct- ness, while precision, recall, and F-measure evaluate the correctness of each class: Number of all correctly classified Web forum messages Accuracy = , Total number of Web forum messages Number of correctly classified Web forum messages for class i Precision (i) = , Total number of Web forum messages classified as class i Number of correctly classified Web forum messages for class i Recall (i) = , Total number of Web forum messages in class i 5 Experimental Study 381 2 × Precision i × Recall (i) ( ) F - Measure (i) , where Precision i + Recall (i) ( ) i= 1, 2 with classes 1 and 2 being web forum messages written by female and male authors, respectively. 5 Experimental Study To assess the effectiveness of the proposed research design, we conduct an experi- ment on a large and long-standing international Islamic women’s political forum. In the following, we detail the test bed, hypotheses, experimental results, and discussion. 5.1 T est Bed We conduct our experiment on a large, international Islamic women’s political forum to evaluate our proposed framework of gender classification for web forums. We choose it for three reasons: First, it is a large, long-standing (about 4 years) international political forum and thus can be used to study the international cyber political movement; second, it has the self-reported gender information for each registered member, thus providing the gold standard to evaluate the performance of our automatic gender classifiers; third, since it is a women’s forum, more females may participate, thus providing a larger number of messages written by female authors compared with other general, male-dominated web forums. We believe the international, political, and female-oriented nature of this large active forum makes it an ideal test bed for our research. We collected and parsed all the messages in the forum posted up to March 2007. In total, we gathered 34,695 different messages in 4,352 unique threads. The num- bers of messages written by females and males are quite balanced. There are 17,785 and 16,572 messages written by females and males, respectively. An additional 338 messages do not have gender information. The time span of the collected messages is from June 9, 2004 to March 13, 2007. Based on careful discussion with our politi- cal science collaborator, who has significant experience in such women’s political forums, we believe that this test bed is of high quality and has credible participant- specified gender information. To test the performance of our classifiers, we randomly selected 100 authors, 50 females and 50 males. In total, there are 12,690 messages posted by those 100 authors. On average, each female participant produced 142.26 messages, and each male participant wrote 111.54 messages. 382 19 Women’s Forums on the Dark Web 5.2 Hypothese s Drawing on the vast online social media classification literature we reviewed, we posit that adding content-specific features to the baseline content-free features can improve the performance of gender classification for web forums, and conducting feature selection on a relatively large number of features can improve the perfor- mance of gender classification for web forums. 5.3 Experimental Results We built five different feature sets: F1 + F2 + F3, F1 + F2 + F3 + F4(unigram), F1 + F2 + F3 + F4(unigram) + F4(bigram), selected F1 + F2 + F3 + F4(unigram), and selected F1 + F2 + F3 + F4(unigram) + F4(bigram). As described before, feature set F1 + F2 + F3 contains 250 content-free features. For the content-specific features (i.e., unigrams and bigrams), we selected the ones appearing more than ten times among all the messages in the forum. In total, we get 6,012 unigrams and 4,022 bigrams. Therefore, there are 6,262 and 10,284 features in feature sets F1 + F2 + F3 + F4(unigram) and F1 + F2 + F3 + F4(unigram) + F4(bigram), respectively. After conducting feature selection using the information gain heuristic, the two selected feature sets F1 + F2 + F3 + F4(unigram) and F1 + F2 + F3 + F4(unigram)+ F4(bigram) consist of 351 and 640 features, respectively. The feature selection is carried out by Weka’s information gain attribute evaluator (Witten and Frank 2005) . The two selected fea- ture sets are much smaller than the corresponding ones without feature selection. Table 19.3 lists the number of features in each feature set. The classification is carried out by using a linear kernel with the sequential mini- mal optimization (SMO) algorithm (Platt 1999) included in the Weka data mining package (Witten and Frank 2005) . Testing is done via tenfold cross-validation. Table 19.4 shows the precision, recall, and F-measure of gender classification on each feature set. All three types of measurement values keep increasing in the same way as the accuracy. The highest precision, recall, and F-measure for both classes (i.e., female and male) are achieved on the selected feature set F1 + F2 + F3 + F4(unigram) + F4(bigram). 5.4 Different Topics of Interest: Females and Males In our experimental study, we achieve the highest classification accuracy of 86% on the selected feature set F1 + F2 + F3 + F4(unigram) + F4(bigram). This indicates that gender differences do exist in web forums, and the features used for classification, especially the content-specific features, have high discriminating capability to dis- tinguish the online gender differences between female and male posters. 5 Experimental Study 383 Table 19.3 The number of features in each feature set Feature set Number of features F1 + F2 + F3 250 F1 + F2 + F3 + F4(unigram) 6,262 F1 + F2 + F3 + F4(unigram) + F4(bigram) 10,284 Selected F1 + F2 + F3 + F4(unigram) 351 Selected F1 + F2 + F3 + F4(unigram) + F4(bigram) 640 Table 19.4 Performance measures using different feature sets Feature set Class Precision Recall F-measure F1 + F2 + F3 Female 57.10% 72.00% 63.69% Male 62.20% 46.00% 52.89% Average 59.70% 59.00% 59.35% F1 + F2 + F3 + F4(unigram) Female 63.00% 58.00% 60.40% Male 61.10% 66.00% 63.46% Average 62.10% 62.00% 62.05% F1 + F2 + F3 + F4(unigram) + F4(bigram) Female 62.50% 70.00% 66.04% Male 65.90% 58.00% 61.70% Average 64.20% 64.00% 64.10% Selected F1 + F2 + F3 + F4(unigram) Female 90.20% 74.00% 81.30% Male 78.00% 92.00% 84.42% Average 84.10% 83.00% 83.55% Selected F1 + F2 + F3 + F4(unigram) + Female 92.90% 78.00% 84.80% F4(bigram) Male 81.00% 94.00% 87.02% Average 86.90% 86.00% 86.45% By investigating the features in the selected feature set F1 + F2 + F3 + F4(unigra m) + F4(bigram), we observe that females talk more about family members, God, peace, marriage, and goodwill; on the other hand, males talk more about extremism, holy man, and belief. Table 19.5 lists some examples of the unigrams and bigrams preferred by females and males, respectively, from the selected feature set F1 + F2 + F3 + F4(unigram) + F 4(bigram). They are among the features with the highest information gain values, 2 therefore showing high discriminatory power. We conduct chi-square (c ) tests to examine the statistical significances of the differences between females and males using those unigrams and bigrams. A domain expert from an Islamic country pro- vided the meanings of some of those unigrams and bigrams. As summarized in Table 19.5, all the listed female-preferred unigrams and big- rams are statistically significant. Specifically, significant terms/words in female conversations include sis (i.e., sisters by Islamic), sister, mother, husband, flower, amen, alhamdulillah (i.e., thank God), inshaallaah (i.e., in God’s will), ahhah kheir (i.e., God is good), and sexually defiled. Male-preferred unigrams and bigrams are statistically significant, except for “original Arabic.” Specifically, significant terms/ words in male discussions include Salafi (i.e., an extremist sect of Islam), Allah 384 19 Women’s Forums on the Dark Web Table 19.5 Examples of female- and male-preferred unigrams and bigrams from the selected feature set F1 + F2 + F3 + F4(unigram) + F4(bigram) 2 Keyword c value P value Result Meaning F emale-preferred unigrams and bigrams Sis 456.07 0.0001 Supported Sisters by Islamic Sister 165.08 0.0001 Supported Mother 123.88 0.0001 Supported Husband 51.87 0.0001 Supported Flower 9.00 0.0030 Supported Amen 166.64 0.0001 Supported Alhamdulillah 283.85 0.0001 Supported Thank God Inshaallaah 33.51 0.0001 Supported In God’s will Ahhah kheir 15.16 0.0001 Supported God is good Sexually defiled 5.25 0.0220 Supported Male-pr eferred unigrams and bigrams Salafi 377.17 0.0001 Supported Extremist sect of Islam Allah 290.30 0.0001 Supported Allah God of Muslims Army 66.12 0.0001 Supported Deviant 35.79 0.0001 Supported Ijtihaad 57.80 0.0001 Supported Inferring or interpreting Islamic laws Email 23.81 0.0001 Supported Great scholar 13.89 0.0002 Supported Muslim intellectual 11.27 0.0008 Supported Imam Nawawi 26.56 0.0001 Supported Priest Nawawi Original Arabic 3.52 0.0606 Not supported No te Significance levels a = 0.05 and a = 0.01 (i.e., Allah God of Muslims), army, deviant, ijtihaad (i.e., inferring or interpreting Islamic laws), email, great scholar, Muslim intellectual, and imam Nawawi (i.e., Priest Nawawi). For the bigram “original Arabic,” although men prefer to use it more frequently than women, the difference is not statistically significant (p = 0.0606 0.05). This may be because the total number of its appearances in the whole forum is small and therefore cannot show statistical significance. The results of our experimental study show the importance of content-specific features in gender classification for web forums, which is consistent with previous gender classification studies for web blogs. For example, Nowson and Oberlander (2006) conduct a linguistic analysis of gender and personality differences in web blogs. By comparing different types of features, they find that a relatively small number of n-gram features have the best discrimination power in automatic gender detection. Schler et al. (2006) analyze a corpus of web blogs from blogger.com and find that female and male bloggers have significant differences in topics of interest. The topics significantly preferred by female bloggers include shopping, mom, cry, kiss, husband, etc. Male bloggers are more interested in topics such as Linux, Microsoft, nations, democracy, economics, etc. 6 Conclusions and Future Directions 385 As an important type of social media, political web forums have become a major communication channel for people to discuss and debate political, cultural, and social issues. More and more women are using this medium to share their political pressure and knowledge. Along with this trend, researchers have developed an increased interest in studying online gender differences. By analyzing writing styles and topics of interest, our experimental results indicate that female and male partici- pants in political web forums do have significantly different topics of interest. 6 Conclusions and Future Directions With the rapid development and the increasing importance of the Internet in people’s daily life and work, understanding online gender differences and why they occur is becoming more and more important for both Internet service providers and users. Nowadays, more and more women are participating in cyberspace. However, this does not diminish online gender differences. In contrast, discrepancies of motiva- tion and interest in Internet use between females and males are becoming the focus of online gender difference research. In this study, we use feature-based online social media text classification techniques to investigate the online gender differ- ences between female and male participants in web forums by examining writing styles and topics of interests. The feature-based gender classification framework we developed can be applied to other different web forums. In the framework, we examine different types of features that have been widely used in previous online text classification studies, including lexical features, syntac- tic features, structural features, and content-specific features. For content-specific features, we use unigrams and bigrams automatically extracted from the whole forum instead of those manually selected. We build five different feature sets by adding content-specific features to the basic content-free features and conducting feature selection. According to our experimental study on a large Islamic women’s political forum, the feature sets combining both content-specific and content-free features perform significantly better than the ones consisting of only content-free features. In addition, feature selection on large feature sets improves the classifica- tion performance significantly. The results also indicate the existence of online gender differences in web forums. The best gender classification performance is achieved using the selected feature set F1 + F2 + F3 + F4(unigram) + F4(bigram). Through further investigation of this selected feature set, we identify different topics of interest between females and males. For example, females prefer talking about family members, God, peace, and marriage; males like to talk more about e xtremism, holy man, and belief. Acknowledgments This material is based upon work supported by the National Science Foundation under Grant No. CNS-0709338, “(CRI: CRD) Developing a Dark Web Collection and Infrastructure for Computational and Social Sciences.” We would also like to thank Dr. Katharina von Knop for her helpful suggestions and comments about our research test bed. 386 19 Women’s Forums on the Dark Web References Abbasi, A. and H. Chen, “Applying authorship analysis to extremist-group Web forum messages,” IEEE Intelligent Systems, vol. 20, no. 5 (Special issue on artificial intelligence for national and homeland security), 2005, pp. 67–75. Abbasi, A. and H. Chen, “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace,” ACM Transactions on Information Systems, vol. 26, no. 2, 2008, pp. 1–29. Abbasi, H. Chen, and J.F. Nunamaker, “Stylometric identification in electronic markets: scalability and robustness,” Journal of Management Information Systems, vol. 25, no. 1, 2008b, pp. 49–78. Abbasi, A., H. Chen, and A. Salem, “Sentiment analysis in multiple languages: feature selection for opinion classification in Web forums,” ACM Transactions on Information Systems, vol. 26, no. 3, 2008a, pp. 1–34. Argamon, S., M. Koppel, and G. Avneri, “Routing documents according to style.,” in Proceedings of Proceedings of the 1st International Workshop on Innovative Information, Pisa, Italy, 1988. Argamon, S., M. Koppel, J. Fine, and A. Shimoni, “Gender, genre, and writing style in formal written texts,” Text, vol. 23, no. 3, 2003a, pp. 321–346. Argamon, S., M. Saric, and S.S. Stein, “Style mining of electronic messages for multiple author- ship discrimination,” in Proceedings of Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003b, pp. 475–480. Baayen, R.H., H.V. Halteren, A. Neijt, and F.J. Tweedie, “An experiment in authorship attribution,” in Proceedings of Proceedings of the 6th International Conference on Statistical Analysis of Textual Data, 2002, pp. 69–75. Baayen, R.H., H.V. Halteren, and F.J. Tweedie, “Outside the cave of shadows: using syntactic annotation to enhance authorship attribution,” Literary and Linguistic Computing, vol. 11, no. 3, 1996, pp. 121–132. Bimber, B., “Measuring the gender gap on the Internet,” Social Science Quarterly, vol. 81, no. 3, 2000, pp. 868–876. Burrows, J.F., “‘An ocean where each kind….’ Statistical analysis and some major determinants of literary style,” Computers and the Humanities, vol. 23, no. 4–5, 1989, pp. 309–321. CommerceNet, “The CommerceNet/Nielsen Internet demographic survey (1999),” http://www. commerce.net/, 1999. Consaluo, M. and S. Paasonen, Women and Everyday Uses of the Internet: Agency and Identity, New York: Peter Lang Publishing, 2002. Corney, M., O. de Vel, A. Anderson, and G. Mohay, “Gender-preferential text mining of e-mail discourse,” in Proceedings of Proceedings of the 18th Annual Computer Security Applications Conference (ACSAC 2002), Las Vegas, 2002, pp. 282–292. Dave, K., S. Lawrence, and D. Pennock, “Mining the peanut gallery: opinion extraction and semantic classification of product reviews,” in Proceedings of Proceedings of the 12th International World Wide Web Conference (WWW’03), 2003, pp. 519–528. de Vel, O., “Mining E-mail Authorship,” in Proceedings of Paper presented at the Workshop on Text Mining, ACM International Conference on Knowledge Discovery and Data Mining (KDD 2000), Boston, MA, 2000. de Vel, O., A. Anderson, M. Corney, and G. Mohay, “Mining e-mail content for author identifica- tion forensics,” SIGMOD Record, vol. 30, no. 4, 2001, pp. 55–64. Diederich, J., J. Kindermann, E. Leopold, and G. Paass, “Authorship attribution with support vector machines,” Applied Intelligence, vol. 19, no. 1–2, 2003, pp. 109–123. Forsyth, R.S., and D.I. Holmes, “Feature finding for text classification,” Literary and Linguistic Computing, vol. 11, no. 4, 1996, pp. 163–174. Fountain, J.E., “Constructing the information society: women, information technology, and design,” Technology and Society vol. 22, no. 1, 2000, pp. 45–62. References 387 Fuller, J.E., “Equality in cyberdemocracy? Gauging gender gaps in on-line civic participation,” Social Science Quarterly vol. 85, no. 4, 2004, pp. 938–957. Gamon, M., “Sentiment classification on customer feedback data: noisy data, large feature vectors, and the role of linguistic analysis,” in Proceedings of Proceedings of the 20th International Conference on Computational Linguistics, 2004, pp. 841–847. Grefenstette, G., Y. Qu, J.G. Shanahan, and D.A. Evans, “Coupling niche browsers and affect analysis for an opinion mining application,” in Proceedings of Proceedings of the 12th International Conference Recherche d’Information Assistee par Ordinateur, 2004, pp. 186–194. Guiller, J. and A. Durndell, “Students’ linguistic behaviour in online discussion groups: Does gender matter?” Computers in Human Behavior, vol. 23, no. 5, 2007, pp. 2240–55. Guo, B. and M.S. Nixon, “Gait feature subset selection by mutual information,” IEEE Transactions on Systems, Man, and Cybernetics – Part A: Systems and Humans, vol. 39, no. 1, 2009, pp. 36–46. Halbert, D. “Shulamith firestone: radical feminism and visions of the information society,” Information Communication and Society, vol. 7, no. 1, 2004, pp. 115–136. Harcourt, W., “The personal and the political: women using the Internet,” Cyberpsychology and Behavior vol. 3, no. 5, 2000, pp. 693–697. Harp, D. and M. Tremayne, “The gendered blogosphere: examining inequality using network and feminist theory,” Journalism and Mass Communication Quarterly vol. 83, no. 2, 2006, pp. 247–264. Holmes, D.I. and R.S. Forsyth, “The federalist revisited: new directions in authorship attribution,” Literary and Linguistic Computing, vol. 10, no. 2, 1995, pp. 111–127. Hota, S., S. Argamon, M. Koppel, and I. Zigdon, “Performing gender: automatic stylistic analysis of Shakespeare’s characters,” in Proceedings of Proceedings of the Digital Humanities Conference (Association for Computers in Humanities and the Association for Literary and Linguistic Computing), 2006, pp. 100–106. Hu, M. and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of Proceedings of the ACM SIGKDD International Conference, 2004, pp. 168–177. Jackson, L.A., K.S. Ervin, P.D. Gardner, and N. Schmitt, “Gender and the Internet: women communicating and men searching,” Sex Roles: A Journal of Research, vol. 44, no. 5–6, 2001, pp. 363–378. Koppel, M., N. Akiva, and I. Dagan, “Feature instability as a criterion for selecting potential style markers,” J. Amer. Soc. Inf. Sci. Technol, vol. 57, no. 11, 2006, pp. 1519–1525. Koppel, M., S. Argamon, and A. Shimoni, “Automatically categorizing written texts by author gender,” Literary and Linguistic Computing, vol. 14, no. 7, 2002, pp. 401–412. Koppel, M. and J. Schler, “Exploiting stylistic idiosyncrasies for authorship attribution,” in Proceedings of Proceedings of the IJCAIWorkshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico, 2003. Ledger G.R. and T.V.N. Merriam, “Shakespeare, Fletcher, and the two noble kinsmen.,” Literary and Linguistic Computing, vol. 9, no. 4, 1994, pp. 235–248. Li, J., Z. Zhang, X. Li, and H. Chen, “Kernel-based learning for biomedical relation extraction,” Journal of the American Society for Information Science and Technology (JASIST), vol. 59, no. 5, 2008, pp. 756–769. Li, J., R. Zheng, and H. Chen, “From fingerprint to Writeprint,” Communications of the ACM, vol. 49, no. 4, 2006, pp. 76–82. Martindale, C. and D. McKenzie, “On the utility of content analysis in author attribution: the federalist,” Comput. Humanit., vol. 29, no. 4, 1995, pp. 259–270. Mendenhall, T.C. “The characteristic curves of composition,” Science, vol. 11, no. 11, 1887, pp. 237–249. Mishne, G., “Experiments with mood classification,” in Proceedings of Proceedings of the 1st Workshop on Stylistic Analysis of Text for Information Access, Salvador, Brazil, 2005. Mitra, A., “Voices of the marginalized on the Internet: examples from a Website for women of South Asia,” Journal of Communication vol. 54, no. 3, 2004, pp. 492–510. 388 19 Women’s Forums on the Dark Web Mosteller, F., Applied Bayesian and Classical Inference: The Case of the Federalist Papers, 2nd ed., Springer, 1964. National Election Study, “American National Election Study. 1998 Pre- and post- election survey,” Conducted by the Center for Political Studies of the Institute for Social Research, The University of Michigan, Ann Arbor, Inter-University Consortium for Political and Social Research, 1998. Nowson, S. and J. Oberlander, “The identity of bloggers: openness and gender in personal Weblogs,” in Proceedings of Proceedings of the AAAI Spring Symposia on Computational Approaches to Analyzing Weblogs, Stanford, California, 2006. O’Reilly, T. “What Is Web 2.0? Design patterns and business models for the next generation of soft- ware,” http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-Web-20.html, 2005. Ogan, C., F. Cicek, and M. Ozakca, “Letters to Sarah: analysis of email responses to an online editorial,” New Media and Society vol. 7, no. 4, 2005, pp. 533–557. Pang, B., L. Lee, and S. Vaithyanathain, “Thumbs up? Sentiment classification using machine learning techniques,” in Proceedings of Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002, pp. 79–86. Peng, F., D. Schuurmans, V. Keselj, and S. Wang, “Automated authorship attribution with charac- ter level language models,” in Proceedings of Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 2003. Pew Internet and American Life Project, http://www.pewinternet.org/trends/ User_Demo_7.22.08. htm, 2008. Platt, J. Fast Training on SVMs Using Sequential Minimal Optimization, In Scholkopf, B., Burges, C., and Smola, A. (Ed.) ed., Advances in Kernel Methods: Support Vector Learning, Cambridge, MA: MIT Press, 1999. Quinlan, J.R., “Induction of decision trees,” Machine Learning, vol. 1, no. 1, 1986, pp. 81–106. Schler, J., M. Koppel, S. Argamon, and J. Pennebaker, “Effects of age and gender on blogging,” in Proceedings of Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, Menlo Park, California, 2006, pp. 199–205. Seale, C., S. Ziebland, and J. Charteris-Black, “Gender, cancer experience and Internet use: a comparative keyword analysis of interviews and online cancer support groups,” Social Science and Medicine, vol. 62, no. 10, 2006, pp. 2577–2590. Shade, L.R., Gender and Community in the Social Construction of the InternetGender and Community in the Social Construction of the Internet, New York: Peter Lang Publishing, 2002. Shannon, C.E., “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 4, 1948, pp. 379–423. Sherman, A.P., Cybergrrl Work: Tips and Inspiration for the Professional You, Berkley Trade, 2001. Subasic, P. and A. Huettner, “Affect analysis of text using fuzzy semantic typing,” IEEE Transactions on Fuzzy Systems, vol. 9, no. 4, 2001, pp. 483–496. Turney, P.D., “Thumbs up or thumbs down? Semantic orientation applied to unsupervised classifi- cation of reviews,” in Proceedings of Proceedings of the 40th Annual Meetings of the Association for Computational Linguistics, Philadelphia, Pennsylvania, 2002, pp. 417–424. Tweedie, F.J. and R.H. Baayen, “How variable may a constant be? Measures of lexical richness in perspective.,” Computers and the Humanities, vol. 32, no. 5, 1998, pp. 323–352. Wiebe, J., T. Wilson, and M. Bell, “Identifying collocations for recognizing opinions,” in Proceedings of Proceedings of the ACL/EACL Workshop on Collocation, Toulouse, France, 2001. Wiebe, J., T. Wilson, R. Bruce, M. Bell, and M. Martin, “Learning subjective language,” Computational Linguistics, vol. 30, no. 3, 2004, pp. 277–308. Witten, I.H. and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition), 2nd Edition ed., San Francisco: Morgan Kaufmann, 2005. Yang, Y. and J.O. Pedersen, “A comparative study on feature selection in text categorization,” in Pr oceedings of Proceedings of the ICML97, 1997, pp. 412–420.

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.