Social Networks Analysis (SNA)
In social science, the structural approach that is based on the study of interaction among social actors is called social network analysis.
The relationships that social network analysts study are usually those that link individual human beings since these social scientists believe that besides individual characteristics, relational links or social structure, are necessary and indispensable to fully understand social phenomena.
Social network analysis is used to understand the social structure, which exists among entities in an organization. The defining feature of social network analysis (SNA) is its focus on the structure of relationships, ranging from casual acquaintance to close bonds.
This is in contrast with other areas of the social sciences where the focus is often on the attributes of agents rather than on the relationships between them.
SNA maps and measures the formal and informal relationships to understand what facilitates or impedes the knowledge flows that bind the interacting units, that is, who knows whom and who shares what information and how. Social network analysis is focused on uncovering the patterning of people’s interaction.
SNA has based on the intuition that these patterns are important features of the lives of the individuals who display them. The network analysts believe that how individual life depends in large part on how that individual is tied into a larger web of social connections.
Moreover, many believe that the success or failure of societies and organizations often depends on the patterning of their internal structure, which is guided by formal concept analysis, which is grounded in a systematic analysis of the empirical data.
With the availability of powerful computers and discrete combinatorics (especially graph theory) after 1970, the study of SNA took off as an interdisciplinary specialty; the applications are found manifolds that include organizational behavior, inter-organizational relations, the spread of contagious diseases, mental health, social support, and the diffusion of information and animal social organization.
SNA software provides the researcher with data that can be analyzed to determine the centrality, betweenness, degree, and closeness of each node. An individual’s social network influences his/her social attitude and behavior.
Before collecting network data typically through interviews, it must first be decided as to the kinds of networks and kinds of relations that will be studied:
1. One mode versus two-mode networks: The former involve relations among a single set of similar actors, while the latter involves relations between two different sets of actors.
An example of the two-mode network would be the analysis of a network consisting of private, for-profit organizations, and their links to nonprofit agencies in a community.
Two-mode networks are also used to investigate the relationship between a set of actors and a series of events. For example, although people may not have direct ties to each other, they may attend similar events or activities in a community and in doing so, this sets up opportunities for the formation of “weak ties.”
2. Complete/whole versus ego networks: Complete/whole or sociocentric networks consists of the connections among members of a single, bounded community. Relational ties among all of the teachers in a high school is an example of the whole network.
Ego/Egocentric or personal networks are referred to as the ties directly connecting the focal actor, or ego to others or egos alters in the network, plus ego’s views on the ties among his or her alter.
If we asked a teacher to nominate the people he/she socializes with the outside of school, and then asked that teacher to indicate who in that network socializes with the others nominated, it is a typical ego network.
a. Egocentric network data focus on the network surrounding one node, or in other words, the single social actor. Data are on nodes that share the chosen relation(s) with the ego and on relationships between those nodes. Ego network data can be extracted from whole network data by choosing a focal node and examining only nodes connected to this ego.
Ego network data, like whole network data, can also include multiple relations; these relations can be collapsed into single networks, as when ties to people who provide companionship and emotional aid are collapsed into a single support network.
Unlike whole network analyses, which commonly focus on one or a small number of networks, ego network analyses typically sample large numbers of egos and their networks.
b. Complete/whole networks focus on all social actors rather than focusing on the network surrounding any particular actor. These networks begin from a list of included actors and include data on the presence or absence of relationships between every pair of actors.
When the researcher adopts the whole network perspective, he/she will inquire about each social actor and all other individuals to collect relational data.
Text analysis is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management.
Text analysis involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (such as distribution analysis, clustering, trend analysis, and association rules), and visualization of the results.
Text analysis draws on advances made in other computer science disciplines concerned with the handling of natural language because of the centrality of natural language text to its mission; text analysis exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based computational linguistics.
Since text analysis derives much of its inspiration and direction from seminal research on data mining, there are many high-level architectural similarities between the two systems.
For instance, text analysis adopts many of the specific types of patterns in its core knowledge discovery operations that were first introduced and vetted in data mining research. Further, both types of systems rely on preprocessing routines, pattern-discovery algorithms, and presentation-layer elements such as visualization tools to enhance the browsing of answer sets.
In contrast, for text analysis systems, preprocessing operations center on the identification and extraction of representative features for natural language documents.
These preprocessing operations are responsible for transforming unstructured data stored in document collections into a more explicitly structured intermediate format, which is not a concern relevant for most data mining systems.
The sheer size of document collections makes manual attempts to correlate data across documents, map complex relationships, or identify trends at best extremely labor- intensive and at worst nearly impossible to achieve.
Automatic methods for identifying and exploring inter-document data relationships dramatically enhance the speed and efficiency of research activities.
Indeed, in some cases, automated exploration techniques like those found in text analysis are not just a helpful adjunct but a baseline requirement for researchers to be able, in a practical way, to recognize subtle patterns across large numbers of natural language documents.
Text analysis systems, however, usually do not run their knowledge discovery algorithms on unprepared document collections. Considerable emphasis in text analysis is devoted to what is commonly referred to as preprocessing operations.
Text analysis preprocessing operations include a variety of different types of techniques culled and adapted from information retrieval, information extraction, and computational linguistics research that transform raw, unstructured, original-format content (like that which can be downloaded from document collections) into a carefully structured, intermediate data format.
Knowledge discovery operations, in turn, are operated against this specially structured intermediate representation of the original document collection.
Defining Text Analysis
Text analysis can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text analysis seeks to extract useful information from data sources through the identification and exploration of interesting patterns.
In the case of text analysis, however, the data sources are documented collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections.
The document collection can be any grouping of text-based documents. Practically speaking, however, most text analysis solutions are aimed at discovering patterns across very large document collections. The number of documents in such collections can range from the many thousands to the tens of millions.
Document collections can be either static, in which case the initial complement of documents remains unchanged dynamic, which is a term applied to document collections characterized by their inclusion of new or updated documents over time.
Extremely large document collections, as well as document collections with very high rates of document change, can pose performance optimization challenges for various components of a text analysis system.
A document can be very informally defined as a unit of discrete textual data within a collection that usually correlates with some real-world document such as a business report, legal memorandum, email, research paper, manuscript, article, press release, or news story.
Within the context of a particular document collection, it is usually possible to represent a class of similar documents with a prototypical document.
But a document can (and generally does) exist in any number or type of collections—from the very formally organized to the very ad hoc. A document can also be a member of different document collections, or different subsets of the same document collection, and can exist in these different collections at the same time.
1. The document, as a whole, is seen as a structured object.
2. Documents with extensive and consistent format elements in which field-type metadata can be inferred–such as some email, HTML web pages, PDF files, and word-processing files with heavy document templating or style-sheet constraints– are described as semistructured documents.
3. Documents that have relatively little in the way of strong typographical, layout, or markup indicators to denote structure—like most scientific research papers, business reports, legal memoranda, and news stories—are referred as free format or weakly structured documents.
Some text documents, like those generated from a WYSIWYG HTML editor, actually possess from their inception more overt types of embedded metadata in the form of formalized markup tags.
However, even a rather innocuous document demonstrates a rich amount of semantic and syntactical structure, although this structure is implicit and hid- den in its textual content.
In addition, typographical elements such as punctuation marks, capitalization, numerics, and special characters—particularly when coupled with layout artifacts such as white spacing, carriage returns, underlining, asterisks, tables, columns, and so on.
Can often serve as a kind of “soft markup” language, providing clues to help identify important document subcomponents such as paragraphs, titles, publication dates, author names, table records, headers, and footnotes. Word sequence may also be a structurally meaningful dimension to a document.
An essential task for most text analysis systems is the identification of a simplified subset of document features that can be used to represent a particular document as a whole.
Such a set of features is referred as the representational model of a document; features required to represent a document collection tends to become very large effecting every aspect of a text analysis system’s approach, design, and performance.
The high dimensionality of potentially representative features in document collections is a driving factor in the development of text analysis preprocessing operations aimed at creating more streamlined representational models.
This high dimensionality also indirectly contributes to other conditions that separate text analysis systems from data mining systems such as greater levels of pattern overabundance and more acute requirements for post-query refinement techniques.
The feature sparsity of a document collection reflects the fact that some features often appear in only a few documents, which means that the support of many patterns is quite low; furthermore, only a small percentage of all possible features for a document collection as a whole appears in any single document.
While evaluating an optimal set of features for the representational model for a document collection, the tradeoff is between the following two conflicting objectives:
To achieve the correct calibration of the volume and semantic level of features to portray the meaning of a document accurately, which tends toward evaluating relatively a larger set of features
To identify features in a way that is most computationally efficient and practical for pattern discovery, which tends toward evaluating a smaller set of features
Commonly used document features are described below.
1. Characters: A character-level representation can include the full set of all characters for a document or some filtered subset; and, this feature space is the most complete of any representation of a real-world text document.
The individual component-level letters, numerals, special characters, and spaces are the building blocks of higher-level semantic features such as words, terms, and concepts.
Character-based representations that include some level of positional information (e.g., bigrams or trigrams) are more useful and common. Generally, character-based representations can often be unwieldy for some types of text processing techniques because the feature space for a document is fairly unoptimized.
2. Words: Word-level features existing in the native feature space of a document. A word-level representation of a document includes a feature for each word within that document—that is the “full text,” where a document is represented by a complete and unabridged set of its word-level features.
However, most word- level document representations exhibit at least some minimal optimization and therefore consist of subsets of representative features devoid of items such as stop words, symbolic characters, and meaningless numerics and so on.
3. Terms: Terms are single words and multiword phrases selected directly from the corpus of a native document by means of term-extraction methodologies. Term- level features, in the sense of this definition, can only be made up of specific words and expressions found within the native document for which they are meant to be generally representative.
Hence, a term-based representation of a document is necessarily composed of a subset of the terms in that document.
Several of term-extraction methodologies can convert the raw text of a native document into a series of normalized terms—that is, sequences of one or more tokenized and lemmatized word forms associated with part-of-speech tags. Sometimes an external lexicon is also used to provide a controlled vocabulary for term normalization.
Term-extraction methodologies employ various approaches for generating and filtering an abbreviated list of most meaningful candidate terms from among a set of normalized terms for the representation of a document. This culling process results in a smaller but relatively more semantically rich document representation than that found in word-level document representations.
4. Concepts: Concepts are features generated for a document by means of manual, statistical, rule-based, or hybrid categorization methodologies.
Concept-level features can be manually generated for documents but are now more commonly extracted from documents using complex preprocessing routines that identify single words, multiword expressions, whole clauses, or even larger syntactical units that are then related to specific concept identifiers.
Many categorization methodologies involve a degree of cross-referencing against an external knowledge source; for some statistical methods, this source might simply be an annotated collection of training documents.
For manual and rule-based categorization methods, the cross-referencing and validation of prospective concept-level features typically involve interaction with a “gold standard” such as a preexisting domain ontology, lexicon, or formal concept hierarchy—or even just the mind of a human domain expert.
Unlike word- and term-level features, concept-level features can consist of words not specifically found in the native document.
Term- and concept-based representations exhibit roughly the same efficiency but are generally much more efficient than character- or word-based document models. terms and concepts reflect the features with the most condensed and expressive levels of semantic value, and there are many advantages to their use in representing documents for text analysis purposes.
Term-level representations can sometimes be more easily and automatically generated from the original source text (through various term-extraction techniques) than concept-level representations, which as a practical matter have often entailed some level of human intervention.
Concept-based representations can be processed to support very sophisticated concept hierarchies, and arguably provide the best representations for leveraging the domain knowledge afforded by ontologies and knowledge bases.
They are much better than any other feature set representation at handling synonymy and polysemy and are clearly best at relating a given feature to its various hyponyms and hypernyms.
Possible disadvantages of using concept-level features to represent documents include the relative complexity of applying the heuristics, during preprocessing operations, required to extract and validate concept-type features the domain dependence of many concepts.
Text Analysis can leverage information from formal external knowledge sources for these domains to greatly enhance elements of their preprocessing, knowledge discovery, and presentation layer operations. A domain is defined as a specialized area of interest with dedicated ontologies, lexicons, and taxonomies of information.
Domain Knowledge can be used in text analysis preprocessing operations to enhance concept extraction and validation activities; domain knowledge can play an important role in the development of more meaningful, consistent, and normalized concept hierarchies.
Advanced text analysis systems can create fuller representations of document collections by relating features by way of lexicons and ontologies in preprocessing operations and support enhanced query and refinement functionalities. Domain knowledge can be used to inform many different elements of a text analysis system:
Domain knowledge is an important adjunct to classification and concept-extraction methodologies in preprocessing operations
Domain knowledge can also be leveraged to enhance core mining algorithms and browsing operations.
Domain-oriented information serves as one of the main basis for search refinement techniques.
Domain knowledge may be used to construct meaningful constraints in knowledge discovery operations.
Domain knowledge may also be used to formulate constraints that allow users greater flexibility when browsing large result sets.
Search for Patterns and Trends
The problem of pattern overabundance can exist in all knowledge discovery activities. It is simply aggravated when interacting with large collections of text documents, and, therefore, text analysis operations must necessarily be conceived to provide not only relevant but also manageable result sets to a user.
Although text analysis preprocessing operations play the critical role of transforming the unstructured content of a raw document collection into a more tractable concept-level data representation, the core functionality of a text analysis system resides in the analysis of concept co-occurrence patterns across documents in a collection.
Indeed, text analysis systems rely on algorithmic and heuristic approaches to consider distributions, frequent sets, and various associations of concepts at an inter-document level in an effort to enable a user to discover the nature and relationships of concepts as reflected in the collection as a whole.
Text analysis methods—often based on large-scale, brute-force search directed at large, high-dimensionality feature sets—generally produce very large numbers of patterns.
This results in an overabundance problem with respect to identified patterns that is usually much more severe than that encountered in data analysis applications aimed at structured data sources.
A main operational task for text analysis systems is to enable a user to limit pattern overabundance by providing refinement capabilities that key on various specifiable measures of “interestingness” for search results. Such refinement capabilities prevent system users from getting overwhelmed by too many uninteresting results.
Several types of functionality are commonly supported within the front ends of text analysis systems:
Browsing: Most contemporary text analysis systems support browsing that is both dynamic and content-based; the browsing is guided by the actual textual content of a particular document collection and not by anticipated or rigorously prespecified structures;
User browsing is usually facilitated by the graphical presentation of concept patterns in the form of a hierarchy to aid interactivity by organizing concepts for investigation.
Navigation: Text mining systems must enable a user to move across these concepts in such a way as to always be able to choose either a “big picture” view of the collection in to or to drill down on specific concept relationships.
Visualization: Text analysis systems use visualization tools to facilitate navigation and exploration of concept patterns; these use various graphical approaches to express complex data relationships.
While basic visualization tools generate static maps or graphs that were essentially rigid snapshots of patterns or carefully generated reports displayed on the screen or printed by an attached printer.
State-of-the-art text analysis systems increasingly rely on highly interactive graphic representations of search results that permit a user to drag, pull, click, or otherwise directly interact with the graphical representation of concept patterns.
Query: Languages have been developed to support the efficient parameterization and execution of specific types of pattern discovery queries; these are required because the presentation layer of a text analysis system really serves as the front end for the execution of the system’s core knowledge discovery algorithms.
Instead of limiting a user to limiting a user to run only a certain number of fixed, preprogrammed search queries, text analysis systems are increasingly designed to expose much of their search functionality to the user by opening up direct access to their query languages by means of query language interfaces or command-line query interpreters.
Clustering: Text analysis systems enable clustering of concepts in ways that make the most cognitive sense for a particular application or task.
Refinement constraints: Some text mining systems offer users the ability to manipulate, create, or concatenate refinement constraints to assist in producing more manageable and useful result sets for browsing.