Web History (The Complete Guide 2019)


Web History

Web History The Complete Guide

Retail organizations are performing log analysis, website optimization, and customer loyalty programs by using brand and sentiment analysis and market-basket analysis. Dynamic pricing, website real-time customization, and product recommendations are the results of this analysis.


The finance industry is using big data for fraud pattern detection and to perform analyses for corruption, bribery, risk modeling, and trade analytics. This enables them to improve their customer risk evaluation and fraud detection, as well as to design programs for real-time upsell and cross-marketing offers.


Energy is doing analyses of grid failure, soil analytics, predictive mechanical failure, chemical analysis, and smart meters to name a few. Manufacturing is performing supply-chain analysis, customer-churn analysis, and part replacement, as well as the layout and design of manufacturing plants and factories.


Telecommunication firms use big data information for customer profiling, cell-tower analysis, optimizing customer experience, monitoring equipment status, and network analysis. This improves hardware maintenance, product recommendations, and location-based advertising.


Healthcare uses electronic medical records (EMR) and RFID to perform hospital design, patient treatment, clinical decision support, clinical-trial analysis, and real-time instrument, and patient monitoring and analysis.


Government is using big data for areas such as threat identification, government program analysis, and person-of-interest discovery. It’s the power of the data that is disrupting business and the IT industry. This Blog Explains the complete Web history from starting to till date.


Evolution of the Web

Web 1.0

Web 1.0

To start with, most websites were just a collection of static web pages. The shallow web, also known as the static web, is primarily a collection of static HTML web pages providing information about products or services offered.


After a while, the web became dynamic, delivering web pages created on the fly. The ability to create web pages from the content stored on databases enabled web developers to provide customized information to visitors.


These sites are known as the deep web or the dynamic web. Though a visitor to such websites gets information attuned to his or her requirements, these sites provide primarily one-way interaction and limited user interactivity.


The users have no role in the content generation and no means to access content without visiting the sites concerned. The shallow websites and deep websites, which have none or minimal user interaction, are now generally termed as Web 1.0.


Web 2.0

In the last few years, a new class of web applications, known as Web 2.0 (or Service-Oriented Applications), has emerged.


These applications let people collaborate and share information online in seemingly new ways—examples include social networking sites such as myspace.com, media sharing sites such as YouTube.com, and collaborative authoring sites such as Wikipedia.


These second-generation webs offer smart user interfaces and built-in facilities for users to generate and edit content presented on the web and thereby enrich the content base. Besides leveraging the users’ potential in generating content, Web 2.0 applications provide facilities to keep the content under the user’s own categories (tagging feature) and access it easily (web feed tool).

This new version of web applications is also able to integrate multiple services under a rich user interface.


With the incorporation of new web technologies such as Asynchronous JavaScript and XML (AJAX), Ruby, blog, wiki, social bookmarking, and tagging, the web is fast becoming more dynamic and highly interactive, where users can not only pick content from a site but can also contribute to it.


The web feed technology allows users to keep up with a site’s latest content without having to visit it.


Another feature of the new web is the proliferation of websites with APIs (application programming interfaces). An API from a web service facilitates web developers in collecting data from the service and creating new online applications based on these data.


The Web 2.0 is a collection of technologies, business strategies, and social trends. The Web 2.0 is a highly interactive, dynamic application platform than its predecessor Web 1.0.


Weblogs or Blogs


With the advent of software like Wordpress and Typepad, along with blog service companies like blogger.com, the weblog is fast becoming the communication medium of the new web. Unlike traditional Hypertext Markup Language (HTML) web pages, blogs offer the ability for the non-programmer to communicate on a regular basis.


Traditional HTML-style pages required knowledge of style, coding, and design in order to publish content that was basically read-only from the consumer’s point of view. Weblogs remove much of the constraints by providing a standard user interface that does not require customization.


Weblogs originally emerged as a repository for linking but soon evolved to the ability to publish content and allow readers to become content providers. The essence of a blog can be defined by the format that includes small chunks of content referred to as posts, date stamped, maintained in the reverse chronological order, and content expanded to include links, text, and images.


The biggest advancement made with weblogs is the permanence of the content that has a unique Universal Resource Locator (URL). This allows the content to be posted along with the comments to define a permanent record of information.


This is critical in that having a collaborative record that can be indexed by search engines will increase the utility and spread the information to a larger audience.




A Wiki is a website that promotes the collaborative creation of content. Wiki pages can be edited by anyone at any time. Informational content can be created and easily organized within the wiki environment and then reorganized as required. Wikis are currently in high demand in a large variety of fields, due to their simplicity and flexibility.


Documentation, reporting, project management, online glossaries, and dictionaries, discussion groups, or general information applications are just a few examples of where the end user can provide value.


While stating that anyone can alter content, some large-scale wiki environments have extensive role definitions that define who can perform functions of the update, restore, delete, and creation.


Wikipedia, like many wiki-type projects, has readers, editors, administrators, patrollers, policy makers, subject matter experts, content maintainers, software developers, and system operators. All of which create an environment open to sharing information and knowledge to a large group of users.


RSS Technologies

Originally developed by Netscape, RSS was intended to publish news-type information based on a subscription framework. Many Internet users have experienced the frustration of searching Internet sites for hours at a time to find relevant information.


RSS is an XML-based content-syndication protocol that allows websites to share information as well as aggregate information based on the users’ needs.


In the simplest form, RSS shares the metadata about the content without actually delivering the entire information source. An author might publish the title, description, publish date, and copyrights to anyone that subscribes to the feed.


The end user is required to have an application called an aggregator in order to receive the information. By having the RSS aggregator application, end users are not required to visit each site in order to obtain information.


From an end user perspective, the RSS technology changes the communication method from a search and discover to a notification model. Users can locate content that is pertinent to their job and subscribe to the communication that enables a much faster communication stream.


Social Tagging

Social Tagging

Social tagging describes the collaborative activity of marking shared online content with keywords or tags as a way to organize content for future navigation, filtering, or search. Traditional information architecture utilized a central taxonomy or classification scheme in order to place information into specific predefined bucket or category.


The assumption was that trained librarians understood more about information content and context than the average user. While this might have been true for the local library, the enormous amount of content on the Internet makes this type of system unmanageable.


Tagging offers a number of benefits to the end user community. Perhaps the most important feature to the individual is able to bookmark the information in a way that is easier for them to recall at a later date.


The idea of social tagging is allowing multiple users to tag content in a way that makes sense to them, by combining these tags; users create an environment where the opinions of the majority define the appropriateness of the tags themselves.


The act of creating a collection of popular tags is referred to as a folksonomy that is defined as a folk taxonomy of important and emerging content within the user community.


The vocabulary problem is defined by the fact that different users define content in different ways. The disagreement can lead to missed information or inefficient user interactions.


One of the best examples of social tagging is Flickr that allows the user to upload images and tag them with appropriate metadata keywords. Other users, who view your images, can also tag them with their concept of appropriate keywords. After a critical mass has been reached, the resulting tag collection will identify images correctly and without bias.


Other sites like iStockPhoto have also utilized this technology but more along the sales channel versus the community one.


Mashups: Integrating Information

The final Web 2.0 technology describes the efforts around information integration, commonly referred to as mashups. These applications can be combined to deliver additional value that the individual parts could not on their own:


1.HousingMaps.com combines the Google mapping application with a real estate listing service on Craiglists.com.


2.Chicagocrime.org overlays local crime statistics on top of Google Maps so end users can see what crimes were committed recently in the neighborhood.

3. Another site synchronizes Yahoo! Inc.’s real-time traffic data with Google Maps.


Much of the work with web services will enable greater extensions of mashups and combine many different businesses and business models. Organizations, like Amazon and Microsoft, are embracing the mashup movement by offering developers easier access to their data and services.


Moreover, they’re programming their services so that more computing tasks, such as displaying maps onscreen, get done on the users’ Personal Computers rather than on their far-flung servers.


User Contributed Content


One of the basic themes of Web 2.0 is user-contributed information. The value derived from the contributed content comes not from a subject matter expert, but rather from individuals whose small contributions add up. One example of user-contributed content is


Comparison between Web 1.0 and Web 2.0


Web 1.0 Characteristics Web 2.0 Characteristics

  • Static content Dynamic content
  • Producer-based information Participatory-based information
  • Messages pushed to consumer Messages pulled by the consumer
  • Institutional control Individual enabled
  • Top-down implementation Bottom-up implementation


Users search and browse Users publish and subscribe Transactional-based interactions Relationship-based interactions Goal of mass adoption Goal of niche adoption

Taxonomy Folksonomy

the product review systems like Amazon.com and reputation systems used with eBay. com. A common practice of online merchants is to enable their customers to review or to express opinions on the products they have purchased.


Online reviews are a major source of information for consumers and demonstrated enormous implications for a wide range of management activities, such as brand building, customer acquisition and retention, product development, and quality assurance.


A person’s reputation is a valuable piece of information that can be used when deciding whether or not to interact or do business with. A reputation system is a bidirectional medium where buyers post feedback on sellers and vice versa.


For example, eBay buyers voluntarily comment on the quality of service, their satisfaction with the item traded, and promptness of shipping. Sellers comment about the prompt payment from buyers or respond to comments left by the buyer.


Reputation systems may be categorized into three basic types: ranking, rating, and collaborative. Ranking systems use quantifiable measures of users’ behavior to generate and rating. Rating systems use explicit evaluations given by users in order to define a measure of interest or trust.


Finally, collaborative filtering systems determine the level of relationship between the two individuals before placing a weight on the information. For example, if a user has reviewed similar items in the past, then the relevancy of a new rating will be higher.


Web 3.0

Web 3.0

In current web applications, information is presented in natural language, which humans can process easily; but computers cannot manipulate natural language information on the web meaningfully.


The semantic web is an extension of the current web in which information is given a well-defined meaning, better enabling computers and universal medium for information exchange by putting documents with computer-processable meaning (semantics) on the web.


Adding semantics radically changes the nature of the web—from a place where information is merely displayed to one where it is interpreted, exchanged, and processed.


Associating meaning with content or establishing a layer of machine-understandable data enables a higher degree of automation and more intelligent applications and also facilitates interoperable services.


Semantic web technologies will enhance Web 2.0 tools and its associated data with semantic annotations and semantic knowledge representations, thus enabling a better automatic processing of data that in turn will enhance search mechanisms, management of the tacit knowledge, and the overall efficiency of the actual KM tools.


The benefits of semantic blogging, semantic wikis or semantic Wikipedia, semantic-enhanced social networks, semantic-enhanced KM and semantic-enhanced user support will increase its benefits multifold.


The ultimate goal of the semantic web is to support machine-facilitated global information exchange in a scalable, adaptable, extensible manner, so that information on the web can be used for more effective discovery, automation, integration, and reuse across various applications.


The three key ingredients that constitute the semantic web and help achieve its goals are semantic markup, ontology, and intelligent software agents.


Mobile Web

Mobile Web

With numerous advances in mobile computing and wireless communications and widespread adoption of mobile devices such as smart mobile, the Web is increasingly being accessed using handheld devices.


Mobile web applications could offer some additional features compared to traditional desktop web applications such as location-aware services, context-aware capabilities, and personalization.


The Semantic Web

While the web keeps growing at an astounding pace, most web pages are still designed for human consumption and cannot be processed by machines. Similarly, while web search engines help retrieve web pages, they do not offer support to interpret the results—for that, human intervention is still required.


As the size of the search results is often just too big for humans to interpret, finding relevant information on the web is not as easy as we would desire.


The existing web has evolved as a medium for information exchange among people, rather than machines. As a consequence, the semantic content, that is, the meaning of the information on a web page is coded in a way that is accessible to human beings only.


Today’s web may be defined as the syntactic web, where information presentation is carried out by computers, and the interpretation and identification of relevant information are delegated to human beings.


With the volume of available digital data growing at an exponential rate, it is becoming virtually impossible for human beings to manage the complexity and volume of the available information. This phenomenon often referred to as information overload, poses a serious threat to the continued usefulness of today’s web.


As the volume of web resources grows exponentially, researchers from industry, government, and academia are now exploring the possibility of creating a semantic web in which meaning is made explicit, allowing machines to process and integrate web resources intelligently.


Biologists use a well-defined taxonomy, the Linnaean taxonomy, adopted and shared by most of the scientific community worldwide. Likewise, computer scientists are looking for a similar model to help structure web content.


In 2001, Berners-Lee, Hendler, and Lassila published a revolutionary article in Scientific American titled “The Semantic Web: A New Form of Web Content That Is Meaningful to Computers Will Unleash a Revolution of New Possibilities.”


The semantic web is an extension of the current web in which information is given well- defined meaning, enabling computers and people to work in cooperation.


In the lower part of the architecture, we find three building blocks that can be used to encode text (Unicode), to identify resources on the web (URIs) and to structure and exchange information (XML). Resource Description Framework (RDF) is a simple, yet powerful data model and language for describing web resources.


The SPARQL Protocol and RDF Query Language (SPARQL) is the de facto standard used to query RDF data.


While RDF and the RDF Schema provide a model for representing semantic web data and for structuring semantic data using simple hierarchies of classes and properties, respectively, the SPARQL language and protocol provide the means to express queries and retrieve information from across diverse semantic web data sources.


The need for a new language is motivated by the different data models and semantics at the level of XML and RDF, respectively.


Ontology is a formal, explicit specification of a shared conceptualization of a particular domain—concepts are the core elements of the conceptualization corresponding to entities of the domain being described, and properties and relations are used to describe interconnections between such concepts.


Web ontology language (OWL) is the standard language for representing knowledge on the web. This language was designed to be used by applications that need to process the content of information on the web instead of just presenting information to human users.


Using OWL, one can explicitly represent the meaning of terms in vocabularies and the relationships between those terms. The Rule Interchange Format (RIF) is the W3C Recommendation that defines a framework to exchange rule-based languages on the web. Like OWL, RIF defines a set of languages covering various aspects of the rule layer of the semantic web.


Rich Internet Applications

Rich Internet Applications

Rich Internet applications (RIA) are web-based applications that run in a web browser and do not require software installation, but still, have the features and functionality of traditional desktop applications.


The term “RIA” was introduced in a Macromedia whitepaper in March 2002. RIA represents the evolution of the browser from a static request-response interface to a dynamic, asynchronous interface.


Broadband proliferation, consumer demand, and enabling technologies, including the Web 2.0, are driving the proliferation of RIAs. RIAs promise a richer user experience and benefits—interactivity and usability that are lacking in many current applications. Some prime examples of RIA frameworks are Adobe’s Flex and AJAX, and examples of RIA include Google’s Earth, Mail, and Finance applications.


Enterprises are embracing the promises of RIAs by applying them to user tasks that demand interactivity, responsiveness, and richness. Predominant techniques such as HTML, forms, and CGI are being replaced by another programmer- or user-friendly approaches such as AJAX and web services.


Building a web application using fancy technology, however, does not ensure a better user experience. To add real value, developers must address.


Web Applications

Web Applications

Web applications’ operational environment and their development approach and the faster pace in which these applications are developed and deployed differentiate web applications from those of traditional software.


Characteristics of web applications are as follows:

Web-based systems, in general, demand good aesthetic appeal—“look and feel”— and easy navigation.


Web-based applications demand presentation of a variety of content—text, graphics, images, audio, and/or video—and the content may also be integrated with procedural processing.


Hence, their development includes the creation and management of the content and their presentation in an attractive manner, as well as a provision for subsequent content management (changes) on a continual basis after the initial development and deployment.


Web applications are meant to be used by a vast, diverse, remote community of users who have different requirements, expectations, and skill sets. Therefore, the user interface and usability features have to meet the needs of a diverse, anonymous user community.


Furthermore, the number of users accessing a web application at any time is problems—there could be a “flash crowd” triggered by major events or promotions.


Web applications, especially those meant for a global audience, need to adhere to many different social and cultural sentiments and national standards—including multiple languages and different systems of units.


Ramifications of failure or dissatisfaction of users of web-based applications can be much worse than conventional IT systems. Also, web applications could fail for many different reasons.


Successfully managing the evolution, change, and newer requirements of web applications is a major technical, organizational, and management challenge. Most web applications are evolutionary in their nature, requiring (frequent) changes in content, functionality, structure, navigation, presentation, or implementation on an ongoing basis.


The frequency and degree of change of information content can be quite high; they particularly evolve in terms of their requirements and functionality, especially after the system is put into use.


In most web applications, frequency and degree of change are much higher than in traditional software applications, and in many applications, it is not possible to specify fully their entire requirements at the beginning.


There is a greater demand on the security of web applications; security and privacy needs of web-based systems are in general more demanding than those of traditional software.


Web applications need to cope with a variety of display devices and formats, and support hardware, software, and networks with vastly varying access speeds.


The proliferation of new web technologies and standards and competitive pressure to use them bring its own advantages and also additional challenges to the development and maintenance of web applications.


The evolving nature of web applications necessitates an incremental developmental process.


Web Applications Dimensions

Web Applications


Presentation technologies have advanced over time, such as in terms of multimedia capabilities, but the core technology of the web application platform, the Hypertext Markup Language (HTML), has remained relatively stable.


Consequently, application user interfaces have to be mapped to document-oriented markup code, resulting in an impedance or a gap between the design and the subsequent implementation.


The task of communicating content in an appropriate way combines both artistic visual design and engineering disciplines. Usually, based on the audience of the website, there are numerous factors to be considered.


For example, in the international case, cultural differences may have to be accounted for, affecting not only languages but also, for example, the perception of color schemes.


Further restrictions may originate from the publishing organizations themselves that aim at reflecting the company’s brand with a corresponding corporate design or legal obligations with respect to accessibility.



Interactive elements in web applications often appear in the shape of forms that allow users to enter data that are used as input for further processing. More generally, the dialogue concern covers not only the interaction between humans and the application but rather between arbitrary actors (including other programs) and the manipulated information space.


The flow of information is governed by the web’s interaction model, which, due to its distributed nature, differs considerably from other platforms.


The interaction model is subject to variations, as in the context of recent trends toward more client-side application logic and asynchronous communication between client and server like in the case of AJAX focusing on user interfaces that provide a look and feel that resembles desktop applications.




In addition to the challenge of communicating information, there exists the challenge of making it easily accessible to the user without ending in the “lost in hyperspace” syndrome. This holds true even though the web makes use of only a subset of the rich capabilities of hypertext concepts, for example, allowing only unidirectional links.


Over time, a set of common usage patterns have evolved that aids them in navigating through new websites that may not have been visited before. Applied to web application development, navigation concepts can be extended for accessing not only static document content but also application functionality.



The process dimensions relate to the operations performed on the information space that is generally triggered by the user via the web interface and whose execution is governed by the business policy.


Particular challenges arise from scenarios with frequently changing policies, demanding agile approaches with preferably dynamic wiring between loosely coupled components.


Beneath the user interface of a web application lies the implementation of the actual application logic, for which the web acts as a platform to make it available to the concerned stakeholders.


In case the application is not distributed, the process dimension is hardly affected by web-specific factors, allowing for standard non- web approaches like Component-Based Software Engineering to be applied. Otherwise, service-oriented approaches account for cases where the wiring extends over components that reside on the web.




Data are the content of the documents to be published; although content can be embedded in the web documents together with other dimensions like presentation or navigation, the evolution of web applications often demands a separation, using data sources such as XML files, databases, or web services. Traditional issues include the structure of the information space as well as the definition of structural linking.


In the context of the dynamic nature of web applications, one can distinguish between static information that remains stable over time and dynamic information that is subject to changes.


Depending on the media type being delivered, either the data can be persistent, that is, accessible independently of time, or it can be transient, that is, accessible as a flow, as in the case of a video stream.


Moreover, metadata can also describe other data facilitating the usefulness of the data within the global information space established by the web.


Similarly, the machine-based processing of information is further supported by semantic web approaches that apply technologies like the resource description framework (RDF) to make metadata statements (e.g., about web page content) and express the semantics about associations between arbitrary resources worldwide.


Search Analysis

Search Analysis

Exploiting the data stored in the search logs of web search engines, Intranets, and websites provide important insights into understanding the information searching habits and tactics of online searchers. Web search engine companies use search logs (also referred to as transaction logs) to investigate searching trends and effects of system improvements.


This understanding can inform information system design, interface development, and information architecture construction for content collections. Search logs are an unobtrusive method of collecting significant amounts of searching data on a sizable number of system users.


A search log is an electronic record of interactions that have occurred during a searching episode between a web search engine and users searching for information on that web search engine.


The users may be humans or computer programs acting on behalf of humans. Interactions are the communication exchanges that occur between users and the system initiated wither by the user or the system.


Most search logs are server-side recordings of interactions; the server software application can record various types of data and interactions depending on the file format that the server software supports.


The search log format is typically an extended file format, which contains data such as the client computer’s Internet Protocol (IP) address, user query, search engine access time, and referrer site, among other fields.


Search Log Analysis (SLA) is defined as the use of data collected in a search log to investigate particular research questions concerning interactions among web users, the web search engine, or the web content during searching episodes.


Within this interaction context, SLA could use the data in search logs to discern attributes of the search process, such as the searcher’s actions on the system, the system responses, or the evaluation of results by the searcher.


From this understanding, one achieves some stated objective, such as improved system design, advanced searching assistance, or better understanding of some user information searching behavior.


SLA Process

SLA Process

SLA involves the following three major stages:

1. Data collection involves the process of collecting the interaction data for a given period in a search log. Search logs provide a good balance between collecting a robust set of data and unobtrusively collecting that data.


Collecting data from real users pursuing needed information while interacting with real systems on the web affects the type of data that one can realistically assemble.


On a real-life system, the method of data monitoring and collecting should not interfere with the information searching process. Not only a data collection method that interferes with the information searching process may unintentionally alter that process, but such a nonpermitted interference may also lead to a loss of a potential customer.


A search log typically consists of data such as

  • a. User Identification: The IP address of the customer’s computer
  • b.Date: The date of the interaction as recorded by the search engine server
  • c.The Time: The time of the interaction as recorded by the search engine server Additionally, it could also consist of data such as


  • a.Results Page: The code representing a set of result abstracts and URLs returned by the search engine in response to a query
  • b.Language: The user preferred the language of the retrieved web pages


  • c.Source: The federated content collection searched
  • d.Page Viewed: The URL that the searcher visited after entering the query and viewing the results page, which is also known as click-thru or click-through


2. Data preparation involves the process of cleaning and preparing the search log data for analysis. For data preparation, the focus is on importing the search log data into a relational or NoSQL database, assigning each record a primary key, cleaning the data (i.e., checking each field for bad data), and calculating standard interaction metrics that will serve as the basis for further analysis.


Data preparation consists of steps like

Data preparation

a. Cleaning data: Records in search logs can contain corrupted data. These corrupted records can be as a result of multiple reasons, but they are mostly related to errors when logging the data.


b. Parsing data: Using the three fields of The Time, User Identification, and Search URL, common to all web search logs, the chronological series of actions in a searching episode is recreated. The web query search logs usually contain queries from both human users and agents.


Depending on the research objective, one may be interested in only individual human interactions, those from common user terminals, or those from agents. Depending on the search objective, one may be interested in only individual human interactions, those from common user terminals, or those from agents.


c. Normalizing searching episodes: When a searcher submits a query, then views a document, and returns to the search engine, the web server typically logs this second visit with the identical user identification and query, but with a new time (i.e., the time of the second visit).


This is beneficial information in determining how many of the retrieved results pages the searcher visited from the search engine, but unfortunately, it also skews the results of the query level analysis. In order to normalize the searching episodes, one must first separate these result page requests from query submissions for each searching episode.


3. Data analysis involves the process of analyzing the prepared data. The three common levels of analysis for examining search logs:


a. Session analysis

Session analysis

A searching episode is defined as a series of interactions within a limited duration to address one or more information needs.


This session duration is typically short, with web researchers using between 5 and 120 minutes as a cutoff; each choice of time has an impact on the results: the searcher may be multitasking within a searching episode, or the episode may be an instance of the searcher engaged in successive searching.


This session definition is similar to the definition of a unique visitor used by commercial search engines and organizations to measure website traffic. The number of queries per searcher is the session length.


Session duration is the total time the user spent interacting with the search engine, including the time spent viewing the first and subsequent web documents, except the final document.


Session duration can, therefore, be measured from the time the user submits the first query until the user departs the search engine for the last time (i.e., does not return). This viewing time of the final web document is not available since the web search engine server does not record the time stamp.


web document is the web page referenced by the URL on the search engine’s results page. A web document may be text or multimedia and, if viewed hierarchically, may contain a nearly unlimited number of subweb documents. A web document may also contain URLs linking to other web documents.


From the results page, a searcher may click on a URL, (i.e., visit) one or more results from the listings on the result page. This is click through analysis and measures the page viewing behavior of web searchers.


b. Query analysis

Query analysis

The query level of analysis uses the query as the base metric. A query is defined as a string list of one or more terms submitted to a search engine. This is a mechanical definition as opposed to an information searching definition. The first query by a particular searcher is the initial query.


A subsequent query by the same searcher that is different than any of the searcher’s other queries is a modified query. There can be several occurrences of different modified queries by a particular searcher.


unique query refers to a query that is different from all other queries in the transaction log, regardless of the searcher. A repeat query is a query that appears more than once within the dataset by two or more searchers.


Query complexity examines the query syntax, including the use of advanced searching techniques such as Boolean and other query operators.


c. Term analysis

The term level of analysis naturally uses the term as the basis for analysis. A term is a string of characters separated by some delimiter such as space or some other separator.


At this level of analysis, one focuses on measures such as term occurrence, which is the frequency that a particular term occurs in the transaction log.


High Usage Terms are those terms that occur most frequently in the dataset. Term co-occurrence measures the occurrence of term pairs within queries in the entire search log. One can also calculate degrees of association of term pairs using various statistical measures


Web Analysis

Web Analysis

Effective website management requires a way to map the behavior of the visitors to the site against the particular objectives and purpose of the site. Web analysis or Log file analysis is the study of the log files from a particular website. The purpose of log file analysis is to assess the performance of the website.


Every time a browser hits a particular web page the server computer on which the website is hosted registers and records data called log files for every action a visitor at that particular website takes. Log files data includes information on


  • Who is visiting the website (the visitor’s URL, or web address)
  • The IP address (numeric identification) of the computer the visitor is browsing from
  • The date and time of each visit
  • Which pages the visitor viewed, how long the visitor viewed the site
  • Other relevant data


Log files contain potentially useful information for anyone working with a website—from server administrators to designers to marketers—who needs to assess website usability and effectiveness.


1. Website administrators use the data in log files to monitor the availability of a website to make sure the site is online, available, and without technical errors that might prevent easy access and use. Administrators can also predict and plan for growth in server resources and monitor for unusual and possibly malicious activity.


For example, by monitoring past web usage logs for visitor activity, a site administrator can predict future activity during holidays and other spikes in usage and plan to add more servers and bandwidth to accommodate the expected traffic.


In order to watch for potential attacks on a website, administrators can also monitor web usage logs for abnormal activity on the website such as repetitive login attempts, unusually large numbers of requests from a single IP address, and so forth.


2. Marketers can use the log files to understand the effectiveness of various on- and off-line marketing efforts. By analyzing the weblogs, marketers can determine which marketing efforts are the most effective. Marketers can track the effectiveness of online advertising, such as banner ads and other links, through the use of the referrer logs (referring URLs).


Examination of the referring URLs indicates how visitors got to the website, showing, say, whether they typed the URL (web address) directly into their web browser or whether they clicked through from a link at another site.


Weblogs can also be used to track the amount of activity from offline advertising, such as magazine and other print ads, by utilizing a unique URL in each offline and that is run.


Unlike online advertising that shows results in log information about the referring website, offline advertising requires a way to track whether or not the ad generated a response from the viewer. One way to do this is to use the ad to drive traffic to a particular website especially established only for tracking that source.


3. Website designers use log files to assess the user experience and site usability. Understanding the user environment provides web designers with the information they need to create a successful design.


While ensuring a positive user experience on a website requires more than merely good design, log files do provide readily-available information to assist with the initial design as well as continuous improvement of the website. Web designers can find useful information about


a.The type of operating system (e.g., Windows XP or Linux)

b.The screen settings (e.g., screen resolution)


c.The type of browser (e.g., Internet Explorer or Mozilla) used to access the site

This information allows designers to create web pages that display well for the majority of users.

Click trail can show how a viewer navigates through the various pages of a given website; the corresponding clickstream data can show


  • What products a customer looked at on an e-commerce site
  • Whether the customer purchased those products
  • What products a customer looked at but did not purchase
  • What ads generated many click-throughs but resulted in few purchases
  • And so on


By giving clues as to which website features are successful, and which are not, log files assist website designers in the process of continuous improvement by adding new features, improving on current features, or deleting unused features.


Then, by monitoring the weblogs for the impact on the user reaction, and making suitable adjustments based on those reactions, the website designer can improve the overall experience for website visitors on a continuous basis.


Internet technologies relevant for Web analysis

proxy server

The proxy server is a network server which acts as an intermediary between the user’s computer and the actual server on which the website resides; they are used to improve service for groups of users.


First, it saves the results of all requests for a particular web page for a certain amount of time. Then, it intercepts all requests to the real server to see if it can fulfill the request itself. Say, user, A requests a certain web page; sometime later, user B requests the same page. 


Instead of forwarding the request to the web server where Page 1 resides, which can be a time-consuming operation, the proxy server simply returns the Page 1 that it already fetched for user A. Since the proxy server is often on the same network as the user, this is a much faster operation.


If the proxy server cannot serve a stored page, then it forwards the request to the real server. Importantly, pages served by the proxy server are not logged in the log files, resulting in inaccuracies in counting site traffic.


Major online services (such as Facebook, MSN, and Yahoo!) and other large organizations employ an array of proxy servers in which all user requests are made through a single IP address. This situation causes weblog files to significantly under-report unique visitor traffic.


On the other hand, sometimes home users with an Internet Service Provider get assigned a new IP address each time they connect to the Internet. This causes the opposite effect of inflating the number of unique visits in the weblogs.


2. Firewalls: For the purpose of security rather than efficiency, acting as an intermediary device, a proxy server can also function as a firewall in an organization. Firewalls are used by organizations to protect internal users from outside threats on the Internet or to prevent employees from accessing a specific set of websites.


Firewalls hide the actual IP address for specific user computers and instead present a single generic IP address to the Internet for all its users. Hence, this contributes to under-reporting unique visitor traffic in web analytics.


3. Caching refers to the technique in which most web browser software keeps a copy of each web page, called a cache, in its memory. So, rather than requesting the same page again from the server (for example, if the user clicks the “back” button), the browser on the computer will display a copy of the page rather than make another new request to the server.


Many Internet Service Providers and large organizations cache web pages in an effort to serve content more quickly and reduce bandwidth usage. As with the use of proxy servers, caching poses a problem because weblog files don’t report these cached page views. Again, as a result, weblog files can significantly under-report the actual visitor count.


The veracity of Log Files Data

Despite the wealth of useful information available in log files, the data also suffer from limitations.


Unique Visitors

Unique Visitors

One of the major sources of inaccuracy arises from the way in which unique visitors are measured. Traditional weblog reports measure unique visitors based on the IP address, or network address, recorded in the log file.


Because of the nature of different Internet technologies, IP addresses do not always correspond to an individual visitor in a one-to-one relationship.


In other words, there is no accurate way to identify each individual visitor. Depending on the particular situation, this causes the count of unique visitors to be either over- or under-reported.


Cookies are small bits of data that a website leaves on a visitor’s hard drive after that visitor has hit a website. Then, each time the user’s web browser requests a new web page from the server, the cookie on the user’s hard drive can be read by the server. These cookie data benefit in several ways:


Unique cookie gets generated for each user even if multiple viewers access the same website through the same proxy server; consequently, a unique session is recorded and a more accurate visitor count can be obtained.


Cookies also make it possible to track users across multiple sessions (i.e., when they return to the site subsequently), thus enabling a computation of new versus returning visitors.


Third-party cookies enable the website to assess what other sites the visitor has visited; this enables personalization of the website in terms of the content that is displayed. Cookies are not included in normal log files. Therefore, only a web analytics solution that supports cookie tracking can utilize the benefits.


Visitor Count

Visitor Count

Another source of inaccuracy is in visitor count data. Most weblog reports give two possible ways to count visitors—hits and unique visits. The very definition of hits is a source of unreliability.


By definition, each time a web page is loaded, each element of the web page (i.e., different graphics on the same page) is counted as a separate “hit.”


Therefore, even with a one-page view, multiple hits are recorded as a function of the number of different elements on a given web page. The net result is that hits are highly inflated numbers.


In contrast, the under-reporting of visitors is a serious issue for online advertising. If the ad is cached, nobody knows that the ad was delivered. As a result, the organization delivering the ad does not get paid. Log files cannot track visitor activity from cached pages because the web server never acknowledges the request.


This deficiency is remedied by using page tagging. This technique has its origins in hit counters, which like a car odometer increases by one count with each additional page view.


Page tagging embeds a small piece of Javascript software code on the web page itself. Then, when the website user visits the web page, the Java code is activated by the computer user’s browser software.


Since page tagging is located on the web page itself rather than on the server, each time the page is viewed, it is “tagged”; while server logs cannot keep track of requests for a cached page, a “tagged” page will still acknowledge and record a visit.


Moreover, rather than recording a visit in a weblog file that is harder to access, page tagging records visitor information in a database, offering increased flexibility to access the information more quickly and with more options to further manipulate the data.


Visit Duration

Weblogs do not provide an accurate way to determine visit duration. Visit duration is calculated based on the time spent between the first-page request and the last page request.


If the next page request never occurs, the duration cannot be calculated and will be under-reported. Weblogs also cannot account for the user who views a page, leaves the computer for 20 minutes, and comes back and click to the next page. In this situation, the visit duration would be highly inflated.


Web Analysis Tools

Web Analysis Tools

New tools in web analytics like Google Analytics provide a stronger link between online technologies and online marketing, giving marketers more essential information lacking in earlier versions of web analytics software.


For many years, web analytics programs that delivered only simple measurements such as hits, visits, referrals, and search engine queries were not well linked to an organization’s marketing efforts to drive online traffic.


As a result, they provided very little insights to help the organization track and understand its online marketing efforts.


Trends in web analytics specifically improve both the method of data collection and the analysis of the data, providing significantly more value from a marketing perspective. These newer tools attempt to analyze the entire marketing process, from a user clicking an advertisement through to the actual sale of a product or service.


This information helps to identify not merely which online advertising is driving traffic (number of clicks) to the website and which search terms lead visitors to the site, but which advertising is most effective in actually generating sales (conversion rates) and profitability.


This integration of the web log files with other measures of advertising effectiveness is critical to provide guidance into further advertising spending.


Web analytics software has the capability to perform more insightful, detailed reporting on the effectiveness of common online marketing activities such as search engine listings, pay-per-click advertising, and banner advertising. Marketing metrics to assess effectiveness can include the following:


Cost-per-click: The total online expenditure divided by the number of click-throughs to the site.

Conversion rate: The percentage of the total number of visitors who make a purchase, signup for a service, or complete another specific action.


Return on marketing investment: The advertising expense divided by the total revenue generated from the advertising expense.


Bounce rate: The number of users that visit only a single page divided by the total number of visits; one indicator of the “stickiness” of a web page.


Social Network Applications

Social Network Applications

Social computing is the use of social software, which is based on creating or recreating online social conversations and social contexts through the use of software and technology.


An example of social computing is the use of email for maintaining social relationships. Social Networks (SN) are social structures made up of nodes and ties; they indicate the relationships between individuals or organizations and how they are connected through social contexts.


SN operate on many levels and play an important role in solving problems, on how organizations are run and they help individuals succeed in achieving their targets and goals. Computer-based social networks enable people in different locations to interact with each other socially (e.g., chat and viewable photos) over a network.


SN are very useful for visualizing patterns: A social network structure is made up of nodes and ties: there may be few or many nodes in the networks or one or more different types of relations between the nodes.


Building a useful understanding of a social network is to sketch a pattern of social relationships, kinships, community structure, and so forth. The use of mathematical and graphical techniques in social network analysis is important to represent the descriptions of networks compactly and more efficiently.


Social Networks operate on many different levels from families up to nations and play a critical role in determining the way problems are solved, organizations are run and the degree to which people succeed in achieving their goals.


Popular Social Networks

Social Networks

This section briefly describes popular social networks like LinkedIn, Facebook, Twitter, and Google+.



LinkedIn is currently considered the de facto source of professional networking. Launched in 2003, it is the largest business-oriented social network with more than 260 million users. This network allows users to find the key people they may need to make introductions into the office of the job they may desire.


Users can also track friends and colleagues during times of promotion and hiring to congratulate them if they choose; this results in a complex social web of business connections.


In 2008, LinkedIn introduced their mobile app as well as the ability for users to not only endorse each other but also to specifically attest to individual skills that they may hold and have listed on the site. LinkedIn now supports more than 20 languages.


Users cannot upload their resumes directly to LinkedIn. Instead, a user adds skills and work history to their profile. Other users inside that social network can verify and endorse each attribute. This essentially makes a user’s presence on LinkedIn only as believable as the people they connect with.




Facebook was created by Mark Zuckerberg at Harvard College. Launched in 2004, it grew rapidly and now has more than a billion and a half users.


In 2011, Facebook introduced personal timelines to complement a user’s profile; timelines show the chronological placement of photos, videos, links, and other updates made by a user and his or her friends.


Though a user can customize their timeline as well as the kind of content and profile information that can be shared with individual users, Facebook networks rely heavily on people posting comments publically and also tagging people in photos. Tagging is a very common practice that places people and events together, though, if required, a user can always untag himself or herself.


Conceptually, the timeline is a chronological representation of a person’s life from birth until his or her death, or present day if you are still using Facebook. A user’s life can be broken up into pieces or categories that can be more meaningfully analyzed by the algorithms run by Facebook.


These categories include Work and Education, Family and Relationships, Living, Health and Wellness, and Milestones and Experiences. Each category contains four to seven subcategories. Users have granular control over who sees what content related to them, but less so about what they see in relation to other people.


Facebook is often accused of selling user information and not fully deleting accounts after users choose to remove them. Because Facebook has such a generalized privacy policy, they can get away with handling user information in almost any way that they see fit. Facebook has done many things to improve security in recent years.


Facebook has provided users with a detailed list of open sessions under their account name and given them the ability to revoke them at will. This is to say that, if an unauthorized person accesses a user’s account or the user forgets to log out of a computer, they can force that particular connection to close.


Location and time of access are listed for each open session, so a user can easily determine if their account is being accessed from some- where unexpected.


When viewed through a web browser, Facebook supports https. This protocol is considered secure; however, it is not supported by mobile devices. Data transmitted by Facebook to mobile devices has been proven to be in plain text, meaning if it is intercepted it is easily human readable.


However, the Global Positioning System (GPS) coordinates and information about your friends require special permission.


Default access granted to any Facebook app includes user ID, name, profile picture, gender, age range, locale, networks, list of friends, and any information set as public. Any of this information can be transmitted between devices at any time without a user’s express permission, and, in the case of mobile devices, in plain, unencrypted text.




Twitter’s original idea was to design a system for individuals to share short SMS messages with a small group of people. Hence, tweets were designed to be short and led to the limit of 144 characters per tweet. By 2013, Twitter had 200 million users sending 500 million tweets a day.


Twitter was originally designed to work with text messages. This is why the 140 character limit was put into the original design, to comply with text message rates.


Twitter’s original design was to create a service that a person could send a text to, and that text would not only be available online but it would then be able to resend that text to other people using the service. Subsequently, Twitter has incorporated many different sources of media.


In 2010, Twitter added a facility for online video and photo viewing without redirection to third-party sites. In 2013, Twitter added its own music service as an iPhone app.


Despite Twitter’s continued expansion of supported content, the language used in modern tweets along with some other helpful additions has continued to adhere to the 140 character limit.




Google+ is the only social network to rival Facebook’s user base with more than a billion users. The main feature of Google+ is circles; by being part of the same circle, people create focused social networks. Circles allow networks to center around ideas and products; circles are also the way that streaming content is shared between people.


Circles generate content for users and help organize and segregate with whom information is shared. A user makes circles by placing other Google+ users into them. This is done through an interface built very similar to Gmail and Google maps.


When circles create content for a user, it is accumulated and displayed on their Stream. A user’s Stream is a prioritized list of any content from that user’s circles that they have decided to display. A user can control how much of a Circle’s content is included in their Stream. Circles can also be shared, either with individual users or other circles.


This action being a single timeshare means that there is no subsequent syncing after the share takes place. The lack of synchronous updates without sharing a Circle again means that it is simply very easy for others to have incorrect information about Circles that change on a regular basis.


If frequent updates are made and a user wants his or her network to stay up-to-date, a user may have to share a Circle quite frequently.


Google+ Pages are essentially profiles for businesses, organizations, publications, or other entities that are not related to a single individual. They can be added to Circles like normal users and share updates to user Streams in the same way. The real distinction is that Pages do not require a legal name to be attached to the associated Google account.


Google+ has a large number of additional services and support owing to its high level of integration with Google accounts including games, messenger, photo editing and saving, mobile upload and diagnostics, apps, calendars, and video streaming.


Hangouts, which is Google’s video-streaming application, is available free for use and supports up to 10 simultaneous users in a session. Hangouts can be used as a conference call solution or to create instant webcasts. Functionally, Hangouts is similar to programs like Skype


Other Social Networks

Other Social Networks

Here are some of the other notable social networks:


1. Classmates were established in 1995 by Randy Conrads as a means for class reunions and has more than 50 million registered users. By linking together people from the same school and class year, Classmates.com provides individuals with a chance to “walk down memory lane” and get reacquainted with old classmates that have also registered with the site.


With a minimum age limit of 18 years, registration is free and anyone may search the site for classmates that they may know. Purchasing a gold membership is required to communicate with other members through the site’s email system.


User email addresses are private, and communication for paying members is handled through a double-blind email system that ensures that only paying members can make full use of the site, allowing unlimited communication and orchestration of activities for events like reunions.


2. Friendster was launched in 2002 by Jonathan Abrams as a generic social network in Malaysia. Friendster is a social network made primarily of Asian users. Friendster was redesigned and relaunched as a gaming platform in 2011 where it would grow to its current user base of more than 115 million.


Friendster filed many of the fundamental patents related to social networks. Eighteen of these patents were acquired by Facebook in 2011.


3.hi5 is a social network developed by Ramu Yalamanchi in 2003 in San Francisco, California; and was acquired by Tagged in 2011. All of the normal social network features were included as friend networks, photo sharing, profile information, and groups. In 2009, hi5 was redesigned as a purely social gaming network with a required age of 18 years for all new and existing users.


Several hundred games were added, and Application Programming Interfaces (APIs) were created that include support for Facebook games. This popular change boosted hi5’s user base, and at the time of acquisition, its user base was more than 80 million.


4. Orkut was a social network almost identical to Facebook that was launched in 2004 and was shut down by the end of September 2014. Orkut obtained more than 100 million users, most of which were located in India and Brazil.


5. Flickr is a photo-sharing website that was created in 2004 and was acquired by Yahoo! in 2005; photos and videos can also be accessed via Flickr. It has tens of millions of members sharing billions of images.


6.YouTube is a video-sharing website that was created in 2005 and was acquired by Google in 2006. Members, as well as corporations and organizations, post videos of themselves as well as various events and talks. Movies and songs are also posted on this website.


Social Networks Analysis (SNA)

Social Networks Analysis

In social science, the structural approach that is based on the study of interaction among social actors is called social network analysis.


The relationships that social network analysts study are usually those that link individual human beings since these social scientists believe that besides individual characteristics, relational links or social structure, are necessary and indispensable to fully understand social phenomena.


Social network analysis is used to understand the social structure, which exists among entities in an organization. The defining feature of social network analysis (SNA) is its focus on the structure of relationships, ranging from casual acquaintance to close bonds. This is in contrast with other areas of the social sciences where the focus is often on the attributes of agents rather than on the relationships between them.


SNA maps and measures the formal and informal relationships to understand what facilitates or impedes the knowledge flows that bind the interacting units, that is, who knows whom and who shares what information and how. Social network analysis is focused on uncovering the patterning of people’s interaction.


SNA has based on the intuition that these patterns are important features of the lives of the individuals who display them. The network analysts believe that how individual life depends in large part on how that individual is tied into a larger web of social connections.


Moreover, many believe that the success or failure of societies and organizations often depends on the patterning of their internal structure, which is guided by formal concept analysis, which is grounded in a systematic analysis of the empirical data.


With the availability of powerful computers and discrete combinatorics (especially graph theory) after 1970, the study of SNA took off as an interdisciplinary specialty; the applications are found manifolds that include organizational behavior, inter-organizational relations, the spread of contagious diseases, mental health, social support, and the diffusion of information and animal social organization.


SNA software provides the researcher with data that can be analyzed to determine the centrality, betweenness, degree, and closeness of each node. An individual’s social network influences his/her social attitude and behavior.


Before collecting network data typically through interviews, it must first be decided as to the kinds of networks and kinds of relations that will be studied:


1. One mode versus two-mode networks: The former involve relations among a single set of similar actors, while the latter involves relations between two different sets of actors.


An example of the two-mode network would be the analysis of a network consisting of private, for-profit organizations, and their links to nonprofit agencies in a community.


Two-mode networks are also used to investigate the relationship between a set of actors and a series of events. For example, although people may not have direct ties to each other, they may attend similar events or activities in a community and in doing so, this sets up opportunities for the formation of “weak ties.”


2. Complete/whole versus ego networks: Complete/whole or sociocentric networks consist of the connections among members of a single, bounded community. Relational ties among all of the teachers in a high school is an example of the whole network.


Ego/Egocentric or personal networks are referred to as the ties directly connecting the focal actor, or ego to others or egos alters in the network, plus ego’s views on the ties among his or her alter.


If we asked a teacher to nominate the people he/she socializes with the outside of school, and then asked that teacher to indicate who in that network socializes with the others nominated, it is a typical ego network.


a. Egocentric network data focus on the network surrounding one node, or in other words, the single social actor. Data are on nodes that share the chosen relation(s) with the ego and on relationships between those nodes. Ego network data can be extracted from whole network data by choosing a focal node and examining only nodes connected to this ego.


Ego network data, like whole network data, can also include multiple relations; these relations can be collapsed into single networks, as when ties to people who provide companionship and emotional aid are collapsed into a single support network.


Unlike whole network analyses, which commonly focus on one or a small number of networks, ego network analyses typically sample large numbers of egos and their networks.


b. Complete/whole networks focus on all social actors rather than focusing on the network surrounding any particular actor. These networks begin from a list of included actors and include data on the presence or absence of relationships between every pair of actors.


When the researcher adopts the whole network perspective, he/she will inquire each social actor and all other individuals to collect relational data.


Text Analysis

Text Analysis

Text analysis is a new and exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, natural language processing (NLP), information retrieval (IR), and knowledge management.


Text analysis involves the preprocessing of document collections (text categorization, information extraction, term extraction), the storage of the intermediate representations, the techniques to analyze these intermediate representations (such as distribution analysis, clustering, trend analysis, and association rules), and visualization of the results.


Text analysis draws on advances made in other computer science disciplines concerned with the handling of natural language because of the centrality of natural language text to its mission; text analysis exploits techniques and methodologies from the areas of information retrieval, information extraction, and corpus-based computational linguistics.


Since text analysis derives much of its inspiration and direction from seminal research on data mining, there are many high-level architectural similarities between the two systems.


For instance, text analysis adopts many of the specific types of patterns in its core knowledge discovery operations that were first introduced and vetted in data mining research. Further, both types of systems rely on preprocessing routines, pattern-discovery algorithms, and presentation-layer elements such as visualization tools to enhance the browsing of answer sets.


In contrast, for text analysis systems, preprocessing operations center on the identification and extraction of representative features for natural language documents.


These preprocessing operations are responsible for transforming unstructured data stored in document collections into a more explicitly structured intermediate format, which is not a concern relevant for most data mining systems.


The sheer size of document collections makes manual attempts to correlate data across documents, map complex relationships, or identify trends at best extremely labor- intensive and at worst nearly impossible to achieve.


Automatic methods for identifying and exploring inter-document data relationships dramatically enhance the speed and efficiency of research activities.


Indeed, in some cases, automated exploration techniques like those found in text analysis are not just a helpful adjunct but a baseline requirement for researchers to be able, in a practical way, to recognize subtle patterns across large numbers of natural language documents.


Text analysis systems, however, usually do not run their knowledge discovery algorithms on unprepared document collections. Considerable emphasis in text analysis is devoted to what is commonly referred to as preprocessing operations.


Text analysis preprocessing operations include a variety of different types of techniques culled and adapted from information retrieval, information extraction, and computational linguistics research that transform raw, unstructured, original-format content (like that which can be downloaded from document collections) into a carefully structured, intermediate data format.


Knowledge discovery operations, in turn, are operated against this specially structured intermediate representation of the original document collection.


Defining Text Analysis

Defining Text Analysis

Text analysis can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. In a manner analogous to data mining, text analysis seeks to extract useful information from data sources through the identification and exploration of interesting patterns.


In the case of text analysis, however, the data sources are documented collections, and interesting patterns are found not among formalized database records but in the unstructured textual data in the documents in these collections.


Document Collection

The document collection can be any grouping of text-based documents. Practically speaking, however, most text analysis solutions are aimed at discovering patterns across very large document collections. The number of documents in such collections can range from the many thousands to the tens of millions.


Document collections can be either static, in which case the initial complement of documents remains unchanged dynamic, which is a term applied to document collections characterized by their inclusion of new or updated documents over time.


Extremely large document collections, as well as document collections with very high rates of document change, can pose performance optimization challenges for various components of a text analysis system.




A document can be very informally defined as a unit of discrete textual data within a collection that usually correlates with some real-world document such as a business report, legal memorandum, email, research paper, manuscript, article, press release, or news story.


Within the context of a particular document collection, it is usually possible to represent a class of similar documents with a prototypical document.


But a document can (and generally does) exist in any number or type of collections—from the very formally organized to the very ad hoc. A document can also be a member of different document collections, or different subsets of the same document collection, and can exist in these different collections at the same time.


1.The document, as a whole, is seen as a structured object.


2. Documents with extensive and consistent format elements in which field-type metadata can be inferred–such as some email, HTML web pages, PDF files, and word-processing files with heavy document templating or style-sheet constraints– are described as semistructured documents.


3. Documents that have relatively little in the way of strong typographical, layout, or markup indicators to denote structure—like most scientific research papers, business reports, legal memoranda, and news stories—are referred as free format or weakly structured documents.


Some text documents, like those generated from a WYSIWYG HTML editor, actually possess from their inception more overt types of embedded metadata in the form of formalized markup tags.


However, even a rather innocuous document demonstrates a rich amount of semantic and syntactical structure, although this structure is implicit and hid- den in its textual content.


In addition, typographical elements such as punctuation marks, capitalization, numerics, and special characters—particularly when coupled with layout artifacts such as white spacing, carriage returns, underlining, asterisks, tables, columns, and so on.


Can often serve as a kind of “soft markup” language, providing clues to help identify important document subcomponents such as paragraphs, titles, publication dates, author names, table records, headers, and footnotes. Word sequence may also be a structurally meaningful dimension to a document.


Document Features

Document Features

An essential task for most text analysis systems is the identification of a simplified subset of document features that can be used to represent a particular document as a whole.


Such a set of features is referred as the representational model of a document; features required to represent a document collection tends to become very large effecting every aspect of a text analysis system’s approach, design, and performance.


The high dimensionality of potentially representative features in document collections is a driving factor in the development of text analysis preprocessing operations aimed at creating more streamlined representational models.


This high dimensionality also indirectly contributes to other conditions that separate text analysis systems from data mining systems such as greater levels of pattern overabundance and more acute requirements for post-query refinement techniques.


The feature sparsity of a document collection reflects the fact that some features often appear in only a few documents, which means that the support of many patterns is quite low; furthermore, only a small percentage of all possible features for a document collection as a whole appears in any single document.


While evaluating an optimal set of features for the representational model for a document collection, the tradeoff is between the following two conflicting objectives:


To achieve the correct calibration of the volume and semantic level of features to portray the meaning of a document accurately, which tends toward evaluating relatively a larger set of features


To identify features in a way that is most computationally efficient and practical for pattern discovery, which tends toward evaluating a smaller set of features


Commonly used document features are described below.

Commonly used document

1. Characters: A character-level representation can include the full set of all characters for a document or some filtered subset; and, this feature space is the most complete of any representation of a real-world text document.


The individual component-level letters, numerals, special characters, and spaces are the building blocks of higher-level semantic features such as words, terms, and concepts.


Character-based representations that include some level of positional information (e.g., bigrams or trigrams) are more useful and common. Generally, character-based representations can often be unwieldy for some types of text processing techniques because the feature space for a document is fairly unoptimized.


2. Words: Word-level features existing in the native feature space of a document. A word-level representation of a document includes a feature for each word within that document—that is the “full text,” where a document is represented by a complete and unabridged set of its word-level features.


However, most word- level document representations exhibit at least some minimal optimization and therefore consist of subsets of representative features devoid of items such as stop words, symbolic characters, and meaningless numerics and so on.


3. Terms: Terms are single words and multiword phrases selected directly from the corpus of a native document by means of term-extraction methodologies. Term- level features, in the sense of this definition, can only be made up of specific words and expressions found within the native document for which they are meant to be generally representative.


Hence, a term-based representation of a document is necessarily composed of a subset of the terms in that document.


Several of term-extraction methodologies can convert the raw text of a native document into a series of normalized terms—that is, sequences of one or more tokenized and lemmatized word forms associated with part-of-speech tags. Sometimes an external lexicon is also used to provide a controlled vocabulary for term normalization.


Term-extraction methodologies employ various approaches for generating and filtering an abbreviated list of most meaningful candidate terms from among a set of normalized terms for the representation of a document. This culling process results in a smaller but relatively more semantically rich document representation than that found in word-level document representations.


4. Concepts: Concepts are features generated for a document by means of manual, statistical, rule-based, or hybrid categorization methodologies.


Concept-level features can be manually generated for documents but are now more commonly extracted from documents using complex preprocessing routines that identify single words, multiword expressions, whole clauses, or even larger syntactical units that are then related to specific concept identifiers.


Many categorization methodologies involve a degree of cross-referencing against an external knowledge source; for some statistical methods, this source might simply be an annotated collection of training documents.


For manual and rule-based categorization methods, the cross-referencing and validation of prospective concept-level features typically involve interaction with a “gold standard” such as a preexisting domain ontology, lexicon, or formal concept hierarchy—or even just the mind of a human domain expert.


Unlike word- and term-level features, concept-level features can consist of words not specifically found in the native document.


Term- and concept-based representations exhibit roughly the same efficiency but are generally much more efficient than character- or word-based document models. terms and concepts reflect the features with the most condensed and expressive levels of semantic value, and there are many advantages to their use in representing documents for text analysis purposes.


Term-level representations can sometimes be more easily and automatically generated from the original source text (through various term-extraction techniques) than concept-level representations, which as a practical matter have often entailed some level of human intervention.


Concept-based representations can be processed to support very sophisticated concept hierarchies, and arguably provide the best representations for leveraging the domain knowledge afforded by ontologies and knowledge bases.


They are much better than any other feature set representation at handling synonymy and polysemy and are clearly best at relating a given feature to its various hyponyms and hypernyms.


Possible disadvantages of using concept-level features to represent documents include the relative complexity of applying the heuristics, during preprocessing operations, required to extract and validate concept-type features the domain dependence of many concepts.


Domain Knowledge

Domain Knowledge

Text Analysis can leverage information from formal external knowledge sources for these domains to greatly enhance elements of their preprocessing, knowledge discovery, and presentation layer operations. A domain is defined as a specialized area of interest with dedicated ontologies, lexicons, and taxonomies of information.


Domain Knowledge can be used in text analysis preprocessing operations to enhance concept extraction and validation activities; domain knowledge can play an important role in the development of more meaningful, consistent, and normalized concept hierarchies.


Advanced text analysis systems can create fuller representations of document collections by relating features by way of lexicons and ontologies in preprocessing operations and support enhanced query and refinement functionalities. Domain knowledge can be used to inform many different elements of a text analysis system:

  • Domain knowledge is an important adjunct to classification and concept-extraction methodologies in preprocessing operations
  • Domain knowledge can also be leveraged to enhance core mining algorithms and browsing operations.
  • Domain-oriented information serves as one of the main basis for search refinement techniques.
  • Domain knowledge may be used to construct meaningful constraints in knowledge discovery operations.
  • Domain knowledge may also be used to formulate constraints that allow users greater flexibility when browsing large result sets.


Search for Patterns and Trends

Search for Patterns

The problem of pattern overabundance can exist in all knowledge discovery activities. It is simply aggravated when interacting with large collections of text documents, and, therefore, text analysis operations must necessarily be conceived to provide not only relevant but also manageable result sets to a user.


Although text analysis preprocessing operations play the critical role of transforming the unstructured content of a raw document collection into a more tractable concept-level data representation, the core functionality of a text analysis system resides in the analysis of concept co-occurrence patterns across documents in a collection.


Indeed, text analysis systems rely on algorithmic and heuristic approaches to consider distributions, frequent sets, and various associations of concepts at an inter-document level in an effort to enable a user to discover the nature and relationships of concepts as reflected in the collection as a whole.


Text analysis methods—often based on large-scale, brute-force search directed at large, high-dimensionality feature sets—generally produce very large numbers of patterns.


This results in an overabundance problem with respect to identified patterns that is usually much more severe than that encountered in data analysis applications aimed at structured data sources.


A main operational task for text analysis systems is to enable a user to limit pattern overabundance by providing refinement capabilities that key on various specifiable measures of “interestingness” for search results. Such refinement capabilities prevent system users from getting overwhelmed by too many uninteresting results.


Results Presentation

Results Presentation

Several types of functionality are commonly supported within the front ends of text analysis systems:


Browsing: Most contemporary text analysis systems support browsing that is both dynamic and content-based; the browsing is guided by the actual textual content of a particular document collection and not by anticipated or rigorously prespecified structures;


User browsing is usually facilitated by the graphical presentation of concept patterns in the form of a hierarchy to aid interactivity by organizing concepts for investigation.


Navigation: Text mining systems must enable a user to move across these concepts in such a way as to always be able to choose either a “big picture” view of the collection in to or to drill down on specific concept relationships.


Visualization: Text analysis systems use visualization tools to facilitate navigation and exploration of concept patterns; these use various graphical approaches to express complex data relationships.


While basic visualization tools generate static maps or graphs that were essentially rigid snapshots of patterns or carefully generated reports displayed on the screen or printed by an attached printer, state-of-the-art text analysis systems increasingly rely on highly interactive graphic representations of search results that permit a user to drag, pull, click, or otherwise directly interact with the graphical representation of concept patterns.


Query: Languages have been developed to support the efficient parameterization and execution of specific types of pattern discovery queries; these are required because the presentation layer of a text analysis system really serves as the front end for the execution of the system’s core knowledge discovery algorithms.


Instead of limiting a user to limiting a user to run only a certain number of fixed, preprogrammed search queries, text analysis systems are increasingly designed to expose much of their search functionality to the user by opening up direct access to their query languages by means of query language interfaces or command-line query interpreters.


Clustering: Text analysis systems enable clustering of concepts in ways that make the most cognitive sense for a particular application or task.


Refinement constraints: Some text mining systems offer users the ability to manipulate, create, or concatenate refinement constraints to assist in producing more manageable and useful result sets for browsing.


Sentiment Analysis

Sentiment Analysis

Social media systems on the web have provided excellent platforms to facilitate and enable audience participation, engagement, and community, which has resulted in our new participatory culture.


From reviews and blogs to YouTube, Facebook, and Twitter, people have embraced these platforms enthusiastically because they enable their users to freely and conveniently voice their opinions and communicate their views on any subject across geographic and spatial boundaries.


They also allow people to easily connect with others and to share their information. This participatory web and communications revolution have transformed our everyday lives and society as a whole. It has also popularized two major research areas, namely, social network analysis and sentiment analysis.


Although social network analysis is not a new research area, as it started in the 1940s and the 1950s when management science researchers began to study social actors (people in organizations) and their interactions and relationships, social media has certainly fueled its explosive growth in the past 15 years.

Sentiment analysis essentially grew out of social media on the web that has been very active since the year 2002.


Apart from the availability of a large volume of opinion data in social media, opinions, and sentiments also have a very wide range of applications simply because opinions are central to almost all human activities. Whenever we need to make a decision, we often seek out others’ opinions.


This is true not only for individuals but also for organizations. It is thus no surprise that the industry and applications surrounding sentiment analysis have flourished since around 2006.


Because a key function of social media is for people to express their views and opinions, sentiment analysis is right at the center of research and application of social media itself. It is now well recognized that to extract and exploit information in social media, sentiment analysis is a necessary technology.


One can even take a sentiment-centric view of social network analysis and, in turn, social media content analysis because the most important information that one wants to extract from the social network or social media content is what people talk about and what their opinions are. These are exactly the core tasks of sentiment analysis.


Sentiment Analysis and Natural Language Processing (NLP)


Sentiment analysis, also called opinion mining, is the field of study that analyzes people’s opinions, sentiments, appraisals, attitudes, and emotions toward entities and their attributes expressed in the written text. The entities can be products, services, organizations, individuals, events, issues, or topics.


The field represents a large problem space. The term opinion is taken to mean the whole concept of sentiment, evaluation, appraisal, or attitude and associated information, such as the opinion target and the person who holds the opinion, and the term sentiment is taken to mean the underlying positive or negative feeling implied by opinion.


Sentiment analysis or opinion mining aims to identify positive and negative opinions or sentiments expressed or implied in text and also the targets of these opinions or sentiments.


Sentiment analysis mainly focuses on opinions that express or imply positive or negative or sentiments, also called positive or negative or neutral opinions respectively in everyday language.


This type of opinion is similar to the concept of attitude in social psychology. Apart from sentiment and opinion, there are also the concepts of affectemotion, and mood, which are psychological states of mind.


Sentences expressing opinions or sentiments, being inherently subjective, are usually subjective sentences as opposed to objective sentences that state facts. However, objective sentences can imply positive or negative sentiments of their authors too, because they may describe desirable or undesirable facts.


Sentiment analysis is a semantic analysis problem, but it is highly focused and confined because a sentiment analysis system does not need to fully “understand” each sentence or document; it only needs to comprehend some aspects of it, for example, positive and negative opinions and their targets.


Owing to some special characteristics of sentiment analysis, it allows much deeper language analyses to be performed to gain better insights into NLP than in the general setting because the complexity of the general setting of NLP is simply overwhelming.


Although general natural language understanding is still far from us, with the concerted effort of researchers from different NLP areas, we may be able to solve the sentiment analysis problem, which, in turn, can give us critical insight into how to deal with general NLP.


The experience in the past 15 years seems to indicate that rather than being a subarea of NLP, sentiment analysis is actually more like a mini version of the full NLP or a special case of the full-fledged NLP;


The reason for this is that sentiment analysis touches every core area of NLP, such as lexical semantics, co-reference resolution, word sense disambiguation, discourse analysis, information extraction, and semantic analysis. Sentiment analysis is mainly carried out at three levels:


1.Document-level: Assuming that each document expresses opinions on a single entity (e.g., a single product or service), document-level sentiment classification indicates whether a whole opinion document expresses a positive or negative sentiment. For instance, given a product review, the system determines whether the review expresses an overall positive or negative opinion about the product.


2. Sentence level: Sentence-level sentiment classification indicates whether each sentence expresses a positive, negative, or neutral opinion. This level of analysis is closely related to subjectivity classification, which distinguishes sentences that express factual information (called objective sentences) from sentences that express subjective views and opinions (called subjective sentences).


3.Aspect level: If a sentence has multiple opinions, It does not make much sense to classify this sentence as positive or negative because it is positive about one entity but negative about another. To obtain this level of fine-grained results, we need to go to the aspect level.


Instead of looking at language units (documents, paragraphs, sentences, clauses, or phrases), aspect-level analysis directly looks at the opinion and its target (called opinion target).


Thus, the goal of this level of analysis is to discover sentiments on entities and/or their aspects. On the basis of this level of analysis, a summary of opinions about entities and their aspects can be produced.


There are two different types of opinions:


A regular opinion expresses a sentiment about a particular entity or an aspect of the entity, for example, “Orange tastes very good”express a positive sentiment or opinion on the aspect taste of orange. This is the most common type of opinion.


A comparative opinion compares multiple entities based on some of their shared aspects, for example, “Mango tastes better than orange” compares mango and orange based on their tastes (an aspect) and expresses a preference for mango.


Sentiment analysis involves addressing the problems of opinion searching and opinion summarization at appropriate levels.

Sentiment words, also called opinion words, are words in a language that indicate desirable or undesirable states. For example, good, great, and beautiful are positive sentiment words and bad, awful, and dreadful are negative sentiment words.


Sentiment words and phrases are instrumental to sentiment analysis. A list of such words and phrases is called a sentiment lexicon. Sentiment analysis is usually undertaken in the context of a predefined lexicon.




Individuals, organizations, and government agencies are increasingly using the content in social media for decision making. If an individual wants to buy a consumer product, he or she is no longer limited to asking his or her friends and family for opinions because there are many user reviews and discussions in public forums on the web about the product.


For an organization, it may no longer be necessary to conduct surveys, opinion polls, or focus groups to gather public or consumer opinions about the organization’s products and services because an abundance of such information is publicly available.


Governments can also easily obtain public opinions about their policies and measure the pulses of other nations simply by monitoring their social media.


Sentiment analysis applications have spread to almost every possible domain, from consumer products, healthcare, tourism, hospitality, and financial services to social events and political elections.


There are now hundreds of companies in this space, start-up companies and established large corporations, that have built or are in the process of building their own in-house capabilities, such as Google, Microsoft, Hewlett-Packard, Amazon, eBay, SAS, Oracle, Adobe, Bloomberg, and SAP.


A popular application of sentiment analysis is a stock market prediction. The system identifies opinions from message board posts by classifying each post into one of three sentiment classes: bullish (optimistic), bearish (pessimistic), or neutral (neither bullish nor bearish). The resulting sentiments across all stocks were then aggregated and used to predict the stock Index.


Instead of using bullish and bearish sentiments, an alternate approach is to identify positive and negative public moods on Twitter and used them to predict the movement of stock market indices such as the Dow Jones, S&P 500, and NASDAQ.


The analysis shows that when emotions on Twitter fly high, that is, when people express a lot of hope, fear, or worry, the Dow goes down the next day. When people have less hope, fear, or worry, the Dow goes up.


Mobile Applications

Mobile Applications

A mobile environment is different from traditional distributed environments due to its unique characteristics such as the mobility of users or computers, the limitation of computing capacity of mobile devices, and the frequent and unpredictable disconnection of wireless networks.


Therefore, development of mobile systems is different from the development of distributed systems.


In other words, when designing a mobile system, we have to overcome challenges due to the physical mobility of the clients, the portability features of mobile devices and the fact that the communication is wireless. Thus, it is important that these issues are examined carefully when considering the system requirements, in terms of both functional and nonfunctional requirements.


The process to identify the requirements of a mobile client-server-based system is very different from a nonmobile one. This is due to the unique characteristics of mobile environments that are the mobility of users or computers, the limitation of computing capacity of mobile devices, and the frequent and unpredictable disconnections of wireless networks.


Mobile Computing Applications

Mobile Computing

A wireless mobile application is defined as a software application, a wireless service, or a mobile service that can be either pushed to users’ handheld wireless devices or downloaded and installed, over the air, on these devices.

Wireless applications themselves can be classified into three streams:


1. Browser-based: Applications developed using a markup language. This is similar to the current desktop browser model where the device is equipped with a browser. The wireless application protocol (WAP) follows this approach.


2. Native applications: Compiled applications where the device has a runtime environment to execute applications. Highly interactive wireless applications are only possible with the latter model.


Interactive applications, such as mobile computer games, are a good example. Such applications can be developed using the fast-growing Java 2 Micro Edition (J2ME) platform, and they are known as MIDlets.


3. Hybrid applications: Applications that aim at incorporating the best aspects of both streams above: the browser is used to allow the user to enter URLs to download native applications from remote servers, and the runtime environment is used to let these applications run on the device.


BlackBerry OS

Research In Motion (RIM) is a Canadian designer, manufacturer, and marketer of wireless solutions for the worldwide mobile communications market. Products include the BlackBerry wireless email solution, wireless handhelds, and wireless modems.


RIM is the driving force behind BlackBerry smartphones and the BlackBerry solution. RIM provides a proprietary multitasking OS for the BlackBerry, which makes heavy use of specialized input devices, particularly the scroll wheel or more recently the trackball.


BlackBerry offers the best combination of mobile phone, server software, push email, and security from a single vendor. It integrates well with other platforms, it works with several carriers, and it can be deployed globally for the sales force which is on move.


It is easy to manage, has a longer than usual battery life, and has a small form-factor with an easy-to-use keyboard. BlackBerry is good for access to some of the simpler applications, such as contact list, time management, and field force applications.


Google Android

Google’s Android Mobile platform is the latest mobile platform on the block. This open- source development platform is built on the Linux kernel, and it includes an operating system (OS), middleware stack and a number of mobile applications.


Enterprises will benefit from Android because the availability of open-source code for the entire software stack will allow the existing army of Linux developers to create special-purpose applications that will run on a variety of mobile devices.


The Android is the open-source mobile OS launched by Google. It is intuitive, user- friendly and graphically similar to the iPhone and Blackberry. Being open source, the Android applications may be cheaper and the spread of the Android possibly will increase. The Kernel is based on the Linux v 2.6 and supports 2G, 3G, Wi-Fi, IPv4, and IPv6.


At the multimedia level, Android works with OpenGL and several images, audio, and video formats. The persistence is assured with the support of the SQLite. Regarding security, the Android uses SSL and encryption algorithms.


If Android makes it into phones designed specifically for the enterprise, those products will have to include technology from the likes of Sybase, Intellisync or another such company to enable security features like remote data wipe functionality and forced password changes.


Apple iOS

iPhone OS is the Apple proprietary OS used in the Macintosh machines; an optimized version is used in the iPhone and iPod Touch.

The simplicity and robustness provided either in the menus navigation or in the application’ navigation are two of the main potentialities of the OS. iPhone OS is also equipped with good quality multimedia software, including games, music, and video players. It has also a good set of tools including imaging editing and word processor.


Windows Phone

The Windows Mobile, a variant of the Windows CE (also known officially as Windows Embedded Compact), was developed for the Pocket PCs at the beginning but arises by 2002 to the HTC2 mobile phones. This OS was engineered to offer data and multimedia services.


By 2006, Windows Mobile becomes available for the developer's community. Many new applications started using the system, turning Windows Mobile into one of the most used systems


Windows Mobile comes in two flavors. A smartphone edition is good for wireless email, calendaring, and voice notes. A Pocket PC edition adds mobile versions of Word, Excel, PowerPoint, and Outlook. Palms Treo700w, with the full functionality of the Pocket PC edition, is a better choice for sales force professionals.


The main draw of the Windows Mobile operating system is its maker Microsoft. Windows Mobile also actively syncs to the Exchange and SQL servers. This augurs very well for use by the sales force.


Mobile sales force solutions for Windows Mobile are available from companies like SAP, Siebel, PeopleSoft, and Salesforce.com as well as other leading solution providers.


Windows Mobile permits Bluetooth connections through the interface Winsock. It also allows 902.11x, IPv4, IPv6, VoIP (Voice over IP), GSM, and CDMA (Code Division Multiple Access) connections.


Some of the main applications available are the Pocket Outlook (adapted version of the Outlook for Desktops), Word, and Excel. It provides also Messenger, Browser, and remote desktop.


The remote desktop is an easy way of access to other mobile or fixed terminals. ActiveSync application facilitates the synchronization between the mobile devices and the desktops.


At the multimedia level, Windows Mobile reproduces music, video, and 3D applications. Security is also a concern, so Secure Socket Layer (SSL), Kerberos, and the use of encryption algorithms are available.


Mobile Web Services

Mobile Web Services

Web services are the cornerstone toward building a globally distributed information system, in which many individual applications will take part; building a powerful application whose capability is not limited to local resources will unavoidably require interacting with other partner applications through web services across the Internet.


The strengths of web services come from the fact that web services use XML and related technologies connecting business applications based on various computers and locations with various languages and platforms. The counterpart of the WS in the context of mobile business processes would be Mobile Web Services (MWS).


The proposed MWS is to be the base of the communications between the Internet network and wireless devices such as mobile phones, PDAs, and so forth. The integration between wireless device applications with other applications would be a very important step toward global enterprise systems.


Similar to WS, MWS is also based on the industry- standard language XML and related technologies such as SOAP, WSDL, and UDDI.


Many constraints make the implementation of WS in a mobile environment very challenging. The challenge comes from the fact that mobile devices have smaller power and capacities as follows:


  • Small power limited to a few hours
  • Small memory capacity
  • Small processors not big enough to run larger applications
  • Small screen size, especially in mobile phones, which requires developing specific websites with suitable size
  • The small keypad that makes it harder to enter data
  • Small hard disk

 The speed of the data communication between the device and the network and that varies


The most popular MWS is a proxy-based system where the mobile device connects to the Internet through a proxy server. Most of the processing of the business logic of the mobile application will be performed on the proxy server that transfers the results to the mobile device that is mainly equipped with a user interface to display output on its screen.


The other important advantage a proxy server provides in MWS is, instead of connecting the client application residing on the mobile device to many service providers and consuming most of the mobile processor and the bandwidth, the proxy will communicate with service providers, do some processing, and send back only the final result to the mobile device.


In the realistic case where the number of mobile devices becomes in the range of tens of millions, the proxy server would be on the cloud and the service providers would be the cloud service providers.


Mobile web services use existing industry-standard XML-based web services architecture to expose mobile network services to the broadest audience of developers.


Developers will be able to access and integrate mobile network services such as messaging, location-based content delivery, syndication, personalization, identification, authentication, and billing services into their applications.


This will ultimately enable solutions that work seamlessly across stationary networks and mobile environments. Customers will be able to use mobile web services from multiple devices on both wired and wireless networks.


The aim of the mobile web services effort is twofold:

aim of the mobile web services

1. To create a new environment that enables the IT industry and the mobile industry to create products and services that meet customer needs in a way not currently possible within the existing web services practices.


With web services being widely deployed as the SOA of choice for internal processes in organizations, there is also an emerging demand for using web services enabling mobile working and e-business.


By integrating Web Services and mobile computing technologies, consistent business models can be enabled on a broad array of endpoints: not just on mobile devices operating over mobile networks but also on servers and computing infrastructure operating over the Internet.


To make this integration happen at a technical level, mechanisms are required to expose and leverage existing mobile network services.


Also, practices for how to integrate the various business needs of the mobile network world and their associated enablers such as security must be developed. The result is a framework, such as the Open Mobile Alliance, that demonstrates how the web service specifications can be used and combined with mobile computing technology and protocols to realize practical and interoperable solutions.


Successful mobile solutions that help architect customers’ service infrastructures need to address security availability and scalability concerns both at the functional level and at the end-to-end solution level, rather than just offering fixed- point products.


What is required is a standard specification and an architecture that tie together service discovery, invocation, authentication, and other necessary components—thereby adding context and value to web services.


In this way, operators and enterprises will be able to leverage the unique capabilities of each component of the end-to-end network and shift the emphasis of service delivery from devices to the human user.


Using a combination of wireless, broadband, and wireline devices, users can then access any service on demand, with a single identity and single set of service profiles, personalizing service delivery as dictated by the situation.


There are three important requirements to accomplish user (mobile- subscriber)-focused delivery of mobile services: federated identity, policy, and federated context. Integrating identity, policy, and context into the overall mobile services architecture enables service providers to differentiate the user from the device and deliver the right service to the right user on virtually any device:


a. Federated identity: In a mobile environment, users are not seen as individuals (e.g., mobile subscribers) to software applications and processes who are tied to a particular domain, but rather as entities that are free to traverse multiple service networks.


This requirement demands a complete federated network identity model to tie the various personas of an individual without compromising privacy or loss of ownership of the associated data.


The federated network identity model allows the implementation of seamless single sign-on for users interacting with applications. It also ensures that user identity, including transactional information and other personal information, is not tied to a particular device or service, but rather is free to move with the user between service providers. Furthermore, it guarantees that only appropriately authorized parties are able to access protected information.


b. Policy: User policy, including roles and access rights, is an important requirement for allowing users not only to have service access within their home network but also to move outside it and still receive the same access to services.


Knowing who the user is and what role they fulfill at the moment they are using a particular service is essential to providing the right service in the right instance. The combination of federated identity and policy enables service providers and users to strike a balance between access rights and user privacy


c. Federated context: Understanding what the user is doing, what they ask, why it is being requested, where they are, and what device they are using is an essential requirement.


The notion of federated context means accessing and acting upon a user’s current location, availability, presence, and role, for example, at home, at work, on holiday, and other situational attributes.


This requires the intelligent synthesis of information available from all parts of the end-to-end network and allows service providers and enterprises to deliver relevant and timely applications and services to end users in a personalized manner.


For example, information about the location and availability of a user’s device may reside on the wireless network, the user’s calendar may be on the enterprise intranet, and preferences may be stored in a portal.


2. To help create web services standards that will enable new business opportunities by delivering integrated services across stationary (fixed) and wireless networks. Mobile web services use existing industry-standard XML-based web services architecture to expose mobile network services to the broadest audience of developers.


Developers will be able to access and integrate mobile network services such as messaging, location-based content delivery, syndication, personalization, identification, authentication, and billing services into their applications.


This will ultimately enable solutions that work seamlessly across stationary networks and mobile environments. Customers will be able to use mobile web services from multiple devices on both wired and wireless networks.


Delivering appealing, low-cost mobile data services, including ones that are based on mobile Internet browsing and mobile commerce, is proving increasingly difficult to achieve.


The existing infrastructure and tools as well as the interfaces between Internet/ web applications and mobile network services remain largely fragmented, characterized by tightly coupled, costly, and close alliances between value-added service providers and a complex mixture of disparate and sometimes overlapping standards (WAP, MMS, Presence, Identity, etc.) and proprietary models (e.g., propriety interfaces).


This hinders interoperability solutions for the mobile sector and at the same time drives up the cost of application development and ultimately the cost of services offered to mobile users.


Such problems have given rise to initiatives for standardizing mobile web services. The most important of these initiatives is the Open Mobile Alliance and the mobile web services frameworks that are examined below.


Mobile Field Cloud Services

Mobile Field Cloud Services

Companies that can outfit their employees with devices like PDAs, laptops, multifunction smartphones, or pagers will begin to bridge the costly chasm between the field and the back office.


For example, transportation costs for remote employees can be significantly reduced, and productivity can be significantly improved by eliminating needless journeys back to the office to file reports, collect parts, or simply deliver purchase orders.


Wireless services are evolving toward the goal of delivering the right cloud service to whoever needs it, for example, employees, suppliers, partners, and customers, at the right place, at the right time, and on any device of their choice.


The combination of wireless handheld devices and cloud service delivery technologies poses the opportunity for an entirely new paradigm of information access that in the enterprise context can substantially reduce delays in the transaction and fulfillment process and lead to improved cash flow and profitability.


A field cloud services solution automates, standardizes, and streamlines manual processes in an enterprise and helps centralize disparate systems associated with customer service life-cycle management including customer contact, scheduling and dispatching, mobile workforce communications, resource optimization, work order management, time, labor, material tracking, billing, and payroll.


A field web services solution links seamlessly all elements of an enterprise’s field service operation—customers, service engineers, suppliers, and the office—to the enterprise’s stationary infrastructure, wireless communications, and mobile devices. Field web services provide real-time visibility and control of all calls and commitments, resources, and operations.


They effectively manage business activities such as call taking and escalation, scheduling and dispatching, customer entitlements and SLAs, work orders, service contracts, time sheets, labor and equipment tracking, invoicing, resource utilization, reporting, and analytics.


Of particular interest to field services are location-based services, notification services, and service disambiguation as these mechanisms enable developers to build more sophisticated cloud service applications by providing accessible interfaces to advanced features and intelligent mobile features:


1. Location-based services provide information specific to a location using the latest positioning technologies and are a key part of the mobile web services suite. Dispatchers can use GPS or network-based positioning information to determine the location of field workers and optimally assign tasks (push model) based on geographic proximity.


Location-based services and applications enable enterprises to improve operational efficiencies by locating, tracking, and communicating with their field workforce in real time.


For example, location-based services can be used to keep track of vehicles and employees, whether they are conducting service calls or delivering products. Trucks could be pulling in or out of a terminal, visiting a customer site, or picking up supplies from a manufacturing or distribution facility.


With location-based services, applications can get such things such as real-time status alerts, for example, estimated time of approach, arrival, departure, duration of stop, current information on traffic, weather, and road conditions for both home-office and en route employees.


2. Notification services allow critical business to proceed uninterrupted when employees are away from their desks, by delivering notifications to their preferred mobile device. Employees can thus receive real-time notification when critical events occur, such as when incident reports are completed.


The combination of location-based and notification services provides added value by enabling such services as proximity-based notification and proximity-based actuation.


Proximity-based notification is a push or pull interaction model that includes targeted advertising, automatic airport check-in, and sightseeing information. Proximity-based actuation is a push-pull interaction model, whose most typical example is payment based on proximity, for example, toll watch.


3. Service instance disambiguation helps distinguish between many similar candidate service instances, which may be available inside close perimeters. For instance, there may be many on-device payment services in the proximity of a single point of sale.


Convenient and natural ways for identifying appropriate service instances are then required, for example, relying on closeness or pointing rather than identification by cumbersome unique names.


Context-Aware Mobile Applications

Context-Aware Mobile Applications

A mobile application is context-aware if it uses context to provide relevant information to users or to enable services for them; relevancy depends on a user’s current task (and activity) and profile (and preferences).


Apart from knowing who the users are and where they are, we need to identify what they are doing, when they are doing it, and which object they focus on. The system can define user activity by taking into account various sensed parameters like location, time, and the object that they use.


In outdoor applications, and depending on the mobile devices that are used, satellite-supported technologies, like GPS, or network-supported cell information, like GSM, IMTS, and WLAN, is applied. Indoor applications use RFID, IrDA, and Bluetooth technologies in order to estimate the users’ position in space.


While time is another significant parameter of context that can play an important role in order to extract information on user activity, the objects that are used in mobile applications are the most crucial context sources.


In mobile applications, the user can use mobile devices, like mobile phones and PDAs and objects that are enhanced with computing and communication abilities. Sensors attached to artifacts provide applications with information about what the user is utilizing.


In order to present the user with the requested information in the best possible form, the system has to know the physical properties of the artifact that will be used (e.g., artifact screen’s display characteristics).


The types of interaction interfaces that an artifact provides to the user need to be modeled (e.g., whether artifact can be handled by both speech and touch techniques), and the system must know how it is designed.


Thus, the system has to know the number of each artifact’s sensors and their position in order to graduate context information with a level of certainty. Based on information on the artifact’s physical properties and capabilities, the system can extract information on the services that they can provide to the user.


In the context-aware mobile applications, artifacts are considered as content providers. They allow users to access context in a high-level abstracted form, and they inform other application’s artifacts so that context can be used according to the application needs.


Users are able to establish associations between the artifacts based on the context that they provide; keep in mind that the services enabled by artifacts are provided as context. Thus, users can indicate their preferences, needs, and desires to the system by determining the behavior of the application via the artifacts they create.


The set of sensors attached to an artifact measure various parameters such as location, time, temperature, proximity, and motion—the raw data given by its sensors determine the low-level context of the artifact. The aggregation of such low-level context information from various homogeneous and non- homogeneous sensors results into a high-level context information.


Ontology-Based Context Model

Ontology-Based Context Model

This ontology is divided into two layers: a common one that contains the description of the basic concepts of context-aware applications and their interrelations representing the common language among artifacts, and a private one that represents an artifact’s own description as well as the new knowledge or experience acquired from its use.


The common ontology defines the basic concepts of a context-aware application; such an application consists of a number of artifacts and their associations.


The concept of artifact is described by its physical properties and its communication and computational capabilities; the fact that an artifact has a number of sensors and actuators attached is also defined in our ontology.


Through the sensors, an artifact can perceive a set of parameters based on which the state of the artifact is defined; an artifact may also need these parameters in order to sense its interactions with other artifacts as well as with the user.


The ontology also defines the interfaces via which artifacts may be accessed in order to enable the selection of the appropriate one. The common ontology represents an abstract form of the concepts represented, especially of the context parameters, as more detailed descriptions are stored into each artifact’s private ontology.


For instance, the private ontology of an artifact that represents a car contains a full description of the different components in a car as well as their types and their relations.


The basic goal of the proposed ontology-based context model is to support a content management process, based on a set of rules that determine the way in which a decision is made and are applied to existing knowledge represented by this ontology.


The rules that can be applied during such a process belong to the following categories: rules for an artifact’s state assessment that define the artifact’s state based on its low- and high-level contexts, rules for local decisions that exploit an artifact’s knowledge only in order to decide the artifact’s reaction (like the request or the provision of a service).


And finally rules for global decisions that take into account various artifacts’ states and their possible reactions in order to preserve a global state defined by the user.


Context Support for User Interaction

User Interaction

The ontology-based context model that we propose empowers users to compose their own personal mobile applications. In order to compose their applications, they first have to select the artifacts that will participate and establish their associations.


They set their own preferences by associating artifacts, denoting the sources of context that artifacts can exploit, and defining the interpretation of this context through rules in order to enable various services.


As the context acquisition process is decoupled from the context management process, users are able to create their own mobile applications avoiding the problems emerging from the adaptation and customization of applications like disorientation and system failures.


The goal of context in computing environments is to improve interaction between users and applications. This can be achieved by exploiting context, which works like implicit commands and enables applications to react to users or surroundings without the users’ explicit commands.


Context can also be used to interpret explicit acts, making interaction much more efficient. Thus, context-aware computing completely redefines the basic notions of interface and interaction.


In this section, we present how our ontology-based context model enables the use of context in order to assist human-computer interaction in mobile applications and to achieve the selection of the appropriate interaction technique. Mobile systems have to provide multimodal interfaces so that users can select the most suitable technique based on their context.


The ontology-based context model that we presented in the previous section captures the various interfaces provided by the application’s artifacts in order to support and enable such selections. Similarly, the context can determine the most appropriate interface when a service is enabled.


Ubiquitous and mobile interfaces must be proactive in anticipating needs, while at the same time working as a spatial and contextual filter for information so that the user is not inundated with requests for attention.


Context can also assist designers to develop mobile applications and manage various interfaces and interaction techniques that would enable more satisfactory and faster closure of transactions.


Easiness is an important requirement for mobile applications; by using context according to our approach, designers are abstracted from the difficult task of context acquisition and have merely defined how context is exploited from various artifacts by defining simple rules.


Our approach presents an infrastructure capable of handling, substituting, and combining complex interfaces when necessary. The rules applied to the application’s context and the reasoning process support the application’s adaptation.


The presented ontology-based context model is easily extended; new devices, new interfaces, and novel interaction techniques can be exploited into a mobile application by simply incorporating their descriptions in the ontology.


Mobile Web 2.0

Mobile Web

Mobile Web 2.0 results from the convergence of the Web 2.0 services and the proliferation of web-enabled mobile devices. Web 2.0 enables to facilitate interactive information sharing, interoperability, user-centered design, and collaboration among users.


This convergence is leading to a new communication paradigm, where mobile devices act not only as mere consumers of information but also as complex carriers for getting and providing information, and as platforms for novel services.


Mobile Web 2.0 represents both an opportunity for creating novel services and an extension of Web 2.0 applications to mobile devices.


The management of user-generated content, of content personalization, of community and information sharing, is much more challenging in a context characterized by devices with limited capabilities in terms of display, computational power, storage, and connectivity.


Furthermore, novel services require support for real-time determination and communication of the user position.


Mobile Web 2.0 is constituted of the following:

Mobile Web

1. Sharing services that are characterized by the publication of contents to be shared with other users. Sharing services offer the users the capability to store, organize, search, and manage heterogeneous contents.


These contents may be rated, commented, tagged, and shared with specified users or groups that can usually visualize the stored resources chronologically, by category, rating or tags, or via a search engine.


Multimedia sharing services are related to sharing of multimedia resources, such as photos or videos. These resources are typically generated by the users that exploit the sharing service to upload and publish their own contents. Popular examples of web portals offering a multimedia sharing service include Flickr, YouTube, and Mocospace.


2. Social services that refer to the management of social relationships among the users. This is constituted of services such as Community management services enable registered users to maintain a list of contact details of people they know.


Their key feature is the possibility to create and update a personal profile including information such as user preferences and his list of contacts.


These contacts may be used in different ways depending on the purpose of the service, which may range from the creation of a personal network of business and professional contacts (e.g., LinkedIn), to the management of social events (e.g., Meetup), and up to the connection with old and new friends (e.g., Facebook).


Blogging services enable a user to create and manage a blog, that is, a sort of personal online journal, possibly focused on a specific topic of interest. Blogs are usually created and managed by an individual or a limited group of people, namely author(s), through regular entries of heterogeneous content, including text, images, and links to other resources related to the main topic, such as other blogs, web pages, or multimedia contents.


A blog is not a simple online journal, because the large majority of them allow external comments on the entries. The final effect is the creation of a discussion forum that engages readers and builds a social community around a person or a topic.


Other related services may also include blogrolls (i.e., links to other blogs that the author reads) to indicate social relationships to other bloggers. Among the most popular portals that allow users to manage their own blog, we cite BlogSpot, Wordpress, and so on.


Microblogging services is characterized by very short message exchanges among the users. Although this class of services originates from the blogging category, there are important differences between microblogging and traditional blogs, namely, the size of the exchanged messages is significantly smaller.


The purpose of microblogging is to capture and communicate instantaneous thoughts or feeling of the users, and the recipient of the communication may differ from that of traditional blogs because microblogging allows authors to interact with a group of selected friends. Twitter is an example of portals providing microblogging services.


3. Location services that tailor information and contents on the basis of the user location. The knowledge of the user current location may be exploited in several ways to offer value-added services.


People discovery services that enable locating user friends; usually these services plot the position of the user and his/her friends on a map; the geographical location of the users is uploaded to the system by means of a positioning system installed on the user mobile devices.


Points of interest (POIs) discovery exploits geographical information to locate POIs, such as events, restaurants, museums, and any kind of attractions that may be useful or interesting to a user. These services offer the users a list of nearby POIs selected on the basis of their personal preferences and specifications.


POIs are collected by exploiting collaborative recommendations from other users that may add a new POI by uploading its geographical location, possibly determined through a GPS positioning system installed on the mobile device. Users may also upload short descriptions, comments, tags, and images or videos depicting the place.


Mobile Analytics

Mobile Analytics

The objectives of mobile analytics are twofold: prediction and description—prediction of unknown or future values of selected variables, such as interests or location of mobiles, and description in terms of human behavior patterns.


Description involves gaining “insights” into mobile behaviors, whereas prediction involves improving decision making for brands, marketers, and enterprises.


This can include the modeling of sales, profits, the effectiveness of marketing efforts, and the popularity of apps and a mobile site. The key is to realize the data that is being aggregated and how to not only create and issue metrics on mobile activity but more importantly, how to leverage it via the data mining of mobile devices to improve sales and revenue.


For years, retailers have been testing new marketing and media campaigns, new pricing promotions, and the merchandising of new products with freebies and half-price deals, as well as a combination of all of these offers, in order to improve sales and revenue.


With mobiles, it has become increasingly easy to generate the data and metrics for mining and precisely calibrating consumer behaviors.


Brands and companies leveraging mobile analytics can be more adept at identifying, co-opting, and shaping consumer behavior patterns to increase profits. Brands and mobile marketers that figure out how to induce new habits can enhance their bottom lines. Inducing a new habit loop can be used to introduce new products, services, and content via the offer of coupons or deals based on the location of mobiles.


Mobile Site Analytics

Mobile site analytics can help the brand and companies solve the mystery of how mobile consumers are engaging and interacting with their site.


Without dedicated customer experience metrics, brands, marketers, and companies cannot tell whether the mobile site experience actually got better or how changes in the quality of that experience affected the site’s business performance.


Visitors tend to focus on three basic things when evaluating a mobile site: usefulness, ease-of-use, and how enjoyable it is. Metrics should measure these criteria with completion rates and survey questions.


Mobile Clustering Analysis

Clustering is the partition of a dataset into subsets of “similar” data, without using a priori knowledge about properties or existence of these subsets.


For example, a clustering analysis of mobile site visitors might discover a high propensity for Android devices to make higher amounts of purchases of, say, Apple mobiles. Clusters can be mutually exclusive (disjunct) or overlapping. Clustering can lead to the autonomous discovery of typical customer profiles.


Clustering detection is the creation of models that find mobile behaviors that are similar to each other; these clumps of similarity can be discovered using SOM software to find previously unknown patterns in mobile datasets.


Unlike classification software, which analyzes for predicting mobile behaviors, clustering is different in that the software is “let loose” on the data; there are no targeted variables. Instead, it is about exploratory autonomous knowledge discovery.


The clustering software automatically organizes itself around the data with the objective of discovering some meaningful hidden structures and patterns of mobile behaviors.


This type of clustering can be done to discover keywords or mobile consumer clusters, and it is a useful first step for mining mobiles. It allows for the mapping of mobiles into distinct clusters of groups without any human bias.


Clustering is often performed as a prelude to the use of classification analysis using rule- generating or decision-tree software for modeling mobile device behaviors.


Market basket analysis using a SOM is useful in situations where the marketer or brand wants to know what items or mobile behaviors occur together or in a particular sequence or pattern.


The results are informative and actionable because they can lead to the organization of offers, coupons, discounts, and the offering of new products or services that prior to the analysis were unknown.


Clustering analyses can lead to answers to such questions as for why do products or services sell together, or who is buying what combinations of products or services; they can also map what purchases are made and when. Unsupervised knowledge discovery occurs when one cluster is compared to another and new insight is revealed as to why.


For example, SOM software can be used to discover clusters of locations, interests, models, operating systems, mobile site visitors, and app downloads, thus enabling a marketer or developer to discover unique features of different consumer mobile groupings.


Mobile Text Analysis

Another technology that can be used for data mining mobile devices is text mining, which refers to the process of deriving, extracting, and organizing high-quality information from unstructured content, such as texts, emails, documents, messages, and comments.


Text mining means extracting meaning from social media and customer comments about a brand or company in mobile sites and app reviews.


This is a different variation in clustering programs; text mining software is commonly used to sort through unstructured content that can reside in millions of emails, chat, web forums, texts, tweets, blogs, and so on, that daily and continuously accumulate in mobile sites and mobile servers.


Text analytics generally includes tasks such as the following:

  • Categorization of taxonomies
  • Clustering of concepts
  • Entity and information extraction
  • Sentiment analysis



Text analytics is important to the data mining of mobile devices because, increasingly, companies, networks, mobile sites, enterprises, and app servers are accumulating a large percentage of their data in unstructured formats, which is impossible to analyze and categorize manually.


Text mining refers to the process of deriving an understanding from unstructured content through the division of clustering patterns and the extraction of categories or mobile trends using machine learning algorithms for the organization of key concepts from unstructured content.


Text mining can be used to gain new insight into unstructured content from multiple data sources, such as a social network of a mobile site or an app platform. Text analytical tools can convert unstructured content and parse it over to a structured format that is amenable to data mining of mobile devices via classification software.


For example, all the daily emails or visits that a mobile site accumulates on a daily basis can be organized into several groupings, such as those mobiles seeking information, service assistance, or those complaining about specific products, services, or brands. Text mining can also be used to gauge sentiment regarding a brand or company.


Mobile marketers, developers, and brands need to consider how to incorporate time, demographics, location, interests, and other mobile available variables into their analytics models. Clustering, text, and classification software can be used to accomplish this for various marketing and brand goals.


Clustering software analyses can be used to discover and monetize mobile mobs. Text software analyses can discover the important brand value and sentiment information being bantered about in social networks.


Finally, classification software can pinpoint important attributes of profitable and loyal mobiles. Classification often involves the use of rule-generating decision-tree programs for the segmentation of mobile data behaviors.


Mobile Classification Analysis

There are two major objectives to classification via the data mining of mobile devices: description and prediction. The description is an understanding of a pattern of mobiles behaviors and to gain insight—for example, what devices are the most profitable to a mobile site and app developer.


Prediction, however, is the creation of models to support, improve, and automate decision makings, such as what highly profitable mobiles to target in an ad marketing campaign via a mobile site or app.


Both description and prediction can be accomplished using classification software, such as rule-generator and decision-tree programs. This type of data mining analysis is also known as supervised learning.


For example, a mobile analyst or marketer can take advantage of segmenting the key characteristics of mobile behaviors over time to discover hidden trends and patterns of purchasing behaviors.


Machine learning technology can discover the core features of mobiles by automatically learning to recognize complex patterns and make intelligent decisions based on mobile data, such as what, when, where, and why certain mobiles have a propensity to make a purchase or download an app, while others do not.


Classifying mobiles enables the positioning of the right product, service, or content to these moving devices via precise messages on a mobile site, or the targeting of an email, text, or the creation of key features to an app.


The marketer or developer will need to use classification software known as rule-generators or decision-tree programs. Decision trees are powerful classification and segmentation programs that use a tree-like graph of decisions and their possible consequences.


Decision-tree programs provide a descriptive means of calculating conditional probabilities. Trained with historical data samples, these classification programs can be used to predict future mobile behaviors.


A decision tree takes as input an objective, such as what type of app to offer, described by a set of properties from historical mobile behaviors or conditions, such as geolocation, operating system, and device model. These mobile features can then be used to make a prediction, such as what type of app to offer to a specific mobile.


The prediction can also be a continuous value, such as total expected coupon sales, or what price to offer for an app.


When a developer or marketer needs to make a decision based on several consumer factors, such as their location, device being used and total log-in time, a decision tree can help identify which factors to consider and how each factor has historically been associated with different outcomes of that decision—such as what products or services certain mobiles are likely to purchase based on observed behavioral patterns over time.


One common advantage of using decision trees is to eliminate a high number of noisy and ineffective consumer attributes for predicting, say, “high customer loyalty” or “likely to buy” models.


Developers and marketers can start with hundreds of mobile attributes from multiple data sources and, through the use of decision trees, they can eliminate many of them in order to focus simply on those with the highest information gain as they pertain to predicting high loyalty or potential revenue growth from mobile features and behaviors.


Mobile Streaming Analysis

The data mining of mobile devices may require the use of both deductive and inductive “streaming analytical” software that is event-driven to link, monitor, and analyze mobile behaviors. These new streaming analytical software products react to mobile consumer events in real time. There are two main types of streaming analytical products:


1. There are deductive streaming programs that operate based on user-defined business rules and are used to monitor multiple streams of data, reacting to consumer events as they take place.


2. There are also inductive streaming software products that use predictive rules derived from the data itself via clustering, text, and classification algorithms. These inductive streaming products build their rules from global models involving the segmentation and analysis from multiple and distributed mobile data clouds and networks.


These deductive and inductive software products can work with different data formats, from different locations, to make real-time predictions using multiple models from massive digital data streams.

Web-Based Applications


J2EE is the result of Sun’s effort to integrate the assortment of Java technologies and API together into a cohesive Java development platform for developing complex distributed Java applications.


Sun’s enhancement of the n-tier development model for Java, combined with the introduction of specific functionalities to permit the easier development of the server- side scalable Web-based enterprise applications, has led to a wide adoption of Java for Web-centric application development.


Enterprise application development entails expertise in a host of areas like interprocess communications, memory management, security issues, and database-specific access queries. J2EE provides built-in support for services in all these areas, enabling developers to focus on implementing business logic rather than intricate code that supports basic application support infrastructure.


There are numerous advantages of application development in the J2EE area:


J2EE offers support for the componentization of enterprise applications that enable higher productivity via reusability of components, the rapid development of functioning applications via prebuilt functional components, higher-quality test-driven development via pretested components, and easier maintenance via cost-effective upgrades to individual components.


J2EE offers support for hardware and operating systems (OS) independence by enabling system services to be accessed via Java and J2EE rather than directly via APIs specific to the underlying systems.


J2EE offers a wide range of APIs to access and integrate with third-party products in a consistent manner, including databases, mail systems, and messaging platforms.


J2EE offers clear-cut segregation between system development, deployment, and execution, thus enabling independent development, integration, and upgradation of components.


J2EE offers specialized components that are optimized for specific types of roles in an enterprise application such as Entity Beans for handling persistent data and Session Beans for handling processing.


All the aforementioned features make possible rapid development of complex, distributed applications by enabling developers to focus on developing business logic, implementing the system without being impacted by prior knowledge of the target execution environment(s), and creating systems that can be ported more easily between different hardware platforms and operating systems (OS).


Realization of the Reference Architecture in J2EE

The Java Enterprise Edition (J2EE) platform provides a component-based approach to implement n-tier distributed enterprise applications. The components that make up the application are executed in runtime environments called containers.


Containers are used to provide infrastructure-type services such as lifecycle management, distribution, and security. Containers and components in the J2EE application are broadly divided into three tiers.


The client tier is typically a web browser or alternatively Java application client. The middle tier contains two primary containers of the J2EE application, namely, web container and Enterprise JavaBeans (EJB) container.


The function of the web container is to process client requests and generate corresponding responses, while the function of the EJB container is to implement the business logic of the application. The EIS tier primarily consists of data sources and a number of interfaces and APIs to access the resources and other existing or legacy applications.


JavaServer Pages and Java Servlets as the User Interaction Components

JSP and Java Servlets are meant to process and respond to web user request. Servlet provides a Java-centric programming approach for implementing web tier functionality. The Servlet API provides an easy-to-use set of objects that process HTTP requests and generate HTML/XML responses.


JSPs provide an HTML-centric version of the Java Servlets. JSP components are document based rather than object-based and possess built-in access to Servlet API request and response objects as also the user session object. JSPs also provide a powerful custom tag mechanism, enabling the encapsulation of reusable Java presentation code that can be placed directly into the JSP document.


Session Bean EJBs as Service-Based Components

Session Beans are meant for representing services provided to a client. Unlike Entity Beans, Session Beans do not share data across multiple clients—each user requesting a service or executing a transaction invokes a separate Session Bean to process the request. A Stateless Session Bean after processing a request goes on to the next requestor next client without maintaining or sharing any data.


However, Stateful Session Beans are often constructed for a particular client and maintain a state across method invocations for a single client until the component is removed.


Entity Bean EJBs as the Business Object Components

Entity Beans are meant for representing persistent data entities within an enterprise application. One of the major component services that are provided to the Entity Beans is that of Container-Managed Persistence (CMP).


However, in EJB 2.0 specification, CMP is limited to one table only. Any object-relational mapping involving more than a one-to-one table- object mapping is supported only through Bean-Managed Persistence (BMP).


Distributed Java Components

Java Naming and Directory Interface (JNDI) enables the naming and distribution of Java components within the reference architecture. JNDI can be used to store and retrieve any Java object.


However, JNDI is usually used to look up for component (home or remote) interfaces to enterprise beans. The client uses JNDI to look up the corresponding EJB Home interface, which enables the creation, access, or removal of instances of Session and Entity Beans.


In case of local Entity Bean, a method invocation is proxied directly to the bean’s implementation. While in case of remote Entity Beans, the Home interface is used to obtain access to the remote interface to invoke the exposed methods using RMI.


The remote interface takes the local method call, serializes the objects that will be passed as arguments, and invokes the corresponding remote method on the distributed object.


These serialized objects are converted back into normal objects to invoke the method to return the resulting value upon which the process is reversed to revert the value back to the remote interface client.


J2EE Access to the EIS (Enterprise Information Systems) Tier

J2EE provides a number of interfaces and APIs to access resources in the EIS tier. The use of JDBC API is encapsulated primarily in the data access layer or within the CMP classes of the Entity Bean.


Data sources that map to a database is defined in JDBC, which can be looked up by a client searching for a resource using the JNDI. This enables the J2EE application server to provide connection pooling to different data resources, which should appropriately be closed as soon as the task is over to prevent bottlenecks.


The various J2EE interfaces and APIs available are as follows:

Java Connector Architecture provides a standard way to build adapters to access existing enterprise applications.


JavaMail API provides a standard way to access mail server applications.


Java Message Service (JMS) provides a standard interface to enterprise messaging systems. JMS enables reliable asynchronous communication with other distributed components. JMS is used by Message-Driven Beans (MDBs) to perform asynchronous or parallel processing of messages.


Model–View–Controller Architecture

The Model 2 architecture is based on the Model–View–Controller (MVC) design pattern. A generic MVC implementation is a vital element of the reference architecture as it provides a flexible and reusable foundation for very rapid web-based application development.

The components of the MVC architecture are as follows:


  • View deals with the display on the screens presented to the user
  • Controller deals with the flow and processing of user actions
  • Model deals with the business logic


MVC architecture modularizes and isolates screen logic, control logic, and business logic in order to achieve greater flexibility and opportunity for reuse. A critical isolation point is between the presentation objects and the application back-end objects that manage the business logic and data.


This enables the user interface to affect major changes on the display screens without impacting the business logic and data components.


View does not contain the source of data and relies on the model to furnish the relevant data. When the model updates the data, it notifies as also furnishes the changed data to the view so that it can render the display to the user with the up-to-date data and correct data. The controller channels information from the view of the user actions for processing by the business logic in the model.


The controller enables an application design to flexibly handle things such as page navigation and access to the functionality provided by the application model in case of form submissions. Thus, the controller provides an isolation point between the model and the view, resulting in a more loosely coupled front end and back end.