What is Search log Analysis?
Exploiting the data stored in the search logs of web search engines, Intranets, and websites provide important insights into understanding the information searching habits and tactics of online searchers.
Web search engine companies use search logs (also referred to as transaction logs) to investigate searching trends and effects of system improvements. In this blog we explains What is Search Analysis.
This understanding can inform information system design, interface development, and information architecture construction for content collections. Search logs are an unobtrusive method of collecting significant amounts of searching data on a sizable number of system users.
A search log is an electronic record of interactions that have occurred during a searching episode between a web search engine and users searching for information on that web search engine.
The users may be humans or computer programs acting on behalf of humans. Interactions are the communication exchanges that occur between users and the system initiated wither by the user or the system.
Most search logs are server-side recordings of interactions; the server software application can record various types of data and interactions depending on the file format that the server software supports.
The search log format is typically an extended file format, which contains data such as the client computer’s Internet Protocol (IP) address, user query, search engine access time, and referrer site, among other fields.
Search Log Analysis (SLA) is defined as the use of data collected in a search log to investigate particular research questions concerning interactions among web users, the web search engine, or the web content during searching episodes.
Within this interaction context, SLA could use the data in search logs to discern attributes of the search process, such as the searcher’s actions on the system, the system responses, or the evaluation of results by the searcher.
From this understanding, one achieves some stated objective, such as improved system design, advanced searching assistance, or better understanding of some user information searching behavior.
SLA involves the following three major stages:
1. Data collection involves the process of collecting the interaction data for a given period in a search log. Search logs provide a good balance between collecting a robust set of data and unobtrusively collecting that data.
Collecting data from real users pursuing needed information while interacting with real systems on the web affects the type of data that one can realistically assemble.
On a real-life system, the method of data monitoring and collecting should not interfere with the information searching process. Not only a data collection method that interferes with the information searching process may unintentionally alter that process, but such a nonpermitted interference may also lead to a loss of a potential customer.
A search log typically consists of data such as
a. User Identification: The IP address of the customer’s computer
b.Date: The date of the interaction as recorded by the search engine server
c.The Time: The time of the interaction as recorded by the search engine server Additionally, it could also consist of data such as
a.Results Page: The code representing a set of result abstracts and URLs returned by the search engine in response to a query
b.Language: The user preferred the language of the retrieved web pages
c.Source: The federated content collection searched
d.Page Viewed: The URL that the searcher visited after entering the query and viewing the results page, which is also known as click-thru or click-through
2. Data preparation involves the process of cleaning and preparing the search log data for analysis. For data preparation, the focus is on importing the search log data into a relational or NoSQL database, assigning each record a primary key, cleaning the data (i.e., checking each field for bad data), and calculating standard interaction metrics that will serve as the basis for further analysis.
Data preparation consists of steps like
a. Cleaning data: Records in search logs can contain corrupted data. These corrupted records can be as a result of multiple reasons, but they are mostly related to errors when logging the data.
b. Parsing data: Using the three fields of The Time, User Identification, and Search URL, common to all web search logs, the chronological series of actions in a searching episode is recreated. The web query search logs usually contain queries from both human users and agents.
Depending on the research objective, one may be interested in only individual human interactions, those from common user terminals, or those from agents. Depending on the search objective, one may be interested in only individual human interactions, those from common user terminals, or those from agents.
c. Normalizing searching episodes: When a searcher submits a query, then views a document, and returns to the search engine, the web server typically logs this second visit with the identical user identification and query, but with a new time (i.e., the time of the second visit).
This is beneficial information in determining how many of the retrieved results pages the searcher visited from the search engine, but unfortunately, it also skews the results of the query level analysis. In order to normalize the searching episodes, one must first separate these result page requests from query submissions for each searching episode.
3. Data analysis involves the process of analyzing the prepared data. The three common levels of analysis for examining search logs:
a. Session analysis
A searching episode is defined as a series of interactions within a limited duration to address one or more information needs.
This session duration is typically short, with web researchers using between 5 and 120 minutes as a cutoff; each choice of time has an impact on the results: the searcher may be multitasking within a searching episode, or the episode may be an instance of the searcher engaged in successive searching.
This session definition is similar to the definition of a unique visitor used by commercial search engines and organizations to measure website traffic. The number of queries per searcher is the session length.
Session duration is the total time the user spent interacting with the search engine, including the time spent viewing the first and subsequent web documents, except the final document.
Session duration can, therefore, be measured from the time the user submits the first query until the user departs the search engine for the last time (i.e., does not return). This viewing time of the final web document is not available since the web search engine server does not record the time stamp.
A web document is the web page referenced by the URL on the search engine’s results page. A web document may be text or multimedia and, if viewed hierarchically, may contain a nearly unlimited number of subweb documents. A web document may also contain URLs linking to other web documents.
From the results page, a searcher may click on a URL, (i.e., visit) one or more results from the listings on the result page. This is click through analysis and measures the page viewing behavior of web searchers.
b. Query analysis
The query level of analysis uses the query as the base metric. A query is defined as a string list of one or more terms submitted to a search engine. This is a mechanical definition as opposed to an information searching definition. The first query by a particular searcher is the initial query.
A subsequent query by the same searcher that is different than any of the searcher’s other queries is a modified query. There can be several occurrences of different modified queries by a particular searcher.
A unique query refers to a query that is different from all other queries in the transaction log, regardless of the searcher. A repeat query is a query that appears more than once within the dataset by two or more searchers.
Query complexity examines the query syntax, including the use of advanced searching techniques such as Boolean and other query operators.
c. Term analysis
The term level of analysis naturally uses the term as the basis for analysis. A term is a string of characters separated by some delimiter such as space or some other separator.
At this level of analysis, one focuses on measures such as term occurrence, which is the frequency that a particular term occurs in the transaction log.
High Usage Terms are those terms that occur most frequently in the dataset. Term co-occurrence measures the occurrence of term pairs within queries in the entire search log. One can also calculate degrees of association of term pairs using various statistical measures
Effective website management requires a way to map the behavior of the visitors to the site against the particular objectives and purpose of the site. Web analysis or Log file analysis is the study of the log files from a particular website. The purpose of log file analysis is to assess the performance of the website.
Every time a browser hits a particular web page the server computer on which the website is hosted registers and records data called log files for every action a visitor at that particular website takes. Log files data includes information on
Who is visiting the website (the visitor’s URL, or web address)
The IP address (numeric identification) of the computer the visitor is browsing from
The date and time of each visit
Which pages the visitor viewed, how long the visitor viewed the site
Other relevant data
Log files contain potentially useful information for anyone working with a website—from server administrators to designers to marketers—who needs to assess website usability and effectiveness.
1. Website administrators use the data in log files to monitor the availability of a website to make sure the site is online, available, and without technical errors that might prevent easy access and use. Administrators can also predict and plan for growth in server resources and monitor for unusual and possibly malicious activity.
For example, by monitoring past web usage logs for visitor activity, a site administrator can predict future activity during holidays and other spikes in usage and plan to add more servers and bandwidth to accommodate the expected traffic.
In order to watch for potential attacks on a website, administrators can also monitor web usage logs for abnormal activity on the website such as repetitive login attempts, unusually large numbers of requests from a single IP address, and so forth.
2. Marketers can use the log files to understand the effectiveness of various on- and off-line marketing efforts. By analyzing the blogs, marketers can determine which marketing efforts are the most effective. Marketers can track the effectiveness of online advertising, such as banner ads and other links, through the use of the referrer logs (referring URLs).
Examination of the referring URLs indicates how visitors got to the website, showing, say, whether they typed the URL (web address) directly into their web browser or whether they clicked through from a link at another site.
blogs can also be used to track the amount of activity from offline advertising, such as magazine and other print ads, by utilizing a unique URL in each offline and that is run.
Unlike online advertising that shows results in log information about the referring website, offline advertising requires a way to track whether or not the ad generated a response from the viewer. One way to do this is to use the ad to drive traffic to a particular website especially established only for tracking that source.
3. Website designers use log files to assess user experience and site usability. Understanding the user environment provides web designers with the information they need to create a successful design.
While ensuring a positive user experience on a website requires more than merely good design, log files do provide readily-available information to assist with the initial design as well as continuous improvement of the website. Web designers can find useful information about
a.The type of operating system (e.g., Windows XP or Linux)
b.The screen settings (e.g., screen resolution)
c.The type of browser (e.g., Internet Explorer or Mozilla) used to access the site
This information allows designers to create web pages that display well for the majority of users.
Click trail can show how a viewer navigates through the various pages of a given website; the corresponding clickstream data can show
What products a customer looked at on an e-commerce site
Whether the customer purchased those products
What products a customer looked at but did not purchase
What ads generated many click-throughs but resulted in few purchases
And so on
By giving clues as to which website features are successful, and which are not, log files assist website designers in the process of continuous improvement by adding new features, improving on current features, or deleting unused features.
Then, by monitoring the blogs for the impact on the user reaction, and making suitable adjustments based on those reactions, the website designer can improve the overall experience for website visitors on a continuous basis.
The veracity of Log Files Data
Despite the wealth of useful information available in log files, the data also suffer from limitations.
One of the major sources of inaccuracy arises from the way in which unique visitors are measured. Traditional blog reports measure unique visitors based on the IP address, or network address, recorded in the log file.
Because of the nature of different Internet technologies, IP addresses do not always correspond to an individual visitor in a one-to-one relationship.
In other words, there is no accurate way to identify each individual visitor. Depending on the particular situation, this causes the count of unique visitors to be either over- or under-reported.
Cookies are small bits of data that a website leaves on a visitor’s hard drive after that visitor has hit a website. Then, each time the user’s web browser requests a new web page from the server, the cookie on the user’s hard drive can be read by the server. These cookie data benefit in several ways:
Unique cookie gets generated for each user even if multiple viewers access the same website through the same proxy server; consequently, a unique session is recorded and a more accurate visitor count can be obtained.
Cookies also make it possible to track users across multiple sessions (i.e., when they return to the site subsequently), thus enabling computation of new versus returning visitors.
Third-party cookies enable the website to assess what other sites the visitor has visited; this enables personalization of the website in terms of the content that is displayed. Cookies are not included in normal log files. Therefore, only a web analytics solution that supports cookie tracking can utilize the benefits.
Another source of inaccuracy is in visitor count data. Most blog reports give two possible ways to count visitors—hits and unique visits. The very definition of hits is a source of unreliability.
By definition, each time a web page is loaded, each element of the web page (i.e., different graphics on the same page) is counted as a separate “hit.”
Therefore, even with a one-page view, multiple hits are recorded as a function of the number of different elements on a given web page. The net result is that hits are highly inflated numbers.
In contrast, the under-reporting of visitors is a serious issue for online advertising. If the ad is cached, nobody knows that the ad was delivered. As a result, the organization delivering the ad does not get paid. Log files cannot track visitor activity from cached pages because the web server never acknowledges the request.
This deficiency is remedied by using page tagging. This technique has its origins in hit counters, which like a car odometer increases by one count with each additional page view.
Since page tagging is located on the web page itself rather than on the server, each time the page is viewed, it is “tagged”; while server logs cannot keep track of requests for a cached page, a “tagged” page will still acknowledge and record a visit.
Moreover, rather than recording a visit in a blog file that is harder to access, page tagging records visitor information in a database, offering increased flexibility to access the information more quickly and with more options to further manipulate the data.
blogs do not provide an accurate way to determine the visit duration. Visit duration is calculated based on the time spent between the first-page request and the last page request.
If the next page request never occurs, the duration cannot be calculated and will be under-reported. blogs also cannot account for the user who views a page, leaves the computer for 20 minutes, and comes back and click to the next page. In this situation, the visit duration would be highly inflated.
Web Analysis Tools
New tools in web analytics like Google Analytics provide a stronger link between online technologies and online marketing, giving marketers more essential information lacking in earlier versions of web analytics software.
For many years, web analytics programs that delivered only simple measurements such as hits, visits, referrals, and search engine queries were not well linked to an organization’s marketing efforts to drive online traffic.
As a result, they provided very little insights to help the organization track and understand its online marketing efforts.
Trends in web analytics specifically improve both the method of data collection and the analysis of the data, providing significantly more value from a marketing perspective. These newer tools attempt to analyze the entire marketing process, from a user clicking an advertisement through to the actual sale of a product or service.
This information helps to identify not merely which online advertising is driving traffic (number of clicks) to the website and which search terms lead visitors to the site, but which advertising is most effective in actually generating sales (conversion rates) and profitability.
This integration of the web log files with other measures of advertising effectiveness is critical to provide guidance into further advertising spending.
Web analytics software has the capability to perform more insightful, detailed reporting on the effectiveness of common online marketing activities such as search engine listings, pay-per-click advertising, and banner advertising. Marketing metrics to assess effectiveness can include the following:
Cost-per-click: The total online expenditure divided by the number of click-throughs to the site.
Conversion rate: The percentage of the total number of visitors who make a purchase, signup for a service, or complete another specific action.
Return on marketing investment: The advertising expense divided by the total revenue generated from the advertising expense.
Bounce rate: The number of users that visit only a single page divided by the total number of visits; one indicator of the “stickiness” of a web page.