Secondary research is the collection and collation of information that is published and publicly available.
Also called desk research, it is frequently done to explore a topic before engaging in more expensive primary research, or to quickly gain a summary understanding of a topic without engaging too many resources.
The advantages of doing secondary research (vs. primary research) are numerous. First, secondary research is considerably less expensive than primary research.
Also, it is considerably faster to complete as it does not depend on third parties (such as recruited participants, organized focus groups, or enrolled survey participants) to obtain the information, and the sample size of third-party research reports will often be quite considerable.
Furthermore, due to scope and reputation, third-party information often has more perceived authority and impartiality than “ in-house” research.
An assessment such as “ BCC Research estimates that the global advanced drug delivery market should grow from roughly $178.8 billion in 2015 to nearly $227.3 billion by 2020, with a compound annual growth rate (CAGR) of 4.9%” * has more credibility than most in-house estimates.
Finally, as mentioned at the beginning of this section, secondary research is very useful to orient and define primary research: a researcher will often start a market research project by doing a quick market review to identify some of the main trends and concerns before diving into primary research. There are some disadvantages to secondary research.
First, the information is not always personalized to an organization’ s requirements, and it is quite difficult to find data for emerging fields: reports on nanomedicine are plentiful, but developing a specific application merging nanomedicine and information technology means that secondary research will be quite scarce, and means that the organization will either need to
(a) extrapolate from generalist research, or (b) conduct primary research. Also, the data might be outdated, limiting its usefulness. Finally, most of the time, the original data used in secondary data is unverifiable: it is quite difficult to spot errors in data collection or dispute the way that data was analyzed in a consolidated report.
There are two types of secondary data: external and internal. External secondary data is information gathered from outside the organization.
This includes anything from government statistics to media sources. Internal secondary data is data that the organization is generating itself. It could be data collected from customers’ feedback, accounting and sales records, or employee experiences.
While it is possible to do secondary research using non-Internet sources, the bulk of our suggestions are related to this medium for both the ease of use, convenience as well as cost.
Finally, we make a distinction between active and passive secondary research. Active secondary research takes place when the researcher is actively searching for information, while passive secondary research is the use of tools and software to automate data collection.
Active Secondary Research
Active secondary research takes place when the market researcher is dynamically searching for information. In this section, we will be going over the most popular and pertinent sources of data available. All of the suggested sources are web-based.
The Internet has grown to be by far the most important resource for searching for and identifying secondary research information: even most magazines and newspapers have made their archives (partially) available online.
While there are some ethical considerations for what information the researcher can use, a simple rule of thumb is that any information made available to the public is fair game for collection and review.
If the method used to obtain the information is not commonly available to the public (such as using a former employee’ s password to access a restricted website area), then it is not only unethical but most definitely illegal.
In the following section, we will be going over a variety of Internet data sources that a market researcher might wish to investigate when doing secondary research, and will conclude with a few tips to increase the effectiveness of Google-based Internet researches.
Government agencies generate large bodies of information that can be used by researchers. Most of this information is free to use and can be useful at the start of a research project. Government data is usually statistical in nature and is very useful when building marketing models or trying to understand the nature of a market.
CIA World Fact blog www.cia.gov/library/publications/the-world-factblog
U.S. Patent Search patft.uspto.gov/
National Center for Health Statistics www.cdc.gov/nchs/hus.htm
National Agricultural Statistics Service www.nass.usda.gov
U.S. Census www.census.gov
Data.gov U.S.www.data.gov Government’ s open data
Government databases that could be useful to review during a market research effort in life sciences.
Public Company Data
Companies publish a lot of useful information online. Often, companies will do market research, and will publish results as consolidated information in the corporate documentation. As such, reviewing this publicly available data is another useful way to start your research effort.
One of the first places to search online is the website of a specific company with a technology or product akin to the one that is the focus of your research. Looking at data from other companies can help to estimate the number of elements, including market size, growth, and segmentation since you will possibly find pieces of market data.
For example, while perusing a corporate presentation, you might find a sentence on market segmentation where the company states that it believes to be the second largest player in the market, and estimates that it generates 30% of the total sales.
This statement might be based on other secondary research or on internal data, so it’ s important to note the origin of the statement. Some of the documents that can be reviewed online include
Annual reports, which might also be available on the websites of central depositories, such as the Securities and Exchange Commission (for American companies, www. sec.gov) or SEDAR (for Canadian companies, www.sedar. com)
Company pitch decks or presentations at investor’ s forums
Press releases, marketing collateral, product brochures, and white papers
Video product presentations
Once these documents are identified, be sure to download them and save them to your own local drive.
Companies can change or update their content online without warning, and the information you identified might not be available any longer when you return to the website. Taking a screenshot of important information can be another way to ensure that the data isn’ t lost following a website update.
Print media sources are published on a regular schedule by specialized companies. These include magazines and newspapers and include both their printed/physical format as well as their Internet counterparts.
Generalist newspapers and publications, while interesting, are of limited use when doing specialized market research. Although it is possible to get general information from them (such as trends, government policies, and orientations, or industry summaries), print media market research in life science will focus mostly on specialized content publishers, mostly in the form of trade magazine publishers.
There are a number of trade magazines that are published on a regular basis that are of use to market researchers. Most of them are free, and access to their archives is public most of the time, although some do monetize their archives.
Below lists some of the main publications that market researchers can use during their market research efforts. One thing to investigate is if the publication publishes specialized inserts or special issues on focused topics, as these also provide very valuable market data.
Genetic Engineering & Biotechnology (www.genengnews.com) Bioindustry new News publication
FierceBiotech (www.fiercebiotech.com) Biotechnology news publication
Fierce Pharma (www.fiercepharma.com) Pharmaceutical news publication
BioProcess International Magazine (www.bioprocessintl.com) Manufacture of biotherapeutics
Contract Pharma (www.contractpharma.com) Pharma outsourcing industry
Trade and Industry Groups
Trade and industry groups are organizations representing multiple firms (private companies, government agencies, universities, consultants) in a common commercial activity sector. Some of the larger associations produce (or sponsor) reports relevant to their industry, which can be of use during market research.
One of the advantages of the reports produced by these organizations is that they are third-party reports and are (relatively) unbiased, but the reports are not always freely available and are sometimes reserved for members.
Associations are also used to identify participants or targets for further research as their members are routinely listed on their website. This can give the researcher a good starting point when researching a specific technology area, or if the researcher’ s research focus is in a specific geographical region.
Finally, many trade associations collect information from their members in the form of surveys and then share the results at large in a consolidated format, producing reports on key trends and issues of concern. The most important life science trade organizations are listed below
Biotechnology Innovation(www.bio.org) Innovative healthcare,
Organization agricultural, industrial, and environmental biotechnology products
PhRMA (www.phrma.org) Innovative biopharmaceutical research companies
Scientific publications are a very useful source of information when researching market trends and future technologies. They can also be very useful to identify key opinion leaders for future research efforts, such as interviews and focus groups, and they supply valuable information to benchmark technologies.
While there are many publications available, the best approach is to target hubs that consolidate articles.
Main Scientific Article Hubs for Market Research
PubMed (www.ncbi.nlm.nih.gov/pubmed) PubMed comprises more than 26 million citations for biomedical literature
BioMed Central (www.biomedcentral.com) Peer-reviewed biomedical research specializing in open access journal publication
Science Gateway (sciencegateway.org/) Directory of links to biomedical science and research journals
Europe PMC (europepmc.org) Europe PMC also contains patents, NHS (National Health Service) guidelines, and Agricola records
Google Scholar (scholar.google.com) Freely accessible web search engine that indexes the full text or metadata of scholarly literature
Market Research Firms
Large market research firms publish syndicated reports on a regular basis. They collect data from major participants in an industry, consolidating the data and publishing it so that it can be purchased by all interested parties. The reports can be expensive but can be worth it for the time saved by purchasing a completed product (rather than engaging in a costly primary research operation).
Some companies also offer report customization for its bigger clients, enabling some level of personalization. Finally, executive reports on these syndicated reports are sometimes available at large, so it is possible to collect some data that can be used for data triangulation.
Competitive Start-Up Research through Specialized Websites
Researching upcoming competitors or new technologies can be accomplished by searching for specialized websites. There are a number of specialized databases that can be consulted for free that are of interest to emerging start-ups to identify early competing technologies or partnership opportunities.
CrunchBase (www.crunchbase.com) Database of innovative companies and entrepreneurs
YouNoodle Platform for entrepreneurship(www.younoodle.com)
AngelList (angel.co) Website for start-ups, angel investors, and job-seekers looking to work at start-ups
BizStats Business statistics and financial ratios (www.bizstats.com)
Blogs are websites or publication streams maintained by individuals and companies to publish information in a more interactive format.
The objective is to generate discussion between the company and its consumers, or between an expert and people interested in a topic. It is also heavily used by consumers and advocacy groups to express opinions, both positive and negative, related to a topic of interest.
Blogs are one of the top trusted sources of information due to their personal nature. A Technorati survey found blogs to be one of the top five trusted sources of information on the web (after news sites, facebook, retail sites, and YouTube).* But the fact that they are so trusted has led to a deterioration in the quality of information being shared on them.
It is quite easy for bloggers to manipulate (or be manipulated), so the information they share should be taken as an information lead, rather than being positioned as a definitive fact.
As Ryan Holiday mentions in his blog Confessions of a Media Manipulator, “ The economics of the Internet has created a twisted set of incentives that make traffic (to the blog) more important— and more profitable— than the truth.”
Also, the popularity of blogs has been declining, but they still remain an important information source. While many people debate the usefulness of blogs when so many ways to push information to consumers exist, blogs still have some very important characteristics compared with alternatives (e.g., social media websites such as LinkedIn or using a Facebook Feed).
The most important of these differences is that as blogs can be located directly on an organization’ s website, they allow for more detailed posts and have more longevity, as they are available for as long as the organization wishes them to be online.
Researching blogs is a useful way to identify trends, consumer reactions, and key opinion leaders. Blogs from advocacy groups are especially useful to identify concerns from patients, and blogs from key opinion leaders are useful to identify upcoming trends, but care should be taken to measure these opinions against other sources of information.
For the purpose of this blog, “ social networks” are defined as dedicated websites or applications where users with a common objective gather and participate in discussions. These discussions create a network of social interactions, as users share messages, comments, information, opinions, experiences, and more.
As of 2016, the most popular social networks are LinkedIn (which is mainly used by the business community), Facebook (which focuses more on social interactions), and Twitter (which is used for commenting and quickly diffusing information and opinions).
However, there are hundreds of social networks catering to specialized needs for niche communities.
A very interesting one for researching trends in life sciences is the PatientsLikeMe (www.patientslikeme.com) social network, a website that bills itself as a “ free website where people can share their health data to track their progress, help others, and change medicine for good.”
Social networks are a useful way to measure the customer’ s pulse on a topic. Analyzing the comments found on social network pages allows the researcher to understand how people view a brand or their perception of a specific topic.
It is also possible to use social networks to recruit participants for surveys and focus groups, as well as interacting directly with them and engaging in one-to-one conversations.
The popularity of social networks has led many companies to set up and maintain pages on social networks as well. Some pages are for the company in general, while others are specific product pages. When reviewing these pages, it is possible to collect and collate customer feedback on these specific products.
For example, say you were developing a carrying device for glucose monitoring devices. Using Facebook, you would find the Medtronic Diabetes Facebook page. Reading through the comments, you would find information such as:
I would like to know why the pump clips are made so cheaply now ! I am considering gluing it to my pump at this point because I am so tired of it snapping off with the slightest bit of a bump and no it’ s not a default in my clip itself I had two sent to me and both are just as bad as one another ... the quality of this product is going down but the prices seem to keep going up and it is truly sad ...
Both of my clips have broken in the last few months. One of them broke while I was sleeping!
I stopped using those plastic ones for that reason
Through more research, you might be able to identify an opportunity for a more sturdy carrying solution, or something completely different from the current technology.
A good model for analyzing data gathered on social media is using a slight variation of the Five Ws (Who / What / Where / When / Why).
What happened? — What was the situation that caused the posters to post messages on social media?
Who is involved? — Who is posting the message? Who else is involved? Is the poster posting for somebody else?
Where was the comment posted?—Where did the poster post his comment? Did he append his comment to a specific post, or did he start a new conversation?
Why did that happen? — Why did he post a message? Why did he use social media instead of a more conventional method (such as calling customer support)?
How can this problem be solved? — What did he hope to accomplish? What are the solutions that would resolve this situation?
There are also many online tools that a market researcher can use to monitor customer feedback instead of using a generalist search engine. For Twitter, there are websites such as MentionMapp (http://mentionmapp.com/), which lets you create a map of users and hashtags based on your search, enabling you to identify social key opinion leaders.
Another option is TweetReach (https://tweetreach.com/), an analytical tool that generates real-time stats for any search term by searching through Facebook, Twitter, and Instagram.
Finally, Mention (https://mention.com/en/) does real-time monitoring of both the web and social media, enabling you to do competitive analysis and to find influencers while Social Searcher (https://www.social-searcher.com) is a free social media search engine that looks through available public information in many social networks (such as Twitter, Google+, Facebook, YouTube, Instagram, Tumblr, Reddit, Flickr, Dailymotion, and Vimeo) for the information you need.
While also allowing users to save their searches and set up e-mail alerts.
Discussion boards are online discussion forums where users can discuss topics through forum posted messages, often anonymously. Their main strength is that the messages are usually more detailed than those found in blog post responses, and have longer archival periods.
One of the most useful discussion groups for life sciences is cafepharma.com, where you can find great insights on pharma marketing trends and rumors. Reddit is another good source for finding informal information.
For example, on Reddit in November 2016, someone started a discussion called “ What are your thoughts on the Bayer x Monsanto merger?” * Some of the feedback included:
Could be something to do with weed legalization. I know Monsanto has wanted to grow large-scale pot farms, and that means a lot of drugs to process. Bayer has a history of processing drugs, especially for the pain relief sector. I have no proof of this whatsoever. But no one else commented and as a weed smoker this merger turned some lights on for me.
That conspiracy theory just never dies. They have stated (over and over and over) that they have no interest in marijuana.
Fun thought, but actually extremely improbable bumping up against impossible. By this logic, why not get into the tobacco market?
Growing commodity crops, which is what marijuana would be if legal, is not in the interest of a chemical company. Bayer wants to get into agricultural chemicals. More so than they already have and the easiest way to achieve that when you're sitting on mountains of cash is to buy someone.
Let’ s be clear, these are rumors, not facts, akin to something a researcher might overhear in a café or a convention. And without data triangulation, these are random statements, and shouldn’t be taken at face value. Nonetheless, what they do provide is a potential clue and a research lead.
These two websites are examples of places where people familiar with the industry converge to engage in discussions. It is possible for a researcher to use these to test a hypothesis or to discreetly ask participants a question to get more potential information.
Search Engines: Tips and Tricks
Searching through the Internet using a web research engine is usually the first step in secondary research. Here are a few tips and tricks to make the research effort more efficient:
Look to the past: Sometimes, a market researcher will be looking for something specific but will conclude that the information is no longer available online. It might be an old press release that a competitor has pulled from their website, information on a previous partnership that was quietly ended, or specifications on discontinued products, for example.
In these cases, the website www.archive.org (an Internet archive non-profit digital library offering free universal access to all) is especially useful. Archived in their public database is historical web snapshots of a company’ s website, which can include pages, attachments, and more.
While a researcher might not have access to each version of a company’ s website, there are often several snapshots taken throughout the year, enabling the researcher to identify key information that has been removed online.
Look for corporate web DNA: I originally found this technique referenced by Leonard Fuld in The Secret Language of Competitive Intelligence. It is based on the concept that every organization develops their own brand of corporate speak or pattern.
It is akin to corporate web DNA. Fuld defines it as “ a unique pattern of words and phrases that form the substance of a company’ s website, its press releases, and its advertisements.”
As such, if the researcher can identify a group of unique words or jargon as potential corporate DNA, he can then proceed with searching the web using the aforementioned terminology, grouped between two sets of quotation marks.
As an example, using Medtronic’ s “ Transforming technology to change lives’ slogan to search the web brings up a series of white papers (old and new), job offers (both current and expired), as well as customer testimonials.
Go beyond Google: If you are not finding the information you need, you can try using another search engine to obtain different search results. Some of the interesting alternative search engines include
Bing (www.bing.com) (which is reported to have a better video search option)
Board reader (boardreader.com) (which specializes in the user point of views by searching through forums, message boards, and Reddit)
Slide Share (www.slideshare.net) (a cornucopia of PowerPoint presentations, slide decks, and webinars from past conferences)
Power Up Google
Google is the starting point for many research projects. Google accounts for over 75% of the global desktop search engine market share, as its simplicity of use and search algorithm for ranking websites in order of importance enables researchers to quickly find the information they are looking for.
When searching on Google, remember that the search engine ranks the first word as slightly more important than the second, and the second slightly more important than the third, and so on. Hence, searching for “ North America vaccine trends” and “ vaccine trends north America” will generate slightly different results.
Passive Secondary Research
Automated Internet research tools are a boon to market researchers, as they automatically monitor and report on specific information topics. We will be going through some of the most interesting tools that researchers can use to automate their market research.
Rich Site Summary Feeds
Rich Site Summary (RSS) feeds are a simple method to aggregate data generated by specialized websites, and efficiently supply researchers with up-to-date information on specific topics.
Three types of RSS readers exist: web-based readers (which you access through your web browser), such as Feedly (www. feedly.com) and Netvibes (www.netvibes.com); client-based readers, which you download and install on your computer, such as RSSOwl (www.rssowl.org); and those that integrate into your web browser (Firefox and Internet Explorer both offer this option).
The advantages of using RSS feeds for research and continued monitoring are multiple. First, RSS feeds save time: you can quickly subscribe to the feeds you are interested in, and quickly scan aggregated data without having to visit every single website every time.
Also, as RSS feeds update themselves automatically, you get information as it becomes available. Finally, they increase your productivity as you can quickly scan the headlines for the information you are interested in and dig deeper into the topics of interest.
Google Alerts is a service offered by Google. It is useful to automatically monitor a topic by setting up a search alert.
Once set up, the Google Alert sends search engine search results by e-mail to researchers as they occur, or on a predetermined basis in the form of a digest (once a day or once a week), at a predetermined time.
To create a Google Alert, the user only needs to go to the website www.google.com/alerts, type in their topic of interest, and customize the information feed requested (frequency, where the information will be collected from, the number of results wanted each period, and which e-mail will be receiving the information). It is possible to edit an alert if needed or delete the alert if it is no longer needed.
Google Alerts monitors major sites and news media for the topics selected and easily shares the information it collects across social media platforms such as Facebook, Twitter, and Google+.
Web Monitoring Tools
In addition to using automated alerting tools such as Google Alerts and monitoring RSS feeds, it is possible to set up specialized software and online search tools that monitor changes in specific web content (e.g., a competitor’ s website).
These tools will alert you to any changes on a company’ s website as they occur, making it unnecessary to continually monitor key websites.
Some software can also be configured to simultaneously monitor RSS feeds and newsgroups, making powerful all-in-one solutions. Some of the most popular software in this space includes
Website-Watcher (www.aignes.com, which is software based)
WatchThatPage (www.watchthatpage.com, which is free for individuals)
InfoMinder (www.infominder.com, which is web based)
Social Media Tracking
A number of tools exist that are specifically designed to monitor social media. These tools search through popular media, such as Twitter and Facebook, and user-generated content such as blogs and comments to generate reports that can be used to identify underlying consumer trends.
One of the popular tools in this space is Keyhole (keyhole.co). This monitoring tool keeps an eye on keywords and hashtags across Twitter and Instagram and can be useful to quickly identify popular public key opinion leaders on a topic, which can then be engaged further for market research. It can also be used to identify geographic tends and estimate overall consumer sentiment.
Another useful tool is Warble (warble.co), which monitors Twitter and sends daily e-mail reports directly to the chosen e-mail account. It automates the process of monitoring Twitter, which is pretty important due to the significant mass of content that is generated on it daily.
Twilert (www.twilert.com) is a variant of Warble, that while not free, enables the researcher to monitor words according to various vectors, such as only positive, negative, or neutral tweets. Hence, it already does the first step of the analysis for the market researcher.
Online Collaborative Tools: Factr.com
Factr.com is an interesting take on automated web monitoring, as it adds collaborative capabilities that allow team members to adjust the automated web monitoring tool and comment on the information being collected. Factr.com collects RSS feed data, which is fused into a single collaborative stream.
Factr.com allows teams to collaborate in real time while the platform collects data. As a user sets up a collaborative stream, data is continually collected. Interestingly, Factr.com has been online for a few years now, and some streams have been collecting data for over three years.
While searching through RSS feeds, Factr.com can also learn and make suggestions on new sources to collect data from. It can search for keywords, and users can upload images, files, and reports to the collaborative workspace to increase collaboration.
Factr.com will also analyze tags, consolidate the research data, and generate reports, making it a useful tool to automate some of the more basic research.
Services such as Google Trends track comprehensive search results over long periods of time and are used to identify underlying trends. As such, a researcher can uncover what the potential target audience is searching for online and analyze the information to gain insights on potential customers (especially if the company is targeting end users).
Google trends will be useful to gauge overall interest in a topic by region, show the top queries related to the topic at hand, and try to identify related topics.
It is a useful tool when identifying other relevant topics when researching a subject. Google Trends can be very useful to research a competitor and get a quick image of its overall popularity. Some other uses can include
Researching the product features that potential clients find important by identifying what search terms are the most popular in relation to the search topic.
Gauging consumer demand by searching which products or features generate the most consumer demand.
Evaluating what consumers are searching for in competing products, by entering a competitor’ s name in the topics box and looking at search queries related to the brand.
A Word on Bots and Data Scrapers
One method of collecting data that might be suggested to the researcher is the use of “ bots.” Also called “ scrapers,” “ robots,” or “ spiders,” these are types of software that the researcher sets up to run automated tasks to collect data systematically from a competitor’ s website that would otherwise be cost-prohibitive.
For example, a researcher could program a “ bot” to collect data from a competitor’ s website by setting up the bot to solicit the online web store and pre-order every single permutation of products to identify pricing information as well as pricing bundling strategies.
Another method would be to set the “ scraper” to collect all data from the competitor’ s website by copying its HTML code.
The use of bots generates a range of ethical and legal issues. From an ethical perspective, it is frowned upon to use and solicit a competitor’ s resources in a manner exceeding that of an “ average consumer.”
As such, you might contact customer services to ask a few questions, but you couldn’t automate a “ bot” to generate hundreds of queries on a website to gather the information you want.
There are also legal considerations: using a bot will most assuredly go against a website’ s terms of service, which explicitly prohibit the use of this type of tool. Courts in many U.S. states have interpreted the terms “ without authorization” and “ exceed authorized access” in a rather broad fashion, leaving the researcher exposed to lawsuits and legal penalties.
While enforcement varies from one jurisdiction to the next, the researcher should take into account that he has crossed into a very gray area of market research.
Internal Secondary Data
Internal secondary data exists and is stored inside the organization. It is available exclusively to the organization and is usually generated and collected during normal business activities. Internal sources of data should always be investigated first because they are usually the quickest, most inexpensive, and most convenient source of information available.
Sources of internal secondary data include
Sales data: If the organization is commercializing its product, it has access to an invaluable resource, its own sales data. This data is already collected by the organization and organized in a way that is useful and extractable by a market researcher.
Some of the data that the researcher can look at includes sales invoices, sales inquiries, quotations, returns, and salesforce business development sheets.
From this information, territory trends, customer type, pricing and elasticity, packaging, and bundling impact can be inferred. This data can be useful to identify the most profitable customer groups and which ones to target in the future.
Financial data: All functioning organizations have accounting and financial data. This can include the cost of producing, storing, and distributing its products. It can also include data on research and development costs and burn rate.
Internal expertise: Mid-sized organizations will often have inside expertise, that is, personnel who have been with the organization for a while. These people can be tapped and interviewed to get more information on past initiatives, products, lost customers, or any other topics.
Frequently referred to as the organizational memory, they often have a wealth of undocumented knowledge that can be harvested. They might be aware of internally produced reports that might be of use, past projects, or failed product initiatives.
One of the weak points of internal data is inaccuracy, due to the fact it may be dated or the way it was collected. Also, while most of the time data can be ported from internal systems to market research data analysis tools, some legacy systems may make data conversion especially challenging.
Finally, there are confidentiality issues: some companies employing third-party researchers may hesitate to “ open the blogs,” and share either consolidated data or limited data sets.
Evaluating Your Secondary Research
Most secondary research, by its nature, is generated by a third party. As such, it is important to verify its accuracy before using it. For this, we propose the use of data triangulation.
Data triangulation is the validation of data through cross-verification of two or more independent sources to ensure its accuracy.
For example, to estimate market growth, a company could use multiple market reports, consolidating them with key opinion leaders’ market growth estimates and information gathered from government websites. Data triangulation is very useful to strengthen the company’ s assumptions going forward.
Data triangulation is especially important when dealing with sources of information that are akin to rumors and hearsay: information found on blogs, social networks, and discussion groups should never be taken at face value but rather used as leads for further research.
It has become so easy to manipulate these types of media (either purposefully, or through lack of rigorous fact-checking), mainly because of the way social media generates revenue— more clicks on the web pages means more views, which means more revenue from ads.
Hence, popularity is a much more important metric than the truth. The information found on social media should never be implicitly trusted.
Of note, it can also be interesting to triangulate primary and secondary data as a way to verify data accuracy. One way to do this would be to collect some key pieces of data and discuss them with key opinion leaders during an interview to get their opinion and perspective.
Finally, when triangulating data, it is crucially important to make sure that the sources used to triangulate do not refer back to the same initial source. Using a series of news articles and blog posts that all refer back to the same initial market report would not enable triangulation.
Likewise, more and more media stories are found to be inaccurate, having implicitly trusted blogs and built stories on inaccurate information.
To illustrate the importance of triangulation, a few years ago, I was updating a total available market/serviceable available market (TAM/SAM) model for a client. He had prepared his own model a number of years earlier and was looking to update it as well as to perform an independent validation.
But we had an issue as my current estimates were much smaller than his previous estimates. As he believed his market was growing, his couldn't understand why my estimates were so much smaller.
After some research, I found that one of the markets estimates that he had used was based on a study that defined his market segment much more broadly than what he was targeting. As such, as my client had used a single point of reference as his starting point, his model grossly overestimated his market.
Limits of Secondary Research
As mentioned in the first few pages of this blog, secondary data has a number of issues that have to be considered. There are issues related to availability, accuracy, trust, and bias.
First, there are issues of availability: not every topic will have secondary research data that exists and fits the needs of the researchers. The more specialized the topic, the more difficult it is to gather data. As such, careful consideration has to be given to how a topic has been defined.
For example, if you are researching the generic manufacturing industry, and a report describes the market size, there could be issues with how the market was defined (does the author include biosimilars?
Does the market study make a distinction between patented drugs and over-the-counter [OTC] drugs?). The researcher has to double-check the definitions used in the secondary data he chooses to use.
Second, there are issues of accuracy or the lack of data verification. A survey that you generate by yourself is verifiable, as you are able to revise your numbers quite easily. And you can trust the data that you generate as you were the one who gathered it. But a secondary market report or a news article cannot be verified easily.
You do not have access to the building blocks and have to rely on the publisher’ s honesty and accuracy. If the author is biased, has been manipulated, or has made an error, the market researcher will be at the mercy of the data that he is using. Media references, especially those that are very close to social networks, should be used as information leads.
Thirdly, secondary data may not be sufficiently detailed or have such a small sample size as to not be useful. Like checking a scientific article, checking the sample size on which the author is basing his conclusions is critical when evaluating data. A data reference point without sufficient background on methodology should never be used.
Finally, as the author of market research must be careful to keep his bias in check, it is always useful to check for any potential bias in the data source being used. Some organizations and authors have their own agenda, which would taint the quality of information being shared.
Data analysis is the process of evaluating, transforming, classifying, and modeling data.
Before diving into the analysis, it’ s important to note that the market research process can be iterative. It is possible to “ jump” back and forth between the collection and analysis of data.
For example, during initial data analysis, it is possible to find out that some new research areas or some of your data is insufficient, and a new data collection effort is necessary. The researcher should account for this possibility.
Case in point, a while back, a client had done a web survey to identify some trends related to salary and incentives in healthcare. He had collected an impressive sample (over 3000 entries), and I was brought in to analyze the data.
Nonetheless, during the initial analysis of the data sample, the demographic data revealed that the overall weight of U.S. East Coast respondents was too large compared with the other U.S. geographic regions.
As such, my client had to proceed with a second data collection effort, this time focusing on the other regions if he wanted to obtain a representative sample.
In this blog, we will provide an overview of the data analysis. Our objective is not to turn the reader into a statistical analyst guru, but rather to give him a better sense of how to read the data he has collected, and some tools to better understand it.
As such, we will be going over the basic elements of data clean up, followed by some words on quantitative and qualitative data analysis, and closing with some barriers to effective data analysis.
Initial Data Analysis
Once your data collection is completed, it is quite tempting to immediately start data analysis. After all, the objective of the project is to answer one or more market research questions, and once you have your data in hand, it feels as if the answers are a footstep away.
But taking time to clean up your data is crucial, and this has to be your first step before data analysis. We call this step the initial data analysis phase. While often undocumented in the final report, some researchers state that as much as
80% of the time allocated to the statistical analysis process is actually spent on data cleaning and preparation.* As such, it is essential that your research plan includes a budget (in both time and money) for this initial data analysis.
Note that at this stage, the researcher is not analyzing the data itself to find the answers to his research questions, but rather focusing on the data set to evaluate its validity and adjusting it as needed (while documenting any adjustments that are made). This is a crucial step, especially if more complex data analysis will be done afterward.
There are two main steps to initial data analysis: (1) cleaning up the data (which itself is split into a screening phase and a diagnosis/editing phase) followed by (2) the preparation and formatting of data for analysis.
Cleaning Up the Data
Cleaning up data consists of assessing the data to check for errors and inconsistencies, and then developing solutions to address the errors as well as documenting how you will be handling them.
Before starting to clean up your data, make sure you have backed up the original data set: you will most likely generate multiple different versions of your data throughout the data analysis effort, but you have to make sure that you always have a copy of the original data. That way, if you need to trace back some original data that has been transformed, you can access the original data to do so.
After backing up the data, the next step of cleaning data is to ensure that it is error-free. This is called the screening phase. You have to identify any irregular patterns or inconsistencies, and you have to resolve these issues before moving forward.
For example, you might notice that one of the data columns is empty, which was due to an error in importing the data from the online survey software to the data analysis software.
Other common errors include different variable types in the same columns (mixing numeric and alphanumeric are the most common mistakes) and having a shifted column due to a conversion error.
Once you are satisfied that the database is error-free, you can start familiarizing yourself with the data, identifying more specific issues and correcting them, as well as removing incorrect entries: as a rule of thumb, no more than 5% of the total number of data samples should be rejected.
This is the diagnosis and editing phase of the data and includes the following components:
1. Make sure data from multiple data sources is integrated in a consistent pattern; for example, if there were multiple people doing interviews, you have to make sure the data is entered the same way for each interviewer.
2. If, before data collection, the market research plan was to remove incomplete surveys, they should be removed at this time.† If no decision was made, look at your minimum sample size; if counting these incomplete survey responses is enabling you to reach your minimum survey size, consider whether you want to (and can) do a new data collection effort.
If it is not possible (e.g., due to time constraints), any incomplete responses included should be cleaned up to avoid any compilation issues.
3. Check for inconsistencies in the data, such as duplicate records or multiple answers for the same Internet protocol (IP) address.
4. Apply any consistency checks that were built into the survey, removing inconsistent entries.
5. Exclude any unneeded data, using the predetermined inclusion or exclusion criteria: for example, if the study was designed to generate a precise image of the U.S. market, remove any non-U.S. responses that you have collected.
6. Identify the validity of outliers in the quantitative data set. These might invalidate statistical models or create inconsistent results during analysis, so it is important to verify outliers individually. Furthermore, how these outlying values were identified and handled needs to be documented.
7. Scan qualitative answers to your survey to remove entries with unusable data. For example, some participants will answer “ awesome” to every single qualitative question, while others will take a moment to complain about the company sponsoring the survey, or the survey itself.
As a rule of thumb, I expect that anywhere from 2% to 3% of survey responses will need to be removed due to unusable answers.
8. Double-check group labels and consistency in naming for category variables; while this might sound trivial, it is a time-saver to take the time to format and correctly name data labels, especially during the data analysis phase.
Once these steps are done, you should check if the data sample size is lacking, sufficient, or excessive: it is possible that after cleanup, the sample size will no longer be the necessary size to reach statistical significance and a new phase of data collection will be needed.
Preparation of Data
The final step of cleaning data is the preparation of the data for more advanced analysis. While producing summary tables can be useful, the researcher might decide to adjust the variables for advanced analysis.
Some of the data can be used in its unprocessed form, but it might be useful to rescale or standardize categories, transforming some data into a single variable, combining them into summary scores, or using more complex functions such as ratios.
As such, you might need to create new variables that will specifically address your research topic (by creating indices from scales or combining data sets that are too small to be useful categories) and format the variables so they can be processed correctly by the data analysis software.
This can include setting a value for missing data codes so that your software can handle them and correctly formatting data variables (so you don’ t have any numeric and alphanumeric mix-ups).
Demographic data is often the subject of aggregation, specifically variables related to geographic location, age, and educational level.
For example, during data collection, a researcher might have collected the state and city of each respondent, but might later realize that this level of granularity is not useful for his research objective, and decide instead to recode the data according to major geographic regions.
It is also possible to recode data if a subgroup is too small to generate useful information.
For example, if you did a web survey of 3000 participants, and less than 1% of participants were aged 65 and older, it might be useful to aggregate them into the closest category and redefine it: so you could combine the data from the 55– 64 years category with the 65 and more into a single “ 55 and more” category.
Specific Issues Relating to Cleaning Up Qualitative Data
Qualitative data analysis is more complex than analyzing quantitative data. It is often produced on different media (written notes, sound or video recordings, pictures), which first require transcription into electronic media.
The transcription should be done word for word, not just summaries as summaries imply the initial transformation of data: if the transcriber is summarizing data, he is introducing some bias by choosing what information to include, which words to summarize, and so on. Some other things to consider:
The transcription document should include nominative information, such as who performed the interview, who was interviewed, as well as the time, date, and location the interview took place.
During transcription, non-verbal cues that were noted by the interviewer should be included in the transcript, as these could prove informative later on: mark them distinctively so you can quickly identify them.
If possible, try to transcribe transcripts yourself (or have someone who attended the interview or focuses group transcribe the information), so you can add any thoughts or ideas that emerge during transcription.
Even if you are trying to be as faithful to the original text as possible, you can edit out verbal tics such as “ you know” or “ it’ s like,” as well as fillers (such as “ uh,” “ hmm,” and more) as these can prove quite distracting during analysis.
Data Analysis: Quantitative
Quantitative data refers to data that can be measured and numbered. Counting the number of potential clients for a product, calculating the number of products or doses a consumer uses each day, or the average distance a patient is willing to travel to visit a specialized clinic are all different types of quantitative data.
Quantitative data is easier to share since the results are easy to understand and easy to explain. There is a perception that quantitative data is “ hard data” (vs. qualitative data that is perceived as “ soft data” ). Hence, market research often has a large quantitative component to convince the target public.
Explaining how to analyze quantitative data is an ambitious endeavor, and we cannot give a complete overview of the process in the limited space we have allocated to it here.
This section starts with a presentation on the two different types of analysis: descriptive and inferential analysis.
This is followed by some presentations on univariate and bivariate data analysis models, and some simple tests you can use to verify the significance of the results, such as correlation and regression models, as well as some interesting models that can be used to analyze your data.
We conclude this section with a description of computer software packages that are available to analyze quantitative data.
Overview of Descriptive and Inferential Analysis
There are two types of quantitative analysis: descriptive and inferential analysis. When doing descriptive analysis, the information describes or summarizes the data that was collected, whereas inferential statistics are used to make inferences from the sample data for a whole population.
Descriptive analysis is used when the researcher is analyzing his data so he can identify patterns in the population he is studying. Contrary to inferential analysis, the descriptive analysis will summarize the data in the sample itself, rather than inferring that it applies to a whole population.
The advantages of doing descriptive analysis are multiple: first, doing descriptive analysis makes vast amounts of data both more manageable and organized. It is also straightforward, can be used to further research ideas, and lays the groundwork for more complex analysis.
In this blog, we will be looking at descriptive analysis from a univariate analysis angle (with a focus on frequencies of variables, central tendencies, and dispersion) and a multivariate analysis angle (with emphasis on the relationships between variables).
The restriction of descriptive analysis is that it is limited to the data you are handling. It is used to report and describe data that you collected, but cannot be used beyond the current data.
The researcher cannot use the data to interpret phenomena beyond the population from which the sample was taken. For example, if you tested how a product is used by a sample population, you could not consider the findings representative of the population at large using exclusively descriptive analysis.
Inferential statistics would be needed. So, while descriptive statistics are used to describe what is happening in the sample population alone, inferential statistics are used to infer what the population beyond the data sample is thinking.
Hence, inferential statistics are techniques we use with our data set to make generalizations about the populations from which the samples were collected.
The limitations are that inferential analysis is uncertainty. While calculating the proportion of a sample in a specific situation is absolute, projecting it onto a population means that it becomes an estimate: building this estimate requires the researcher to make educated guesses, and this uncertainty will increase the likelihood of errors.
There are a number of techniques that market researchers use to examine the relationships between variables, thereby creating inferential statistics. We will look at some of the most valuable ones in market research: correlation, regression analyses, and general linear models.
Univariate analysis is the simplest way to analyze data. All that it entails is the analysis of a single variable, which can be categorical or numerical.
Since it is not being compared to one or more other variables, it is very descriptive in nature. Univariate analysis cannot be used to explain a phenomenon or to contextualize a variable in relation to other variables.
Once data is compiled, it can be shared in a simple table form, as well as in visual representations such as bar charts, histograms, frequency polygons, or pie charts.
The most frequent univariate analysis methods are frequencies of variables, central tendencies, and dispersion.
Frequency Distribution, Central Tendency, and Dispersion
The frequency distribution is a simple data analysis technique that allows the researcher to get a quick impression of the data he has collected. Using frequency distribution, he can see how often the specific values are observed and their percentages compared to the overall sample.
As such, frequency distribution enables the researcher to summarize each variable independently, and also makes it easier to subsequently engage in more complex analysis.
To illustrate frequency distributions, we will use data from a project I did a few years back. It was for a company in the dental health space.
This client had developed an innovative dental product, and was looking to evaluate how much interest dentists would have in the product, according to various distribution models. Over 100 dentists (current and past clients) participated in an online survey. The survey included both quantitative and qualitative questions.
To validate interest in the innovation, the client had asked the following question: “ How likely would you be to use [product description] if you had easy access to it and the turnaround times described above? (i.e. less than 90 minutes)”
The data was compiled and illustrated in both a table (which is more detailed and descriptive) and a graph (a more visual format to share data)
Based on this univariate analysis, there seemed to be some strong interest in the current client base, as 86.0% of participants were either “ somewhat likely,” “ likely,” or “ very likely” to use the product developed by my client.
There are many ways to find the central tendency of a data sample. You can use either the mean (the sum of the values divided by the number of effective values), the mode (the most frequently occurring value), or the median (the value separating the upper half of a data sample).
This is the traditional method of presenting scoring data and giving a sense of the general appreciation in a population (e.g., statements like: “ participants in our survey gave an average of 7 out of 10 for the overall look of our new product label” ).
Using a mean is the most popular way to calculate central tendency, as it includes all the values of your data sets, but it is particularly sensitive to the presence of outliers. The outlier can skew the mean, rendering the data unusable. Another reason for not using the mean is if your data itself is skewed:
if, for example, half your sample would be willing to pay $10 for your product, and the other half would be willing to pay $20, a mean of $15 would not reveal that you probably have two distinct market segments in your data rather than a single monolithic segment. As for modes, these demonstrate the most popular option in the sample.
The problem when using the mode is when it is not unique: if you have two data sets with the highest frequency, it can be challenging to interpret the data.
Another issue with the mode is that it does not give a good read of the central tendency: if you asked each participant to give a score of 1– 10 on the quickness of a product, and then half responded 1 out of 10, the mode would not be representative of the central tendency.
The median is not affected by the presence of outliers, so it gives a better idea of a typical value. It is also better to use when the distribution of the sample is skewed.
The main issues when using the median (rather than the mean) are that it is less popular than using the mean to communicate the central tendency, so people might not be familiar with its meaning.
Also, a medium is much more complex to calculate if you are not using software, which honestly should not happen too often if you are preparing a market report.
Dispersion is the calculation of the distribution of values around a central value. The three main dispersions that you can calculate are a range (the distance separating the highest and lowest value), the variance (the average of the squared differences from the mean), and the standard variation (which is the square root of the variance and measures how much the numbers are spread out).
As an example, a project I worked on surveyed over 5000 individuals on their Internet usage. One of the questions asked to participants was to estimate time spent online.
Across all age groups, the average was 19 hours, with a standard variation of 12 hours. This meant that most of the data sample had answered between 7 and 31 hours spent online, giving us a rough idea of the time spent online for the surveyed population.
Multivariate analysis is the concurrent analysis of two or more datasets. While univariate data analysis is used to describe a phenomenon, multivariate analysis is used to explain the phenomenon.
The objective is to compare two (or more groups), and to identify relationships between the multiple data sets. The analysis exposes how different subgroups respond to some query. The simplest method of doing multivariate analysis is cross-tabulation (or contingency tables), while more complex tools include correlation and regression analysis.
A simple way to do multivariate analysis is to build contingency tables, which are multidimensional frequency distributions in which the frequencies of two (or more) variables are cross-tabulated. These simple tables can help the researcher find the relations between different variables.
Contingency tables are called pivot tables in Excel. A useful guide to creating a pivot table to analyze worksheet data can be found directly on the support.office.com website. For ease of use, you can follow the abbreviated link: goo.gl/XykXIv.
To properly illustrate a contingency table, we are sharing a table illustrating bivariate analysis. The data for this table was for a project I did a few years back, the theme of which focused on “ what women want.”
Surveying Australian mothers, one of the topics the survey dealt with was identifying how mothers got information relating to their pregnancy. This specific table cross-tabulated data from two questions:
a. Did you use this information source to get more information relating to your pregnancy? (Check all that apply.)
b. What is your age?
At a glance, there are four apparent key learning’s we get by doing a cross-tabulation:
a. It seems that women younger than 39 years old are much more likely to use online resources to look for information than their older counterparts.
b.It seems that women younger than 39 years old are much more likely to consult their mothers for advice than their older counterparts.
c. Only half of the participants contacted the Department of Health to get information relating to their pregnancy.
d.There is an overwhelming use of pregnancy blogs and manuals across all age groups.
Overall, a contingency table is an effective and simple way to find underlying information that univariate analysis would not have been able to discern by using the data from two distinct data sets.
Correlations are an extension of a cross-tabulation analysis. While building a contingency table might uncover a potential relationship between multiple data sets, correlation analysis is necessary to confirm the relationship between the variables.
To confirm the existence of a correlation between two variables (as well as regression and general linear relationships), the use of specialized software such as SPSS or JMP is recommended, although it is possible to use Excel as well.
When calculating correlation, for each data set, the software will provide an answer ranging from 1 and − 1. The closer the answer provided is to 1, the stronger the relationship between the two variables. The closer it is to zero, the more it means that there is no relationship between the two variables.
If you obtain a correlation of − 1, this means that your two data sets are negatively correlated, meaning the higher that one value goes, the lower the second one goes. As a rule of thumb, if you obtain a correlation value of 0.5 or higher, this means that there is a significant relationship.
This becomes interesting as you start to calculate the strength between multiple sets, and you can statistically see which relationships are stronger and determine key elements or those that are important for the participants in your research.
There are two major caveats to calculating correlation. First, correlation is not causation. This means that you might have found that two data sets have a relationship, but there is no proof that one element is causing the other.
There could be multiple reasons why your data is correlated. Second, calculating correlation is only possible if your data is in a linear relationship.
A good alternative if you believe that there is a relationship between two data sets, but that your calculations or correlations don’ t support it, is to do a scatter plot of your data so you might be able to visualize a non-linear relationship.
Regression analysis is another way to estimate the relationship between two variables. It is a test used to see how one independent variable affects the other dependent ones.
Some prefer it to correlation since it statistically demonstrates the goodness of fit (adjusted R Square). Regression analysis can also be used for forecasting, using the regression equation as a simple linear model.
One of the main differences between correlation analysis and a regression analysis is that while correlation analysis quantifies the relationship between variables, a regression analysis can be used to predict the relationship between variables, especially if one of the two variables is one that you manipulate (e.g., time or price).
As a rule of thumb, the guidelines for the interpretation of your regression factors.
General Linear Model
General linear models are used to assess the effect of predictors on one or more continuous variables. They include tests such as the t-test (which is used to examine the difference between the means of two independent groups) and the ANOVA test (tests the significance of differences between the means of two or more groups).
These are used to assess the effect of several predictors on one or more continuous dependent variables.
Software for Quantitative Analysis
There are two types of software that you can use for analyzing quantitative data. First, it is possible to do a lot of basic and even some intricate analyses using spreadsheet software such as Excel.
Spreadsheets can calculate anything from simple averages (with the AVG and MED functions) and frequencies (using the FREQUENCY function) to more complex calculations such as standard deviations (STDEV), correlations (using the CORREL function), or even trends (using the TREND function, calculating regressions).
The main advantage of using a program like Excel is that many users are already familiar with spreadsheets, so the learning curve to using it is smooth. Also, integrating data from other software (such as online websites used to collect survey data) is pretty straightforward.
Finally, as you are already formatting your data in your spreadsheet, preparing your data for presentation models (tables or graphs) is streamlined. As such, spreadsheets are a good alternative for companies that do not engage in statistical analysis on a regular basis.
For more complex analysis, or if you engage in quantitative analysis on a regular basis, obtaining a statistics software package is a necessity. There are a number of them available, but here are a few of the more popular ones.
One of the market leaders is IBM SPSS Statistics (www. spss.com). It is a powerful statistics software package used for statistical analysis in academic and business environments.
It is quite popular with scientists and marketers alike and contains a full suite of tools that are useful for any researcher. It has a wide range of charts and graphs to choose from and efficient access to statistical tests. While powerful, it is one of the costlier solutions out there.
SAS (www.sas.com) is another prevalent analytical software program that is used for advanced analytics and analysis. It is positioned as one of the most powerful software programs available but is quite complex to program and utilize.
It is one of the hardest software programs to learn but offers great data management and the ability to work with multiple files simultaneously.
A less expensive option, albeit still powerful, is JMP (www.jmp.com). Like SPSS, JMP has a number of options for data analysis and produces a full range of charts and graphs. It is very straightforward to use and somewhat less expensive, and easily imports data from most formats, including SPSS tables.
There are also a number of free statistics software packages available, but be aware that the learning curve is steeper than some of the commercial packages available.
One of the popular free alternatives is PSPP (http://www.gnu.org/software/ pspp/pspp.html), which, while not as powerful as the commercial alternatives, does have a similar look’ n’ feel to SPSS, which makes the transition easier. It also has an easy point’ n click interface.
Qualitative Data Analysis
Qualitative data is information that is subjective and subject to interpretation. It can include anything from stories to words, observations, pictures, and even more peculiar sources such as songs, stories, and poems.
Examples of information you can collect through qualitative analysis include personal preferences in consumer purchases, the impact of quality on customer purchasing decisions, or the influence of packaging color on acquisition decisions.
Data collected through interviews, focus groups, Delphi groups, and observation is usually qualitative in nature, but as we will see, it is possible to codify the collected data so it can be analyzed using quantitative data tools.
Qualitative research is particularly useful when the objective is to find out what people are really thinking, identifying future trends, and speculating on competitive threats. Strategic thinking and careful analysis are needed to identify those trends.
In the next few pages, we will go over a proposed framework for qualitative data analysis, followed by a discussion of computational tools available to enhance qualitative data analysis.
Qualitative Data Analysis Process
Quantitative data has the advantage of being simple to analyze. It is often collected in a format that comes pre-coded and ready to analyze, whereas qualitative data needs to first be codified and labeled before proceeding with the analysis.
To be able to code qualitative data, the researcher will need to build a classification framework, following a process to categorize verbal or behavioral data to enable the classification, summarization, and tabulation of his data.
The four steps of the basic qualitative data analysis process are (1) familiarization, (2) identifying a framework, (3) sorting the data into the framework, and (4) using the framework to complete the descriptive analysis.
Some researchers suggest doing an early analysis step before diving into the complete qualitative data analysis. This implies analyzing some qualitative data during data collection.
They believe that it is an effective method to optimize ongoing research and generate new ideas for collecting better data. It can also give you an early idea of what your analysis framework will look like.
I believe this should be avoided in studies that are exploratory in nature, to ensure that the first few interviews do not direct you down a path that hasn’ t yet been validated.
For example, if you do an interview with a few doctors to explore the important factors for signing up for your new service, and they focus on the login and payment screen, there is no indication that these concerns are valid throughout your population, and focusing on them too early in the study might bias the final outcome.
Step One: Familiarization
The first step of qualitative data analysis is to familiarize yourself with your data and what it looks like, and then start to visualize the necessary effort for data analysis.
This means relistening to multiple interviews, reading multiple interview transcripts, and reviewing open-ended survey answers as well as rereading the notes you collected during the research effort.
If you worked as part of a team, a debriefing meeting will be useful to discuss high-level impressions as well, but be careful that these discussions don’ t bias the next analysis steps.
Step Two: Identifying a Framework
The qualitative data framework is the coding plan that you will be using to organize your information. During this step, you will be identifying high-level data patterns that you will use to build your first framework.
This framework will consist of codes, which are tags or labels that are used to assign meaning to qualitative information. They are usually attached to pieces of text, such as words, expressions, phrases, and even paragraphs. They can take the form of a simple word, or can be more complex, in the form of expressions and metaphors.
These codes are used to create meaning through clustering: the more often a code appears during analysis, the more meaningful the meaning behind it. These code clusters are then used to analyze data, identify patterns, and generate recommendations or concepts.
When developing your coding, it is possible to use three types of codes:
1. Description codes: Codes that describe the phenomena, and require very little interpretation.
2. Interpretative codes: When you have a bit more knowledge about what you are researching, you can start assigning more complex coding by using codes that integrate some interpretative elements.
3.Pattern codes: These are codes that you can start using when some of the more complex elements such as patterns, themes, and causal links are identified in more depth. This might require you to go over some earlier analysis to verify if some codes need to be updated.
There are multiple methodologies that can be used to build a framework and start identifying your codes, but we will be going over three of the more common ones. These are a manual iteration, automated text analytics, and directed content analysis.
Method 1: Manual iteration
Manual iteration is the most complex and lengthy method to develop a framework, but it is the one that will less likely result in you having to go over your data twice.
The first step is to read through a few records to get a high-level appreciation of which phenomena are occurring most often. If the bulk of your qualitative data interviews, review a few transcripts and start identifying key trends. If it’ s focus groups, read through the transcripts and moderator notes to identify those common elements.
If it’ s a survey with qualitative open-ended questions, read the first 100 responses, and identify keywords, ideas, and thoughts in a separate column of the data analysis tool that you are using.
Afterward, go through the new column of keywords that you generated to identify patterns and build a framework for analysis. If a consistent framework cannot be found, read a few more interviews, or a second group of 100 survey responses, and then look through your notes to identify that pattern.
To illustrate the process in an online survey, I have included a short extract from a project I analyzed a few years back.
To give some context, my client, a company active in event organization and media platforms targeting the Generation Y public had organized a survey targeting its subscribers. One of the issues they investigated was the purchasing patterns for healthy drinks.
Hence, they had asked participants to think about the last drink they had purchased, and the reason that had led to that purchase over another competing brand or product. My client had collated the data and brought me in to analyze it.
Hence, the first step was to build an analysis framework. I chose to do it through manual iteration since the sample was quite large and complex, and the context was exploratory. To demonstrate, I have included a few statements taken from the survey, followed by some tags that I felt were suitably related.
Other patterns were identified around themes such as “locally made,” “great website,” “celebrity endorsement,” “advertising,” and more. Not all of these keywords were kept in the final version of the framework: some keywords, which were present in the first few entries, had very little presence in the survey responses overall and were discarded.
Hence, since the process is iterative, some keywords that are identified in the first step might be dropped later on, while others are added during the more in-depth analysis. Some overlap might also occur, as well as splitting up some keywords of your framework.
You might note that there is a high level of subjectivity in this process. While I use “ friend’ s recommendation” to denote the phenomenon of “ a purchase due to a recommendation of a non-family-related person,” you might prefer using the code “ third-party recommendation” to widen the definition to “ a purchase due to a recommendation of a third party.”
This is all part of the strength and weakness of qualitative data. Future iterations will enable you to refine your framework— you might find that there is enough of a distinction between friends and parents to warrant a second category. Or you may not. Flexibility is the key at this stage.
This is also the reason why two researchers need to work in tandem on some of the bigger projects. By going over the work independently, and then combining the results afterward, the researchers are able to develop some level of consistency.
The framework that you build will allow quicker analysis of the responses that you have collected. By the end of this step, you will have identified a series of keywords and will be ready to advance to step three.
Method 2: Automated text analytics
This method can be used with the qualitative analysis of data sets for which the answers are short yet in high quantity, such as qualitative answers in an online survey. The idea is to use a word cloud generator.
Word cloud generators are platforms that count the number of times each word appears in a data sample, assign it a relative density, and then generate a visual representation according to the word density of the top words.
There are many free word cloud generators that exist online such as WordItOut (worditout.com/word-cloud/create) and WordCloud (www.jasondavies. com/wordcloud/), which you can use to generate your own word cloud for both analysis and presentation purposes.
After choosing your word cloud generator, copy a random number of entries for the topic you are analyzing into the text box. The platform will generate a word cloud, giving more prominence to the words that appear more frequently.
These words can be used as a basis for the first framework for data analysis. Be advised that while this method is much easier and faster than manual iteration, this shortcut will likely overlook some keywords, and some revision of your framework might be needed later on.
To illustrate, using the same data set as used in the first method we generated a word cloud using 100 random answers from participants. The example was generated using the WordCloud website.
Remember to exclude common words to increase the visual effect of the word cloud. In this case, we excluded words such as “ because,” “ around,” and other commons articles that are generally not important.
A quick review of the word cloud lets us appreciate the emphasis that participants place on the quality of the product, the brand name, the reputation, recommendations, and reviews, for example.
Using the word cloud, we can start the next step of analysis with a half-dozen or more code words such as quality, price, reviews, friend’ s recommendation, brand, taste, trust, and healthy, for example.
Method 3: Directed content analysis
Finally, some researchers use directed content analysis. Using this methodology, some of the codes are developed before data analysis, using theory and a review of the literature/ secondary research to build the question guide and the theoretical framework to guide analysis. Using this method, additional codes are added as the analysis progresses.
The advantage of this method is that the existing framework lets the researcher jump into the analysis phase faster, and the questions guide can be built according to the developed analysis framework.
Nonetheless, this is less useful in exploratory contexts and can bias some of the data collection as well as the analysis.
Coming back to our example presented earlier in this blog, if we had decided to use directed content analysis for our project, there are several ways we could have built our framework:
If we had built it on the client’ s experience (hence a “ hypothetical framework” ), the client and I would most probably have used terms such as cool, price, and friend’s recommendation as starting points for our framework. (Based on a quick review of our e-mail conversations, these are the terms he expected to emerge.)
Alternatively, I could have reviewed an article such as “ Marketing to the Generations” * from the Journal of Behavioral Studies in Business and identified “prestige,” “uniqueness,” “pricing,” and “referrals” as starting points.
In both cases, the results are the same: a pre-existing framework is imposed over qualitative data. Even if there is a speed advantage, there is a potential negative impact on the analysis of exploratory data that should not be underscored.
Step Three: Coding the Data Using the Framework
Once a suitable framework has been built, it is time to assign the codes to your data. As this is an iterative process, expect to modify your framework during analysis: you will most likely remove, merge, and add codes all the way during your analysis process.
This might mean that some data elements have to be reviewed multiple times. This is a normal step of qualitative data analysis. In this section, we will illustrate two examples of coding qualitative data, one for interview data and one for survey data.
To code interview data, I usually use a three-column analysis worksheet. The first column has coding that is attributed to a sentence. The second column is the interview itself, with phrase segments allocated to phenomena, and the third column is general comments and thoughts that emerge during analysis.
These thoughts are often very useful during the iterative process, to help adjust the coding categories, to help define them precisely, and to prepare for the next step, pattern coding.
To illustrate this process, here is a short extract analysis of an interview I did a few years ago. It was for a company planning to offer outsourcing services to large companies and needed to better understand the trends impacting the need for specialized outsourcing abilities. The company needed to understand decision-making processes and identify growth opportunities.
The codification then enables quick clustering and analysis: instead of looking through hundreds of pages of interviews, you can quickly scan and look for patterns or codes that reappear often, and then group these comments to craft your story. Codes make the interview more intelligible, and the recurrence of certain codes signals the emergence of regular themes.
Survey analysis follows a similar approach. Following up on the survey on purchasing patterns for healthy drinks, we coded all the responses from the participants. Each statement was individually analyzed, and assigned a code from our framework. As the analysis proceeds, new codes can emerge, and new ones can be created from consolidation.
For example, during the consolidation stage, the categories “user reviews” and “online reviews (blog)” was consolidated into user reviews, as the number of responses did not justify keeping those two categories separated.
To ease codification, we built an analysis matrix. Qualitative data was stacked on the lines, as columns were each assigned a framework item. As the process was iterative, it was possible to add codes as new elements were found. To best illustrate this, I have included a sample extract of the matrix that was generated
Step Four: Use the Framework for Descriptive Analysis
Once data is analyzed, the codes are used to create significance through the use of clustering. The more often that codes appear during analysis, the more expressive the concept behind them. These clusters are then used to identify patterns and generate recommendations.
As for interviews, the next step is pattern coding. Here, the objective is to group the categories into smaller sets of themes and constructs. Hence, during this step, recurring phenomena emerge, and you can see the most important ones.
Finally, remember that it is very important to carefully define each theme; if they are not properly defined, some blurred lines might complicate the analysis, and render the framework invalid, or ill-defined.
It can ease the burden of manual transcription and reduce miscalculations due to human error. Automated searches, auto-coding, and integrated analysis tools all facilitate researcher tasks.
The use of software for data analysis should be guided by many factors, such as the type and amount of data being manipulated, the time needed to learn to use the software versus the time analysis is expected to begin, cost constraints, and the need to share the data across multiple researchers.
Nonetheless, one has to realize that while using qualitative software data will make the codifying and analysis process that much easier, it also distances the researcher from his data, and he loses some of the opportunity to understand his data.
Some of the most popular qualitative data software programs include
Atlas it (www.atlasit.com): Allows codification of data, as well as the visualization of complex relationships between data sets. QDA Miner (www.provalisresearch.com): Includes both coding and analysis components, and enables teams to share data virtually. A free version (QDA Miner Lite) also exists.
Some Final Notes about Qualitative Analysis
As we have mentioned numerous times throughout this blog, qualitative data is often alleged to be less “ real” than quantitative data. This might be due to the ease with which it can be manipulated, or how the data sample is often so small relative to large quantitative studies that it is dismissed as anecdotal.
This is simply not true, and qualitative researchers can follow a number of practices to refute this perception, such as using a rigorous qualitative analysis framework, using computational tools to assist qualitative analysis, and using mixed methods of data analysis.
Also, as qualitative data is more subjective in nature, it is often useful to work with multiple analysts when analyzing it (depending on the project scope, of course).
If the budget allows it, two (or more) different analysts should analyze the same data following the same methodology, which would be followed by a convergence step and discussion on divergent results and interpretation.
Obstacles to Effective Analysis
There are a number of issues a researcher has to keep in mind when doing the analysis. Doing so will play a key role in conducting the data analysis in a coherent manner.
Confusing Facts and Opinions
When collecting data, the researcher will be faced with a number of different data sets originating from multiple sources. It is important to remember that not all of them will be facts, and should be treated accordingly. A fact should be irrefutable.
This can become confusing as you do online research and gather multiple statements that are in conflict: multiple different market research firms can estimate market growth differently.
It will be up to the researcher to select which one is the most credible, and define it as such: use words such as “ estimate” and “ believe” that convey confidence, while not falsely conveying confirmation.
A fact is something that has occurred, something that is true. Do not confuse opinions and beliefs with facts.
Researchers are human and can engage in research with a set of preconceived ideas and biases. It is important to be able to recognize this to be able to manage them and minimize their impact.
One of the biggest forms of researcher bias is confirmation bias, where the researcher will have an inclination to retain data that favors some preconceived ideas and dismiss ideas that go against these preconceived ideas.
Other biases can include correspondence bias (over-emphasizing a personality-based explanation for behaviors observed in others) and hindsight bias (the inclination toward seeing past events as predictable).
Complexity of Data
As you engage in market research, complexity can become an issue, especially if your data sources are increasing, and mixed in nature. You might find yourself in a situation where you need to reduce complexity.
One way to do so is to try to simplify complex information and distill it into a single sentence. While you might lose some of the richness of the data, you can always go back to your original data (especially if your coding framework allows quick referencing).
Future of Data Analysis
As we move forward, there is a definitive trend toward the computerization of data analysis. The emergence of software such as IBM Watson (a powerful analytics tool in its own right), will shift the way people think about and do data analysis.
Speed, accuracy, and sexy visualization tools, all within reach with a couple of clicks, will change the relationship between the researcher and his data. Furthermore, as machine learning increases, we will see an improvement in basic data analytics steps such as data clean up and even predictive analysis.
Big data will also come into play. In the past, we might have looked at a data set as a unique data point.
As we go forward, the emergence of big data means that the first step of analysis will be to link new data to many other sources simultaneously, trying to make sense of it in a global sense, not only the context in which it was gathered.
Furthermore, the way data is generated is shifting dramatically. Our mobile phones and devices (from a Fitbit to our credit and loyalty cards) are all individually generating data, so the future of data analysis is one where an increasing amount of data is being generated from multiple locations, and converged into massive databases.
Nonetheless, these tools should not replace the researchers’ ability to interpret and read his data.
The use of these tools will not teach the user basic statistics, and without these basic skills, it becomes very easy to believe the data “ because my computer says so.” The ability to interpret data is crucial going forward, even if tools evolve in directions that make this easier.