Natural language Processing Statistical Methods

statistical natural language processing techniques,statistical methods in natural language processing, what is statistical natural language processing
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Foundations of Statistical Natural Language Processing E0123734 Christopher D. Manning Hinrich The MIT Press Cambridge, Massachusetts London, England SchiitzeSecond printing, 1999 0 1999 Massachusetts Institute of Technology Second printing with corrections, 2000 All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or informa- tion storage and retrieval) without permission in writing from the publisher. Typeset in Bright by the authors using Printed and bound in the United States of America. Library of Congress Cataloging-in-Publication Information Manning, Christopher D. Foundations of statistical natural language processing / Christopher D. Manning, Hinrich Schutze. p. cm. Includes bibliographical references (p. ) and index. ISBN 0-262-13360-l 1. Computational linguistics-Statistical methods. I. Schutze, Hinrich. II. Title. 1999 99-21137 CIP 410’.285-dc21 P98.5.S83M36 ETPX2E. Lucidalo/13Brief Contents I Preliminaries 1 1 Introduction 3 2 Mathematical Foundations 39 3 Linguistic Essentials 81 4 Corpus-Based Work 117 II Words 149 5 Collocations 151 6 Statistical Inference: n-gram Models over Sparse Data 191 7 Word Sense Disambiguation 229 8 Lexical Acquisition 265 III Grammar 315 9 Markov Models 317 10 Part-of-Speech Tagging 341 Probabilistic Context Free Grammars 381 11 12 Probabilistic Parsing 407 Iv Applications and Techniques 461 463 13 Statistical Alignment and Machine Translation Clustering 495 14 15 Topics in Information Retrieval 529 Text Categorization 575 16Contents List of Tables xv List of Figures xxi Table of Notations xxv Preface rodx Road Map mxv I Preliminaries 1 1 Introduction 3 1.1 Rationalist and Empiricist Approaches to Language4 1.2 Scientific Content 7 1.2.1 Questions that linguistics should answer8 1.2.2 Non-categorical phenomena in language 11 1.2.3Language and cognition as probabilistic phenomena 15 1.3The Ambiguity of Language: Why NLP Is Difficult 17 1.4 Dirty Hands 19 1.4.1 Lexical resources 19 1.4.2 Word counts 20 1.4.3 Zipf’s laws 23 1.4.4 Collocations 29 1.4.5 Concordances 31 1.5 Further Reading 34. Vlll Contents 1.6 Exercises 35 2 Mathematical Foundations 39 2.1 Elementary Probability Theory 40 2.1.1 Probability spaces 40 2.1.2Conditional probability and independence 42 2.1.3 Bayes’ theorem 43 4 5 2.1.4Random variables 46 2.1.5Expectation and variance 2.1.6 Notation 4 7 2.1.7Joint and conditional distributions 48 2.1.8 Determining P 48 2.1.9 Standard distributions 50 2.1.10 Bayesian statistics 54 2.1.11 Exercises 59 2.2 Essential Information Theory 60 2.2.1 Entropy 61 2.2.2 Joint entropy and conditional entropy 63 2.2.3 Mutual information 66 2.2.4 The noisy channel model 68 72 2.2.5 Relative entropy or Kullback-Leibler divergence 73 2.2.6 The relation to language: Cross entropy 2.2.7 The entropy of English 76 2.2.8 Perplexity 78 2.2.9 Exercises 78 2.3 Further Reading 79 3 Linguistic Essentials 81 3.1 Parts of Speech and Morphology 8 1 3.1.1 Nouns and 83 pronouns 3.1.2Words that accompany nouns: Determiners and adjectives 87 3.1.3 Verbs 88 3.1.4 Other parts of speech 91 3.2 Phrase Structure 93 grammars 3.2.1 Phrase structure 96 101 3.2.2 Dependency: Arguments and adjuncts 3.2.3 X’ theory 106 3.2.4 Phrase structure ambiguity 107Contents ix 3.3 Semantics and 109 3.4 Other Areas 112 3.5 Further Reading 113 3.6 Exercises 114 4 Corpus-Based Work 117 4.1 Getting Set Up 118 4.1.1 Computers 118 4.1.2 Corpora 118 4.1.3 Software 120 4.2 Looking at Text 123 4.2.1 Low-level formatting issues 123 4.2.2 Tokenization: What is a word? 124 4.2.3 Morphology 131 4.2.4 Sentences 134 4.3 Marked-up Data 136 4.3.1 Markup schemes 137 4.3.2 Grammatical tagging 139 4.4 Further Reading 145 4.5 Exercises 147 II Words 149 5 Collocations 151 5.1 Frequency 153 5.2Mean and Variance 157 5.3 Hypothesis Testing 162 5.3.1 The t test 163 5.3.2 Hypothesis testing of differences 166 5.3.3 Pearson’s chi-square test 169 5.3.4 Likelihood ratios 172 5.4 Mutual Information 178 5.5 The Notion of Collocation 183 5.6 Further Reading 187 6 Statistical Inference: n -gram Models over Sparse Data 191 6.1 Bins: Forming Equivalence Classes 192 6.1.1 Reliability vs. discrimination 192 6.1.2 models 192 n-gram PragmaticsContents n-gram 195 6.1.3 Building models 6.2 Statistical Estimators 196 6.2.1Maximum Likelihood Estimation (MLE) 197 Laplace’s law, Lidstone’s law and the 6.2.2 Jeffreys-Perks law 202 6.2.3 Held out estimation 205 6.2.4Cross-validation (deleted estimation) 210 212 6.2.5 Good-Turing estimation 6.2.6 Briefly noted 216 6.3 Combining Estimators 217 6.3.1Simple linear interpolation 218 6.3.2 Katz’s backing-off 219 6.3.3General linear interpolation 220 6.3.4 Briefly noted 222 6.3.5Language models for Austen 223 6.4 Conclusions 224 6.5 Further Reading 225 6.6 Exercises 225 7 Word Sense Disambiguation 229 7.1 Methodological Preliminaries 232 7.1.1Supervised and unsupervised learning 232 7.1.2 Pseudowords 233 7.1.3Upper and lower bounds on performance 233 7.2 Supervised Disambiguation 235 7.2.1 Bayesian classification 235 7.2.2 An information-theoretic approach 239 7.3 Dictionary-Based Disambiguation 241 242 7.3.1 Disambiguation based on sense definitions 7.3.2 Thesaurus-based disambiguation 244 7.3.3Disambiguation based on translations in a second-language corpus 247 7.3.4One sense per discourse, one sense per collocation 249 7.4 Unsupervised Disambiguation 252 7.5 What Is a Word Sense? 256 Further Reading 260 7.7 Exercises 262 7.6Contents xi 8 Lexical Acquisition 265 8.1 Evaluation Measures 267 8.2 Verb Subcategorization 271 8.3 Attachment Ambiguity 278 8.3.1 Hindle and Rooth (1993) 280 8.3.2 General remarks on PP attachment 284 8.4 Selectional Preferences 288 8.5 Semantic 294 8.5.1 Vector measures 296 space Probabilistic measures 303 8.6 The Role of Lexical Acquisition in Statistical NLP 308 8.7 Further Reading 312 III Grammar 315 9 Markov Models 317 9.1 Markov Models 318 9.2 Hidden Markov Models 320 9.2.1 Why use 322 9.2.2 General form of an HMM 324 The Three Fundamental Questions for 325 Finding the probability of an observation 326 9.3.2 Finding the best state sequence 331 9.3.3 The third problem: Parameter estimation 333 9.4 Implementation, Properties, and Variants 336 9.4.1 Implementation 336 9.4.2 Variants 337 9.4.3 Multiple input observations 338 9.4.4 Initialization of parameter values 339 9.5 Further Reading 339 10Part-of-Speech Tagging 341 10.1 The Information Sources in Tagging 343 10.2 Markov Model Taggers 345 10.2.1 The probabilistic model 345 10.2.2 The Viterbi algorithm 349 10.2.3 Variations 351 10.3 Hidden Markov Model Taggers 356 HMMs: 9.3.1 HMMs9.3 HMMs? 8.5.2 Similarityxii Contents 10.3.1 Applying to POS tagging 357 10.32The effect of initialization on HMM training 359 Transformation-Based Learning of Tags 361 10.4.1 Transformations 362 10.4.2 The learning algorithm 364 10.4.3 Relation to other models 365 10.4.4 Automata 367 10.4.5 Summary 369 10.5 Other Methods, Other Languages 370 10.5.1 Other approaches to tagging 370 10.5.2 Languages other than English 371 10.6 Tagging Accuracy and Uses of Taggers 371 accuracy 10.6.1 Tagging 371 10.6.2 Applications of tagging 374 10.7 Further Reading 377 Exercises 379 11 Probabilistic Context Free Grammars 381 11.1 Some Features of PCFGs 386 11.2 Questions for PCFGs 388 11.3 The Probability of a String 392 11.3.1 Using inside probabilities 392 11.3.2 Using outside probabilities 394 Finding the most likely parse for a sentence 396 11.3.4 Training a PCFG 398 401 11.4 Problems with the Inside-Outside Algorithm 11.5 Further Reading 402 11.6 Exercises 404 12 Probabilistic Parsing 407 12.1 Some Concepts 408 12.1.1 Parsing for disambiguation 408 12.1.2 Treebanks 412 12.1.3 Parsing models vs. language models 414 12.1.4 Weakening the independence assumptions of PCFGs 416 421 12.1.5 Tree probabilities and derivational probabilities 12.1.6There’s more than one way to do it 423 11.3.3 10.8 10.4 HMMs. . Xl11 Contents 121.7 Phrase structure grammars and dependency grammars 428 12.1.8 Evaluation 431 12.1.9 Equivalent models 437 12.1.10 Building Search methods 439 parsers: 12.1.11 Use of the geometric mean 442 12.2 Some Approaches 443 12.2.1 Non-lexicalized grammars 443 12.2.2 Lexicalized models using derivational histories 448 12.2.3 Dependency-based models 451 12.2.4 Discussion 454 12.3 Further Reading 456 Exercises 458 IV Applications and Techniques 461 13 Statistical Alignment and Machine Translation 463 13.1 Text Alignment 466 13.1.1 Aligning sentences and paragraphs 467 13.1.2 Length-based methods 471 Offset alignment by signal processing techniques 475 13.1.4 Lexical methods of sentence alignment 478 13.1.5 Summary 484 13.1.6 Exercises 484 13.2 Word Alignment 484 13.3 Statistical Machine Translation 486 13.4 Further Reading 492 14 Clustering 495 14.1 Hierarchical Clustering 500 14.1.1 Single-link and complete-link clustering 503 14.1.2 Group-average agglomerative clustering 507 14.1.3 An application: Improving a language model 509 14.1.4 Top-down clustering 512 14.2 Non-Hierarchical Clustering 514 14.2.1 K-means 515 14.2.2 The EM algorithm 518 14.3 Further Reading 527 13.1.3 12.4 treebankxiv 14.4 Exercises 528 15 Topics in Information Retrieval 529 15.1 Some Background on Information Retrieval 530 Common design features of IR systems 532 15.1.2 Evaluation measures 534 15.1.3 The probability ranking principle 538 15.2 The Vector Space Model 539 15.2.1 Vector similarity 540 15.2.2 Term weighting 541 15.3 Term Distribution Models 544 15.3.1 The Poisson distribution 545 15.3.2 The two-Poisson model 548 15.3.3 The K mixture 549 15.3.4 Inverse document frequency 551 Residual inverse document frequency 553 15.3.6 Usage of term distribution models 554 15.4 Latent Semantic Indexing 554 15.4.1 Least-squares methods 557 15.4.2 Singular Value Decomposition 558 15.4.3 Latent Semantic Indexing in IR 564 15.5 Discourse Segmentation 566 15.5.1 567 15.6 Further Reading 570 15.7 Exercises 573 Text Categorization 575 16.1 Decision 578 16.2 Maximum Entropy Modeling 589 16.2.1 Generalized iterative scaling 591 16.2.2 Application to text categorization 594 16.3 597 16.4 k Nearest Neighbor Classification 604 16.5 Further Reading 607 Tiny Statistical Tables 609 Bibliography 611 Index 657 Perceptrons Trees 16 TextTiling 15.3.5 (PRP) 15.1.1 ContentsList of Tables 1.1 Common words in Tom Sawyer. 21 1.2 Frequency of frequencies of word types in Tom Sawyer. 22 1.3Empirical evaluation of Zipf’s law on Tom Sawyer. 24 1.4 collocations in the New York Times. 30 1.5 after filtering. 32 2.1Likelihood ratios between two theories. 58 2.2Statistical NLP problems as decoding problems. 71 Common inflections of nouns. 3.1 84 3.2Pronoun forms in English. 86 3.3 Features commonly marked on verbs. 90 4.1Major suppliers of electronic corpora with contact 119 4.2Different formats for telephone numbers appearing in an issue of The Economist. 131 4.3 Sentence lengths in text. 137 4.4Sizes of various tag sets. 140 4.5Comparison of different tag sets: adjective, adverb, conjunction, determiner, noun, and pronoun tags. 141 4.6Comparison of different tag sets: Verb, preposition, punctuation and symbol tags. 142 5.1 Finding Collocations: Raw Frequency. 154 5.2Part of speech tag patterns for collocation filtering. 154 5.3Finding Collocations: Justeson and Katz’ part-of-speech filter. 15 5 newswire URLs. bigramsFrequent bigramCommonest xvi List of Tables 5.4The nouns w occurring most often in the patterns ‘strong w’ and ‘powerful w.’ 156 5.5Finding collocations based on mean and variance. 161 5.6Finding collocations: The t test applied to 10 that occur with frequency 20. 166 5.7Words that occur significantly more often with powerful (the first ten words) and strong (the last ten words). 167 5.8A 2-by-2 table showing the dependence of occurrences of new and companies. 169 171 5.9Correspondence of and cow in an aligned corpus. Testing for the independence of words in different corpora 5.10 171 5.11How to compute Dunning’s likelihood ratio test. 172 5.12 of powerful with the highest scores according to Dunning’s likelihood ratio test. 174 Damerau’s frequency ratio test. 176 5.13 5.14Finding collocations: Ten that occur with frequency 20, ranked according to mutual information. 178 5.15Correspondence of and house and house in the aligned Hansard corpus. 179 5.16Problems for Mutual Information from data sparseness. 181 5.17 mutual information 182 Thomas 1991) and (Fano 1961). 5.18Collocations in the BBI Combinatory Dictionary of English strength power. 185 6.1Growth in number of parameters for n-gram models. 194 6.2Notation for the statistical estimation chapter. 197 Probabilities of each successive word for a clause from 6.3 200 Persuasion. 6.4Estimated frequencies for the AP data from Church and 203 6.5Expected Likelihood Estimation estimates for the word was. 205 6.6Using the t test for comparing the performance of two systems. 209 6.7Extracts from the frequencies of frequencies distribution 214 and in the Austen corpus. trigramsbigramsfor following (1991a).Gale and for the words in (Cover andDifferent definitions of communeSand chambre bigrams Bigrams x2.using vache bigramsList of Tables xvii 6.8Good-Turing estimates for Adjusted frequencies and probabilities. 215 6.9 frequency estimates for the clause Persuasion. 215 6.10Back-off language models with Good-Turing estimation tested on Persuasion. 223 Probability estimates of the test clause according to various language models. 224 7.1Notational conventions used in this chapter. 235 7.2Clues for two senses of drug used by a Bayesian classifier. 238 7.3Highly informative indicators for three ambiguous French words. 239 Two senses of ash. 243 7.4 7.5Disambiguation of ash with Lesk’s algorithm. 243 7.6Some results of thesaurus-based disambiguation. 247 How to disambiguate interest using a second-language corpus. 248 Examples of the one sense per discourse constraint. 250 7.8 7.9Some results of unsupervised disambiguation. 256 8.1 F measure and accuracy are different objective functions. 270 Some subcategorization frames with example verbs and 8.2 271 sentences. 8.3Some subcategorization frames learned by Manning’s system. 276 8.4 An example where the simple model for resolving PP attachment ambiguity fails. 280 8.5Selectional Preference Strength (SPS). 290 Association strength distinguishes a verb’s plausible and implausible objects. 292 8.7Similarity measures for binary vectors. 299 The cosine as a measure of semantic similarity. 302 8.9Measures of between probability distributions. 304 8.10Types of words occurring in the LOB corpus that were not covered by the OALD dictionary. 310 9.1Notation used in the HMM chapter. 324 Variable calculations for 0 = (lem, cola). 330 9.2 10.1Some part-of-speech tags frequently used for tagging English. 342 ice-t, (dis-)similarity 8.8 8.6 The 7.7 6.11 from bigramGood-Turing bigrams:. . . List of Tables xvlll Notational conventions for tagging. 346 10.3Idealized counts of some tag transitions in the Brown Corpus. 348 Idealized counts of tags that some words occur within the Brown Corpus. Table of probabilities for dealing with unknown words in 10.5 352 Initialization of the parameters of an HMM. 359 10.7Triggering environments in Brill’s transformation-based tagger. 363 Examples of some transformations learned in 10.8 transformation-based tagging. 363 374 Examples of frequent errors of probabilistic taggers. 375 10.10 A portion of a confusion matrix for part of speech tagging. 383 11.1Notation for the PCFG chapter. 384 11.2A simple Probabilistic Context Free Grammar (PCFG). 394 11.3Calculation of inside probabilities. 413 12.1Abbreviations for phrasal categories in the Penn Treebank. Frequency of common subcategorization frames (local trees 418 expanding VP) for selected verbs. 12.3Selected common expansions of NP as Subject vs. Object, 420 ordered by log odds ratio. Selected common expansions of NP as first and second object inside VP. 420 12.5Precision and recall evaluation results for PP attachment 436 errors for different styles of phrase structure. 12.6Comparison of some statistical parsing systems. 455 470 13.1Sentence alignment papers. 14.1A summary of the attributes of different clustering algorithms. 500 14.2Symbols used in the clustering chapter. 501 14.3Similarity functions used in clustering. 503 An example of K-means clustering. 518 14.5An example of a Gaussian mixture. 521 A small stop list for English. 533 15.1 An example of the evaluation of rankings. 535 15.2 14.4 12.4 12.2 10.9 10.6 tagging. 349 10.4 10.2List of Tables xix 15.3Three quantities that are commonly used in term weighting in information retrieval. 542 Term and document frequencies of two words in an example corpus. 542 15.5Components of tf.idf weighting schemes. 544 15.6 Document frequency and collection frequency (cf) for 6 words in the New York Times corpus. 547 15.7Actual and estimated number of documents with k occurrences for six terms. 550 Example for exploiting co-occurrence in computing content 554 15.9The matrix of document correlations 562 16.1Some examples of classification tasks in NLP. 576 16.2Contingency table for evaluating a binary classifier. 577 16.3The representation of document 11, shown in figure 16.3. 581 An example of information gain as a splitting criterion. 582 16.5Contingency table for a decision tree for the Reuters category “earnings.” 586 16.6An example of a maximum entropy distribution in the form of equation (16.4). 593 16.7An empirical distribution whose corresponding maximum 594 entropy distribution is the one in table 16.6. Feature weights in maximum entropy modeling for the category “earnings” in Reuters. 595 Classification results for the distribution corresponding to table 16.8 on the test set. 595 601 16.10 Perceptron for the “earnings” category. 16.11 Classification results for the perceptron in table 16.10 on the test set. 602 16.12 Classification results for an 1NN for the “earnings” category. 606 categorizer 16.9 16.8 16.4 BTB. similarity. 15.8 (df) 15.4List of Figures 1.1 law. 26 1.2Mandelbrot’s formula. 27 1.3 Key Word In Context (KWIC) display for the word showed. 32 1.4 showed in Tom Sawyer. 33 2.1A diagram illustrating the calculation of conditional 42 P(AJB). 2.2A random variable X for the sum of two dice. 45 2.3Two examples of binomial distributions: 10,0.7) and b(r; 52 2.4Example normal distribution curves: n(x; and n(x; 1.5,2). 53 2.5 The entropy of a weighted coin. 63 2.6The relationship between mutual information I and 67 2.7 The noisy channel model. 69 2.8 A binary symmetric channel. 69 2.9 The noisy channel model in linguistics. 70 3.1An example of recursive phrase structure expansion. 3.2An example of a prepositional phrase attachment ambiguity. 108 4.1Heuristic sentence boundary detection algorithm. 135 4.2A sentence as tagged according to several different tag sets. 140 Using a three word collocational window to capture 5.1 at a distance. 158 bigrams 99 H.entropy 0,l) 10,O.l). b(r; probability Syntactic frames for Zipf’s of 5.2Histograms of the position of strong relative to three words. 160 7.1Bayesian disambiguation. 238 7.2The Flip-Flop algorithm applied to finding indicators for disambiguation. 240 7.3Lesk’s dictionary-based disambiguation algorithm. 243 7.4 Thesaurus-based disambiguation. 245 7.5 Adaptive thesaurus-based disambiguation. 246 7.6Disambiguation based on a second-language corpus. 249 7.7Disambiguation based on “one sense per collocation” and “one sense per discourse.” 252 An EM algorithm for learning a word sense clustering. 7.8 254 8.1A diagram motivating the measures of precision and recall. 268 8.2Attachments in a complex sentence. 285 8.3A document-by-word matrix A. 297 8.4A word-by-word matrix B. 297 A modifier-by-head matrix C. 8.5 297 A Markov model. 9.1 319 The crazy soft drink machine, showing the states of the 9.2 machine and the state transition probabilities. 321 9.3A section of an HMM for a linearly interpolated language model. 323 9.4A program for a Markov process. 325 9.5Trellis algorithms. 328 Trellis algorithms: Closeup of the computation of forward 9.6 probabilities at one node. 329 9.7The probability of traversing an arc. 334 10.1Algorithm for training a Visible Markov Model Tagger. 348 10.2Algorithm for tagging with a Visible Markov Model Tagger. 350 10.3The learning algorithm for transformation-based tagging. 364 11.1The two parse trees, their probabilities, and the sentence probability. 385 A Probabilistic Regular Grammar (PRG). 11.2 390 11.3Inside and outside probabilities in 391 12.1A word lattice (simplified). 408 PCFGs. Figures List. . . List of Figures 12.2 A Penn tree. 413 12.3 CFG derivations of the same tree. 421 12.4An LC stack parser. 425 12.5 Decomposing a local tree into dependencies. 430 12.6An example of the PARSEVAL measures. 433 12.7The idea of crossing brackets. 434 12.8Penn trees versus other trees. 436 13.1Different strategies for Machine Translation. 464 13.2 Alignment and correspondence. 469 13.3Calculating the cost of alignments. 473 13.4A sample dot plot. 476 13.5The pillow-shaped envelope that is searched. 480 13.6 The noisy channel model in machine translation. 486 14.1A single-link clustering of 22 frequent English words represented as a dendrogram. 496 14.2Bottom-up hierarchical clustering. 502 14.3Top-down hierarchical clustering. 502 14.4A cloud of points in a plane. 504 14.5Intermediate clustering of the points in figure 14.4. 504 14.6Single-link clustering of the points in figure 14.4. 505 14.7Complete-link clustering of the points in figure 14.4. 505 14.8The K-means clustering algorithm. 516 14.9One iteration of the K-means algorithm. 517 14.10An example of using the EM algorithm for soft clustering. 519 Results of the search ‘ “glass pyramid” Pei Louvre’ on an 15.1 search engine. 531 15.2Two examples of precision-recall curves. 537 15.3A vector space with two dimensions. 540 15.4The Poisson distribution. 546 An example of a term-by-document matrix A. 15.5 555 15.6 Dimensionality reduction. 555 15.7An example of linear regression. 558 15.8The matrix T of the SVD decomposition of the matrix in figure 15.5. 560 15.9The matrix of singular values of the SVD decomposition of the matrix in figure 15.5. 560 internet Two Treebank xx111List of Figures xxiv 15.10 The matrix D of the SVD decomposition of the matrix in figure 15.5. 561 15.11 The matrix = S2x2D2xn of documents after with singular values and reduction to two dimensions. 562 15.12 Three constellations of cohesion scores in topic boundary identification. 569 A decision tree. 578 16.1 Geometric interpretation of part of the tree in figure 16.1. 579 16.2 An example of a Reuters news story in the topic category 16.3 580 Pruning a decision tree. 585 Classification accuracy depends on the amount of training 16.5 data available. 587 An example of how decision trees use data inefficiently from the domain of phonological rule learning. 588 16.7The Perceptron Learning Algorithm. 598 One error-correcting step of the perceptron learning 600 algorithm. 602 16.9Geometric interpretation of a perceptron. 16.8 16.6 16.4 “earnings.” resealing J3

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.