Lecture notes on Natural language Processing

deep learning for natural language processing and related applications and deep learning for natural language processing theory and practice pdf free download
DavidCooper Profile Pic
Published Date:11-07-2017
Your Website URL(Optional)
CS224d Deep Learning for Natural Language Processing Richard Socher, PhD Welcome 1.  CS224d logis7cs 2.  Introduc7on to NLP, deep learning and their intersec7on 2 Richard Socher 3/31/16 Course Logiscs •  Instructor: Richard Socher (Stanford PhD, 2014; now Founder/CEO at MetaMind) •  TAs: James Hong, Bharath Ramsundar, Sameep Bagadia, David Dindi, ++ •  Time: Tuesday, Thursday 3:00-4:20 •  Loca7on: Gates B1 •  There will be 3 problem sets (with lots of programming), a midterm and a final project •  For syllabus and office hours, see h\p://cs224d.stanford.edu/ •  Slides uploaded before each lecture, video + lecture notes aer Lecture 1, Slide 3 Richard Socher 3/31/16 Pre-requisites •  Proficiency in Python •  All class assignments will be in Python. There is a tutorial here •  College Calculus, Linear Algebra (e.g. MATH 19 or 41, MATH 51) •  Basic Probability and Sta7s7cs (e.g. CS 109 or other stats course) •  Equivalent knowledge of CS229 (Machine Learning) •  cost func7ons, •  taking simple deriva7ves •  performing op7miza7on with gradient descent. Lecture 1, Slide 4 Richard Socher 3/31/16 Grading Policy •  3 Problem Sets: 15% x 3 = 45% •  Midterm Exam: 15% •  Final Course Project: 40% •  Milestone: 5% (2% bonus if you have your data and ran an experiment) •  A\end at least 1 project advice office hour: 2% •  Final write-up, project and presenta7on: 33% •  Bonus points for excep7onal poster presenta7on •  Late policy •  7 free late days – use as you please •  Aerwards, 25% off per day late •  PSets Not accepted aer 3 late days per PSet •  Does not apply to Final Course Project •  Collabora7on policy: Read the student code book and Honor Code •  Understand what is ‘collabora7on’ and what is ‘academic infrac7on’ Lecture 1, Slide 5 Richard Socher 3/31/16 High Level Plan for Problem Sets •  The first half of the course and the first 2 PSets will be hard •  PSet 1 is in pure python code (numpy etc.) to really understand the basics •  Released on April 4th •  New: PSets 2 & 3 will be in TensorFlow, a library for punng together new neural network models quickly (à special lecture) •  PSet 3 will be shorter to increase 7me for final project •  Libraries like TensorFlow (or Torch) are becoming standard tools •  But s7ll some problems Lecture 1, Slide 6 Richard Socher 3/31/16 What is Natural Language Processing (NLP)? •  Natural language processing is a field at the intersec7on of •  computer science •  ar7ficial intelligence •  and linguis7cs. •  Goal: for computers to process or “understand” natural language in order to perform tasks that are useful, e.g. •  Ques7on Answering •  Fully understanding and represenng the meaning of language (or even defining it) is an illusive goal. •  Perfect language understanding is AI-complete Lecture 1, Slide 7 Richard Socher 3/31/16 NLP Levels Lecture 1, Slide 8 Richard Socher 3/31/16 (A ny sample of) NLP Applicaons •  Applica7ons range from simple to complex: •  Spell checking, keyword search, finding synonyms •  Extrac7ng informa7on from websites such as •  product price, dates, loca7on, people or company names •  Classifying, reading level of school texts, posi7ve/nega7ve sen7ment of longer documents •  Machine transla7on •  Spoken dialog systems •  Complex ques7on answering Lecture 1, Slide 9 Richard Socher 3/31/16 NLP in Industry •  Search (wri\en and spoken) •  Online adver7sement •  Automated/assisted transla7on •  Sen7ment analysis for marke7ng or finance/trading •  Speech recogni7on •  Automa7ng customer support Lecture 1, Slide 10 Richard Socher 3/31/16 Why is NLP hard? •  Complexity in represen7ng, learning and using linguis7c/ situa7onal/world/visual knowledge •  Jane hit June and then she fell/ran. •  Ambiguity: “I made her duck” Lecture 1, Slide 11 Richard Socher 3/31/16 What’s Deep Learning (DL)? •  Deep learning is a subfield of machine learning 3.3. APPROACH 35 •  Most machine learning methods work Feature NER TF Current Word well because of human-designed PreviousWord NextWord representa7ons and input features Current Word Character n-gram all length≤6 Current POS Tag •  For example: features for finding SurroundingPOS Tag Sequence Current Word Shape named en77es like loca7ons or SurroundingWord Shape Sequence Presence of Wordin Left Window size4 size9 organiza7on names (Finkel, 2010): Presence of Wordin RightWindow size4 size9 Table 3.1: Features used by the CRF for the two tasks: named entity recognition (NER) andtemplatefilling(TF). •  Machine learning becomes just op7mizing weights to best make a final predic7on can go beyond imposing just exact identity conditions). I illustrate this by modeling two forms of non-local structure: label consistency in the named entity recognition task, and templateconsistencyinthetemplatefillingtask. Onecouldimaginemanywaysofdefining suchmodels;forsimplicityI usetheform Lecture 1, Slide 12 Richard Socher 3/31/16 (λ,y,x) P (yx)∝ θ (3.1) M ∏ λ λ∈Λ where the product is over a set of violation types Λ,andforeachviolationtype λ we specifyapenaltyparameterθ .Theexponent(λ,s,o)isthecountofthenumberoftimes λ thattheviolationλ occurs in thestatesequenceswith respect tothe observationsequence o.Thishastheeffectofassigningsequenceswithmoreviolations a lower probability. The particular violation types are defined specifically for each task, and are described in sections3.4.1and 3.5.2. Thismodel,asdefinedabove,isnotnormalized,andclearlyitwouldbeexpensivetodo so. As wewill see in thediscussionof Gibbssampling,thiswill not actuallybe a problem forus.Machine Learning vs Deep Learning Machine Learning in Practice Describing your data with Learning features a computer can algorithm understand Domain specific, requires Ph.D. Op7mizing the level talent weights on features What’s Deep Learning (DL)? •  Representa7on learning a\empts to automa7cally learn good features or representa7ons •  Deep learning algorithms a\empt to learn (mul7ple levels of) representa7on and an output •  From “raw” inputs x (e.g. words) Lecture 1, Slide 14 Richard Socher 3/31/16 On the history and term of “Deep Learning” •  We will focus on different kinds of neural networks •  The dominant model family inside deep learning •  Only clever terminology for stacked logis7c regression units? •  Somewhat, but interes7ng modeling principles (end-to-end) and actual connec7ons to neuroscience in some cases •  We will not take a historical approach but instead focus on methods which work well on NLP problems now •  For history of deep learning models (star7ng 1960s), see: Deep Learning in Neural Networks: An Overview by Schmidhuber Lecture 1, Slide 15 Richard Socher 3/31/16 Reasons for Exploring Deep Learning •  Manually designed features are oen over-specified, incomplete and take a long 7me to design and validate •  Learned Features are easy to adapt, fast to learn •  Deep learning provides a very flexible, (almost?) universal, learnable framework for represenng world, visual and linguis7c informa7on. •  Deep learning can learn unsupervised (from raw text) and supervised (with specific labels like posi7ve/nega7ve) Lecture 1, Slide 16 Richard Socher 3/31/16 Reasons for Exploring Deep Learning •  In 2006 deep learning techniques started outperforming other machine learning techniques. Why now? •  DL techniques benefit more from a lot of data •  Faster machines and mul7core CPU/GPU help DL •  New models, algorithms, ideas à Improved performance (first in speech and vision, then NLP) Lecture 1, Slide 17 Richard Socher 3/31/16 Deep Learning for Speech •  The first breakthrough results of “deep learning” on large datasets Phonemes/Words happened in speech recogni7on •  Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recogni7on Dahl et al. (2010) Acousc model Recog RT03S Hub5 \ WER FSH SWB Tradi7onal 1-pass 27.4 23.6 features −adapt Deep Learning 1-pass 18.5 16.1 (−33%) (−32%) −adapt Lecture 1, Slide 18 Richard Socher 3/31/16 Deep Learning for Computer Vision •  Most deep learning groups have (un7l 2 years ago) focused on computer vision •  Break through paper: ImageNet Classifica7on with Deep Convolu7onal Neural Networks by Krizhevsky et 8 Olga Russakovsky et al. al. 2012 PASCAL ILSVRC ··· ··· Zeiler and Fergus (2013) 19 Richard Socher Lecture 1, Slide 19 ··· Fig. 2 The ILSVRC dataset contains many more fine-grained classes compared to the standard PASCAL VOC benchmark; for example, instead of the PASCAL “dog” category there are 120 di↵ erent breeds of dogs in ILSVRC2012-2014 classification and single-object localization tasks. are 1000 object classes and approximately 1.2 million object class label, corresponding to one object that is training images, 50 thousand validation images and 100 present in an image. For the single-object localization thousand test images. Table 2 (top) documents the size task,everyvalidationandtestimageandasubsetofthe of the dataset over the years of the challenge. training images are annotated with axis-aligned bound- ing boxes around every instance of this object. Every bounding box is required to be as small as 3.2 Single-object localization dataset construction possible while including all visible parts of the object instance. An alternate annotation procedure could be The single-object localization task evaluates the ability to annotate the full (estimated) extent of the object: of an algorithm to localize one instance of an object e.g., if a person’s legs are occluded and only the torso category. It was introduced as a taster task in ILSVRC is visible, the bounding box could be drawn to include 2011, and became an o cial part of ILSVRC in 2012. the likely location of the legs. However, this alterna- The key challenge was developing a scalable crowd- tive procedure is inherently ambiguous and ill-defined, sourcing method for object bounding box annotation. leading to disagreement among annotators and among Ourthree-stepself-verifyingpipelineisdescribedinSec- researchers (what is the true “most likely” extent of tion 3.2.1. Having the dataset collected, we perform this object?). We follow the standard protocol of only detailed analysis in Section 3.2.2 to ensure that the annotatingvisibleobjectparts(Russelletal.,2007;Ev- dataset is su ciently varied to be suitable for evalu- 5 eringham et al., 2010). ation of object localization algorithms. 3.2.1 Bounding box object annotation system Object classes and candidate images. Theobjectclasses for single-object localization task are the same as the We summarize the crowdsourced bounding box anno- object classes for image classification task described tation system described in detail in (Su et al., 2012). above in Section 3.1. The training images for localiza- The goal is to build a system that is fully automated, tion task are a subset of the training images used for 5 SomedatasetssuchasPASCAL VOC(Everinghametal., image classification task, and the validation and test 2010) and LabelMe (Russell et al., 2007) are able to provide images are the same between both tasks. more detailed annotations: for example, marking individual object instances as being truncated.Wechosenottoprovide Bounding box annotation. Recall that for the image this level of detail in favor of annotating more images and classification task every image was annotated with one more object instances. dogs cats birdsDeep Learning + NLP = Deep NLP •  Combine ideas and goals of NLP and use representa7on learning and deep learning methods to solve them •  Several big improvements in recent years across different NLP •  levels: speech, morphology, syntax, seman7cs •  applicaons: machine transla7on, sen7ment analysis and ques7on answering Lecture 1, Slide 20 Richard Socher 3/31/16

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.