Apache spark tutorial ppt

apache spark streaming ppt and apache spark architecture ppt
DannyConnolly Profile Pic
DannyConnolly,Switzerland,Professional
Published Date:14-07-2017
Your Website URL(Optional)
Comment
Using Apache Spark Pat McDonough - Databricks 2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory   What is Spark? Fast and Expressive Cluster Computing System Compatible with Apache Hadoop     Efficient Usable •  Rich APIs in Java, •  General execution Scala, Python graphs •  Interactive shell •  In-memory storage Key Concepts Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets Operations •  Collections of objects spread •  Transformations" across a cluster, stored in RAM (e.g. map, filter, or on Disk groupBy) •  Built through parallel •  Actions" transformations (e.g. count, collect, save) •  Automatically rebuilt on failure Working With RDDs textFile = sc.textFile(”SomeFile.txt”) RDD RDD RDD RDD Action Value Transformations linesWithSpark.count() 74 linesWithSpark.first() Apache Spark linesWithSpark = textFile.filter(lambda line: "Spark” in line)Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base  RDD   Transformed  RDD   Cache  1   lines = spark.textFile(“hdfs://...”) Worker   results   errors = lines.filter(lambda s: s.startswith(“ERROR”)) tasks   Block  1   messages = errors.map(lambda s: s.split(“\t”)2) Driver   messages.cache() Acon   Cache  2   messages.filter(lambda s: “mysql” in s).count() Worker   messages.filter(lambda s: “php” in s).count() Cache  3   . . . Block  2   Worker   Full-­‐text  search  of  Wikipedia   •  60GB  on  20  EC2  machine   Block  3   •  0.5  sec  vs.  20s  for  on-­‐disk  Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)2) HDFS  File   Filtered  RDD   Mapped  RDD   filter   map   (func  =  startsWith(…))   (func  =  split(...))  Language Support Standalone Programs Python lines = sc.textFile(...) • Python, Scala, & Java lines.filter(lambda s: “ERROR” in s).count() Interactive Shells Scala •  Python & Scala val lines = sc.textFile(...) lines.filter(x = x.contains(“ERROR”)).count() Performance Java •  Java & Scala are faster due JavaRDDString lines = sc.textFile(...); to static typing lines.filter(new FunctionString, Boolean() Boolean call(String s) •  …but Python is often fine return s.contains(“error”); ).count(); Interactive Shell •  The Fastest Way to Learn Spark •  Available in Python and Scala •  Runs as an application on an existing Spark Cluster… •  OR Can run locally Administrative GUIs h5p://Standalone  Master:8080  (by  default)  Software Components Your  applicaon   •  Spark runs as a library in your SparkContext   program (1 instance per app) •  Runs tasks locally or on Cluster   Local   cluster manager   threads   –  Mesos, YARN or standalone Worker   Worker   mode Spark   Spark   executor   executor   •  Accesses storage systems via Hadoop InputFormat API HDFS  or  other  storage   –  Can use HBase, HDFS, S3, … Task Scheduler •  General task B:   A:   graphs F:   •  Automatically Stage  1   groupBy   pipelines functions C:   D:   E:   •  Data locality aware join   •  Partitioning aware" to avoid shuffles map   filter   Stage  2   Stage  3   =  RDD   =  cached  partition  Advanced Features •  Controllable partitioning – Speed up joins against a dataset •  Controllable storage formats – Keep data serialized for efficiency, replicate to multiple nodes, cache on disk •  Shared variables: broadcasts, accumulators •  See online docs for details Local Execution •  Just pass local or localk as master URL •  Debug using local debuggers – For Java / Scala, just run your program in a debugger – For Python, use an attachable debugger (e.g. PyDev) •  Great for development & unit tests Cluster Execution •  Easiest way to launch is EC2:" " ./spark-ec2 -k keypair –i id_rsa.pem –s slaves \" launchstopstartdestroy clusterName •  Several options for private clusters: –  Standalone mode (similar to Hadoop’s deploy scripts) –  Mesos –  Hadoop YARN •  Amazon EMR: tinyurl.com/spark-emr Using the Shell Launching: spark-shell pyspark (IPYTHON=1) Modes: MASTER=local ./spark-shell local, 1 thread MASTER=local2 ./spark-shell local, 2 threads MASTER=spark://host:port ./spark-shell cluster SparkContext •  Main entry point to Spark functionality •  Available in shell as variable sc •  In standalone programs, you’d make your own (see later for details) Creating RDDs Turn a Python collection into an RDD   sc.parallelize(1, 2, 3) Load text file from local FS, HDFS, or S3   sc.textFile(“file.txt”)   sc.textFile(“directory/.txt”)   sc.textFile(“hdfs://namenode:9000/path/file”) Use existing Hadoop InputFormat (Java/Scala only)   sc.hadoopFile(keyClass, valClass, inputFmt, conf) Basic Transformations   nums = sc.parallelize(1, 2, 3) Pass each element through a function   squares = nums.map(lambda x: xx) // 1, 4, 9 Keep elements passing a predicate   even = squares.filter(lambda x: x % 2 == 0) // 4 Map each element to zero or more others   nums.flatMap(lambda x: = range(x))   = 0, 0, 1, 0, 1, 2 Range  object  (sequence   of  numbers  0,  1,  …,  x-­‐1)  Basic Actions   nums = sc.parallelize(1, 2, 3) Retrieve RDD contents as a local collection   nums.collect() = 1, 2, 3 Return first K elements   nums.take(2) = 1, 2 Count number of elements   nums.count() = 3 Merge elements with an associative function   nums.reduce(lambda x, y: x + y) = 6 Write elements to a text file   nums.saveAsTextFile(“hdfs://file.txt”) Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Python:    pair  =  (a,  b)                              pair0    =  a              pair1    =  b   Scala:      val  pair  =  (a,  b)            pair._1  //  =  a            pair._2  //  =  b   Java:    Tuple2  pair  =  new  Tuple2(a,  b);              pair._1  //  =  a            pair._2  //  =  b