Question? Leave a message!


DannyConnolly Profile Pic
Published Date:14-07-2017
Website URL
Using Apache Spark Pat McDonough - Databricks Apache Spark The Spark Community +You  INTRODUCTION TO APACHE SPARK 2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory   What is Spark? Fast and Expressive Cluster Computing System Compatible with Apache Hadoop     Efficient Usable •  Rich APIs in Java, •  General execution Scala, Python graphs •  Interactive shell •  In-memory storage Key Concepts Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets Operations •  Collections of objects spread •  Transformations" across a cluster, stored in RAM (e.g. map, filter, or on Disk groupBy) •  Built through parallel •  Actions" transformations (e.g. count, collect, save) •  Automatically rebuilt on failure Working With RDDs textFile = sc.textFile(”SomeFile.txt”) RDD RDD RDD RDD Action Value Transformations linesWithSpark.count() 74 linesWithSpark.first() Apache Spark linesWithSpark = textFile.filter(lambda line: "Spark” in line)Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base  RDD   Transformed  RDD   Cache  1   lines = spark.textFile(“hdfs://...”) Worker   results   errors = lines.filter(lambda s: s.startswith(“ERROR”)) tasks   Block  1   messages = s: s.split(“\t”)2) Driver   messages.cache() Acon   Cache  2   messages.filter(lambda s: “mysql” in s).count() Worker   messages.filter(lambda s: “php” in s).count() Cache  3   . . . Block  2   Worker   Full-­‐text  search  of  Wikipedia   •  60GB  on  20  EC2  machine   Block  3   •  0.5  sec  vs.  20s  for  on-­‐disk  Scaling Down 100   80   60   40   20   0   Cache   25%   50%   75%   Fully   disabled   cached   %  of  working  set  in  cache   Execution  time  (s)   69   58   41   30   12  Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)2) HDFS  File   Filtered  RDD   Mapped  RDD   filter   map   (func  =  startsWith(…))   (func  =  split(...))  Language Support Standalone Programs Python lines = sc.textFile(...) • Python, Scala, & Java lines.filter(lambda s: “ERROR” in s).count() Interactive Shells Scala •  Python & Scala val lines = sc.textFile(...) lines.filter(x = x.contains(“ERROR”)).count() Performance Java •  Java & Scala are faster due JavaRDDString lines = sc.textFile(...); to static typing lines.filter(new FunctionString, Boolean() Boolean call(String s) •  …but Python is often fine return s.contains(“error”); ).count(); Interactive Shell •  The Fastest Way to Learn Spark •  Available in Python and Scala •  Runs as an application on an existing Spark Cluster… •  OR Can run locally Administrative GUIs h5p://Standalone  Master:8080  (by  default)  JOB EXECUTION Software Components Your  applicaon   •  Spark runs as a library in your SparkContext   program (1 instance per app) •  Runs tasks locally or on Cluster   Local   cluster manager   threads   –  Mesos, YARN or standalone Worker   Worker   mode Spark   Spark   executor   executor   •  Accesses storage systems via Hadoop InputFormat API HDFS  or  other  storage   –  Can use HBase, HDFS, S3, … Task Scheduler •  General task B:   A:   graphs F:   •  Automatically Stage  1   groupBy   pipelines functions C:   D:   E:   •  Data locality aware join   •  Partitioning aware" to avoid shuffles map   filter   Stage  2   Stage  3   =  RDD   =  cached  partition  Advanced Features •  Controllable partitioning – Speed up joins against a dataset •  Controllable storage formats – Keep data serialized for efficiency, replicate to multiple nodes, cache on disk •  Shared variables: broadcasts, accumulators •  See online docs for details Local Execution •  Just pass local or localk as master URL •  Debug using local debuggers – For Java / Scala, just run your program in a debugger – For Python, use an attachable debugger (e.g. PyDev) •  Great for development & unit tests Cluster Execution •  Easiest way to launch is EC2:" " ./spark-ec2 -k keypair –i id_rsa.pem –s slaves \" launchstopstartdestroy clusterName •  Several options for private clusters: –  Standalone mode (similar to Hadoop’s deploy scripts) –  Mesos –  Hadoop YARN •  Amazon EMR: WORKING WITH SPARK