Apache Spark introduction ppt

apache spark tutorial ppt apache spark streaming ppt and also apache spark technology ppt
DannyConnolly Profile Pic
DannyConnolly,Switzerland,Professional
Published Date:14-07-2017
Your Website URL(Optional)
Comment
Intro to Apache Spark http://databricks.com/ download slides:
 training.databricks.com/workshop/itas_workshop.pdf Licensed under a Creative Commons Attribution-NonCommercial- NoDerivatives 4.0 International LicenseIntro: Success Criteria By end of day, participants will be comfortable 
 with the following: open a Spark Shell • develop Spark apps for typical use cases • tour of the Spark API • explore data sets loaded from HDFS, etc. • review of Spark SQL, Spark Streaming, MLlib • follow-up courses and certification • developer community resources, events, etc. • return to workplace and demo use of Spark • 401: Getting Started Installation hands-on lab: 20 minInstallation: Let’s get started using Apache Spark, 
 in just four easy steps… databricks.com/spark-training-resourcesitas for class, copy from the USB sticks NB: please do not install/run Spark using: Homebrew on MacOSX • Cygwin on Windows • 6Step 1: Install Java JDK 6/7 on MacOSX or Windows oracle.com/technetwork/java/javase/downloads/ jdk7-downloads-1880260.html follow the license agreement instructions • then click the download for your OS • need JDK instead of JRE (for Maven, etc.) • 7Step 2: Download Spark we will use Spark 1.1.0 1. copy from the USB sticks 2. double click the archive file to open it 3. connect into the newly created directory for a fallback: spark.apache.org/downloads.html 8Step 3: Run Spark Shell we’ll run Spark’s interactive shell… within the “spark” directory, run: ./bin/spark-shell then from the “scala” REPL prompt,
 let’s create some data… val data = 1 to 10000 9Step 4: Create an RDD create an RDD based on that data… val distData = sc.parallelize(data) then use a filter to select values less than 10… distData.filter(_ 10).collect() 10Step 4: Create an RDD create an val distData = sc.parallelize(data) then use a filter to select values less than 10… d Checkpoint: 
 what do you get for results? gist.github.com/ceteri/ f2c3486062c9610eac1dfile-01-repl-txt 11Installation: Optional Downloads: Python For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform: store.continuum.io/cshop/anaconda/ 12Installation: Optional Downloads: Maven Java builds later also require Maven, which you can download at: maven.apache.org/download.cgi 1302: Getting Started Spark Deconstructed lecture: 20 minSpark Deconstructed: Let’s spend a few minutes on this Scala thing… scala-lang.org/ Scala Crash Course
 Holden Karau
 lintool.github.io/SparkTutorial/ slides/day1_Scala_crash_course.pdf 15Spark Deconstructed: Log Mining Example // load error messages from a log into memory // then interactively search for various patterns // https://gist.github.com/ceteri/8ae5b9509a08c08a1132 // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("\t")).map(r = r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 messages.filter(_.contains("php")).count() 16Spark Deconstructed: Log Mining Example We start with Spark running on a cluster…
 submitting code to be evaluated on it: Worker Worker Driver Worker 17Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("\t")).map(r = r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() // action 2 discussing the other part messages.filter(_.contains("php")).count() 18Spark Deconstructed: Log Mining Example At this point, take a look at the transformed RDD operator graph: scala messages.toDebugString res5: String = MappedRDD4 at map at console:16 (3 partitions) MappedRDD3 at map at console:16 (3 partitions) FilteredRDD2 at filter at console:14 (3 partitions) MappedRDD1 at textFile at console:12 (3 partitions) HadoopRDD0 at textFile at console:12 (3 partitions) 19Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("\t")).map(r = r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() Worker // action 2 messages.filter(_.contains("php")).count() discussing the other part Worker Driver Worker 20Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("\t")).map(r = r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() Worker // action 2 block 1 messages.filter(_.contains("php")).count() discussing the other part Worker Driver block 2 Worker block 3 21Spark Deconstructed: Log Mining Example // base RDD val lines = sc.textFile("hdfs://...") // transformed RDDs val errors = lines.filter(_.startsWith("ERROR")) val messages = errors.map(_.split("\t")).map(r = r(1)) messages.cache() // action 1 messages.filter(_.contains("mysql")).count() Worker // action 2 block 1 messages.filter(_.contains("php")).count() discussing the other part Worker Driver block 2 Worker block 3 22