利用Spark API写一个单独的程序

本文参考Spark网站的Self-Contained Applications一节,使用Scala语言开发一个单独的小程序。

(1)首先安装sbt,参考官方文档。我使用的是RPM包格式:

curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install sbt

(2)接下来在/home文件夹下建立一个SparkApp的文件夹,文件夹布局如下:

bash-4.1# find /home/SparkApp/
/home/SparkApp/
/home/SparkApp/simple.sbt
/home/SparkApp/src
/home/SparkApp/src/main
/home/SparkApp/src/main/scala
/home/SparkApp/src/main/scala/SimpleApp.scala

其中simple.sbt文件内容如下所示:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.4"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

SimpleApp.scala程序如下:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "file:///usr/local/spark/README.md" // Should be some file on your system
    val conf = new SparkConf().setAppName("Simple Application")
    val sc = new SparkContext(conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
  }
}

(3)执行sbt package命令打包jar文件:

bash-4.1# sbt package
......
[success] Total time: 89 s, completed May 25, 2015 10:16:51 PM

(4)调用spark-submit脚本执行程序:

bash-4.1# /usr/local/spark/bin/spark-submit --class "SimpleApp" --master local[4] target/scala-2.10/simple-project_2.10-1.0.jar
......
Lines with a: 60, Lines with b: 29

可以看到,输出正确结果。