개요

spark tutorial을 오랜만에 돌려 본 뒤 맥 기준환경 설정 및 튜토리얼 실행방법을 정리해둔다.

adoptopenjdk8 설치

brew tap adoptopenjdk/openjdk
brew install --cask adoptopenjdk8

Scala Install 설치

brew install coursier/formulas/coursier && cs setup

#.bashrc에 추가
#export PATH="$PATH:/Users/lks21c/Library/Application Support/Coursier/bin"

cs install scala:3.1.1 && cs install scalac:3.1.1

파일구성 및 빌드

자바 메이븐 프로젝트 처럼 아래와 같이 디렉토리를 기본 구성해 준다.

src/main/scala/SimpleApp.scala

YOUR_SPARK_HOME를 적절한 경로로 바꾸어준다. 필자는 ~/spark로 심볼릭 링크를 걸어서 만들어두었다.

/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession

object SimpleApp {
  def main(args: Array[String]) {
    val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
    val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
    val logData = spark.read.textFile(logFile).cache()
    val numAs = logData.filter(line => line.contains("a")).count()
    val numBs = logData.filter(line => line.contains("b")).count()
    println(s"Lines with a: $numAs, Lines with b: $numBs")
    spark.stop()
  }
}

bundle.sbt

name := "Simple Project"

version := "1.0"

scalaVersion := "2.12.15"

libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"

파일 정리

아래와 같이 파일이 구성되며 된다.

$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala

패키지 빌드

$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.12/simple-project_2.12-1.0.jar

검증하기

아래 커맨드를 실행해보면 정상 실행됨을 확인 할 수 있다.

$ YOUR_SPARK_HOME/bin/spark-submit \
  --class "SimpleApp" \
  --master local[4] \
  target/scala-2.12/simple-project_2.12-1.0.jar

그러면 아래와 같은 로그를 얻을 수 있다.

$ ~/spark/bin/spark-submit   --class "SimpleApp"   --master local[4]   ~/repo/simple_spark/target/scala-2.12/simple-project_2.12-1.0.jar > a.txt
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/03/15 14:56:03 INFO SparkContext: Running Spark version 3.2.1
22/03/15 14:56:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/15 14:56:03 INFO ResourceUtils: ==============================================================
22/03/15 14:56:03 INFO ResourceUtils: No custom resources configured for spark.driver.
22/03/15 14:56:03 INFO ResourceUtils: ==============================================================
22/03/15 14:56:03 INFO SparkContext: Submitted application: Simple Application
22/03/15 14:56:03 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/03/15 14:56:03 INFO ResourceProfile: Limiting resource is cpu
22/03/15 14:56:03 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/03/15 14:56:03 INFO SecurityManager: Changing view acls to: lks21c
22/03/15 14:56:03 INFO SecurityManager: Changing modify acls to: lks21c
22/03/15 14:56:03 INFO SecurityManager: Changing view acls groups to:
22/03/15 14:56:03 INFO SecurityManager: Changing modify acls groups to:
22/03/15 14:56:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(lks21c); groups with view permissions: Set(); users  with modify permissions: Set(lks21c); groups with modify permissions: Set()
22/03/15 14:56:03 INFO Utils: Successfully started service 'sparkDriver' on port 50946.
22/03/15 14:56:03 INFO SparkEnv: Registering MapOutputTracker
22/03/15 14:56:04 INFO SparkEnv: Registering BlockManagerMaster
22/03/15 14:56:04 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
22/03/15 14:56:04 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
22/03/15 14:56:04 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
22/03/15 14:56:04 INFO DiskBlockManager: Created local directory at /private/var/folders/46/lkn4wlfd7v3gppr3wvxnps700000gn/T/blockmgr-885f123e-9ba1-4cd8-8a14-e1756d4c0c13
22/03/15 14:56:04 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/03/15 14:56:04 INFO SparkEnv: Registering OutputCommitCoordinator
22/03/15 14:56:04 INFO Utils: Successfully started service 'SparkUI' on port 4040.
22/03/15 14:56:04 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://igwangscbookpro:4040
22/03/15 14:56:04 INFO SparkContext: Added JAR file:/Users/lks21c/repo/simple_spark/target/scala-2.12/simple-project_2.12-1.0.jar at spark://igwangscbookpro:50946/jars/simple-project_2.12-1.0.jar with timestamp 1647323763277
22/03/15 14:56:04 INFO Executor: Starting executor ID driver on host igwangscbookpro
22/03/15 14:56:04 INFO Executor: Fetching spark://igwangscbookpro:50946/jars/simple-project_2.12-1.0.jar with timestamp 1647323763277
22/03/15 14:56:04 INFO TransportClientFactory: Successfully created connection to igwangscBookPro/192.168.1.248:50946 after 30 ms (0 ms spent in bootstraps)
22/03/15 14:56:04 INFO Utils: Fetching spark://igwangscbookpro:50946/jars/simple-project_2.12-1.0.jar to /private/var/folders/46/lkn4wlfd7v3gppr3wvxnps700000gn/T/spark-59387458-0823-4846-93c4-617d446fcff4/userFiles-c8c462e0-3369-46ce-a1c6-4afb7358e3b6/fetchFileTemp4064235344669326924.tmp
22/03/15 14:56:04 INFO Executor: Adding file:/private/var/folders/46/lkn4wlfd7v3gppr3wvxnps700000gn/T/spark-59387458-0823-4846-93c4-617d446fcff4/userFiles-c8c462e0-3369-46ce-a1c6-4afb7358e3b6/simple-project_2.12-1.0.jar to class loader
22/03/15 14:56:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 50948.
22/03/15 14:56:04 INFO NettyBlockTransferService: Server created on igwangscbookpro:50948
22/03/15 14:56:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/03/15 14:56:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, igwangscbookpro, 50948, None)
22/03/15 14:56:04 INFO BlockManagerMasterEndpoint: Registering block manager igwangscbookpro:50948 with 366.3 MiB RAM, BlockManagerId(driver, igwangscbookpro, 50948, None)
22/03/15 14:56:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, igwangscbookpro, 50948, None)
22/03/15 14:56:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, igwangscbookpro, 50948, None)
22/03/15 14:56:05 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
22/03/15 14:56:05 INFO SharedState: Warehouse path is 'file:/Users/lks21c/repo/simple_spark/spark-warehouse'.
22/03/15 14:56:05 INFO InMemoryFileIndex: It took 62 ms to list leaf files for 1 paths.
22/03/15 14:56:08 INFO FileSourceStrategy: Pushed Filters:
22/03/15 14:56:08 INFO FileSourceStrategy: Post-Scan Filters:
22/03/15 14:56:08 INFO FileSourceStrategy: Output Data Schema: struct<value: string>
22/03/15 14:56:09 INFO CodeGenerator: Code generated in 278.353833 ms
22/03/15 14:56:09 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 338.2 KiB, free 366.0 MiB)
22/03/15 14:56:09 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 32.6 KiB, free 365.9 MiB)
22/03/15 14:56:09 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on igwangscbookpro:50948 (size: 32.6 KiB, free: 366.3 MiB)
22/03/15 14:56:10 INFO SparkContext: Created broadcast 0 from count at SimpleApp.scala:9
22/03/15 14:56:10 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
22/03/15 14:56:10 INFO DAGScheduler: Registering RDD 8 (count at SimpleApp.scala:9) as input to shuffle 0
22/03/15 14:56:10 INFO DAGScheduler: Got map stage job 0 (count at SimpleApp.scala:9) with 1 output partitions
22/03/15 14:56:10 INFO DAGScheduler: Final stage: ShuffleMapStage 0 (count at SimpleApp.scala:9)
22/03/15 14:56:10 INFO DAGScheduler: Parents of final stage: List()
22/03/15 14:56:10 INFO DAGScheduler: Missing parents: List()
22/03/15 14:56:10 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[8] at count at SimpleApp.scala:9), which has no missing parents
22/03/15 14:56:10 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 19.5 KiB, free 365.9 MiB)
22/03/15 14:56:10 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 9.0 KiB, free 365.9 MiB)
22/03/15 14:56:10 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on igwangscbookpro:50948 (size: 9.0 KiB, free: 366.3 MiB)
22/03/15 14:56:10 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1478
22/03/15 14:56:10 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[8] at count at SimpleApp.scala:9) (first 15 tasks are for partitions Vector(0))
22/03/15 14:56:10 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks resource profile 0
22/03/15 14:56:10 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0) (igwangscbookpro, executor driver, partition 0, PROCESS_LOCAL, 4850 bytes) taskResourceAssignments Map()
22/03/15 14:56:10 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/03/15 14:56:11 INFO FileScanRDD: Reading File path: file:///Users/lks21c/spark/README.md, range: 0-4512, partition values: [empty row]
22/03/15 14:56:11 INFO CodeGenerator: Code generated in 39.607875 ms
22/03/15 14:56:11 INFO MemoryStore: Block rdd_3_0 stored as values in memory (estimated size 5.1 KiB, free 365.9 MiB)
22/03/15 14:56:11 INFO BlockManagerInfo: Added rdd_3_0 in memory on igwangscbookpro:50948 (size: 5.1 KiB, free: 366.3 MiB)
22/03/15 14:56:11 INFO CodeGenerator: Code generated in 16.053417 ms
22/03/15 14:56:11 INFO CodeGenerator: Code generated in 45.85025 ms
22/03/15 14:56:11 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2160 bytes result sent to driver
22/03/15 14:56:11 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 786 ms on igwangscbookpro (executor driver) (1/1)
22/03/15 14:56:11 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/03/15 14:56:11 INFO DAGScheduler: ShuffleMapStage 0 (count at SimpleApp.scala:9) finished in 1.116 s
22/03/15 14:56:11 INFO DAGScheduler: looking for newly runnable stages
22/03/15 14:56:11 INFO DAGScheduler: running: Set()
22/03/15 14:56:11 INFO DAGScheduler: waiting: Set()
22/03/15 14:56:11 INFO DAGScheduler: failed: Set()
22/03/15 14:56:11 INFO CodeGenerator: Code generated in 48.862792 ms
22/03/15 14:56:11 INFO SparkContext: Starting job: count at SimpleApp.scala:9
22/03/15 14:56:11 INFO DAGScheduler: Got job 1 (count at SimpleApp.scala:9) with 1 output partitions
22/03/15 14:56:11 INFO DAGScheduler: Final stage: ResultStage 2 (count at SimpleApp.scala:9)
22/03/15 14:56:11 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)
22/03/15 14:56:11 INFO DAGScheduler: Missing parents: List()
22/03/15 14:56:11 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[11] at count at SimpleApp.scala:9), which has no missing parents
22/03/15 14:56:11 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 11.0 KiB, free 365.9 MiB)
22/03/15 14:56:11 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 5.5 KiB, free 365.9 MiB)
22/03/15 14:56:11 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on igwangscbookpro:50948 (size: 5.5 KiB, free: 366.2 MiB)
22/03/15 14:56:11 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1478
22/03/15 14:56:11 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[11] at count at SimpleApp.scala:9) (first 15 tasks are for partitions Vector(0))
22/03/15 14:56:11 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks resource profile 0
22/03/15 14:56:11 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 1) (igwangscbookpro, executor driver, partition 0, NODE_LOCAL, 4453 bytes) taskResourceAssignments Map()
22/03/15 14:56:11 INFO Executor: Running task 0.0 in stage 2.0 (TID 1)
22/03/15 14:56:11 INFO ShuffleBlockFetcherIterator: Getting 1 (60.0 B) non-empty blocks including 1 (60.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/03/15 14:56:11 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
22/03/15 14:56:11 INFO Executor: Finished task 0.0 in stage 2.0 (TID 1). 2691 bytes result sent to driver
22/03/15 14:56:11 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 1) in 150 ms on igwangscbookpro (executor driver) (1/1)
22/03/15 14:56:11 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
22/03/15 14:56:11 INFO DAGScheduler: ResultStage 2 (count at SimpleApp.scala:9) finished in 0.175 s
22/03/15 14:56:11 INFO DAGScheduler: Job 1 is finished. Cancelling potential speculative or zombie tasks for this job
22/03/15 14:56:11 INFO TaskSchedulerImpl: Killing all running tasks in stage 2: Stage finished
22/03/15 14:56:11 INFO DAGScheduler: Job 1 finished: count at SimpleApp.scala:9, took 0.215160 s
22/03/15 14:56:12 INFO DAGScheduler: Registering RDD 16 (count at SimpleApp.scala:10) as input to shuffle 1
22/03/15 14:56:12 INFO DAGScheduler: Got map stage job 2 (count at SimpleApp.scala:10) with 1 output partitions
22/03/15 14:56:12 INFO DAGScheduler: Final stage: ShuffleMapStage 3 (count at SimpleApp.scala:10)
22/03/15 14:56:12 INFO DAGScheduler: Parents of final stage: List()
22/03/15 14:56:12 INFO DAGScheduler: Missing parents: List()
22/03/15 14:56:12 INFO DAGScheduler: Submitting ShuffleMapStage 3 (MapPartitionsRDD[16] at count at SimpleApp.scala:10), which has no missing parents
22/03/15 14:56:12 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 19.5 KiB, free 365.9 MiB)
22/03/15 14:56:12 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 9.0 KiB, free 365.9 MiB)
22/03/15 14:56:12 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on igwangscbookpro:50948 (size: 9.0 KiB, free: 366.2 MiB)
22/03/15 14:56:12 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1478
22/03/15 14:56:12 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 3 (MapPartitionsRDD[16] at count at SimpleApp.scala:10) (first 15 tasks are for partitions Vector(0))
22/03/15 14:56:12 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks resource profile 0
22/03/15 14:56:12 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 2) (igwangscbookpro, executor driver, partition 0, PROCESS_LOCAL, 4850 bytes) taskResourceAssignments Map()
22/03/15 14:56:12 INFO Executor: Running task 0.0 in stage 3.0 (TID 2)
22/03/15 14:56:12 INFO BlockManager: Found block rdd_3_0 locally
22/03/15 14:56:12 INFO Executor: Finished task 0.0 in stage 3.0 (TID 2). 2117 bytes result sent to driver
22/03/15 14:56:12 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 2) in 53 ms on igwangscbookpro (executor driver) (1/1)
22/03/15 14:56:12 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
22/03/15 14:56:12 INFO DAGScheduler: ShuffleMapStage 3 (count at SimpleApp.scala:10) finished in 0.090 s
22/03/15 14:56:12 INFO DAGScheduler: looking for newly runnable stages
22/03/15 14:56:12 INFO DAGScheduler: running: Set()
22/03/15 14:56:12 INFO DAGScheduler: waiting: Set()
22/03/15 14:56:12 INFO DAGScheduler: failed: Set()
22/03/15 14:56:12 INFO SparkContext: Starting job: count at SimpleApp.scala:10
22/03/15 14:56:12 INFO DAGScheduler: Got job 3 (count at SimpleApp.scala:10) with 1 output partitions
22/03/15 14:56:12 INFO DAGScheduler: Final stage: ResultStage 5 (count at SimpleApp.scala:10)
22/03/15 14:56:12 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)
22/03/15 14:56:12 INFO DAGScheduler: Missing parents: List()
22/03/15 14:56:12 INFO DAGScheduler: Submitting ResultStage 5 (MapPartitionsRDD[19] at count at SimpleApp.scala:10), which has no missing parents
22/03/15 14:56:12 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 11.0 KiB, free 365.9 MiB)
22/03/15 14:56:12 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 5.5 KiB, free 365.8 MiB)
22/03/15 14:56:12 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on igwangscbookpro:50948 (size: 5.5 KiB, free: 366.2 MiB)
22/03/15 14:56:12 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:1478
22/03/15 14:56:12 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (MapPartitionsRDD[19] at count at SimpleApp.scala:10) (first 15 tasks are for partitions Vector(0))
22/03/15 14:56:12 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks resource profile 0
22/03/15 14:56:12 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 3) (igwangscbookpro, executor driver, partition 0, NODE_LOCAL, 4453 bytes) taskResourceAssignments Map()
22/03/15 14:56:12 INFO Executor: Running task 0.0 in stage 5.0 (TID 3)
22/03/15 14:56:12 INFO ShuffleBlockFetcherIterator: Getting 1 (60.0 B) non-empty blocks including 1 (60.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks
22/03/15 14:56:12 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
22/03/15 14:56:12 INFO Executor: Finished task 0.0 in stage 5.0 (TID 3). 2648 bytes result sent to driver
22/03/15 14:56:12 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 3) in 31 ms on igwangscbookpro (executor driver) (1/1)
22/03/15 14:56:12 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool
22/03/15 14:56:12 INFO DAGScheduler: ResultStage 5 (count at SimpleApp.scala:10) finished in 0.045 s
22/03/15 14:56:12 INFO DAGScheduler: Job 3 is finished. Cancelling potential speculative or zombie tasks for this job
22/03/15 14:56:12 INFO TaskSchedulerImpl: Killing all running tasks in stage 5: Stage finished
22/03/15 14:56:12 INFO DAGScheduler: Job 3 finished: count at SimpleApp.scala:10, took 0.061911 s
22/03/15 14:56:12 INFO SparkUI: Stopped Spark web UI at http://igwangscbookpro:4040
22/03/15 14:56:12 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/03/15 14:56:12 INFO MemoryStore: MemoryStore cleared
22/03/15 14:56:12 INFO BlockManager: BlockManager stopped
22/03/15 14:56:12 INFO BlockManagerMaster: BlockManagerMaster stopped
22/03/15 14:56:12 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/03/15 14:56:12 INFO SparkContext: Successfully stopped SparkContext
22/03/15 14:56:12 INFO ShutdownHookManager: Shutdown hook called
22/03/15 14:56:12 INFO ShutdownHookManager: Deleting directory /private/var/folders/46/lkn4wlfd7v3gppr3wvxnps700000gn/T/spark-6f2f3488-5715-4efc-9b3a-d23f6ab5b0b9
22/03/15 14:56:12 INFO ShutdownHookManager: Deleting directory /private/var/folders/46/lkn4wlfd7v3gppr3wvxnps700000gn/T/spark-59387458-0823-4846-93c4-617d446fcff4

Reference

  • https://spark.apache.org/docs/latest/quick-start.html