SageMaker는 Spark-Shell과 같이 실행 시 Spark 세션 변수가 초기화 된다.
따로 define하지 않아도 print(spark) 하면 값이 할당 된 것을 볼 수 있다.
추가로 spark session 변수를 할당 하여 작업 시, 여러가지 에러를 맛본다.
반면에 Glue Job Script는 Spark 세션변수를 선언해줘야 한다.
[ SageMaker ]
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext.getOrCreate())
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
glueContext = GlueContext(SparkContext.getOrCreate())
df = spark.createDataFrame(
[(1, "foo", "USA"),(2, "bar", "KOREA"),],
["id", "label", "country"]
)
[(1, "foo", "USA"),(2, "bar", "KOREA"),],
["id", "label", "country"]
)
df.printSchema()
df.createOrReplaceTempView("test")
spark.sql("select * from test").show()
[ Glue Job Script ]
import ...
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
....
job.commit()
댓글 없음:
댓글 쓰기