Kun 1ambda

🦁

in the jungle

high-functioning developer trained by unsupervised compilers

1ambda / ingress.yaml

Created May 12, 2022 01:32

	apiVersion: networking.k8s.io/v1
	kind: Ingress
	metadata:
	name: ...
	namespace: ...
	annotations:

	kubernetes.io/ingress.class: alb
	alb.ingress.kubernetes.io/load-balancer-name: ...
	alb.ingress.kubernetes.io/scheme: internal

1ambda / spark-code.py

Created January 8, 2022 02:19

	spark
	.read // 데이터를 읽어옵니다.
	.format("jdbc") // "jdbc" 뿐 아니라 "kafka" 등 다양한 Format 을 사용할 수 있습니다

	.join(...) // 다른 데이터와 Join (병합) 합니다.

	.where(...) // 데이터 Row 필터링하거나
	.selectExpr(...) // 필요한 Column 만 선택합니다.

	repartition(5, "col1") // 얼마나 / 어떤 기준으로 분산해 처리할지를 정의합니다

1ambda / spark-conf.yaml

Created January 2, 2022 02:13

	spark.executor.instances = 10
	spark.executor.cores = 10
	spark.executor.memory = 30g (GiB)

	spark.memory.memoryOverhead = 0.1
	spark.memory.fraction = 0.8
	spark.memory.storageFraction = 0.5
	spark.memory.offHeap.enabled = false

1ambda / spark-error.log

Created January 2, 2022 02:12

	ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason:

	Container killed by YARN for exceeding memory limits. 33.2 GB of 33 GB physical memory used.
	Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

1ambda / spark-cache.py

Created January 2, 2022 01:52

	dfInitial = spark.read(...)

	dfFiltered = dfInitial.select(...).where(..).cache() # 캐시 호출
	dfJoined = (...)

	# action 호출
	# Transformation 이 실행되며 dfFiltered 를 계산후
	# dfFiltered 를 여러대 나눠진 Executor 에서 메모리에 캐싱
	dfJoined.write(...)

1ambda / spark-df-view.py

Created January 2, 2022 01:45

	df = spark.read.(...)

	df.createOrReplaceTempView("PROPERTY_META")

	spark.sql("SELECT * FROM PROPERTY_META ..")

1ambda / jdbc-insert.py

Created January 2, 2022 01:35

+# repartition 이 컬럼 기준 없이 되었으므로
+# 동일한 property_id (e.g., 2101) 가 여러 Partition = Connection 에 나누어 Insert
+df\
+    .repartition(10)\
+    .write\
+    .mode("append")\
+    .format("jdbc")\
+    .option("numPartitions", "10")
+# property_id 기준으로 Partition 이 나뉘고

1ambda / stat-ddl.sql

Created January 2, 2022 01:35

1ambda / spark-rdd.py

Created January 2, 2022 01:14

1ambda / spark-rdd.py

Created January 2, 2022 01:14

	df = spark.read.csv(...)
	dfSelected = df.selectExpr("...")