pyspark的一個問題
不曉得版上有沒有人在玩pyspark的大大
目前在看線上文件遇到一個問題,網址如下:
http://spark.apache.org/docs/latest/programming-guide.html#understanding-closures-a-nameclosureslinka
裡面有一個程式碼範例以及說明如下
Consider the naive RDD element sum below, which may behave differently
depending on whether execution is happening within the same JVM.
A common example of this is when running Spark in local mode
(--master = local[n]) versus deploying a Spark application
to a cluster (e.g. via spark-submit to YARN):
他的意思是,如果在local mode下跑就可以改變counter的值,然後在cluster上跑
就無法改變counter的值 ?
我在local mode下跑下面這段程式碼counter的值完全都不會改變啊
是我會錯意?還是需要在設定什麼啊?
counter = 0
rdd = sc.parallelize(data)
# Wrong: Don't do this!!
def increment_counter(x):
global counter
counter += x
rdd.foreach(increment_counter)
print("Counter value: ", counter)
--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 61.220.35.20
※ 文章網址: https://www.ptt.cc/bbs/Python/M.1479355074.A.489.html
推
11/19 18:06, , 1F
11/19 18:06, 1F