What are shared variables in Spark?

1 year ago

Olivia Parker

1 minute

In Spark, shared variables are mutable variables that are shared among all tasks in the cluster. Spark supports two types of shared variables: broadcast variables and accumulators.

Broadcast Variables: Broadcast variables allow programmers to cache a read-only variable on all nodes in the cluster, so that it can be used in each task. This helps reduce the cost of fetching the variable in each task and improves runtime efficiency.

# 在Python中创建广播变量
broadcast_var = sc.broadcast([1, 2, 3])

# 在任务中使用广播变量
def my_func(value):
    for num in broadcast_var.value:
        print(num * value)

rdd.map(my_func).collect()

Accumulators: Accumulators allow multiple tasks to share a writable variable in the cluster for accumulating counts or other aggregation operations. They are typically used to record statistical information during task execution.

# 在Python中创建累加器
accum = sc.accumulator(0)

# 在任务中使用累加器
def my_func(value):
    accum.add(value)
    return value

rdd.map(my_func).collect()
print(accum.value)

Be cautious when using shared variables as they can lead to problems such as concurrent access and inconsistent states, especially when multiple tasks are simultaneously modifying a shared variable. It is advisable to carefully consider the use cases of shared variables to ensure thread safety and reliability.