What is an RDD in Spark?

1 year ago

Isabella Edwards

1 minute

RDD (Resilient Distributed Dataset) is the most basic data structure in Spark, representing an immutable, distributed collection of data that can be operated on in a distributed manner. It is a core concept in Spark that can cache data in memory to speed up data processing. RDDs can be created in a parallelized way from collections, HDFS files, or other data sources, and support both transformation and action operations. They also provide fault tolerance and resilience, serving as the fundamental data unit for parallel computation on a cluster to improve data processing efficiency.