What is RDD in Spark?

1 year ago

William Carter

1 minute

RDD (Resilient Distributed Dataset) is the most fundamental data abstraction in Spark, representing an immutable, partitioned collection of elements. RDDs can be computed in parallel across multiple nodes in a cluster. They can be created from data sources such as Hadoop file systems, HBase, Cassandra, and generated through transformations on other RDDs. RDDs are a highly fault-tolerant data structure provided by Spark, capable of automatically recovering even in the event of node failures.