Hadoop Read/Write Process Explained

2 years ago

Jackson Davis

2 minutes

The reading and writing process in Hadoop mainly consists of two parts: the reading and writing process of HDFS and the reading and writing process of MapReduce.

The process of reading and writing in HDFS:

Writing Process: When a client needs to write data to HDFS, the data is first divided into blocks (usually 128MB) and replicated. The data blocks are then transferred to NameNode through the HDFS client, where the metadata information of the data blocks is recorded in EditLog and the specific location information of the data blocks is saved in BlockMap. The client then transfers the data blocks to DataNode, where they are saved on the local disk, and a confirmation message is sent to NameNode. Finally, NameNode updates the metadata information and returns the write result to the client.
Reading process: When a client wants to read data from HDFS, it sends a read request to the NameNode, which then provides the client with the location information of the data blocks. The client then uses the HDFS client to read the data blocks from the DataNode, and combines them into a complete file.

The process of reading and writing in MapReduce.

Writing process: In a MapReduce job, input data is typically read from HDFS. The job first reads the input data from HDFS, then partitions the data into InputSplits, with each InputSplit corresponding to the input data for a Map task. Next, the MapReduce framework assigns InputSplits to different Map tasks and sends the tasks to be executed on various nodes in the cluster.
The process of reading: In MapReduce tasks, the output data is usually written to HDFS. Each Map task will generate intermediate results and write them to temporary files on the local disk, while Reduce tasks will read the intermediate results from the temporary files of each Map task and merge them. Finally, Reduce tasks will write the final result to HDFS.

In general, the read-write process of Hadoop can be summarized as follows: when data is written, it is first partitioned, replicated, and metadata information is saved, then the data blocks are stored on DataNodes; when data is read, the location information of the data blocks is retrieved first, then the data blocks are read from DataNodes and merged.

#Big Data #Data Processing #Hadoop #HDFS #MapReduce