Pig Data Deduplication: DISTINCT Guide
In Pig, data deduplication can be achieved using the DISTINCT keyword in Pig Latin language. This keyword is used to remove duplicate tuples from a relation and only retain unique tuples.
Here is an example in Pig using the DISTINCT keyword to remove duplicate data.
-- 加载数据
data = LOAD 'inputData.txt' USING PigStorage(',') AS (id:int, name:chararray, age:int);
-- 去重
unique_data = DISTINCT data;
-- 存储去重后的数据
STORE unique_data INTO 'outputData' USING PigStorage(',');
In the example above, the input data is first loaded, and the data is deduplicated using the DISTINCT keyword before storing the deduplicated data in the specified output path. This enables the data deduplication operation to be achieved.