How can Flink achieve duplicate data removal?

2 years ago

William Carter

1 minute

Flink can achieve data deduplication by using the functions DataStream#keyBy and DataStream#distinct.

Below is a sample code demonstrating how to use Flink to implement data deduplication.

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class DataDeduplicationExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 创建一个包含重复数据的DataStream
        DataStream<Tuple2<String, Integer>> input = env.fromElements(
                new Tuple2<>("A", 1),
                new Tuple2<>("B", 2),
                new Tuple2<>("A", 1),
                new Tuple2<>("C", 3),
                new Tuple2<>("B", 2)
        );

        // 使用keyBy函数将数据按key分组
        DataStream<Tuple2<String, Integer>> deduplicated = input
                .keyBy(0)
                .distinct();

        deduplicated.print();

        env.execute("Data Deduplication Example");
    }
}

In the example code above, we created a DataStream with duplicate data and grouped the data by the first field using the keyBy function. Next, we deduplicated each group using the distinct function. Finally, we printed the deduplicated result.

When running the above code, the following output will be obtained:

(A,1)
(B,2)
(C,3)

It can be seen that the duplicate data has been removed.