How to read Kafka data for statistics with Flink?

2 years ago

Ava Mitchell

3 minutes

In Flink, we can use FlinkKafkaConsumer to read data from Kafka for analysis. The following example code demonstrates how to use FlinkKafkaConsumer and KeyedStream to count the occurrences of each key in the Kafka data.

Firstly, we need to introduce relevant dependencies.

<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-java_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_2.11</artifactId>
    <version>${flink.version}</version>
</dependency>

Next, you can use the following code to read Kafka data for analysis:

import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;

import java.util.Properties;

public class KafkaDataStatistics {

    public static void main(String[] args) throws Exception {
        // 设置执行环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 配置Kafka连接相关信息
        Properties props = new Properties();
        props.setProperty("bootstrap.servers", "localhost:9092");
        props.setProperty("group.id", "flink-consumer");

        // 从Kafka读取数据
        FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<>("topic", new SimpleStringSchema(), props);
        DataStream<String> kafkaStream = env.addSource(kafkaConsumer);

        // 对数据进行统计
        DataStream<Tuple2<String, Integer>> result = kafkaStream
                .keyBy(value -> value) // 根据键分组
                .process(new CountProcessFunction());

        // 打印结果
        result.print();

        // 执行程序
        env.execute("Kafka Data Statistics");
    }

    // 自定义ProcessFunction进行统计
    public static class CountProcessFunction extends KeyedProcessFunction<String, String, Tuple2<String, Integer>> {
        private ValueState<Integer> countState;

        @Override
        public void open(Configuration parameters) throws Exception {
            ValueStateDescriptor<Integer> countDescriptor = new ValueStateDescriptor<>("count", Integer.class);
            countState = getRuntimeContext().getState(countDescriptor);
        }

        @Override
        public void processElement(String value, Context ctx, Collector<Tuple2<String, Integer>> out) throws Exception {
            Integer count = countState.value();
            if (count == null) {
                count = 0;
            }
            count++;
            countState.update(count);
            out.collect(new Tuple2<>(ctx.getCurrentKey(), count));
        }
    }
}

In the above code, the FlinkKafkaConsumer reads data from Kafka and converts it into a DataStream. Next, the data is grouped by key using the keyBy() method. Then, custom KeyedProcessFunction is used for processing the statistics and outputting the results to the DataStream. Finally, the results are printed using the print() method and the program is executed.

Please note that the code examples provided above are only for a simple statistical illustration. Depending on your specific needs, you may need to make appropriate adjustments based on your data format and statistical logic.