Apache Beam Timestamps & Watermarks Guide
In Apache Beam, you can use Timestamps and Watermarks provided by the Apache Beam SDK to control the temporal properties of data. Timestamps are used to specify the timestamp of data elements, while Watermarks are used to control the progress of the data stream.
To control the temporal properties of data, you can use the ParDo function in the data processing pipeline to specify the timestamp of data elements. For example, you can use the WithTimestamps function to set the timestamp for data elements.
PCollection<MyData> myData = ... // 获取数据集
PCollection<MyData> timestampedData = myData.apply(ParDo.of(new DoFn<MyData, MyData>() {
@ProcessElement
public void processElement(ProcessContext c) {
MyData data = c.element();
Instant timestamp = ... // 指定时间戳
c.outputWithTimestamp(data, timestamp);
}
}));
After specifying the timestamp of the data elements, you can also use Window operators to allocate data into windows in order to control the time attributes of the data stream. For example, you can use the FixedWindows function to allocate data elements into fixed size time windows.
PCollection<MyData> windowedData = timestampedData.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))));
Finally, Watermarks can be used to control the progress of data streams. They indicate the current progress of the data stream, and Apache Beam will use Watermarks to control the processing and triggering of data. The generation logic of Watermarks can be specified by setting the WatermarkEvaluator function.
PCollection<MyData> input = ... // 输入数据集
PCollection<MyData> output = input.apply(WithTimestamps.of(new MyTimestampFunction()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(1))));
PTransform<PCollection<MyData>, PCollection<MyResult>> transform = ... // 定义数据处理转换
PCollection<MyResult> finalOutput = output.apply(transform);
pipeline.run();
By using the above methods, it is possible to flexibly control the time attributes of data in Apache Beam, enabling more precise data processing and windowing operations.