{"id":28053,"date":"2024-03-16T09:39:12","date_gmt":"2024-03-16T09:39:12","guid":{"rendered":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/"},"modified":"2024-03-22T12:31:28","modified_gmt":"2024-03-22T12:31:28","slug":"how-can-flink-deduplicate-data-in-kafka","status":"publish","type":"post","link":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/","title":{"rendered":"How can Flink deduplicate data in Kafka?"},"content":{"rendered":"<p>In Flink, deduplicating data from Kafka can be achieved using the following methods:<\/p>\n<ol>\n<li>Organize by key<\/li>\n<li>decrease<\/li>\n<li>combine<\/li>\n<li>To bend or crease something to form pleats or layers.<\/li>\n<\/ol>\n<pre class=\"post-pre\"><code>DataStream&lt;MyData&gt; stream = env.addSource(<span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">FlinkKafkaConsumer<\/span>&lt;&gt;(...));\r\n\r\nDataStream&lt;MyData&gt; deduplicatedStream = stream\r\n    .keyBy(data -&gt; data.getId())  <span class=\"hljs-comment\">\/\/ \u6309\u7167 id \u5b57\u6bb5\u8fdb\u884c\u5206\u7ec4<\/span>\r\n    .reduce((data1, data2) -&gt; data1);  <span class=\"hljs-comment\">\/\/ \u4f7f\u7528 reduce \u64cd\u4f5c\u7b26\u5c06\u76f8\u540c id \u7684\u6570\u636e\u53bb\u91cd<\/span>\r\n<\/code><\/pre>\n<ol>\n<li>group by key<\/li>\n<li>Functioning process<\/li>\n<li>Function that can be used to flat map data and process it efficiently, typically used for handling rich data structures.<\/li>\n<\/ol>\n<pre class=\"post-pre\"><code>DataStream&lt;MyData&gt; stream = env.addSource(<span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">FlinkKafkaConsumer<\/span>&lt;&gt;(...));\r\n\r\nDataStream&lt;MyData&gt; deduplicatedStream = stream\r\n    .keyBy(data -&gt; data.getUniqueId())  <span class=\"hljs-comment\">\/\/ \u6309\u7167\u552f\u4e00\u6807\u8bc6\u7b26\u8fdb\u884c\u5206\u7ec4<\/span>\r\n    .process(<span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">DeduplicateFunction<\/span>());  <span class=\"hljs-comment\">\/\/ \u81ea\u5b9a\u4e49 ProcessFunction \u5b9e\u73b0\u53bb\u91cd\u903b\u8f91<\/span>\r\n\r\n<span class=\"hljs-keyword\">public<\/span> <span class=\"hljs-keyword\">static<\/span> <span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title class_\">DeduplicateFunction<\/span> <span class=\"hljs-keyword\">extends<\/span> <span class=\"hljs-title class_\">ProcessFunction<\/span>&lt;MyData, MyData&gt; {\r\n    <span class=\"hljs-keyword\">private<\/span> ValueState&lt;Boolean&gt; seen;\r\n\r\n    <span class=\"hljs-meta\">@Override<\/span>\r\n    <span class=\"hljs-keyword\">public<\/span> <span class=\"hljs-keyword\">void<\/span> <span class=\"hljs-title function_\">open<\/span><span class=\"hljs-params\">(Configuration parameters)<\/span> <span class=\"hljs-keyword\">throws<\/span> Exception {\r\n        seen = getRuntimeContext().getState(<span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">ValueStateDescriptor<\/span>&lt;&gt;(<span class=\"hljs-string\">\"seen\"<\/span>, Boolean.class));\r\n    }\r\n\r\n    <span class=\"hljs-meta\">@Override<\/span>\r\n    <span class=\"hljs-keyword\">public<\/span> <span class=\"hljs-keyword\">void<\/span> <span class=\"hljs-title function_\">processElement<\/span><span class=\"hljs-params\">(MyData data, Context ctx, Collector&lt;MyData&gt; out)<\/span> <span class=\"hljs-keyword\">throws<\/span> Exception {\r\n        <span class=\"hljs-keyword\">if<\/span> (seen.value() == <span class=\"hljs-literal\">null<\/span>) {\r\n            seen.update(<span class=\"hljs-literal\">true<\/span>);\r\n            out.collect(data);\r\n        }\r\n    }\r\n}\r\n<\/code><\/pre>\n<p>It is important to note that the above method can only remove duplicates from adjacent data. If the data volume is large or the data distribution is uneven, it may result in performance issues. If you need to deduplicate data from the entire Kafka, consider using Flink&#8217;s state backend like RocksDB to store processed data identifiers in the state and regularly clean up expired data.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In Flink, deduplicating data from Kafka can be achieved using the following methods: Organize by key decrease combine To bend or crease something to form pleats or layers. DataStream&lt;MyData&gt; stream = env.addSource(new FlinkKafkaConsumer&lt;&gt;(&#8230;)); DataStream&lt;MyData&gt; deduplicatedStream = stream .keyBy(data -&gt; data.getId()) \/\/ \u6309\u7167 id \u5b57\u6bb5\u8fdb\u884c\u5206\u7ec4 .reduce((data1, data2) -&gt; data1); \/\/ \u4f7f\u7528 reduce \u64cd\u4f5c\u7b26\u5c06\u76f8\u540c id \u7684\u6570\u636e\u53bb\u91cd group [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-28053","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.5 (Yoast SEO v21.5) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How can Flink deduplicate data in Kafka? - Blog - Silicon Cloud<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How can Flink deduplicate data in Kafka?\" \/>\n<meta property=\"og:description\" content=\"In Flink, deduplicating data from Kafka can be achieved using the following methods: Organize by key decrease combine To bend or crease something to form pleats or layers. DataStream&lt;MyData&gt; stream = env.addSource(new FlinkKafkaConsumer&lt;&gt;(...)); DataStream&lt;MyData&gt; deduplicatedStream = stream .keyBy(data -&gt; data.getId()) \/\/ \u6309\u7167 id \u5b57\u6bb5\u8fdb\u884c\u5206\u7ec4 .reduce((data1, data2) -&gt; data1); \/\/ \u4f7f\u7528 reduce \u64cd\u4f5c\u7b26\u5c06\u76f8\u540c id \u7684\u6570\u636e\u53bb\u91cd group [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog - Silicon Cloud\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/SiliCloudGlobal\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-16T09:39:12+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-03-22T12:31:28+00:00\" \/>\n<meta name=\"author\" content=\"Emily Johnson\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SiliCloudGlobal\" \/>\n<meta name=\"twitter:site\" content=\"@SiliCloudGlobal\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Emily Johnson\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\"},\"author\":{\"name\":\"Emily Johnson\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/3b041b19cffc258705478ecfab895378\"},\"headline\":\"How can Flink deduplicate data in Kafka?\",\"datePublished\":\"2024-03-16T09:39:12+00:00\",\"dateModified\":\"2024-03-22T12:31:28+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\"},\"wordCount\":126,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\"},\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\",\"url\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\",\"name\":\"How can Flink deduplicate data in Kafka? - Blog - Silicon Cloud\",\"isPartOf\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#website\"},\"datePublished\":\"2024-03-16T09:39:12+00:00\",\"dateModified\":\"2024-03-22T12:31:28+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.silicloud.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How can Flink deduplicate data in Kafka?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#website\",\"url\":\"https:\/\/www.silicloud.com\/blog\/\",\"name\":\"Silicon Cloud Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\"},\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\",\"name\":\"Silicon Cloud Blog\",\"url\":\"https:\/\/www.silicloud.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png\",\"contentUrl\":\"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png\",\"width\":1024,\"height\":1024,\"caption\":\"Silicon Cloud Blog\"},\"image\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/SiliCloudGlobal\/\",\"https:\/\/twitter.com\/SiliCloudGlobal\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/3b041b19cffc258705478ecfab895378\",\"name\":\"Emily Johnson\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/a5cb4e73d02ab1d79f2dfe919389ff7c1de072baa97686392031c03d858cc358?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/a5cb4e73d02ab1d79f2dfe919389ff7c1de072baa97686392031c03d858cc358?s=96&d=mm&r=g\",\"caption\":\"Emily Johnson\"},\"url\":\"https:\/\/www.silicloud.com\/blog\/author\/emilyjohnson\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How can Flink deduplicate data in Kafka? - Blog - Silicon Cloud","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/","og_locale":"en_US","og_type":"article","og_title":"How can Flink deduplicate data in Kafka?","og_description":"In Flink, deduplicating data from Kafka can be achieved using the following methods: Organize by key decrease combine To bend or crease something to form pleats or layers. DataStream&lt;MyData&gt; stream = env.addSource(new FlinkKafkaConsumer&lt;&gt;(...)); DataStream&lt;MyData&gt; deduplicatedStream = stream .keyBy(data -&gt; data.getId()) \/\/ \u6309\u7167 id \u5b57\u6bb5\u8fdb\u884c\u5206\u7ec4 .reduce((data1, data2) -&gt; data1); \/\/ \u4f7f\u7528 reduce \u64cd\u4f5c\u7b26\u5c06\u76f8\u540c id \u7684\u6570\u636e\u53bb\u91cd group [&hellip;]","og_url":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/","og_site_name":"Blog - Silicon Cloud","article_publisher":"https:\/\/www.facebook.com\/SiliCloudGlobal\/","article_published_time":"2024-03-16T09:39:12+00:00","article_modified_time":"2024-03-22T12:31:28+00:00","author":"Emily Johnson","twitter_card":"summary_large_image","twitter_creator":"@SiliCloudGlobal","twitter_site":"@SiliCloudGlobal","twitter_misc":{"Written by":"Emily Johnson","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#article","isPartOf":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/"},"author":{"name":"Emily Johnson","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/3b041b19cffc258705478ecfab895378"},"headline":"How can Flink deduplicate data in Kafka?","datePublished":"2024-03-16T09:39:12+00:00","dateModified":"2024-03-22T12:31:28+00:00","mainEntityOfPage":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/"},"wordCount":126,"commentCount":0,"publisher":{"@id":"https:\/\/www.silicloud.com\/blog\/#organization"},"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/","url":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/","name":"How can Flink deduplicate data in Kafka? - Blog - Silicon Cloud","isPartOf":{"@id":"https:\/\/www.silicloud.com\/blog\/#website"},"datePublished":"2024-03-16T09:39:12+00:00","dateModified":"2024-03-22T12:31:28+00:00","breadcrumb":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-deduplicate-data-in-kafka\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.silicloud.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How can Flink deduplicate data in Kafka?"}]},{"@type":"WebSite","@id":"https:\/\/www.silicloud.com\/blog\/#website","url":"https:\/\/www.silicloud.com\/blog\/","name":"Silicon Cloud Blog","description":"","publisher":{"@id":"https:\/\/www.silicloud.com\/blog\/#organization"},"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.silicloud.com\/blog\/#organization","name":"Silicon Cloud Blog","url":"https:\/\/www.silicloud.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png","contentUrl":"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png","width":1024,"height":1024,"caption":"Silicon Cloud Blog"},"image":{"@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/SiliCloudGlobal\/","https:\/\/twitter.com\/SiliCloudGlobal"]},{"@type":"Person","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/3b041b19cffc258705478ecfab895378","name":"Emily Johnson","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/a5cb4e73d02ab1d79f2dfe919389ff7c1de072baa97686392031c03d858cc358?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/a5cb4e73d02ab1d79f2dfe919389ff7c1de072baa97686392031c03d858cc358?s=96&d=mm&r=g","caption":"Emily Johnson"},"url":"https:\/\/www.silicloud.com\/blog\/author\/emilyjohnson\/"}]}},"_links":{"self":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/28053","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/comments?post=28053"}],"version-history":[{"count":1,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/28053\/revisions"}],"predecessor-version":[{"id":62326,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/28053\/revisions\/62326"}],"wp:attachment":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/media?parent=28053"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/categories?post=28053"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/tags?post=28053"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}