{"id":19802,"date":"2024-03-15T19:28:33","date_gmt":"2024-03-15T19:28:33","guid":{"rendered":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/"},"modified":"2024-03-21T16:34:17","modified_gmt":"2024-03-21T16:34:17","slug":"how-can-flink-achieve-duplicate-data-removal","status":"publish","type":"post","link":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/","title":{"rendered":"How can Flink achieve duplicate data removal?"},"content":{"rendered":"<p>Flink can achieve data deduplication by using the functions DataStream#keyBy and DataStream#distinct.<\/p>\n<p>Below is a sample code demonstrating how to use Flink to implement data deduplication.<\/p>\n<pre class=\"post-pre\"><code><span class=\"hljs-keyword\">import<\/span> org.apache.flink.api.java.tuple.Tuple2;\r\n<span class=\"hljs-keyword\">import<\/span> org.apache.flink.streaming.api.datastream.DataStream;\r\n<span class=\"hljs-keyword\">import<\/span> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;\r\n\r\n<span class=\"hljs-keyword\">public<\/span> <span class=\"hljs-keyword\">class<\/span> <span class=\"hljs-title class_\">DataDeduplicationExample<\/span> {\r\n    <span class=\"hljs-keyword\">public<\/span> <span class=\"hljs-keyword\">static<\/span> <span class=\"hljs-keyword\">void<\/span> <span class=\"hljs-title function_\">main<\/span><span class=\"hljs-params\">(String[] args)<\/span> <span class=\"hljs-keyword\">throws<\/span> Exception {\r\n        <span class=\"hljs-type\">StreamExecutionEnvironment<\/span> <span class=\"hljs-variable\">env<\/span> <span class=\"hljs-operator\">=<\/span> StreamExecutionEnvironment.getExecutionEnvironment();\r\n\r\n        <span class=\"hljs-comment\">\/\/ \u521b\u5efa\u4e00\u4e2a\u5305\u542b\u91cd\u590d\u6570\u636e\u7684DataStream<\/span>\r\n        DataStream&lt;Tuple2&lt;String, Integer&gt;&gt; input = env.fromElements(\r\n                <span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">Tuple2<\/span>&lt;&gt;(<span class=\"hljs-string\">\"A\"<\/span>, <span class=\"hljs-number\">1<\/span>),\r\n                <span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">Tuple2<\/span>&lt;&gt;(<span class=\"hljs-string\">\"B\"<\/span>, <span class=\"hljs-number\">2<\/span>),\r\n                <span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">Tuple2<\/span>&lt;&gt;(<span class=\"hljs-string\">\"A\"<\/span>, <span class=\"hljs-number\">1<\/span>),\r\n                <span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">Tuple2<\/span>&lt;&gt;(<span class=\"hljs-string\">\"C\"<\/span>, <span class=\"hljs-number\">3<\/span>),\r\n                <span class=\"hljs-keyword\">new<\/span> <span class=\"hljs-title class_\">Tuple2<\/span>&lt;&gt;(<span class=\"hljs-string\">\"B\"<\/span>, <span class=\"hljs-number\">2<\/span>)\r\n        );\r\n\r\n        <span class=\"hljs-comment\">\/\/ \u4f7f\u7528keyBy\u51fd\u6570\u5c06\u6570\u636e\u6309key\u5206\u7ec4<\/span>\r\n        DataStream&lt;Tuple2&lt;String, Integer&gt;&gt; deduplicated = input\r\n                .keyBy(<span class=\"hljs-number\">0<\/span>)\r\n                .distinct();\r\n\r\n        deduplicated.print();\r\n\r\n        env.execute(<span class=\"hljs-string\">\"Data Deduplication Example\"<\/span>);\r\n    }\r\n}\r\n<\/code><\/pre>\n<p>In the example code above, we created a DataStream with duplicate data and grouped the data by the first field using the keyBy function. Next, we deduplicated each group using the distinct function. Finally, we printed the deduplicated result.<\/p>\n<p>When running the above code, the following output will be obtained:<\/p>\n<pre class=\"post-pre\"><code>(A,1)\r\n(B,2)\r\n(C,3)\r\n<\/code><\/pre>\n<p>It can be seen that the duplicate data has been removed.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Flink can achieve data deduplication by using the functions DataStream#keyBy and DataStream#distinct. Below is a sample code demonstrating how to use Flink to implement data deduplication. import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class DataDeduplicationExample { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); \/\/ \u521b\u5efa\u4e00\u4e2a\u5305\u542b\u91cd\u590d\u6570\u636e\u7684DataStream DataStream&lt;Tuple2&lt;String, Integer&gt;&gt; input = env.fromElements( [&hellip;]<\/p>\n","protected":false},"author":8,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_import_markdown_pro_load_document_selector":0,"_import_markdown_pro_submit_text_textarea":"","footnotes":""},"categories":[1],"tags":[],"class_list":["post-19802","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.5 (Yoast SEO v21.5) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>How can Flink achieve duplicate data removal? - Blog - Silicon Cloud<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"How can Flink achieve duplicate data removal?\" \/>\n<meta property=\"og:description\" content=\"Flink can achieve data deduplication by using the functions DataStream#keyBy and DataStream#distinct. Below is a sample code demonstrating how to use Flink to implement data deduplication. import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class DataDeduplicationExample { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); \/\/ \u521b\u5efa\u4e00\u4e2a\u5305\u542b\u91cd\u590d\u6570\u636e\u7684DataStream DataStream&lt;Tuple2&lt;String, Integer&gt;&gt; input = env.fromElements( [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog - Silicon Cloud\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/SiliCloudGlobal\/\" \/>\n<meta property=\"article:published_time\" content=\"2024-03-15T19:28:33+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-03-21T16:34:17+00:00\" \/>\n<meta name=\"author\" content=\"William Carter\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@SiliCloudGlobal\" \/>\n<meta name=\"twitter:site\" content=\"@SiliCloudGlobal\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"William Carter\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\"},\"author\":{\"name\":\"William Carter\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/f697031891aacefc4b681d139781d3c0\"},\"headline\":\"How can Flink achieve duplicate data removal?\",\"datePublished\":\"2024-03-15T19:28:33+00:00\",\"dateModified\":\"2024-03-21T16:34:17+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\"},\"wordCount\":96,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\"},\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\",\"url\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\",\"name\":\"How can Flink achieve duplicate data removal? - Blog - Silicon Cloud\",\"isPartOf\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#website\"},\"datePublished\":\"2024-03-15T19:28:33+00:00\",\"dateModified\":\"2024-03-21T16:34:17+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.silicloud.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"How can Flink achieve duplicate data removal?\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#website\",\"url\":\"https:\/\/www.silicloud.com\/blog\/\",\"name\":\"Silicon Cloud Blog\",\"description\":\"\",\"publisher\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\"},\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#organization\",\"name\":\"Silicon Cloud Blog\",\"url\":\"https:\/\/www.silicloud.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png\",\"contentUrl\":\"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png\",\"width\":1024,\"height\":1024,\"caption\":\"Silicon Cloud Blog\"},\"image\":{\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/www.facebook.com\/SiliCloudGlobal\/\",\"https:\/\/twitter.com\/SiliCloudGlobal\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/f697031891aacefc4b681d139781d3c0\",\"name\":\"William Carter\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/1786698071dd8d74bec894b512f9e3c610c3a2a32985f67e688976cee3c8bbef?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/1786698071dd8d74bec894b512f9e3c610c3a2a32985f67e688976cee3c8bbef?s=96&d=mm&r=g\",\"caption\":\"William Carter\"},\"url\":\"https:\/\/www.silicloud.com\/blog\/author\/williamcarter\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"How can Flink achieve duplicate data removal? - Blog - Silicon Cloud","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/","og_locale":"en_US","og_type":"article","og_title":"How can Flink achieve duplicate data removal?","og_description":"Flink can achieve data deduplication by using the functions DataStream#keyBy and DataStream#distinct. Below is a sample code demonstrating how to use Flink to implement data deduplication. import org.apache.flink.api.java.tuple.Tuple2; import org.apache.flink.streaming.api.datastream.DataStream; import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment; public class DataDeduplicationExample { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); \/\/ \u521b\u5efa\u4e00\u4e2a\u5305\u542b\u91cd\u590d\u6570\u636e\u7684DataStream DataStream&lt;Tuple2&lt;String, Integer&gt;&gt; input = env.fromElements( [&hellip;]","og_url":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/","og_site_name":"Blog - Silicon Cloud","article_publisher":"https:\/\/www.facebook.com\/SiliCloudGlobal\/","article_published_time":"2024-03-15T19:28:33+00:00","article_modified_time":"2024-03-21T16:34:17+00:00","author":"William Carter","twitter_card":"summary_large_image","twitter_creator":"@SiliCloudGlobal","twitter_site":"@SiliCloudGlobal","twitter_misc":{"Written by":"William Carter","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#article","isPartOf":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/"},"author":{"name":"William Carter","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/f697031891aacefc4b681d139781d3c0"},"headline":"How can Flink achieve duplicate data removal?","datePublished":"2024-03-15T19:28:33+00:00","dateModified":"2024-03-21T16:34:17+00:00","mainEntityOfPage":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/"},"wordCount":96,"commentCount":0,"publisher":{"@id":"https:\/\/www.silicloud.com\/blog\/#organization"},"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/","url":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/","name":"How can Flink achieve duplicate data removal? - Blog - Silicon Cloud","isPartOf":{"@id":"https:\/\/www.silicloud.com\/blog\/#website"},"datePublished":"2024-03-15T19:28:33+00:00","dateModified":"2024-03-21T16:34:17+00:00","breadcrumb":{"@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.silicloud.com\/blog\/how-can-flink-achieve-duplicate-data-removal\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.silicloud.com\/blog\/"},{"@type":"ListItem","position":2,"name":"How can Flink achieve duplicate data removal?"}]},{"@type":"WebSite","@id":"https:\/\/www.silicloud.com\/blog\/#website","url":"https:\/\/www.silicloud.com\/blog\/","name":"Silicon Cloud Blog","description":"","publisher":{"@id":"https:\/\/www.silicloud.com\/blog\/#organization"},"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.silicloud.com\/blog\/#organization","name":"Silicon Cloud Blog","url":"https:\/\/www.silicloud.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png","contentUrl":"https:\/\/www.silicloud.com\/blog\/wp-content\/uploads\/2023\/11\/EN-SILICON-Full.png","width":1024,"height":1024,"caption":"Silicon Cloud Blog"},"image":{"@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/www.facebook.com\/SiliCloudGlobal\/","https:\/\/twitter.com\/SiliCloudGlobal"]},{"@type":"Person","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/f697031891aacefc4b681d139781d3c0","name":"William Carter","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.silicloud.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/1786698071dd8d74bec894b512f9e3c610c3a2a32985f67e688976cee3c8bbef?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/1786698071dd8d74bec894b512f9e3c610c3a2a32985f67e688976cee3c8bbef?s=96&d=mm&r=g","caption":"William Carter"},"url":"https:\/\/www.silicloud.com\/blog\/author\/williamcarter\/"}]}},"_links":{"self":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/19802","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/users\/8"}],"replies":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/comments?post=19802"}],"version-history":[{"count":1,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/19802\/revisions"}],"predecessor-version":[{"id":53563,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/posts\/19802\/revisions\/53563"}],"wp:attachment":[{"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/media?parent=19802"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/categories?post=19802"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.silicloud.com\/blog\/wp-json\/wp\/v2\/tags?post=19802"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}