{"id":36828,"date":"2023-10-01T05:01:37","date_gmt":"2023-04-21T12:35:38","guid":{"rendered":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/"},"modified":"2024-05-04T02:18:34","modified_gmt":"2024-05-03T18:18:34","slug":"%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82","status":"publish","type":"post","link":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/","title":{"rendered":"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark"},"content":{"rendered":"<p>\u867d\u7136\u73b0\u5728\u6709\u70b9\u665a\u4e86\uff0c\u4f46\u6211\u4e00\u76f4\u6ca1\u6709\u53bb\u5c1d\u8bd5\u8fc7Apache Spark\uff0c\u6240\u4ee5\u6211\u51b3\u5b9a\u8bd5\u4e00\u8bd5\u3002<br \/>\n\u6211\u60f3\u8981\u7406\u89e3\u5b83\uff0c\u5e76\u5c1d\u8bd5\u4f7f\u7528\u673a\u5668\u5b66\u4e60\u5e93\u505a\u4e9b\u4ec0\u4e48\u7684\u539f\u56e0\u662f\uff0c\u6211\u60f3\u66f4\u6df1\u5165\u5730\u5b66\u4e60\u5e76\u5c1d\u8bd5\u4e00\u4e9b\u4e1c\u897f\u3002<br \/>\n\u8fd9\u53ea\u662f\u6211\u7684\u5b66\u4e60\u7b14\u8bb0\u3002<\/p>\n<p>\u987a\u4fbf\u63d0\u4e00\u4e0b\uff0c\u5173\u4e8eMLlib\u7684\u90e8\u5206\uff0c\u6211\u4e5f\u53c2\u8003\u4e86\u8fd9\u7bc7\u6587\u7ae0\u3002<\/p>\n<div><img decoding=\"async\" class=\"post-images\" title=\"\" src=\"https:\/\/cdn.silicloud.com\/blog-img\/blog\/img\/657d2b2537434c4406c498d8\/2-0.png\" alt=\"\u30b9\u30af\u30ea\u30fc\u30f3\u30b7\u30e7\u30c3\u30c8 2016-07-24 1.01.06.png\" \/><\/div>\n<h1>\u5b89\u88c5\u8bbe\u5b9a<\/h1>\n<ul class=\"post-ul\">\n<li style=\"list-style-type: none;\">\n<ul class=\"post-ul\">java 1.8<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul class=\"post-ul\">\n<li style=\"list-style-type: none;\">\n<ul class=\"post-ul\">mac<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<ul class=\"post-ul\">apache spark 1.6\u7cfb<\/ul>\n<ol>\n<li style=\"list-style-type: none;\">\n<ol>\u53ef\u4ee5\u4ece\u8fd9\u91cc\u4e0b\u8f7dSpark\u3002<\/ol>\n<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<ol>\u89e3\u538b\u7f29\u5e76\u8bbe\u7f6eSPARK_HOME\uff0c\u5e76\u5c06\u8def\u5f84\u6dfb\u52a0\u5230bin\u76ee\u5f55\u3002<\/ol>\n<pre class=\"post-pre\"><code>&gt; tar xfv spark-1.6.1-bin-hadoop2.6.tgz\r\n\r\n&gt; mv spark-1.6.1-bin-hadoop2.6 spark-1.6.1\r\n\r\n<\/code><\/pre>\n<pre class=\"post-pre\"><code><span class=\"nb\">export <\/span><span class=\"nv\">SPARK_HOME<\/span><span class=\"o\">=<\/span>\/usr\/local\/project\/apache-spark\/spark-1.6.1\r\n<span class=\"nb\">export <\/span><span class=\"nv\">PATH<\/span><span class=\"o\">=<\/span><span class=\"nv\">$PATH<\/span>:<span class=\"nv\">$SPARK_HOME<\/span>\/bin\r\n<\/code><\/pre>\n<p>\u8fd0\u884c\u4ee5\u4e0b\u547d\u4ee4\u6765\u5f15\u7528.bashrc\u6587\u4ef6\uff1a<br \/>\nsource ~\/.bashrc<\/p>\n<p>Chinese translation:<br \/>\n\u8fd0\u884c\u4ee5\u4e0b\u547d\u4ee4\u6765\u5f15\u7528.bashrc\u6587\u4ef6\uff1a<br \/>\nsource ~\/.bashrc<\/p>\n<ol>\u6211\u4f1a\u786e\u8ba4\u8def\u5f84\u662f\u5426\u901a\u7545\u3002<\/ol>\n<pre class=\"post-pre\"><code>&gt; spark-shell\r\n\r\nUsing Spark's repl log4j profile: org\/apache\/spark\/log4j-defaults-repl.properties\r\nTo adjust logging level use sc.setLogLevel(\"INFO\")\r\nWelcome to\r\n      ____              __\r\n     \/ __\/__  ___ _____\/ \/__\r\n    _\\ \\\/ _ \\\/ _ `\/ __\/  '_\/\r\n   \/___\/ .__\/\\_,_\/_\/ \/_\/\\_\\   version 1.6.1\r\n      \/_\/\r\n\r\nUsing Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)\r\nType in expressions to have them evaluated.\r\nType :help for more information.\r\nSpark context available as sc.\r\n16\/07\/23 23:53:02 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)\r\n16\/07\/23 23:53:03 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)\r\n16\/07\/23 23:53:09 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0\r\n16\/07\/23 23:53:09 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException\r\n16\/07\/23 23:53:13 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)\r\n16\/07\/23 23:53:14 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)\r\n16\/07\/23 23:53:20 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0\r\n16\/07\/23 23:53:20 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException\r\nSQL context available as sqlContext.\r\n\r\nscala&gt;\r\n<\/code><\/pre>\n<p>\u5f53\u542f\u52a8\u65f6\u51fa\u73b0\u8fd9\u79cd\u60c5\u51b5\uff0c\u6682\u65f6\u7b97\u662f\u8bbe\u7f6e\u5b8c\u6210\u4e86\u3002<\/p>\n<h1>\u7b80\u5355\u8c03\u6574<\/h1>\n<p>\u5efa\u8bae\u60a8\u5c1d\u8bd5\u7b80\u5355\u5730\u6309\u7167\u8fd9\u4e2a\u9875\u9762\u7684\u5feb\u901f\u5165\u95e8\u5185\u5bb9\u8fdb\u884c\u64cd\u4f5c\uff0c\u8fd9\u6837\u4f1a\u5f88\u597d\u3002<\/p>\n<pre class=\"post-pre\"><code>&gt; pyspark\r\n\r\n&gt; textFile = sc.textFile(\"README.md\")\r\n\r\n&gt; textFile.count()\r\n\r\n&gt;  textFile.first()\r\n\r\n<\/code><\/pre>\n<h1>\u53ef\u4ee5\u8bd5\u7740\u67e5\u770b&#8221;\u5feb\u901f\u5f00\u59cb&#8221;\u7684\u5185\u5bb9\u3002<\/h1>\n<p>\u8fd9\u4e2a\u4e5f\u662f\u5728Quick start\u91cc\u5199\u7684\uff0c\u6211\u8bd5\u8fc7\u4e86\u3002<\/p>\n<pre class=\"post-pre\"><code>&gt;&gt;&gt; def max(a, b):\r\n...   if a &gt; b:\r\n...     return a\r\n...   else:\r\n...     return b\r\n...\r\n&gt;&gt;&gt; textFile.map(lambda line: len(line.split())).reduce(max)\r\n&gt;&gt;&gt; wordCounts.collect()\r\n16\/07\/25 12:58:36 INFO SparkContext: Starting job: collect at &lt;stdin&gt;:1\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Registering RDD 9 (reduceByKey at &lt;stdin&gt;:1)\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Got job 5 (collect at &lt;stdin&gt;:1) with 2 output partitions\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Final stage: ResultStage 6 (collect at &lt;stdin&gt;:1)\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 5)\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 5)\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Submitting ShuffleMapStage 5 (PairwiseRDD[9] at reduceByKey at &lt;stdin&gt;:1), which has no missing parents\r\n16\/07\/25 12:58:36 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 8.2 KB, free 195.2 KB)\r\n16\/07\/25 12:58:36 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.2 KB, free 200.4 KB)\r\n16\/07\/25 12:58:36 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:54829 (size: 5.2 KB, free: 511.1 MB)\r\n16\/07\/25 12:58:36 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:1006\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 5 (PairwiseRDD[9] at reduceByKey at &lt;stdin&gt;:1)\r\n16\/07\/25 12:58:36 INFO TaskSchedulerImpl: Adding task set 5.0 with 2 tasks\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 9, localhost, partition 0,PROCESS_LOCAL, 2149 bytes)\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID 10, localhost, partition 1,PROCESS_LOCAL, 2149 bytes)\r\n16\/07\/25 12:58:36 INFO Executor: Running task 0.0 in stage 5.0 (TID 9)\r\n16\/07\/25 12:58:36 INFO Executor: Running task 1.0 in stage 5.0 (TID 10)\r\n16\/07\/25 12:58:36 INFO HadoopRDD: Input split: file:\/usr\/local\/project\/apache-spark\/spark-1.6.2\/README.md:1679+1680\r\n16\/07\/25 12:58:36 INFO HadoopRDD: Input split: file:\/usr\/local\/project\/apache-spark\/spark-1.6.2\/README.md:0+1679\r\n\/usr\/local\/project\/apache-spark\/spark-1.6.2\/python\/lib\/pyspark.zip\/pyspark\/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling\r\n\/usr\/local\/project\/apache-spark\/spark-1.6.2\/python\/lib\/pyspark.zip\/pyspark\/shuffle.py:58: UserWarning: Please install psutil to have better support with spilling\r\n16\/07\/25 12:58:36 INFO PythonRunner: Times: total = 70, boot = 12, init = 10, finish = 48\r\n16\/07\/25 12:58:36 INFO PythonRunner: Times: total = 65, boot = 6, init = 10, finish = 49\r\n16\/07\/25 12:58:36 INFO Executor: Finished task 0.0 in stage 5.0 (TID 9). 2318 bytes result sent to driver\r\n16\/07\/25 12:58:36 INFO Executor: Finished task 1.0 in stage 5.0 (TID 10). 2318 bytes result sent to driver\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 9) in 150 ms on localhost (1\/2)\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID 10) in 146 ms on localhost (2\/2)\r\n16\/07\/25 12:58:36 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool\r\n16\/07\/25 12:58:36 INFO DAGScheduler: ShuffleMapStage 5 (reduceByKey at &lt;stdin&gt;:1) finished in 0.154 s\r\n16\/07\/25 12:58:36 INFO DAGScheduler: looking for newly runnable stages\r\n16\/07\/25 12:58:36 INFO DAGScheduler: running: Set()\r\n16\/07\/25 12:58:36 INFO DAGScheduler: waiting: Set(ResultStage 6)\r\n16\/07\/25 12:58:36 INFO DAGScheduler: failed: Set()\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Submitting ResultStage 6 (PythonRDD[12] at collect at &lt;stdin&gt;:1), which has no missing parents\r\n16\/07\/25 12:58:36 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 5.1 KB, free 205.4 KB)\r\n16\/07\/25 12:58:36 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 3.2 KB, free 208.6 KB)\r\n16\/07\/25 12:58:36 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:54829 (size: 3.2 KB, free: 511.1 MB)\r\n16\/07\/25 12:58:36 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:1006\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 6 (PythonRDD[12] at collect at &lt;stdin&gt;:1)\r\n16\/07\/25 12:58:36 INFO TaskSchedulerImpl: Adding task set 6.0 with 2 tasks\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Starting task 0.0 in stage 6.0 (TID 11, localhost, partition 0,NODE_LOCAL, 1894 bytes)\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Starting task 1.0 in stage 6.0 (TID 12, localhost, partition 1,NODE_LOCAL, 1894 bytes)\r\n16\/07\/25 12:58:36 INFO Executor: Running task 0.0 in stage 6.0 (TID 11)\r\n16\/07\/25 12:58:36 INFO Executor: Running task 1.0 in stage 6.0 (TID 12)\r\n16\/07\/25 12:58:36 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks\r\n16\/07\/25 12:58:36 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks\r\n16\/07\/25 12:58:36 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms\r\n16\/07\/25 12:58:36 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms\r\n16\/07\/25 12:58:36 INFO PythonRunner: Times: total = 10, boot = -97, init = 106, finish = 1\r\n16\/07\/25 12:58:36 INFO Executor: Finished task 1.0 in stage 6.0 (TID 12). 3633 bytes result sent to driver\r\n16\/07\/25 12:58:36 INFO PythonRunner: Times: total = 15, boot = -101, init = 116, finish = 0\r\n16\/07\/25 12:58:36 INFO Executor: Finished task 0.0 in stage 6.0 (TID 11). 3853 bytes result sent to driver\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Finished task 1.0 in stage 6.0 (TID 12) in 61 ms on localhost (1\/2)\r\n16\/07\/25 12:58:36 INFO TaskSetManager: Finished task 0.0 in stage 6.0 (TID 11) in 64 ms on localhost (2\/2)\r\n16\/07\/25 12:58:36 INFO TaskSchedulerImpl: Removed TaskSet 6.0, whose tasks have all completed, from pool\r\n16\/07\/25 12:58:36 INFO DAGScheduler: ResultStage 6 (collect at &lt;stdin&gt;:1) finished in 0.070 s\r\n16\/07\/25 12:58:36 INFO DAGScheduler: Job 5 finished: collect at &lt;stdin&gt;:1, took 0.294915 s\r\n[(u'when', 1), (u'R,', 1), (u'including', 3), (u'computation', 1), (u'using:', 1), (u'guidance', 2), (u'Scala,', 1), (u'environment', 1), (u'only', 1), (u'rich', 1), (u'Apache', 1), (u'sc.parallelize(range(1000)).count()', 1), (u'Building', 1), (u'guide,', 1), (u'return', 2), (u'Please', 3), (u'Try', 1), (u'not', 1), (u'Spark', 13), (u'scala&gt;', 1), (u'Note', 1), (u'cluster.', 1), (u'.\/bin\/pyspark', 1), (u'params', 1), (u'through', 1), (u'GraphX', 1), (u'[run', 1), (u'abbreviated', 1), (u'[project', 2), (u'##', 8), (u'library', 1), (u'see', 1), (u'\"local\"', 1), (u'[Apache', 1), (u'will', 1), (u'#', 1), (u'processing,', 1), (u'for', 11), (u'[building', 1), (u'provides', 1), (u'print', 1), (u'supports', 2), (u'built,', 1), (u'[params]`.', 1), (u'available', 1), (u'run', 7), (u'tests](https:\/\/cwiki.apache.org\/confluence\/display\/SPARK\/Useful+Developer+Tools).', 1), (u'This', 2), (u'Hadoop,', 2), (u'Tests', 1), (u'example:', 1), (u'-DskipTests', 1), (u'Maven](http:\/\/maven.apache.org\/).', 1), (u'programming', 1), (u'running', 1), (u'against', 1), (u'site,', 1), (u'comes', 1), (u'package.', 1), (u'and', 10), (u'package.)', 1), (u'prefer', 1), (u'documentation,', 1), (u'submit', 1), (u'tools', 1), (u'use', 3), (u'from', 1), (u'For', 2), (u'.\/bin\/run-example', 2), (u'fast', 1), (u'systems.', 1), (u'&lt;http:\/\/spark.apache.org\/&gt;', 1), (u'Hadoop-supported', 1), (u'way', 1), (u'README', 1), (u'MASTER', 1), (u'engine', 1), (u'building', 2), (u'usage', 1), (u'instance:', 1), (u'with', 3), (u'protocols', 1), (u'And', 1), (u'this', 1), (u'setup', 1), (u'shell:', 2), (u'project', 1), (u'following', 2), (u'distribution', 1), (u'detailed', 2), (u'have', 1), (u'stream', 1), (u'is', 6), (u'higher-level', 1), (u'tests', 2), (u'1000:', 2), (u'sample', 1), (u'[\"Specifying', 1), (u'Alternatively,', 1), (u'file', 1), (u'need', 1), (u'You', 3), (u'instructions.', 1), (u'different', 1), (u'programs,', 1), (u'storage', 1), (u'same', 1), (u'machine', 1), (u'Running', 1), (u'which', 2), (u'you', 4), (u'A', 1), (u'About', 1), (u'sc.parallelize(1', 1), (u'locally.', 1), (u'Hive', 2), (u'optimized', 1), (u'uses', 1), (u'Version\"](http:\/\/spark.apache.org\/docs\/latest\/building-spark.html#specifying-the-hadoop-version)', 1), (u'variable', 1), (u'The', 1), (u'data', 1), (u'a', 8), (u'\"yarn\"', 1), (u'Thriftserver', 1), (u'processing.', 1), (u'.\/bin\/spark-shell', 1), (u'Python', 2), (u'Spark](#building-spark).', 1), (u'clean', 1), (u'the', 21), (u'requires', 1), (u'talk', 1), (u'help', 1), (u'Hadoop', 3), (u'high-level', 1), (u'find', 1), (u'web', 1), (u'Shell', 2), (u'how', 2), (u'graph', 1), (u'run:', 1), (u'should', 2), (u'to', 14), (u'module,', 1), (u'given.', 1), (u'directory.', 1), (u'must', 1), (u'SparkPi', 2), (u'do', 2), (u'Programs', 1), (u'Many', 1), (u'YARN,', 1), (u'using', 2), (u'Example', 1), (u'Once', 1), (u'HDFS', 1), (u'Because', 1), (u'name', 1), (u'Testing', 1), (u'refer', 2), (u'Streaming', 1), (u'SQL', 2), (u'them,', 1), (u'analysis.', 1), (u'set', 2), (u'Scala', 2), (u'thread,', 1), (u'individual', 1), (u'examples', 2), (u'changed', 1), (u'runs.', 1), (u'Pi', 1), (u'More', 1), (u'Python,', 2), (u'Versions', 1), (u'its', 1), (u'version', 1), (u'wiki](https:\/\/cwiki.apache.org\/confluence\/display\/SPARK).', 1), (u'`.\/bin\/run-example', 1), (u'Configuration', 1), (u'command,', 2), (u'can', 6), (u'core', 1), (u'Guide](http:\/\/spark.apache.org\/docs\/latest\/configuration.html)', 1), (u'MASTER=spark:\/\/host:7077', 1), (u'Documentation', 1), (u'downloaded', 1), (u'distributions.', 1), (u'Spark.', 1), (u'Spark\"](http:\/\/spark.apache.org\/docs\/latest\/building-spark.html).', 1), (u'[\"Building', 1), (u'`examples`', 2), (u'on', 5), (u'package', 1), (u'of', 5), (u'APIs', 1), (u'pre-built', 1), (u'Big', 1), (u'or', 3), (u'learning,', 1), (u'locally', 2), (u'overview', 1), (u'one', 2), (u'(You', 1), (u'Online', 1), (u'versions', 1), (u'your', 1), (u'threads.', 1), (u'&gt;&gt;&gt;', 1), (u'spark:\/\/', 1), (u'contains', 1), (u'system', 1), (u'start', 1), (u'build\/mvn', 1), (u'basic', 1), (u'configure', 1), (u'that', 2), (u'N', 1), (u'\"local[N]\"', 1), (u'DataFrames,', 1), (u'particular', 2), (u'be', 2), (u'an', 3), (u'easiest', 1), (u'Interactive', 2), (u'cluster', 2), (u'page](http:\/\/spark.apache.org\/documentation.html)', 1), (u'&lt;class&gt;', 1), (u'example', 3), (u'are', 1), (u'Data.', 1), (u'mesos:\/\/', 1), (u'computing', 1), (u'URL,', 1), (u'in', 5), (u'general', 2), (u'To', 2), (u'at', 2), (u'1000).count()', 1), (u'if', 4), (u'built', 1), (u'no', 1), (u'Java,', 1), (u'MLlib', 1), (u'also', 4), (u'other', 1), (u'build', 3), (u'online', 1), (u'several', 1), (u'[Configuration', 1), (u'class', 2), (u'programs', 2), (u'documentation', 3), (u'It', 2), (u'graphs', 1), (u'.\/dev\/run-tests', 1), (u'first', 1), (u'latest', 1)]\r\n&gt;&gt;&gt;\r\n<\/code><\/pre>\n<h1>\u8bd5\u7740\u6309\u7167Spark\u7f16\u7a0b\u6307\u5357\u9010\u6b65\u8fdb\u884c\u64cd\u4f5c\u3002<\/h1>\n<h2>\u4ece\u5f00\u59cb\u5f00\u59cb<\/h2>\n<pre class=\"post-pre\"><code>rdd = sc.parallelize(range(1, 10)).map(lambda x: (x, 'a' * x))\r\nrdd.saveAsSequenceFile('~\/hoge.txt')\r\nsorted(sc.sequenceFile('~\/hoge.txt').collect())\r\n[(1, u'a'), (2, u'aa'), (3, u'aaa'), (4, u'aaaa'), (5, u'aaaaa'), (6, u'aaaaaa'), (7, u'aaaaaaa'), (8, u'aaaaaaaa'), (9, u'aaaaaaaaa')]\r\n<\/code><\/pre>\n<h2>\u673a\u5668\u5b66\u4e60\u5e93<\/h2>\n<p>\u8fb9\u770b\u8fd9\u91cc\u8fb9\u505a<\/p>\n<p>\u534f\u540c\u8fc7\u6ee4 &#8211; \u57fa\u4e8eRDD\u7684API\u5e38\u5e38\u5728\u63a8\u8350\u5f15\u64ce\u7b49\u9886\u57df\u88ab\u63d0\u53ca\uff0c\u6240\u4ee5\u6211\u4eec\u6765\u8bd5\u8bd5\u770b\u505a\u4e00\u4e0b\u3002<\/p>\n<p>\u6d4b\u8bd5\u6570\u636e\u53ef\u4ee5\u5728[\u8fd9\u91cc](https:\/\/github.com\/apache\/spark\/blob\/master\/data\/mllib\/als\/test.data)\u627e\u5230\u3002<\/p>\n<p>\u6700\u540e\u901a\u8fc7predict\u65b9\u6cd5\u5bf9\u6570\u636e\u8fdb\u884c\u9884\u6d4b\uff0c\u56e0\u4e3a\u6d4b\u8bd5\u6570\u636e\u662f3\u30012\u30015.0\uff0c\u6240\u4ee5\u5e0c\u671b\u80fd\u591f\u5f97\u5230\u63a5\u8fd15\u7684\u503c\u3002<\/p>\n<pre class=\"post-pre\"><code>from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating\r\ndata = sc.textFile(\"data\/mllib\/als\/test.data\")\r\nratings = data.map(lambda l: l.split(',')).map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))\r\nrank = 10\r\nnumIterations = 10\r\nmodel = ALS.train(ratings, rank, numIterations)\r\nmodel.predict(3, 2) \r\n\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\u30fb\r\n\u30fb\u30fb\u30fb\u30fb\r\n4.996948080474724\r\n<\/code><\/pre>\n<p>\u56e0\u4e3a\u662f4.99\uff0c\u6240\u4ee5\u51e0\u4e4e\u662f5\u3002\u603b\u7684\u6765\u8bf4\uff0c\u6211\u8ba4\u4e3a\u7ed3\u679c\u662f\u4e0d\u9519\u7684\u3002<\/p>\n<p>\u5173\u4e8eMLlib\u4ee5\u5916\u7684\u793a\u4f8b\uff0c\u6211\u4eec\u5c06\u5728\u53e6\u4e00\u7bc7\u6587\u7ae0\u4e2d\u8be6\u7ec6\u4ecb\u7ecd\u3002<br \/>\n\u6b64\u6587\u6682\u65f6\u544a\u4e00\u6bb5\u843d\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u867d\u7136\u73b0\u5728\u6709\u70b9\u665a\u4e86\uff0c\u4f46\u6211\u4e00\u76f4\u6ca1\u6709\u53bb\u5c1d\u8bd5\u8fc7Apache Spark\uff0c\u6240\u4ee5\u6211\u51b3\u5b9a\u8bd5\u4e00\u8bd5\u3002 \u6211\u60f3\u8981\u7406\u89e3\u5b83\uff0c\u5e76\u5c1d\u8bd5\u4f7f\u7528\u673a [&hellip;]<\/p>\n","protected":false},"author":5,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-36828","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v21.5 (Yoast SEO v21.5) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark - Blog - Silicon Cloud<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.silicloud.com\/zh\/blog\/\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528apache-spark\u3002\/\" \/>\n<meta property=\"og:locale\" content=\"zh_CN\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark\" \/>\n<meta property=\"og:description\" content=\"\u867d\u7136\u73b0\u5728\u6709\u70b9\u665a\u4e86\uff0c\u4f46\u6211\u4e00\u76f4\u6ca1\u6709\u53bb\u5c1d\u8bd5\u8fc7Apache Spark\uff0c\u6240\u4ee5\u6211\u51b3\u5b9a\u8bd5\u4e00\u8bd5\u3002 \u6211\u60f3\u8981\u7406\u89e3\u5b83\uff0c\u5e76\u5c1d\u8bd5\u4f7f\u7528\u673a [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.silicloud.com\/zh\/blog\/\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528apache-spark\u3002\/\" \/>\n<meta property=\"og:site_name\" content=\"Blog - Silicon Cloud\" \/>\n<meta property=\"article:published_time\" content=\"2023-04-21T12:35:38+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-05-03T18:18:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/cdn.silicloud.com\/blog-img\/blog\/img\/657d2b2537434c4406c498d8\/2-0.png\" \/>\n<meta name=\"author\" content=\"\u6e05, \u5b87\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"\u4f5c\u8005\" \/>\n\t<meta name=\"twitter:data1\" content=\"\u6e05, \u5b87\" \/>\n\t<meta name=\"twitter:label2\" content=\"\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 \u5206\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/\",\"url\":\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/\",\"name\":\"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark - Blog - Silicon Cloud\",\"isPartOf\":{\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/#website\"},\"datePublished\":\"2023-04-21T12:35:38+00:00\",\"dateModified\":\"2024-05-03T18:18:34+00:00\",\"author\":{\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/1a6ecd3d914d22a5ac32791ffc1fbd8e\"},\"breadcrumb\":{\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#breadcrumb\"},\"inLanguage\":\"zh-Hans\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"\u9996\u9875\",\"item\":\"https:\/\/www.silicloud.com\/zh\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/#website\",\"url\":\"https:\/\/www.silicloud.com\/zh\/blog\/\",\"name\":\"Blog - Silicon Cloud\",\"description\":\"\",\"inLanguage\":\"zh-Hans\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/1a6ecd3d914d22a5ac32791ffc1fbd8e\",\"name\":\"\u6e05, \u5b87\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/4b2016c18459a605fc469c7566608f5686491baa112d0871ee613f61b7210565?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/4b2016c18459a605fc469c7566608f5686491baa112d0871ee613f61b7210565?s=96&d=mm&r=g\",\"caption\":\"\u6e05, \u5b87\"},\"url\":\"https:\/\/www.silicloud.com\/zh\/blog\/author\/qingyu\/\"},{\"@type\":\"ImageObject\",\"inLanguage\":\"zh-Hans\",\"@id\":\"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#local-main-organization-logo\",\"url\":\"\",\"contentUrl\":\"\",\"caption\":\"Blog - Silicon Cloud\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark - Blog - Silicon Cloud","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.silicloud.com\/zh\/blog\/\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528apache-spark\u3002\/","og_locale":"zh_CN","og_type":"article","og_title":"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark","og_description":"\u867d\u7136\u73b0\u5728\u6709\u70b9\u665a\u4e86\uff0c\u4f46\u6211\u4e00\u76f4\u6ca1\u6709\u53bb\u5c1d\u8bd5\u8fc7Apache Spark\uff0c\u6240\u4ee5\u6211\u51b3\u5b9a\u8bd5\u4e00\u8bd5\u3002 \u6211\u60f3\u8981\u7406\u89e3\u5b83\uff0c\u5e76\u5c1d\u8bd5\u4f7f\u7528\u673a [&hellip;]","og_url":"https:\/\/www.silicloud.com\/zh\/blog\/\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528apache-spark\u3002\/","og_site_name":"Blog - Silicon Cloud","article_published_time":"2023-04-21T12:35:38+00:00","article_modified_time":"2024-05-03T18:18:34+00:00","og_image":[{"url":"https:\/\/cdn.silicloud.com\/blog-img\/blog\/img\/657d2b2537434c4406c498d8\/2-0.png"}],"author":"\u6e05, \u5b87","twitter_card":"summary_large_image","twitter_misc":{"\u4f5c\u8005":"\u6e05, \u5b87","\u9884\u8ba1\u9605\u8bfb\u65f6\u95f4":"7 \u5206"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/","url":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/","name":"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark - Blog - Silicon Cloud","isPartOf":{"@id":"https:\/\/www.silicloud.com\/zh\/blog\/#website"},"datePublished":"2023-04-21T12:35:38+00:00","dateModified":"2024-05-03T18:18:34+00:00","author":{"@id":"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/1a6ecd3d914d22a5ac32791ffc1fbd8e"},"breadcrumb":{"@id":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#breadcrumb"},"inLanguage":"zh-Hans","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"\u9996\u9875","item":"https:\/\/www.silicloud.com\/zh\/blog\/"},{"@type":"ListItem","position":2,"name":"\u6211\u73b0\u5728\u60f3\u5c1d\u8bd5\u4f7f\u7528Apache Spark"}]},{"@type":"WebSite","@id":"https:\/\/www.silicloud.com\/zh\/blog\/#website","url":"https:\/\/www.silicloud.com\/zh\/blog\/","name":"Blog - Silicon Cloud","description":"","inLanguage":"zh-Hans"},{"@type":"Person","@id":"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/1a6ecd3d914d22a5ac32791ffc1fbd8e","name":"\u6e05, \u5b87","image":{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/www.silicloud.com\/zh\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/4b2016c18459a605fc469c7566608f5686491baa112d0871ee613f61b7210565?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/4b2016c18459a605fc469c7566608f5686491baa112d0871ee613f61b7210565?s=96&d=mm&r=g","caption":"\u6e05, \u5b87"},"url":"https:\/\/www.silicloud.com\/zh\/blog\/author\/qingyu\/"},{"@type":"ImageObject","inLanguage":"zh-Hans","@id":"https:\/\/www.silicloud.com\/zh\/blog\/%e6%88%91%e7%8e%b0%e5%9c%a8%e6%83%b3%e5%b0%9d%e8%af%95%e4%bd%bf%e7%94%a8apache-spark%e3%80%82\/#local-main-organization-logo","url":"","contentUrl":"","caption":"Blog - Silicon Cloud"}]}},"_links":{"self":[{"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/posts\/36828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/comments?post=36828"}],"version-history":[{"count":2,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/posts\/36828\/revisions"}],"predecessor-version":[{"id":95517,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/posts\/36828\/revisions\/95517"}],"wp:attachment":[{"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/media?parent=36828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/categories?post=36828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.silicloud.com\/zh\/blog\/wp-json\/wp\/v2\/tags?post=36828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}