尝试使用 Apache Kudu 快速启动虚拟机

由于Apache Kudu(http://kudu.apache.org/)成为Apache项目的TLP(顶层项目),所以我使用最新的Quickstart VM来尝试。

Apache Kudu是什么?

Apache Kudu是一款全新的开源存储引擎,从零开始开发。它被定位在Hadoop生态系统中,并补充了HDFS和HBase的功能。通过Kudu,可以快速查询频繁添加/更新的数据。例如,像一般数据库一样,可以在实时中进行插入、更新、删除等数据操作。

从Kudu的博客摘录 http://kudu.apache.org/2016/07/25/asf-graduation.html

为实现快速扫描而进行了优化,Kudu特别适合托管时间序列数据和各种类型的业务数据。除了卓越的扫描速度外,Kudu还支持传统数据库中的许多操作,包括实时的插入、更新和删除操作。Kudu支持“自带SQL”理念,并且支持多种不同查询引擎访问,包括其他Apache项目,如Drill、Spark和Impala(孵化中)。

根据文档(http://kudu.apache.org/overview.html#architecture)所述,
– 采用超快速的列存储技术
– 具有分布式和容错能力
– 可与类似Intel 3D Xpoint的下一代存储技术兼容。

Kudu不是用来替代Hadoop的HDFS和HBase,而是用来补充它们的功能。例如,对于大量数据的扫描,HDFS更快;对于随机读写,HBase更快。然而,对于频繁更新的数据进行快速分析并不适合使用HDFS和HBase。

Kudu像Apache Parquet一样采用列存储方式处理数据。与HBase不同,每个列都有自己的类型,因此可以高效地对数据进行编码和压缩。当只需要访问部分列进行分析时,并不需要读写所有的数据,因此磁盘I/O较少,速度更快。Kudu不仅可以进行数据的添加、更新和删除,还具有高效的列存储引擎,可以优化列的访问。

参考信息
– Kudu(クドゥ):Hadoop的新存储引擎,补充HDFS和HBase
– Kudu信息概述

快速启动虚拟机

你可以从Kudu的官方网页上获取Virtual Box上的Quickstart VM来运行。虽然它并不是一个分布式环境,但是如果你想快速尝试Kudu,这是一个不错的选择。
http://kudu.apache.org/docs/quickstart.html

运行环境

    • Macbook Pro (OSX El Capitan)

 

    VirtualBox 5.0.10

执行引导程序

获取并执行Kudu的引导程序。 在Mac上打开终端并执行以下命令。

$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash

将会自动开始下载Virtual Box的映像文件。由于文件大小大约为1.2GB,所以可能需要一段时间。

$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash
+ : https://github.com/cloudera/kudu-examples.git
+ git clone https://github.com/cloudera/kudu-examples.git
Cloning into 'kudu-examples'...
remote: Counting objects: 366, done.
remote: Total 366 (delta 0), reused 0 (delta 0), pack-reused 366
Receiving objects: 100% (366/366), 69.88 KiB | 0 bytes/s, done.
Resolving deltas: 100% (119/119), done.
Checking connectivity... done.
+ pushd kudu-examples
~/Downloads/kudu-examples ~/Downloads
+ pushd demo-vm-setup
~/Downloads/kudu-examples/demo-vm-setup ~/Downloads/kudu-examples ~/Downloads
+ ./setup-kudu-demo-vm.sh
Downloading Virtualbox Image file: http://cloudera-kudu-beta.s3.amazonaws.com/cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1361M  100 1361M    0     0  3069k      0  0:07:34  0:07:34 --:--:-- 1138k

当下载完成后,虚拟机将在VirtualBox上启动。

100 1361M  100 1361M    0     0  3216k      0  0:07:13  0:07:13 --:--:-- 3892k
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Interface 'vboxnet6' was successfully created
Importing VM cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova...
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Interpreting /Users/kawasaki/Downloads/kudu-examples/demo-vm-setup/cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova...
OK.
Disks:
  vmdisk1   68719476736 -1  http://www.vmware.com/interfaces/specifications/vmdk.html#streamOptimized   cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk -1  -1  

Virtual system 0:
 0: Suggested OS type: "RedHat_64"
    (change with "--vsys 0 --ostype <type>"; use "list ostypes" to list all possible values)
 1: VM name specified with --vmname: "kudu-demo"
 2: No. of CPUs specified with --cpus: 2
 3: Guest memory specified with --memory: 6144 MB
 4: Network adapter: orig NAT, config 3, extra slot=0;type=NAT
 5: Network adapter: orig NAT, config 3, extra slot=1;type=NAT
 6: CD-ROM
    (disable with "--vsys 0 --unit 6 --ignore")
 7: IDE controller, type PIIX4
    (disable with "--vsys 0 --unit 7 --ignore")
 8: IDE controller, type PIIX4
    (disable with "--vsys 0 --unit 8 --ignore")
 9: Hard disk image: source image=cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk, target path=/Users/kawasaki/VirtualBox VMs/cloudera-quickstart-vm-5.7.1-kudu-virtualbox/cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk, controller=7;channel=0
    (change target path with "--vsys 0 --unit 9 --disk path";
    disable with "--vsys 0 --unit 9 --ignore")
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully imported the appliance.
Waiting for VM "kudu-demo" to power on...
VM "kudu-demo" has been successfully started.
Wait until services become available.
Updating the /etc/hosts file requires sudo rights.
Password:
kudu1.png

完成之后,打开另一个终端并尝试使用SSH访问。在本地环境中,由于之前尝试过的环境仍然存在,所以无法与quickstart.cloudera进行通信。因此,我已经将以下内容添加到/etc/hosts中。

127.0.0.1   localhost   localhost.domain
192.168.62.100  quickstart.cloudera quickstart

ssh的用户名和密码是demo。

$ ssh demo@quickstart.cloudera
demo@quickstart.cloudera's password: 
Last login: Tue Jul 26 05:16:28 2016 from 192.168.60.1
[demo@quickstart ~]$ 

我成功地登录了,确认kudu和impalad正在运行中。

[demo@quickstart ~]$ ps aux |grep kudu
[demo@quickstart ~]$ ps aux |grep kudu
kudu      3035  0.7  0.5 547028 31164 ?        Sl   05:51   0:01 /usr/lib/kudu/sbin/kudu-master --flagfile=/etc/kudu/conf/master.gflagfile
kudu      3174  0.6  0.2 1245340 13968 ?       Sl   05:51   0:01 /usr/lib/kudu/sbin/kudu-tserver --flagfile=/etc/kudu/conf/tserver.gflagfile
demo      3811  0.0  0.0 107452   948 pts/0    S+   05:55   0:00 grep kudu
[demo@quickstart ~]$ ps aux |grep impalad
impala    2999  2.4  2.7 4170804 163588 ?      Sl   05:51   0:05 /usr/lib/impala/sbin/impalad -log_dir=/var/log/impala -catalog_service_host=127.0.0.1 -state_store_port=24000 -use_statestore -state_store_host=127.0.0.1 -be_port=22000
demo      3813  0.0  0.0 107452   948 pts/0    S+   05:55   0:00 grep impalad
[demo@quickstart ~]$ 

考试

按照下载页面的教程进行操作。

加载样本数据

加载SFO乘客数据。

$ hdfs dfs -mkdir /data
$ hdfs dfs -put examples/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201506.csv /data

启动Impala Shell。

使用分布式SQL引擎Impala连接到Kudu。在另一个终端中打开并启动impala-shell。

$ ssh demo@quickstart.cloudera -t impala-shell
demo@quickstart.cloudera's password: 
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.7.0-cdh5-IMPALA_KUDU-cdh5 RELEASE (build fc36c3c7fbbbdfb0e8b1b0e6ee7505531a384550)
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.7.0-cdh5-IMPALA_KUDU-cdh5 (fc36c3c) built on Tue Jun 14 23:55:45 PDT 2016)

To see live updates on a query's progress, run 'set LIVE_SUMMARY=1;'.
***********************************************************************************
[quickstart.cloudera:21000] > 

创建桌子

通过Impala shell将以下内容复制粘贴到创建passenger_data_raw表。这是一个使用在HDFS上上传的数据的常见Impala/Hive表。

CREATE EXTERNAL TABLE passenger_data_raw (
  id int,
  activity_period int,
  operating_airline string,
  airline_iata_code string,
  published_airline string,
  published_airline_iata_code string,
  geo_summary string,
  geo_region string,
  activity_type_code string,
  price_category_code string,
  terminal string,
  boarding_area string,
  passenger_count bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/data/';

执行测试查询

我将尝试查询。(第一次由于元数据缓存未实施,可能需要一些时间。第二次约为0.5秒)

> SELECT count(*) FROM passenger_data_raw;
Query: select count(*) FROM passenger_data_raw
+----------+
| count(*) |
+----------+
| 13901    |
+----------+
Fetched 1 row(s) in 5.69s

创建一个名为Kudu的表

使用CREATE TABLE AS SELECT来使用HDFS上的外部表数据创建Kudu表,但是遇到了错误返回。

> CREATE TABLE passenger_data
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
 ) AS SELECT * FROM passenger_data_raw;
Query: create TABLE passenger_data
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
 ) AS SELECT * FROM passenger_data_raw
ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause.

根据文档,写明了需要指定错误为DISTRIBUTE BY。文档中指出,需要使用DISTRIBUTE BY RANGE或DISTRIBUTE BY HASH中的任一个选项。
http://www.cloudera.com/documentation/betas/kudu/0-9-0/topics/kudu_impala.html

这次我决定使用哈希来指定,而不是指定范围,将原始表(passenger_data_raw)的id列分成16个桶。

DISTRIBUTE BY HASH (id) INTO 16 BUCKETS

尽管已经发出警告,但以下是它的构成方式。

> CREATE TABLE passenger_data
                            > DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
                            > TBLPROPERTIES(
                            > 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
                            > 'kudu.table_name' = 'passenger_data',
                            > 'kudu.master_addresses' = '127.0.0.1',
                            > 'kudu.key_columns' = 'id'
                            >  ) AS SELECT * FROM passenger_data_raw;
Query: create TABLE passenger_data
DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
 ) AS SELECT * FROM passenger_data_raw
+-----------------------+
| summary               |
+-----------------------+
| Inserted 13901 row(s) |
+-----------------------+
WARNINGS: Error converting column: 12 TO BIGINT (Data is: Other)
file: hdfs://quickstart.cloudera:8020/data/MonthlyPassengerData_200507_to_201506.csv
record: 
11381,201309,"Atlas Air, Inc",5Y,"Atlas Air, Inc",5Y,Domestic,US,Deplaned,Other,Other,Other,65
Error converting column: 12 TO BIGINT (Data is: Other)
file: hdfs://quickstart.cloudera:8020/data/MonthlyPassengerData_200507_to_201506.csv
record: 
11382,201309,"Atlas Air, Inc",5Y,"Atlas Air, Inc",5Y,Domestic,US,Thru / Transit,Other,Other,Other,3

Fetched 1 row(s) in 1.86s
[quickstart.cloudera:21000] > 

查询Kudu上的数据。

查询在Kudu上创建的表。与之前执行的查询一样,第一次查询时需要花费较长时间,因为Hive元存储的数据还没有被缓存,但是之后的查询会很快。

> SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
  GROUP BY operating_airline
  HAVING total IS NOT null
  ORDER BY total DESC LIMIT 10;
Query: select sum(passenger_count) AS total, operating_airline FROM passenger_data
  GROUP BY operating_airline
  HAVING total IS NOT null
  ORDER BY total DESC LIMIT 10
+-----------+----------------------------------+
| total     | operating_airline                |
+-----------+----------------------------------+
| 105363917 | United Airlines - Pre 07/01/2013 |
| 51319845  | United Airlines                  |
| 32657456  | SkyWest Airlines                 |
| 31727343  | American Airlines                |
| 23801507  | Delta Air Lines                  |
| 23685267  | Virgin America                   |
| 22507320  | Southwest Airlines               |
| 16235520  | US Airways                       |
| 11860630  | Alaska Airlines                  |
| 6706438   | JetBlue Airways                  |
+-----------+----------------------------------+
Fetched 10 row(s) in 0.46s
[quickstart.cloudera:21000] > 

更新数据

从这里开始展示Kudu的真正实力。刚才的数据集存在问题,同一航空公司被显示为不同的名称。我们将使用UPDATE来更新数据。

> UPDATE passenger_data
                            >   SET operating_airline="United Airlines"
                            >   WHERE operating_airline LIKE "United Airlines - Pre%";
Query: update passenger_data
  SET operating_airline="United Airlines"
  WHERE operating_airline LIKE "United Airlines - Pre%"

Fetched 0 row(s) in 0.15s
[quickstart.cloudera:21000] >

让我们执行查询并确认一下,它似乎已经没有任何更改了。

> SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
                            >   GROUP BY operating_airline
                            >   HAVING total IS NOT null
                            >   ORDER BY total DESC LIMIT 10;
Query: select sum(passenger_count) AS total, operating_airline FROM passenger_data
  GROUP BY operating_airline
  HAVING total IS NOT null
  ORDER BY total DESC LIMIT 10
+-----------+--------------------+
| total     | operating_airline  |
+-----------+--------------------+
| 156683762 | United Airlines    |
| 32657456  | SkyWest Airlines   |
| 31727343  | American Airlines  |
| 23801507  | Delta Air Lines    |
| 23685267  | Virgin America     |
| 22507320  | Southwest Airlines |
| 16235520  | US Airways         |
| 11860630  | Alaska Airlines    |
| 6706438   | JetBlue Airways    |
| 6266220   | Northwest Airlines |
+-----------+--------------------+
Fetched 10 row(s) in 0.46s
[quickstart.cloudera:21000] >

数据更新得很准确呢!HDFS是一种只写文件系统,所以在数据更新方面存在一些挑战。但是,使用Kudu可以轻松地更新数据,并能够快速分析频繁更新的数据。

网页用户界面的屏幕截图

kudu-master.png
kudu-master2.png
    Kudu tablet-server (http://quickstart.cloudera:8051)
kudu-tablet1.png
kudu-tablet2.png

概括[挑选一个摘要]

使用Kudu可以快速分析添加到HDFS中的数据。尽管Kudu尚未正式发布,但它已经在生产环境的数百个节点上成功运行!

補充

Impala 2.3以后版本有LIVE PROGRESS和LIVE SUMMARY选项,但似乎不支持Kudu,因此未显示。很遗憾。
http://www.cloudera.com/documentation/enterprise/latest/topics/impala_live_progress.html
https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk

bannerAds