尝试使用 Apache Kudu 快速启动虚拟机
由于Apache Kudu(http://kudu.apache.org/)成为Apache项目的TLP(顶层项目),所以我使用最新的Quickstart VM来尝试。
Apache Kudu是什么?
Apache Kudu是一款全新的开源存储引擎,从零开始开发。它被定位在Hadoop生态系统中,并补充了HDFS和HBase的功能。通过Kudu,可以快速查询频繁添加/更新的数据。例如,像一般数据库一样,可以在实时中进行插入、更新、删除等数据操作。
从Kudu的博客摘录 http://kudu.apache.org/2016/07/25/asf-graduation.html
为实现快速扫描而进行了优化,Kudu特别适合托管时间序列数据和各种类型的业务数据。除了卓越的扫描速度外,Kudu还支持传统数据库中的许多操作,包括实时的插入、更新和删除操作。Kudu支持“自带SQL”理念,并且支持多种不同查询引擎访问,包括其他Apache项目,如Drill、Spark和Impala(孵化中)。
根据文档(http://kudu.apache.org/overview.html#architecture)所述,
– 采用超快速的列存储技术
– 具有分布式和容错能力
– 可与类似Intel 3D Xpoint的下一代存储技术兼容。
Kudu不是用来替代Hadoop的HDFS和HBase,而是用来补充它们的功能。例如,对于大量数据的扫描,HDFS更快;对于随机读写,HBase更快。然而,对于频繁更新的数据进行快速分析并不适合使用HDFS和HBase。
Kudu像Apache Parquet一样采用列存储方式处理数据。与HBase不同,每个列都有自己的类型,因此可以高效地对数据进行编码和压缩。当只需要访问部分列进行分析时,并不需要读写所有的数据,因此磁盘I/O较少,速度更快。Kudu不仅可以进行数据的添加、更新和删除,还具有高效的列存储引擎,可以优化列的访问。
参考信息
– Kudu(クドゥ):Hadoop的新存储引擎,补充HDFS和HBase
– Kudu信息概述
快速启动虚拟机
你可以从Kudu的官方网页上获取Virtual Box上的Quickstart VM来运行。虽然它并不是一个分布式环境,但是如果你想快速尝试Kudu,这是一个不错的选择。
http://kudu.apache.org/docs/quickstart.html
运行环境
-
- Macbook Pro (OSX El Capitan)
- VirtualBox 5.0.10
执行引导程序
获取并执行Kudu的引导程序。 在Mac上打开终端并执行以下命令。
$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash
将会自动开始下载Virtual Box的映像文件。由于文件大小大约为1.2GB,所以可能需要一段时间。
$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash
+ : https://github.com/cloudera/kudu-examples.git
+ git clone https://github.com/cloudera/kudu-examples.git
Cloning into 'kudu-examples'...
remote: Counting objects: 366, done.
remote: Total 366 (delta 0), reused 0 (delta 0), pack-reused 366
Receiving objects: 100% (366/366), 69.88 KiB | 0 bytes/s, done.
Resolving deltas: 100% (119/119), done.
Checking connectivity... done.
+ pushd kudu-examples
~/Downloads/kudu-examples ~/Downloads
+ pushd demo-vm-setup
~/Downloads/kudu-examples/demo-vm-setup ~/Downloads/kudu-examples ~/Downloads
+ ./setup-kudu-demo-vm.sh
Downloading Virtualbox Image file: http://cloudera-kudu-beta.s3.amazonaws.com/cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1361M 100 1361M 0 0 3069k 0 0:07:34 0:07:34 --:--:-- 1138k
当下载完成后,虚拟机将在VirtualBox上启动。
100 1361M 100 1361M 0 0 3216k 0 0:07:13 0:07:13 --:--:-- 3892k
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Interface 'vboxnet6' was successfully created
Importing VM cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova...
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Interpreting /Users/kawasaki/Downloads/kudu-examples/demo-vm-setup/cloudera-quickstart-vm-5.7.1-kudu-virtualbox.ova...
OK.
Disks:
vmdisk1 68719476736 -1 http://www.vmware.com/interfaces/specifications/vmdk.html#streamOptimized cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk -1 -1
Virtual system 0:
0: Suggested OS type: "RedHat_64"
(change with "--vsys 0 --ostype <type>"; use "list ostypes" to list all possible values)
1: VM name specified with --vmname: "kudu-demo"
2: No. of CPUs specified with --cpus: 2
3: Guest memory specified with --memory: 6144 MB
4: Network adapter: orig NAT, config 3, extra slot=0;type=NAT
5: Network adapter: orig NAT, config 3, extra slot=1;type=NAT
6: CD-ROM
(disable with "--vsys 0 --unit 6 --ignore")
7: IDE controller, type PIIX4
(disable with "--vsys 0 --unit 7 --ignore")
8: IDE controller, type PIIX4
(disable with "--vsys 0 --unit 8 --ignore")
9: Hard disk image: source image=cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk, target path=/Users/kawasaki/VirtualBox VMs/cloudera-quickstart-vm-5.7.1-kudu-virtualbox/cloudera-quickstart-vm-5.7.1-kudu-virtualbox-disk1.vmdk, controller=7;channel=0
(change target path with "--vsys 0 --unit 9 --disk path";
disable with "--vsys 0 --unit 9 --ignore")
0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%
Successfully imported the appliance.
Waiting for VM "kudu-demo" to power on...
VM "kudu-demo" has been successfully started.
Wait until services become available.
Updating the /etc/hosts file requires sudo rights.
Password:

完成之后,打开另一个终端并尝试使用SSH访问。在本地环境中,由于之前尝试过的环境仍然存在,所以无法与quickstart.cloudera进行通信。因此,我已经将以下内容添加到/etc/hosts中。
127.0.0.1 localhost localhost.domain
192.168.62.100 quickstart.cloudera quickstart
ssh的用户名和密码是demo。
$ ssh demo@quickstart.cloudera
demo@quickstart.cloudera's password:
Last login: Tue Jul 26 05:16:28 2016 from 192.168.60.1
[demo@quickstart ~]$
我成功地登录了,确认kudu和impalad正在运行中。
[demo@quickstart ~]$ ps aux |grep kudu
[demo@quickstart ~]$ ps aux |grep kudu
kudu 3035 0.7 0.5 547028 31164 ? Sl 05:51 0:01 /usr/lib/kudu/sbin/kudu-master --flagfile=/etc/kudu/conf/master.gflagfile
kudu 3174 0.6 0.2 1245340 13968 ? Sl 05:51 0:01 /usr/lib/kudu/sbin/kudu-tserver --flagfile=/etc/kudu/conf/tserver.gflagfile
demo 3811 0.0 0.0 107452 948 pts/0 S+ 05:55 0:00 grep kudu
[demo@quickstart ~]$ ps aux |grep impalad
impala 2999 2.4 2.7 4170804 163588 ? Sl 05:51 0:05 /usr/lib/impala/sbin/impalad -log_dir=/var/log/impala -catalog_service_host=127.0.0.1 -state_store_port=24000 -use_statestore -state_store_host=127.0.0.1 -be_port=22000
demo 3813 0.0 0.0 107452 948 pts/0 S+ 05:55 0:00 grep impalad
[demo@quickstart ~]$
考试
按照下载页面的教程进行操作。
加载样本数据
加载SFO乘客数据。
$ hdfs dfs -mkdir /data
$ hdfs dfs -put examples/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201506.csv /data
启动Impala Shell。
使用分布式SQL引擎Impala连接到Kudu。在另一个终端中打开并启动impala-shell。
$ ssh demo@quickstart.cloudera -t impala-shell
demo@quickstart.cloudera's password:
Starting Impala Shell without Kerberos authentication
Connected to quickstart.cloudera:21000
Server version: impalad version 2.7.0-cdh5-IMPALA_KUDU-cdh5 RELEASE (build fc36c3c7fbbbdfb0e8b1b0e6ee7505531a384550)
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.7.0-cdh5-IMPALA_KUDU-cdh5 (fc36c3c) built on Tue Jun 14 23:55:45 PDT 2016)
To see live updates on a query's progress, run 'set LIVE_SUMMARY=1;'.
***********************************************************************************
[quickstart.cloudera:21000] >
创建桌子
通过Impala shell将以下内容复制粘贴到创建passenger_data_raw表。这是一个使用在HDFS上上传的数据的常见Impala/Hive表。
CREATE EXTERNAL TABLE passenger_data_raw (
id int,
activity_period int,
operating_airline string,
airline_iata_code string,
published_airline string,
published_airline_iata_code string,
geo_summary string,
geo_region string,
activity_type_code string,
price_category_code string,
terminal string,
boarding_area string,
passenger_count bigint
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/data/';
执行测试查询
我将尝试查询。(第一次由于元数据缓存未实施,可能需要一些时间。第二次约为0.5秒)
> SELECT count(*) FROM passenger_data_raw;
Query: select count(*) FROM passenger_data_raw
+----------+
| count(*) |
+----------+
| 13901 |
+----------+
Fetched 1 row(s) in 5.69s
创建一个名为Kudu的表
使用CREATE TABLE AS SELECT来使用HDFS上的外部表数据创建Kudu表,但是遇到了错误返回。
> CREATE TABLE passenger_data
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
) AS SELECT * FROM passenger_data_raw;
Query: create TABLE passenger_data
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
) AS SELECT * FROM passenger_data_raw
ERROR: AnalysisException: A data distribution must be specified using a DISTRIBUTE BY clause.
根据文档,写明了需要指定错误为DISTRIBUTE BY。文档中指出,需要使用DISTRIBUTE BY RANGE或DISTRIBUTE BY HASH中的任一个选项。
http://www.cloudera.com/documentation/betas/kudu/0-9-0/topics/kudu_impala.html
这次我决定使用哈希来指定,而不是指定范围,将原始表(passenger_data_raw)的id列分成16个桶。
DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
尽管已经发出警告,但以下是它的构成方式。
> CREATE TABLE passenger_data
> DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
> TBLPROPERTIES(
> 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
> 'kudu.table_name' = 'passenger_data',
> 'kudu.master_addresses' = '127.0.0.1',
> 'kudu.key_columns' = 'id'
> ) AS SELECT * FROM passenger_data_raw;
Query: create TABLE passenger_data
DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
TBLPROPERTIES(
'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
'kudu.table_name' = 'passenger_data',
'kudu.master_addresses' = '127.0.0.1',
'kudu.key_columns' = 'id'
) AS SELECT * FROM passenger_data_raw
+-----------------------+
| summary |
+-----------------------+
| Inserted 13901 row(s) |
+-----------------------+
WARNINGS: Error converting column: 12 TO BIGINT (Data is: Other)
file: hdfs://quickstart.cloudera:8020/data/MonthlyPassengerData_200507_to_201506.csv
record:
11381,201309,"Atlas Air, Inc",5Y,"Atlas Air, Inc",5Y,Domestic,US,Deplaned,Other,Other,Other,65
Error converting column: 12 TO BIGINT (Data is: Other)
file: hdfs://quickstart.cloudera:8020/data/MonthlyPassengerData_200507_to_201506.csv
record:
11382,201309,"Atlas Air, Inc",5Y,"Atlas Air, Inc",5Y,Domestic,US,Thru / Transit,Other,Other,Other,3
Fetched 1 row(s) in 1.86s
[quickstart.cloudera:21000] >
查询Kudu上的数据。
查询在Kudu上创建的表。与之前执行的查询一样,第一次查询时需要花费较长时间,因为Hive元存储的数据还没有被缓存,但是之后的查询会很快。
> SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
GROUP BY operating_airline
HAVING total IS NOT null
ORDER BY total DESC LIMIT 10;
Query: select sum(passenger_count) AS total, operating_airline FROM passenger_data
GROUP BY operating_airline
HAVING total IS NOT null
ORDER BY total DESC LIMIT 10
+-----------+----------------------------------+
| total | operating_airline |
+-----------+----------------------------------+
| 105363917 | United Airlines - Pre 07/01/2013 |
| 51319845 | United Airlines |
| 32657456 | SkyWest Airlines |
| 31727343 | American Airlines |
| 23801507 | Delta Air Lines |
| 23685267 | Virgin America |
| 22507320 | Southwest Airlines |
| 16235520 | US Airways |
| 11860630 | Alaska Airlines |
| 6706438 | JetBlue Airways |
+-----------+----------------------------------+
Fetched 10 row(s) in 0.46s
[quickstart.cloudera:21000] >
更新数据
从这里开始展示Kudu的真正实力。刚才的数据集存在问题,同一航空公司被显示为不同的名称。我们将使用UPDATE来更新数据。
> UPDATE passenger_data
> SET operating_airline="United Airlines"
> WHERE operating_airline LIKE "United Airlines - Pre%";
Query: update passenger_data
SET operating_airline="United Airlines"
WHERE operating_airline LIKE "United Airlines - Pre%"
Fetched 0 row(s) in 0.15s
[quickstart.cloudera:21000] >
让我们执行查询并确认一下,它似乎已经没有任何更改了。
> SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
> GROUP BY operating_airline
> HAVING total IS NOT null
> ORDER BY total DESC LIMIT 10;
Query: select sum(passenger_count) AS total, operating_airline FROM passenger_data
GROUP BY operating_airline
HAVING total IS NOT null
ORDER BY total DESC LIMIT 10
+-----------+--------------------+
| total | operating_airline |
+-----------+--------------------+
| 156683762 | United Airlines |
| 32657456 | SkyWest Airlines |
| 31727343 | American Airlines |
| 23801507 | Delta Air Lines |
| 23685267 | Virgin America |
| 22507320 | Southwest Airlines |
| 16235520 | US Airways |
| 11860630 | Alaska Airlines |
| 6706438 | JetBlue Airways |
| 6266220 | Northwest Airlines |
+-----------+--------------------+
Fetched 10 row(s) in 0.46s
[quickstart.cloudera:21000] >
数据更新得很准确呢!HDFS是一种只写文件系统,所以在数据更新方面存在一些挑战。但是,使用Kudu可以轻松地更新数据,并能够快速分析频繁更新的数据。
网页用户界面的屏幕截图


- Kudu tablet-server (http://quickstart.cloudera:8051)


概括[挑选一个摘要]
使用Kudu可以快速分析添加到HDFS中的数据。尽管Kudu尚未正式发布,但它已经在生产环境的数百个节点上成功运行!
補充
Impala 2.3以后版本有LIVE PROGRESS和LIVE SUMMARY选项,但似乎不支持Kudu,因此未显示。很遗憾。
http://www.cloudera.com/documentation/enterprise/latest/topics/impala_live_progress.html
https://asciinema.org/a/1rv7qippo0fe7h5k1b6k4nexk