使用elasticsearch-loader工具将CSV文件导入ElasticSearch

2 年 ago

文, 翔

3 minutes

概述

以下是使用elasticsearch-loader将CSV文件上传到ElasticSearch的步骤。相对于Logstash，这种方法更加方便快捷。

elasticsearch-loader 是什么？

用于将数据文件（json、parquet、csv、tsv）批量加载到ElasticSearch的Python工具。

GitHub –> GitHub 仓库 (GitHub Repository).

支援环境

python/es5.6.166.8.07.1.12.7VVV3.7VVV

安装

$ sudo pip install elasticsearch-loader

如何使用

使用这种CSV文件

$ cat test.csv 
id,name,age,address
01,taro,12,tokyo
02,hanako,13,kyoto
03,ichiro,16,osaka

执行以下命令，将CSV文件注册到ElasticSearch中。

$ elasticsearch_loader --es-host <host:port> --index <IndexName> --type <TypeName> csv <FileName>

$ elasticsearch_loader --es-host 192.168.1.1:9200 --index student --type type csv test.csv
{'index': u'student', 'bulk_size': 500, 'http_auth': None, 'es_conn': <Elasticsearch([{u'host': u'192.168.1.1', u'port': 9200}])>, 'encoding': u'utf-8', 'keys': [], 'use_ssl': False, 'update': False, 'id_field': None, 'as_child': False, 'index_settings_file': None, 'timeout': 10.0, 'progress': False, 'ca_certs': None, 'with_retry': False, 'verify_certs': False, 'type': u'type', 'es_host': (u'192.168.1.1:9200',), 'delete': False}
  [####################################]

成果

如果指定的索引和类型不存在，系统会自动创建并成功注册。

$ curl -H "Content-Type: application/json" -XGET 'http://192.168.1.1:9200/student/type/_search?pretty'

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "student",
        "_type" : "type",
        "_id" : "gECMZXABW66WYIIZTexw",
        "_score" : 1.0,
        "_source" : {
          "age" : "12",
          "address" : "tokyo",
          "id" : "01",
          "name" : "taro"
        }
      },
      {
        "_index" : "student",
        "_type" : "type",
        "_id" : "gkCMZXABW66WYIIZTexw",
        "_score" : 1.0,
        "_source" : {
          "age" : "16",
          "address" : "osaka",
          "id" : "03",
          "name" : "ichiro"
        }
      },
      {
        "_index" : "student",
        "_type" : "type",
        "_id" : "gUCMZXABW66WYIIZTexw",
        "_score" : 1.0,
        "_source" : {
          "age" : "13",
          "address" : "kyoto",
          "id" : "02",
          "name" : "hanako"
        }
      }
    ]
  }
}

帮忙

$ elasticsearch_loader -h
Usage: elasticsearch_loader [OPTIONS] COMMAND [ARGS]...

Options:
  -c, --config-file TEXT          Load default configuration file from esl.yml
  --bulk-size INTEGER             How many docs to collect before writing to
                                  Elasticsearch (default 500)
  --es-host TEXT                  Elasticsearch cluster entry point. (default
                                  http://localhost:9200)
  --verify-certs                  Make sure we verify SSL certificates
                                  (default false)
  --use-ssl                       Turn on SSL (default false)
  --ca-certs TEXT                 Provide a path to CA certs on disk
  --http-auth TEXT                Provide username and password for basic auth
                                  in the format of username:password
  --index TEXT                    Destination index name  [required]
  --delete                        Delete index before import? (default false)
  --update                        Merge and update existing doc instead of
                                  overwrite
  --progress                      Enable progress bar - NOTICE: in order to
                                  show progress the entire input should be
                                  collected and can consume more memory than
                                  without progress bar
  --type TEXT                     Docs type. TYPES WILL BE DEPRECATED IN APIS
                                  IN ELASTICSEARCH 7, AND COMPLETELY REMOVED
                                  IN 8.  [required]
  --id-field TEXT                 Specify field name that be used as document
                                  id
  --as-child                      Insert _parent, _routing field, the value is
                                  same as _id. Note: must specify --id-field
                                  explicitly
  --with-retry                    Retry if ES bulk insertion failed
  --index-settings-file FILENAME  Specify path to json file containing index
                                  mapping and settings, creates index if
                                  missing
  --timeout FLOAT                 Specify request timeout in seconds for
                                  Elasticsearch client
  --encoding TEXT                 Specify content encoding for input files
  --keys TEXT                     Comma separated keys to pick from each
                                  document
  -h, --help                      Show this message and exit.

Commands:
  csv
  json     FILES with the format of [{"a": "1"}, {"b": "2"}]
  parquet

以上所述