转载

从MySQL到Hive，数据迁移就这么简单

从存储基础设施的一个位置移动数据到另一个位置是个艰难的过程。至少，过去是这样。而现在，在合适的工具和基础设施条件下，传统的数据迁移过程中涉及到的许多困难点都可以消除。使用Sqoop能够极大简化MySQL数据迁移至Hive之流程，并降低Hadoop处理分析任务时的难度。

作者：核子可乐译来源：51CTO.com | 2016-05-09 10:27

51CTO网+ 首届中国APP创新评选大赛>>

使用Sqoop能够极大简化MySQL数据迁移至Hive之流程，并降低Hadoop处理分析任务时的难度。

先决条件:安装并运行有Sqoop与Hive的Hadoop环境。为了加快处理速度，我们还将使用Cloudera Quickstart VM(要求至少4 GB内存)，不过大家也可以使用Hortonworks Data Platform(至少要求8 GB内存)。由于我的笔记本电脑只有8 GB内存，因此我在这里使用CLoudera VM镜像。

如果大家需要使用Virtualbox运行Cloudera/HDP VM，则可以轻松使用其它多种Hadoop生态系统预装软件包(包括MySQL、Oozie、Hadoop、Hive、Zookeeper、Storm、Kafka以及Spark等等)。

在MySQL中创建表

在Cloudera VM中，打开命令界面并确保MySQL已经安装完毕。

shell> mysql --version  mysql  Ver 14.14 Distrib 5.1.66, for redhat-linux-gnu (x86_64) using readline 5.

示例当中自然要使用自己的数据库，因此使用以下命令在MySQL中创建一套数据库：

mysql> create database sqoop;

接下来：

mysql> use sqoop;  mysql> create table customer(id varchar(3), name varchar(20), age varchar(3), salary integer(10));  Query OK, 0 rows affected (0.09 sec)  mysql> desc customer;  +--------+-------------+------+-----+---------+-------+  | Field  | Type        | Null | Key | Default | Extra |  +--------+-------------+------+-----+---------+-------+  | id     | varchar(3)  | YES  |     | NULL    |       |  | name   | varchar(20) | YES  |     | NULL    |       |  | age    | varchar(3)  | YES  |     | NULL    |       |  | salary | int(10)     | YES  |     | NULL    |       |  +--------+-------------+------+-----+---------+-------+

mysql> select * from customer;  +------+--------+------+--------+  | id   | name   | age  | salary |  +------+--------+------+--------+  | 1    | John   | 30   |  80000 |  | 2    | Kevin  | 33   |  84000 |  | 3    | Mark   | 28   |  90000 |  | 4    | Jenna  | 34   |  93000 |  | 5    | Robert | 32   | 100000 |  | 6    | Zoya   | 40   |  60000 |  | 7    | Sam    | 37   |  75000 |  | 8    | George | 31   |  67000 |  | 9    | Peter  | 23   |  70000 |  | 19   | Alex   | 26   |  74000 |  +------+--------+------+-----

开始Sqoop之旅

如大家所见，其中customer表中并不包含主键。我在该表中并未添加多少记录。默认情况下，Sqoop能够识别出表中的主键列(如果有的话)，并将其作为划分列。该划分列的低值与高值检索自该数据库，而映射任务则指向符合区间要求的均匀部分。

如果主键并未均匀分布在该区间当中，那么任务将出现不平衡状况。这时，大家应当明确选定一个与--split-by参数不同的列，例如--split-by id。

由于我们希望将此表直接导入至Hive中，因此需要在Sqoop命令中添加–hive-import：

sqoop import --connect jdbc:mysql://localhost:3306/sqoop   --username root   -P   --split-by id   --columns id,name   --table customer    --target-dir /user/cloudera/ingest/raw/customers   --fields-terminated-by ","   --hive-import   --create-hive-table   --hive-table sqoop_workspace.customers

下面来看Sqoop命令各选项的具体作用：

connect – 提供jdbc字符串

username – 数据库用户名

-P – 将在控制台中询问密码。大家也可以使用-passwaord，但并不推荐这种作法，因为其会显示在任务执行日志中并可能导致问题。解决办法之一在于将数据库密码存储在HDFS中的文件内，并将其向运行时交付。

table – 告知计算机我们希望导入哪个MySQL表。在这里，表名称为customer。
split-by – 指定划分列。在这里我们指定id列。
target-dir – HDFS目标目录。
fields-terminated-by – 我已经指定了逗号作为分隔值(默认情况下，导入HDFS的数据以逗号作为分隔值)。
hive-import – 将表导入Hive(如果不加设置，则使用Hive的默认分隔符)。
create-hive-table – 检查如果已经存在一个Hive表，任务设置是否会因此失败。
hive-table – 指定.。本示例中为sqoop_workspace.customers，其中sqoop_workspace为数据库名称，而customers则为表名称。

如下所示，Sqoop为一项map-reduce任务。请注意，这里我使用-P作为密码选项。除了这种方式，我们也可以使用-password实现参数化，并从文件中读取密码内容。

sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer  --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers  Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.  Please set $ACCUMULO_HOME to the root of your Accumulo installation.  16/03/01 12:59:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0  Enter password:  16/03/01 12:59:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.  16/03/01 12:59:54 INFO tool.CodeGenTool: Beginning code generation  16/03/01 12:59:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1  16/03/01 12:59:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1  16/03/01 12:59:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce  Note: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.java uses or overrides a deprecated API.  Note: Recompile with -Xlint:deprecation for details.  16/03/01 13:00:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar  16/03/01 13:00:01 WARN manager.MySQLManager: It looks like you are importing from mysql.  16/03/01 13:00:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct  16/03/01 13:00:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.  16/03/01 13:00:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)  16/03/01 13:00:01 INFO mapreduce.ImportJobBase: Beginning import of customer  16/03/01 13:00:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address  16/03/01 13:00:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar  16/03/01 13:00:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps  16/03/01 13:00:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032  16/03/01 13:00:11 INFO db.DBInputFormat: Using read commited transaction isolation  16/03/01 13:00:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `customer`  16/03/01 13:00:11 WARN db.TextSplitter: Generating splits for a textual index column.  16/03/01 13:00:11 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.  16/03/01 13:00:11 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.  16/03/01 13:00:11 INFO mapreduce.JobSubmitter: number of splits:4  16/03/01 13:00:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456782715090_0004  16/03/01 13:00:13 INFO impl.YarnClientImpl: Submitted application application_1456782715090_0004  16/03/01 13:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/  16/03/01 13:00:13 INFO mapreduce.Job: Running job: job_1456782715090_0004  16/03/01 13:00:47 INFO mapreduce.Job: Job job_1456782715090_0004 running in uber mode : false  16/03/01 13:00:48 INFO mapreduce.Job:  map 0% reduce 0%  16/03/01 13:01:43 INFO mapreduce.Job:  map 25% reduce 0%  16/03/01 13:01:46 INFO mapreduce.Job:  map 50% reduce 0%  16/03/01 13:01:48 INFO mapreduce.Job:  map 100% reduce 0%  16/03/01 13:01:48 INFO mapreduce.Job: Job job_1456782715090_0004 completed successfully  16/03/01 13:01:48 INFO mapreduce.Job: Counters: 30      File System Counters          FILE: Number of bytes read=0          FILE: Number of bytes written=548096          FILE: Number of read operations=0          FILE: Number of large read operations=0          FILE: Number of write operations=0          HDFS: Number of bytes read=409          HDFS: Number of bytes written=77          HDFS: Number of read operations=16          HDFS: Number of large read operations=0          HDFS: Number of write operations=8      Job Counters           Launched map tasks=4          Other local map tasks=5          Total time spent by all maps in occupied slots (ms)=216810          Total time spent by all reduces in occupied slots (ms)=0          Total time spent by all map tasks (ms)=216810          Total vcore-seconds taken by all map tasks=216810          Total megabyte-seconds taken by all map tasks=222013440      Map-Reduce Framework          Map input records=10          Map output records=10          Input split bytes=409          Spilled Records=0          Failed Shuffles=0          Merged Map outputs=0          GC time elapsed (ms)=2400          CPU time spent (ms)=5200          Physical memory (bytes) snapshot=418557952          Virtual memory (bytes) snapshot=6027804672          Total committed heap usage (bytes)=243007488      File Input Format Counters           Bytes Read=0      File Output Format Counters           Bytes Written=77  16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Transferred 77 bytes in 104.1093 seconds (0.7396 bytes/sec)  16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Retrieved 10 records.  16/03/01 13:01:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1  16/03/01 13:01:49 INFO hive.HiveImport: Loading uploaded data into Hive  Logging initialized using configuration in jar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.properties  OK  Time taken: 2.163 seconds  Loading data to table sqoop_workspace.customers  chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000': User does not belong to supergroup  chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001': User does not belong to supergroup  chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002': User does not belong to supergroup  chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003': User does not belong to supergroup  Table sqoop_workspace.customers stats: [numFiles=4, totalSize=77]  OK  Time taken: 1.399 seconds

最后，让我们验证Hive中的输出结果：

hive> show databases;  OK  default  sqoop_workspace  Time taken: 0.034 seconds, Fetched: 2 row(s)  hive> use sqoop_workspace;  OK  Time taken: 0.063 seconds  hive> show tables;  OK  customers  Time taken: 0.036 seconds, Fetched: 1 row(s)  hive> show create table customers;  OK  CREATE TABLE `customers`(    `id` string,     `name` string)  COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'  ROW FORMAT DELIMITED     FIELDS TERMINATED BY ','     LINES TERMINATED BY '/n'   STORED AS INPUTFORMAT     'org.apache.hadoop.mapred.TextInputFormat'   OUTPUTFORMAT     'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'  LOCATION    'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers'  TBLPROPERTIES (    'COLUMN_STATS_ACCURATE'='true',     'numFiles'='4',     'totalSize'='77',     'transient_lastDdlTime'='1456866115')  Time taken: 0.26 seconds, Fetched: 18 row(s)

hive> select * from customers;

1 John

2 Kevin

19 Alex

3 Mark

4 Jenna

5 Robert

6 Zoya

7 Sam

8 George

9 Peter

Time taken: 1.123 seconds, Fetched: 10 row(s).