从存储基础设施的一个位置移动数据到另一个位置是个艰难的过程。至少,过去是这样。而现在,在合适的工具和基础设施条件下,传统的数据迁移过程中涉及到的许多困难点都可以消除。使用Sqoop能够极大简化MySQL数据迁移至Hive之流程,并降低Hadoop处理分析任务时的难度。
使用Sqoop能够极大简化MySQL数据迁移至Hive之流程,并降低Hadoop处理分析任务时的难度。
先决条件:安装并运行有Sqoop与Hive的Hadoop环境。为了加快处理速度,我们还将使用Cloudera Quickstart VM(要求至少4 GB内存),不过大家也可以使用Hortonworks Data Platform(至少要求8 GB内存)。由于我的笔记本电脑只有8 GB内存,因此我在这里使用CLoudera VM镜像。
如果大家需要使用Virtualbox运行Cloudera/HDP VM,则可以轻松使用其它多种Hadoop生态系统预装软件包(包括MySQL、Oozie、Hadoop、Hive、Zookeeper、Storm、Kafka以及Spark等等)。
在Cloudera VM中,打开命令界面并确保MySQL已经安装完毕。
shell> mysql --version mysql Ver 14.14 Distrib 5.1.66, for redhat-linux-gnu (x86_64) using readline 5.
示例当中自然要使用自己的数据库,因此使用以下命令在MySQL中创建一套数据库:
mysql> create database sqoop;
接下来:
mysql> use sqoop; mysql> create table customer(id varchar(3), name varchar(20), age varchar(3), salary integer(10)); Query OK, 0 rows affected (0.09 sec) mysql> desc customer; +--------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+-------------+------+-----+---------+-------+ | id | varchar(3) | YES | | NULL | | | name | varchar(20) | YES | | NULL | | | age | varchar(3) | YES | | NULL | | | salary | int(10) | YES | | NULL | | +--------+-------------+------+-----+---------+-------+
mysql> select * from customer; +------+--------+------+--------+ | id | name | age | salary | +------+--------+------+--------+ | 1 | John | 30 | 80000 | | 2 | Kevin | 33 | 84000 | | 3 | Mark | 28 | 90000 | | 4 | Jenna | 34 | 93000 | | 5 | Robert | 32 | 100000 | | 6 | Zoya | 40 | 60000 | | 7 | Sam | 37 | 75000 | | 8 | George | 31 | 67000 | | 9 | Peter | 23 | 70000 | | 19 | Alex | 26 | 74000 | +------+--------+------+-----
如大家所见,其中customer表中并不包含主键。我在该表中并未添加多少记录。默认情况下,Sqoop能够识别出表中的主键列(如果有的话),并将其作为划分列。该划分列的低值与高值检索自该数据库,而映射任务则指向符合区间要求的均匀部分。
如果主键并未均匀分布在该区间当中,那么任务将出现不平衡状况。这时,大家应当明确选定一个与--split-by参数不同的列,例如--split-by id。
由于我们希望将此表直接导入至Hive中,因此需要在Sqoop命令中添加–hive-import:
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers
下面来看Sqoop命令各选项的具体作用:
connect – 提供jdbc字符串
username – 数据库用户名
-P – 将在控制台中询问密码。大家也可以使用-passwaord,但并不推荐这种作法,因为其会显示在任务执行日志中并可能导致问题。解决办法之一在于将数据库密码存储在HDFS中的文件内,并将其向运行时交付。
如下所示,Sqoop为一项map-reduce任务。请注意,这里我使用-P作为密码选项。除了这种方式,我们也可以使用-password实现参数化,并从文件中读取密码内容。
sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. Please set $ACCUMULO_HOME to the root of your Accumulo installation. 16/03/01 12:59:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0 Enter password: 16/03/01 12:59:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset. 16/03/01 12:59:54 INFO tool.CodeGenTool: Beginning code generation 16/03/01 12:59:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 16/03/01 12:59:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 16/03/01 12:59:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce Note: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.java uses or overrides a deprecated API. Note: Recompile with -Xlint:deprecation for details. 16/03/01 13:00:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar 16/03/01 13:00:01 WARN manager.MySQLManager: It looks like you are importing from mysql. 16/03/01 13:00:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct 16/03/01 13:00:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path. 16/03/01 13:00:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql) 16/03/01 13:00:01 INFO mapreduce.ImportJobBase: Beginning import of customer 16/03/01 13:00:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address 16/03/01 13:00:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar 16/03/01 13:00:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps 16/03/01 13:00:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/03/01 13:00:11 INFO db.DBInputFormat: Using read commited transaction isolation 16/03/01 13:00:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `customer` 16/03/01 13:00:11 WARN db.TextSplitter: Generating splits for a textual index column. 16/03/01 13:00:11 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records. 16/03/01 13:00:11 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column. 16/03/01 13:00:11 INFO mapreduce.JobSubmitter: number of splits:4 16/03/01 13:00:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456782715090_0004 16/03/01 13:00:13 INFO impl.YarnClientImpl: Submitted application application_1456782715090_0004 16/03/01 13:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/ 16/03/01 13:00:13 INFO mapreduce.Job: Running job: job_1456782715090_0004 16/03/01 13:00:47 INFO mapreduce.Job: Job job_1456782715090_0004 running in uber mode : false 16/03/01 13:00:48 INFO mapreduce.Job: map 0% reduce 0% 16/03/01 13:01:43 INFO mapreduce.Job: map 25% reduce 0% 16/03/01 13:01:46 INFO mapreduce.Job: map 50% reduce 0% 16/03/01 13:01:48 INFO mapreduce.Job: map 100% reduce 0% 16/03/01 13:01:48 INFO mapreduce.Job: Job job_1456782715090_0004 completed successfully 16/03/01 13:01:48 INFO mapreduce.Job: Counters: 30 File System Counters FILE: Number of bytes read=0 FILE: Number of bytes written=548096 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=409 HDFS: Number of bytes written=77 HDFS: Number of read operations=16 HDFS: Number of large read operations=0 HDFS: Number of write operations=8 Job Counters Launched map tasks=4 Other local map tasks=5 Total time spent by all maps in occupied slots (ms)=216810 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=216810 Total vcore-seconds taken by all map tasks=216810 Total megabyte-seconds taken by all map tasks=222013440 Map-Reduce Framework Map input records=10 Map output records=10 Input split bytes=409 Spilled Records=0 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=2400 CPU time spent (ms)=5200 Physical memory (bytes) snapshot=418557952 Virtual memory (bytes) snapshot=6027804672 Total committed heap usage (bytes)=243007488 File Input Format Counters Bytes Read=0 File Output Format Counters Bytes Written=77 16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Transferred 77 bytes in 104.1093 seconds (0.7396 bytes/sec) 16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Retrieved 10 records. 16/03/01 13:01:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1 16/03/01 13:01:49 INFO hive.HiveImport: Loading uploaded data into Hive Logging initialized using configuration in jar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.properties OK Time taken: 2.163 seconds Loading data to table sqoop_workspace.customers chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000': User does not belong to supergroup chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001': User does not belong to supergroup chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002': User does not belong to supergroup chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003': User does not belong to supergroup Table sqoop_workspace.customers stats: [numFiles=4, totalSize=77] OK Time taken: 1.399 seconds
最后,让我们验证Hive中的输出结果:
hive> show databases; OK default sqoop_workspace Time taken: 0.034 seconds, Fetched: 2 row(s) hive> use sqoop_workspace; OK Time taken: 0.063 seconds hive> show tables; OK customers Time taken: 0.036 seconds, Fetched: 1 row(s) hive> show create table customers; OK CREATE TABLE `customers`( `id` string, `name` string) COMMENT 'Imported by sqoop on 2016/03/01 13:01:49' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '/n' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers' TBLPROPERTIES ( 'COLUMN_STATS_ACCURATE'='true', 'numFiles'='4', 'totalSize'='77', 'transient_lastDdlTime'='1456866115') Time taken: 0.26 seconds, Fetched: 18 row(s)
hive> select * from customers;
OK
1 John
2 Kevin
19 Alex
3 Mark
4 Jenna
5 Robert
6 Zoya
7 Sam
8 George
9 Peter
Time taken: 1.123 seconds, Fetched: 10 row(s).
到此完成!从MySQL到Hive,数据迁移工作就是这么简单。
【编辑推荐】
【责任编辑:Ophira TEL:(010)68476606】