转载

提取pdf文件文本：pdfparser与xpdf具体操作

网上搜索有许多pdf文本提取相关的开发包，仅php语言就有许多。下面是本猿在实践中接触的三种库：

1. PDFLIB TET http://www.pdflib.com/en/download/tet/

2. PDF Parser http://www.pdfparser.org/

3. XPDF http://www.foolabs.com/xpdf/

第一感觉比较满意的是 PDFLIB TET，因为其具有图片提取等功能，然而这个库是收费的，只能看着多达200多页的英文文档无动于衷！作为爱学习的类猿，还是期待大神的出现！

本文主要通过 PDF Parser 和 XPDF 来实现pdf文件中文本的提取工作。

实验环境：

阿里云平台 + ubuntu12.04 + apache2 + php5.3.10 + mysql5.6 （本项目中，整体采用 thinkphp 框架，该功能只是项目的一部分）

PDF Parser

准备工作：

上诉官网下载项目源码：pdfparser-master.zip；

解压源码文件，复制src文件夹下Smalot文件夹（该文件夹中源码是项目的核心源码）到ThinkPHP/Library文件夹下（该文件夹为thinkphp框架中存放第三方库）；

修改源代码的命名，如page.php修改为page.class.php（后者为php官方推荐类命名方式）；

实验环节：

编写一个类调用上面的库，具体代码

  1 <?php  2 namespace Admin/Controller;  3 use Think/Controller;  4   5 class PdfParseController extends Controller {  6   //定义方法，解析pdf文件  7   function parse(){  8     // 获取参数，文件所在路径  9     $path = $_GET['path']; 10     // 创建源码中的Parser类对象 11     $parser = new /Smalot/PdfParser/Parser(); 12     // 调用解析方法，参数为pdf文件路径，返回结果为Document类对象 13     $document = $parser->parseFile($path); 14     // 获取所有的页 15     $pages = $document->getPages(); 16     // 逐页提取文本 17     foreach($pages as $page){ 18         echo($page->getText()); 19     }    20   } 21 } 22 ?>

本项目中是通过前端请求来调用上诉类中的parse()方法，由于存在网络延迟等问题，为了不影响UI体验，采用ajax异步调用

  1 // js文件，页面按钮点击后调用parse方法  2 var xmlHttp = null;  3   4 function parse(){  5     //alert("开始");  6     var path = document.getElementById("pdffile").value; // 获取文件路径  7   8     var url = "http://***.***.***.***/***/***/PdfParse/parse?path=" + path;  //请求路径  9  10     request(url, function(result){ 11         //回调函数 12         //alert(result); 13         document.getElementsByName("context")[0].value = result; 14     }); 15 } 16    17 function request(url, onsuccess){ 18              19     //获取XMLHttpRequest对象，执行异步请求操作 20     if (window.XMLHttpRequest) { 21             xmlHttp = new XMLHttpRequest(); 22         } else if (window.ActiveXObject) { 23             xmlHttp = new ActiveXObject("Microsoft.XMLHTTP"); 24     } else { 25         alert("Browser does not support HTTP Request"); 26     } 27      28     xmlHttp.onreadystatechange = function(){ 29         if (xmlHttp.readyState == 4) { 30             if (xmlHttp.status == 200) { 31                 // 请求成功返回 32                 onsuccess(xmlHttp.responseText); 33             } 34         } 35     } 36     xmlHttp.open("GET", url, true); 37     xmlHttp.send(); 38 }

  1 <!-- 网页代码 -->  2 <body>  3     <tr>  4         <td>文档解析：</td>  5         <td>  6             <select id="pathtype" name="pathtype" style="width:60px;">  7                 <option value="url">网址</option>  8             </select>  9             <input type="text" id="pdffile" name="pdffile" style="width:500px"> 10         </td> 11         <td colspan="10" > 12             <input type="button" class="input_button" name="parse" value="解析" onclick="parse()" /> 13         </td> 14     </tr> 15 </body>

测试网址：http://www.cffex.com.cn/tzgg/jysgg/201512/W020151204630497494614.pdf

提取pdf文件文本：pdfparser与xpdf具体操作

优点：可以直接解析网页中的pdf文件，无需下载；

缺点：部分解析结果存在乱码格式；不支持图片提取

XPDF

准备工作：

上诉官网下载项目：xpdfbin-linux-3.04.tar.gz， xpdf-chinese-simplified.tar.gz；

安装xpdf-3.04到指定目录（本次为/usr/local）

tar zxvf xpdfbin-linux-3.04.tar.gz -C /usr/local //解压到安装目录

cd /usr/local/xpdfbin-linux-3.04 //打开解压文件夹

cat INSTALL

cd bin32/

cp ./* /usr/local/bin/

cd ../doc/

mkdir -p /usr/local/man/man1

mkdir -p /usr/local/man/man5

cp *.1 /usr/local/man/man1

cp *.5 /usr/local/man/man5

至此解析工具已经安装好，可以shell端命令调用解析英文文档，如果需要支持其他语言，需要安装字体插件。下面为简体中文插件安装过程

cp sample-xpdfrc /usr/local/etc/xpdfrc

tar zxvf xpdf-chinese-simplified.tar.gz -C /usr/local

cd /usr/local/xpdf-chinese-simplified

mkdir -p /usr/local/share/xpdf/chinese-simplified

cp -r Adobe-GB1.cidToUnicode ISO-2022-CN.unicodeMap EUC-CN.unicodeMap GBK.unicodeMap CMap /usr/local/share/xpdf/chinese-simplified/

把解压后文件夹chinese-simplified里面文件 add-to-xpdfrc 的内容复制到/usr/local/etc/xpdfrc文件中。

shell端命令调用（W020151204630497494614.pdf文件已经下载到shell命令当前目录中）：

pdftotext W020151204630497494614.pdf //没有采用字体库，存在乱码

pdftotext -layout -enc GBK W020151204630497494614.pdf //无乱码

实验环节：

编写一个类调用上面的命令，具体代码

  1 <?php  2 namespace Admin/Controller;  3 use Think/Controller;  4   5 class PdfParseController extends Controller {  6   function pdftotxt(){  7     // 获取参数，文件所在路径  8     $path = $_GET['path'];  9     // 下载文件 10     $file_name = $this->download($path); 11     // 解析文件 12     $content = shell_exec('/usr/local/bin/pdftotext -layout -enc GBK '.$file_name.' -');  13     // 转换文本编码格式 14     $content = mb_convert_encoding($content, 'UTF-8','GBK');  15     // 删除下载的文件 16     unlink($file_name); 17     echo($content); 18   } 19  20   // 定义函数，下载文件 21   function download($file_url){ 22     // 判断参数是否赋值及是否为空 23     if(!isset($file_url)||trim($file_url)==''){ 24         return '500'; 25     } 26  27     // 返回路径中的文件名部分，包含扩展名 28     $file_name=basename($file_url); 29  30     $content = file_get_contents($file_url); 31     file_put_contents($file_name, $content); 32  33     return $file_name; 34   } 35 } 36 ?>

同样通过前端异步请求来调用上诉类中的parse()方法

  1 var xmlHttp = null;  2   3 function pdftotxt(){  4     var path = document.getElementById("pdffile").value; // 获取文件路径  5   6     var url = "http://***.***.***.***/***/***/PdfParse/pdftotxt?path=" + path;  //请求路径  7   8     request(url, function(result){  9         //回调函数 10         //alert(result); 11         document.getElementsByName("context")[0].value = result; 12     }); 13 } 14    15 function request(url, onsuccess){ 16              17     //获取XMLHttpRequest对象，执行异步请求操作 18     if (window.XMLHttpRequest) { 19             xmlHttp = new XMLHttpRequest(); 20         } else if (window.ActiveXObject) { 21             xmlHttp = new ActiveXObject("Microsoft.XMLHTTP"); 22     } else { 23         alert("Browser does not support HTTP Request"); 24     } 25      26     xmlHttp.onreadystatechange = function(){ 27         if (xmlHttp.readyState == 4) { 28             if (xmlHttp.status == 200) { 29                 // 请求成功返回 30                 onsuccess(xmlHttp.responseText); 31             } 32         } 33     } 34     xmlHttp.open("GET", url, true); 35     xmlHttp.send(); 36 }

  1 <body>  2   3     <tr>  4         <td>文档解析：</td>  5         <td>  6             <select id="pathtype" name="pathtype" style="width:60px;">  7                 <option value="url">网址</option>  8             </select>  9             <input type="text" id="pdffile" name="pdffile" style="width:500px"> 10         </td> 11         <td colspan="10" > 12             <input type="button" class="input_button" name="parse" value="解析" onclick="parse()" /> 13         </td> 14         <td colspan="10" > 15             <input type="button" class="input_button" name="exchange" value="转换" onclick="pdftotxt()" /> 16         </td> 17     </tr> 18 </body>

测试网址：http://www.cffex.com.cn/tzgg/jysgg/201512/W020151204630497494614.pdf

提取pdf文件文本：pdfparser与xpdf具体操作