当前位置 博文首页 > 行无际:PDF转HTML工具——用springboot包装pdf2htmlEX命令行工

    行无际:PDF转HTML工具——用springboot包装pdf2htmlEX命令行工

    作者:行无际 时间:2021-05-04 18:20

    Convert PDF to HTML without losing text or format.

    springbootpdf2htmlEX命令行工具包装为web服务, 使得PDFHTML更方便。

    pdf2htmlEX命令行工具详情见:
    https://github.com/pdf2htmlEX/pdf2htmlEX

    pdf2html-service源码见:
    https://github.com/iflyendless/pdf2html-service

    快速开始

    # 拉取镜像
    docker pull iflyendless/pdf2html-service:1.0.1
    
    # 启动
    docker run --name pdf2html -p 8686:8686 -d --rm iflyendless/pdf2html-service:1.0.1
    

    使用:

    curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/example.pdf'
    

    提醒一下: /pdfs/example.pdf指的是pdf文件所在的绝对路径

    在当前目录解压html.zip, 即可看到转换后的html文件以及000-task.txt

    构建镜像

    # 下载代码
    git clone https://github.com/iflyendless/pdf2html-service.git
    
    # 进入项目
    cd pdf2html-service
    
    # 跳过单元测试打包
    mvn clean package -DskipTests
    
    # build docker image
    docker build -t pdf2html-service:1.0.1 .
    

    如果构建镜像失败,请检查 https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/ 该站点下jdk版本是否与Dockerfile中的下载版本一致。

    启动

    docker run --name pdf2html -p 8686:8686 -d --rm pdf2html-service:1.0.1
    

    如果需要格外设置一些参数的话, 可以启动docker的时候通过-e传进去:

    # 同时启动的最大子进程数, 需要根据系统的资源合理设置(默认15)
    -e PDF2HTML_MAX_PROCESS=15
    
    # 执行/usr/local/bin/pdf2htmlEX命令时最大超时时间,单位s表示秒(默认600s)
    -e PDF2HTML_COMMAND_TIMEOUT=600s
    

    即:

    docker run --name pdf2html -p 8686:8686 -e PDF2HTML_MAX_PROCESS=10 -e PDF2HTML_COMMAND_TIMEOUT=60s -d --rm pdf2html-service:1.0.1
    

    更多配置见: resources目录下的application.yml文件。

    Http接口

    (1) 查看版本

    curl http://localhost:8686/api/version
    

    (2) 查看配置

    curl http://localhost:8686/api/config
    

    (3) 上传多个pdf, 并下载html压缩包

    curl -o html.zip --request POST 'localhost:8686/api/pdf2html' --form 'files=@/pdfs/001.pdf' --form 'files=@/pdfs/002.pdf' --form 'files=@/pdfs/003.pdf'
    

    提醒一下: /pdfs/001.pdf指的是pdf文件所在的绝对路径

    (4) 查询程序暴露出来的metric

    curl http://localhost:8686/api/metric
    

    问题排查

    # 进入容器
    docker exec -it pdf2html bash
    
    # 查看日志目录
    cd /opt/pdf2html-service/logs
    
    # 查看转换失败的pdf
    cd /tmp/pdf2html-service/failed-pdfs
    
    # 手动调用pdf2htmlEX命令转换pdf
    pdf2htmlEX --help
    
    

    实现

    每次手动调用pdf2htmlEX命令行工具不太方便,在此基础上包装成一个web服务更加方便使用。完整源码见:
    https://github.com/iflyendless/pdf2html-service

    思路

    由于pdf2htmlEX命令行工具的依赖较为复杂,编译也比较麻烦,所以可直接在官方提供的Docker Image中安装JDK,然后用springboot快速编写一个web应用,接收用户http请求,后台调用pdf2htmlEX命令行工具将多个PDF都转为HTML,然后压缩生成的HTMLzip包,让用户下载。

    Dockerfile如下:

    # pdf2htmlex image
    FROM pdf2htmlex/pdf2htmlex:0.18.8.rc1-master-20200630-Ubuntu-bionic-x86_64
    
    ENV TZ='CST-8'
    ENV LANG C.UTF-8
    
    # apt
    RUN sed -i s@/archive.ubuntu.com/@/mirrors.aliyun.com/@g /etc/apt/sources.list
    RUN apt-get clean && apt-get update
    RUN apt-get install -y vim curl htop net-tools
    
    # vim
    RUN echo "set fileencodings=utf-8,ucs-bom,gb18030,gbk,gb2312,cp936" >> /etc/vim/vimrc
    RUN echo "set termencoding=utf-8" >> /etc/vim/vimrc
    RUN echo "set encoding=utf-8" >> /etc/vim/vimrc
    
    # jdk
    ADD https://enos.itcollege.ee/~jpoial/allalaadimised/jdk8/jdk-8u291-linux-x64.tar.gz /tmp/
    RUN tar -zxf /tmp/jdk-*.tar.gz -C /opt/ && rm -f /tmp/jdk-*.tar.gz && mv /opt/jdk* /opt/jdk
    
    ENV JAVA_HOME /opt/jdk
    ENV PATH ${JAVA_HOME}/bin:$PATH
    
    # pdf2html-service
    COPY target/pdf2html-service-*.tar.gz /tmp/
    RUN tar -zxf /tmp/pdf2html-service-*.tar.gz -C /opt/ && rm -f /tmp/pdf2html-service-*.tar.gz
    
    ENTRYPOINT [""]
    WORKDIR /opt/pdf2html-service
    CMD ["bash","-c","./start.sh && tail -f /dev/null"]
    

    引入依赖

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <java.version>1.8</java.version>
        <maven.build.timestamp.format>yyyyMMdd</maven.build.timestamp.format>
        <hutool.version>5.6.3</hutool.version>
    </properties>
    
    <dependencies>
    
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
    
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-configuration-processor</artifactId>
        </dependency>
    
        <dependency>
            <groupId>org.apache.commons</groupId>
            <artifactId>commons-exec</artifactId>
            <version>1.3</version>
        </dependency>
    
        <dependency>
            <groupId>cn.hutool</groupId>
            <artifactId>hutool-all</artifactId>
            <version>${hutool.version}</version>
        </dependency>
    
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
        </dependency>
    
    </dependencies>
    

    这是一个springboot应用:

    import org.springframework.boot.SpringApplication;
    import org.springframework.boot.autoconfigure.SpringBootApplication;
    import org.springframework.boot.context.properties.ConfigurationPropertiesScan;
    
    @SpringBootApplication
    @ConfigurationPropertiesScan
    public class Pdf2HtmlService {
    
        public static void main(String[] args) {
            SpringApplication.run(Pdf2HtmlService.class);
        }
    }
    

    程序配置

    application.yml大致如下:

    server:
      port: ${APP_PORT:8686}
      servlet.context-path: /
    
    pdf2html:
      # /usr/local/bin/pdf2htmlEX --zoom 1.3
      command: ${PDF2HTML_COMMAND:/usr/local/bin/pdf2htmlEX --zoom 1.3 --quiet 1}
      command-timeout: ${PDF2HTML_COMMAND_TIMEOUT:600s}
      work-dir: ${PDF2HTML_WORK_DIR:/tmp/pdf2html-service}
      max-process: ${PDF2HTML_MAX_PROCESS:15}
    
    spring:
      application:
        name: pdf2html-service
    

    对应的Pdf2HtmlProperties如下:

    import lombok.Data;
    import org.springframework.boot.context.properties.ConfigurationProperties;
    
    import java.time.Duration;
    
    @Data
    @ConfigurationProperties(prefix = "pdf2html")
    public class Pdf2HtmlProperties {
    
        private String command;
    
        private String workDir;
    
        private Duration commandTimeout;
    
        // 同时启动的最大子进程数, 需要根据系统的性能合理设置
        private int maxProcess;
    }
    

    下面简单解释一下这几个配置的含义:

    • command:指的是调用pdf2htmlEX命令行工具的具体command,详细参数见pdf2htmlEX --help
    • command-timeout:使用的apachecommons-exec工具包,异步调用命令行,可设置最大超时时间。commons-exec的使用详情见:https://commons.apache.org/proper/commons-exec/tutorial.html
    • work-dir:该web应用程序的工作目录,也就是接收到用户的request时,先将pdf文件写入该目录的一个子目录下,调用pdf2htmlEX生成的html默认也是在该目录下,然后压缩该目录下生成的html文件,写入response。另外注意的是:转换失败的pdf会写入到该work-dir下的failed-pdfs下。方便复现、排查问题。
    • max-process:由于我的实现中调用命令行工具是全异步操作,必须对同时启动的命令行个数加以限制,避免短时间内产生大量子进程,不仅严重影响程序性能,而且可能导致系统瞬间卡死。所以该配置限制了同时启动的最大子进程数, 需要根据系统的性能合理设置。这里是用JDK自带的java.util.concurrent.Semaphore来限制子进程数量。

    接口实现

    接口实现并不复杂,关键地方也加了一些注释。如下:

    import cn.hutool.core.collection.CollUtil;
    import cn.hutool.core.io.FileTypeUtil;
    import cn.hutool.core.io.FileUtil;
    import cn.hutool.core.util.ArrayUtil;
    import cn.hutool.core.util.CharsetUtil;
    import cn.hutool.core.util.IdUtil;
    import cn.hutool.core.util.ZipUtil;
    import com.github.iflyendless.config.Pdf2HtmlProperties;
    import lombok.Getter;
    import lombok.extern.slf4j.Slf4j;
    import org.apache.commons.exec.*;
    import org.springframework.web.bind.annotation.*;
    import org.springframework.web.multipart.MultipartFile;
    
    import javax.annotation.PostConstruct;
    import javax.annotation.Resource;
    import javax.servlet.ServletOutputStream;
    import javax.servlet.http.HttpServletResponse;
    import java.io.File;
    import java.io.FileFilter;
    import java.net.URLEncoder;
    import java.util.*;
    import java.util.concurrent.ConcurrentHashMap;
    import java.util.concurrent.CountDownLatch;
    import java.util.concurrent.Semaphore;
    
    @Slf4j
    @RestController
    @RequestMapping("/api")
    public class Pdf2HtmlController {
    
        private static final String PDF = "pdf";
        private static final String FAILED_PDF_DIR = "failed-pdfs";
        private static final String TASK_FILE = "000-task.txt";
    
        @Resource
        private Pdf2HtmlProperties pdf2HtmlProperties;
    
        // 为了限制同时启动pdf2htmlEX命令行工具的子进程数
        private static Semaphore semaphore;
    
        // 转换html失败的pdf写到这个目录, 方便后面手动转换排查原因
        private static File failedPdfDir;
    
        @PostConstruct
        public void init() {
            semaphore = new Semaphore(pdf2HtmlProperties.getMaxProcess());
            failedPdfDir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), FAILED_PDF_DIR));
        }
    
        @GetMapping("/version")
        public Object version() {
            return "1.0.1";
        }
    
        @GetMapping("/config")
        public Object config() {
            return pdf2HtmlProperties;
        }
    
        @GetMapping("/metric")
        public Object metric() {
            Map<String, Object> semaphoreMap = new LinkedHashMap<>();
            semaphoreMap.put("availablePermits", semaphore.availablePermits());
            semaphoreMap.put("queueLength", semaphore.getQueueLength());
    
            Map<String, Object> metricMap = new LinkedHashMap<>();
            metricMap.put("semaphore", semaphoreMap);
    
            return metricMap;
        }
    
        @PostMapping("/pdf2html")
        public void pdf2html(@RequestParam("files") MultipartFile[] files,
                             HttpServletResponse response) {
            if (ArrayUtil.isEmpty(files)) {
                log.warn("文件数为0");
                return;
            }
    
            File dir = FileUtil.mkdir(FileUtil.file(pdf2HtmlProperties.getWorkDir(), IdUtil.simpleUUID()));
    
            try (ServletOutputStream outputStream = response.getOutputStream()) {
                List<File> fileList = new ArrayList<>(files.length);
                for (MultipartFile f : files) {
                    if (f == null || f.isEmpty()) {
                        continue;
                    }
                    // 写入本地工作目录
                    File localFile = FileUtil.writeFromStream(f.getInputStream(), FileUtil.file(dir, f.getOriginalFilename()));
                    // 只处理pdf文件
                    if (isPdf(localFile)) {
                        fileList.add(localFile);
                    }
                }
    
                if (CollUtil.isEmpty(fileList)) {
                    return;
                }
    
                long start = System.currentTimeMillis();
    
                int size = fileList.size();
                CountDownLatch latch = new CountDownLatch(size);
                // 处理失败的pdf统计
                Map<String, Throwable> failedMap = new ConcurrentHashMap<>();
    
                for (File file : fileList) {
                    // 这里限制启动子进程的数量
                    // 因为后面的调用是异步的, 防止瞬间产生大量子进程
                    semaphore.acquire();
                    // 异步调用pdf2htmlEX命令行工具
                    invokeCommand(dir, file, latch, failedMap);
                }
    
                // 等待所有子进程结束
                latch.await();
    
                log.info("pdf2html一共耗时{}ms, pdf数量为{}", System.currentTimeMillis() - start, size);
    
                // 记录 统计数据写入文件000-task.txt, 转换html失败的pdf写入固定目录
                recordTaskResult(size, failedMap, dir, fileList);
    
                // 将生成的html文件以及task.txt压缩, 并写入response
                ZipUtil.zip(outputStream, CharsetUtil.CHARSET_UTF_8, true, new FileFilter() {
                    @Override
                    public boolean accept(File pathname) {
                        if (pathname.isDirectory()) {
                            return true;
                        }
                        String name = pathname.getName().toLowerCase();
                        return name.endsWith(".html") || name.endsWith(".txt");
                    }
                }, dir);
    
                response.addHeader("Content-Disposition",
                        "attachment;fileName=" + URLEncoder.encode(dir.getName() + ".zip", "UTF-8"));
                response.addHeader("Content-type", "application/zip");
            } catch (Throwable e) {
                log.error("pdf2html error", e);
            } finally {
                FileUtil.del(dir);
            }
        }
    
        /**
         * 这里使用apache的commons-exec执行pdf2htmlEX命令行工具
         * 详情见: https://commons.apache.org/proper/commons-exec/tutorial.html
         */
        public void invokeCommand(File workDir, File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
            String filePath = file.getAbsolutePath();
    
            String line = String.format("%s --dest-dir %s %s", pdf2HtmlProperties.getCommand(), workDir.getAbsolutePath(), filePath);
            CommandLine commandLine = CommandLine.parse(line);
    
            // 命令行的超时处理
            ExecuteWatchdog watchdog = new ExecuteWatchdog(1000 * pdf2HtmlProperties.getCommandTimeout().getSeconds());
            // 命令行 执行完成的回调
            ResultHandler resultHandler = new ResultHandler(file, latch, failedMap);
    
            Executor executor = new DefaultExecutor();
            executor.setExitValue(0);
            executor.setWatchdog(watchdog);
    
            try {
                executor.execute(commandLine, resultHandler);
            } catch (Throwable e) {
                semaphore.release();
                String fileName = file.getName();
                if (!failedMap.containsKey(fileName)) {
                    failedMap.put(fileName, e);
                }
                latch.countDown();
    
                log.error("invokeCommand failed, command: {}, error:{}", line, e);
            }
        }
    
        public static boolean isPdf(File file) {
            try {
                return PDF.equalsIgnoreCase(FileTypeUtil.getType(file));
            } catch (Exception e) {
                log.error("识别pdf类型失败, 文件名:{}, error: {}", file.getAbsolutePath(), e);
                return false;
            }
        }
    
        public static void recordTaskResult(int total, Map<String, Throwable> failedMap, File workDir, List<File> pdfs) {
            List<String> list = new ArrayList<>();
            list.add("total:" + total);
            list.add("success:" + (total - failedMap.size()));
            list.add("failed:" + failedMap.size());
    
            list.add("");
            list.add("failed-pdfs:");
            list.add("");
    
            Set<String> failedNames = failedMap.keySet();
            list.addAll(failedNames);
    
            // 记录任务完成大致情况
            FileUtil.writeLines(list, FileUtil.file(workDir, TASK_FILE), CharsetUtil.CHARSET_UTF_8);
    
            // 转换失败的pdf写入其他目录,后续可能需要进一步处理
            if (CollUtil.isNotEmpty(failedNames)) {
                for (File pdf : pdfs) {
                    String name = pdf.getName();
                    if (failedNames.contains(name)) {
                        File dest = FileUtil.file(failedPdfDir, name);
                        if (dest.exists()) {
                            dest = FileUtil.file(failedPdfDir, IdUtil.simpleUUID() + "-" + name);
                        }
                        FileUtil.copyFile(pdf, dest);
                    }
                }
            }
        }
    
        /**
         * 根据具体的业务逻辑做相应的实现, 这里会打印一下错误日志
         */
        public static class ResultHandler implements ExecuteResultHandler {
    
            private final File file;
            private final CountDownLatch latch;
            private final Map<String, Throwable> failedMap;
    
            @Getter
            private int exitValue = -8686;
    
            public ResultHandler(File file, CountDownLatch latch, Map<String, Throwable> failedMap) {
                this.file = file;
                this.latch = latch;
                this.failedMap = failedMap;
            }
    
            @Override
            public void onProcessComplete(int exitValue) {
                semaphore.release();
                this.latch.countDown();
    
                this.exitValue = exitValue;
            }
    
            @Override
            public void onProcessFailed(ExecuteException e) {
                semaphore.release();
                this.failedMap.put(this.file.getName(), e);
                this.latch.countDown();
    
                log.error("pdf2html failed, file: {}, error:{}", this.file.getAbsolutePath(), e);
            }
        }
    }
    

    写在后面

    由于本人对前端开发不太熟悉,就没有花时间做个简单的页面了。如果你了解前端开发而且对此工具有点兴趣,可以顺手写个页面出来,那就更好了!!!另外,如果你知道PDFHTML有更好的工具或实现,欢迎评论区留言!!!

    随手记录,方便你我他。

    bk
    下一篇:没有了