PDF轉Word完全免費？這麼好的事情我怎麼不知道？？？？

2019 年 10 月 6 日
筆記

」閱讀此篇需要三分鐘「

首先來看看來個PDF文件

我們來選擇其中一個論文摘要

使用我們的python程式碼轉化後：

是不是很神奇？

現在網路上大部分的PDF轉Word都是收費的，基本都是按頁收費，有了我們的python程式碼後，我們就可以完全免費的將PDF轉成Word了，這麼好的福利我們趕緊來了解一下吧！

首先來看看我們要安裝一些什麼模組：

attrs==17.4.0  lxml==4.1.1  pdfminer3k==1.3.1  pluggy==0.6.0  ply==3.11  py==1.5.2  pytest==3.4.1  python-docx==0.8.6  six==1.11.0

使用pip模組管理工具即可安裝。

如上圖，將每個模組都安裝好。

或者直接將模組放到requirements.txt文件里，運行

pip install -r requirements

安裝即可

下一步就來開始coding了！

首先導入需要使用的模組：

import os  from io import StringIO  from io import open  from concurrent.futures import ProcessPoolExecutor  from pdfminer.pdfinterp import PDFResourceManager  from pdfminer.pdfinterp import process_pdf  from pdfminer.converter import TextConverter  from pdfminer.layout import LAParams  from docx import Document

然後定義好PDF文件的讀取路徑和Word文件的生成路徑。

pdf_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/pdf'  word_folder = r'/Users/wuyuqing/Desktop/Code/pdf2word/word'

接下來我們定義使用的方法：

def read_from_pdf(file_path):      with open(file_path, 'rb') as file:          resource_manager = PDFResourceManager()          return_str = StringIO()          lap_params = LAParams()            device = TextConverter(              resource_manager,              return_str,              laparams=lap_params)          process_pdf(resource_manager, device, file)          device.close()            content = return_str.getvalue()          return_str.close()          return content

通過位元組流的方式打開文件，讀取內容。我們主要使用process_pdf這個函數處理pdf，詳情處理步驟我們可以看看API是這麼處理的（這API寫好的程式碼，供參考，不需要你再次手寫）：

def process_pdf(rsrcmgr, device, fp, pagenos=None, maxpages=0, password='',                  caching=True, check_extractable=True):      # Create a PDF parser object associated with the file object.      parser = PDFParser(fp)      # Create a PDF document object that stores the document structure.      doc = PDFDocument(caching=caching)      # Connect the parser and document objects.      parser.set_document(doc)      doc.set_parser(parser)      # Supply the document password for initialization.      # (If no password is set, give an empty string.)      doc.initialize(password)      # Check if the document allows text extraction. If not, abort.      if check_extractable and not doc.is_extractable:  raise PDFTextExtractionNotAllowed(                        'Text extraction is not allowed: %r' % fp)# Create a PDF interpreter object.      interpreter = PDFPageInterpreter(rsrcmgr, device)      # Process each page contained in the document.      for (pageno,page) in enumerate(doc.get_pages()):          if pagenos and (pageno not in pagenos): continue          interpreter.process_page(page)          if maxpages and maxpages <= pageno+1: break

下面我們考慮將位元組流存成docx文檔：

def save_text_to_word(content, file_path):      doc = Document()      for line in content.split('n'):          paragraph = doc.add_paragraph()          paragraph.add_run(remove_control_characters(line))      doc.save(file_path)

# 將兩個函數封裝起來def pdf_to_word(pdf_file_path, word_file_path):content = read_from_pdf(pdf_file_path)      save_text_to_word(content, word_file_path)

主要功能完成，這樣就算完工了

下面我們來調用讀取pdf生成docx的方法

tasks = []  with ProcessPoolExecutor(max_workers=5) as executor:      for file in os.listdir(pdf_folder):          extension_name = os.path.splitext(file)[1]          if extension_name != '.pdf':              continue          file_name = os.path.splitext(file)[0]          pdf_file = pdf_folder + '/' + file          word_file = word_folder + '/' + file_name + '.docx'          print('正在處理: ', file)          result = executor.submit(pdf_to_word, pdf_file, word_file)          tasks.append(result)  while True:      exit_flag = True      for task in tasks:          if not task.done():              exit_flag = False      if exit_flag:          print('完成')          exit(0)

這樣就可以生成doc文件了，怎麼樣是不是很簡單？

PDF轉Word完全免費？這麼好的事情我怎麼不知道？？？？

VirMach 便宜 VPS

QNews

PDF轉Word完全免費？這麼好的事情我怎麼不知道？？？？

分享此文：

Related Posts

JVM垃圾收集演算法

千頭萬緒的企業安全怎麼做？先收下這兩個錦囊！

Msgpack有沒有興趣了解一下？

數據解讀—B站火過蔡徐坤的「鬼畜「區巨頭們

VirMach 便宜 VPS

QNews

熱門搜尋