手把手教学:提取PDF各种表格文本数据(附代码)
- 2019 年 10 月 6 日
- 筆記
关于PDFPlumbe
PDFPlumb最适合提取电脑生成的PDF,而不是扫描的PDF。 它是在pdfminer和pdfmine.six基础上设计的。
适用版本: Python2.7、3.1、3.4、3.5和3.6。
安装PDFPlumbe
pip install pdfplumber
要使用pdfplumber的可视化调试工具,还需要在计算机上安装ImageMagick(https://imagemagick.org/index.php),说明如下:
data:image/s3,"s3://crabby-images/9ea16/9ea161b5b640e656b4cb735d0addecf98af3209e" alt=""
http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-debian
具体参数、提取流程与可视化我们将以案例进行展示,更详细的内容,请大家在文末下载安装包自行查看。
案例一
import pdfplumber pdf = pdfplumber.open("../pdfs/ca-warn-report.pdf") p0 = pdf.pages[0] im = p0.to_image() im
data:image/s3,"s3://crabby-images/3d93b/3d93b62e279387988c7b52e1091f95cf91e1f3d2" alt=""
使用 .extract_table 获取数据:
table = p0.extract_table() table[:3]
data:image/s3,"s3://crabby-images/73371/73371b7d5ed9e5ffd53bd657a8d5025829c11d19" alt=""
使用pandas将列表呈现为一个DataFrame,并在某些日期内删除多余的空格。
import pandas as pd df = pd.DataFrame(table[1:], columns=table[0]) for column in ["Effective", "Received"]: df[column] = df[column].str.replace(" ", "")
data:image/s3,"s3://crabby-images/1c75e/1c75ef208b3827c8dd4b655de6ce2d8a57182d97" alt=""
大功告成!
具体是如何产生的呢?
红线代表pdfplumber在页面上找到的线,蓝色圆圈表示这些线的交叉点,淡蓝色底纹表示从这些交叉点派生的单元格。
data:image/s3,"s3://crabby-images/ee94e/ee94e9de3996e05288cb454cd8640273b8d903b5" alt=""
案例二:从PDF中提取图形数据
import pdfplumber report = pdfplumber.open("../pdfs/ag-energy-round-up-2017-02-24.pdf").pages[0] im = report.to_image() im
data:image/s3,"s3://crabby-images/2b3bb/2b3bba7517474bc2a1ec7a087ee561f714bbc4f9" alt=""
页面对象具有 .curves 属性,该属性包含在页面上找到的一个curve对象列表。本报告包含12条曲线,每图4条:
len(report.curves) 12 report.curves[0]
data:image/s3,"s3://crabby-images/81079/8107960999d52e61c72a282b63c68d0cea8d01ad" alt=""
将它们传递 .draw_lines 确定曲线的位置:
im.draw_lines(report.curves, stroke="red", stroke_width=2)
data:image/s3,"s3://crabby-images/85c40/85c4010b14479718a7b4a2ce9541305a414e30f0" alt=""
我们通过循环使用四种颜色的调色板来获得更好的显示感:
im.reset() colors = [ "gray", "red", "blue", "green" ] for i, curve in enumerate(report.curves): stroke = colors[i%len(colors)] im.draw_circles(curve["points"], radius=3, stroke=stroke, fill="white") im.draw_line(curve["points"], stroke=stroke, stroke_width=2) im
data:image/s3,"s3://crabby-images/2eaf9/2eaf9242832f4579ab9ddd3b4f543ffee2cff565" alt=""
案例三
import pdfplumber pdf = pdfplumber.open("../pdfs/background-checks.pd") p0 = pdf.pages[0] im = p0.to_image() im
data:image/s3,"s3://crabby-images/20410/204103372f4ddf3a05d6c1a4499d1b1a00a09e76" alt=""
使用 PageImage.debug_tablefinder() 来检查表格:
im.reset().debug_tablefinder()
data:image/s3,"s3://crabby-images/1851c/1851c98ad89042c1883424559b354460767fd43a" alt=""
默认设置正确地标识了表的垂直边界,但是没有捕获每组5个states/territories之间的水平边界。所以:
使用自定义 .extract_table :
- 因为列由行分隔,所以我们使用 vertical_strategy="lines"
- 因为行主要由文本之间的沟槽分隔,所以我们使用 horizontal_strategy="text"
- 由于文本的左、右端与竖线不是很齐平,所以我们使用 intersection_tolerance: 15
table_settings = { "vertical_strategy": "lines", "horizontal_strategy": "text", "intersection_x_tolerance": 15 } im.reset().debug_tablefinder(table_settings)
data:image/s3,"s3://crabby-images/a5e1f/a5e1f77aef34feff026bfb35adeecb2f79549a05" alt=""
table = p0.extract_table(table_settings) for row in table[:5]: print(row)
data:image/s3,"s3://crabby-images/1a13c/1a13c45fa3d3cdb9b3979f5239830d0f91075341" alt=""
清理数据(页眉页脚等):
core_table = table[3:3+56] " • ".join(core_table[0])
data:image/s3,"s3://crabby-images/a9c3a/a9c3a38e54276eeec8abc352d4dadcef3888854d" alt=""
" • ".join(core_table[-1])
data:image/s3,"s3://crabby-images/3ba62/3ba62ca22d3ff97978836269662765235c18aa7e" alt=""
COLUMNS = [ "state", "permit", "handgun", "long_gun", "other", "multiple", "admin", "prepawn_handgun", "prepawn_long_gun", "prepawn_other", "redemption_handgun", "redemption_long_gun", "redemption_other", "returned_handgun", "returned_long_gun", "returned_other", "rentals_handgun", "rentals_long_gun", "private_sale_handgun", "private_sale_long_gun", "private_sale_other", "return_to_seller_handgun", "return_to_seller_long_gun", "return_to_seller_other", "totals" ]
def parse_value(i, x): if i == 0: return x if x == "": return None return int(x.replace(",", "")) from collections import OrderedDict def parse_row(row): return OrderedDict((COLUMNS[i], parse_value(i, cell)) for i, cell in enumerate(row)) data = [ parse_row(row) for row in core_table ] Now here's the first row, parsed: data[0]
data:image/s3,"s3://crabby-images/47788/47788d8694d65b585ea6021007faa59eecc89025" alt=""
案例四
import pdfplumber import re from collections import OrderedDict pdf = pdfplumber.open("../pdfs/san-jose-pd-firearm-sample.pdf") p0 = pdf.pages[0] im = p0.to_image() im
data:image/s3,"s3://crabby-images/8e048/8e04878c1cd005412f46e9bbc26a04a034561ac8" alt=""
我们在pdfplumber检测到的每个 char 对象周围绘制矩形。通过这样做,我们可以看到报表主体的的每一行都有相同的宽度,并且每个字段都填充了空格(“”)字符。这意味着我们可以像解析标准的固定宽度数据文件一样解析这些行。
im.reset().draw_rects(p0.chars)
data:image/s3,"s3://crabby-images/44113/441138d86e7fa102ed09139d32d58dced0812a8e" alt=""
使用 page .extract_text(…) 方法,逐行抓取页面上的每个字符(文本):
text = p0.extract_text() print(text)
data:image/s3,"s3://crabby-images/cbf60/cbf60fbb0ed6d8483a0754630775668091a24971" alt=""
清理数据(页眉页脚等):
core_pat = re.compile(r"LOCATION[-s]+(.*)ns+Flags = e", re.DOTALL) core = re.search(core_pat, text).group(1) print(core)
data:image/s3,"s3://crabby-images/ef08a/ef08ac36fa88fe15127a6be01c6d6e6ae9b4291d" alt=""
在这份报告中,每f一个irearm占了两行。下面的代码将表拆分为two-line,然后根据每个字段中的字符数解析出字段:
lines = core.split("n") line_groups = list(zip(lines[::2], lines[1::2])) print(line_groups[0])
data:image/s3,"s3://crabby-images/54e11/54e11acc8a3d4e863b043a9d0789541af925171e" alt=""
def parse_row(first_line, second_line): return OrderedDict([ ("type", first_line[:20].strip()), ("item", first_line[21:41].strip()), ("make", first_line[44:89].strip()), ("model", first_line[90:105].strip()), ("calibre", first_line[106:111].strip()), ("status", first_line[112:120].strip()), ("flags", first_line[124:129].strip()), ("serial_number", second_line[0:13].strip()), ("report_tag_number", second_line[21:41].strip()), ("case_file_number", second_line[44:64].strip()), ("storage_location", second_line[68:91].strip()) ]) parsed = [ parse_row(first_line, second_line) for first_line, second_line in line_groups ]
parsed[:2]
data:image/s3,"s3://crabby-images/6335b/6335b40109657cef439dd27f878618497817a494" alt=""
通过DataFrame进行展示:
mport pandas as pd columns = list(parsed[0].keys()) pd.DataFrame(parsed)[columns]
data:image/s3,"s3://crabby-images/43774/43774061025385da3e73cf5d75dacc75660584a3" alt=""