国产精品久久久久久爽爽爽,色呦呦,国产精品

如何用Python提取PDF文件中的表格

當(dāng)前位置：點(diǎn)晴教程→知識(shí)管理交流 →『技術(shù)文檔交流』

admin

2025年8月4日 18:34 本文熱度 1417

所需要的庫(kù) pdfplumber

有很多的庫(kù)都可以用于處理PDF文檔，例如PyMuPDF、PyPDF2、pdfplumber、pikepdf等等，它們各自有著特定的優(yōu)勢(shì)和用途，為了實(shí)現(xiàn)對(duì)文本、圖片和表格的簡(jiǎn)單提取功能，這里我選用的是pdfplumber，大家可以訪問(wèn)下面鏈接來(lái)對(duì)這個(gè)庫(kù)有個(gè)大致了解：

https://pypi.org/project/pdfplumber/?

在開(kāi)始寫代碼前，我們需要用以下命令對(duì)這個(gè)庫(kù)進(jìn)行安裝：

pip install pdfplumber

提取表格文本的代碼部分

def extract_tables(file_path):
    try:
        with pdfplumber.open(file_path) as pdf:
            for page in pdf.pages:
                tables = page.extract_tables()
                if tables isnotNoneand len(tables)>0:
                    for table in tables:
                        if table isnotNoneand len(table)>0:
                            for row in table:
                                # print(row)
                                print(' '.join(map(str,row)))
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' does not exist.")
    except Exception as e:
        print(f"Error: An error occurred while processing the file: {e}")

代碼解析

with pdfplumber.open(file_path) as pdf:

with語(yǔ)句用于管理資源，在這里是打開(kāi)的PDF文件。它確保無(wú)論代碼執(zhí)行成功還是失敗，PDF文件都會(huì)被正確關(guān)閉。pdfplumber.open(file_path)打開(kāi)指定路徑的PDF文件，并將其賦值給變量pdf。

for page in pdf.pages:

這個(gè)for循環(huán)用于遍歷PDF文檔中的每一頁(yè)。pdf.pages是一個(gè)包含PDF所有頁(yè)面對(duì)象的列表，循環(huán)會(huì)依次將每個(gè)頁(yè)面對(duì)象賦值給變量page。

tables = page.extract_tables()

這里的extract_tables()可不是我們自定義的方法名，它是pdfplumber的extract_tables()方法，它會(huì)自動(dòng)識(shí)別并提取當(dāng)前頁(yè)面page中的所有表格。提取的結(jié)果是一個(gè)列表，其中的每個(gè)元素代表一個(gè)表格。這個(gè)列表被賦值給變量tables。

if tables is not None and len(tables)>0:

這個(gè)條件判斷檢查tables是否為None（表示沒(méi)有提取到表格），并且檢查tables列表的長(zhǎng)度是否大于0。如果這兩個(gè)條件都成立，即成功提取到至少一個(gè)表格，才會(huì)執(zhí)行下面的代碼塊。

for table in tables:

這個(gè)for循環(huán)用于遍歷上一步提取到的tables列表中的每一個(gè)表格。每次循環(huán)中，table代表當(dāng)前正在處理的表格。

if table is not None and len(table)>0:

這個(gè)條件判斷再次確保當(dāng)前table對(duì)象不為空，并且包含至少一行數(shù)據(jù)。

for row in table:

table本身是一個(gè)列表，這個(gè)最內(nèi)層的for循環(huán)遍歷表格中的每一行，并將每一行數(shù)據(jù)賦值給row變量。

print(' '.join(map(str,row)))

這里我用join()和map()方法將表格中的數(shù)據(jù)。

map(str, row)：將 row列表中的每個(gè)元素都轉(zhuǎn)換成字符串類型。
' '.join(…)：這是一個(gè)字符串方法，它使用一個(gè)空格作為分隔符，將 map()返回的所有字符串連接成一個(gè)完整的字符串，也就是說(shuō)，這里我將每一行的數(shù)據(jù)，用空格連接成了一個(gè)字符串，然后進(jìn)行了打印。

提取表格并保存為 CSV 的代碼部分

def extract_tables2csv(file_path):
    try:
        with pdfplumber.open(file_path) as pdf:
            for i,page in enumerate(pdf.pages):
                tables = page.extract_tables()
                if tables isnotNoneand len(tables)>0:
                    for j,table in enumerate(tables):
                        if table isnotNoneand len(table)>0:
                           csv_filename = f'table_{i+1}_{j}.csv'
                           with open(csv_filename, 'w', newline='') as csvfile:
                               writer = csv.writer(csvfile)
                               writer.writerows(table)
                           print(f'Page {i+1}:\n')
                           for row in table:
                               print(row)
    except FileNotFoundError:
        print(f"Error: The file '{file_path}' does not exist.")
    except Exception as e:
        print(f"Error: An error occurred while processing the file: {e}")

代碼解析

for i,page in enumerate(pdf.pages):

這個(gè) for循環(huán)用于遍歷PDF文檔中的每一頁(yè)。enumerate()會(huì)返回頁(yè)面的索引i(從0開(kāi)始) 和頁(yè)面對(duì)象page。

for j,table in enumerate(tables):

這個(gè)嵌套的for循環(huán)，用于遍歷page.extract_tables()得到的tables列表中的每一個(gè)表格。enumerate()會(huì)返回表格的索引 j(從 0 開(kāi)始) 和表格對(duì)象table。

csv_filename = f'table_{i+1}_{j}.csv'

利用前面的到的i和j組合出將保存的CSV文件的文件名。例如，第一個(gè)頁(yè)面上的第一個(gè)表格會(huì)被命名為 table_1_0.csv。

with open(csv_filename, 'w', newline='') as csvfile:

這個(gè)with語(yǔ)句，用于打開(kāi)一個(gè)文件進(jìn)行寫入。open(csv_filename, 'w', newline='')以寫入模式'w'打開(kāi)之前定義的文件名。newline=''參數(shù)用于防止在寫入CSV文件時(shí)產(chǎn)生額外的空行。打開(kāi)的文件對(duì)象被賦值給變量csvfile。