Oracle數據泵導出使用並行參數,單個表能否真正的並行?

對於Oracle 數據泵expdp,impdp是一種邏輯導出導入遷移數據的一個工具,是服務端的工具,常見於DBA人員使用,用於數據遷移。從A庫遷移至B庫,或者從A用戶遷移至B用戶等。

那麼有個疑問?

在Oracle 11.2.0.4的版本,對一個表,使用並行參數,是否真實的起用了並行?假設並行為2,是否真的分2個進程,一個進程負責導出一半的數據???

 

1.測試導出兩個不同的表,使用並行2

$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel=2 cluster=n tables=dept,emp
Total estimation using BLOCKS method: 128 KB
. . exported "SCOTT"."DEPT"                              5.929 KB       4 rows
. . exported "SCOTT"."EMP"                               8.562 KB      14 rows
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
  /home/oracle/D01.dmp
  /home/oracle/D02.dmp
  
這兩個文件,一個大一個小,根本不是大小相等的文件格式。 並且dump文件是二進位文件,無法觀察裡面具體存放什麼資訊,因此直接讀取dump文件觀察這條路不通。

2.閱讀MOS文檔

Parallel Capabilities of Oracle Data Pump (Doc ID 365459.1)    
  
For every export operation, Data Pump estimates how much disk space each table data object in the export job will consume (in bytes).
This is done whether or not the user uses the ESTIMATE parameter. The estimate is printed in the log file and displayed on the client's
standard output device. The estimate is for table row data only; it does not include metadata. This estimate is used to determine how many
PX processes will be applied to the table data object, if any.
對於每個導出的表來說,會估算導出大小,只包含表的行記錄對應預估的大小。並且以此評估真正使用並行時,需要使用多少個進程? The columns of the tables are examined to determine if direct path, external tables, or both methods can be used. For direct path,
the parallel number for the table data object is always one since direct path does not support parallel unload of a table data object.
PX processes are only used with external tables. Oracle導出兩種方式,直接路徑讀 or 外部表,直接路徑讀並行一直1,外部表才允許並行導出。 If the external tables method
is chosen, Data Pump will determine the maximum number of PX processes that can work on a table data object.
It does this by dividing the estimated size of the table data object by 250 MB and rounding the result down. If the result is zero or one,
then PX processes are not used to unload the table. 如果選擇了外部表方法,則數據泵將確定可在表數據對象上運行的PX進程的最大數量。它通過將表數據對象的估計大小除以250 MB並將結果四捨五入來實現。如果結果為零或一,則不使用PX進程卸載表。 If a job
is not big enough to make use of the maximum parallel number, then the user will not see the maximum number of active workers and
Parallel Execution Processes. For example, if there is one 800 MB table, and it has been determined that external tables will be used,
there will be one worker for the metadata, one worker for the data, and three PX processes. As mentioned above, the worker process for
the data acts as the coordinator for the PX processes and does not count toward the parallel total. So, if a user specifies PARALLEL = 10,
the degree of parallelism is actually four. The user will only see one active worker in the STATUS display. Data Pump is working optimally;
the job is too small for the specified degree of parallelism.

這段話不太理解,有點懵,後續測試下。

3.模擬一個800M的非分區表,並行使用10個並行導出,結果如何?

疑問? 表導出,是根據表統計資訊估算大小?  還是根據DBA_SEGMENTS 估算?

SQL> select sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='A';
SUM(BYTES)/1024/1024
--------------------
                 824

SQL> select owner,table_name,NUM_ROWS,BLOCKS*8/1024,SAMPLE_SIZE from dba_tables where owner=’SCOTT’ and table_name=’A’;

OWNER TABLE_NAME NUM_ROWS BLOCKS*8/1024 SAMPLE_SIZE

———- —————————— ———- ————- ———–
SCOTT A

導出

$ expdp scott/tiger directory=dump dumpfile=D%U.dmp parallel=10 tables=a
Estimate in progress using BLOCKS method...
Processing object type TABLE_EXPORT/TABLE/TABLE_DATA
Total estimation using BLOCKS method: 824 MB
Processing object type TABLE_EXPORT/TABLE/TABLE
. . exported "SCOTT"."A"                                 708.3 MB 7315680 rows
******************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
  /u01/dump/D01.dmp
  /u01/dump/D02.dmp
  /u01/dump/D03.dmp
Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:03:58 2020 elapsed 0 00:00:11
$ ls -lrt D*.dmp
-rw------- 1 oracle oinstall 245936128 Aug  5 06:03 D03.dmp
-rw------- 1 oracle oinstall 248098816 Aug  5 06:03 D02.dmp
-rw------- 1 oracle oinstall 248860672 Aug  5 06:03 D01.dmp


$expdp \'/ as sysdba\' attach=SYS_EXPORT_TABLE_01 
Job: SYS_EXPORT_TABLE_01
  Owner: SCOTT                          
  Operation: EXPORT                         
  Creator Privs: TRUE                           
  GUID: AC156DF4AC940F1DE053453CA8C0F1FA
  Start Time: Wednesday, 05 August, 2020 6:03:48
  Mode: TABLE                          
  Instance: tt
  Max Parallelism: 10
  EXPORT Job Parameters:
  Parameter Name      Parameter Value:
     CLIENT_COMMAND        scott/******** directory=dump dumpfile=D%U.dmp parallel=10 tables=a 
  State: EXECUTING                      
  Bytes Processed: 0
  Current Parallelism: 10
  Job Error Count: 0
  Dump File: /u01/dump/D01.dmp
    bytes written: 4,096
  Dump File: /u01/dump/D%u.dmp
  
Worker 1 Status:
  Process Name: DW00
  State: EXECUTING                      
  Object Schema: SCOTT
  Object Name: A
  Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 1
  Total Objects: 1
  Worker Parallelism: 1
  
Worker 2 Status:
  Process Name: DW01
  State: EXECUTING                      
  Object Schema: SCOTT
  Object Name: A
  Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
  Completed Objects: 1
  Total Objects: 1
  Worker Parallelism: 3

自問自答:數據泵這個值說是預估,實際上是直接使用DBA_SEGMENTS裡面的數據。
Total estimation using BLOCKS method: 824 MB

並且此時在結合MOS最後一段話的理解,並行參數使用10,但是實際上相當於使用了兩個並行主進程,一個進程負責導出元數據,導出完成元數據,在幫忙導出一點數據;
另一個進程主要負責導輸出,根據演算法,出現3個輔助進程,加快導出! 824/250=3 四捨五入。
這也側面驗證了對於非分區表,實際上並行導出,並不會多出幾個進程並行。
疑問又來了??? 那麼對於單表的導出,使用並行參數是否真的快? 還是慢?
並行10 ,11秒執行。

$ expdp scott/tiger directory=dump dumpfile=D1%U.dmp tables=a

******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D101.dmp
Job “SCOTT”.”SYS_EXPORT_TABLE_01″ successfully completed at Wed Aug 5 06:12:40 2020 elapsed 0 00:00:15

15s的時間,與11s相比,差距並不大。   因此對於單表來說,其實並行並不會真正意義上明細加快速度。

4.模擬分區表,800M數據,使用並行參數導出時間如何?

create table RANGE_PART_TAB(id number,
    deal_date date, area_code number, contents varchar2(4000))
 partition by range(deal_date)
 (
 partition p1 values less  than(to_date('2020-02-01','yyyy-mm-dd')),
 partition p2 values less  than(to_date('2020-03-01','yyyy-mm-dd')),
 partition p3 values less  than(to_date('2020-04-01','yyyy-mm-dd')),
 partition p4 values less  than(to_date('2020-05-01','yyyy-mm-dd')),
 partition p5 values less  than(to_date('2020-06-01','yyyy-mm-dd')),
 partition p6 values less  than(to_date('2020-07-01','yyyy-mm-dd')),
 partition p7 values less  than(to_date('2020-08-01','yyyy-mm-dd')),
 partition p8 values less  than(to_date('2020-09-01','yyyy-mm-dd')),
 partition p9 values less  than(to_date('2020-10-01','yyyy-mm-dd')),
 partition p10 values less than(to_date('2020-11-01','yyyy-mm-dd')));
 
 insert into range_part_tab (id,deal_date,area_code,contents)
 select rownum,
          to_date(to_char(to_date('20200101','yyyymmdd'),'J')+ trunc(dbms_random.value(0,300)),'J'),
ceil(dbms_random.value(590,599)),
 rpad('*',400,'*')
 from dual
 connect by rownum <= 100000;
SQL> r   多次自插入數據
  1* insert into range_part_tab select * from range_part_tab
800000 rows created.
SQL> commit;
 SQL> select sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB';
SUM(BYTES)/1024/1024
--------------------
                 792
SQL>  select PARTITION_NAME,sum(bytes)/1024/1024 from dba_segments where owner='SCOTT' and segment_name='RANGE_PART_TAB' group by
PARTITION_NAME order by 1; PARTITION_NAME SUM(BYTES)/1024/1024 ------------------------------ -------------------- P1 80 P10 72 P2 80 P3 80 P4 80 P5 80 P6 80 P7 80 P8 80 P9 80 10 rows selected. [oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D2%U.dmp tables=RANGE_PART_TAB parallel=10 Worker 1 Status: Process Name: DW00 State: EXECUTING Object Schema: SCOTT Object Name: RANGE_PART_TAB Object Type: TABLE_EXPORT/TABLE/TABLE_DATA Completed Objects: 10 Total Objects: 10 Completed Rows: 138,592 Completed Bytes: 58,754,176 Percent Done: 77 Worker Parallelism: 1 Worker 2 Status: Process Name: DW01 State: WORK WAITING ****************************************************************************** Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is: /u01/dump/D201.dmp /u01/dump/D202.dmp /u01/dump/D203.dmp /u01/dump/D204.dmp /u01/dump/D205.dmp /u01/dump/D206.dmp /u01/dump/D207.dmp /u01/dump/D208.dmp /u01/dump/D209.dmp Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:26:04 2020 elapsed 0 00:00:22 [oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D3%U.dmp tables=RANGE_PART_TAB Job "SCOTT"."SYS_EXPORT_TABLE_01" successfully completed at Wed Aug 5 06:28:14 2020 elapsed 0 00:00:20 !!!

測試了一個寂寞,根本不是想像中的十個不同的主進程,每個人負責一個分區,而還是一個主進程,串列化導出每個分區! 很傻很天真。

??? 難道是對於每個Segments 大小 250M的限制???

對測試分區表擴大幾倍。

666 果然是250MB的閾值,當單個分區SEGMENTS大於250M,才真正的開始了並行的作用,導出存在10個主進程導出。

SQL> select PARTITION_NAME,sum(bytes)/1024/1024 from dba_segments where owner=’SCOTT’ and segment_name=’RANGE_PART_TAB’ group by PARTITION_NAME;
PARTITION_NAME SUM(BYTES)/1024/1024
—————————— ——————–
P7 309
P4 304
P1 312
P2 288
P3 304
P6 296
P10 264
P9 300
P8 312
P5 312

10 rows selected.
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D4%U.dmp tables=RANGE_PART_TAB parallel=10
Worker 1 Status:
Process Name: DW00
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 2 Status:
Process Name: DW01
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 3 Status:
Process Name: DW02
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 4 Status:
Process Name: DW03
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 5 Status:
Process Name: DW04
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 6 Status:
Process Name: DW05
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 7 Status:
Process Name: DW06
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 8 Status:
Process Name: DW07
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 9 Status:
Process Name: DW08
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1

Worker 10 Status:
Process Name: DW09
State: EXECUTING
Object Schema: SCOTT
Object Name: RANGE_PART_TAB
Object Type: TABLE_EXPORT/TABLE/TABLE_DATA
Completed Objects: 1
Total Objects: 10
Worker Parallelism: 1
******************************************************************************
Dump file set for SCOTT.SYS_EXPORT_TABLE_01 is:
/u01/dump/D401.dmp
/u01/dump/D402.dmp
/u01/dump/D403.dmp
/u01/dump/D404.dmp
/u01/dump/D405.dmp
/u01/dump/D406.dmp
/u01/dump/D407.dmp
/u01/dump/D408.dmp
/u01/dump/D409.dmp
/u01/dump/D410.dmp
Job “SCOTT”.”SYS_EXPORT_TABLE_01″ successfully completed at Wed Aug 5 06:37:20 2020 elapsed 0 00:00:40
[oracle@test dump]$ expdp scott/tiger directory=dump dumpfile=D5%U.dmp tables=RANGE_PART_TAB

測試下時間對比。

Job “SCOTT”.”SYS_EXPORT_TABLE_01″ successfully completed at Wed Aug 5 06:39:14 2020 elapsed 0 00:01:21

 

並行10,非並行時間   40s:80s=1:2  時間上減少了很多。數據越大,效率越明顯。

 

 

總結:1.對於單個表而言,表小於250M,並行無任何意義,都是1;

            2.對單個表來說,並行度2 ,基本上都是第一個進程導出元數據,第二個導出數據,第一個導出元數據的進程導出後,數據未導完畢,會繼續往dump灌數據,但是幾乎沒有並行效果;

            3.對於非分區表來說,表很大,並行開很多,僅僅只是導出時輔助進程多幾個,效率這塊未驗證,但是初步對比,沒啥效果;

            4.對於分區表來說,表很大,單個分區大於250M,並行很多才有用,分區都很小,可以忽略等同於無並行。  數據量越大,並行才能真正使用。

Tags: