Có cách nào để sử dụng FORALL để chèn dữ liệu từ một mảng không?

Tôi thực sự quan tâm đến việc cái gì sẽ nhanh hơn, vì vậy tôi đã thử nghiệm một số cách có thể có để so sánh chúng:

đơn giản executemany không cần thủ thuật.
tương tự với APPEND_VALUES gợi ý bên trong câu lệnh.
union all cách tiếp cận bạn đã thử trong một câu hỏi khác. Điều này sẽ chậm hơn ở trên vì nó tạo ra thực sự rất lớn tuyên bố (có thể yêu cầu nhiều mạng hơn chính dữ liệu). Sau đó, nó sẽ được phân tích cú pháp ở phía DB, điều này cũng sẽ tiêu tốn rất nhiều thời gian và bỏ qua tất cả các lợi ích (không nói về giới hạn kích thước tiềm năng). Sau đó, tôi đã executemany 'ed nó để kiểm tra với các phần không để xây dựng một câu lệnh duy nhất cho 100 nghìn bản ghi. Tôi không sử dụng cách nối các giá trị bên trong câu lệnh vì muốn giữ an toàn.
insert all . Nhược điểm giống nhau, nhưng không có công đoàn. So sánh nó với union phiên bản.
tuần tự hóa dữ liệu trong JSON và thực hiện giải mã hóa ở phía DB với json_table . Hiệu suất tiềm năng tốt với một câu lệnh ngắn và truyền dữ liệu đơn lẻ với chi phí JSON thấp.
FORALL được đề xuất của bạn trong thủ tục trình bao bọc PL / SQL. Phải giống với executemany kể từ khi làm tương tự, nhưng ở phía cơ sở dữ liệu. Chi phí chuyển đổi dữ liệu thành bộ sưu tập.
FORALL giống nhau , nhưng với cách tiếp cận cột để truyền dữ liệu:chuyển danh sách giá trị cột đơn giản thay vì kiểu phức tạp. Sẽ nhanh hơn nhiều so với FORALL với bộ sưu tập vì không cần phải tuần tự hóa dữ liệu thành loại bộ sưu tập.

Tôi đã sử dụng Cơ sở dữ liệu tự trị của Oracle trong Đám mây Oracle với tài khoản miễn phí. Mỗi phương thức được thực thi 10 lần trong vòng lặp với cùng một tập dữ liệu đầu vào gồm 100k bản ghi, bảng được tạo lại trước mỗi lần kiểm tra. Đây là kết quả mà tôi nhận được. Thời gian chuẩn bị và thời gian thực hiện ở đây là quá trình chuyển đổi dữ liệu ở phía máy khách DB lần lượt gọi chính nó.

>>> t = PerfTest(100000)
>>> t.run("exec_many", 10)
Method:  exec_many.
    Duration, avg: 2.3083874 s
    Preparation time, avg: 0.0 s
    Execution time, avg: 2.3083874 s
>>> t.run("exec_many_append", 10)
Method: exec_many_append.
    Duration, avg: 2.6031369 s
    Preparation time, avg: 0.0 s
    Execution time, avg: 2.6031369 s
>>> t.run("union_all", 10, 10000)
Method:  union_all.
    Duration, avg: 27.9444233 s
    Preparation time, avg: 0.0408773 s
    Execution time, avg: 27.8457551 s
>>> t.run("insert_all", 10, 10000)
Method: insert_all.
    Duration, avg: 70.6442494 s
    Preparation time, avg: 0.0289269 s
    Execution time, avg: 70.5541995 s
>>> t.run("json_table", 10)
Method: json_table.
    Duration, avg: 10.4648237 s
    Preparation time, avg: 9.7907693 s
    Execution time, avg: 0.621006 s
>>> t.run("forall", 10)
Method:     forall.
    Duration, avg: 5.5622837 s
    Preparation time, avg: 1.8972456000000002 s
    Execution time, avg: 3.6650380999999994 s
>>> t.run("forall_columnar", 10)
Method: forall_columnar.
    Duration, avg: 2.6702698000000002 s
    Preparation time, avg: 0.055710800000000005 s
    Execution time, avg: 2.6105702 s
>>>

Cách nhanh nhất chỉ là executemany , không có quá nhiều ngạc nhiên. Điều thú vị ở đây là APPEND_VALUES không cải thiện truy vấn và trung bình mất nhiều thời gian hơn, vì vậy điều này cần được điều tra thêm.

Giới thiệu về FORALL :như mong đợi, mảng riêng lẻ cho mỗi cột mất ít thời gian hơn vì không có chuẩn bị dữ liệu cho nó. Nó ít nhiều có thể so sánh với executemany , nhưng tôi nghĩ chi phí PL / SQL đóng một số vai trò ở đây.

Một phần thú vị khác đối với tôi là JSON:phần lớn thời gian được dành cho việc viết LOB vào cơ sở dữ liệu và tuần tự hóa, nhưng bản thân truy vấn rất nhanh. Có thể thao tác ghi có thể được cải thiện theo một cách nào đó với chuncsize hoặc một số cách khác để chuyển dữ liệu LOB vào câu lệnh select, nhưng đối với mã của tôi, nó không có cách tiếp cận rất đơn giản và dễ hiểu với executemany .

Cũng có những cách tiếp cận khả thi mà không có Python mà nên nhanh hơn như các công cụ gốc dành cho dữ liệu bên ngoài, nhưng tôi đã không thử nghiệm chúng:

Dưới đây là mã tôi đã sử dụng để thử nghiệm.

import cx_Oracle as db
import os, random, json
import datetime as dt


class PerfTest:
  
  def __init__(self, size):
    self._con = db.connect(
      os.environ["ora_cloud_usr"],
      os.environ["ora_cloud_pwd"],
      "test_low",
      encoding="UTF-8"
    )
    self._cur = self._con.cursor()
    self.inp = [(i, "Test {i}".format(i=i), random.random()) for i in range(size)]
  
  def __del__(self):
    if self._con:
      self._con.rollback()
      self._con.close()
 
#Create objets
  def setup(self):
    try:
      self._cur.execute("drop table rand")
      #print("table dropped")
    except:
      pass
  
    self._cur.execute("""create table rand(
      id int,
      str varchar2(100),
      val number
    )""")
    
    self._cur.execute("""create or replace package pkg_test as
  type ts_test is record (
    id rand.id%type,
    str rand.str%type,
    val rand.val%type
  );
  type tt_test is table of ts_test index by pls_integer;
  
  type tt_ids is table of rand.id%type index by pls_integer;
  type tt_strs is table of rand.str%type index by pls_integer;
  type tt_vals is table of rand.val%type index by pls_integer;
  
  procedure write_data(p_data in tt_test);
  procedure write_data_columnar(
    p_ids in tt_ids,
    p_strs in tt_strs,
    p_vals in tt_vals
  );

end;""")
    self._cur.execute("""create or replace package body pkg_test as
  procedure write_data(p_data in tt_test)
  as
  begin
    forall i in indices of p_data
      insert into rand(id, str, val)
      values (p_data(i).id, p_data(i).str, p_data(i).val)
    ;
    
    commit;

  end;
  
  procedure write_data_columnar(
    p_ids in tt_ids,
    p_strs in tt_strs,
    p_vals in tt_vals
  ) as
  begin
    forall i in indices of p_ids
      insert into rand(id, str, val)
      values (p_ids(i), p_strs(i), p_vals(i))
    ;
    
    commit;
    
  end;

end;
""")

 
  def build_union(self, size):
      return """insert into rand(id, str, val)
    select id, str, val from rand where 1 = 0 union all
    """ + """ union all """.join(
      ["select :{}, :{}, :{} from dual".format(i*3+1, i*3+2, i*3+3)
        for i in range(size)]
    )
 
 
  def build_insert_all(self, size):
      return """
      """.join(
      ["into rand(id, str, val) values (:{}, :{}, :{})".format(i*3+1, i*3+2, i*3+3)
        for i in range(size)]
    )


#Test case with executemany
  def exec_many(self):
    start = dt.datetime.now()
    self._cur.executemany("insert into rand(id, str, val) values (:1, :2, :3)", self.inp)
    self._con.commit()
    
    return (dt.timedelta(0), dt.datetime.now() - start)
 
 
#The same as above but with prepared statement (no parsing)
  def exec_many_append(self):
    start = dt.datetime.now()
    self._cur.executemany("insert /*+APPEND_VALUES*/ into rand(id, str, val) values (:1, :2, :3)", self.inp)
    self._con.commit()
    
    return (dt.timedelta(0), dt.datetime.now() - start)


#Union All approach (chunked). Should have large parse time
  def union_all(self, size):
##Chunked list of big tuples
    start_prepare = dt.datetime.now()
    new_inp = [
      tuple([item for t in r for item in t])
      for r in list(zip(*[iter(self.inp)]*size))
    ]
    new_stmt = self.build_union(size)
    
    dur_prepare = dt.datetime.now() - start_prepare
    
    #Execute unions
    start_exec = dt.datetime.now()
    self._cur.executemany(new_stmt, new_inp)
    dur_exec = dt.datetime.now() - start_exec

##In case the size is not a divisor
    remainder = len(self.inp) % size
    if remainder > 0 :
      start_prepare = dt.datetime.now()
      new_stmt = self.build_union(remainder)
      new_inp = tuple([
        item for t in self.inp[-remainder:] for item in t
      ])
      dur_prepare += dt.datetime.now() - start_prepare
      
      start_exec = dt.datetime.now()
      self._cur.execute(new_stmt, new_inp)
      dur_exec += dt.datetime.now() - start_exec

    self._con.commit()
    
    return (dur_prepare, dur_exec)


#The same as union all, but with no need to union something
  def insert_all(self, size):
##Chunked list of big tuples
    start_prepare = dt.datetime.now()
    new_inp = [
      tuple([item for t in r for item in t])
      for r in list(zip(*[iter(self.inp)]*size))
    ]
    new_stmt = """insert all
    {}
    select * from dual"""
    dur_prepare = dt.datetime.now() - start_prepare
    
    #Execute
    start_exec = dt.datetime.now()
    self._cur.executemany(
      new_stmt.format(self.build_insert_all(size)),
      new_inp
    )
    dur_exec = dt.datetime.now() - start_exec

##In case the size is not a divisor
    remainder = len(self.inp) % size
    if remainder > 0 :
      start_prepare = dt.datetime.now()
      new_inp = tuple([
        item for t in self.inp[-remainder:] for item in t
      ])
      dur_prepare += dt.datetime.now() - start_prepare
      
      start_exec = dt.datetime.now()
      self._cur.execute(
        new_stmt.format(self.build_insert_all(remainder)),
        new_inp
      )
      dur_exec += dt.datetime.now() - start_exec

    self._con.commit()
    
    return (dur_prepare, dur_exec)

    
#Serialize at server side and do deserialization at DB side
  def json_table(self):
    start_prepare = dt.datetime.now()
    new_inp = json.dumps([
      { "id":t[0], "str":t[1], "val":t[2]} for t in self.inp
    ])
    
    lob_var = self._con.createlob(db.DB_TYPE_CLOB)
    lob_var.write(new_inp)
    
    start_exec = dt.datetime.now()
    self._cur.execute("""
    insert into rand(id, str, val)
    select id, str, val
    from json_table(
      to_clob(:json), '$[*]'
      columns
        id int,
        str varchar2(100),
        val number
    )
    """, json=lob_var)
    dur_exec = dt.datetime.now() - start_exec
    
    self._con.commit()
    
    return (start_exec - start_prepare, dur_exec)


#PL/SQL with FORALL
  def forall(self):
    start_prepare = dt.datetime.now()
    collection_type = self._con.gettype("PKG_TEST.TT_TEST")
    record_type = self._con.gettype("PKG_TEST.TS_TEST")
    
    def recBuilder(x):
      rec = record_type.newobject()
      rec.ID = x[0]
      rec.STR = x[1]
      rec.VAL = x[2]
      
      return rec

    inp_collection = collection_type.newobject([
      recBuilder(i) for i in self.inp
    ])
    
    start_exec = dt.datetime.now()
    self._cur.callproc("pkg_test.write_data", [inp_collection])
    dur_exec = dt.datetime.now() - start_exec
    
    return (start_exec - start_prepare, dur_exec)


#PL/SQL with FORALL and plain collections
  def forall_columnar(self):
    start_prepare = dt.datetime.now()
    ids, strs, vals = map(list, zip(*self.inp))
    start_exec = dt.datetime.now()
    self._cur.callproc("pkg_test.write_data_columnar", [ids, strs, vals])
    dur_exec = dt.datetime.now() - start_exec
    
    return (start_exec - start_prepare, dur_exec)

  
#Run test
  def run(self, method, iterations, *args):
    #Cleanup schema
    self.setup()

    start = dt.datetime.now()
    runtime = []
    for i in range(iterations):
      single_run = getattr(self, method)(*args)
      runtime.append(single_run)
    
    dur = dt.datetime.now() - start
    dur_prep_total = sum([i.total_seconds() for i, _ in runtime])
    dur_exec_total = sum([i.total_seconds() for _, i in runtime])
    
    print("""Method: {meth}.
    Duration, avg: {run_dur} s
    Preparation time, avg: {prep} s
    Execution time, avg: {ex} s""".format(
      inp_s=len(self.inp),
      meth=method,
      run_dur=dur.total_seconds() / iterations,
      prep=dur_prep_total / iterations,
      ex=dur_exec_total / iterations
    ))