Python使用SQLite插入大量数据

2020 年 1 月 9 日
筆記

前言

使用Python爬虫代理IP时，最先使用了sqlite作为存储ip数据库，sqlite简单、灵活、轻量、开源，和文件系统一样。而当大量插入爬取的数据时，出现了严重的耗时，查看一起资料后，发现：sqlite在每条insert都使用commit的时候，就相当于每次访问时都要打开一次文件，从而引起了大量的I/O操作，耗时严重。下面是每次插入后，提交事务处理，每次插入的时间，单位是秒。

    def insert(self, sql, data):          '''          insert data to the table          :param sql:          :param data:          :return:          '''          if sql is not None and sql != '':              if data is not None:                  cu = self.getcursor()                  try:                      for d in data:                          cu.execute(sql, d)                    # 每次都执行事务提交      self.conn.commit()                  except sqlite3.Error as why:                      print "insert data failed:", why.args[0]                  cu.close()          else:              print "sql is empty or None"

0.134000062943  0.132999897003  0.236999988556  0.134000062943  0.120000123978  0.155999898911  0.131999969482  0.142000198364  0.119999885559  0.176000118256  0.124000072479  0.115999937057  0.111000061035  0.119999885559

显式使用事务的形式提交

在批量插入数据之后再进行事务提交，把大量操作的语句都保存在内存中，当提交时才全部写入数据库，此时，数据库文件也就只用打开一次，会显著的提高效率。

    def insert(self, sql, data):          '''          insert data to the table          :param sql:          :param data:          :return:          '''          if sql is not None and sql != '':              if data is not None:                  cu = self.getcursor()                  try:                      for d in data:                          cu.execute(sql, d)                  except sqlite3.Error as why:                      print "insert data failed:", why.args[0]                  # 批量插入之后再执行事务提交                  self.conn.commit()                  cu.close()          else:              print "sql is empty or None"

每次插入20行数据的时间如下，单位秒，很明显的提高了效率

0.263999938965  0.117000102997  0.194999933243  0.263000011444  0.131000041962  0.15399980545  0.143000125885  0.12299990654  0.128000020981  0.121999979019  0.203999996185

写同步和执行准备方法

这两种方法主要参考提升SQLite数据插入效率低、速度慢的方法，

写同步

在SQLite中，数据库配置的参数都由编译指示（pragma）来实现的，而其中synchronous选项有三种可选状态，分别是full、normal、off。官方文档

当synchronous设置为FULL (2), SQLite数据库引擎在紧急时刻会暂停以确定数据已经写入磁盘。这使系统崩溃或电源出问题时能确保数据库在重起后不会损坏。FULL synchronous很安全但很慢。

当synchronous设置为NORMAL, SQLite数据库引擎在大部分紧急时刻会暂停，但不像FULL模式下那么频繁。 NORMAL模式下有很小的几率(但不是不存在)发生电源故障导致数据库损坏的情况。但实际上，在这种情况下很可能你的硬盘已经不能使用，或者发生了其他的不可恢复的硬件错误。

设置为synchronous OFF (0)时，SQLite在传递数据给系统以后直接继续而不暂停。若运行SQLite的应用程序崩溃，数据不会损伤，但在系统崩溃或写入数据时意外断电的情况下数据库可能会损坏。另一方面，在synchronous OFF时一些操作可能会快50倍甚至更多。在SQLite 2中，缺省值为NORMAL.而在3中修改为FULL。