小内存如何对两个大型列表求差集? - V2EX
推荐学习书目
Learn Python the Hard Way
Python Sites
PyPI -Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
ioven

小内存如何对两个大型列表求差集?

  •  1
     
  •   ioven Oct 27, 2017 5892 views
    This topic created in 3119 days ago, the information mentioned may be changed or developed.

    求 a 与 b 的差集,内存限制 1G

    # 500w a = [...] # 5000w b = [...] # 记录为 10 - 100 不等字符串 

    不知是不是关键词不对,搜索到的方案都是 set、numpy,或 set(b) 遍历 a,内存实在扛不住啊

    求指点,速度慢些也可以接受

    Supplement 1    Oct 28, 2017

    感谢各位大佬指点,放上解决方案方便后来同学

    shell 版

    sort -u -o a.txt a.txt sort -u -o b.txt b.txt comm -23 a.txt b.txt > results.txt 先排序,后求差集 

    python sqlite 版

    import sqlite3 cOnn= sqlite3.connect('test.db') c = conn.cursor() c.execute('CREATE TABLE IF NOT EXISTS a (v TEXT)') c.execute('CREATE TABLE IF NOT EXISTS b (v TEXT)') with open(r'a.txt', 'r') as f_in: for n, i in enumerate(f_in): c.execute('INSERT INTO a VALUES (?)', (i.strip(), )) with open(r'b.txt', 'r') as f_in: for n, i in enumerate(f_in): c.execute('INSERT INTO b VALUES (?)', (i.strip(), )) c.execute('CREATE INDEX bv ON b(v)') c.execute('SELECT v FROM a WHERE v NOT IN (SELECT v FROM b)') for i in c: print(i) 

    感谢 @CRVV 大佬的代码

    24 replies    2017-11-21 22:42:24 +08:00
    kristingna
        1
    kristingna  
       Oct 27, 2017
    你可以试试 Spark
    yuyang
        2
    yuyang  
       Oct 27, 2017 via Android
    先外部排序,然后比较
    qwertyegg
        3
    qwertyegg  
       Oct 27, 2017
    100 个 char 的 String 占内存 240byte,最大占用 240 byte * 50M = 1200MB,你的记录只要不是极端的情况下,1GB 内存轻松放进去呀
    qwertyegg
        4
    qwertyegg  
       Oct 27, 2017
    @qwertyegg 好像算错了,是 12000MB。
    qwertyegg
        5
    qwertyegg  
       Oct 27, 2017
    @qwertyegg 那也不难,把 b 拆成 10 个,每次从 b 里面读 5M 个记录,做 10 次差集不就可以了。
    buliugu
        6
    buliugu  
       Oct 27, 2017   1
    bitmap
    Xs0ul
        7
    Xs0ul  
       Oct 27, 2017
    把 a b 都按首字母拆开成一堆小文件,a b 每对首字母相同的做差,再把结果合并起来。
    geelaw
        8
    geelaw  
       Oct 27, 2017
    似乎光是 a、b 本身的大小已经远远超过 1GB 了。

    你是希望附加空间在 1GB 之内吗?
    neosfun
        9
    neosfung  
       Oct 27, 2017 via iPhone
    Trie 树
    owenliang
        10
    owenliang  
       Oct 27, 2017   1
    1,数据集本身就大于内存,所以数据集肯定在磁盘。
    2,内存无法排序所有数据,所以必须外排序。
    3,一旦 2 个文件有序,那么就可以 2 路归并。
    CRVV
        11
    CRVV  
       Oct 27, 2017   1
    大概试了一下,在我的机器上,含生成随机数据一共大约 400 秒,其中建索引 80 秒,查询 60 秒。
    内存占用不超过 20 M,数据库文件 6.4 G

    import sqlite3
    import string
    import random

    cOnn= sqlite3.connect('test.db')

    c = conn.cursor()

    c.execute('CREATE TABLE IF NOT EXISTS a (v TEXT)')
    c.execute('CREATE TABLE IF NOT EXISTS b (v TEXT)')

    letters = string.ascii_letters + string.digits

    for _ in range(5_000_000):
    ____random_string = ''.join((random.choice(letters) for _ in range(random.randint(10, 100))))
    ____c.execute('INSERT INTO a VALUES (?)', (random_string, ))

    for _ in range(50_000_000 // 4_500_000):
    ____c.execute('INSERT INTO b SELECT * FROM a LIMIT 4500000')

    c.execute('CREATE INDEX bv ON b(v)')

    c.execute('SELECT v FROM a WHERE v NOT IN (SELECT v FROM b)')

    count = 0
    for _ in c:
    ____count += 1

    print(count)

    conn.commit()
    conn.close()
    NoAnyLove
        12
    NoAnyLove  
       Oct 28, 2017
    如果记录的长度比较均匀的话,那么按照长度分组之后再来做运算不知道内存够不够
    swulling
        13
    swulling  
       Oct 28, 2017 via iPhone
    内存够用 hash 表,比如 b 减 a,就用 hash 表
    内存不够,比如 a 减 b,那就用 B 树
    kaneg
        14
    kaneg  
       Oct 28, 2017 via iPhone
    把小份数据读到内存,大的在文件中逐条读,求二者的交集,结果必然不大于之前的记录,然后分别求之前两份原始数据的差集,最后合并两份差集即可。
    clino
        15
    clino  
       Oct 28, 2017 via Android
    我也和楼上一样想用 sqlite
    herozhang
        16
    herozhang  
       Oct 28, 2017 via iPhone
    把 swap 分区设大一点,就可以了,哈哈哈
    clino
        17
    clino  
       Oct 28, 2017
    楼主能不能说下 shell 和 python sqlite 版运行时间分别是多少?
    zhicheng
        18
    zhicheng  
       Oct 28, 2017
    如果不是完全随机数,可以压缩一下。只要能把小的那个塞进内存就行了。
    ioven
        19
    ioven  
    OP
       Oct 28, 2017
    @clino 用测试数据来看,sqlite 大概是 shell 两倍时间、空间占用,但 sqlite 处理更加灵活,这点差距完全可以接受
    stanjia
        20
    stanjia  
       Oct 29, 2017
    省心法本机安个 hadoop
    clino
        21
    clino  
       Oct 30, 2017
    @ioven 如果 sqlite 操作都放在一个事务里面,估计时间优化得比较短
    ioven
        22
    ioven  
    OP
       Oct 30, 2017
    @clino 多谢提醒,默认是智能事务,试试手工开启事务看速度有没有提升
    shamashii
        23
    shamashii  
       Nov 21, 2017
    生成 110s,比较 120s,实验时感觉坑点竟然在于生成随机字符串效率,求改进
    ```
    import timeit
    def main():
    import h5py, cyrandom
    allchr = "".join((chr(i) for i in range(33,127)))
    pspool = [[cyrandom.choice(allchr) for _ in range(cyrandom.randint(10, 100))] for x in range(100000)]

    chunkl = []
    for _ in range(5000000):
    b1 = cyrandom.choice(pspool)
    cyrandom.shuffle(b1)
    chunkl.append(''.join(b1).encode('utf-8'))

    f = h5py.File('h5.h5','w')
    for k in range(50000000//5000000):
    l = [str(k).encode('utf-8')]
    # cyrandom.shuffle(chunkl)
    print(k)
    f.create_dataset(str(k), data=chunkl+l,)
    del chunkl
    f.close()

    def query():
    import h5py
    f = h5py.File('h5.h5','a')
    wbw = set(f['0'].value)
    count = []
    for k in f.keys():
    print(k)
    for x in f[k].value:
    if x not in wbw:
    count.append(x)
    print(count)
    f.close()

    print(timeit.timeit(main, number=1))
    print(timeit.timeit(query, number=1))
    ```
    shamashii
        24
    shamashii  
       Nov 21, 2017
    About     Help     Advertise     Blog     API     FAQ     Solana     3272 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 59ms UTC 14:09 PVG 22:09 LAX 07:09 JFK 10:09
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86