千万级数据库记录模糊匹配效率问题

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

For Existing Member Sign In

请不要在回答技术问题时复制粘贴 AI 生成的内容

This topic created in 3066 days ago, the information mentioned may be changed or developed.

存量数据库记录 ID 姓名（%50 ）性别（ 10%）手机号（ 20%）籍贯（%10 ）电子邮箱（ 10%） 1 张三女 13800001111 山西太原 [email protected] 2 张三男 13800001111 山西太原 [email protected] 3 李四男 15611112345 湖南长沙 [email protected] 4 王五男 17022220000 广东广州 [email protected]

新增记录 5 李四女 15611112345 广西桂林 [email protected]

需求是新增记录时做相似性匹配：规则如下相似性=(李四==李四)*50% + (男==女)10%+(15611112345==15611112345)20%+(湖南长沙==广西桂林)10%+([email protected][email protected])10%= 150% + 010% + 120% + 0%10% + 110% = 80% >=80% 按照规则，各字段加权计算，阈值设为 80%时，表示记录 5 和记录 3 是相似的。

存在问题：现在问题是存量记录表有上“千万”记录，每次新添加记录时需要与存量的记录进行相似度匹配，但是每次都遍历扫描数据库效率太低。请问有没有什么好的方式 /算法？或者使用其他数据结构代替关系数据库进行记录存储？使得匹配时间控制在 1 分钟级别左右。

[email protected]

记录

李四

数据库

17 replies 2017-12-05 17:22:21 +08:00

forkme

Dec 4, 2017

mpich

Dec 4, 2017

ES ？

forkme

Dec 4, 2017

@mpich 什么 ES ？不懂

kxxoling

Dec 4, 2017 via iPad

@forkme elastic search

gamexg

Dec 4, 2017

@forkme #3 es 指的是 Elasticsearch。
看需求怎么和之前的一个做小货的帖子需求类是，他的只要求匹配相识的通信录。

但是仔细看了下需求，没什么难度吧？
给姓名做索引，然后第一个筛选掉姓名，只要姓名不符合怎么也不可能达到 80%。
另外重名的数据量应该不大，之后直接遍历吧。

如果数据库压力大上个 kv 储存来保存姓名。

zhengxiaowai

Dec 4, 2017

ES 太重千万的数据量不需要要用到，而且学习曲线不友好。

推荐使用 pg 的 ts_query 可以设置 rank，效率也不错，挺好上手的

ytmsdy

Dec 4, 2017

上个 SSD 试试看。。

table cellpadding="0" cellspacing="0" border="0" width="100%">

tomczhen

Dec 4, 2017

不知道你当前的 sql 是怎么写的，感觉这种 100 分制后面也许会有问题，如果再加一个条件，比例要重新分配？业务代码相关部分都得改？

select a,b,c,d,e, sum(weights) from (
select a,b,c,d,e,1 as weights from table where a = '李四' and b = '女'
union all
select a,b,c,d,e,2 as weights from table where a = '李四' and c = '15611112345'
union all
select a,b,c,d,e,1 as weights from table where a = '李四' and d = '广西桂林'
union all
select a,b,c,d,e,1 as weights from table where a = '李四' and e = '[email protected]'
)
group by a,b,c,d,e
having sum(weights)>=3;