Gse v0.20.0 发布了, Go 高性能分词, 优化性能和代码, 更多测试 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
The Go Programming Language
http://golang.org/
Go Playground
Go Projects
Revel Web Framework
vway
V2EX    Go 编程语言

Gse v0.20.0 发布了, Go 高性能分词, 优化性能和代码, 更多测试

  •  
  •   vway
    vcaesar 2018-10-10 03:16:07 +08:00 3971 次点击
    这是一个创建于 2627 天前的主题,其中的信息可能已经有所发展或是发生改变。

    Go 语言高效分词, 支持英文、中文、日文等

    词典用双数组 trie ( Double-Array Trie )实现, 分词器算法为基于词频的最短路径加动态规划。

    支持普通和搜索引擎两种分词模式,支持用户词典、词性标注,可运行 JSON RPC 服务。

    项目地址: https://github.com/go-ego/gse

    package main import ( "fmt" "github.com/go-ego/gse" ) func main() { var seg gse.Segmenter seg.LoadDict("zh,testdata/test_dict.txt,testdata/test_dict1.txt") text1 := []byte("你好世界, Hello world") segments := seg.Segment(text1) fmt.Println(gse.ToString(segments)) } 

    Lethe River

    Add

    • [NEW] Add slice() and string() func and test
    • [NEW] Add more test
    • [NEW] Optimize textSliceToString splicing speed
    • [NEW] Update LoadDict() log.Printf and optimize read dict log
    • [NEW] Add ToString() and ToSlice() default value and update test
    • [NEW] ToString and ToSlice use return not use else and update code
    • [NEW] Update sever code
    • [NEW] Add token equals() func and test
    • [NEW] Add search mode example
    • [NEW] Optimize file defer close
    • [NEW] Segment return use nil not empty array
    • [NEW] Update pkg to newest ( optimize cedar code )

    • [NEW] Update and refactoring segment test code
    • [NEW] Update dictionary and static demo
    • [NEW] Refactoring gse benchmark code
    • [NEW] Update and simplify test code

    Update

    • [NEW] Update issue template more obvious
    • [NEW] Update godoc, pull_request_template.md and issue_template.md
    • [NEW] Update README.md Uniform name
    • [NEW] Update godoc
    • [NEW] Update Update README.md add searchMode docs
    • [NEW] Optimize Japanese subparticipation errors
    • [NEW] Update code style and name style
    • [NEW] Update examples and benchmark code
    • [NEW] Add Travis ci go1.11 support

    Fix

    • [FIX] Update examples lang fix #4
    • [FIX] Fix typo for example
    • [FIX] Fix LoadDict() godoc error
    • [FIX] Fix sub-word error
    • [FIX] Fix dict is nil segmentWords panic nil pointer
    • [FIX] Update README.md Fixed Release badge

    See Commits for more details, after Apr 27.

    10 条回复    2018-10-10 20:09:54 +08:00
    yanaraika
        1
    yanaraika  
       2018-10-10 06:51:49 +08:00
    8102 年了,至少用个马尔可夫吧
    vway
        2
    vway  
    OP
       2018-10-10 08:43:03 +08:00
    @yanaraika 后面会考虑加上 HMM
    JeffKing
        3
    JeffKing  
       2018-10-10 08:55:48 +08:00 via iPhone
    8102 了,至少用 crf 分词吧
    enenaaa
        4
    enenaaa  
       2018-10-10 09:31:14 +08:00
    词频是用什么语料统计的?
    dilu
        5
    dilu  
       2018-10-10 09:51:50 +08:00
    先支持楼主一个,顺便想问问有没有什么学习分词的资料可以分享一波,对这个很感兴趣。
    realpg
        6
    realpg  
    PRO
       2018-10-10 10:36:13 +08:00
    英文分词难道不是应该基于空格标点么
    vway
        7
    vway  
    OP
       2018-10-10 19:21:47 +08:00
    @JeffKing ♀, 会考虑的加上
    vway
        8
    vway  
    OP
       2018-10-10 20:07:33 +08:00
    @realpg 目前主要的还是对一些终止符做一些优化
    vway
        9
    vway  
    OP
       2018-10-10 20:08:33 +08:00
    @dilu Baidu 或 Google 有很多资料
    vway
        10
    vway  
    OP
       2018-10-10 20:09:54 +08:00
    @enenaaa 结巴分词的
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2609 人在线   最高记录 6679   &nbs;   Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 23ms UTC 13:20 PVG 21:20 LAX 05:20 JFK 08:20
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86