用python分析在chrome下自己的上网的习惯 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
Sign Up Now
For Existing Member  Sign In
推荐学习书目
Learn Python the Hard Way
PyPI - Python Package Index
http://diveintopython.org/toc/index.html
Pocoo
值得关注的项目
PyPy
Celery
Jinja2
Read the Docs
gevent
pyenv
virtualenv
Stackless Python
Beautiful Soup
结巴中文分词
Green Unicorn
Sentry
Shovel
Pyflakes
pytest
Python 编程
pep8 Checker
Styles
PEP 8
Google Python Style Guide
Code Style from The Hitchhiker's Guide
Fikhtengol

用python分析在chrome下自己的上网的习惯

  •  
  •   Fikhtengol Feb 12, 2013 8007 views
    This topic created in 4824 days ago, the information mentioned may be changed or developed.
    初二啊今天,玩回来没事,四处浏览的时候,突然感觉自己一直写程序的时候不够专心,老是一会点下这,一会点下那的。于是就想查查自己的上网记录。想看看自己经常访问的网页有哪些,访问最多的是哪几个,我用chrome,但是它的history貌似没有这样的功能。于是就自己写下吧。
    chrome 的history目录在~/.config/google-chrome/Default/,里面History打开是乱码,最后发现是存为sqllite db的,那就很easy了。找到History的那个file,查询下里面的table,有好几个,目测发现了个urls,应该是存自己访问的url的,一看里面还有visit count.就是它,开始select吧。select * from urls order by visit_count limit 0,10 ? 结果弄出来好多都是同一个网站的。好吧,又想看看同一个host下的情况。那就得先取出url中的host,然后再把相同的host相加再排序输出。

    #!/usr/bin/env python
    '''
    analyse the user's chrome behavior.
    '''
    import sqlite3
    import urlparse
    class AnalyseChrome:
    '''
    the user's chrome history log is writed by sqllite. and saved default in ~/.config/google-chrome/Default/History at ubuntu.
    '''
    def __init__(self,db="/home/lijun/.config/google-chrome/Default/History"):
    '''init the AnalyseChrome by the chrome history db path.'''
    self.cn=sqlite3.connect(db)
    self.cu=self.cn.cursor()
    def get_sql_res(self,sql):
    try:
    self.cu.execute(sql)
    except Exception,e:
    print str(e)
    return 0,str(e)
    res=self.cu.fetchall()
    return res,""
    def show_table(self,name="%"):
    '''show the table in db of History'''

    sql="SELECT * FROM sqlite_master WHERE type='table' and name like '%s';"%(name,)
    return self.get_sql_res(sql)

    def clear(self,):
    self.cn.close()

    def top_n(self,n,orderby="host"):
    '''
    return the top n url or host the user visit frequently.default orderby host
    '''

    sql="select url,visit_count from urls order by url ;"
    res,errmsg=self.get_sql_res(sql)
    uniq_res=[]
    #first select all url,visit form urls table sort by url ;
    #and make a new list which has uniq url and new count. by myself.
    #then sort by python's list.sort().
    #at last print top n.
    #maybe,it's not quick enough,or easy enough. max heap?my history is not that much.
    if res:
    urlhost=""
    for item in res:
    if orderby=="host":
    now_urlhost=urlparse.urlparse(item[0]).netloc
    elif orderby=="url":
    now_urlhost=item[0]
    else:
    return None,"error argv in top_n"
    if now_urlhost=="" or now_urlhost==None:
    continue
    if urlhost!=now_urlhost:
    urlhost,count=now_urlhost,item[1]
    uniq_res.append([urlhost,count])

    else:
    uniq_res[-1][-1]=uniq_res[-1][-1]+item[1]
    continue
    else:
    return None,errmsg
    uniq_res.sort(key=lambda x:x[1],reverse=True)
    return [i for i in uniq_res[0:n]],""


    if __name__=="__main__":
    ac=AnalyseChrome()

    tb,errormsg=ac.show_table('urls')
    if tb:
    for i in tb:
    print i

    res,errormsg=ac.top_n(20,"host")
    no=1
    if res:
    for i in res:
    print no,i
    no+=1
    else :
    print errormsg
    ac.clear()
    开个头吧,后面还可以算各个host访问占的比例,某段时间里的访问情况。。。
    17 replies    1970-01-01 08:00:00 +08:00
    Fikhtengol
        1
    Fikhtengol  
    OP
       Feb 12, 2013
    其它几个history的file都没仔细看,可以挖掘的东西应该还是挺多的吧,大家有木有兴趣挖掘下
    Fikhtengol
        2
    Fikhtengol  
    OP
       Feb 12, 2013
    我去,代码直接从编辑器里copy 到这就没有缩进了,应该怎么弄啊?
    paloalto
        4
    paloalto  
       Feb 12, 2013
    @Fikhtengol 可以把代码贴到gist
    zythum
        5
    zythum  
       Feb 12, 2013   1
    https://gist.github.com 去这边然后把地址贴过来
    Fikhtengol
        7
    Fikhtengol  
    OP
       Feb 12, 2013
    ADD-ONS,有木有chrome的呢?
    @paloalto
    lowstz
        9
    lowstz  
       Feb 12, 2013
    db路径用下面这个,减少硬编码
    db = os.path.expanduser('~/.config/google-chrome/Default/History')
    cyr1l
        10
    cyr1l  
       Feb 12, 2013
    我还以为楼主今年初二, 吓我一跳。 原来是说今天。。。 可是今天初三了啊。。 就算你是12个小时57分钟前, 也是初三了。 #不要在意细节。。。
    CaoZ
        11
    CaoZ  
       Feb 12, 2013
    collections.Counter 是个好东西

    不使用浏览器的 API 而直接尝试分析文件, 算不算 hack ? 不过还是 Python 版简单直接...
    Fikhtengol
        12
    Fikhtengol  
    OP
       Feb 13, 2013
    呵呵 我是pwd 后直接copy的
    Fikhtengol
        13
    Fikhtengol  
    OP
       Feb 13, 2013
    Fikhtengol
        14
    Fikhtengol  
    OP
       Feb 13, 2013
    python 写多了怕自己退化啊,数据结构太好用。。。@caoz
    oppih28
        15
    oppih28  
       Feb 13, 2013 via iPhone
    cloverstd
        16
    cloverstd  
       Feb 14, 2013
    database is locked
    Fikhtengol
        17
    Fikhtengol  
    OP
       Feb 14, 2013
    把chrome关了,as database is locked by chrome
    @cloverstd
    About     Help     Advertise     Blog     API     FAQ     Solana     3190 Online   Highest 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 61ms UTC 12:36 PVG 20:36 LAX 05:36 JFK 08:36
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86