python 如何获取一个网址的标题？？

文档： http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
使用了@kehr 说的bs4，用的就是链接里面的例子。
soup.title.string
# u'The Dormouse's story'

在用的过程中还是遇到乱码了，就是用这个方法获取这个网址的title是乱码，不知有解不？
http://www.joy.cn/

网址

标题

源码

24 replies 2017-08-30 17:21:12 +08:00

nervouna

Apr 18, 2014

你确定你只要 title？这算是抓取网页里最简单的一件事了吧……
Google Beautiful Soup

ericls

Apr 18, 2014 via Android

手机打一个
from pyquery import PyQuery as pq

d=py(url)
d('title').text

不知道对不对

fay

Apr 18, 2014

我在开发自己的莲蓬网的时候，需要获取网页标题等信息，顺手抽离了这部分的代码： https://github.com/fay/pagemeta

tonghuashuai

Apr 18, 2014

1 #!/usr/bin/env python
2 #coding:utf-8
3
4 import urllib
5 import re
6
7
8 def get_title(url):
9 title = ''
10 c = urllib.urlopen(url)
11 html = c.read()
12
13 p = '<title>.*?</title>'
14 target = re.findall(p, html)
15
16 if target:
17 title = target[0]
18
19 return title
20
21 if __name__ == '__main__':
22 url = 'http://www.baidu.com'
23 title = get_title(url)
24 print title

简单实现，没加异常处理

tonghuashuai

Apr 18, 2014

行号杯具了

davidli

Apr 18, 2014

楼上正解

网站的标题应该都是在<title></title>里吧。

yhf

Apr 18, 2014 via iPhone

BeautifulSoup 随便哪个节点由你抓

lifemaxer

Apr 18, 2014

以当前页为例，
soup = BeautifulSoup(content) #content为当前页数据
a = soup.find_all('title')[0].get_text()

hao1032

Apr 18, 2014

@tonghuashuai
这个在实际应用中会出错的，获取的title编码不知道是什么，还要获取网页里面的charset，然后解码。

@lifemaxer 这个我试试看看会不会出现乱码的问题。

hao1032

Apr 18, 2014

@tonghuashuai
这个在实际应用中会出错的，获取的title编码不知道是什么，还要获取网页里面的charset，然后解码。
更恶心的是网页中写的charset又不一定是正确的（就遇到过这样的奇葩网站），然后用charset去解又会出错。

Crossin

Apr 18, 2014

@hao1032 那就用chardet判断一下编码

lm902

Apr 19, 2014

text.split("<title>")[1].split("</title>")[0]

kehr

Apr 20, 2014

刚做了网页抓取。推荐BeautifulSoup无压力。

文档： http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

hao1032

May 24, 2014

@kehr 使用bs4，用的就是你发到链接里面的例子。
soup.title.string
# u'The Dormouse's story'

在用的过程中还是遇到乱码了，就是用这个方法获取这个网址的title是乱码，不知有解不？
http://www.joy.cn/

cloverstd

May 25, 2014

乱码可能是 gzip 压缩了

binux

May 25, 2014

@hao1032 requests

caomu

May 25, 2014

1.网上有类似的服务吗？
> 关于这个我想到的是 YQL 。。。

Sylv

May 25, 2014

@hao1032

import requests
from bs4 import BeautifulSoup

r = requests.get("http://www.joy.cn")
r.encoding = requests.utils.get_encodings_from_content(r.content)[0]
soup = BeautifulSoup(r.text)
print soup.title.string

参考： http://liguangming.com/python-requests-ge-encoding-from-headers

ccbikai

PRO

May 25, 2014 via Android

@Sylv 18楼和我的方法一样，如果乱码最后一句改print soup.title.get_text()

dbow

May 25, 2014

上lxml用XPATH表达式最快，"html/head/title"
即
from lxml import etree
etree.HTML(content).xpath("/html/head/title")[0]

diaoleona

May 27, 2014

@dbow 不能同意更多

hao1032

May 29, 2014

@Sylv 这个要加个import，我看了这个文章http://www.cnblogs.com/todoit/archive/2013/04/08/3008513.html
加了个from_encoding="gb18030"，测试这个网站是可以的。
如果后面有问题，再使用你的这个方法。

xiaowangge

Sep 5, 2014

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests
from lxml import html

respOnse= requests.get('http://www.joy.cn/')

# Parse the body into a tree
parsed_body = html.fromstring(response.text)

print ''.join(parsed_body.xpath('//title/text()'))

funnybunny00

Aug 30, 2017

福音