
a.xml:
<?xml version="1.0" encoding="ISO-8859-1"?> <!DOCTYPE dblp SYSTEM "dblp.dtd"> <article mdate="2017-05-28" key="journals/acta/Saxena96"> <author>Sanjeev Saxena</author> <author>q</author> <author>w</author> <title>Parallel Integer Sorting and Simulation Amongst CRCW Models</title> <pages>607-619</pages> <year>1996</year> <volume>33</volume> <journal>Acta Inf</journal> <number>7</number> <url>db/journals/acta/acta33.html#Saxena96</url> <ee>htt</ee> </article> python 代码:
from xml.sax.handler import ContentHandler, EntityResolver from xml.sax import parse from itertools import combinations class DBLP(ContentHandler, EntityResolver): passthrough = False paper_authors = [] currTag = '' def startElement(self, name, attrs): if name == 'article': self.passthrough = True elif name == 'author' and self.passthrough: self.currTag = 'author' def endElement(self, name): if name == 'article': self.passthrough = False self.generate_paper_info() self.paper_authors = [] elif name == 'author': self.currTag = '' def characters(self, chars): if self.passthrough and self.currTag == 'author': self.paper_authors.append(chars) def generate_paper_info(self): with open('dblp.txt', 'w') as f: if len(self.paper_authors) < 2: print 'Only one author' else: for info in combinations(self.paper_authors, 2): f.write('{0} {1}\n'.format(info[0], info[1])) print 'Write one piece of cooperation user' parse('a.xml', DBLP()) 报错信息:
Traceback (most recent call last): File "dblp.py", line 40, in <module> parse('a.xml', DBLP()) File "E:\Python2.7.12\lib\xml\sax\__init__.py", line 33, in parse parser.parse(source) File "E:\Python2.7.12\lib\xml\sax\expatreader.py", line 110, in parse xmlreader.IncrementalParser.parse(self, source) File "E:\Python2.7.12\lib\xml\sax\xmlreader.py", line 123, in parse self.feed(buffer) File "E:\Python2.7.12\lib\xml\sax\expatreader.py", line 213, in feed self._parser.Parse(data, isFinal) File "E:\Python2.7.12\lib\xml\sax\expatreader.py", line 397, in external_entity_ref "") File "E:\Python2.7.12\lib\xml\sax\saxutils.py", line 349, in prepare_input_source f = urllib.urlopen(source.getSystemId()) File "E:\Python2.7.12\lib\urllib.py", line 87, in urlopen return opener.open(url) File "E:\Python2.7.12\lib\urllib.py", line 213, in open return getattr(self, name)(url) File "E:\Python2.7.12\lib\urllib.py", line 469, in open_file return self.open_local_file(url) File "E:\Python2.7.12\lib\urllib.py", line 483, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] : 'dblp.dtd' 分割线~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<!DOCTYPE dblp SYSTEM "dblp.dtd"> 只要把上面这句去掉就好了,想到用 EntityResolver 重写,忽略识别 dtd,但是不知道如何重写。