Lets say I have a sitemap.xml
file with this data:
http://domain.com/pag1
2012-08-25
weekly
0.9
http://domain.com/pag2
2012-08-25
weekly
0.9
http://domain.com/pag3
2012-08-25
weekly
0.9
I want to extract all the locations from it (data between
and ).
Sample output be like:
http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3
How to do this?
Answer
You can use python script here
This script get any links started with http
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('>(http:\/\/.+)<',d)
for i in data:
print i
And in your case next script find all data wraped in tags
import re
f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('(http:\/\/.+)<\/loc>',d)
for i in data:
print i
Here nice tool to play with regexp if you not familiar with it.
if you need to load remote file you can use next code
import urllib2 as ur
import re
f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
data = re.findall('(http:\/\/.+)<\/loc>',d)
for i in data:
print i
No comments:
Post a Comment