Sunday, 11 August 2019

url - Extract Links from a sitemap(xml)


Lets say I have a sitemap.xml file with this data:



http://domain.com/pag1
2012-08-25
weekly
0.9


http://domain.com/pag2
2012-08-25
weekly
0.9


http://domain.com/pag3
2012-08-25
weekly
0.9


I want to extract all the locations from it (data between and ).


Sample output be like:


http://domain.com/pag1
http://domain.com/pag2
http://domain.com/pag3

How to do this?



Answer



You can use python script here


This script get any links started with http


import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('>(http:\/\/.+)<',d)
for i in data:
print i

And in your case next script find all data wraped in tags


import re

f = open('sitemap.xml','r')
res = f.readlines()
for d in res:
data = re.findall('(http:\/\/.+)<\/loc>',d)
for i in data:
print i

Here nice tool to play with regexp if you not familiar with it.


if you need to load remote file you can use next code


import urllib2 as ur
import re

f = ur.urlopen(u'http://server.com/sitemap.xml')
res = f.readlines()
for d in res:
data = re.findall('(http:\/\/.+)<\/loc>',d)
for i in data:
print i

No comments:

Post a Comment

How can I VLOOKUP in multiple Excel documents?

I am trying to VLOOKUP reference data with around 400 seperate Excel files. Is it possible to do this in a quick way rather than doing it m...