How to keep all your websites in sync with scraping technology
Recently updated on
Introduction
Many of our clients maintain multiple websites, or multiple domains within a single website. For example, "store.example.com" and "www.example.com" will almost always be different websites. In many instances, the different websites are maintained by different vendors; however, most have at least some elements in common, such as logos, headers, footers, etc.
Often, when deploying a new site, these shared elements are simply copied and pasted from the main site. This is quick and easy to do, but the downside of this approach is that when someone inevitably changes a logo or other shared element on the main site, the element that should be universal across both sites will go out of sync. Such "quiet" errors might go unnoticed for long stretches of time. When they are noticed, it might take some digging to figure out what changed and whether any other websites or subdomains have been affected.
How often does this happen? Doing a quick Google search of "store." yielded store.playstation.com. Here's a screenshot of the footer of store.playstation.com and playstation.com:
store.playstation.com
www.playstation.com
Now, these two footers might be different for reasons that we don't appreciate, but this is most likely just a copy-and-paste that has gone out of sync.
Case in point: a client of ours recently decided to change the look and feel of their main site's header and footer. That site is supported by a different vendor, but to preserve uniformity, they asked us to make analogous changes to the header and footer of a subdomain website that we support. To avoid synchronization problems like those described above, we decided to have the subdomain dynamically pull in (or "scrape") the header and footer from the main site. Since we wanted the change to be reflected on all pages of the subdomain, we decided to add a template tag to the subdomain's base template. If you're unfamiliar with template tags in Django, you might want to check out, How to create custom template tags and filters from the Django Project website.
Part I - The Template Tag
First, we added a module "app_tags.py" to the "templatetags" directory of the relevant app. Here's a starting version of the code:
my_app/templatetags/app_tags.py
import requests from bs4 import BeautifulSoup from django import template
register = template.Library()
@register.filter def get_external(domain): resp = requests.get(domain) soup = BeautifulSoup(resp.text, features="html.parser") header = soup.find("header") return str(header)
This function uses the excellent "Requests" library for making the request to the external URL, and the also excellent "Beautiful Soup" for parsing the resultant HTML. In our case, the header we wanted to copy was conveniently enclosed in <header>
tags -- hence the call to soup.find("header")
. The return value of find()
is a Beautful Soup object, but all we want from it right now is the raw HTML from <header>
to </header>.
We can obtain this by converting the object into a string, and then we can just plug it into our template. Speaking of that template, let's look at it now.
Part II - The Base Template
Let's use this as our starting base template:
templates/base_template.html
{% with domain='https://example.com' %}
{% load app_tags %}
{# DOCTYPE, <html> tag, <head> tags, etc. #}
<body>
{{ domain|get_external|safe }}
...
{# rest of page content #}
...
</body>
Note that we stored the URL of the external "main" website into the domain
variable. By doing this we avoid having to type the value out every time we need it. This is much less error prone, and also makes the value easy to change. This could come in handy if, for example, the client were to make further changes to the header in their dev environment, and we wanted to mirror those changes in our dev environment.
We {% load %}
the template tag we created above, and when we reach the spot for our own header, all we have to is send domain
to the get_external()
function. We also pipe the output through Django's built-in safe
filter so that the HTML will not be escaped.
And that's pretty much the basic idea. However, we discovered an extra wrinkle in the case of our client...
Part III - Replacing Relative Paths with Absolute Paths
The external header included some images and links whose src
and href
attributes pointed to relative paths. When we plopped those same relative paths into our template on the subdomain, we got a lot of broken images and links. For those things to work correctly, we needed to replace the relative paths with absolute paths.
We used Beautiful Soup again to find any tags with a src
or href
attribute that began with a forward slash ("/"), indicating a relative path. To do this, we passed a regular expression (regex) to Beautiful Soup's findAll()
method.
To any value matching the regex we then prepended the value of domain
as passed in from the template. This gave us an absolute path to the linked resource.
The "domain" variable is just a string, so we could have just used a +
operator to concatenate it with the relative path, taking care to put exactly one forward slash in between. However, we wanted our code to reflect an awareness of the nature and meaning of the data being manipulated, so instead of doing that, we joined the pieces together using Python's built-in urlparse
and urlunparse
methods. Passing a url to urlparse()
returns a six-part "ParseResult" object that separates out the URL's scheme ("http" or "https"), its network location (subdomain, domain, etc), and other components. We can then call that object's _replace()
method to change one or more of these components. In our case that meant providing the relative path to the resource we wanted to link to. Afterwards, calling urlunparse()
on the modified object returns the absolute path we're looking for.
Let's revisit the template tag code to see what those changes look like:
my_app/templatetags/app_tags.py
import re
import requests
from bs4 import BeautifulSoup
from django import template
from urllib.parse import urlparse, urlunparse
rel_path_regex = re.compile("^/")
@register.filter
def get_external(domain):
resp = requests.get(domain)
soup = BeautifulSoup(resp.text, features="html.parser")
header = soup.find("header")
parsed_domain = urlparse(domain)
# Convert image tags with a relative "src" to absolute
for tag in header.findAll("img", src=rel_path_regex):
tag.attrs["src"] = urlunparse(parsed_domain._replace(path=tag.attrs["src"]))
# Convert anchor tags with a relative "href" to absolute
for tag in header.findAll("a", href=rel_path_regex):
tag.attrs["href"] = urlunparse(parsed_domain._replace(path=tag.attrs["href"]))
return str(header)
Conclusion
In the introduction to this post, we said that we were tasked with bringing over both the header and the footer for use on the subdomain. Naturally, we parsed out the footer in the same way by assigning a variable footer
to the return value from soup.find("footer")
. But we also wanted to avoid calling the function twice in the template, since doing so would mean making two identical requests to the external website. We solved this by having get_external()
return a Python dictionary containing both the header and the footer; then in the template, we surrounded both the header and footer section in a {% with %}
block, like this:
templates/base_template.html
{% with domain='https://example.com' %}
{% load app_tags %}
{# DOCTYPE, <html> tag, <head> tags, etc. #}
<body>
{% with externals=domain|get_external %}
{{ externals.header|safe }}
...
{# rest of page content #}
...
{{ externals.footer|safe }}
...
{% endwith %}
</body>
Finally, you might be wondering whether we should really be scraping the main site every single time a page under the subdomain is requested. For this client, we decided that we could get away with this approach because internet traffic to the subdomain amounted to only a few dozen hits per day. However, for a more heavily trafficked website, one would want to use some sort of caching system. Such a system could also ensure that the subdomain's headers remained in place even if connectivity to the external site were disrupted, or if the maintainers of the external site made an unpredictable change to their page's HTML, such as removing the <header>
or <footer>
tags that we use to locate those sections of the document.
We hope you have enjoyed this tour through one of our little problems! Happy coding!
Project Requirements
- Django (https://pypi.org/project/Django/)
- Requests (https://pypi.org/project/requests/)
- BeautifulSoup (https://pypi.org/project/beautifulsoup4/)