An archive. Websites can be archived too!

Case Study: How to archive a website created in Drupal 8 to HTML?

21.09.2018

What to do with an old, outdated website that you would like to keep online? The perfect solution is to archive it to pure HTML code. We will demonstrate it on the example of a drupalcamp.pl website created in Droopler, based on Drupal 8.

Why archive pages at all?

Sometimes websites have their expiration date. It may result from the life cycle of the technology used to build it or simply because the website was created for an event or some special occasion. When you organise a music festival, for example, the website is no longer up to date and necessary after it ends. When you have long forgotten websites on your server, their code may be so outdated that it will turn into a threat in the future. If, for some reason, you want to keep your websites on the Internet, you have to take into account the cost of their constant maintenance and updating.

What are the costs of an unused website?

The cost of maintenance depends to a large extent on the technology used. Let's focus on Drupal 8 since it is one of the safest CMS available on the market. Updates to D8 are published monthly, and every six months a version containing new functionalities is published. This means you need to install a fresh release of Drupal 12 times a year and test our website to see if it's still working, so you can stay on top of updates. We know from experience that this can be very time-consuming.

On the other hand, if you decide against upgrading, your website is now at risk of being attacked and poses a threat to other websites on your server. Shortcomings in the security department may lead to much higher costs than updating your code on an ongoing basis.

The question arises – how to avoid maintenance costs and keep the website online? A great compromise between sentiment and cost-effectiveness is the conversion to pure HTML code.

What are the advantages and disadvantages of pure HTML?

Deploying websites written in pure HTML is in a sense a return to the roots. In the age of advanced CMSs, hardly anyone remembers that a website can be created without the use of server-side interpreted languages, such as PHP.

Why writing pages using HTML only was forgotten?

  • Due to difficulties in updating their content.
  • Because it is not possible to reuse the code for global elements (such as header, main menu, footer).
  • Due to the static nature of HTML, which makes it difficult to create administrative pages.

So why convert an unused Drupal 8 website to pure HTML?

  • This will result in a rapid increase in the speed of operation of all subpages, including those which have been the slowest so far.
  • It will be very difficult to attack the website if you configure the server properly.
  • Updating the code will become completely unnecessary, the cost of maintenance will be practically zero.

What will be the limitations of a website converted to HTML?

  • Changes to the content will become more time-consuming. The developer will include them in a local copy and then generate the HTML version for publication on the server.
  • Dynamic elements such as forms will stop working.

How to adapt a website for archiving?

Not every website is suitable for archiving right away. First of all, you should make sure that none of the subpages contains any elements requiring PHP scripts to work:

  • Contact forms (they can be replaced with embedded Google Forms).
  • Search engines (they can be replaced with Google search on the website).
  • Views Exposed Filters.
  • AJAX in views.

It is also necessary to disable error messages sent by the server – especially when copying a website from localhost. During archiving you should use settings as similar to production as possible, including CSS/JS aggregation and lack of additional diagnostic information generated by Twig.

At the beginning of the article, we promised to describe the conversion to HTML, based on a real example. Therefore, we are going to present the method of archiving the drupalcamp.pl website, dedicated to the DrupalCamp 2018 conference organised by us. This is a cyclical event, but each subsequent year we prepare a completely new website. Once DrupalCamp has taken place, we leave this page up as a souvenir, archived to HTML at a separate address.

The website of DrupalCamp conference in Wrocław.

What preparations did drupalcamp.pl require? First of all, we removed the subpages with the forms, which were no longer needed, since the conference has already ended. We made sure that all views worked without AJAX on the programme subpages. We also carried out a quick JS audit to eliminate potential code issues when PHP is disabled.

The archiving process

We use the httrack software, which is available under the GNU GPL3 license, in order to automatically archive pages. It is available for Windows, Linux, OSX and Android. We use httrack via a Linux console. There are plenty of switches and options available in the documentation, here is our command to make a 1:1 copy of a website, while maintaining the link structure:

httrack http://example.com -O output_dir --disable-security-limits --max-rate=99999999999 -K3 -X -%P -wqQ%v --robots=0 -N "%h%p/%n.%t"
  • --disable-security-limits - disables the built-in transfer limits, this is useful when our local server is the source.
  • --max-rate - sets the maximum transmission speed.
  • -%P - tries to recognise all possible links to files on the website.
  • -K3 - does not change the links on the pages.
  • -N "%h%p/%n.%t" - does not change file names.
  • -X - on the next command, deletes files from the archived version that were deleted in the original.
  • -wqQ%v - standard mode, silent, with a list of processed files on the screen.

The resulting page image is not yet completely finished. The subpages are in files such as about-us.html, instead of about-us/index.html. A simple script will fix this problem:

#!/bin/bash
for f in $(find output_dir/example.com -name "*.html" -type f); do 
	if [[ $f = *"/index.html" ]]; then
		echo "Omitting $f"		
	else
		echo "Processing $f"
		mkdir "${f%.html}"
		mv $f "${f%.html}/index.html"
	fi
done

The copy created in this way will be indistinguishable from the original to the observer. This is important for the preservation of existing positions in Internet search engines.

Httrack’s compatibility with Drupal

Drupal 8 is not fully compatible with httrack. The main problem is the responsive images presented via the <picture> tag. Proper conversion to HTML requires providing httrack with hints for additional downloads.

The drupalcamp.pl website that we archived is based on Droopler, an in-house, free of charge distribution of Drupal 8. In version 1.3 of Droopler, we have implemented full support for httrack, which helps with identifying and downloading all graphics files used on the website.

How did we “improve” the compatibility with httrack? We used a very simple solution in the form of hooks preparing a list of files to download. Hints for the bot are placed in the <head> section of the page, as subsequent <link> elements:

<link href="/sites/default/files/styles/responsive_image_2000/public/blog/node_52/35080887_779262195604057_3638740630118596608_o.jpg?itok=YkFsAytN" rel="droopler:c0527d:img0" />
<link href="/sites/default/files/styles/responsive_image_1200/public/blog/node_52/35080887_779262195604057_3638740630118596608_o.jpg?itok=OEsKzsbg" rel="droopler:c0527d:img1" />

These elements are recognised by httrack and downloaded to the copy. Thanks to this, we can maintain full responsiveness of the images. The excess code is usually deleted from the console by means of a regular expression.

Conversion results

The result of conversion to HTML is very satisfactory in our case. We’ve got a folder with files of a total size of about 20 MB. As one would expect, the access time to an HTML file is a few milliseconds, meaning that it is negligible. It remains very low even under heavy loads. So far, this value on the production server has oscillated around 200ms (of course for users who weren’t logged in, with active cache). Under the load, it increased to about 700ms.

We checked the correctness of export using Screaming Frog SEO Spider software. It did not detect 404 errors, which means that the archiving was 100% successful. Also, the browser consoles do not show any JS errors.

It can be expected that in the next few days the DrupalCamp 2018 website will be finally retired and replaced by pure HTML version. In this way, we will avoid the need to update it and, therefore, we won’t incur additional costs. If there is a need to make changes to the content, we will make them on a local version, based on Drupal, and then automatically generate a new HTML page. We encourage you to take advantage of our experience!

Need a team of Drupal and PHP web development experts?

Contact us now!