SWC Blog

SWC Blog

How to save a web page as PDF, HTML or an image/screenshot (after running the javascript)

We have a need to save our web pages (of hikes, complete with maps) as PDF's, so they work offline, i.e. we needed to save the page after all the javascript had run, and the map tiles has loaded.

Previously, we used phantomjs. It was designed for testing (save the page as HTML, and perform tests to make sure it has rendered correctly), but there was also a save as PDF facility. However, there were issues with the way the maps were displayed, and some inconsistencies with the CSS.

Now there is a munch easier way. Run Google Chrome from the command line, in 'headless' mode, with a 'save-as-pdf' command line option. There is also a save-as-html option for testers, and a save-as-image option (i.e. a screenshot), for documentation writers.

At the time of writing (July 2017), this works on Mac and Linux only - its coming to Windows.

Step 1 : Install Chrome

On Ubuntu: https://www.ubuntuupdates.org/ppa/google_chrome

Step 2 : Run Google Chrome in 'headless mode' from the command line

google-chrome-stable --headless --disable-gpu --print-to-pdf=output.pdf https://www.example.com

Testing is easy. It should look the same as print preview in Chrome. We previously had issues with the rendering engine being different in phantomjs and chrome. So now, if some CSS, like page-break-inside is implemented in Chrome, it'll be implemented in the PDF as well. Happy days.

Or, for a screenshot:

google-chrome-stable --headless --disable-gpu --screenshot=output.png https://www.example.com

Or, for the HTML (the contents of the <body> tag)

google-chrome-stable --headless --disable-gpu --dump-dom https://www.example.com > example.html

So now there is no excuse for not testing. For example, I test that a particular map tile (https://maps.google.com/...png) is included in a page.

For more information

https://developers.google.com/web/updates/2017/04/headless-chrome


Compare Apache and NginX

Which web server should you use?

If you don't already know the right answer for you, then its Apache!

Apache was (almost) the original web server, and runs over 50% of web server. Its Open Source, it has lots of modules, is easy to use, highly configurable, can be used with php, perl, java, python, etc, and most websites use it.

NginX (Engine X) is a newer web server that runs on approaching 20% of webserver. It uses a more efficient design that is innately much more efficient than Apache.  Its used mainly by large very high traffic websites as a front end for static resources like images, HTML, CSS, Javascript. Applications (CGI, php, java etc is then handled by back end Apache or Node servers.

Some example differences:

  • Apache uses a separate memory thread for each connection, that spends much of its time waiting for the next browser request (while still using memory). NginX uses a shared queuing system instead. Think of it as 1 till per customer, rather than a single queue of customers at a bank of tills.
  • NginX has a server config file, where Apache can also have per directory ones. Thats great for individual  developers to tweak there own project, but it means that the server has to check each directory in the path to a file each request - quite an overhead. NginX is faster, Apache is easier to use.

So, in summary

  • for a typical web server, use Apache. 
  • but if you get millions of request a day, use NginX as the front end, Then Apache (or Node or whatever) as the back end application server.

By the way, the figures, from late 2016, that show the Unix/Linux Apache (50% +), and NginX ( ~ 20%) webserver growing, also show IIS (the Windows web server) fading to less than 10%. Most of the rest is taken by companies like Google, who run their own proprietary webservers.

If the figures were web traffic, rather the number of web servers, NginX would score much higher.

HTTPS and OCSP Stapling for Apache

"OCSP Stapling" is a way of caching part of the SSL verification process on a websever in an HTTPS connection.

Its caches a certificate check with the certificate authority, instead of doing it for each request.

Here's how to enable it


# server config or virtual host
SSLStaplingCache shmcb:/tmp/stapling_cache(128000)
SSLUseStapling on


And this amazingly thorough test enables you to check you've set this, and pretty much everything else,  correctly

Brotli Compression for webservers

Brotili is a new more efficient form of file compression from Google - smaller size, less CPU - what's not to love.

Unfortunately, the Apache module for it hasn't been release yet, and web browsers are only just starting to support it.

It is the way of the future, but it's time (as of early 2017) has not yet come.

Enable HTTP/2 (was SPDY) on your Apache webserver

The HTTP/1.1 which most webserver use has been around since the start of the web.

Now there's an alternative which most browsers support, HTTP/2 - the protocol formerly known in its infancy as Google SPDY.

Its a binary protocol (not plain text as HTTP/1.1), and is generally more efficient all round for both server and browser.

For example, it solves the multiplexing (getting multiple files in parallel from the same website) problem, so you no longer need to do domain name sharding (i.e. splitting your resources over several domain names - a.example.com, b.example.com, c.example.com) to increase throughput.

Before you begin! Browsers only seem to support HTTP/2 on secure HTTPS websites, so you need to upgrade your website to HTTPS first.

So how do you enable it?

First, enable the Apache HTTP2 module. This is a bit dependent upon your flavour of Unix - you might have to use a2enmod, or more simply, something like

#LoadModule http2_module /usr/lib/apache2/modules/mod_http2.so

Then, add this to the server config (or virtual host)

# set http/2 with http/1.1 as a fallback
Protocols h2 http/1.1

When you've done it, there's a test tool here.

The HTTP X-Frame-Options header

The X-Frame-Options options header can be used to stop other websites opening your webpages inside a FRAME or IFRAME.

You can set it in server config, virtual hosts, .htaccess or HTML

Here is an HTML example

<!-- dont open this page in an iframe -->
<meta name="header" content="X-Frame-Options: DENY" >

And a server config example

# allow pages from this domain name to open pages in iframes, no one else
Header always append X-Frame-Options SAMEORIGIN

This page explains all

The HTTP Content Security Policy header

This is an interesting header. It allows you to set rules for different classes of assets. An assets is an image, video, javascript, ajax, etc. request included in one of your webpages.

For example, a banks might set "only allows assets from this server". So if a page links to an external javascript URL, a browser will ignore it.

You can also set a 'report violations' URL. So the browser will send a warning message to the website, Its a POST request to a URL you set in the header.

The full syntax is available here, but here are a few examples

The policy can be set in a webpage,


<!-- 
this a a HTML example
- everything comes from this website, 
- except images can come from any secure website 
-->
<meta http-equiv="Content-Security-Policy" content="default-src 'self'; image-src https:*"> 

or erver config.

# virtual hosts, .htaccess
#
# This example sets
# - by default, everything must be loaded from this domain
# - except images from anywhere
# - except javascript from self or the 2 listed domain names
Content-Security-Policy: default-src 'self'; img-src *; script-src 'self' cdn.jquery.com maps.google.com

# this example sets the reporting url
Content-Security-Policy: default-src 'self'; report-uri https://www.example.com/violation-reports.cgi