MHTML Generation and Loading

MHTML Generation and Loading

As implemented in Chrome

 

 

Overview

In Chrome, the static snapshots of web pages are captured and saved in a modified MHTML format (RFC 2557), that is a web page archive format used to combine resources together with HTML into a single file. A set of modifications, as summarized in this doc, are made to the existing MHTML generator and loader in order to:

  • Improve security and privacy
  • Ensure the snapshot mimics as close as possible to the original page

 

There are ongoing efforts to try to add better support for creating packages of files for use on the web, as drafted in the Packaging on the Web spec. But it is a long way to get all things sorted out and agreed upon. So Chrome built its own support upon the existing MHTML spec with the improvements described in this document.

Generating MHTML

The user can request MHTML in two ways:

  1. Foreground: the user has loaded a page, and wants to save what they’re seeing. On Desktop version, “Save as MHTML” is not exposed by default. The user needs to either enable “Save Page as MHTML” feature from chrome://flags/ or install an extension to download the page as MHTML. On Android, the user can press “Download” button to save the current page.
  2. Background: the user has not loaded a page, and wants the UA to download and save it when it has better connectivity. This is only provided on Android.

 

To save a web page into MHTML, the MHTML generator will traverse the DOM tree and serialize all elements. It will also enclose the following resources in multi-part format:

  • subframes
  • images
  • css
  • fonts
  • object elements which contain iframe, image and svg resources. Note that object elements containing plugin resources will not be enclosed since plugins will not be loaded due to sandboxing for MHTML.

 

Any other resources, like scripts, audio and video, are not included. Note that the current state of DOM tree, which has already been impacted by the running of the scripts, is captured. The further running of the scripts is not allowed due to sandboxing protection.

Use cid URL to denote subframe source

Chromium supports out-of-process iframes (OOPIFs), which allow a child frame of a page to be rendered by a different process than its parent frame. In this case, the parent frame only knows about the original URL of the child frame, but not the final URL of the child frame. If these two URLs are different due to redirection, the mismatch will cause the child frame resource not to be found. To solve this issue, a unique Content-ID is generated for each child frames. The src attribute of iframe element will be replaced by the corresponding Content-ID URL. A Content-ID header will be added to the resource for this child frame. We do this even if the child frames happened to be identical.

From: <Saved by Blink>

Content-Type: multipart/related;

type=”text/html”;

boundary=”—-MultipartBoundary–…—-“

——MultipartBoundary–…—-

Content-Type: text/html

Content-ID: <frame-47385-7ac60b6b-1791-4ca5-a995-f3576b2518e1@mhtml.blink>

Content-Transfer-Encoding: quoted-printable

Content-Location: http://example.org/sample.html

<html><body>

<iframe src=3D”cid:frame-47459-8f0ff1a6-dced-4528-9139-2d3b8f63db56@mhtml.blink“>

</iframe>

</body></html>

——MultipartBoundary–…—-

Content-Type: text/html

Content-ID: <frame-47459-8f0ff1a6-dced-4528-9139-2d3b8f63db56@mhtml.blink>

Expand inline iframes

All inline iframes, denoted by srcdoc attribute, will be expanded into standalone subframes resources. That is, the content contained in srcdoc attribute will be written in a new subframe resource identified by Content-ID URL. The srcdoc attribute of the iframe element will be replaced with the src attribute pointing to this resource.

 

For example, the following inline iframe

<iframe srcdoc=”&lt;body&gt;…&lt;/body&gt;”></iframe>

 

will be converted to:

<iframe src=3D”cid:frame-47459-8f0ff1a6-dced-4528-9139-2d3b8f63db56@mhtml.blink”>

</iframe>

</body></html>

——MultipartBoundary–…—-

Content-Type: text/html

Content-ID: <frame-47459-8f0ff1a6-dced-4528-9139-2d3b8f63db56@mhtml.blink>

<body>

</body>

 

Use binary encoding to reduce file size (Android offline snapshot only)

RFC 2045 defines the allowed encoding formats for body content and it is up to the application to pick the appropriate encoding. On Android, all resources use the binary Content-Transfer-Encoding in order to reduce the size of MHTML file. On all other platforms, image resources are encoded in base64 format and other non-image resources are encoded in quoted-printable format. We are going to switch to using binary encoding across all platforms.

 

The base64-encoded binary data tends to be 1.37 times larger, compared to the original data size. Also extra time needs to be spent in order to encode and decode from base64 format. 

From: <Saved by Blink>

Content-Type: multipart/related;

type=”text/html”;

boundary=”—-MultipartBoundary–…—-“

——MultipartBoundary–…—-

Content-Type: text/html

Content-Transfer-Encoding: binary

Content-Location: http://example.org/sample.html

<html><body>

<img src=”http://example.org/images/sample.png”>

</body></html>

——MultipartBoundary–…—-

Content-Type: image/png

Content-Transfer-Encoding: binary

Content-Location: http://example.org/images/sample.png

Style elements

The CSS rules of a style element can be updated dynamically independent of the CSS text contained in the style element. To deal with this inconsistency, we will not serialize any style elements. Instead, we will scan document.stylesheets to look for those originated from style elements and wrap the serialized CSS text into link elements.

 

For example, the following style element has dynamically updated CSS:

<style id=”s1″ type=”text/css”>h1: {color: green;}</style>

<script>

  document.getElementById(“s1”).sheet.insertRule(“h1 { color: red; }”, 0);

</script>

 

It will be serialized to the following:

<link rel=”stylesheet” type=”text/css” href=”cid:css-some-random-uuid@mhtml.blink” />

——MultipartBoundary–…—-

Content-Type: text/css

Content-Transfer-Encoding: quoted-printable

Content-Location: cid:css-some-random-uuid@mhtml.blink

charset “utf-8”;

p { color: red; }

Strip meta elements containing Content-Security-Policy

The meta elements could declare Content-Security-Policy directives. These constraints should have already been enforced when the original document is loaded. Since only the rendered resources, i.e., iframes, are encapsulated in the saved MHTML page, there is no need to carry the directives. If these directives are still kept in the MHTML, subframes that are referred to using cid: scheme could be prevented from loading. As a workaround, meta elements that contain Content-Security-Policy directives are not included in the serialization. All other meta elements are not affected.

<meta http-equiv=”Content-Security-Policy” content=”img-src ‘self’; child-src ‘self'”>

Strip script elements and attributes

Since script execution is disabled for MHTML, it does not bring any value to keep those scripts around in the MHTML. All the elements and attributes that contain scripts are not serialized, including:

  • Any script element.
<script>…</script>
  • Any onEVENT attribute: any attribute that begins with on is treated as an event binding attribute and thus get stripped out.
<body onload=”foo();”>

  <a onclick=”bar();”>

  • Any attribute containing a javascript: URI, such as href and src, and some SVG attributes like from and to.
<a href=”javascript:foo();”>

 

<svg xmlns=”http://www.w3.org/2000/svg”>

  <rect id=”rect” height=”100″>

    <animate from=”javascript:foo();” to=”javascript:bar();” />

  </rect>

</svg>

 

The inline iframe, provided by srcdoc attribute of frame elements, can also contain a script.

<iframe srcdoc=”&lt;body&gt;…&lt;/body&gt;”></iframe>

We don’t need to exclude this attribute since it will be rewritten as src attribute containing link instead of html contents. Any scripts contained in the inline content will be removed when the content is serialized.

Strip attributes that could send tracing pings

The ping attribute of anchor element, that is used to send a HTTP post request to the specified URL for monitoring or tracking purpose, will be stripped if present.

<a href=”http://www.sample.org” ping=”http://http://www.sample.org/trace”>

Strip hidden elements

If elements are marked as hidden, they’re not going to made visible later due to the fact that the script execution is disabled. However, we don’t want to target all hidden elements since some hidden elements can affect layout or CSS matching. To play safe, we only strip the following:

  • Input element with type=“hidden”
<input type=”hidden” name=”foo” value=”bar”>
  • Any html element with the “hidden” attribute
<p hidden>Invisible</p>

 

We do not strip:

  • HTML elements having “display:hidden” style still participate in layout and take space though they are hidden. So they need to be kept.
  • HTML elements having “display:none” style do not participate in layout, but some websites use nth-child CSS selectors to target these: https://crbug.com/697970.

Strip iframes that are injected into head

Some web pages might have scripts that create iframes and insert into head element. These iframes will not be displayed. If these iframes are preserved as part of MHTML serialization, they will show up as static elements of head element. When a page, including MHTML page, is being loaded, all iframes found in head will be moved to body which will make these iframes visible. To prevent this from happening, all iframes who are descendent of head will not be saved.

When srcset is used, move the selected url to src

For device with high DPI, a different image from srcset could be loaded. For example, 2x.png will be loaded in a device with high DPI:

<img src=”1x.png” srcset=”2x.png 2x”>

 

When the <img> element is serialized, the srcset attribute is not saved. Instead, the src attribute is set with the loaded image URL. To ensure that this image will be shown in the same bounding box, the rendering width and height will be provided if they weren’t already present.

<img src=”http://…/2x.pngwidth=”…” height=”…”>

Remove popup overlays

Some web pages might display popups, that could obstruct the user from viewing the main content. The popup could be triggered automatically by the script, or invoked as the result of clicking a link or button by the user. The former one is mostly often used for ad or survey purpose, while the latter one could be requested by the user in order to sign in, enter codes or take other actions. This is more likely to happen if the page is loaded in background mode, since the user doesn’t have a chance to close these popups before saving MHTML.

 

When the page is loaded in background mode, we will search all the visible DOM elements that satisfy the following criteria:

  • The element is visible, with non-zero width and height.
  • The bounding box of the element contains the center point of the viewport.
  • The z-index value is greater than 50 (experimental value, to be tuned).

Upon found, the elements, including all their children, are skipped in MHTML serialization.

<style type=”text/css”>

.overlay {

  position: fixed;

  z-index: 1000;

  top: 0;

  left: 0;

  width: 100%;

  height: 100%;

  …

}

.modal {

  position: fixed;

  z-index: 1001;

  left: 50%;

  top: 50%; 

  width: 200px;

  height: 100px;

  margin-left: -100px;

  margin-top: -50px;

  …

}

</style>

<div class=”overlay”></div>

<div class=”modal”>Popup</div>

<div>Some text.</div>

Save shadow DOM contents

The shadow DOM tree contents will be serialized and put within a template element that is inserted at the end of the shadow host element. A special attribute, shadowmode, is set in the template element to denote the mode of the preserved shadow DOM tree. In addition, another attribute, shadowdelegatesfocus, may be set if delegatesfocus flag needs to be set when creating shadow tree. 

<style>span { font-size:12px; }</style>

<div id=”host” class=”example” >

  <p>Hello, placeholder!</p>

  <template shadowmode=”open”>

    <span>Hello, shadow!</span>

    <style>span { color:blue; }</style>

  </template>

</div>

 

Notes:

  1. The original children of shadow host elements are still serialized since they might be carried into the shadow DOM tree via distribution.
  2. The distributed nodes via <content> or <slot> will not be visited. <content> or <slot> elements are serialized in their original formats.
  3. If some elements in the original page happen to contain shadowmode attributes, they will be removed during serialization.

Add and update headers to support P2P sharing

The captured MHTML page could be shared with other parties across devices. The receiving party, i.e., Chrome in other device, may need to find out and show the basic info about the shared page. For example, title and URL.

 

The title can be found in the Subject header. Chrome used to match IE by replacing all non-printable ASCII characters with “?”. Instead, we’ll now use RFC 2047‘s technique to encode non-printable ASCII text. The value of the Subject header will be: 

  • If all the characters in title are printable ASCII, they will appear in Subject header.
  • Otherwise, the title text will be encoded as:

=?utf-8?Q?encoded_text?=

where, “utf-8” is the chosen charset to represent the title and “Q” is the Quoted-Printable format to convert to 7-bit printable ASCII characters.

Notes:

  • The long title line can be split into multiple lines using soft line break “CRLF+SPACE/TAB” per RFC 2047. The soft line break “=CRLF” used to break long line in message body as defined in RFC 2045 should NOT be used.
  • The white space characters should always be encoded regardless where it appears per RFC 2047.

 

The URL can be found in the Content-Location header of the 1st multipart section, which requires the parsing of whole multipart structure. To avoid this, a custom header Snapshot-Content-Location is added. The URL value of this header should only contain ASCII characters. Any non-ASCII input characters are UTF-8 encoded and % escaped to ASCII.

 

Header Name Header Value
Snapshot-Content-Location The final URL of the main page.

 

Here is an example of new headers (in blue) added to the MHTML document:

From: <Saved by Blink>

Subject: =?utf-8?Q?Example=3D=E2=98=9D?=

Date: Sat, 30 Jun 2017 22:24:53 -0000

MIME-Version: 1.0

Snapshot-Content-Location: https://example.com/

Content-Type: multipart/related;

type=”text/html”;

boundary=”—-MultipartBoundary–…—-“

——MultipartBoundary–…—-

Content-Type: text/html

Content-ID: <frame-135364-6e4d2ef2-86f3-4e42-b385-0f81739a3fd7@mhtml.blink>

Content-Transfer-Encoding: quoted-printable

Content-Location: https://example.com/

 

Loading MHTML

MIME Types of MHTML files

The MHTML files that can be opened and rendered by Chrome should have the MIME type “multipart/related” or “message/rfc822”.

MHTML files eligible to load

Loading from file URLs

The MHTML files provided by file:// URLs (file URL has to end with .mht or .mhtml suffix) will be loaded. The omnibox will show the file URL.

 

The MHTML files hosted in the remote servers (http/https URLs) will NOT be loaded. Instead, they will be downloaded.

Offline snapshot (Android only)

In Android, the snapshots of web pages can be captured in MHTML archive format, i.e., by pressing the Download icon. These web archive files are stored like local MHTML archive files with digest of the MHTML files computed and recorded in the internal database. The digest is used to verify that the MHTML file created by the UA from trusted input and stored in the public directory is not modified. The file’s claimed URL can be deemed trusted only when the file is not changed. If the network is disconnected or in poor condition when the user accesses these URLs, the snapshots from these files will be loaded instead. The omni-bar still shows the http/https URL with offline status marked, rather than the location of the MHTML file.

Opening from external apps via intent dispatching (Android Only)

In Android, any apps can construct and dispatch an Intent to open MHTML files. Chrome can interpret the intent and open the requested MHTML file:

  • If a file:// URL is passed in the Intent, the url should end with .mht or .mhtml extension and the intent type should be multipart/related, message/rfc822 or empty.
  • If a content:// URL is passed in the Intent, the intent type should be either multipart/related or message/rfc822.

Sandboxing

The MHTML is loaded in full sandboxing mode with the only exception to open new top-level windows. That is, the `Content-Security-Policy: sandbox allow-top-navigation-by-user-activation allow-popups-to-escape-sandbox` header is forced to be present.

  • The document is loaded into a unique origin, which means all same-origin checks will fail.
  • Script execution is disabled. This includes Javascripts loaded via script tags, inline event handlers and scripts.
  • Forms cannot be submitted.
  • Plugin will not load.
  • Features that trigger automatically (auto-focused form elements, auto-playing videos, etc.) are blocked.

Note that new top-level windows will only be allowed when the user clicks a link with blank target.

Instantiating Shadow DOM Contents

After the MHTML page is loaded, Chrome creates shadow roots from <template shadowmode=…> elements that were captured from the shadow DOM trees at serialization. Any scripts and event handlers in shadow DOM templates will not be run.

 

Errors caused by trying to add a shadow root to an invalid element or a second shadow root cause that template to be skipped.

 

We’ll migrate to https://github.com/whatwg/dom/issues/510 if/when it’s standardized.

Disabling Form Control Elements

Due to the fact that form submission and script execution are disabled, the user can’t get anything by interacting with the form. To give a clear indication to the user about this, all form control elements are shown to be in disabled state.

Network Requests

Resource requests

All the sub-resources to load and render should be contained in the MHTML archive. Any attempts to load them from the network will be blocked.

Navigation requests

Only user-initiated navigations, clicking on static links by users, will be allowed.

  • Navigations originated from main frame will be allowed.
  • For a navigation originated from a subframe, it will:
    • be blocked if it navigates within the subframe
    • open a new window if the top-level target is specified (target=”_top”).
    • open a new window if the blank target is specified  (target=”_blank”).

Tracing ping requests

All the tracing ping requests, currently only those pings from anchor elements, will be blocked.

Acknowledgements

Special thanks to Dmitry Titov, Justin DeWitt, Lukasz Anforowicz, Nate Chapin and Jay Civelli who have proposed and implemented changes to improve MHTML support in Chrome.

 

Thanks to Jeffrey Yasskin for editing this specification. Many thanks Domenic Denicola, Mike West, Hayato Ito for great feedback.

References

 

Leave a Comment

Your email address will not be published. Required fields are marked *