Improving language negotiation
Being a multilingual user on the web can sometimes be a frustrating experience. Depending on the device you’re using and the language it happens to be configured to, some websites will try to accommodate you, and give you a translated version of their content, even though you may prefer the non-translated original.
That is why a recent Intent to experiment around language negotiation caught my eye, as it seemed like a good opportunity to improve on the status quo. It also stirred up some controversy around the tradeoff it is making between user agency, internationalization and privacy.
In this post, I’ll try to cover what the intent is trying to solve, what’s currently broken with language content negotiation on the web, and how I think we can do better.
# Status quo
The Accept-Language header is sent with every request and contains a priority-ordered list of languages that the user prefers. The server can then choose to use it to reply with one of the preferred languages, and mark that using the Content-Language
header.
Sending out an ordered list of preferred languages is not a problem for mono-lingual users of fairly popular languages, especially if these users’ IP addresses are coming from a country that speaks that language. But for users that speak multiple languages that browse the web from countries where those languages are not dominant, that list can reveal a lot of information.
In combination with the IP the user’s request is coming from, the list of languages can be combined with other bits of information to create a user-specific fingerprint that can identify the user across the web. (regardless of their cookie state)
# Chrome’s Proposal
The Chrome team wants to fix that by limiting the information sent with requests to the user’s single highest priority language. That’s an approach Safari also broadly takes (even if it’s slightly more complex than that), so there’s some precedent there.
Where the Chrome team’s approach differs is by allowing sites to reply with an Avail-Language
header that contains a list of languages the content is available in. Then if the Content-Language
value is one that the user doesn’t understand, the browser can retry the request using a language that is on both the user’s list and the Avail-Language
list (if one exists).
To demonstrate the above with an example, let’s say our user understands both French and English, with French being the highest priority language. If that user goes to a site destined mainly for a Spanish-speaking audience, the request to that site would be sent with an Accept-Language: fr
header, and the response may have a Content-Language: es-ES
header as well as a Avail-Langauge: es, en
one.
The browser would then understand that the response is in a language the user doesn’t understand, but that a variant of the content that the user could understand exists. It would then send another request with an Accept-Language: en
to get the site’s English variant and present it to the user.
# Cost
In theory the above proposal would work similarly to today’s status quo and the user would eventually end up with the language they prefer in front of them.
But that doesn’t come for free.
On the client-side, the users would get their content with a significant delay, due to the request’s retry.
As far as servers go, they would end up sending two responses instead of one in those cases. That can result in some server overload, or at the very least, wasted CPU cycles.
And if I understand the proposal correctly, that cost would be borne every time the user browses to the site, because the site’s Avail-language
response will not be cached to be used in future navigations to make better decisions.
In practice, I suspect things would be even worse..
The proposal counts on sites actively adopting Avail-Language
as part of their language negotiation, in order to give users content in the language they expect. And while that may happen over time, in the meantime multilingual users will end up seeing sites in languages they do not understand!!
Beyond that, the proposal doesn’t take advantage of the opportunity to improve language negotiation on the web and improve the experience of multilingual users.
# A better path?
# Original language
In my opinion as a tri-lingual user, the above proposal carries over one fundamental mistake from the current language negotiation protocol. It assumes that users have a single “preferred language”.
My theory is that multilingual users have a favorite language per site, which is the site’s original language, assuming it is one they are fluent in.
It also kinda makes sense - because translation is lossy, as a user you’d want the highest fidelity content you can decipher, which is the content in its original language.
Therefore, I believe that we can avoid the tradeoff between privacy and internationalization/performance altogether! It's perfectly fine for the browser to send only a single language as part of the Accept-Language
list! But that language needs to be influenced by the site’s original language.
Taking into account the site’s original language doesn’t reveal any extra information about the user, and hence is perfectly safe from a privacy perspective.
Browsers can know the site’s original language through any (or all) of the following:
- We could define a new header (e.g.
Original-Language
) that browser-affiliated crawlers could use to get that information and then distribute it to browsers as a configuration.- Maybe we don’t even need a new header and crawlers can detect the default language of the site when a bogus Accept-Language value is sent in the request.
- Browsers could use the site’s domain as a heuristic that indicates its original language.
- We could generate open source lists mapping sites (or a popular subset of them) to their original languages that various browsers could use. We could generate those lists using the above heuristics and/or site-provided headers.
Assuming that the original language is known, the Accept-Language algorithm can be as minimal as:
- If the site’s original language is included in the user’s preferred languages, send that language in the
Accept-Language
header - Otherwise, send the user’s highest priority language.
# Passive resource requests
Beyond the above, I also think we can safely freeze the values of Accept-Language
headers on requests for passive subresources (e.g. images, fonts). By “freeze”, I mean we can change the value of these headers to en
for all users, regardless of their language preference.
While in theory sites could use language negotiation to e.g. serve different images or fonts to certain users, I suspect that (almost) never happens in practice. We could validate those assumptions by adding usecounters for resource responses that include a Content-Language
header, which indicates the server did some negotiation, and then further investigate these cases.
If freezing of these header values works out, we might be able to even try and remove these headers entirely in the future for passive resources, if we see that no servers rely on them.
# Conclusion
While I believe that Chrome’s proposal is a good step in the right direction in favor of user privacy, I think it makes some unnecessary tradeoffs that aren’t ideal both for performance and (at least in the practical short term) for internationalization.
The alternative I propose here does not come for free, as browsers would need to maintain “original language” lists or heuristics. But I believe it would provide a better user experience, while giving users better privacy.