Not everything is UTF-8
One of them is the encoding (or lack thereof) in Mercurial and how it affects how we write code in both Python and Rust. As easy as it was to explain the issue to said developer, in the few instances of asking around for help on implementation details (mostly to get information about what had already been done and what I needed to do myself) I've noticed that not everyone I'd interacted with outside of our circle of VCS developers even understood the problem I was trying to solve.
Please note that I am not pointing fingers or accusing anyone of being disingenuous, just about everyone I talked to was very much trying to help me and to understand what is it that I wanted to solve in the first place. I usually don't have that much trouble explaining things to people in those situations, so I figured this warranted a full blog post.
The core issue
There Ain’t No Such Thing As Plain Text
This is a quote from Joel Spolsky, most notably known as the co-founder and (until recently) CEO of Stack Overflow. It's from an article of his from 2003 called The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) . Read that one first and then come back , because it covers a lot of the "general not-VCS-related" encoding stuff that serves as a basis for the rest of this post, and it is still relevant today.
In version control software like Mercurial, we have to make no assumptions about what the contents of tracked files are and their encoding. For all we know, file
could be a binary file, a
file, or even a mixed encoding file: it is a very real and relevant need for a VCS to be able to track and manipulate data without assuming it to be text
(of any encoding).
Take the following example:
$ hg init test-repo $ cd test-repo $ echo -n "Raphaël Gomès" > foo # assuming UTF-8 default $ hg commit -Am "UTF-8" $ iconv -f UTF-8 -t WINDOWS-1252 foo > foo2 $ mv foo2 foo $ hg commit -Am "WINDOWS-1252"
Here, we create a new empty repository, create the (UTF-8)
file containing my name, commit it, then convert it from UTF-8 to WINDOWS-1252, then commit that.
HGPLAIN= hg export
ensures you are not customizing output with a separate diff tool,
) will show the correct bytes in each "half" of the diff if your terminal encoding is set to UTF8 or CP1252, no bytes are lost by Mercurial. Even without changing encodings in a commit, simply using an encoding other than UTF-8 like KOI-8 would be unusable if not for the diff algorithm being encoding-agnostic. Because the bytes are sent as-is by Mercurial, all the user has to do is have a terminal that has the right encoding, and everything will be fine: nowhere did the user need to provide encoding information.
But forget binary files for a minute, their diffs are usually useless compared to a hexdump and we could also use LFS for them, right? Couldn't users just convert the rest of their repositories to UTF-8 and be done with it? I think that every developer including myself would be much happier if they didn't have to consider multiple encodings and that text were UTF-8 everywhere... but the world is unfortunately more complicated than that.
Say you're designing a new VCS from scratch in Rust or, in my case, rewriting core parts of a VCS in Rust; which type do you use to manipulate file contents? If your answer was
, you've just disqualified any file that isn't WTF-8
from being tracked by your VCS at any point in the history. That means that anyone converting from Mercurial to your shiny new system will lose at least part of their history if not all of it: for example, you can't convert the
repo losslessly because early revisions used ISO/CEI 8859-5, not to mention any binary or mixed-encoding files (common in translation files). What type do you use to represent a file path? If your answer was
, you've made valid UNIX and Windows MBCS paths impossible to represent in your software. If your answer was
), good guess, but it is also wrong in our use-case: file paths tracked by Mercurial need to be abstracted away from the current OS, otherwise you open yourself up to normalization and cross-OS/cross-FS compatibility issues that stem from the distributed nature of Mercurial.
An ecosystem issue
I will be using Rust as the reference language, but this applies to all programmers of all languages, from embedded to web developers. Most of the time you might not have to take encoding into account because you're interacting with only UTF-8 as you have for the past 10 years: if it's the case, I'm happy for you.
But if you're doing anything that may handle text (or data) of unknown origin, I urge you to ask yourself "should there be a bytes API?" . Too many times I've stumbled across a library that provides interesting functionality that assumed everything to be UTF-8 when there was no real need for it.
I think part of the reason is because Rust is one of the few languages that actually handles string types correctly
all play a distinct role that is needed to properly represent strings:
is for UTF-8 data,
for strings in your OS's representation (that may not be UTF-8), and
for compatibility with C. This last one could die in theory in a world where C didn't exist, but Free Pascal didn't win so here we are. Because Rust makes it easy to properly handle UTF-8 data through
, developers are empowered to... sometimes do the wrong thing: in my opinion this is absolutely not a flaw in Rust, but merely a side-effect of how mis-understood encoding issues are. The decision of not having types and APIs for bytestrings in the Rust stdlib is probably the same as with any other: to keep it minimal.
Even well-known, widely used crates like
made by programmers that definitely understand the underlying issue did not have a non-
interface ( regex#85
) until a few versions in because an issue was opened. There probably are other reasons why this feature wasn't implemented, but to me this underlines the lack of attention that this problem receives.
Please, look at your crates/packages/gems/whathaveyou and try to think for a minute if that UTF-8/Unicode restriction is really necessary.
Because "There Ain’t No Such Thing As Plain Text", we do a lot of bytestring manipulation in Mercurial; in Python that would be
b"this is a bytestring!"
, and in Rust you would use a
or maybe the
The initial question I had for the people I mentioned at the beginning of the article was as follows: is there a crate that allows me to do bytestring formatting like we use the
formatting? I wasn't able to find anything online in a good hour or so of searching, but I might have missed something. A particular person I interacted with was adamant that "implementing
is enough", but
, that only handles
. So all the
-related macros in the Rust stdlib understandably use
, because Rust is voting for a UTF-8 future, which I am all for.
That however does not help me solve my issue. Even Python, that had bytestring formatting in Python 2, removed it in Python 3.0 and only re-introduced it in 3.5 after it was made clear that it is a very real need, albeit somewhat niche.
I'm planning on writing a macro soon, probably called
for that very purpose and put it in a crate. If anyone already has similar functionality somewhere, I'd be happy to not do this work, otherwise I'll keep you posted.
- Apple Encourages Developers to Use iOS 14's New App Attest API to Protect Against Security ...
- How I Build Scalable Modern Web Applications for Real Users
- 100% 展示 MySQL 语句执行的神器-Optimizer Trace
- Roblox jumps to over 150M monthly users, will pay out $250M to developers in 2020
- Our Series E Funding – An Inflection Point to Accelerate the Realization of our Mission