If it has Text, it is Pain

I've been meaning to write tests for Wave for a while. Honestly, in my 12 years of dev I never actually wrote one outside of the odd academic homework assignment, so I had no idea how to approach it. After a couple of videos and video presentations about general unit tests, TDD and stuff, I actually got intrigued, and with a bug discovered just today by a coworker in Wave I was baffled by I thought, okay, let's dissect this little Käfer.

Prerequisites: What does Wave do here actually

If you have not authored an article on Wave before, let me give you some background. When you want to load a specific article in the article view, there are three methods, two actively used and one historical, to find it. The first and most direct, the “permalink” if you will, is going to /article/{id}, with the UUIDv4 ID of the article. This is used for stuff like the article editor and during the draft and review process when the other method doesn't work, and maybe one day there will be a “share permalink” option that will always link to that, although once published the second method is also very immutable.

Now, that ID based approach isn't nice for a public facing thing, as it is ugly to crawl, search machines don't like it, it's hard for a user to recall or copy, and if you share the link externally, no one can tell what you are sharing. Many sites, and I on GeeksList, fix this by adding the title in an escaped form to the path, as well as the ID, and only really use the ID for lookup. I, with Wave, chose to go a different path, I wanted complete ID-less resolution. Now to achieve this, initially, I just encoded the Title whenever I needed to put in a link, and decoded it when looking for an article, this approach for downwards compatibility works to this day.. if you have not yet used the new method. The new method gives some control to the user, as well as to improve URL niceness. Every article now also has a Slug property, that will be inserted just after the date part of the URL, so if it is set to if-it-has-text-it-is-pain and the publish date is the 1st May 2024, the full relative URL will be /2024/05/01/if-it-has-text-it-is-pain. Now, if you may not want to decide on a slug yourself, or don't really know what that is, it will be generated for you.

And here things go wrong.

The Initial Concept

Slugs, as opposed to titles with 256 characters, have a limited length of 64, but even without that the following could still provide issues in edge cases. In order to generate a slug, a logical first step is to do what has been done before, encode the title of the article. So we take the title, lowercase it, replace - with + and spaces with -, and then call Uri.EscapeDataString on it, to make it a safe slug for a URL.
Now, in order to make a URL absolutely error proof for the web, you want to use only ASCII characters, excluding special characters. So something like a comma will be encoded to %2C.
And now you truncate it to 64 characters, right?

Hold on Salomon

Here we beginn with the problem my Coworker caused. You see, he is a very… wordy kinda man, some evil tongue would suggest he spend way too much time in academia and getting certifications, but that shall not influence our tests. His title for an article was long, very long, more than 64 characters long. And as it happened, the German language contains some characters not fit for consumption for an Internet Explorer version 6 or something, it contained the word “zurück” at a very bad location. The “ü” was at position 62 of the title, and you might guess what happened next. As it happens, this character will be encoded to %C3%9C, so inserted at position 62, cut off at 64, your slug now ends in %C3. You may question now, what character is this? None, it is none, and that fucks with things.
The article saved as normal, but trying to view it, ended it 404. Wave couldn't find it, because it fucked with the escape/unescape fuckery I had to do because Blazor auto-escapes route parameters.

So to save the child, a fix was needed. And the problem was, I did not know the escape sequence for “ü” was %C3%9C, because looking at escapes for comma and question mark, I assumed it's a unicode codepoint kinda situation, and the full escape must have been like %C39C or something. So I did a math, subtracted the escaped slugs length from the unescaped, modulo 3 and subtracted from 3, I took that away from the 64 length subtraction, resulted in 62, so the new slug truncation worked like a charm… here.

And only here

Opening Pandora's Box

Pushed to prod I though, hey, this is a prime opportunity to write a test for this. So I wrote some, couple red tests, some were already green of course, and one test case was for a special character at position 62 or 63.

They failed.

As previously stated, the escape sequence is a multiple of these 3 character long blocks starting with a percentage sign, so my initial assumption was wrong. Here I tried to do some more math, took the difference of before and after escape, counted the question mark and multiplied by 3, broke some simpler tests I've written earlier that were green before. So I said, okay, fuck all of this, we are doing this professional style. So I did write… a lot of edge case tests:

Yes, I tested simple special characters at position 61, 62 and 63, as well as a 3 escape sequence long special character at 55, 56 and 57, the € character. In order to pass this, after a lot of try and error and some hard thinking, was of course, the forbidden magic. As I thought and thought, my knowledge of automaton from my IT theory lectures came up, and looking at the problem I realized this was approaching Turing Machine Level complexity, so it was time for regex. No matter what I did, there was only one way, match uninterrupted sequences of percent, a-f or digit, a-f or digit, who's start index is before the cut-off limit and would go over it, so I ended up with this:

string baseSlug = potentialNewSlug ?? Title;
baseSlug = baseSlug.ToLowerInvariant()[..Math.Min(64, baseSlug.Length)];
string slug = Uri.EscapeDataString(baseSlug).Replace("-", "+").Replace("%20", "-");
		
// I hate my life
int escapeTrimOvershoot = 0;
if (slug.Length > 64) {
	// Escape sequences come with a % and two hex digits, there may be up to 3 of such sequences
	// per character escaping ('?' has %3F, but € has %E2%82%AC), so we need to find the last group
	// of such an escape parade and see if it's going over by less than 9, because then we need to 
	// remove more characters in the truncation, or we end up with a partial escape sequence.. parade
	escapeTrimOvershoot = 64 - Regex.Match(slug,
		@"(?<escape>(%[a-fA-F\d][a-fA-F\d])+)",
		RegexOptions.None | RegexOptions.ExplicitCapture)
	.Groups.Values.Last(g => g.Index < 64).Index;
	if (escapeTrimOvershoot > 9) escapeTrimOvershoot = 0;
}

Slug = slug[..Math.Min(slug.Length, 64 - escapeTrimOvershoot)];

So yea, this got all my tests passing.

Wait, just 9?

Yes, smart reader, you might now have an objection to this code. First of all, there is a thing that reliably produces and escape sequence with 4 escape blocks: emojis.
My answer to that? I spend like four hours building this, if you put emojis into your titles, go suffer.
Second of all, what if the escape sequence is less than 9, but wouldn't reach the cut-off length? Wouldn't it also get rid of extra non-escape characters? Like in the “zurück” example, if it where to start at like character 58, wouldn't it cut off üc even tho it would fit?
Yes.
... I do not care.
... I am tired.

Conclusion

I wanted to start with some simple tests, to get my feet wet.

If it does text transformation, it is never simple.

Never.

So be vary reader, of ye who shall approach the Hydra of Unicode, must treat carefully when truncating of it's extremities, as it may grow two or three or four more in it's place.

If it has Text, it is Pain

Table of Content

Prerequisites: What does Wave do here actually

The Initial Concept

Hold on Salomon

Opening Pandora's Box

Wait, just 9?

Conclusion

About the Author

Mia Rose Winter

This might also interest you

A Mystery Involving Hardware Security Modules and Value Tokens

A Brief Look at the 3DS Cartridge Protocol

Reconstructing the 3DS Bootstrapping Process at the Factory