Wave

If it has Text, it is Pain

Mia Rose Winter

Table of Content


I've been meaning to write tests for Wave for a while. Honestly, in my 12 years of dev I never actually wrote one outside of the odd academic homework assignment, so I had no idea how to approach it. After a couple of videos and video presentations about general unit tests, TDD and stuff, I actually got intrigued, and with a bug discovered just today by a coworker in Wave I was baffled by I thought, okay, let's dissect this little Käfer.

Prerequisites: What does Wave do here actually

If you have not authored an article on Wave before, let me give you some background. When you want to load a specific article in the article view, there are three methods, two actively used and one historical, to find it. The first and most direct, the “permalink” if you will, is going to /article/{id}, with the UUIDv4 ID of the article. This is used for stuff like the article editor and during the draft and review process when the other method doesn't work, and maybe one day there will be a “share permalink” option that will always link to that, although once published the second method is also very immutable.

Now, that ID based approach isn't nice for a public facing thing, as it is ugly to crawl, search machines don't like it, it's hard for a user to recall or copy, and if you share the link externally, no one can tell what you are sharing. Many sites, and I on GeeksList, fix this by adding the title in an escaped form to the path, as well as the ID, and only really use the ID for lookup. I, with Wave, chose to go a different path, I wanted complete ID-less resolution. Now to achieve this, initially, I just encoded the Title whenever I needed to put in a link, and decoded it when looking for an article, this approach for downwards compatibility works to this day.. if you have not yet used the new method. The new method gives some control to the user, as well as to improve URL niceness. Every article now also has a Slug property, that will be inserted just after the date part of the URL, so if it is set to if-it-has-text-it-is-pain and the publish date is the 1st May 2024, the full relative URL will be /2024/05/01/if-it-has-text-it-is-pain. Now, if you may not want to decide on a slug yourself, or don't really know what that is, it will be generated for you.

And here things go wrong.

The Initial Concept

Slugs, as opposed to titles with 256 characters, have a limited length of 64, but even without that the following could still provide issues in edge cases. In order to generate a slug, a logical first step is to do what has been done before, encode the title of the article. So we take the title, lowercase it, replace - with + and spaces with -, and then call Uri.EscapeDataString on it, to make it a safe slug for a URL.
Now, in order to make a URL absolutely error proof for the web, you want to use only ASCII characters, excluding special characters. So something like a comma will be encoded to %2C.
And now you truncate it to 64 characters, right?

Hold on Salomon

Here we beginn with the problem my Coworker caused. You see, he is a very… wordy kinda man, some evil tongue would suggest he spend way too much time in academia and getting certifications, but that shall not influence our tests. His title for an article was long, very long, more than 64 characters long. And as it happened, the German language contains some characters not fit for consumption for an Internet Explorer version 6 or something, it contained the word “zurück” at a very bad location. The “ü” was at position 62 of the title, and you might guess what happened next. As it happens, this character will be encoded to %C3%9C, so inserted at position 62, cut off at 64, your slug now ends in %C3. You may question now, what character is this? None, it is none, and that fucks with things.
The article saved as normal, but trying to view it, ended it 404. Wave couldn't find it, because it fucked with the escape/unescape fuckery I had to do because Blazor auto-escapes route parameters.

So to save the child, a fix was needed. And the problem was, I did not know the escape sequence for “ü” was %C3%9C, because looking at escapes for comma and question mark, I assumed it's a unicode codepoint kinda situation, and the full escape must have been like %C39C or something. So I did a math, subtracted the escaped slugs length from the unescaped, modulo 3 and subtracted from 3, I took that away from the 64 length subtraction, resulted in 62, so the new slug truncation worked like a charm… here.

And only here

Opening Pandora's Box

Pushed to prod I though, hey, this is a prime opportunity to write a test for this. So I wrote some, couple red tests, some were already green of course, and one test case was for a special character at position 62 or 63.

They failed.

As previously stated, the escape sequence is a multiple of these 3 character long blocks starting with a percentage sign, so my initial assumption was wrong. Here I tried to do some more math, took the difference of before and after escape, counted the question mark and multiplied by 3, broke some simpler tests I've written earlier that were green before. So I said, okay, fuck all of this, we are doing this professional style. So I did write… a lot of edge case tests:

Yes, I tested simple special characters at position 61, 62 and 63, as well as a 3 escape sequence long special character at 55, 56 and 57, the € character. In order to pass this, after a lot of try and error and some hard thinking, was of course, the forbidden magic. As I thought and thought, my knowledge of automaton from my IT theory lectures came up, and looking at the problem I realized this was approaching Turing Machine Level complexity, so it was time for regex. No matter what I did, there was only one way, match uninterrupted sequences of percent, a-f or digit, a-f or digit, who's start index is before the cut-off limit and would go over it, so I ended up with this:

string baseSlug = potentialNewSlug ?? Title;
baseSlug = baseSlug.ToLowerInvariant()[..Math.Min(64, baseSlug.Length)];
string slug = Uri.EscapeDataString(baseSlug).Replace("-", "+").Replace("%20", "-");
		
// I hate my life
int escapeTrimOvershoot = 0;
if (slug.Length > 64) {
	// Escape sequences come with a % and two hex digits, there may be up to 3 of such sequences
	// per character escaping ('?' has %3F, but € has %E2%82%AC), so we need to find the last group
	// of such an escape parade and see if it's going over by less than 9, because then we need to 
	// remove more characters in the truncation, or we end up with a partial escape sequence.. parade
	escapeTrimOvershoot = 64 - Regex.Match(slug,
		@"(?<escape>(%[a-fA-F\d][a-fA-F\d])+)",
		RegexOptions.None | RegexOptions.ExplicitCapture)
	.Groups.Values.Last(g => g.Index < 64).Index;
	if (escapeTrimOvershoot > 9) escapeTrimOvershoot = 0;
}

Slug = slug[..Math.Min(slug.Length, 64 - escapeTrimOvershoot)];

So yea, this got all my tests passing.

Wait, just 9?

Yes, smart reader, you might now have an objection to this code. First of all, there is a thing that reliably produces and escape sequence with 4 escape blocks: emojis.
My answer to that? I spend like four hours building this, if you put emojis into your titles, go suffer.
Second of all, what if the escape sequence is less than 9, but wouldn't reach the cut-off length? Wouldn't it also get rid of extra non-escape characters? Like in the “zurück” example, if it where to start at like character 58, wouldn't it cut off üc even tho it would fit?
Yes.
... I do not care.
... I am tired.

Conclusion

I wanted to start with some simple tests, to get my feet wet.

If it does text transformation, it is never simple.

Never.

So be vary reader, of ye who shall approach the Hydra of Unicode, must treat carefully when truncating of it's extremities, as it may grow two or three or four more in it's place.

About the Author

Mia Rose Winter

Software Developer / Project Manager. Full-time cat Woman and bisexual menace. Really not liking tech these days, I have more fun writing stories and books. Developer of GeeksList, Just Short It and Wave.

This might also interest you

A Mystery Involving Hardware Security Modules and Value Tokens

Forbidden Tempura 10/7/2025

Context Historical context In July, 2021, the phenomenon known as the &ldquo;Gigaleak&rdquo; continued. The Gigaleak was a drip-feed of part of the ill-gotten data from the 2018 Nintendo data breach. On July 20, 2021, the iqcvs.tar.xz file was uploaded to the now-defunct file sharing website anonfiles.com and thereby made available to the public by The Hacker Known as 4chan. This file contains a dump of CVS repositories. The repository sw contains the BroadOn network infrastructure around the middle of the year 2006. This is shortly before the Nintendo Wii launched. The network infrastructure was initially launched alongside the iQue Player, a variant of the Nintendo 64 featuring downloadable games and some anti-piracy measures of questionable quality (non-HTTPS link) intended for the Chinese market, which was and still is notorious for being particularly prone to piracy. It was developed by a company then called BroadOn Communications Corp., a California corporation. The iQue Player u

ITInfodump

A Brief Look at the 3DS Cartridge Protocol

Forbidden Tempura 6/2/2024

About a week ago, there has been a little addition to the 3dbrew wiki page about 3DS cartridges (carts) that outlines the technical details of how the 3DS cartridge controller and a 3DS cartridge talk to each other. I would like to take this opportunity to also include the 3DS itself in the conversation to illuminate which part of which device performs which step. I will then proceed to outline where I think the corresponding design decisions originate. Finally, I will conclude with some concrete ideas for improvement. But first, we need to talk about parallel universes This protocol makes no sense unless you have a basic overview of the 3DS AES engine. The 3DS AES engine can load 128-bit AES keys in two ways: Using key-derivation from a keyX and keyY (officially called KeyId and KeySeed, respectively). Directly specifying a full AES key. The key derivation from a keyX and keyY works as follows: AES key = (((keyX ROL 2) XOR keyY) + C1) ROR 41, where ROL is left rotation on a 128-bit

ITGamesInfodump

Reconstructing the 3DS Bootstrapping Process at the Factory

Forbidden Tempura 5/13/2024

Motivation The Nintendo 3DS was a fairly popular console. In spite of that, surprisingly little is known about how it is put together at the factory. Working with information that was uncovered during the so-called Gigaleak, I will try to recover as much information as I can about the manufacturing process up and until the point the 3DS is able to complete a normal boot sequence. One-Time Programmable (OTP) region Every 3DS ships with 0x100 of one-time programmable persistent memory at 0x10012000-0x10012100, containing console-unique keys and information. This obviously has to occur before any normal firmware runs on the system because significant amounts of all data written would fail to account for console-unique information and thus the encrypted values would be all encrypted for the wrong keys. An interesting observations: ctr.7z (SHA-256: 8b05072361254437277576d53c08b95e5f076c6b33a2871fad74eaa5561d1d38) ctr/sources/bootrom/CTR/private/build/bootrom/ctr_bootrom/ARM9/main.c has a pr

ITGamesInfodump
Powered by Wave