Note that the Unicode normalization steps can/will change the number of code points needed to store the string, so if you include those, you can no longer plan on the result string fitting into the original storage. You then encode the sequence of UCS-4 code points into the desired encoding (UTF-8, UTF-16, etc.) If your input happened to contain a U+00C1 to start with, however, it would also convert that into two code points as well. This will turn the aforementioned "Latin A with acute" back into two code points - a "Latin capital A" and a "combining Acute". You then (again, optionally) apply another Unicode normalization process, such as NFD (canonical decomposition). You the reverse the order of those complete characters, typically by using the index you created in the previous step. The result of this will typically be an index of the actual characters in the string, such as the position and length of each. You then walk through all the characters from beginning to end, breaking the string into actual characters - and if there are (still) combining diacritic marks, keeping them with the characters they modify. ![]() This will (where possible) convert combining diacritical forms (such as the U+301 that Jon mentioned) into single code points (e.g., an "A" with a "U+301" would be converted to "Latin capital A with acute", U+00C1). In this case, you'd probably want to apply the "NFKC" transformation: compatibility decomposition followed by canonical composition. You can normalize the input to one of the four Unicode normalization forms. In some cases, you can be sure a particular sequence of octets does not follow the rules of a particular encoding scheme, but you can rarely (if ever) be sure that it does follow a particular encoding scheme. For this, you'd generally prefer to rely on input from the user than attempt to figure it out on your own. You normally want to start by converting any other representation to UCS-4 (aka UTF-32). At least to me, this seems a bit like brute-force engineering, with little real elegance. I'm the first to admit that there may be others who have better ideas though. I'll outline one possibility that I have found at least somewhat workable. This lets you keep each part of the transformation simple enough that you can keep it under control, and stand a reasonable chance of making it meet its requirements. You have to build your software in a number of layers, each of which applies a fairly specific set of transforms in a specific order. If you do have to deal with a broad range of inputs, you just about have to think in terms of a "stack", a bit like a network stack. Even if you direct them back to the simple case of just swapping 8-bit items, knowing whether or not they think in broader terms than that may be valuable. As an interview question, it's usually asked just about the technical bits of doing an in-place swap of 8-bit items to reverse their order (regardless of what characters those might actually represent).Īt the same time, especially if you're interviewing a relatively senior person, you could at least hope to hear some questions about the specification and the exact form of the input.
0 Comments
Leave a Reply. |