About this blog…

I am employed by Netnod as head of engineering, research and development and am among other things chair of the Security and Stability Advisory Committee at ICANN. You can find CV and photos of me at this page.

As I wear so many hats, I find it being necessary to somewhere express my personal view on things. This is the location where that happens. Postings on this blog, or at Facebook, Twitter etc, falls under this policy.

The views expressed on this post are mine and do not necessarily reflect the views of Netnod or any other of the organisations I have connections to.

Python and Unicode

As some of you saw, I did write earlier about problems I had with wide Unicode codepoints. That was because the Python I use is not compiled for support for it. Naive as I was, I thought that this was one of the things that was fixed for Python 3.x, but it was not. Because of this, codepoints with more than 16 bit values are stored as surrogate pairs, so it was possible to do a workaround. Not pretty, and possibly not the most efficient python code out there (note that it uses some data structures that I have already created elsewhere):

def unicodeCaseFold(c):
  result = ""
  maxlen = len(c)
  i = 0
  while(i<maxlen):
    if((maxlen-i) > 1 and ord(c[i:i+1]) >= 0xD800 and ord(c[i:i+1]) <= 0xFA0D):
      # Lets guess this is a surrogate pair
      value = ord(c[i:i+2])
      j = i + 2
    else:
      value = ord(c[i:i+1])
      j = i + 1
    if(value in unicodeSpecialCaseFoldingDict):
      result = result + unicodeSpecialCaseFoldingDict[value]
    else:
      result = result + c[i:j].lower()
  i = j
  return(result)

Updated: Some “<” and “>” where not visible earlier as I forgot to encode them properly in HTML… Now fixed.

Updated: Corrected some bugs in the code.

Comments are closed.