From 5ec19118ea597c05021b4d0ba92586886aed3bc3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Haoran=20S=2E=20Diao=20=28=E5=88=81=E6=B5=A9=E7=84=B6=29?= <0@hairydiode.xyz> Date: Sat, 25 Nov 2023 04:50:42 -0800 Subject: Unicode jank writeup --- cont/unihome.html | 322 +++++++++++++++++++++++++++++------------------------- 1 file changed, 176 insertions(+), 146 deletions(-) diff --git a/cont/unihome.html b/cont/unihome.html index 953f0dd..ec7e540 100644 --- a/cont/unihome.html +++ b/cont/unihome.html @@ -2,149 +2,179 @@ 123456789-223456789-323456789-423456789-523456789-623456789-723456789-8234567890 一二三四-->[TITLE] [DATE] -------------------------------------------------------------------------------- -[SETTITLE]We Have Unicode at Home -[SETDATE]6-30-2023 -Preface - it's just uses more memory, handwriting in the 70s, arabic/farsi - terminals, historically never existed an ascii only time. telegram codes -busybox - bash - sed - awk - grep - bc - iconv - xxd - read - sort - uniq - cat -tmux -kbd - console-braille -zpix bdf - 30M -zpix ttf - 4.5MiB -jizji - 1.3M -misaki - 747K -Google - 2.7 MB -LinBiolinumTI.pfb - 860KiB -HanaMinA - 22M,30M -unifont - 11.7MiB -Latex 2.9GiB -cm-super - 57.8MiB - just european languages + cyrillic -cbfonts - 70.6MiB -ensembl human genome - 4.5GiB -Rant - Aesthetics vs. Function - cool-retro-term, pixel fonts, monospace of chinese vs english - - - - The text confusion - In the beginning there was not the command line. There was wall - paintings bone etc - - Inefficiency - The only first class data types on a computer are int, - uint, and float. Why is there not universal way to - display/store them on posix systems, 256 combos per byte, - only 9 used, less than 5% efficiency - - HTML v. inefficient, easy to grep kinda - Json, v. inefficient - - Data confusion - IME - table takes in keypresses, spits out unicode character - keypresses should be own type, but is ascii, - what happens when different keyboard layout? - - What happens if typing russiand and want to use - vim or press C-c? - - Big table, very simple datatype, not first class - - Tree/files, super simple datatype, not first - class, file argument woes - - Display: - simply doing an OR required like 3 processes - because every program required different text - representation of the same data, even though - first class data type - - no language has first class lexer, closest is - awk - - bdf file ridiculously inefficient, keywords too - long, actual data is 2x by hexadec - representation - - bdf file is just a big table w/ 2d array as - output , very simple data type, have to do 1000 - conversions for input (decimal codepoint vs - 32bit vs utf-8), and output (2d array of bits vs - hex representation of the same) - - Big table no way to sort to make more efficient - - Representation - Forced to represent all out data so that the lowest - common denominator teletype in 1970s new jersey can - print it if we were to send it directly over serial - not just a bash issue: JSON, HTML, PDB, even - PDF/postscript - Ascii isn't event text, can't write accents or - directiona quotes or nn or even a bar over a letter. - Flipside, nobody who doesn't use posix knows or cares - what ~ and | are. - - Regex, same basic thing, 30 different variants, because - forced to represent as text with no specialized symbols - - same with code, every language has its own way of - representing a code block, none of which are - particularly legible - - if should be one key press and one byte - - In-band vs out of band - no universal way to embed data, json has directional - brackets, backslash hell is the norm, completely - avoidable, but the text obsession means type info is - ignored - guis - all based off of one dumb xerox experiment - all have same issues - lossy data display - no interop of actual data - no open loop input - no way to store input as its own data/scripting - - in memory data: - no interop, spend all your time using framework - libraries to convert data around. It's not just - a bash issue - - weird selection of first class data types, why - is text 1st class and not a mesh or a linked - list? -Rant - In the beginning, there was not a command line. In the beginning, there - was iron oxide pigment on torch lit cave walls, then there were stylus - indentations on clay, patterns carved on turtle shell, knots - tied in string, grooves cut in vinyl, and finally discrete states stored - in a great multitude of mechanisms. The universal datatype is not text, - it is uint_256, IEEE floating points. +So as we all know, the Linux console is limited to 512 characters, and lives in +kernel space. So I wrote a workaround that displays unicode characters using +braille (assuming your linux console font has braille characters) characters +using only userland busybox. + + +--------------------------=[Part I. Braille Graphics]=-------------------------- +Braille graphics are actually really easy, the braille block goes from U+2800 +to U+28FF, with the lower 8 bits corresponding to the dots in each braille +character in the following order: + +#0 3 +#1 4 +#2 5 +#6 7 + +with 0 being the lowest bit and 7 being the highest bit. + +utf-8 encodes this codepoint with three bytes + +1110xxxx 10xxxxxx 10xxxxxx + +where x represents the bits of the codepoint, therefore U+2800 converted to +UTF-8 is 0xE2A080 (big endian) or 14852224 in decimal (I'll explain why decimal +is relevant later). + +If you take the pixel buffer, shift it according to the above chart (and +adjusted for the utf-8 encoding position change), and OR the base codepoint, you +get your desired braille character. + +The problem is that bash can not do bitwise operations, and that it calls a +seperate process for conversion from hex to decimal. So our code ends up looking +like this: + + if [ "${rawbuff[((1+4*$2))]:((1+2*$1)):1}" == "1" ];then + num=$(($num + 16)) + fi + + where $num starts off as 14852224, we have a raw pixel buffer where each + row is stored as a string where '1' represents a filled in pixel, and + the current braille block we are rendering's x and y position are at $2 + and $1. + +The above code takes the value of the raw pixel buffer at position (1,1) +relative to the current code block, shifts it by 4, then ORs it with the +rendered braille character. + + +I also wrote some code to take commands that draw in the raw pixel buffer as +well. + +code here + +----------------=[Part 2, Rendering BDF fonts with only busybox]=--------------- +BDF is a human legible bitmap font format where each character entry looks like: + +STARTCHAR uni6D69 +ENCODING 28009 +SWIDTH 1000 0 +DWIDTH 8 0 +BBX 7 7 0 -1 +BITMAP +98 +1C +A8 +3E +80 +9C +9C +ENDCHAR + +The first line is the unicode codepoint, followed by some info I don't care +about, and the bitmap data of the character where each row is a stored as a line +converted to hex. You can tell if we convert the hex to binary, it will be the +"raw pixel format" from before. so all we really need to do is write a small awk +script to find the relevant bitmap lines, then convert to binary and display it +with previous braille display script. + +Complete Character Display code here + +-------------------------=[Part 3. UTF-8 Shenanigans.]=------------------------- +One annoying thing about utf-8, is that if you want to get the codepoint of a +particular character in a utf-8 string, you have to do some iconv trickery where +you first convert it to UTF-32, then convert it to hex. + +Another problem is that BDF stores the codepoint as DECIMAL!!!!!. You see that +line "STARTCHAR uni6D69"? That's just the name of the character, it could +theoretically be anything. The actual line storing the codepoint is +"ENCODING 28009", So we have to convert from hex to decimal, which is a +surprisingly convoluted procedure in bash. + +All this is done in a wrapper script that displays all the input from stdin and +displays it using all the fonts in a directory given as its argument + +wrapper script code here + +----------------------------=[Part 4. Practical Use]=--------------------------- +So remember the janky bash based IM from last time? I modified it to use the +braille display from before. I also wrote a little script that displays all the +non-ASCII characters in the previously focused tmux pane, so together we can +both display and input utf-8 characters in the linux console using tmux. + +see the code and writeup + + +"Screenshots" below: + +Bash running in tmux +[usernm@cm│[usernm@cmphostname ~]$ mkdir 帖 │乔 +phostname │[usernm@cmphostname ~]$ cd 帖 │pdr +~]$ ud │[usernm@cmphostname 帖]$ vim 天干 │⢠⠋⣏⡁⡆⡇⠀⠀⠁ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⢹⠔⢅⠇⡇⡇⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠸⠠⠊⠀⠥⠇⠀⠀⠂ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │⣲⡪⢰⣓⣲⠀⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠒⣱⠘⡖⡞⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠩⠜⠠⠃⠧⠇⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │⢠⠴⠥⠤⡄⠀⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠸⢭⠭⡭⠇⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠤⠊⠀⠣⠤⠇⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │ +⠉⠉⢹⠉⠉⠁⠀⠀ │ │ +⠉⠉⡝⡍⠉⠁⠀⠀ │ │ +⠤⠊⠀⠈⠢⠄⠀⠀ │ │ +⠈⠉⢹⠉⠉⠀⠀⠀ │ │ +⠒⠒⢺⠒⠒⠂⠀⠀ │ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │ │ +[usernm@cm│ │ +phostname │ │ +~]$ │ │ + │ │ +Leftpane is displaying all the unicode characters in the primary terminal +(remember, on the linux console they would all just be squares), and right pane +is the input method, which displays candidate characters in bash. + +Vim running in tmux +⡇⡇⡇⡖⠓⡆⠀⠀ │甲乙丙丁 │之 鐻 +⠁⠏⠁⠧⠤⠇⠀⠀ │ 最常用 │azn +⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⠤⠤⠼⠤⢤⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⠀⠀⣀⠔⠁⠀⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠔⠉⠒⠤⠤⠄⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⣊⡂⣀⣗⣒⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⢺⡂⣗⢗⡖⡃⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠽⠴⠑⠝⠘⠄⠀⠀ +⠉⠉⢹⠉⠉⠁⠀⠀ │~ │ +⠉⠉⡝⡍⠉⠁⠀⠀ │~ │ +⠤⠊⠀⠈⠢⠄⠀⠀ │~ │ +⠈⠉⢹⠉⠉⠀⠀⠀ │~ │ +⠒⠒⢺⠒⠒⠂⠀⠀ │~ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │~ │ +[usernm@cm│~ │ +phostname │~ │ +~]$ ud │~ │ +⣏⣉⣹⣉⣉⡇⠀⠀ │~ │ +⠧⠤⢼⠤⠤⠇⠀⠀ │~ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │~ │ +⠉⠉⢉⠝⠋⠀⠀⠀ │~ │ +⢀⠔⠁⠀⠀⡀⠀⠀ │~ │ +⠣⠤⠤⠤⠤⠃⠀⠀ │~ │ +⣉⣉⣹⣉⣉⡁⠀⠀ │~ │ +⡇⢀⠜⢄⠀⡇⠀⠀ │~ │ +⠇⠁⠀⠀⠥⠇⠀⠀ │~ │ +⠉⠉⢹⠉⠉⠁⠀⠀ │~ │ +⠀⠀⢸⠀⠀⠀⠀⠀ │~ │ +⠀⠠⠼⠀⠀⠀⠀⠀ │~ │ +⢸⠭⠭⠭⢽⠀⠀⠀ │~ │ +⢹⠭⡏⡭⠭⡅⠀⠀ │~ │ +⠚⠉⠇⠬⠪⠄⠀⠀ │~ │ +⡖⣓⣚⣒⡓⡆⠀⠀ │~ │ +⢀⣓⣲⣒⣃⠀⠀⠀ │~ │ +⠘⠀⠸⠀⠚⠀⠀⠀ │~ │ +⢸⣉⣹⣉⣹⠀⠀⠀ │~ │ +⢸⠤⢼⠤⢼⠀⠀⠀ │~ │ +⠎⠀⠸⠀⠼⠀⠀⠀ │~ │ +[usernm@cm│~ │ +phostname │~ │ +~]$ │-- INSERT -- 2,11-15 All │ -- cgit v1.1