summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorHaoran S. Diao (刁浩然) <0@hairydiode.xyz>2023-11-25 04:50:42 -0800
committerHaoran S. Diao (刁浩然) <0@hairydiode.xyz>2023-11-25 04:50:42 -0800
commit5ec19118ea597c05021b4d0ba92586886aed3bc3 (patch)
tree247205b30d86e833f944dbcb15aedc686d4c85fc
parente613befcced9bf6427bea15f9f39fa45fa900326 (diff)
Unicode jank writeup
-rw-r--r--cont/unihome.html322
1 files changed, 176 insertions, 146 deletions
diff --git a/cont/unihome.html b/cont/unihome.html
index 953f0dd..ec7e540 100644
--- a/cont/unihome.html
+++ b/cont/unihome.html
@@ -2,149 +2,179 @@
123456789-223456789-323456789-423456789-523456789-623456789-723456789-8234567890
一二三四-->[TITLE] [DATE]
--------------------------------------------------------------------------------
-[SETTITLE]We Have Unicode at Home
-[SETDATE]6-30-2023
-Preface
- it's just uses more memory, handwriting in the 70s, arabic/farsi
- terminals, historically never existed an ascii only time. telegram codes
-busybox
- bash
- sed
- awk
- grep
- bc
- iconv
- xxd
- read
- sort
- uniq
- cat
-tmux
-kbd
- console-braille
-zpix bdf
- 30M
-zpix ttf
- 4.5MiB
-jizji
- 1.3M
-misaki
- 747K
-Google
- 2.7 MB
-LinBiolinumTI.pfb
- 860KiB
-HanaMinA
- 22M,30M
-unifont
- 11.7MiB
-Latex 2.9GiB
-cm-super
- 57.8MiB
- just european languages + cyrillic
-cbfonts
- 70.6MiB
-ensembl human genome
- 4.5GiB
-Rant
- Aesthetics vs. Function
- cool-retro-term, pixel fonts, monospace of chinese vs english
-
-
-
- The text confusion
- In the beginning there was not the command line. There was wall
- paintings bone etc
-
- Inefficiency
- The only first class data types on a computer are int,
- uint, and float. Why is there not universal way to
- display/store them on posix systems, 256 combos per byte,
- only 9 used, less than 5% efficiency
-
- HTML v. inefficient, easy to grep kinda
- Json, v. inefficient
-
- Data confusion
- IME
- table takes in keypresses, spits out unicode character
- keypresses should be own type, but is ascii,
- what happens when different keyboard layout?
-
- What happens if typing russiand and want to use
- vim or press C-c?
-
- Big table, very simple datatype, not first class
-
- Tree/files, super simple datatype, not first
- class, file argument woes
-
- Display:
- simply doing an OR required like 3 processes
- because every program required different text
- representation of the same data, even though
- first class data type
-
- no language has first class lexer, closest is
- awk
-
- bdf file ridiculously inefficient, keywords too
- long, actual data is 2x by hexadec
- representation
-
- bdf file is just a big table w/ 2d array as
- output , very simple data type, have to do 1000
- conversions for input (decimal codepoint vs
- 32bit vs utf-8), and output (2d array of bits vs
- hex representation of the same)
-
- Big table no way to sort to make more efficient
-
- Representation
- Forced to represent all out data so that the lowest
- common denominator teletype in 1970s new jersey can
- print it if we were to send it directly over serial
- not just a bash issue: JSON, HTML, PDB, even
- PDF/postscript
- Ascii isn't event text, can't write accents or
- directiona quotes or nn or even a bar over a letter.
- Flipside, nobody who doesn't use posix knows or cares
- what ~ and | are.
-
- Regex, same basic thing, 30 different variants, because
- forced to represent as text with no specialized symbols
-
- same with code, every language has its own way of
- representing a code block, none of which are
- particularly legible
-
- if should be one key press and one byte
-
- In-band vs out of band
- no universal way to embed data, json has directional
- brackets, backslash hell is the norm, completely
- avoidable, but the text obsession means type info is
- ignored
- guis
- all based off of one dumb xerox experiment
- all have same issues
- lossy data display
- no interop of actual data
- no open loop input
- no way to store input as its own data/scripting
-
- in memory data:
- no interop, spend all your time using framework
- libraries to convert data around. It's not just
- a bash issue
-
- weird selection of first class data types, why
- is text 1st class and not a mesh or a linked
- list?
-Rant
- In the beginning, there was not a command line. In the beginning, there
- was iron oxide pigment on torch lit cave walls, then there were stylus
- indentations on clay, patterns carved on turtle shell, knots
- tied in string, grooves cut in vinyl, and finally discrete states stored
- in a great multitude of mechanisms. The universal datatype is not text,
- it is uint_256, IEEE floating points.
+So as we all know, the Linux console is limited to 512 characters, and lives in
+kernel space. So I wrote a workaround that displays unicode characters using
+braille (assuming your linux console font has braille characters) characters
+using only userland busybox.
+
+
+--------------------------=[Part I. Braille Graphics]=--------------------------
+Braille graphics are actually really easy, the braille block goes from U+2800
+to U+28FF, with the lower 8 bits corresponding to the dots in each braille
+character in the following order:
+
+#0 3
+#1 4
+#2 5
+#6 7
+
+with 0 being the lowest bit and 7 being the highest bit.
+
+utf-8 encodes this codepoint with three bytes
+
+1110xxxx 10xxxxxx 10xxxxxx
+
+where x represents the bits of the codepoint, therefore U+2800 converted to
+UTF-8 is 0xE2A080 (big endian) or 14852224 in decimal (I'll explain why decimal
+is relevant later).
+
+If you take the pixel buffer, shift it according to the above chart (and
+adjusted for the utf-8 encoding position change), and OR the base codepoint, you
+get your desired braille character.
+
+The problem is that bash can not do bitwise operations, and that it calls a
+seperate process for conversion from hex to decimal. So our code ends up looking
+like this:
+
+ if [ "${rawbuff[((1+4*$2))]:((1+2*$1)):1}" == "1" ];then
+ num=$(($num + 16))
+ fi
+
+ where $num starts off as 14852224, we have a raw pixel buffer where each
+ row is stored as a string where '1' represents a filled in pixel, and
+ the current braille block we are rendering's x and y position are at $2
+ and $1.
+
+The above code takes the value of the raw pixel buffer at position (1,1)
+relative to the current code block, shifts it by 4, then ORs it with the
+rendered braille character.
+
+
+I also wrote some code to take commands that draw in the raw pixel buffer as
+well.
+
+code <a href="https://hairydiode.xyz/cgit/bbrll/tree/bbrll">here</a>
+
+----------------=[Part 2, Rendering BDF fonts with only busybox]=---------------
+BDF is a human legible bitmap font format where each character entry looks like:
+
+STARTCHAR uni6D69
+ENCODING 28009
+SWIDTH 1000 0
+DWIDTH 8 0
+BBX 7 7 0 -1
+BITMAP
+98
+1C
+A8
+3E
+80
+9C
+9C
+ENDCHAR
+
+The first line is the unicode codepoint, followed by some info I don't care
+about, and the bitmap data of the character where each row is a stored as a line
+converted to hex. You can tell if we convert the hex to binary, it will be the
+"raw pixel format" from before. so all we really need to do is write a small awk
+script to find the relevant bitmap lines, then convert to binary and display it
+with previous braille display script.
+
+Complete Character Display code <a href="https://hairydiode.xyz/cgit/bbrll/tree/fontd">here</a>
+
+-------------------------=[Part 3. UTF-8 Shenanigans.]=-------------------------
+One annoying thing about utf-8, is that if you want to get the codepoint of a
+particular character in a utf-8 string, you have to do some iconv trickery where
+you first convert it to UTF-32, then convert it to hex.
+
+Another problem is that BDF stores the codepoint as DECIMAL!!!!!. You see that
+line "STARTCHAR uni6D69"? That's just the name of the character, it could
+theoretically be anything. The actual line storing the codepoint is
+"ENCODING 28009", So we have to convert from hex to decimal, which is a
+surprisingly convoluted procedure in bash.
+
+All this is done in a wrapper script that displays all the input from stdin and
+displays it using all the fonts in a directory given as its argument
+
+wrapper script code <a href="https://hairydiode.xyz/cgit/bbrll/tree/fontd">here</a>
+
+----------------------------=[Part 4. Practical Use]=---------------------------
+So remember the janky bash based IM from last time? I modified it to use the
+braille display from before. I also wrote a little script that displays all the
+non-ASCII characters in the previously focused tmux pane, so together we can
+both display and input utf-8 characters in the linux console using tmux.
+
+see the <a href="https://hairydiode.xyz/cgit/bim">code</a> and <a href="https://hairydiode.xyz/jankime">writeup</a>
+
+
+"Screenshots" below:
+
+Bash running in tmux
+[usernm@cm│[usernm@cmphostname ~]$ mkdir 帖 │乔
+phostname │[usernm@cmphostname ~]$ cd 帖 │pdr
+~]$ ud │[usernm@cmphostname 帖]$ vim 天干 │⢠⠋⣏⡁⡆⡇⠀⠀⠁
+⡤⡧⡄⠀⡧⠄⠀⠀ │ │⢹⠔⢅⠇⡇⡇⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠸⠠⠊⠀⠥⠇⠀⠀⠂
+⠁⠏⠁⠧⠤⠇⠀⠀ │ │⣲⡪⢰⣓⣲⠀⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠒⣱⠘⡖⡞⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠩⠜⠠⠃⠧⠇⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀ │ │⢠⠴⠥⠤⡄⠀⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠸⢭⠭⡭⠇⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠤⠊⠀⠣⠤⠇⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀ │ │
+⠉⠉⢹⠉⠉⠁⠀⠀ │ │
+⠉⠉⡝⡍⠉⠁⠀⠀ │ │
+⠤⠊⠀⠈⠢⠄⠀⠀ │ │
+⠈⠉⢹⠉⠉⠀⠀⠀ │ │
+⠒⠒⢺⠒⠒⠂⠀⠀ │ │
+⠀⠀⠸⠀⠀⠀⠀⠀ │ │
+[usernm@cm│ │
+phostname │ │
+~]$ │ │
+ │ │
+Leftpane is displaying all the unicode characters in the primary terminal
+(remember, on the linux console they would all just be squares), and right pane
+is the input method, which displays candidate characters in bash.
+
+Vim running in tmux
+⡇⡇⡇⡖⠓⡆⠀⠀ │甲乙丙丁 │之 鐻
+⠁⠏⠁⠧⠤⠇⠀⠀ │ 最常用 │azn
+⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⠤⠤⠼⠤⢤⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⠀⠀⣀⠔⠁⠀⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠔⠉⠒⠤⠤⠄⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⣊⡂⣀⣗⣒⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⢺⡂⣗⢗⡖⡃⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠽⠴⠑⠝⠘⠄⠀⠀
+⠉⠉⢹⠉⠉⠁⠀⠀ │~ │
+⠉⠉⡝⡍⠉⠁⠀⠀ │~ │
+⠤⠊⠀⠈⠢⠄⠀⠀ │~ │
+⠈⠉⢹⠉⠉⠀⠀⠀ │~ │
+⠒⠒⢺⠒⠒⠂⠀⠀ │~ │
+⠀⠀⠸⠀⠀⠀⠀⠀ │~ │
+[usernm@cm│~ │
+phostname │~ │
+~]$ ud │~ │
+⣏⣉⣹⣉⣉⡇⠀⠀ │~ │
+⠧⠤⢼⠤⠤⠇⠀⠀ │~ │
+⠀⠀⠸⠀⠀⠀⠀⠀ │~ │
+⠉⠉⢉⠝⠋⠀⠀⠀ │~ │
+⢀⠔⠁⠀⠀⡀⠀⠀ │~ │
+⠣⠤⠤⠤⠤⠃⠀⠀ │~ │
+⣉⣉⣹⣉⣉⡁⠀⠀ │~ │
+⡇⢀⠜⢄⠀⡇⠀⠀ │~ │
+⠇⠁⠀⠀⠥⠇⠀⠀ │~ │
+⠉⠉⢹⠉⠉⠁⠀⠀ │~ │
+⠀⠀⢸⠀⠀⠀⠀⠀ │~ │
+⠀⠠⠼⠀⠀⠀⠀⠀ │~ │
+⢸⠭⠭⠭⢽⠀⠀⠀ │~ │
+⢹⠭⡏⡭⠭⡅⠀⠀ │~ │
+⠚⠉⠇⠬⠪⠄⠀⠀ │~ │
+⡖⣓⣚⣒⡓⡆⠀⠀ │~ │
+⢀⣓⣲⣒⣃⠀⠀⠀ │~ │
+⠘⠀⠸⠀⠚⠀⠀⠀ │~ │
+⢸⣉⣹⣉⣹⠀⠀⠀ │~ │
+⢸⠤⢼⠤⢼⠀⠀⠀ │~ │
+⠎⠀⠸⠀⠼⠀⠀⠀ │~ │
+[usernm@cm│~ │
+phostname │~ │
+~]$ │-- INSERT -- 2,11-15 All │