Unicode jank writeup

author: Haoran S. Diao (刁浩然) <0@hairydiode.xyz> 2023-11-25 04:50:42 -0800
committer: Haoran S. Diao (刁浩然) <0@hairydiode.xyz> 2023-11-25 04:50:42 -0800
commit: 5ec19118ea597c05021b4d0ba92586886aed3bc3 (patch)
tree: 247205b30d86e833f944dbcb15aedc686d4c85fc
parent: e613befcced9bf6427bea15f9f39fa45fa900326 (diff)
1 files changed, 176 insertions, 146 deletions
diff --git a/cont/unihome.html b/cont/unihome.html
index 953f0dd..ec7e540 100644
--- a/cont/unihome.html
+++ b/cont/unihome.html
@@ -2,149 +2,179 @@
 123456789-223456789-323456789-423456789-523456789-623456789-723456789-8234567890
 一二三四-->[TITLE]                                                      [DATE]
 --------------------------------------------------------------------------------
-[SETTITLE]We Have Unicode at Home
-[SETDATE]6-30-2023
-Preface
-	it's just uses more memory, handwriting in the 70s, arabic/farsi
-	terminals, historically never existed an ascii only time. telegram codes
-busybox
-	bash
-	sed
-	awk
-	grep
-	bc
-	iconv
-	xxd
-	read
-	sort
-	uniq
-	cat
-tmux
-kbd
-	console-braille
-zpix bdf
-	30M
-zpix ttf
-	4.5MiB
-jizji
-	1.3M
-misaki
-	747K
-Google
-	2.7 MB
-LinBiolinumTI.pfb
-	860KiB
-HanaMinA
-	22M,30M
-unifont
-	11.7MiB
-Latex 2.9GiB
-cm-super
-	57.8MiB
-		just european languages + cyrillic
-cbfonts
-	70.6MiB
-ensembl human genome
-	4.5GiB
-Rant
-	Aesthetics vs. Function
-	cool-retro-term, pixel fonts, monospace of chinese vs english
-	
-	
-
-	The text confusion
-		In the beginning there was not the command line. There was wall
-		paintings bone etc
-
-		Inefficiency
-			The only first class data types on a computer are int,
-			uint, and float. Why is there not universal way to
-			display/store them on posix systems, 256 combos per byte,
-			only 9 used, less than 5% efficiency
-
-			HTML v. inefficient, easy to grep kinda 
-			Json, v. inefficient
-
-		Data confusion
-			IME
-			table takes in keypresses, spits out unicode character
-				keypresses should be own type, but is ascii,
-				what happens when different keyboard layout?
-
-				What happens if typing russiand and want to use
-				vim or press C-c?
-
-				Big table, very simple datatype, not first class
-
-				Tree/files, super simple datatype, not first
-				class, file argument woes
-
-			Display:
-				simply doing an OR required like 3 processes
-				because every program required different text
-				representation of the same data, even though
-				first class data type
-				
-				no language has first class lexer, closest is
-				awk
-
-				bdf file ridiculously inefficient, keywords too
-				long, actual data is 2x by hexadec
-				representation
-
-				bdf file is just a big table w/ 2d array as
-				output , very simple data type, have to do 1000
-				conversions for input (decimal codepoint vs
-				32bit vs utf-8), and output (2d array of bits vs
-				hex representation of the same)
-
-			Big table no way to sort to make more efficient
-
-		Representation
-			Forced to represent all out data so that the lowest
-			common denominator teletype in 1970s new jersey can
-			print it if we were to send it directly over serial
-				not just a bash issue: JSON, HTML, PDB, even
-				PDF/postscript
-			Ascii isn't event text, can't write accents or
-			directiona quotes or nn or even a bar over a letter.
-			Flipside, nobody who doesn't use posix knows or cares
-			what ~ and | are.
-
-			Regex, same basic thing, 30 different variants, because
-			forced to represent as text with no specialized symbols
-
-			same with code, every language has its own way of
-			representing a code block, none of which are
-			particularly legible
-
-			if should be one key press and one byte
-
-		In-band vs out of band
-			no universal way to embed data, json has directional
-			brackets, backslash hell is the norm, completely
-			avoidable, but the text obsession means type info is
-			ignored
-		guis
-			all based off of one dumb xerox experiment
-			all have same issues
-				lossy data display
-				no interop of actual data
-				no open loop input
-				no way to store input as its own data/scripting
-		
-			in memory data:
-				no interop, spend all your time using framework
-				libraries to convert data around. It's not just
-				a bash issue
-
-				weird selection of first class data types, why
-				is text 1st class and not a mesh or a linked
-				list?
-Rant
-	In the beginning, there was not a command line. In the beginning, there
-	was iron oxide pigment on torch lit cave walls, then there were stylus
-	indentations on clay, patterns carved on turtle shell, knots
-	tied in string, grooves cut in vinyl, and finally discrete states stored
-	in a great multitude of mechanisms. The universal datatype is not text,
-	it is uint_256, IEEE floating points.
+So as we all know, the Linux console is limited to 512 characters, and lives in
+kernel space. So I wrote a workaround that displays unicode characters using
+braille (assuming your linux console font has braille characters) characters
+using only userland busybox.
+
+
+--------------------------=[Part I. Braille Graphics]=--------------------------
+Braille graphics are actually really easy, the braille block goes from  U+2800
+to U+28FF, with the lower 8 bits corresponding to the dots in each braille
+character in the following order:
+
+#0 3
+#1 4
+#2 5
+#6 7
+
+with 0 being the lowest bit and 7 being the highest bit. 
+
+utf-8 encodes this codepoint with three bytes 
+
+1110xxxx 	10xxxxxx 	10xxxxxx
+
+where x represents the bits of the codepoint, therefore U+2800 converted to
+UTF-8 is 0xE2A080 (big endian) or 14852224 in decimal (I'll explain why decimal
+is relevant later).
+
+If you take the pixel buffer, shift it according to the above chart (and
+adjusted for the utf-8 encoding position change), and OR the base codepoint, you
+get your desired braille character.
+
+The problem is that bash can not do bitwise operations, and that it calls a
+seperate process for conversion from hex to decimal. So our code ends up looking
+like this: 
+
+	if [ "${rawbuff[((1+4*$2))]:((1+2*$1)):1}" == "1" ];then
+                num=$(($num + 16))
+        fi
+
+	where $num starts off as 14852224, we have a raw pixel buffer where each
+	row is stored as a string where '1' represents a filled in pixel, and
+	the current braille block we are rendering's x and y position are at $2
+	and $1. 
+
+The above code takes the value of the raw pixel buffer at position (1,1)
+relative to the current code block, shifts it by 4, then ORs it with the
+rendered braille character.
+
+
+I also wrote some code to take commands that draw in the raw pixel buffer as
+well. 
+
+code <a href="https://hairydiode.xyz/cgit/bbrll/tree/bbrll">here</a>
+
+----------------=[Part 2, Rendering BDF fonts with only busybox]=---------------
+BDF is a human legible bitmap font format where each character entry looks like:
+
+STARTCHAR uni6D69
+ENCODING 28009
+SWIDTH 1000 0
+DWIDTH 8 0
+BBX 7 7 0 -1
+BITMAP
+98
+1C
+A8
+3E
+80
+9C
+9C
+ENDCHAR
+
+The first line is the unicode codepoint, followed by some info I don't care
+about, and the bitmap data of the character where each row is a stored as a line
+converted to hex. You can tell if we convert the hex to binary, it will be the
+"raw pixel format" from before. so all we really need to do is write a small awk
+script to find the relevant bitmap lines, then convert to binary and display it
+with previous braille display script.
+
+Complete Character Display code <a href="https://hairydiode.xyz/cgit/bbrll/tree/fontd">here</a>
+
+-------------------------=[Part 3. UTF-8 Shenanigans.]=-------------------------
+One annoying thing about utf-8, is that if you want to get the codepoint of a
+particular character in a utf-8 string, you have to do some iconv trickery where
+you first convert it to UTF-32, then convert it to hex.
+
+Another problem is that BDF stores the codepoint as DECIMAL!!!!!. You see that
+line "STARTCHAR uni6D69"? That's just the name of the character, it could
+theoretically be anything. The actual line storing the codepoint is 
+"ENCODING 28009", So we have to convert from hex to decimal, which is a
+surprisingly convoluted procedure in bash.
+
+All this is done in a wrapper script that displays all the input from stdin and
+displays it using all the fonts in a directory given as its argument 
+
+wrapper script code <a href="https://hairydiode.xyz/cgit/bbrll/tree/fontd">here</a>
+
+----------------------------=[Part 4. Practical Use]=---------------------------
+So remember the janky bash based IM from last time? I modified it to use the
+braille display from before. I also wrote a little script that displays all the
+non-ASCII characters in the previously focused tmux pane, so together we can
+both display and input utf-8 characters in the linux console using tmux.
+
+see the <a href="https://hairydiode.xyz/cgit/bim">code</a> and <a href="https://hairydiode.xyz/jankime">writeup</a>
+
+
+"Screenshots" below:
+
+Bash running in tmux
+[usernm@cm│[usernm@cmphostname ~]$ mkdir 帖                          │乔
+phostname │[usernm@cmphostname ~]$ cd 帖                             │pdr
+~]$ ud    │[usernm@cmphostname 帖]$ vim 天干                         │⢠⠋⣏⡁⡆⡇⠀⠀⠁
+⡤⡧⡄⠀⡧⠄⠀⠀  │                                                          │⢹⠔⢅⠇⡇⡇⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀  │                                                          │⠸⠠⠊⠀⠥⠇⠀⠀⠂
+⠁⠏⠁⠧⠤⠇⠀⠀  │                                                          │⣲⡪⢰⣓⣲⠀⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀  │                                                          │⠒⣱⠘⡖⡞⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀  │                                                          │⠩⠜⠠⠃⠧⠇⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀  │                                                          │⢠⠴⠥⠤⡄⠀⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀  │                                                          │⠸⢭⠭⡭⠇⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀  │                                                          │⠤⠊⠀⠣⠤⠇⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀  │                                                          │
+⠉⠉⢹⠉⠉⠁⠀⠀  │                                                          │
+⠉⠉⡝⡍⠉⠁⠀⠀  │                                                          │
+⠤⠊⠀⠈⠢⠄⠀⠀  │                                                          │
+⠈⠉⢹⠉⠉⠀⠀⠀  │                                                          │
+⠒⠒⢺⠒⠒⠂⠀⠀  │                                                          │
+⠀⠀⠸⠀⠀⠀⠀⠀  │                                                          │
+[usernm@cm│                                                          │
+phostname │                                                          │
+~]$       │                                                          │
+          │                                                          │
+Leftpane is displaying all the unicode characters in the primary terminal
+(remember, on the linux console they would all just be squares), and right pane
+is the input method, which displays candidate characters in bash.
+
+Vim running in tmux
+⡇⡇⡇⡖⠓⡆⠀⠀  │甲乙丙丁                                                  │之 鐻
+⠁⠏⠁⠧⠤⠇⠀⠀  │        最常用                                            │azn
+⡤⡧⡄⠀⡧⠄⠀⠀  │~                                                         │⠤⠤⠼⠤⢤⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀  │~                                                         │⠀⠀⣀⠔⠁⠀⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀  │~                                                         │⠔⠉⠒⠤⠤⠄⠀⠀
+⡤⡧⡄⠀⡧⠄⠀⠀  │~                                                         │⣊⡂⣀⣗⣒⠀⠀⠀
+⡇⡇⡇⡖⠓⡆⠀⠀  │~                                                         │⢺⡂⣗⢗⡖⡃⠀⠀
+⠁⠏⠁⠧⠤⠇⠀⠀  │~                                                         │⠽⠴⠑⠝⠘⠄⠀⠀
+⠉⠉⢹⠉⠉⠁⠀⠀  │~                                                         │
+⠉⠉⡝⡍⠉⠁⠀⠀  │~                                                         │
+⠤⠊⠀⠈⠢⠄⠀⠀  │~                                                         │
+⠈⠉⢹⠉⠉⠀⠀⠀  │~                                                         │
+⠒⠒⢺⠒⠒⠂⠀⠀  │~                                                         │
+⠀⠀⠸⠀⠀⠀⠀⠀  │~                                                         │
+[usernm@cm│~                                                         │
+phostname │~                                                         │
+~]$ ud    │~                                                         │
+⣏⣉⣹⣉⣉⡇⠀⠀  │~                                                         │
+⠧⠤⢼⠤⠤⠇⠀⠀  │~                                                         │
+⠀⠀⠸⠀⠀⠀⠀⠀  │~                                                         │
+⠉⠉⢉⠝⠋⠀⠀⠀  │~                                                         │
+⢀⠔⠁⠀⠀⡀⠀⠀  │~                                                         │
+⠣⠤⠤⠤⠤⠃⠀⠀  │~                                                         │
+⣉⣉⣹⣉⣉⡁⠀⠀  │~                                                         │
+⡇⢀⠜⢄⠀⡇⠀⠀  │~                                                         │
+⠇⠁⠀⠀⠥⠇⠀⠀  │~                                                         │
+⠉⠉⢹⠉⠉⠁⠀⠀  │~                                                         │
+⠀⠀⢸⠀⠀⠀⠀⠀  │~                                                         │
+⠀⠠⠼⠀⠀⠀⠀⠀  │~                                                         │
+⢸⠭⠭⠭⢽⠀⠀⠀  │~                                                         │
+⢹⠭⡏⡭⠭⡅⠀⠀  │~                                                         │
+⠚⠉⠇⠬⠪⠄⠀⠀  │~                                                         │
+⡖⣓⣚⣒⡓⡆⠀⠀  │~                                                         │
+⢀⣓⣲⣒⣃⠀⠀⠀  │~                                                         │
+⠘⠀⠸⠀⠚⠀⠀⠀  │~                                                         │
+⢸⣉⣹⣉⣹⠀⠀⠀  │~                                                         │
+⢸⠤⢼⠤⢼⠀⠀⠀  │~                                                         │
+⠎⠀⠸⠀⠼⠀⠀⠀  │~                                                         │
+[usernm@cm│~                                                         │
+phostname │~                                                         │
+~]$       │-- INSERT --                            2,11-15       All │
author	Haoran S. Diao (刁浩然) <0@hairydiode.xyz>	2023-11-25 04:50:42 -0800
committer	Haoran S. Diao (刁浩然) <0@hairydiode.xyz>	2023-11-25 04:50:42 -0800
commit	5ec19118ea597c05021b4d0ba92586886aed3bc3 (patch)
tree	247205b30d86e833f944dbcb15aedc686d4c85fc
parent	e613befcced9bf6427bea15f9f39fa45fa900326 (diff)