From 15a09725bac29ac484d3ee26242f4b1e68072fee Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Haoran=20S=2E=20Diao=20=28=E5=88=81=E6=B5=A9=E7=84=B6=29?= <0@hairydiode.xyz> Date: Sat, 25 Nov 2023 04:53:08 -0800 Subject: Unihome finally published --- index.html | 2 +- jankime.html | 8 +- unihome.html | 324 ++++++++++++++++++++++++++++++++--------------------------- 3 files changed, 183 insertions(+), 151 deletions(-) diff --git a/index.html b/index.html index c524128..3c7e5e7 100644 --- a/index.html +++ b/index.html @@ -43,13 +43,13 @@ Where's all the other stuff you host from this domain? My Mastodon Instance Where's all the content? Scroll Down +[] [Janky IME] 6-29-2023 [PGP Public Key] 6-26-2018 [Control Systems Club Project Workflow] 1-24-2019 [Matrix Homeserver] 3-17-2019 [Moving This Site] 6-26-2018 [Making This Site] 11-13-2017 -[MIT Decisions Countdown Clock] 3-09-2019 [Omnicom Writeup] 1-12-2018 [Control Systems Club] 2-21-2018 [Control Systems Club Web Controlled Servo Instructions] 1-24-2019 diff --git a/jankime.html b/jankime.html index 6d1454a..75656c7 100644 --- a/jankime.html +++ b/jankime.html @@ -20,7 +20,7 @@ 123456789-223456789-323456789-423456789-523456789-623456789-723456789-8234567890 一二三四-->Janky IME 6-29-2023 -------------------------------------------------------------------------------- -UPDATE: This IME is now tmux based, old xdotool version is still here +UPDATE: This IME is now tmux based, old xdotool version is still here UPDATE2: I have created the most cursed thing in existance. Full unicode display and input support in the linux console using only userland bash/busybox and @@ -67,7 +67,7 @@ The Implementation: Input is read with read in a loop - CODE: + CODE: OIFS=$IFS export IFS=""; read -rsn1 i IFS=$OIFS @@ -82,7 +82,7 @@ The Implementation: I then simply run grep ^$code\s, rearrange the columns with awk and sort, then take out the ranking column - CODE: + CODE: opt=$(grep "^$code\s" ~/lang/zh/boshiamy/ibus-boshiamy/boshiamy.txt |\ #remove simplfied grep -v 98|\ @@ -95,7 +95,7 @@ The Implementation: conversion of the input characters from line seperated to space seperated was done for free. However this makes the code less portable - CODE: + CODE: char=$(echo $opt | awk "{print \$1}") ... tmux send-key -t "!" "$char" diff --git a/unihome.html b/unihome.html index c9744b3..e0a7b88 100644 --- a/unihome.html +++ b/unihome.html @@ -1,6 +1,6 @@ -We Have Unicode at Home + @@ -18,152 +18,184 @@ -------------------------------------------------------------------------------- We Have Unicode at Home 6-30-2023 +一二三四--> -------------------------------------------------------------------------------- -Preface - it's just uses more memory, handwriting in the 70s, arabic/farsi - terminals, historically never existed an ascii only time. telegram codes -busybox - bash - sed - awk - grep - bc - iconv - xxd - read - sort - uniq - cat -tmux -kbd - console-braille -zpix bdf - 30M -zpix ttf - 4.5MiB -jizji - 1.3M -misaki - 747K -Google - 2.7 MB -LinBiolinumTI.pfb - 860KiB -HanaMinA - 22M,30M -unifont - 11.7MiB -Latex 2.9GiB -cm-super - 57.8MiB - just european languages + cyrillic -cbfonts - 70.6MiB -ensembl human genome - 4.5GiB -Rant - Aesthetics vs. Function - cool-retro-term, pixel fonts, monospace of chinese vs english - - - - The text confusion - In the beginning there was not the command line. There was wall - paintings bone etc - - Inefficiency - The only first class data types on a computer are int, - uint, and float. Why is there not universal way to - display/store them on posix systems, 256 combos per byte, - only 9 used, less than 5% efficiency - - HTML v. inefficient, easy to grep kinda - Json, v. inefficient - - Data confusion - IME - table takes in keypresses, spits out unicode character - keypresses should be own type, but is ascii, - what happens when different keyboard layout? - - What happens if typing russiand and want to use - vim or press C-c? - - Big table, very simple datatype, not first class - - Tree/files, super simple datatype, not first - class, file argument woes - - Display: - simply doing an OR required like 3 processes - because every program required different text - representation of the same data, even though - first class data type - - no language has first class lexer, closest is - awk - - bdf file ridiculously inefficient, keywords too - long, actual data is 2x by hexadec - representation - - bdf file is just a big table w/ 2d array as - output , very simple data type, have to do 1000 - conversions for input (decimal codepoint vs - 32bit vs utf-8), and output (2d array of bits vs - hex representation of the same) - - Big table no way to sort to make more efficient - - Representation - Forced to represent all out data so that the lowest - common denominator teletype in 1970s new jersey can - print it if we were to send it directly over serial - not just a bash issue: JSON, HTML, PDB, even - PDF/postscript - Ascii isn't event text, can't write accents or - directiona quotes or nn or even a bar over a letter. - Flipside, nobody who doesn't use posix knows or cares - what ~ and | are. - - Regex, same basic thing, 30 different variants, because - forced to represent as text with no specialized symbols - - same with code, every language has its own way of - representing a code block, none of which are - particularly legible - - if should be one key press and one byte - - In-band vs out of band - no universal way to embed data, json has directional - brackets, backslash hell is the norm, completely - avoidable, but the text obsession means type info is - ignored - guis - all based off of one dumb xerox experiment - all have same issues - lossy data display - no interop of actual data - no open loop input - no way to store input as its own data/scripting - - in memory data: - no interop, spend all your time using framework - libraries to convert data around. It's not just - a bash issue - - weird selection of first class data types, why - is text 1st class and not a mesh or a linked - list? -Rant - In the beginning, there was not a command line. In the beginning, there - was iron oxide pigment on torch lit cave walls, then there were stylus - indentations on clay, patterns carved on turtle shell, knots - tied in string, grooves cut in vinyl, and finally discrete states stored - in a great multitude of mechanisms. The universal datatype is not text, - it is uint_256, IEEE floating points. +So as we all know, the Linux console is limited to 512 characters, and lives in +kernel space. So I wrote a workaround that displays unicode characters using +braille (assuming your linux console font has braille characters) characters +using only userland busybox. + + +--------------------------=[Part I. Braille Graphics]=-------------------------- +Braille graphics are actually really easy, the braille block goes from U+2800 +to U+28FF, with the lower 8 bits corresponding to the dots in each braille +character in the following order: + +#0 3 +#1 4 +#2 5 +#6 7 + +with 0 being the lowest bit and 7 being the highest bit. + +utf-8 encodes this codepoint with three bytes + +1110xxxx 10xxxxxx 10xxxxxx + +where x represents the bits of the codepoint, therefore U+2800 converted to +UTF-8 is 0xE2A080 (big endian) or 14852224 in decimal (I'll explain why decimal +is relevant later). + +If you take the pixel buffer, shift it according to the above chart (and +adjusted for the utf-8 encoding position change), and OR the base codepoint, you +get your desired braille character. + +The problem is that bash can not do bitwise operations, and that it calls a +seperate process for conversion from hex to decimal. So our code ends up looking +like this: + + if [ "${rawbuff[((1+4*$2))]:((1+2*$1)):1}" == "1" ];then + num=$(($num + 16)) + fi + + where $num starts off as 14852224, we have a raw pixel buffer where each + row is stored as a string where '1' represents a filled in pixel, and + the current braille block we are rendering's x and y position are at $2 + and $1. + +The above code takes the value of the raw pixel buffer at position (1,1) +relative to the current code block, shifts it by 4, then ORs it with the +rendered braille character. + + +I also wrote some code to take commands that draw in the raw pixel buffer as +well. + +code here + +----------------=[Part 2, Rendering BDF fonts with only busybox]=--------------- +BDF is a human legible bitmap font format where each character entry looks like: + +STARTCHAR uni6D69 +ENCODING 28009 +SWIDTH 1000 0 +DWIDTH 8 0 +BBX 7 7 0 -1 +BITMAP +98 +1C +A8 +3E +80 +9C +9C +ENDCHAR + +The first line is the unicode codepoint, followed by some info I don't care +about, and the bitmap data of the character where each row is a stored as a line +converted to hex. You can tell if we convert the hex to binary, it will be the +"raw pixel format" from before. so all we really need to do is write a small awk +script to find the relevant bitmap lines, then convert to binary and display it +with previous braille display script. + +Complete Character Display code here + +-------------------------=[Part 3. UTF-8 Shenanigans.]=------------------------- +One annoying thing about utf-8, is that if you want to get the codepoint of a +particular character in a utf-8 string, you have to do some iconv trickery where +you first convert it to UTF-32, then convert it to hex. + +Another problem is that BDF stores the codepoint as DECIMAL!!!!!. You see that +line "STARTCHAR uni6D69"? That's just the name of the character, it could +theoretically be anything. The actual line storing the codepoint is +"ENCODING 28009", So we have to convert from hex to decimal, which is a +surprisingly convoluted procedure in bash. + +All this is done in a wrapper script that displays all the input from stdin and +displays it using all the fonts in a directory given as its argument + +wrapper script code here + +----------------------------=[Part 4. Practical Use]=--------------------------- +So remember the janky bash based IM from last time? I modified it to use the +braille display from before. I also wrote a little script that displays all the +non-ASCII characters in the previously focused tmux pane, so together we can +both display and input utf-8 characters in the linux console using tmux. + +see the code and writeup + + +"Screenshots" below: + +Bash running in tmux +[usernm@cm│[usernm@cmphostname ~]$ mkdir 帖 │乔 +phostname │[usernm@cmphostname ~]$ cd 帖 │pdr +~]$ ud │[usernm@cmphostname 帖]$ vim 天干 │⢠⠋⣏⡁⡆⡇⠀⠀⠁ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⢹⠔⢅⠇⡇⡇⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠸⠠⠊⠀⠥⠇⠀⠀⠂ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │⣲⡪⢰⣓⣲⠀⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠒⣱⠘⡖⡞⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠩⠜⠠⠃⠧⠇⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │⢠⠴⠥⠤⡄⠀⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │ │⠸⢭⠭⡭⠇⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │ │⠤⠊⠀⠣⠤⠇⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │ │ +⠉⠉⢹⠉⠉⠁⠀⠀ │ │ +⠉⠉⡝⡍⠉⠁⠀⠀ │ │ +⠤⠊⠀⠈⠢⠄⠀⠀ │ │ +⠈⠉⢹⠉⠉⠀⠀⠀ │ │ +⠒⠒⢺⠒⠒⠂⠀⠀ │ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │ │ +[usernm@cm│ │ +phostname │ │ +~]$ │ │ + │ │ +Leftpane is displaying all the unicode characters in the primary terminal +(remember, on the linux console they would all just be squares), and right pane +is the input method, which displays candidate characters in bash. + +Vim running in tmux +⡇⡇⡇⡖⠓⡆⠀⠀ │甲乙丙丁 │之 鐻 +⠁⠏⠁⠧⠤⠇⠀⠀ │ 最常用 │azn +⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⠤⠤⠼⠤⢤⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⠀⠀⣀⠔⠁⠀⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠔⠉⠒⠤⠤⠄⠀⠀ +⡤⡧⡄⠀⡧⠄⠀⠀ │~ │⣊⡂⣀⣗⣒⠀⠀⠀ +⡇⡇⡇⡖⠓⡆⠀⠀ │~ │⢺⡂⣗⢗⡖⡃⠀⠀ +⠁⠏⠁⠧⠤⠇⠀⠀ │~ │⠽⠴⠑⠝⠘⠄⠀⠀ +⠉⠉⢹⠉⠉⠁⠀⠀ │~ │ +⠉⠉⡝⡍⠉⠁⠀⠀ │~ │ +⠤⠊⠀⠈⠢⠄⠀⠀ │~ │ +⠈⠉⢹⠉⠉⠀⠀⠀ │~ │ +⠒⠒⢺⠒⠒⠂⠀⠀ │~ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │~ │ +[usernm@cm│~ │ +phostname │~ │ +~]$ ud │~ │ +⣏⣉⣹⣉⣉⡇⠀⠀ │~ │ +⠧⠤⢼⠤⠤⠇⠀⠀ │~ │ +⠀⠀⠸⠀⠀⠀⠀⠀ │~ │ +⠉⠉⢉⠝⠋⠀⠀⠀ │~ │ +⢀⠔⠁⠀⠀⡀⠀⠀ │~ │ +⠣⠤⠤⠤⠤⠃⠀⠀ │~ │ +⣉⣉⣹⣉⣉⡁⠀⠀ │~ │ +⡇⢀⠜⢄⠀⡇⠀⠀ │~ │ +⠇⠁⠀⠀⠥⠇⠀⠀ │~ │ +⠉⠉⢹⠉⠉⠁⠀⠀ │~ │ +⠀⠀⢸⠀⠀⠀⠀⠀ │~ │ +⠀⠠⠼⠀⠀⠀⠀⠀ │~ │ +⢸⠭⠭⠭⢽⠀⠀⠀ │~ │ +⢹⠭⡏⡭⠭⡅⠀⠀ │~ │ +⠚⠉⠇⠬⠪⠄⠀⠀ │~ │ +⡖⣓⣚⣒⡓⡆⠀⠀ │~ │ +⢀⣓⣲⣒⣃⠀⠀⠀ │~ │ +⠘⠀⠸⠀⠚⠀⠀⠀ │~ │ +⢸⣉⣹⣉⣹⠀⠀⠀ │~ │ +⢸⠤⢼⠤⢼⠀⠀⠀ │~ │ +⠎⠀⠸⠀⠼⠀⠀⠀ │~ │ +[usernm@cm│~ │ +phostname │~ │ +~]$ │-- INSERT -- 2,11-15 All │
-- cgit v1.1