zakuarbor

The Issue With Default in Switch Statements with Enums

2025-07-05T00:00:00-04:00

Reading the coding standards at a company I recently joined revealed to me the issue with default label within the switch statement and why it’s prohibitted when its being used to enumerate through an enum. default label is convenient to handle any edge cases and it’s often used to handle errors. However, when working with enums, it is often the case that the prpogrammer intends to handle all possible values in the enum. To catch this mishap, programmers would enable -Wswitch or -Werror=switch to their compiler. For instance, let’s suppose I have an enum named Suit to represent the different suits in a deck of cards.

enum Suit {
  Diamonds,
  Hearts,
  Clubs,
  Spades
};

Let’s suppose I forget to enumerate through Spades:

switch(suit) {
  case Diamonds:
    printf("Diamonds\n");
    break;
  case Hearts:
    printf("Hearts\n");
    break;
  case Clubs:
    printf("Clubs\n");
    break;
}

Then I’ll get the following warning:

$ LC_MESSAGES=C gcc -Wswitch /tmp/test.c
/tmp/test.c: In function ‘main’:
/tmp/test.c:12:3: warning: enumeration value ‘Spades’ not handled in switch [-Wswitch]
   12 |   switch(suit) {
      |   ^~~~~~

Note: LC_MESSAGES=C is just to instruct GCC to default to traditional C English language behavior since my system is in French

Based on GCC Documentation on Warning Options:

-Wswitch
  Warn whenever a switch statement has an index of enumerated type and lacks a 
  case for one or more of the named codes of that enumeration. 
  (The presence of a default label prevents this warning.) 
  case labels outside the enumeration range also provoke warnings when this 
  option is used. This warning is enabled by -Wall. 

Based on the documentation, we should no longer see the warning anymore if we add a default label:

switch(suit) {
    case Diamonds:
      printf("Diamonds\n");
      break;
    case Hearts:
      printf("Hearts\n");
      break;
    case Clubs:
      printf("Clubs\n");
      break;
    default:
}

And as expected, we see no warnings:

$ LC_MESSAGES=C gcc  -Wswitch /tmp/test.c
$

However, I notice a similar warning option in the documentation which will catch this misbehavior even with the default label:

-Wswitch-enum
    Warn whenever a switch statement has an index of enumerated type and lacks 
    a case for one or more of the named codes of that enumeration. case labels 
    outside the enumeration range also provoke warnings when this option is used. 

So regardless if we have a default label or not:

$ LC_MESSAGES=C gcc  -Wswitch-enum /tmp/test.c
/tmp/test.c: In function ‘main’:
/tmp/test.c:12:3: warning: enumeration value ‘Spades’ not handled in switch [-Wswitch-enum]
   12 |   switch(suit) {
      |   ^~~~~~

On a side note, -Wall will not catch this misbehavior if a default is present:

$ LC_MESSAGES=C gcc  -Wall /tmp/test.c
$ 

This is because -Wall enables most warnings but not all warnings. Based on the documentation, we see that -Wall enables -Wswitch instead of Wswitch-enum.

MicroBlog 2024 Edition

2025-02-23T00:00:00-05:00

In 2023, I was fascinated in learning about a revival of the old internet where chaos and nostalgia ensues in neocities which is an attempt to recreate the community of the old internet that geocities provided in the past. However, I never progressed much aside from creating an initial introduction page. Almost a year after in the summer of August 2024, I decided to take advantage of the amount of free time I have rececntly due to taking a break from school to start posting shorter content as a way to write something quick and potentially more personal as this blog site has become more of a technical blog rather than a personal blog. Though that doesn’t mean I won’t post random odd blog posts that aren’t technical nor do I guarantee any professional (it is the internet and a blog site that I maintain without any sponsorship nor earn anything monetary). Though any potential chaos or out of context content I may post in the future will be contained in my neocities site while it lasts which should spare this blog site from any weird oddities for the time being.

While my microblog isn’t exactly short, it is definitely shorter than my typical technical blog posts. You will probably see some parallels between my microblog and the blogs I post here. This is because some of the technical aspects of the micro posts are the quick scratch notes that gives an overview of the topic I wish to write about on this blog site.

You can visit my microblog if you are interested: Random Bits

Complete List

[2024-12-29] New Laptop: Framework 16
[2024-12-29] Utilizing Aliases and Interactive Mode to Force Users to Think Twice Before Deleting Files
[2024-12-20] Stack Overflow: The Case of a Small Stack
[2024-12-17] Jekyll Cache Saving the Day
[2024-11-09] QNX is 'Free' to Use
[2024-10-08] [Preview] Manually Verifying an Email Signature
[2024-10-06] [Preview] Half-Width and Full-Width Characters
[2024-09-18] Mixing Number and String
[2024-08-30] `.` At The End of a URL
[2024-08-28] Splitting Pdfs into Even and Odd Pages
[2024-08-28] Executing Script Loophole
[2024-08-24] Replacing main()
[2024-08-18] Editing GIFS and Creating 88x31 Buttons
[2024-08-10] multiple definition of `variable` ... first defined here
[2024-08-04] Delusional Dream of a OpenPower Framework Laptop
[2024-08-04] 2024 Update

New Laptop: Framework 16

December 29, 2024

micro

Ever since Linus Tech Tips (LTT) introduced Framework, a repairable and modular laptop, back in 2021, I always wanted one for myself. I always loved the idea of modular electronics ever since PhoneBloks introduced their idea of modular phones. Electronics that are modular are usually highly repairable due to the fact that one can easily swap a faulty component with a new component instead of going to a repair shop or dumping the phone into the garbage. The appeal of bringing the desktop experience of being able to upgrade various parts such as the CPU, RAM and storage to the laptop was very appealing. Electronics of the past were much easier to repair and upgrade but these days laptops are designed to not be easily upgradable such as the use of soldered RAM. Laptops are also designed to not be as repairable as it once was with the use of integrating more components into the SoC which allows manufacturers to significantly design a more compact and sleeker device. There are lots of benefits of SoC than just compactness, it also can help with power efficiency and speed as it can be optimized to have fast access to both the CPU and memory. While there could be engineering reasons to soldered RAM, it is likely to also encourage consumers to purchase a new laptop instead.

A Framework laptop and its various parts. Source: Framework

The Framework laptop is great but every criticism you have heard about the Framework laptop holds true. Cost is the biggest issue with Framework laptops. As Framework is a small company, it cannot build in scale unlike the other OEMs. You will be paying an extremely hefty price to obtain a modular laptop. You could get a laptop from other OEMs with better specs for way less than what Framework offers. The laptop is not suitable for the regular consumers and is way more expensive than a luxurious laptop (aka Macbooks). There are other issues with the Framework laptop but I consider this to not be the cost of Framework but rather the cost of modularity. As I mentioned earlier, there are tradeoffs between modularity and integrating everything into an SoC. When you are getting a Framework laptop, you are buying the laptop for its modularity and repairability. For instance, when you buy a Framework 16 for instance, you can see the outlines of the various sliders around the keyboard and touchpad. In addition, you can clearly see the outlines of each expansion card on the laptop.

On a very positive note, you can swap the expansion cards to fit your needs and for those who care about colors, you can easily swap the colors of the screen bezel and the panels surrounding the keyboard such as adding a numpad, swapping the keyboard for an RGB keyboard, or getting an LED matrix panel. The flexibility to change the expansion cards was the biggest appeal of the laptop for me as you get to choose which IO ports will be HDMI, USB-As, or USB-Cs (with some restrictions).

I should keep this more brief as this is a microblog … Anyhow, now that I have access to my first dedicated GPU, I can now play video games that isn’t Minesweeper, Solitaire, Starcraft (Broodwar) and PC ports of old games like Final Fantasy 7. Ever since players were forced to move onto Counterstrike 2, I was no longer able to play CounterStrike with my old Lenovo Gen 7 X1 Carbon laptop. I was surprised by how noisy the laptop can be when playing Counterstrike 2 though that is likely due to my inexperience playing videogames that requires a dedicated GPU (and I am playing on a laptop which is probably not the best idea if you want to play videogames). Here’s the specs:

$ neofetch
             .',;::::;,'.                zaku@fedora 
         .';:cccccccccccc:;,.            ----------- 
      .;cccccccccccccccccccccc;.         OS: Fedora Linux 40 (Workstation Edition) x86_64 
    .:cccccccccccccccccccccccccc:.       Host: Laptop 16 (AMD Ryzen 7040 Series) AJ 
  .;ccccccccccccc;.:dddl:.;ccccccc;.     Kernel: 6.11.4-201.fc40.x86_64 
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.    Uptime: 5 hours, 46 mins 
.:ccccccccccccc;KMMc;cc;xMMc:ccccccc:.   Packages: 2254 (rpm), 12 (flatpak) 
,cccccccccccccc;MMM.;cc;;WW::cccccccc,   Shell: bash 5.2.26 
:cccccccccccccc;MMM.;cccccccccccccccc:   Resolution: 1920x1080 
:ccccccc;oxOOOo;MMM0OOk.;cccccccccccc:   DE: GNOME 46.6 
cccccc:0MMKxdd:;MMMkddc.;cccccccccccc;   WM: Mutter 
ccccc:XM0';cccc;MMM.;cccccccccccccccc'   WM Theme: Adwaita 
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;    Theme: Adwaita [GTK2/3] 
ccccc;0MNc.ccc.xMMd:ccccccccccccccc;     Icons: Adwaita [GTK2/3] 
cccccc;dNMWXXXWM0::cccccccccccccc:,      Terminal: gnome-terminal 
cccccccc;.:odl:.;cccccccccccccc:,.       CPU: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics (16) @ 5.263GHz 
:cccccccccccccccccccccccccccc:'.         GPU: AMD ATI c4:00.0 Phoenix1 
.:cccccccccccccccccccccc:;,..            GPU: AMD ATI Radeon RX 7600/7600 XT/7600M XT/7600S/7700S / PRO W7600 
  '::cccccccccccccc::;,.                 Memory: 7192MiB / 31386MiB

On OpenBlender Benchmark:

monster: 130.805407
junkshop: 85.742239
classroom:64.374681

Which is significantly better than what my X1 Carbon achieved (where higher numbers are better).

Utilizing Aliases and Interactive Mode to Force Users to Think Twice Before Deleting Files

December 29, 2024

micro linux

I previously mentioned that I lost my file by accidentally overwriting my file using the cp command. This got me thinking as to why this would be impossible on my work laptop since I would be constantly bombarded with a prompt to confirm my intention to overwrite the file.

$ cp 2024-12-01-template.md 2024-12-30-alias-interactive.md
cp: overwrite '2024-12-30-alias-interactive.md'?

Commands like mv and cp have an interactive flag -i to prompt before overwriting the file. As seen in man 1 cp

-i, --interactive
              prompt before overwrite (overrides a previous -n option)

To force everyone at work to have this flag enabled, they made cp and mv an alias in our default shell configs:

alias cp="cp -i"
alias mv="mv -i"

Which you can also verify using the type command:

$ type cp
cp is aliased to `cp -i'
$ type mv
mv is aliased to `mv -i'

Stack Overflow: The Case of a Small Stack

December 20, 2024

micro stack qnx C/C++

Years ago I was once asked by an intern to debug a mysterious crash that seemed so innocent. While I no longer recall what the code was about, we stripped the program to a single line in main. Yet the program still continued to crash.

Source:

int main() {
    char buf[1024*1024*1024];
}

Result:

# ./prog-arm64 

Process 630803 (prog-arm64) terminated SIGSEGV code=1 fltno=11 ip=00000025333267f0 mapaddr=00000000000007f0 ref=000000443dd5dc50
Memory fault (core dumped) 

This bewildered all of the interns as it made absolutely no sense. Through our investigation, there was two things we noticed:

The program worked on our local machines but not on our target virtual machine
We were allocating an extremely large buffer in the stack which was unusual

It turns out the intern wanted to allocate a 1MiB buffer for some networking or driver related ticket. If I recall correctly, our target only had 512MB RAM so this could have explained the mysterious crash. But even 1MiB buffer on the stack was too large for our target:

Source:

int main() {
	char buf[1024*1024];
}

Result:

# ./prog-arm64 

Process 696339 (prog-arm64) terminated SIGSEGV code=1 fltno=11 ip=0000004de7e7a7ec mapaddr=00000000000007ec ref=000000383b19fbe0
Memory fault (core dumped) 

One thing I purposely omitted was that our target was running QNX, a realtime operating system. If we were to take a look at the documentation:

A process’s main thread starts with an automatically allocated 512 KB stack – QNX SDP 8.0 - Stack Allocation

This shocked all of us since 1 MiB is not a large buffer in 2021 where we had plenty of memory on our own personal system at home.

Note 1: The target used in the example was an aarch64le. This example will work on amd64 (x86_64) but requires you to add something such as a print statement

Note 2: QNX 8.0 was released to the general public in late 2023 or early 2024 so the actual target at the time when the question was asked was running either QNX 7.0 or QNX 7.1 (I do not recall which version)

The behavior for AMD64 (x86_64) as noted requires more fiddling to trigger a crash which came to my surprise. A slightly more detailed version will be released shortly on my blog which will include a very brief reason as to why AMD64 doesn’t crash if nothing extra is added like a call to puts.

Jekyll Cache Saving the Day

December 17, 2024

micro jekyll cache

I was in the midst of publishing a post on announcing that QNX released a non-commercial license which allows hobbyist to fiddle around but I accidentally deleted my file using the cp command. This effectively killed my mood as I did not want to rewrite everything from scratch. I then recall that Jekyll creates a cache to speed up the build process when converting markdown to HTML.

$ ls -ld .?* 
drwxr-xr-x. 1 zaku zaku 204 Dec 16 23:47 .git
-rw-r--r--. 1 zaku zaku   0 Oct 20 19:55 .gitignore
drwxr-xr-x. 1 zaku zaku  32 Oct 20 19:56 .jekyll-cache

If we were to traverse into the cache and into Jekyll-Converters--Markdown, you’ll see a lot of directories labelled what it appears to be in hex:

.jekyll-cache/Jekyll/Cache/Jekyll--Converters--Markdown$ ls
0e  1c  22  24  2e  37  3f  44  47  53  57  5d  62  66  6e  74  7b  84  8d  90  91  9c  a7  a9  aa  ab  b1  b3  b6  c1  c6  cb  d4  d5  e1  e2  ea  f9  fc

Using my trust tool grep, I was able to patch up pieces of my work. However, as the purpose of Jekyll-Converters--Markdown is to cache markdown files that have been converted to HTML, I obviously had to clean it up a bit but regardless, it was much faster than to rewrite the entire article.

QNX is 'Free' to Use

November 9, 2024

micro qnx

Recently on Hackernews, a relations developer from QNX announced that QNX is now free for anything non-commercial. QNX also made an annoncement to the LinkedIn Community as well which was where I learned about it. For those who are not familiar with QNX, QNX is a properiety realtime operating system targetted for embedded systems and is installed in over 255 million vehicles. QNX has a great reputation for being reliable and safe embedded system to build software on top of due to its microarchitecture and compliance to many industrial and engineering design process which gives customers the ability to certify their software in safety critical systems more easily. What makes QNX appealing is a discussion on another time but for me, this is a good opportunity to fiddle around with the system. I was previously denied a license from my university who had an agreement with QNX and my attempts to get an educational license did not go far years ago.

Previously to gain access to QNX, one would have to either purchase a commericial license from QNX or have an academic license. This made hobbyists from having access to the operating system. With the non-commericial license, QNX is now open for those who are interested in running a RTOS in their hobby projects and for open source developers to port their software on QNX. QNX is a POSIX compliant software but as QNX was not open for public use, companies had to port open source projects into QNX such as ROS (Robotics Operating System which isn’t an actual OS). QNX also mentions the non-commercial license allows one to develop training materials and books on utilizing QNX which is frankly scarce outside of QNX authorized materials (i.e. QNX training, Foundary27, and QNX Documentation).

While the announcement is welcoming news for me who would love to tinker around, this is yet another product entering the hobbyist community late. The reason for the success of UNIX, Linux, RISCV, and ARM is the ease and availability of the product to hobbyists and students who later bring this to their workplace or make the product better. Closing access to technology is a receipe for disaster in the long-term in terms of gaining market advantage. This is exactly the reason why we see cloud corporations enticing either the student or the hobbyist population to have free (limited) access to their products and even at times sponsor events targeted towards them. Linux, BSD, and FreeRTOS being open source makes them the dominant OS among the tinkering community and have wide adoption in the market. Over the years, we have seen a shift from customers using commercial and custom grade hardware and software towards more open source or off the shelf solutions including on critical safety applications such as those on SpaceX using Linux and non radiation hardened CPUs. IBM for instance has been late to developing an ecosystem of developers for their Cloud, Database and Power Architecture. IBM over the recent years has done a good job in creating free developer focused trainings which tries to make use of their own technologies. However, it is plain obvious that IBM has failed to capture mainstream interest of hobbyists who much prefer other cloud providers such as AWS, Google Cloud, Linode, and Digital Ocean. The SPARC and POWER architectures were open-source far too late by their own respective owners that developers have shifted towards RISCV and ARM as those architectures are either more open or easier to obtain (such as through Raspberry Pi Foundation).

While I have not done any sentimental analysis of this announcement, I think overall this move is a good first step to develop an ecosystem of developers who appreciate and understand the QNX architecture but is also met with sketpicism. For reference, QNX has messed with the community twice before which explains the big mistrust from experienced developers. The top comment on Hackernews does a great job summarizing the sketpicism. QNX used to have a bigger hobbyist community in the past where open source projects such as Firefox would have a build for QNX, but that all died when QNX closed their doors to the community. Years later, QNX source code was available for the public to read (though probably with restrictions) but later shut the source code availability after being acquired by Blackberry who does not have the best reputation to the developer community (hence why Blackberry Phones failed to capture the market from my understanding despite once being a market leader).

Regardless, I have plans to create a few materials on QNX in the coming months and perhaps create a follow up to QNX Adapative Partitioning System as it seemed to have gained enough has been ranked top 5 on Google search results (though I doubt it had many readers due to the population of QNX developers):

Google Search Console from July 9 2023 - Nov 8 2024 which had 308 clicks

[Preview] Manually Verifying an Email Signature

October 8, 2024

micro gpg signing

I noticed that the neocities community love using protonmail and some even share their public key to enable full encryption communication. While I care about cyber security more than the average human, I do not care enough to start requiring others to start encrypting their email and sign their messages so that I can verify the authenticity of the messages I receieve.

Out of curiosity, I decided to see how one would manually verify the signature of an email to ensure that the email has not been tampered with and comes from the person who it claims to be. I won’t go into how digital signatures work as those details will be posted shortly after at my blog.

Import Alice’s public key:

$ gpg --import publickey-alice@proton.me.asc 
gpg: key : public key "alice@proton.me " imported
gpg: Total number processed: 1
gpg:               imported: 1

Download the email .eml file and the signature

$ ls signature.asc 'GPG Signing test.eml'
'GPG Signing test.eml'   signature.asc

Extract the message to verify from .eml file

This is where things get difficult. The downloaded email *.eml has a lots of unnedded information that needs to be discarded. I highly recommend that you make a copy of the email file because it does take a while to get used to.

The content of the message starts after you see the following header (the hash will differ):

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --------7005887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69bc70a

So for instance, let’s look at the following file:

 MIME-Version: 1.0
 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415"; charset=utf-8

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415
 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

Then the first line of the signed message is:

 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

Where the signed message ends is a scene of confusion. On the internet, there are many that says you to put everything between the first boundary and the second boundary into a new file. The boundary they are referring to is the line after This is an OpenPGP/MIME signed message (RFC 4880 and 3156) which has the form ----.

 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415

 //email content

 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415

Despite my many attempts, I had no success till I realized you have to delete all trailing new lines. One thing I notice is that the hash on the first line of the signed message is also the last line in the signed message.

 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

The first line of the signed file

The hash on the first line of the signed message is: ff35159c3ebf11234dd954191b3141592 so our file should also end with this hash.

If our message looks something like this:

 MIME-Version: 1.0
 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415"; charset=utf-8

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415
 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592
 Content-Type: application/pgp-keys; filename="publickey - alice@proton.me - .asc"; name="publickey-alice@proton.me.asc"
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename="publickey-alice@proton.me.asc"; name="publickey - alice@proton.me - .asc"

 ABCDEF0x4ZjZkeGxSL0xUABCDEFmltotlUR0ABCDEFWaABCDEFE9PQP9ABCDEFAABCDEFtLUVORCBABCED
 ABCDEFEABCDEFFWSBCTE9DSy0tLABCDE==
 -----------------------ff35159c3ebf11234dd954191b3141592--

 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415

Then the signed message should be

 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592
 Content-Type: application/pgp-keys; filename="publickey - alice@proton.me - .asc"; name="publickey-alice@proton.me.asc"
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename="publickey-alice@proton.me.asc"; name="publickey - alice@proton.me - .asc"

 ABCDEF0x4ZjZkeGxSL0xUABCDEFmltotlUR0ABCDEFWaABCDEFE9PQP9ABCDEFAABCDEFtLUVORCBABCED
 ABCDEFEABCDEFFWSBCTE9DSy0tLABCDE==
 -----------------------ff35159c3ebf11234dd954191b3141592--

Verify the signature: gpg --verify signature.asc message.txt

 $ gpg --verify signature.asc message.txt 
 gpg: Signature made Mon 07 Oct 2024 11:29:48 PM EDT
 gpg:                using EDDSA key 
 gpg: Good signature from "alice@proton.me " [unknown]
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:          There is no indication that the signature belongs to the owner.
 Primary key fingerprint: 

In practice, no one verifies the digital signatures of emails manually. Any sane individual will utilize any email client that would automate the verification process for them. This was a quick preview of a blog post I will be writing in the next few days that will go into email signatures in more details with better explanations and diagrams.

[Preview] Half-Width and Full-Width Characters

October 6, 2024

micro encoding

Those of us who live and speak English will probably never think about how characters are encoded which is how characters such as the very letters you see in the screen are represented by being given some number like 65 for ‘A’ in ASCII which takes 1 byte to be represented such as a char in C.

I was not aware of the existence of full-width and half-width characters till the friend asked me to briefly explain the highlevel information about the difference in representing the characters. For those like me who weren’t aware that the Japanese mix between zenkaku (full-width) and hankaku (half-width) characters, look at the image below or visit this webpage: https://mailmate.jp/blog/half-width-full-width-hankaku-zenkaku-explained

Based on the article I shared, half-width characters takes up 1 byte while full-width characters takes up 2 bytes (also can be called double byte character). I do believe this depends on the encoding used. For me, the most obvious distinction between half and full width characters is how much graphical space it consumes as evident from both the image above and below:

Full and Half Width encoded on UTF-8 as seen through Vim

While I have read and typed Korean during my younger years when I was forced to learn Korean, it never clicked to me how much space Korean takes up graphically. It is obvious in hindsight but it was nonetheless interesting. Taking a look at the size and bytes encoding, we can see that number 1 in UTF-8 encoding takes 1 and 3 bytes for half-width and full-width character repsectively

$ stat -c "%n,%s" -- halfwidth-utf8.txt fullwidth-utf8.txt 
halfwidth-utf8.txt,1
fullwidth-utf8.txt,3

One confusion I had was understanding what the difference between UTF-8 and UTF-16 and the following excercise helped me understand this:

UTF-8 encodes each character between 1-4 bytes
UTF-16 encodes each characters between 2-4 bytes

UTF-8 and UTF-16 as you can tell are variable length meaning they take up more or less bytes depending on the character being encoded. We can see this by comparing the number 1 arabic numeral v.s. 一:

$ stat -c "%n,%s" -- halfwidth-1.txt chinese-1.md 
halfwidth-1.txt,1
chinese-1.md,3

In UTF-8, 1 takes up 1 byte which is unsurprising as ASCII has great advantage in UTF-8 compared to other Asian languages.

Note: Do not attempt to display UTF-16 encoded files on the terminal without changing your locale (or whatever it is called). It will not display nicely. Vim on my machine will automatically open the file as UTF-16LE.

Let’s inspect the contents of the files between Half character 1 and Full Byte Character １ in HEX:

$ cat halfwidth-1.txt; echo ""; xxd halfwidth-1.txt; cat fullwidth-1.txt ; echo ""; xxd fullwidth-1.txt 
1
00000000: 31                                       1
１
00000000: efbc 91                                  ...

As we can see, the half-width character 1 in UTF-8 is represented as 0x31 meaning only one byte would be required. However, a full-width digit １ is represented as 0xEFBC91. Now let’s compared this with UTF-16:

$ cat halfwidth-utf16.txt; echo ; xxd halfwidth-utf16.txt; cat fullwidth-utf16.txt; echo; xxd fullwidth-utf16.txt 
1
00000000: 0031                                     .1
�
00000000: ff11                                     ..

Note: To view UTF-16 on VIM run on command mode (i.e. press esc to exit current mode and press : to enter command mode): e ++enc=utf-16be fullwidth-utf16.txt

As expected, UTF-16 represents code points in the upper range very well where we now see １ (full-width 1) being represented with only 2 bytes unlike the 3 that was required in UTF-8. Though the same cannot be said for code points in the lower range such as our half-width digit 1 which now takes 2 bytes by appending 0x00 to its hex representation.

I will be writing a more detailed look into encoding at my blog in the coming days. This is just a quick preview.

Mixing Number and String

September 18, 2024

micro programming

A recent post has gotten somewhat popular on the web and is something many of us could somewhat relate with. In the case of many including the author, the issue stems from how YAML treats strings and numbers. As a rule of thumb, I would always suggest avoiding any potential confusion by always adding the quotes around a string to ensure the value is treated as a string as intended. The crux of the post was how their Git commit inconveniently happened to be 556474e378 which is very rare to obtain. Recall that scientific notation is in the form of \d+(\.\d+)?E-?\d+ such as 8.5E-10 to refer to 8.5 x 10^-10. The issue that one may encounter when mixing numbers and strings is that things can go very unexpected like the author did whereby 556474e378 was treated as 556474 x 10³⁷⁸. While I do not have any specific examples in mind when I have encountered such issues, I definitely have encountered this issue before where I mixed up a string and a number and obtained an undesired behavior. However, I do not think I ever encountered an issue where my numbers were interpreted as scientific notations.

`.` At The End of a URL

August 30, 2024

micro dns network

I recently learned that websites can be terminated with a . such as www.google.com. or https://neocities.org.. However, this does not work for all websites. I was skimming through Network for Dummies during work and while it doesn’t cover anything useful for the work I am trying to do (if you have taken a network course before, don’t bother reading this book unless you were bored like I was¹), terminating a website with a . was a surprise.

The book states that If a domain name ends with a trailing dot, ..., and the domain name is said to be a fully qualified domain name (FQDN). The difference between an absolute name (FQDN) and relative name is important when working with DNS and can cause an “internet outage” if done incorrectly as one user on hackernews comments. Based on some article (linked by a stackoverflow user), websites that fail to handle . in their domain names are the ones who are in violation of RFC 1738 or at least not heeding to its recommendations.

Notes:

¹ While Network for Dummies was actually fun to read surprisingly due to the author’s writing style, it lacks technical depth which should come to no surprise.

Splitting Pdfs into Even and Odd Pages

August 28, 2024

micro printer pdf utilities

During the winter break I have obtained an old Xerox XE88 Workstation Printer released in the year of 2000, the year where the media were worried about Y2K causing havok to our digital infrastructure which never came to the scale we all feared thankfully. Though of course a bug will eventually will creep and wreck havok(i.e. Crowdstrike Falcon Update). But I digress, using this printer was filled with frustration as it is a relic from the past that is not meant to be used in 2024. Firstly, the printer requires a parallel port which no modern computer comes equip with. I have to drag out my last surviving desktop from my childhood that originally came with Windows Me that we immediately switched to the glorious Windows XP that we all know, love and dearly miss. As it turns out a few months later after my first use of the printer, my PS/2 connected mouse stopped working. I do not know if the PS/2 port is broken or if my PS/2 mouse is broken. I did manage to find another PS/2 mouse but as it was water damaged from a basement leak a few years ago, there was little chance it would work. Without a mouse made this task much harder, but I digress.

Parallel Port

PS/2 Port typically found in desktops from the 90s

Placing aside the hardware struggles to operate such printer in 2024, the printer does not have duplex printing so I had run commands on my pdfs on my Linux machine before transferring the files to my Windows XP machine (thankfully there are USB ports on this desktop that work or else I would have to dust off my 3.5 inch floppy disks and CDs). To split pdfs into even and odd pages turns out to be a very simple command:

pdftk A="${file}" cat Aodd output "${file}-odd.pdf"
pdftk A="${file}" cat Aeven output "${file}-even.pdf"

As I am printing a bunch of papers on Trusted Computing, I needed to split a lot of PDFs so this task can get quite tedious so I wrote a simple shell script:

for file in *pdf; do
  pdftk A="${file}" cat Aodd output "${file}-odd.pdf"
  pdftk A="${file}" cat Aeven output "${file}-even.pdf"
done

Executing Script Loophole

August 28, 2024

micro script linux

I recently came across an article discussing an attempt to close a loophole bypassing the normal execution permission bit. Exploiting a program suid and euid to gain high privilige is a commonly known technique called privilege escalation. This article does not cover this but it introduces a flaw in the current way Linux handles the execution of scripts. I do not know why privilige escalation came to my mind but as I usually write nonesensical things anyways, I shall keep it here for now. The article gives a neat example where a script does not have execution bit but is still executable by invoking the script via an interpreter.

$ ls -l evil-script.py 
-rw-r--r--. 1 zaku zaku 86 Aug 28 00:20 evil-script.py
$ ./evil-script.py
bash: ./evil-script.py: Permission denied
$ python3 evil-script.py 
Evil script has been invoked. Terror shall fill this land

As you can see, the script has no execute bit set. However, the script is still executable by feeding the script to the interpreter. I have never considered this a security loophole but after reading the article, I realized there are some concerns of allowing scripts to be executable bypassing the file’s permission. I have always made the habit of running many of the interpreted scripts non-executable and fed them to the interpreter due to laziness (I know it’s a one time thing to set the execute bit but I am just lazy to run chmod).

The article covers some promising approaches so I do expect a solution to be merged into the kernel sometime in the near future which will force me to change my habits once the interpreters make the change. Though if interpreters do make this patch, I do expect quite a few production and CI/CD servers to be impacted as there will always be someone like me who are lazy to set the execute bit on our scripts.

One benefit of closing this loophole is to force users to deliberately make the conscious choice to set the execute bit similar to how we have to set the flatpaks we download as executables (at least from my personal experience) before we can execute the flatpaks.

Replacing main()

August 24, 2024

micro gcc C/C++

Any beginner C programmer will know that the first function executed in any program is the main() function. However, that is not the entire truth. Just like how we have learned the Bohr and Lewis diagrams in Chemistry in Highschool, this is an oversimplification. From my knowledge, the first function executed once the loader runs in a binary is _start().

Without going into any details, we can replace main() with another function such as foo() (sorry for the lack of creativity).

#include 
#include 

int foo() {
  printf("Called foo\n");
  exit(0);
}

int main() {
  printf("Called main\n");
  return 0;
}

If we compile with -e where is the name of the function replacing main(), we can see the following results:

$ gcc foo.c -e foo
$ ./a.out 
Called foo

We can also observe from objdump and nm to see where the start_address of the C code is (here I am making a distinction between the first entry point of the C code and the binary).

$  objdump -f ./a.out | grep start
start address 0x0000000000401136
$ nm ./a.out | grep foo
0000000000401136 T foo

Few Notes

You must define main() even if it’s not going to be used. CPP Reference states this explicitly:

Every C program coded to run in a hosted execution environment contains the definition (not the prototype) of a function named main, which is the designated start of the program.

Neglecting to define main results in an error like the following:
```
$ gcc foo.c
/usr/bin/ld: /usr/lib/gcc/x86_64-redhat-linux/14/../../../../lib64/crt1.o: in function `_start':
(.text+0x1b): undefined reference to `main'
collect2: error: ld returned 1 exit status
```
The C program entry must call exit() to terminate if it is not main() or else a segfault will occur
```
$ ./a.out 
Called foo
Segmentation fault (core dumped)
```
a backtrace via gdb won’t give much information as to why. Probably best to consult with glibc. Essentially it is likely due to the fact that _start is not a function that returns in the stack. It calls exit to terminate the program which probably does some cleaning via atexit and set the exit status $? to some value.
```
(gdb) bt 
#0  0x0000000000000001 in ?? ()
#1  0x00007fffffffdd46 in ?? ()
#2  0x0000000000000000 in ?? ()
```

Random Links for later Research

https://vishalchovatiya.com/posts/crt-run-time-before-starting-main/
https://www.gnu.org/software/hurd/glibc/startup.html
https://stackoverflow.com/questions/63543127/return-values-in-main-vs-start

Editing GIFS and Creating 88x31 Buttons

August 18, 2024

micro gifs gimp

Lately I have been learning how to edit GIFS and it is painstaking difficult to remove a background from a GIF without using an AI tool, especially when the image has over 70 frames. There is likely an easier way to edit GIFs but I had to manually edit over 50 frames, erasing the clouds from a GIF using the eraser tool frame by frame which took some time to finish.

Original:

Flying Pikachu Transparent Balloon Pikachu Stickerfrom Flying Pikachu Transparent Stickers

Result:

However, if you are not editing a GIF but rather trying to incorporate the GIF into your 88x31 buttons, it turns out to be quite simple. Following the instructions from a video on Youtube, I managed to create a few simple 88x31 buttons that have features cats, coffee, and the two programs I am or finished studying (i.e. doing a 2nd degree):

To resize the gifs, I used ezgif resize tool to set the height to be 31px since I didn’t know how to resize GIFs on GIMP as it would open a bunch of layers. I guess I could have used ffmpeg but using an online tool is just more convenient and easier. I do wonder if there are other standard anti-pixel button sizes aside from 80x15 pixels because a height of 31 pixels is quite limiting. It’s amazing what the community has been able to do with such limiting number of pixels.

For instance, the Bash button I have made has the subtitle “THE BOURNE-AGAIN SHELL” which is very hard to make out. I am assuming the standard practice is to render the button as a GIF and display the text on the next frame. That way users would be able to see the full-text.

multiple definition of `variable` ... first defined here

August 10, 2024

micro gcc C/C++

Randomly I decided to compile some old projects I worked on and I was surprised to see a few compilation errors in an assembler I wrote years back. As it has been many years since I last touched most of the projects I looked at, I was pleased to see the compiler catching obvious mistakes I had made in the past. Though this did come to a surprise as to why the compiler I used years ago never complained such obvious mistakes. The solution and reason for the last compilation error was not immediate to me:

$ make
gcc -o assembler assembler.c symbol_table.c parser.c  -fsanitize=address -lasan
/usr/bin/ld: /tmp/cc1MoBol.o:(.bss+0x0): multiple definition of `table'; /tmp/cc0B4XxW.o:(.bss+0x0): first defined here
/usr/bin/ld: /tmp/cc1MoBol.o:(.bss+0x81): multiple definition of `__odr_asan.table'; /tmp/cc0B4XxW.o:(.bss+0x40): first defined here

At first I thought I may had made a stupid mistake and defined the struct called table twice but all I could find was symbol_table.h, the file that declared the variable, being included by assembler.c and parser.c. This led to the conclusion there must have been a compiler behavioral change between GCC 9 and GCC 14. After a quick googling and going through going through the Release Notes, it turns out that starting from GCC 10, GCC now defaults to -fno-common:

GCC now defaults to -fno-common. As a result, global variable accesses are more efficient on various targets. In C, global variables with multiple tentative definitions now result in linker errors. With -fcommon such definitions are silently merged during linking.

In the Porting to GCC 10 webpage, the developers of GCC notes:

A common mistake in C is omitting extern when declaring a global variable in a header file. If the header is included by several files it results in multiple definitions of the same variable

To resolve this issue, one can either silently ignore their mistake and compile with -fcommon or to correctly declare the global variable with the extern keyword.

Delusional Dream of a OpenPower Framework Laptop

August 4, 2024

micro framework powerpc

Framework is a company that makes modular and repairable laptops that has captured the interests of tech enthusiasts over the past 4 years. Currently Framework laptops are limited to x86-64 architecture supporting Intel and later AMD CPUs in 2023. Although Framework laptops are not entirely open source, they have open source a decent chunk of their work from my understanding and which allows third party development of components and makes partnership possible for other companies such as DeepComputing to release a mainboard that runs a RISC-V CPU . While the new mainboard will not be usable for everyday applications, it is a step forward to a more open ecosystem and this is an exciting step for both Framework, RISC-V and the broader open-advocate community. This announcement makes me wonder the possibility of OpenPower running on a Framework laptop. Similarly to RISC-V, there isn’t an easily accessible way to obtain a consumer product running on OpenPower (aside from Raptor Computing with their extremely expensive machines). There is the PowerPC Notebook project ran by a group of volunteers who are trying to develop an open source PowerPC notebook to the hands of hobbyists. It would be interesting if OpenPower community could also partner with Framework to develop a mainboard once the project is complete and the software is more matured. However, this would be a difficult step as there is no dedicated company like DeepComputing that will pour resources into making this happen. The interest into OpenPower is low and overshadowed by the wider industry interest in expanding the ARM and RISC-V architecture to consumers. IBM made a huge mistake in open sourcing the POWER architecture too late. But one could always dream (even if it’s delusional) :D

2024 Update

August 4, 2024

micro site

Website

In the past year I have been very lazy as evident with my lack of activity on my personal blog. I'm now trying to pick up blogging again. It's hard to believe that it's been almost an entire year since I created this neocity site, which has almost 0 updates since. I've been thinking about how to use this site since I already have a blog on GitHub Pages. Honestly, I forgot this corner existed, and it’s been bothering me that I couldn’t write my random, nonsensical thoughts because my main blog wouldn’t be a suitable medium until now. So, I’ve decided that this corner will be a microblog where I can share random articles and thoughts. A microblog is different from a regular blog in that the content is much shorter. This space will allow me to quickly jot down something random. I hope that a collection of these random posts will evolve into a blog post or spark an idea for my final year thesis or project.

How are my studies going?

I’m still studying Mathematics, but I’ve lost much of my initial interest in the field after taking a few third-year courses. I ended up taking fewer Math courses, which puts my ability to graduate on time at risk. Listening to lectures and reading about abstract groups and rings made me really miss programming and computer science. Despite this, there were still some Math courses I enjoyed, such as Combinatorics and Real Analysis. However, I didn’t last long in the follow-up Real Analysis courses that covered Stone-Weierstrass and Commutative C* Algebra. Feeling tired of abstract Mathematics, I decided to take a break and pursue an internship at a telecommunications enterprise.

What am I doing Now?

As mentioned, I am currently doing a year-long internship with a telecommunications enterprise. Although the job isn't very exciting, it's a welcome break from Mathematics. This would typically be a great chance to catch up on my Computer Science studies by delving into textbooks and online resources, but I’ve been quite lazy. Instead, I've been focusing on learning French, a language I've always wanted to master. I started learning French in elementary school, as it’s a requirement in Canada. While it might make more sense to learn my mother tongue, I’m opting to learn French, which might seem confusing to some. For context, I don't have an English name and was born in some Asian country but I am unable to communicate with others in my mother tongue.

this: the implicit parameter in OOP

2025-02-11T00:00:00-05:00

I was recently reminded that the variable this is an implicit parameter passed to all methods in OOP such as C++. We can observe this by comparing a regular function vs a method belonging to some class:

#include 

void greet() {
    std::cout << "Hello World\n";
}

class Human {
public:
    void greet() {
        std::cout << "Hello World\n";
    }
};

int main() {
    greet();
    Human human = Human();
    human.greet();
}

Output:

$ g++ test.C
$ ./a.out 
Hello World
Hello World

Furthermore, their resulting mangled names do not indicate that the function/method takes in any arguments:

$ nm a.out  | grep greet
0000000000401126 T _Z5greetv
000000000040115c W _ZN5Human5greetEv

C++ mangles the symbols to handle name resolutions produced by the compiler which can provide more information to the linker. One obvious problem name mangling solves is handling function overloading where the same function identifier can take in different number or different types of parameters. The v suffix in the mangled names indicates that its only parameter is void. This is true, as the title suggests, this is an implicit parameter meaning its a “parameter” that the compiler will pass into the function. However, this can only be observed by inspecting the assembly code. A language that explicitly passes a reference to the object itself is Python where a typical constructor would look like the following:

class Human:
    def __init__(self, name, age):
        self.name =  name
        self.age = age

Anyhow, let’s observe the assembly code. Note: I’ll be only showing the code of interest.

For greet:

Dump of assembler code for function _Z5greetv:
   0x0000000000401126 <+0>:	push   %rbp
   0x0000000000401127 <+1>:	mov    %rsp,%rbp
   0x000000000040112a <+4>:	mov    $0x402280,%esi

For Human::greet:

Dump of assembler code for function _ZN5Human5greetEv:
   0x000000000040115c <+0>:     push   %rbp
   0x000000000040115d <+1>:     mov    %rsp,%rbp
   0x0000000000401160 <+4>:     sub    $0x10,%rsp
   0x0000000000401164 <+8>:     mov    %rdi,-0x8(%rbp)
   0x0000000000401168 <+12>:	mov    $0x402280,%esi

In x86 assembly, whenever you enter a function, the parameters are retrieved from the stack into registers rdi, rsi, rdx, etc (at least that’s how I understood it). Since greet has not parameters, it goes straight to storing the address of our constant string “Hello World\n” into the esi register:

(gdb) x/1s 0x402280
0x402280:	"Hello World\n"

However, for our method Human::greet, rdi register which typically holds the first parameter of the function is being utilized

mov    %rdi,-0x8(%rbp)

We can assume whatever register rdi is holding, it’s an 8B value which also happens to be the size of a pointer in x86-64. This is our implicit argument, this, which contains the address of the object itself. We can observe this via gdb:

(gdb) p &human
$2 = (Human *) 0x7fffffffdc4f
...
(gdb) i r rdi
rdi            0x7fffffffdc4f      140737488346191

where we see that the rdi register contains the same address as our object human: 0x7fffffffdc4f.

We can also replicate this in arm where w0 or x0 will be set with the address of our object human using compiler explorer:

Human::greet():
 stp	x29, x30, [sp, #-32]!
 mov	x29, sp
 str	x0, [sp, #24]
...

As you can observe, x0 also containsi some 8B value from the stack (ie. 32 - 24 = 8). Running this on an QNX ARM image (I was too lazy to flash a new OS onto my Raspberry Pi), we can observe x0 register indeed does contain the same address as our object human which represents this

(gdb) p &human
$1 = (Human *) 0x81c60
...
Dump of assembler code for function _ZN5Human5greetEv:
test.C:
9	    void greet() {
<+0>:	stp	x29, x30, [sp, #-32]!
<+4>:	mov	x29, sp
<+8>:	str	x0, [sp, #24]
...
Dump of assembler code for function _ZN5Human5greetEv:
test.C:
9	    void greet() {
<+0>:	stp	x29, x30, [sp, #-32]!
<+4>:	mov	x29, sp
<+8>:	str	x0, [sp, #24]

(gdb) i r x0                     
x0             0x81c60             531552

view is just vim

2025-01-24T00:00:00-05:00

I recently found out accidentally at work that vim and view were the same thing when I happened to be editing a file on view instead of my beloved vim editor.

Note: This is a follow up post from my microblog

For context, view is often used in lieu of vi when trying to open a file for read only while retaining the same shortcuts as vi. This is why it surprised me to see that I could modify a file when view was supposed to be read only. If we were to take a look at the documentation:

$ man -Leng view
VIM(1)                                                                  General Commands Manual                                                                  VIM(1)

NAME
       vim - Vi IMproved, a programmer's text editor

SYNOPSIS
       vim [options] [file ..]
       vim [options] -
       vim [options] -t tag
       vim [options] -q [errorfile]

       ex gex
       view
       gvim gview vimx evim eview
       rvim rview rgvim rgview

Interestingly, the man pages for view points to vim and we can see all sorts of different types of editors listed along with it such as gvim. AIX 7.3 documentation states that view Starts the vi editor in read-only mode. This is indeed evident when I take a look at how view is defined in my system (Fedora 41):

$ cat /usr/bin/view
#!/usr/bin/sh

# run vim -R if available
if test -f /usr/bin/vim
then
  exec /usr/bin/vim -R "$@"
fi

# run vi otherwise
exec /usr/libexec/vi -R "$@"

where -R is a flag for Read-only mode:

       -R          Read-only  mode.  The 'readonly' option will be set.  You can still edit the buffer, but will be prevented from accidentally overwriting a file.  If
                   you do want to overwrite a file, add an exclamation mark to the Ex command, as in ":w!".  The -R option also implies the -n option (see above).  The
                   'readonly' option can be reset with ":set noro".  See ":help 'readonly'".

Vim Oddities

What I found particularly odd was how at work, on one system view was simply a symlink to vi

$ realpath view
/usr/bin/vi

while on another machine, the two had the same md5sum (the md5sum is for illustration purposes, I just replicated the behavior on my local machine):

zaku@fedora:/usr/bin$ md5sum view
8fe562f5dd43b70c38f10ee2ec3310ca  view
zaku@fedora:/usr/bin$ md5sum vim
8fe562f5dd43b70c38f10ee2ec3310ca  vim

This odd behavior made me confused so I decided to make an experiment seeing how the only difference between view and vim on both systems at work was their names:

$ ln -s /usr/bin/vim view-pika
$ ls -l view-pika
lrwxrwxrwx. 1 zaku zaku 12 22 janv. 22:52 view-pika -> /usr/bin/vim

And it BEHAVED THE SAME as view. Thus I concluded, vim behaves differently depending on the name of the command being executed. More precisely, if the program started with the name view then it would open vim as read-only by taking a look at argv[0]. Upon looking at the source code on Github under main.c::parse_command_name():

    if (STRNICMP(initstr, "view", 4) == 0)

where initstr = gettail((char_u *)parmp->argv[0]); as suspected. This explains why pika-view did not work but view-pika worked. It only compared the first 4 characters of argv[0] to see if it starts with view. If you inspect the code more, you’ll see that vim has many faces.

This behavior is entirely documented on the man pages which I never noticed:

Vim behaves differently, depending on the name of the command (the executable may still be the same file).

       vim       The "normal" way, everything is default.

       ex        Start in Ex mode.  Go to Normal mode with the ":vi" command.  Can also be done with the "-e" argument.

       view      Start in read-only mode.  You will be protected from writing the files.  Can also be done with the "-R" argument.

       gvim gview
                 The GUI version.  Starts a new window.

       gex       Starts a new gvim window in Ex mode. Can also be done with the "-e" argument to gvim

       vimx      Starts gvim in "Vi" mode similar to "vim", but with additional features like xterm clipboard support

       evim eview
                 The GUI version in easy mode.  Starts a new window.  Can also be done with the "-y" argument.

       rvim rview rgvim rgview
                 Like the above, but with restrictions.  It will not be possible to start shell commands, or suspend Vim.  Can also be done with the "-Z" argument.

Extra Random Information on VIM and VI

1.) Viewing Compilation Flags

That was all I wanted to look at in regards to view and vim. One interesting timbit about vim is that you can see what it appears to be the compilation flag by running: vim --version:

         fichier vimrc système : "/etc/vimrc"
     fichier vimrc utilisateur : "$HOME/.vimrc"
  2e fichier vimrc utilisateur : "~/.vim/vimrc"
  3e fichier vimrc utilisateur : "~/.config/vim/vimrc"
      fichier exrc utilisateur : "$HOME/.exrc"
 fichier de valeurs par défaut : "$VIMRUNTIME/defaults.vim"
               $VIM par défaut : "/usr/share/vim"
Compilation : gcc -c -I. -Iproto -DHAVE_CONFIG_H -O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wno-complain-wrong-lang -Werror=format-security -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection -mtls-dialect=gnu2 -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -D_GNU_SOURCE -D_FILE_OFFSET_BITS=64 -DSYS_VIMRC_FILE=/etc/vimrc -D_REENTRANT -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 
Édition de liens : gcc -Wl,--enable-new-dtags -Wl,-z,relro -Wl,--as-needed -Wl,-z,pack-relative-relocs -Wl,-z,now -Wl,--build-id=sha1 -Wl,-z,relro -Wl,--as-needed -Wl,-z,pack-relative-relocs -Wl,-z,now -specs=/usr/lib/rpm/redhat/redhat-hardened-ld -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -Wl,--build-id=sha1 -specs=/usr/lib/rpm/redhat/redhat-package-notes -L/usr/local/lib -o vim -lm -lselinux -lncurses -lsodium -lacl -lattr -lgpm

2.) One unique thing about vim is a charityware. If you simply type in vim, the menu will ask you to help children in Uganda through ICCF Holland.

3.) Vimconf is held in Japan. This indicates that vim either has a strong presence in Japan or a very dedicated fanbase.

4.) Ubuntu 24.04 ships vim.tiny, likely a more stripped down version of vim

5.) vi packaged on a QNX virtual target is called elvis, an enhanced clone of vi. QNX probably ships elvis as the default editor due to its small size compared to vim (though this also means less features compared to vim). The QNX Raspberry Pi 4 image though ships with regular vim. Similarly to vim, renaming elvis to view will open the editor in read only mode.

# vi --version
elvis 2.2.0
Copyright (c) 1995-2003 by Steve Kirkendall
Permission is granted to redistribute the source or binaries under the terms of
of the Perl `Clarified Artistic License', as described in the doc/license.html
file.  In particular, unmodified versions can be freely redistributed.
Elvis is offered with no warranty, to the extent permitted by law.

v.s.

   system vimrc file: "$VIM/vimrc"
     user vimrc file: "$HOME/.vimrc"
 2nd user vimrc file: "~/.vim/vimrc"
 3rd user vimrc file: "~/.config/vim/vimrc"
      user exrc file: "$HOME/.exrc"
       defaults file: "$VIMRUNTIME/defaults.vim"
  fall-back for $VIM: "/builds/workspace/build/stage/target/qnx/usr/share/vim
"
Compilation: aarch64-unknown-nto-qnx8.0.0-gcc -mlittle-endian -mlittle-endian -c -I. -Iproto -DHAVE_CONFIG_H -mlittle-endian -I/builds/workspace/build/stage/target/qnx/usr/include -I/builds/workspace/build/qnx_sdp/target/qnx/usr/include -mlittle-endian -O2 -Wall -fplugin=/builds/workspace/build/qnx_sdp/host/linux/x86_64/usr/lib/gcc/aarch64-unknown-nto-qnx8.0.0/12.2.0/plugin/cmdline_save.so -fplugin=srcversion -fplugin-arg-srcversion-path=/builds/workspace/build/qnx_sdp/target/qnx -fplugin-arg-srcversion-path=/builds/workspace/build/code -fplugin-arg-srcversion-path=/builds/workspace/build/stage/target/qnx -fplugin-arg-srcversion-path=/builds/workspace/build/qnx_sdp/host/linux/x86_64 -fplugin-arg-srcversion-buildid=vim_br-main_be-800-16 -g -D_REENTRANT -U_FORTIFY_SOURCE -D_FORTIFY_SOURCE=1 
Linking: aarch64-unknown-nto-qnx8.0.0-gcc -mlittle-endian -mlittle-endian -mlittle-endian -L/builds/workspace/build/stage/target/qnx/aarch64le/lib -L/builds/workspace/build/stage/target/qnx/aarch64le/usr/lib -L/builds/workspace/build/qnx_sdp/target/qnx/aarch64le/lib -L/builds/workspace/build/qnx_sdp/target/qnx/aarch64le/usr/lib -Wl,-Map,install.map -Wl,--build-id=md5 -Wl,--as-needed -o vim -lm -lsocket -lncurses -liconv -lintl 

The Sign of Char

2025-01-20T00:00:00-05:00

Note: This is a follow up post from my microblog

WARNING: I am no expert in Assembly. The last and only time I ever wrote assembly was computing the Fibbonacci Sequences 8 years ago for the MIPS architecture

The following below has a value that is vague:

char i = -1;

The issue with the above line is that the value of i is not immediately obvious as compilers for different architectures could treat this as signed or unsigned. The signedness of a data type can be simply thought as whether or not there is a dedicated sign bit that indicates whether or not the number is postive or negative.

A Quick Review of Signedness

The size of char is 1 byte as defined in the C specification (C99 3.7.1) which corresponds to 8 bits. This effectively gives char the ability to represent 2⁸ = 256 values which is more than enough to represent all 128 characters of ASCII and other encodings that slightly extended ASCII to utilize the other remaining unused slots (i.e. ASCII utilizes only maps to 128 values) such as JIS X 0201.

There are different ways to represent negative numbers but the most common, at least from what I recall, is that negative numbers are represented using two’s complements. From what I read online, it would seem that the advantage of two’s complement is that we can treat operations on the numbers the same regardless if it is negative or positive. This also allows us to not have a concept of negative 0 which would be quite odd to deal with.

Two’s complement is quite simple but it does require you to be familiar with binary since that is how computers represent any piece of data. The most significant bit (the left most bit) represents whether the number is negative or not. If set (i.e. set to 1 or true), then the number is negative and we must apply two’s complement to retrieve the number in decimal.

In our case, let’s look at how -1 is represented using two’s complement: 1111 1111 or 0xFF

Invert all bits:
```
1111 1111 => 0000 0000
```

Add 1:

0000
0001
---------
0001

Since the result is a 1 and we know the most significant bit was set (or else we would not have to do 2’s complement), 1111 1111 represents -1

Signedness of Char in ARM

In Robert Love’s section on “Signedness of Chars” (Chapter 19 - Portability) of his book on the Linux Kernel Development, he notes that on some systems such as in ARM would treat char as unsigned which goes against the logic of us AMD64 (x86-64) programmers. Effectively, the value of i will be stored as 255 rather than -1. The reason for this is apparently due to performance.

Let’s verify this on my Raspberry Pi 4 machine running Linux:

char i = -1;
if (i == 255) {
    printf("char is unsigned\n");
}
if (i == -1) {
    printf("char is signed\n");
}

Result: char is unsigned

Let’s examine under the hood (using Godbolt GCC 14.2 with no optimization enabled):

unsigned char i = -1;
signed char j = -1;

The corresponding assembly is:

; unsigned char i = -1;
mov	w0, #0xffffffff            	// #-1
strb	w0, [sp, #15]

; signed char j = -1
mov	w0, #0xffffffff            	// #-1
strb	w0, [sp, #14]

As you can observe, both signed and unasigned char results set of instructions to store its value. The differences should be the way the compiler treats each variable such as utilizing the signed or unsigned instructions.

i++;
j++;

The corresponding assembly is:

; i++
ldrb	w0, [sp, #15] ; w0 = -1 = 255 (treated as unsigned)
add	w0, w0, #0x1      ; let's ignore the fact it'll ovevrflow   
strb	w0, [sp, #15]

;j++
ldrsb	w0, [sp, #14] ; w0 = -1 (treated as signed)
and	w0, w0, #0xff
add	w0, w0, #0x1
and	w0, w0, #0xff
strb	w0, [sp, #14]

As we can see, the variable signed j utilizes ldrsb instead of ldrb to load a signed byte and generates significatly more instructions than incrementing the unsigned i.

Let’s focus our attention to ldrsb which is loading the value (a byte) pointed by sp - 14 which corresponds to the value of j. 0xFF is 255 if we treat it as unsigned but we must be able to distinguish between the number 255 and -1. Recall that w0 is a 32 bit register but we are only loading a single byte which is 8 bits long. This is where the sign extend comes into the story.

ldrb w0, #0xff will look like the following:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	1	1	1	1	1	1	1
0				0				0				0				0				0				F				F

Notice how bits 8-31 are set to 0, this is what we call zero-extends whereby the byte value is extended with 0s to obtain a 32-bit word. Meanwhile for ldrsb, it loads the byte and then sign extend to 32 bits with 1s by setting the upper remaining bits 8-31 to 1:

31	30	29	28	27	26	25	24	23	22	21	20	19	18	17	16	15	14	13	12	11	10	9	8	7	6	5	4	3	2	1	0
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1
F				F				F				F				F				F				F				F

After loading the byte (as signed) to w0, there are two extra instructions that differs between adding a signed and unsigned integer:

and	w0, w0, #0xff
add	w0, w0, #0x1
and	w0, w0, #0xff

As to why these instructions are necessary is still something that is not clear to me why it is necessary to “truncate” w0 such that all bits after the first 8 are set to 0 (if I understood this correctly). I know we are only interested on adding onto a single byte but I was under the impression that truncation wouldn’t be necessary as we are using strb to store the result back to memory. Of course, I expect these and instructions to not exist when we tune the optimization. As this is a simple program, I do not think it’s worth the effort to look into this in further details.

Unsigned Char in Other Architectures

ARM is not the only unique architecture that treats char as unsigned. trofi also did a nice overview of looking at the signedness of other architecture after encountering a bug in SQLite whereby SQLite would hang (i.e. be stuck in an infinite loop) on PowerPC architecture. After looking at various architecture, he concluded that ARM, PowerPC, and s390 have unsigned char.

Signedness based on OS

The size and range for each data types is not solely based architecture as different OS could impose their own limits as well. On the same architecture, the size of int does differ between 64-bit Windows and 64-bit Linux (i.e. LP64 v.s. LLP64).

So amongst common 64 bit OSes, there are two different implementations of the sizes of int, long and long long. UNIX-based systems tend to use length of 4/8/8 (in bytes, as returned by sizeof()), whereas Windows uses 4/4/8. In a different terminology, 4/8/8 is called LP64 (long and pointers 64 bit) and 4/4/8 is LLP64 (long long and pointers 64 bit).

Portable C and long

The differences between the size of `long int` on Linux and Windows

I do not have a Windows running on ARM processor to know what would be the signedness of char but as for MacOS, I did manage to ask a random stranger to confirm the signedness. Interestingly, MacOS running on its ARM chips such as the M3 treat char as signed. In QNX on ARM, char is unsigned as I expected, it’s just MacOS being weird. I wonder if there is a technical or historical reason for this. Perhaps this was due to the desire to port x86 code to ARM by emulating portability differences between the two architecture but that’s just speculation on my part.

Conclusion

Therefore to make your code portable, one should ensure to explicitly state whether or not char is signed or unsigned instead of making assumptions if they know their char will lie outside of 0 to 127. All that the C standard guarantees is that its size is 1 byte.

Resources:

Utilizing Aliases and Interactive Mode to Force Users to Think Twice Before Deleting Files

2024-12-29T00:00:00-05:00

I previously mentioned in my microblog that I lost my file by accidentally overwriting my file using the cp command. This got me thinking as to why this would be impossible on my work laptop since I would be constantly bombarded with a prompt to confirm my intention to overwrite the file.

$ cp 2024-12-01-template.md 2024-12-30-alias-interactive.md
cp: overwrite '2024-12-30-alias-interactive.md'?

Commands like mv and cp have an interactive flag -i to prompt before overwriting the file. As seen in man 1 cp

-i, --interactive
              prompt before overwrite (overrides a previous -n option)

To force everyone at work to have this flag enabled, they made cp and mv an alias in our default shell configs:

alias cp="cp -i"
alias mv="mv -i"

Which you can also verify using the type command:

$ type cp
cp is aliased to `cp -i'
$ type mv
mv is aliased to `mv -i'

Stack Overflow: The Case of a Small Stack

2024-12-29T00:00:00-05:00

Source:

int main() {
    char buf[1024*1024*1024];
}

Result:

# ./prog-arm64 

Process 630803 (prog-arm64) terminated SIGSEGV code=1 fltno=11 ip=00000025333267f0 mapaddr=00000000000007f0 ref=000000443dd5dc50
Memory fault (core dumped) 

This bewildered all of the interns as it made absolutely no sense. Through our investigation, there was two things we noticed:

The program worked on our local machines but not on our target virtual machine
We were allocating an extremely large buffer in the stack which was unusual

int main() {
	char buf[1024*1024];
}

Result:

# ./prog-arm64 

Process 696339 (prog-arm64) terminated SIGSEGV code=1 fltno=11 ip=0000004de7e7a7ec mapaddr=00000000000007ec ref=000000383b19fbe0
Memory fault (core dumped) 

One thing I purposely omitted was that our target was running QNX, a realtime operating system. If we were to take a look at the documentation:

A process’s main thread starts with an automatically allocated 512 KB stack – QNX SDP 8.0 - Stack Allocation

This shocked all of us since 1 MiB is not a large buffer in 2021 where we had plenty of memory on our own personal system at home.

Note 1: The target used in the example was an aarch64le. This example will work on amd64 (x86_64) but requires you to add something such as a print statement

Investigating why AMD64 (x86_64) seems unaffected

Note: Everything below is nothing shocking nor interesting. I just felt like keeping it there.

The behavior for AMD64 (x86_64) as noted requires more fiddling to trigger a crash which came to my surprise. From my understanding of the documentation, the stack size should still be 512KB. Suspecting there could be some optimization going on, I fiddled around with the compiler setting and added some code to see if I could trigger the crash and it turns out that if I make a call to printf, the program will indeed crash as desired.

Source Code:

#include 
int main() {
  char buf[1024*1024];
  printf("Hello World\n");
}

Result:

# ./prog-amd64  

Process 2977812 (prog-amd64) terminated SIGSEGV code=1 fltno=11 ip=0000002c51b107f6 mapaddr=00000000000007f6 ref=0000003f4ece4b58
Memory fault (core dumped) 

To test my hypothesis that there was optimization under the hood, I generated the assembly (i.e. pass -S to qcc):

main:
.LFB0:
        .file 1 "prog.c"
        .loc 1 2 12
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        subq    $1048592, %rsp

With much disappointment, my hypothesis was incorrect. We can see that the stack pointer indeed does move at least by 1 MiB (1024 x 1024 = 1048576). As this file was simply incomplete as we still needed to run the assembler and linker to make the program executable, I then proceeded to running the program on the debugger in hopes that I can save my hypothesis (spoiler: my initial hypothesis is false).

(gdb) disassemble
Dump of assembler code for function main:
   0x0000000008048791 <+0>:     push   %rbp
   0x0000000008048792 <+1>:     mov    %rsp,%rbp
   0x0000000008048795 <+4>:     sub    $0x100010,%rsp
   0x000000000804879c <+11>:    mov    0x182d(%rip),%rax        # 0x8049fd0
   0x00000000080487a3 <+18>:    mov    (%rax),%rcx
   0x00000000080487a6 <+21>:    mov    %rcx,-0x8(%rbp)
   0x00000000080487aa <+25>:    xor    %ecx,%ecx
   0x00000000080487ac <+27>:    mov    $0x0,%eax
   0x00000000080487b1 <+32>:    mov    %eax,%edx
   0x00000000080487b3 <+34>:    mov    0x1816(%rip),%rax        # 0x8049fd0
   0x00000000080487ba <+41>:    mov    -0x8(%rbp),%rsi
   0x00000000080487be <+45>:    sub    (%rax),%rsi
   0x00000000080487c1 <+48>:    je     0x80487c8 <main+55>
   0x00000000080487c3 <+50>:    call   0x8048620 <__stack_chk_fail@plt>
=> 0x00000000080487c8 <+55>:    mov    %edx,%eax

As we can see from the assembly above, the stack pointer does move at least by 1MiB so the theory of optimization is definitely ruled out. Going through the program via the debugger using stepi I notice the following:

   0x00000000080487be <+45>:    sub    (%rax),%rsi
   0x00000000080487c1 <+48>:    je     0x80487c8 <main+55>
   0x00000000080487c3 <+50>:    call   0x8048620 <__stack_chk_fail@plt>
=> 0x00000000080487c8 <+55>:    mov    %edx,%eax

The instruction pointer skipped <__stack_chk_fail@plt> which is the the stack guard that is added to mitigate against stack buffer oveflows (whether intentional or not). Essentially, a stack guard inserts some small value known as the canary between the stack variables and the return address. If the return address was overwritten, then the canary value would be overwritten. The way to check whether the canary has been overwritten can be done in either two ways:

canary - original_canary != 0
canary ^ original_canary != 0

If any of the two are evaluated to be true, then the program will jump to the fail function to terminate the program. In our program, it would seem that we did not overwrite register rax which appears to be our canary with the value of 0x8049fd0. I will now attempt to walk through with you what exactly is going on with my limited knowledge in Assembly (I’m going to use the excuse that I am a Mathematics student to excuse my lack of assembly knowledge :D):

For simplicity, I am going to modify the above assembly above to use more friendly notation when making references to addresses and write some pseudocode in C syntax (I’ll be omitting some details so it’s not a one to one replication). From instructions between <+11> to <+21>, we are storing the canary value 8 bytes below the base pointer:

<+11>:    mov    0x182d(%rip),%rax        # 0x8049fd0
<+18>:    mov    (%rax),%rcx
<+21>:    mov    %rcx,-0x8(%rbp)

rax = 0x8049fd0
rcx = rax
*(rbp-8) = rcx

This value is then compared with rax register which is again loaded with the original canary value in the instruction address <+34>. The generated assembly code utilises the 2nd method to check whether a canary value has been overwritten, by subtracting the two canary values:

<+34>:    mov    0x1816(%rip),%rax        # 0x8049fd0
<+41>:    mov    -0x8(%rbp),%rsi
<+45>:    sub    (%rax),%rsi

rax = 0x8049fd0;//store the original canary value into rax (this value will ideally be not modified)
rsi = *(rbp-8); //store our canary value to register rsi (this value could be modified if we have a buffer overflow)
result = rsi - rax

As the canary value was not modified, the result is set to 0. je in iaddress <+48> will skip the next instruction to call __stack_chk_fail@plt (iaddress <+50>).

Note: I did not read into the function __stack_chk_fail@plt so maybe they do more checks to see if the canary failed because it has the name chk into the name

As our program skipped __stack_chk_fail@plt, the program does not crash.

Now let’s take a quick look into why adding a print statement triggers the crash:

=> 0x00000000080487f6 <+37>:    call   0x8048650 
   0x00000000080487fb <+42>:    mov    $0x0,%eax
   0x0000000008048800 <+47>:    mov    %eax,%edx
   0x0000000008048802 <+49>:    mov    0x17c7(%rip),%rax        # 0x8049fd0
   0x0000000008048809 <+56>:    mov    -0x8(%rbp),%rsi
   0x000000000804880d <+60>:    sub    (%rax),%rsi
   0x0000000008048810 <+63>:    je     0x8048817 
   0x0000000008048812 <+65>:    call   0x8048660 <__stack_chk_fail@plt>
   0x0000000008048817 <+70>:    mov    %edx,%eax
   0x0000000008048819 <+72>:    leave
   0x000000000804881a <+73>:    ret
End of assembler dump.
(gdb) stepi

Program received signal SIGSEGV, Segmentation fault.

Immediately we can see that the stack guard is not the reason for the crash but rather a call to puts@plt that triggered the crash. Let’s compare the two instruction registers before the crash is triggered where the first is from a program with a valid buffer size:

(gdb) i r
...
rbp            0x81ce0             0x81ce0
rsp            0x818d0             0x818d0
...

v.s.

(gdb) info r
...
rbp            0x81ce0             0x81ce0
rsp            0xfffffffffff81cd0  0xfffffffffff81cd0
...

Only the stack pointer rsp differs which is to be expected. To understand the crash, we first need to recall the fact that each function has their own stack.

Side Note: Stacks

QNX is ‘Free’ to Use

2024-11-09T00:00:00-05:00

While I have not done any sentimental analysis of this announcement, I think overall this move is a good first step to develop an ecosystem of developers who appreciate and understand the QNX architecture but is also met with sketpicism. For reference, QNX has messed with the community twice before which explains the big mistrust from experienced developers. The top comment on Hackernews does a great job summarizing the sketpicism. QNX used to have a bigger hobbyist community in the past where open source projects such as Firefox would have a build for QNX, but that all died when QNX closed their doors to the community. Years later, QNX source code was available for the public to read (though probably with restrictions) but later shut the source code availability after being acquired bhy Blackberry who does not have the best reputation to the developer community (hence why Blackberry Phones failed to capture the market from my understanding despite once being a market leader).

Regardless, I have plans to create a few materials on QNX in the coming months and perhaps create a follow up to QNX Adapative Partitioning System as it seemed to have gained enou has been ranked top 5 on Google search results (though I doubt it had many readers due to the population of QNX developers):

Google Search Console from July 9 2023 - Nov 8 2024 which had 308 clicks

Verifying Email Signature Manually

2024-10-12T00:00:00-04:00

I noticed that the neocities community love using protonmail and some even share their public key to enable full encryption communication. What makes protonmail special is the focus on privacy and security. All emails sent between Proton Mail users are end to end encrypted meaning not even Proton can have access to the messages. However, when communicating outside of Proton ecosystem to non-Proton Mail users like those with Gmail and Outlook, communication between the two are not encrypted end to end by default. This does not mean the encryption utilized by Gmail and Outlook are inadequate. The vast majority of emails are encrypted in transit using TLS encryption, the very same encryption you use to enter your password to your bank or entering your credit card to buy something online for instance.

Aside: If you are curious about protonmail’s encryption scheme: https://proton.me/support/proton-mail-encryption-explained

What is the Purpose of a Digital Signature

Depending on your sense of security, TLS encryption may not be sufficient. There are a few issues with just relying on TLS encryption:

Loss of Privacy: Companies like Google and Microsoft have access to your data. Depending on their policies, your emails could be used for training purposes, released to government authorities, or be leaked due to a security breach
Potential For Data to be Compromised: Even if you trust your company to respect your privacy, it does not mean the company has good security practices and could be attacked by a state sponsor. With data not potentially be encrypted at rest and encrypted properly, your data could be leaked to malicious actors

Since communication outside of ProtonMail is not end-to-end encryption, if one wants to maintain the security level of their communication, they would need to require both parties to send emails encrypted with each other’s public key. Therefore, it is not uncommon to see people on the internet share their public key for others to communicate with them.

Personally, I am fine with using Gmail and Outlook for all my email communication but nonetheless, I thought it would be interesting to see how one would manually verify the signature of an email. One other use case of public key cryptography is signing. Encryption refers to obfuscating the original message to ensure confidentiality (to the best of one’s knowledge). Digital signing does not ensure confidentiality but authenticity. In other words, digital signing is a process to verify that the email has not been tampered with and comes from the person whom they claim to be. With man in the middle attacks, it is possible for an attacker to intercept and modify the original message. Here are some purposes (and potential uses) of digital signatures:

Authenticity: A verification that you are indeed talking to the person whom you think you are talking to
- this assumes that the private key of the other party is kept secret and secured and you are given the public key somehow in a secured and trusted way
Integrity: The ability to detect if the message has been tampered with (similar to a tamper tape/seal on very sensitive envelopes or products)
Attestation I really should be careful what I mean by “attestation”. I am referring to the sender attesting that they indeed are the one who is communicating with them for legal purposes. Similar to how we sign documents to attest that we agreed to the accuracy of the documents and agreement to the terms outlined in the contract, digital signatures can be also used for similar purposes. A better word for this process is notarization.

While authenticity and “attestation” (from my definition) sound similar, but there is a key difference between the two. Authenticity is for the receiver to verify they are indeed talking to the person they believe to be in contact with. “Attestation” is a way to legally bind the user to a contract. Therefore, if a digital signature is ever used for the purpose of entering a contract, one should ensure they use separate keys for signing and encryption. When you communicate with others using public key encryption, you are obviously not signing every message as if it was a legal contract. This is something I probably need to remember myself as I delve more into security.

One interesting aspect about digital signatures is protecting software from supply chain attacks. If you ever download a software from a big open source project like Fedora, they would often provide you either a hash or a signature. A hash can be used to verify that the file has not been tampered. However, this does not provide authenticity. Authenticity can only be obtained through the usage of digital signature. If an attacker manages to infilterate a server, they could potentially replace the file and its associated hash with their own malicious file. The client will not be able to protect themselves from this supply chain attack as both the file and the hash posted on the project’s website has been compromised. With digital signature, one can verify the authenticity of the file and have assurance the file has not been tampered with. However, this does require one to already have the public key beforehand as the attacker could already have compromised server that shares the project’s public key.

Commits can be signed ON Git. Github has a feature to mark the commit or tag as verified if the commit was both signed and verified by Github.

I mentioned that digital signatures can provide authenticity, but this is not entirely true. This is true if you have obtained the public key from a trusted source such as from the entity you are communicating with. This is where digital certificates can help.

Anyhow, that was enough rambling, time to go into the details of how to verify email signatures.

How to Verify a Digital Signature

Digital signatures work by having the sender (Alice) sign the message with their private key. With this, the receiver (Bob) can use the sender’s (Alice’s) public key to verify the message. From my understanding, the signature is often appendded to the email message so that the receiver can easily obtained the signature when they receive the email. This could differ when using digital signatures for different purposes such as downloading a software from the publisher’s site. Wikipedia has a good diagram to visualize this process:

A diagram illustrating how the process of signing and verifying a digital signature works. Extracted from Wikipedia

I will not go into how to sign an email as my focus is on how to verify an email signature. More specifically, I will be using ProtonMail to automatically sign my email and send the email to my Gmail account.

Step 1: Obtain the Public Key

There are a few methods to obtain a public key such as from the organization’s website or attached to the website. This is likely the most vulnerable step in the entire process as an attacker could upload their own public key to a vulnerable website or masquerade as the person you expect to be communicating with such as having an email that resembles closely with a trusted identity or is spoofed to appear legitimate as seen with Outlook in 2021. Protonmail offers an option to send a public key to those outside of Protonmail ecosystem automatically. While this method isn’t flawed (I ain’t a cybersecurity expert) per se, this does make me think twice about the validity of the public key that has been sent to me as using a compromised key could make this entire verification process go wrong. However, to initiate communication that is encrypted end to end, this is a necessary step. While I do not have a clear picture on certificates, certificates probably could alleviate this issue by having a trusted third party called the certificate authority to verify the identity of the sender.

Left: Alice sending an email to Bob with her publickey and signature. Right: Instructions to verify Fedora ISO

Step 2: Import (Alice’s/Sender’s) Public Key

Once Alice’s (i.e. the sender) public key has been obtained, the key needs to be imported to the public keyring. I do not understand why the keys always have to be imported rather than just being specified to be honest. Perhaps it’s because I am using the public key as an armoured ASCII asc rather than the GNU Privacy Guard gpg public keyring file. Though I am not going to bother verifying this.

To import a public key: gpg --import

$ gpg --import publickey-alice@proton.me.asc 
gpg: key : public key "alice@proton.me " imported
gpg: Total number processed: 1
gpg:               imported: 1 

We can verify the import with: gpg --list-public-keys

$ gpg --list-public-keys alice@proton.me
pub   ed25519 2024-01-10 [SC]
      
uid           [ unknown] alice@proton.me 
sub    2024-01-10 [E]

Step 3: Download the Email Message

This step does vary depending on your email client but on Gmail, one can simply download the email by clicking on the kebab menu (the three dots or ellipses) found on the right side of the email as shown below:

This will download the email in the electronic mail format .eml which is not the signed email. .eml files have a lot of extra information that is packaged over the signed email. We will need to extract the content that has been signed to verify the message.

Step 3: Extract the Content Containing the Signed Email

The content of the email that needs to be extracted is the data that has been signed by Alice’s public key to create the signature. The file will look something like the following:

An edited email file that does not contain attachments aside from the signature

Step 4: Extract Signed Message

As mentioned in the previous step, we need to remove all the extra data in the email file that isn’t part of the signed message. You should make a backup of the email file because this is easy to mess up if you do not know what you are doing like the author had:

$ cp 'GPG KEY no publickey attachments.eml' 'GPG KEY no publickey attachments.eml.bak'
$ ls 'GPG KEY no publickey attachments.eml'*
'GPG KEY no publickey attachments.eml'  'GPG KEY no publickey attachments.eml.bak

The content of the message starts after you see the following header (the hash will differ):

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--------AAAAAAAAAAAA

where --------AAAAAAAAAAAA is our boundary as clear denoted earlier in the file.

This means the very first line of the signed file is:

Content-Type: multipart/mixed;boundary=---------------------BBBBBBBBBBBB

The contents of the signed message is enclosed within the boundary which is (does not include the boundary) as shown below (remove trailing newlines):

The contents of the signed message

One thing I notice is that the hash on the first line of the signed message is also the last line in the signed message. For instance, in our example that would be: BBBBBBBBBBBB. Therefore our file should also end with this hash.

For instance, if our message looked along the lines of:

MIME-Version: 1.0
 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg=pgp-sha512; boundary="------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415"; charset=utf-8

 This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415
 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592
 Content-Type: application/pgp-keys; filename="publickey - alice@proton.me - .asc"; name="publickey-alice@proton.me.asc"
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename="publickey-alice@proton.me.asc"; name="publickey - alice@proton.me - .asc"

 ABCDEF0x4ZjZkeGxSL0xUABCDEFmltotlUR0ABCDEFWaABCDEFE9PQP9ABCDEFAABCDEFtLUVORCBABCED
 ABCDEFEABCDEFFWSBCTE9DSy0tLABCDE==
 -----------------------ff35159c3ebf11234dd954191b3141592--

 --------3141887d7abcdefgbe09e18825fd164103abcdefgf8c40b59382649cd69b31415

Then the signed message would be:

 Content-Type: multipart/mixed;boundary=---------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592

 ...

 -----------------------ff35159c3ebf11234dd954191b3141592
 Content-Type: application/pgp-keys; filename="publickey - alice@proton.me - .asc"; name="publickey-alice@proton.me.asc"
 Content-Transfer-Encoding: base64
 Content-Disposition: attachment; filename="publickey-alice@proton.me.asc"; name="publickey - alice@proton.me - .asc"

 ABCDEF0x4ZjZkeGxSL0xUABCDEFmltotlUR0ABCDEFWaABCDEFE9PQP9ABCDEFAABCDEFtLUVORCBABCED
 ABCDEFEABCDEFFWSBCTE9DSy0tLABCDE==
 -----------------------ff35159c3ebf11234dd954191b3141592--

Step 5: Verify the Email Signature

Verify the signature: gpg --verify signature.asc message.txt

$ gpg --verify signature.asc message.txt 
 gpg: Signature made Mon 07 Oct 2024 11:29:48 PM EDT
 gpg:                using EDDSA key 
 gpg: Good signature from "alice@proton.me " [unknown]
 gpg: WARNING: This key is not certified with a trusted signature!
 gpg:          There is no indication that the signature belongs to the owner.
 Primary key fingerprint: 

While the signature has been verified: Good signature, we do see a warning about the key not being certified.

(Optional) Step 6: Validate Imported Public Key

Upon reading gnupg manual, there are instructions to verify the imported public key by checking if the key’s fingerprint matches the key you are expecting from Alice (the sender). This does involve Alice letting Bob know about it’s key’s fingerprint somehow whether that be in email, text, voice call or in some paper delivered to Bob. Let’s pretend the fingerprint of Alice’s public key was transmitted to you through a trusted source is:

768B 218A CCD7 AA34 9830  52D8 9BD4 1A08 9D98 BC02

We can verify whether the public key really came from Alice by verifying the public key’s fingerpint and see if it matches:

$ gpg --edit-key alice@proton.me
...
gpg> fpr
pub   ed25519/[redacted] 2024-01-10 alice@proton.me 
 Primary key fingerprint: 768B 218A CCD7 AA34 9830  52D8 9BD4 1A08 9D98 BC02

To validate Alice’s public key (proceed with caution), we must sign the key with our own private key:

gpg> sign

pub  ed25519/[redacted]
     created: 2024-01-10  expires: never       usage: SC  
     trust: unknown       validity: unknown
 Primary key fingerprint: 768B 218A CCD7 AA34 9830  52D8 9BD4 1A08 9D98 BC02

     alice@proton.me 

Are you sure that you want to sign this key with your
key "Bob " ([redacted])

Really sign? (y/N) yes

gpg> quit
Save changes? (y/N) y

However, this is not suffice to change the validity. On serverfault, Baker does a good job explaining that TRUST != VALIDITY. I am guessing due to the differences in the default settings on gpg, I need to set my trust level to 5 ultimate to remove this warning:

gpg> trust
pub  ed25519/[redacted]
     created: 2024-01-10  expires: never       usage: SC  
     trust: unknown       validity: unknown
sub  cv25519/[redacted]
     created: 2024-01-10 expires: never       usage: E   
[ unknown] (1). alice@proton.me 

Please decide how far you trust this user to correctly verify other users' keys
(by looking at passports, checking fingerprints from different sources, etc.)

  1 = I don't know or won't say
  2 = I do NOT trust
  3 = I trust marginally
  4 = I trust fully
  5 = I trust ultimately
  m = back to the main menu

Your decision? 5

...

Please note that the shown key validity is not necessarily correct
unless you restart the program.

Now if we take a look at the verification, we no longer see the warnings.

$ gpg --verify signature.asc message.txt
gpg: Signature made Mon 07 Oct 2024 11:29:48 PM EDT
gpg:                using EDDSA key 
gpg: Good signature from "alice@proton.me " [ultimate]

Conclusion

In practice, no one verifies the digital signatures of emails manually. Any sane individual will utilize any email client that would automate the verification process for them. As most individuals are not aware of digital signing and email encryption, I’ll probably not set up my email client for work, school, and personal email to automatically verify, sign, and encrypt emails unless I am required to. This does mean I am exposing myself to the spying eyes of my email providers and be suspectible to man in the middle attacks and have my personal information potentially leaked.

To summarize the steps:

Import the keys: gpg --import
Extract the signed message (this includes any attachments that is not the signature itself)
Verify the email: gpg --verify signature.asc message.txt

A Quick Look Into Half-Width and Full-Width Characters

2024-10-07T00:00:00-04:00

A friend of mine has been asking me a few questions about encoding for a paper he is working on. While I don’t understand what his research is on, all I can understand from his research is that he is working on analyzing Japanese texts and it involves understanding character encodings. Character encoding is not a topic that most native-English programmers are familiar with. The most that the average programmer will know is the existence of ASCII and UTF-8 encoding. If we are using anything beyond the English alphabets and arabic numerals (i.e. 1, 2, 3, 4, 5, 6, …) then we can utilize UTF-8, else use ASCII.

I am sure most of us has encountered the random garabage characters such as � or the □ (U+25A1) when trying to read documents that have a mix of English and some foreign language or see random garbage displayed in our media displays like the Infotainment displays when we try to listen to music from Asia.

Chinese characters not displaying correctly. Extracted from Developing Linguistic Corpora: a Guide to Good Practice

I was not aware of the existence of full-width and half-width characters till the friend asked me to briefly give an explaination on the differences between the two from a technical aspect. For those like me who weren’t aware that the Japanese mix between zenkaku (full-width) and hankaku (half-width) characters, look at the image below or visit the following webpage for more explanation: https://mailmate.jp/blog/half-width-full-width-hankaku-zenkaku-explained

As you can see, half-width characters unsurprisingly takes up less space visually than the full-width characters.

Full and Half Width encoded on UTF-8 as seen through Vim

There is also an implication on the amount of data half-width and full-width characters consume (though this does depend on the encoding). For Western audience, we know that ASCII takes up 1 byte and can be represented as a char in C.

Extending ASCII

One interesting fact about ASCII is that ASCII only maps to 128 characters (though only 95 is printable). Recall that ASCII can be represented by 1 byte which makes up of 8 bits. Doing the Math, 8 bits can represent 2^8 = 256 values. This leaves us with the remaining 128 values unmapped to anything.

ASCII Table. Extracted from Wikipedia

This allows other languages and programmers to take advantage in extending ASCII to display extra characters such as accents from European languages such as é, è, ç, à in Latin 8 and ISO 8859 or Katakana characters in JIS C 6220 (JIS X 0201) in 1969. Though JIS C 6220 does change a few characters so it is not exactly an extension of ASCII. Though ignoring the few differences, we can see that the Katakana characters are mapped in the remaining half starting from 0xA1 to 0xDF.

JIS C 6220 which is also known as JIS X 0201. Extracted from Wikipedia

ISO 8859 on the other hand such as Latin-8 seems to be a direct extension of ASCII where 0xA1 - 0xFF contains characters from several European languages such as French, Finnish and Celtic.

ISO 8859-14 (Latin-8) Encoding. Extracted from Wikipedia

Aside: UTF-8 v.s UTF-16

Based on the article I shared, half-width characters takes up 1 byte while full-width characters takes up 2 bytes (also can be called double byte character). I do believe this depends on the encoding used. Taking a look at the size and bytes encoding, we can see that number 1 in UTF-8 encoding takes 1 and 3 bytes for half-width and full-width character repsectively

$ stat -c "%n,%s" -- halfwidth-utf8.txt fullwidth-utf8.txt 
halfwidth-utf8.txt,1
fullwidth-utf8.txt,3

One confusion I had was understanding what the difference between UTF-8 and UTF-16 and the following excercise helped me understand this:

UTF-8 encodes each character between 1-4 bytes
UTF-16 encodes each characters between 2-4 bytes

$ stat -c "%n,%s" -- halfwidth-1.txt chinese-1.md 
halfwidth-1.txt,1
chinese-1.md,3

In UTF-8, 1 takes up 1 byte which is unsurprising as ASCII has great advantage in UTF-8 compared to other Asian languages such as Chinese where the character for 1 一 consumed 3 bytes.

Let’s inspect the contents of the files between Half character 1 and Full Byte Character １ in HEX:

$ cat halfwidth-1.txt; echo ""; xxd halfwidth-1.txt; cat fullwidth-1.txt ; echo ""; xxd fullwidth-1.txt 
1
00000000: 31                                       1
１
00000000: efbc 91                                  ...

$ cat halfwidth-utf16.txt; echo ; xxd halfwidth-utf16.txt; cat fullwidth-utf16.txt; echo; xxd fullwidth-utf16.txt 
1
00000000: 0031                                     .1
�
00000000: ff11                                     ..

Note: To view UTF-16 on VIM run on command mode (i.e. press esc to exit current mode and press : to enter command mode): e ++enc=utf-16be fullwidth-utf16.txt

Half-Width and Full-Width in Japanese Specific Encodings

I had earlier mentioned about JIS C 6220 (JIS X 0201) which utilized the fact that the last 128 bytes of ASCII isn’t utilized which allowed the Japanese to add Katakana support. Although it’s not a direct extension as the Japanese did changed the lower 128 characters slightly to be localized to the Japanese such as replacing the \ with the Japanese Yen ¥. Full-Width Japanese characters apparently started to appear in 1978 starting with JIS C 6226.2) where Kanji can be displayed.

A more recent standard is the Shift-JIS in 1997 and is apparently the current second mostly used encoding among .jp (Japananese) websites. Based on a survey on October 7 2024, Shift JIS is still used by 4.8% of .jp websites, 2.3 for EUC-JP and the remaining going to UTF-8. As mentioned previously, it would seem to be the case for Japanese encoding such as Shift JIS, half-width characters not only have a smaller width but also requires half the number of bytes to be represented. Half-Width characters do not imply less bytes to represent in general but for Shift-JIS, that would seem to be the case:

Hex Representation of ア and ｱ. Credits to charset.7jp.net

As you may notice, I am using the same example from the article but I opted to generate my own image. The blog for some reason decided to add 0x0D0A which corresponds to CRLF i.e. \r\n making it less obvious to readers that the full-width character takes 2 Bytes and the half-width chaacter only takes 1 byte. As I don’t know Japanese, but according to the article both characters have the same phonetic sound. Though I am pretty sure the two are the same in written (i.e. handwriting) language. The likely reason for this behavior is that fact that it is an extension of JIS X 0201:1997, they very same encoding that first introduced Katakana (though the edition differs) and encodes the double-byte characters from JIS X 0208:1997 according to wikipedia.

Note: 1 byte character can also be referred as single-byte character while 2 bytes characters can be referred as double-byte characters

Based on the above image, we can make the following observations:

Full-Width characters take 2 bytes in Shift-JIS
Half-Width characters take 1 byte in Shift-JIS
UTF-8 and UTF-16 do not seem very optimized to take Japanese characters taking 3 bytes and 4 bytes respectively

Unsurprisingly, Shift-JIS was designed for the Japanese and therefore are more space efficient than the more international/universal versions like UTF-8. According to my friend and the article, Japan still requires users to switch between full-width and half-width characters. I have no clue as to why but I have heard that Asian countries such as Japan and Korea can be slow to modernize their digital infrastructures despite being technology leaders and innovators. The article suggests it is due to the bureaucracy and work culture not fostering a culture to take some risks and not seeing the need to fix what isn’t broken.

The remaining content is not relevant to the title but is a refresher of Hex

Review of HEX

Computers work in binary which consists of only 0 or 1 (i.e. base 2). The decimal system we all use is base 10. Hexadecimal are base 16 and tend to be the favorite way to represent a series of bytes due to its more compact form (or at least that’s what it seems like to me). Hexadecimal numbers have 16 values: 0-9 and A-F. In binary, a single bit can represent 2 values which can be expressed as 2^0. This means that 4 bits can represent 2^4 = 16 bits. This means a single hexidecimal digit can be represented using only 4 bits. Two hexadecimal digit will therefore take 8 bits = 1 byte. That is why the half-character ｱ takes up one byte as it is 0xB1 in Shift-JIS. B1 consists of two hexadecimal digits and hence only 8 bits and therefore 1 byte. The full-width character ア is 0x8341 which consists of 4 hexadecimal digits and therefore 4 * 4 bits = 16 bits or 2 bytes.

zakuarbor

The Issue With Default in Switch Statements with Enums

MicroBlog 2024 Edition

Complete List

New Laptop: Framework 16

Utilizing Aliases and Interactive Mode to Force Users to Think Twice Before Deleting Files

Stack Overflow: The Case of a Small Stack

Jekyll Cache Saving the Day

QNX is 'Free' to Use

[Preview] Manually Verifying an Email Signature

[Preview] Half-Width and Full-Width Characters

Mixing Number and String

`.` At The End of a URL

Splitting Pdfs into Even and Odd Pages

Executing Script Loophole

Replacing main()

Few Notes

Random Links for later Research

Editing GIFS and Creating 88x31 Buttons

multiple definition of `variable` ... first defined here

Delusional Dream of a OpenPower Framework Laptop

2024 Update

this: the implicit parameter in OOP

view is just vim

Vim Oddities

Extra Random Information on VIM and VI

The Sign of Char

A Quick Review of Signedness

Signedness of Char in ARM

Unsigned Char in Other Architectures

Signedness based on OS

Conclusion

Utilizing Aliases and Interactive Mode to Force Users to Think Twice Before Deleting Files

Stack Overflow: The Case of a Small Stack

Investigating why AMD64 (x86_64) seems unaffected

QNX is ‘Free’ to Use

Verifying Email Signature Manually

What is the Purpose of a Digital Signature

How to Verify a Digital Signature

Step 1: Obtain the Public Key

Step 2: Import (Alice’s/Sender’s) Public Key

Step 3: Download the Email Message

Step 3: Extract the Content Containing the Signed Email

Step 4: Extract Signed Message

Step 5: Verify the Email Signature

(Optional) Step 6: Validate Imported Public Key

Conclusion

A Quick Look Into Half-Width and Full-Width Characters

Extending ASCII

Aside: UTF-8 v.s UTF-16

Half-Width and Full-Width in Japanese Specific Encodings

Review of HEX

Good Resources