Representing the World to a Computer

It is not an easy task. We get to " know' about the world outside our head by using five senses. Even though our pension organs are quite different from each other, they send the information that they gather to brain using a common framework called the nervous system. Electrical impulses are the common carriers of the information. However the information from different sensory organs are encoded in vastly different ways. Our knowledge of that is at the most partial..
Nevertheless, the final destination of all the signals is the human brain. Once there all the information are mysteriously stitched together to form a unified picture, which we consciously consume and form "knowledge". So far so good. The entire process works efficiently under the hood.
The problem arises when we try to "publish" that "knowledge" for universal consumption. Well, we use languages - like English I French, Spanish etc. We have too many of them. For computers we have languages too. Like C, Java, Python etc.
But those languages are giving commands to a computer. They cannot represent information in any meaningful way. They act more as carriers of information.
A concrete example would be appropriate at this point. Let us say we are trying to tell a computer about a car. For a human recipient the message will be this: "Have we had a car which can be described as a machine that is able to move on roads. For movement, it uses four wheels. The wheels are arranged in pairs with one pair coming after the other longitudinally. That is longitudinal to the road. This specific instance of "car" has a wheel dimension of 16 inches radius, 215 mm width and 65% sidewall thickness. A car has to have an engine wherein petrol or diesel is burnt inside metal cylinders. The sum of cylinder volumes of this instance is 1299 cc..."
In that fashion I can go on and on. It would not make the computer any wiser. It requires the whole story rewritten in a structured format. The computer needs to be told about the format ahead of actual data. A programming language helps us define the structure.
Defining a real life entity in terms of attribute - value pairs in the most common and basic method. So a computer gets to know about a car through an arrangement like this:
wheels: 4
wheel_ radius: 16
wheel_ width: 215
wheel_ side;: 65
engine_ fuel: Petrol
engine_ type: DOHC
engine_ cylinders: 4
engine_ bhp: 89
....
One may use certain data structures to enhance the representation. Something like this:
wheels:
number: 4
radius: 16
width: 215
wside: 65
engine:
fuel: Petrol
type: DOHC
cylinders: 4
Bhp: 89
....
After data structure and actual data one needs to think about encoding. Encoding is a numerical representation of all the symbols available in a language. In other words, all the symbols available the alphabet and the numerals.
There are various encoding schemes. ASCII is the oldest one.

And Unicode is the latest one:

Please note that the second table is using hexadecimal numbers for the sake of brevity.
While the former defines 128 characters, the latter defines as many as 144697, covering all the major languages of the world.
One interesting thing to be noted here is that every character, from a humble "a" to a complicated "ǣ", gets represented by a positive integer, which in itself is random and devoid of any intelligent scheme. Number here plays the role of a symbol. You cannot perform arithmetic operations on a symbol.
To cut a long story short, inside a computer a number may represent a quantity or an entity. The number 65 may represent a person's weight or the letter "A". The associated attribute sets the right context and determines the role of 65.
The most obvious question would be: Why do we have to carry the overhead of encoding and decoding? The simple answer is: Computers cannot comprehend anything other than numbers. It is a limitation of the underlying electronic hardware.
In fact the "underlying hardware" is less capable than you think. The numerical representations of data also need to be further simplified to binary numbers and then only can anything computationally significant task be performed, either in primary or secondary memory.
So what does that leave us with? Infinite possibilities of expression bounded by language - both natural as well as programming.

July 20th, 2022