Hash Functions and Information Security
Author
Aramís Concepción DuránSome things we miraculously find all around us once we have discovered them, even though we never noticed them before. Hash functions are a good example of this. They are an important ingredient in many information technology applications.
What are hash functions?
A hash function consumes any data and generates a fixed-length hash value from it. Every hash function has two essential properties: 1. It always generates the same hash value from the same data object. 2. It always generates different hash values from different data objects. It follows that we can use each hash value as a unique identifier for the data object from which it was generated. In this respect, the hash value of a data object is similar to a human being’s fingerprints: 1. We will always find the same fingerprints on the same person. 2. Different people always have different fingerprints. This is why hash values are often referred to as fingerprints. Let us first look at how hash functions and hash values can be useful to us in information technology.
Application: Detecting corrupted data
Polaroid pictures fade over time, the paper in old books crumbles and the paint on classic oil paintings flakes off. Digital data objects seem unaffected by such decay processes, but appearances are deceptive! It happens again and again that small errors arise unnoticed in digital data files that remain on a data carrier for a long time. How can we detect such errors? Quite simply: by using a hash function and generating hash values from our data objects. When we retrieve a dusty data carrier from the attic years later and want to make sure that the data objects are unchanged, we apply the same hash function to the data objects again and compare the old with the new hash values. If the values match, we can be sure that the data is unchanged. If the hash values do not match, then the data is corrupted.
Application: Detecting transmission errors
Even when transmitting data via computer networks, it often happens that a sent 1 reaches its destination as 0 or vice versa. The transmission protocols on the Internet use efficient hash functions to check whether an error has occurred after each data transmission using short hash values and, if necessary, to request a new transmission.
Admittedly, these protocol mechanisms only protect against technical transmission errors, not against intentional and targeted manipulation of the data. If we want to protect ourselves against intentional data manipulation, we must use a so-called cryptographic hash function for this purpose.
Cryptographic hash functions
As the second essential property of every hash function, we have stated that it always generates different hash values from different data objects. Unfortunately, this is not entirely true. In theory, it cannot be completely avoided that a hash function generates the same hash value for certain pairs of data objects. When this occurs, we speak of a collision. The important thing is that collisions are so unlikely, at least in practice, that we can safely ignore their basic possibility. An ordinary hash function is sufficient if a collision is unlikely for two randomly chosen data objects, even if the data objects differ only in a few bits.
On the other hand, a cryptographic hash function has the special claim that a collision must not occur in practice even if an effective institution, equipped with specialised mathematical knowledge, high-performance computers and a lot of patience, makes a targeted effort to detect a collision. We measure the quality of a cryptographic hash function by how resistant it is to such efforts. High-quality cryptographic hash functions are used, in particular in information security applications.
Application: Detecting targeted data manipulation
When I install an operating system, the first step is often to download a disk image from the website of a Linux distribution. These images are very large files. Downloading them can take a long time and deploying them can cause a high load on server systems. Therefore, it is good to obtain these disk images from alternative sources, for example, from so-called download mirrors or a torrent swarm. How can we make sure that it is precisely the disk image that is made available on the official website? You might think: we use a cryptographic hash function. Together with the disk images, the corresponding hash values are also published according to various cryptographic hash functions. After downloading, we can generate a hash value again from the disk image and compare it with the value published on the official website. If the values match, we can be sure that bit for bit it is the original file. Because cryptographic hash functions, not ordinary ones, are used for this purpose, we are not only protected from transmission errors but also from targeted manipulation of the disk image.
Various cryptographic hash procedures
Examples of cryptographic hash methods used in practice today are MD5, SHA-1, SHA-256 and SHA-512. It is common to represent the generated hash values as strings consisting of the digits "0" to "9" and the letters "a" to "f", but the hash methods differ in the length of the generated hash values. Let's look at the hash values of the hash methods mentioned for the string "Increase Your Skills".
"Increase Your Skills"
--MD5------> 6c54f6572d41393a77070ff2fa089fbe (32 characters)
--SHA-1----> 1ddd3ae23ab8823447621899820a0e6a (40 characters)
eae30c43
--SHA-256--> de800a5c324b81f61cc107b8eefab2f1 (64 characters)
eda32e6f1a0bca3105839415b99d2b9a
--SHA-512--> 41a84b6ae00109cb131893f1568a5e35 (128 characters)
9f8366a53d74505c490f4d20844916d7
180a45446284a94f5feca16135423451
feaa612c1ac0bbf334c31c3c4142818f
The MD5 and SHA-1 methods have not been considered secure for several years and are no longer recommended for use in cryptographic applications. Especially for MD5, very efficient methods for calculating collisions are known. However, these procedures are still used to generate high-quality but not cryptographic checksums. When selecting a hash procedure, always pay attention to whether you only want to protect against technical data corruption or also against targeted data manipulation. To protect against targeted data manipulation, you should use a secure cryptographic hash procedure. To protect against technical data corruption, you can use a weaker hash procedure and benefit from the shorter hash values and faster calculation. For applications where collision resistance is not a high priority, you can even use a simple checksum procedure such as CRC32, which calculates particularly quickly and generates especially short checksums.
"Increase your Skills" --CRC32 --> 7e27dca0 (8 characters)