In this article, we are going to describe the SHA-2 and MD5 algorithms. Both of these hash functions are widely used in modern computer systems. The SHA-2 algorithm is used for cryptographic applications such as password storage and as a proof-of-work for the Bitcoin cryptocurrency. The MD5 algorithm is a much faster hashing algorithm but it is not cryptographically secure. Its main application is data integrity verification.
The objective of this article is to give you a sense of the anatomy of modern widely-used hash functions.
SHA-2 is a family of hash functions which builds upon the groundwork laid out by SHA-0 and SHA-1. The SHA-2 family of hash functions all share the central operations but differ in the size of hash generated. In this article we focus on the 256-bit implementation of SHA-2 (known as SHA-256) but most of the analysis extends to the alternative SHA-2 implementations (for 224, 384 and 512 bit hashes).
In 2010 the US government moved to using solely SHA-2 algorithms over SHA-1 algorithms due to the increased security that comes from SHA-2 which offers longer hashes and therefore increased difficulty in finding collisions. A hash function that generates an x-bit hash can be broken by a brute force search through the 2x possible outputs. Similarly, finding a collision requires 2x/2 evaluations under a birthday attack. Therefore the computational difficultly of attacks by brute force or birthday-style form upon hash functions increases exponentially with hash length.
Similarly to the SHA-1 algorithm the SHA-256 algorithm also starts by defining constants and padding the input message with its length a 1 and 0s before then splitting it into 32-bit messages and running rounds of bitwise operations upon each message segment. SHA-256 includes rightward shifts in the bits as one of its operations expanding the repertoire of operations beyond those in the SHA-1 family. Although the implementation is similar to that of SHA-1, the expanded repertoire of operations, increased rounds of condensing and longer output renders it more secure.
SHA-256 is used in the proof of work and authentication mechanism for the cryptocurrency Bitcoin, the secure messaging application PGP (Pretty Good Privacy), the SSH protocol and in government document integrity checks.
There are known vulnerabilities to the SHA-2 which break pre-image resistance. There are specific vulnerabilities to length extension attacks wherein an attacker can use h(m1) and knowledge of the length of m1 to compute h(m1|m2) for some attacker-controlled m2.
A further family of algorithms SHA-3 were developed in 2012 and released in 2015 to further increase security. However, it is yet to gain widespread support and implementation and will be unlikely to do so until significant flaws in SHA-2 are found that necessitate an update. SHA-3 is not derived from the design of the SHA-2 family of algorithms. The switch to SHA-3 algorithms is particularly recommended for 512-bit hashes where there is the greatest need for security and the greatest difference between the SHA-2 and SHA-3 algorithms. SHA-3 design uses sponge construction in which data is ‘absorbed’ into a data structure before being ‘squeezed’ out to generate a hash.
MD5 stands for the 5th in a series of message digest hash functions. It is currently used widely for checksum calculations and data integrity verification.
The message digest family of hash functions depend solely on the content of any message for generating a hash from it. The original MD hash function was never published and neither was MD3 leaving functions MD2, MD4 and MD5 which built upon each other to strengthen security. Weaknesses in MD2 and MD4 were found and thought to be fixed in MD5. The MD5 hash function is no longer considered cryptographically secure as collisions and breaks have been found.
The MD5 algorithm takes a variable length input and generates a 128 bit hash. The input message is padded so that its length is divisible by 512 so that it may then be split into 512 bit sections. This padding consists of a leading 1 and as many 0s as are necessary.
The 512 bit pieces are then further split into 32-bit messages. The messages are then subject to four rounds of manipulation by the functions below. A different function is used in each round. The hashes are then generally displayed in hexadecimal form for brevity.
The MD5 algorithm was originally designed for checking data integrity which, if well implemented, it is still able to do thanks to its design as a checksum and fast computation time. However, it is not considered suitable for cryptographic use in digital signatures and password storage.
There exists a known collision attack that can be implemented in seconds on standard home computing hardware. The advance in graphical processing units and hardware more generally since the hash functions inception in 1991 has also enabled effective brute force attacks. Despite these flaws a survey in 2015 found that the MD5 hash function was still widely used even by some anti-virus and security software providers.
There are many attacks, both theoretical and practical, on the MD5 function both for forging legitimate hashes and finding collisions. Many of these attacks find and then exploit knowledge of the weak-avalanche effect whereby a small change in the input yields dramatic changes in the resultant hash. An understanding of the relationship between changes in the inputs and changes in the resulting hash enables collisions to be found.
The specifics of the successful attacks on the MD5 hash function are beyond the scope of this introductory note but can be found in the initial sections of “How to Break MD5 and Other Hash Functions” (Wang & Yu) and “On Collisions for MD5” (Stevens, 2007).
Rainbow tables have been published for attacks upon the MD5 hash function which enables anyone to break the hash function.