Hash functions are **one-way functions** that reduce the size of the input to generate an output of a **fixed size**. The output is known as the **hash** of the input. Hash functions are one way implies that given that hash of a particular input, it is really difficult (practically impossible) to reconstruct the input (completely or partially).

# Properties of Hash Functions

The hash functions used in cases where security is important. They have certain properties that make them **cryptographically secure**.

Formally, a cryptographic hash function, *h*, takes as an input a message of arbitrary length and produces a message digest or ‘hash’ of fixed length. To be cryptographically secure and of practical use it must satisfy the following properties:

- Given some
*y*, it is computationally infeasible to find an input *m*’ such that *h*(*m*’) = *y*. - This requires that
*h* be a **one-way** or **pre-image resistant** function. - Note that if
*y* is the message digest of some message* m* we are not looking for *m* but simply some value *m*’ (which may be equal to *m*) that generates the hash *y* when passed into *h*.

- It is computationally infeasible to find messages
*m*_{1} and *m*_{2} such that *h*(*m*_{1}) = *h*(*m*_{2}). This is the condition that *h* be **strongly collision-free**. - It is sometimes possible to weaken this requirement to that of being
**weakly collision-free** or **second pre-image resistant**. This requires that given some value *x*, it is computationally infeasible to find a value *x*’ ≠ *x* such that *h*(*x*’) = *h*(*x*).

- Given a message,
*m*, the message digest *h*(*m*) can be computed **very quickly**.

# Applications of Hash Functions

Given that the hash depends on the input to the hash function and will change with the input hash functions are used to ensure that messages have not been tampered with (message authentication, digital signatures, checksums) and to check for equality while preserving secrecy / efficiently (for example, checking if password is correct without storing the actual password, or checking for duplicates in lists of large items). They are also used as proof-of-work (for example, in cryptocurrencies like bitcoin), error-correcting codes, randomization and to make cryptographic algorithms more efficient.

**Checksums, message authentication, etc**: Hash functions are generally used as follows. The message *m* and its hash *h*(*m*) is sent to some recipient as (*m*, *H*) where *H* = *h*(*m*). To check if errors or tampering have occurred the recipient computes *h*(*M*) and compares it to *H* for equality. If any errors or tampering have occurred it is highly likely that *h*(*M*) ≠ *H*, due to the collision-free properties of the hash function *h*.

# Collisions in a Hash Function

A **collision** occurs when the hash of two distinct inputs yields the same output. Collisions are inevitable given that hash functions reduce input of arbitrary size to an output of fixed size.

Collisions are harmful in cryptographic applications because they lead to ambiguity in how to ensure that the hash does indeed belong to the message sent and not some different message which happens to share a hash. For example, this extremely important with digital signatures where the message to be signed is verified by a hash. It may be that there are two messages sharing a hash one of which you would like to sign and the other you would reject. It could also be that two files share a hash where one is safe and the other is nefarious.

This highlights the importance of **collision-free** or **pre-image resistant** hash functions. For cryptographically secure hash functions it is generally considered to be improbable to the point of near impossibility that two meaningful messages will share a hash. Moreover, the probability that both messages are meaningful in the language of communication is also extremely low. That is, if two messages collide it is likely one will be unreadable and thereby unusable in an attack.

For this reason attacks on hash functions tend to be concerned with using the knowledge of the hash to compute the original input which is being kept secret (for example, trying to figure out the password given the hash).

# SHA-1 Algorithm

In this section, we will take a detailed look at a particular hash function - SHA-1. SHA stands for **S**ecure **H**ash **A**lgorithm.

## Overview

SHA-1 or the second iteration (after the flawed SHA-0) of the Secure Hash Algorithm was developed by the National Security Agency in the US and given to the National Institute of Standards and Technology. It was for a long time the recommended hash function for governmental use (since replaced by SHA-2). The version control system git still uses SHA-1 hashing as a checksum to check document integrity and consistency.

SHA-1 uses an iterative process to generate a 160-bit hash. The input message *m* is broken into a set of fixed size blocks where the final block is padded to be the same size as the others. The messages are then processed in a sequence of rounds using a compression function *h*’ which combines the current block and the result from the previous round of hash function. This differentiates SHA-1 from the message digest family of hash functions because the compression function takes in more than just the message body so while the hash is dependent on the message alone, in each step it also processes the partial hash of the message so far, thereby increasing security. The hash function starts with an initial value *X*_{0} and defines a subsequent series *X*_{j} = *h*’(*X*_{j-1}, *m*_{j}).

The skill in designing SHA-1 came in creating the compression function *h*’. The function is designed so that changing a single input bit affects as many output bits as possible. In cryptography, this property is known as **diffusion** or the **avalanche effect**. It is an important property to have to make computing the inverse difficult. SHA-1 uses the input bits far more than previous hash functions (of the message digest family in particular) in order to increase security.

## Message

SHA-1 begins by padding the original message with a 1 followed by as many trailing zeroes as needed to reach the next multiple of 512 bits in length less 64-bits. Subsequently the 64-bit representation of the length of the message (*T*) is appended. The message is then split into 512 bit blocks *m*_{1}, …, *m*_{L} which are then input to the hash function one by one.

As an example, if the original message had 1640 bits then a 1 and 343 0s are appended to attain a message of 1984 (= 2048 – 64, the next multiple of 512 less the space needed to append the message length) bits in length. As 1640 is equivalent to 11001101000_{2} in binary representation we append a further 53 0s followed by 11001101000 as the 64-bit representation of the message length. It is this final message of 2048 bits that is broken into blocks of 512-bits.

## Definitions and Base Functions

In the description of the algorithm the following standard bitwise logic operations are used (representing bitwise and, or, addition and negation).

\land\ \ \ \ \lor\ \ \ \ \oplus\ \ \ \ \lnot

Furthermore + will be used to represent addition mod 2^{32} and ↵*r* shall be used to denote a left shift by *r* positions with the beginning wrapping around to the end.

The hash function begins by defining the following functions.

f_t(B,C,D)=\begin{cases}
(B\ \land\ C)\ \lor\ ((\lnot B\ \land\ D)\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if\ 0\leq t\leq19\\
B\ \oplus\ C\ \oplus\ D\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if \ 20\leq t\leq39\\
(B\ \land\ C)\ \lor\ (B\ \land\ D)\ \ \lor\ (C\land\ D)\ \ if\ 40\leq t\leq59\\
B\ \oplus\ C\ \oplus\ D\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if\ 60\leq t\leq79
\end{cases}

And the following constants *K*_{0}, …, *K*_{79} written in hexadecimal form.

K_t=\begin{cases}
5A827999\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if\ 0\leq t\leq19\\
6ED9EBA1\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if \ 20\leq t\leq39\\
BF1BBCDC\ \ \ \ \ \ \ \ \ \ \ \ if\ 40\leq t\leq59\\
CA62C1D6\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ if\ 60\leq t\leq79
\end{cases}

## Step-by-Step SHA-1 Algorithm

We may now summarize the SHA-1 Algorithm.

- Begin by padding the original message with a 1 followed by as many trailing zeroes are needed to reach the next multiple of 512 bits in length less 64-bits.
- Append the 64-bit representation of the length of the message.
- Split the message into 512 bit blocks
*m*_{1}, …, *m*_{L} which are then input to the hash function one by one. - Initialize values
*H*_{0} = 67452301, *H*_{1} = EFCDA89, *H*_{2} = 98BADCFE, *H*_{3} = 10325476 and *H*_{4} = C3D2E1F0. - For each 512-bit message block (
*m*_{i}) perform the following: - Split each message into 32-bit blocks:
*m*_{i} = *W*_{0}|*W*_{1}|…|*W*_{15}. - For
*t* from 16 to 79 let *W*_{t }= (*W*_{t-3} ⊕ *W*_{t-8} ⊕* W*_{t-14} ⊕* W*_{t-16}) ↵1. - Let A =
*H*_{0}, *B* = *H*_{1}, *C* = *H*_{2}, *D* = *H*_{3}, *E* = *H*_{4}. - For
*t *= 0 to 79 do the following steps in succession: -
*T* = (*A* ↵ 5) + *f*_{t}(*B*, *C*, *D*) + *E* + *W*_{t} + *K*_{t} -
*E* = *D* -
*D* = *C* -
*C* = (*B* ↵ 30) -
*B* = *A* -
*A* = *T*

- Let
*H*_{0} = *H*_{0} + *A*, *H*_{1} = *H*_{1} + *B*, *H*_{2} = *H*_{2} + *C*, *H*_{3} = *H*_{3} + *D* and *H*_{4} = *H*_{4} + *E*.

- Output the 160-bit has value formed by concatenating
*H*_{0}, *H*_{1}, *H*_{2}, *H*_{3} and *H*_{4}.

## Conclusion

All of the operations involved are very fast to compute as many of them are simple bitwise operations. This makes SHA-1 a fast and practical algorithm to implement. The process can also be iterated as many times as necessary to digest the whole message and therefore scales to messages of different lengths.

While each step of the algorithm is simple the accumulation of steps to form the hash function as a whole becomes complex and difficult to follow.

For examples of implementing the SHA-1 hash function see this example. To see how the algorithm was broken see Shattered It which details the first collision found which took the processing power equivalent to 6,500 years of single-CPU computations and 110 years of single-GPU computations.

# Hash Functions vs Encryption Algorithms

Hash functions are not a sufficient replacement for full encryption due to the fact that they lose information. Hash functions also lack a 'dehashing' function that plays the part of a decryption algorithm yielding the input when given the output (this is impossible by the nature of hash functions being lossy). Hash functions can however be considered cryptographically secure (by the definition above) since the output is obscured from the input. That's to say that they are one-way functions - the input cannot feasibly be computed from the output. This therefore means that the correctly computed hash of some message can be sent to any recipient without concern for the secrecy of the message itself.

# Summary

Hash functions are quick to compute and if well designed secure for the purposes of verifying data integrity, reducing the size of messages without losing security and for verification of digital signatures.

These functions generate fixed length output and should do so in a manner that is ‘one-way’ meaning that the input may not be deduced by the output. Since hash functions reduce the size of inputs the compression is necessarily ‘lossy’ and therefore some inputs may share a hash. This is known as a collision. Finding collisions leads to ambiguity as to the input and thereby undermines a hash function’s use in verifying messages.

Standard attacks on hash functions come in the form of brute force attacks, dictionary tables and more sophisticated rainbow tables. It is not possible to generate a truly collision free hash function but to be considered cryptographically secure a hash function must be designed such that it is infeasible to find a collision or determine the input when given its hash.