{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "3d50812a-1695-4fa8-90c6-e0aed9b51bc8",
      "metadata": {},
      "source": [
        "---\n",
        "title: Cryptographic hash functions\n",
        "description: In this lesson we will look at cryptographic hash functions which see extensive use in quick validation and authentication.\n",
        "---\n",
        "\n",
        "{/* cspell:ignore  onewayfn, injectivity, randomoracle, WXYZ, VXYZ, Brassard, Høyer, QRAM, Chailloux, Schrottenloher, Shen  */}\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "48916182-96be-446c-a13e-e05a7093c4ce",
      "metadata": {},
      "source": [
        "# Cryptographic hash functions\n",
        "\n",
        "In this lesson we will look at cryptographic hash functions which see extensive use in quick validation and authentication.\n",
        "\n",
        "By the end of the lesson we will have covered:\n",
        "\n",
        "*   What cryptographic hash functions are\n",
        "*   Python code examples demonstrating the use of hash functions\n",
        "*   A look at applications of cryptographic hashing\n",
        "*   The security of cryptographic hashing\n",
        "*   Threats to these algorithms from both classical and quantum computers\n",
        "\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "bfc93002-5016-4f8e-a27f-6d1039bbc578",
      "metadata": {},
      "source": [
        "## Introduction to cryptographic hashing\n",
        "\n",
        "[Hash functions](https://en.wikipedia.org/w/index.php?title=Cryptographic_hash_function\\&oldid=1159000809) represent a valuable construct in cryptography as they help enable validation with confidentiality. As such, hash functions form an important component of mechanisms for data authentication and integrity, such as [hash-based message authentication codes](https://en.wikipedia.org/w/index.php?title=HMAC\\&oldid=1158164088) (HMAC) and [digital signatures](https://en.wikipedia.org/w/index.php?title=Digital_signature\\&oldid=1161173191). This article will discuss the basic ideas and security considerations underpinning cryptographic hash functions and outline potential vulnerabilities from the advent of quantum computing.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "8021f79c-3463-4d2e-aed1-f3a87938f7d0",
      "metadata": {},
      "source": [
        "## Basic rationale and design of hash functions\n",
        "\n",
        "There are many situations where authentication and integrity verification need to be performed cheaply and without revealing private information to the party performing the validation.\n",
        "\n",
        "For example, when downloading software from a remote server, an efficient mechanism is needed to verify that the software actually downloaded has not been modified since being created by the original author of the software. Similarly, when authenticating users of web applications, it would be desirable to use a mechanism that does not involve physically storing or transmitting the actual passwords, which can potentially compromise their confidentiality.\n",
        "\n",
        "*Cryptographic hash functions* (CHFs) address such needs efficiently and securely.\n",
        "\n",
        "Fundamentally, a cryptographic hash function takes an input (or *message*) of arbitrary length and returns a fixed-size string of n-bits as output. The output of a <DefinitionTooltip definition=\"Cryptographic hash functions take an input (or message) and produce a fixed-size output, which is typically a string of characters of a fixed length.\">CHF</DefinitionTooltip> is also referred to as a *digest*.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "bad81855-a824-412b-a266-e01a9ddeaeeb",
      "metadata": {},
      "source": [
        "A useful CHF should satisfy several key properties:\n",
        "\n",
        "1.  **Uniformity:** The digests produced by a CHF should be distributed uniformly and should look random. The aim is to ensure the output leaks no information about the input.\n",
        "2.  **Determinism:** For a given input, a CHF must always produce the same <DefinitionTooltip definition=\"Refers to a fixed-length string or hash value generated from input data of arbitrary size.\">digest</DefinitionTooltip>, that is, it must be deterministic.\n",
        "3.  **Irreversibility:** A CHF is a <DefinitionTooltip definition=\"A mathematical function that is relatively easy to compute in one direction but extremely difficult to reverse.\">*one-way function*</DefinitionTooltip> in that, given a digest, it should be infeasible to invert the hashing and obtain the input.\n",
        "4.  **Approximate injectivity:** While CHFs are many-to-one functions, they should appear to look like one-to one functions. This is achieved by combining an enormous output space size with the avalanche effect whereby tiny changes in the input lead to wildly divergent digests.  This characteristic is known as approximate injectivity.\n",
        "\n",
        "Given this, it's possible to validate a piece of data against the original instance by comparing a digest of the data to a digest of the original.\n",
        "\n",
        "*   If the two digests match, we can be confident with high probability that the data is identical to the original.\n",
        "*   If the digests differ, we can be sure that the data was tampered with or is otherwise inauthentic.\n",
        "\n",
        "Since the CHF digests themselves do not reveal the actual contents of the data or the original, they enable validation while preserving privacy.\n",
        "\n",
        "<details>\n",
        "  <summary>\n",
        "    <en>Mathematical description</en>\n",
        "  </summary>\n",
        "\n",
        "  A hash function $H$ can be defined as:\n",
        "\n",
        "  $H : Σ^* \\rightarrow \\{0,1\\}^n$\n",
        "\n",
        "  where $Σ^*$ is the set of all possible strings which we may consider to be binary strings of any length.\n",
        "\n",
        "  The fact that the size of the input domain $Σ^*$ of $H$ is unbounded while that of the co-domain $\\{0,1\\}^n$ is bounded means that $H$ is necessarily many-to-one mapping infinitely many inputs to any given n-bit string.\n",
        "\n",
        "  The properties of uniformity and determinism are nicely encapsulated within the <DefinitionTooltip definition=\"An idealized cryptographic model often used in theoretical discussions and analyses of cryptographic protocols and schemes, it allows for clear and rigorous analysis of security properties.\">*random oracle model*</DefinitionTooltip> of cryptographic hashing.\n",
        "</details>\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "50354fe2-08cd-4876-8a7d-b65fc86e551a",
      "metadata": {},
      "source": [
        "## Example of cryptographic hashing in Python with SHA-256\n",
        "\n",
        "This simple example demonstrates cryptographic hashing using the popular [SHA-256](https://en.wikipedia.org/w/index.php?title=SHA-2\\&oldid=1155462953) algorithm as provided by the `cryptography` Python library.\n",
        "\n"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "8a55a611-63fe-473f-9545-50a90747b680",
      "metadata": {},
      "source": [
        "First we show how a minor difference in plain texts leads to a very large difference in the hash digests.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "id": "c3776040-036f-4b13-ae18-8fc3ba4c8398",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "The two messages differ by 1 characters\n"
          ]
        }
      ],
      "source": [
        "# Begin by importing some necessary modules\n",
        "from cryptography.hazmat.backends import default_backend\n",
        "from cryptography.hazmat.primitives import hashes\n",
        "\n",
        "\n",
        "# Helper function that returns the number of characters different in two strings\n",
        "def char_diff(str1, str2):\n",
        "    return sum(str1[i] != str2[i] for i in range(len(str1)))\n",
        "\n",
        "\n",
        "# Messages to be hashed\n",
        "message_1 = b\"Buy 10000 shares of WXYZ stock now!\"\n",
        "message_2 = b\"Buy 10000 shares of VXYZ stock now!\"\n",
        "\n",
        "print(f\"The two messages differ by { char_diff(message_1, message_2)} characters\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "4ccc7cdd-6235-4309-946b-ed8bd8169d3e",
      "metadata": {},
      "source": [
        "The two messages differ in exactly one character.\n",
        "\n",
        "Next, we instantiate `hash` objects to enable the hashing process, which involves calls to two methods: `update` and `finalize`.\n",
        "\n",
        "We see that due to the avalanche effect in the SHA-256 CHF, a one-character difference in input messages yields two very different digests.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "id": "e60af579-f19f-41ff-8ff8-ccaad6b00e21",
      "metadata": {},
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "digest-1: 6e0e6261b7131bd80ffdb2a4d42f9d042636350e45e184b92fcbcc9646eaf1e7\n",
            "digest-2: 6b0abb368c3a1730f935b68105e3f3ae7fd43d7e786d3ed3503dbb45c74ada46\n",
            "The two digests differ by 57 characters\n"
          ]
        }
      ],
      "source": [
        "# Create new SHA-256 hash objects, one for each message\n",
        "chf_1 = hashes.Hash(hashes.SHA256(), backend=default_backend())\n",
        "chf_2 = hashes.Hash(hashes.SHA256(), backend=default_backend())\n",
        "\n",
        "# Update each hash object with the bytes of the corresponding message\n",
        "chf_1.update(message_1)\n",
        "chf_2.update(message_2)\n",
        "\n",
        "# Finalize the hash process and obtain the digests\n",
        "digest_1 = chf_1.finalize()\n",
        "digest_2 = chf_2.finalize()\n",
        "\n",
        "# Convert the resulting hash to hexadecimal strings for convenient printing\n",
        "digest_1_str = digest_1.hex()\n",
        "digest_2_str = digest_2.hex()\n",
        "\n",
        "# Print out the digests as strings\n",
        "print(f\"digest-1: {digest_1_str}\")\n",
        "print(f\"digest-2: {digest_2_str}\")\n",
        "\n",
        "print(f\"The two digests differ by { char_diff(digest_1_str, digest_2_str)} characters\")"
      ]
    },
    {
      "attachments": {},
      "cell_type": "markdown",
      "id": "5b0d1239-070f-42dc-8d41-2d4462d7bf09",
      "metadata": {},
      "source": [
        "## Applications of cryptographic hashing\n",
        "\n",
        "The unique properties of CHFs make them suitable for a wide array of applications:\n",
        "\n",
        "*   **Data integrity checks:** Hash functions can be used to create a <DefinitionTooltip definition=\"Commonly used technique in cryptography to detect errors during data transmission involving adding a value(which is often derived from the data itself) to the data to create a checksum value. The recipient can then perform the same calculation to compare checksums to verify authenticity.\">checksum</DefinitionTooltip> for a set of data. Any modifications to the data, intentional or not, will result in a different checksum, alerting systems or users to the change. The checksum is also typically much more compact than the original data, which makes checksum comparisons very fast.\n",
        "\n",
        "![Fig 1: Secure hashing for data integrity checks](https://quantum.cloud.ibm.com/learning/images/courses/quantum-safe-cryptography/cryptographic-hash-functions/data-integrity.avif)\n",
        "\n",
        "*Figure 1. Secure hashing for data integrity*\n",
        "\n",
        "*   **Digital signatures:** Cryptographic hashes are essential to the functioning of digital signatures as they involve comparing cryptographically hashed messages to establish authenticity and integrity while preserving privacy.\n",
        "\n",
        "![Fig 2: Digital signatures](https://quantum.cloud.ibm.com/learning/images/courses/quantum-safe-cryptography/cryptographic-hash-functions/digital-signature.avif)\n",
        "\n",
        "*Figure 2. Digital signatures*\n",
        "\n",
        "*   **Blockchain and cryptocurrencies:** Cryptocurrencies like Bitcoin rely heavily on CHFs, particularly in creating transaction integrity and enabling consensus mechanisms like *proof of work*.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "a1b8348e-3f6f-4e57-96a6-c02d96b04e5e",
      "metadata": {},
      "source": [
        "## Security of cryptographic hashing\n",
        "\n",
        "The security of a CHF is typically assessed based on resistance to two types of attacks: [pre-image](https://en.wikipedia.org/w/index.php?title=Preimage_attack\\&oldid=1145879411) and [collision](https://en.wikipedia.org/w/index.php?title=Collision_attack\\&oldid=1138239916).\n",
        "\n",
        "### Pre-image resistance\n",
        "\n",
        "*Pre-image resistance* means that given a digest, it should be infeasible to find the input.\n",
        "\n",
        "This is related to the one-way property of CHFs.\n",
        "\n",
        "A good CHF is designed in such a way that a party wishing to conduct a pre-image attack has no better option than a brute force approach, which has time complexity $2^n$.\n",
        "\n",
        "<details>\n",
        "  <summary>\n",
        "    <em>mathematical details</em>\n",
        "  </summary>\n",
        "\n",
        "  Given a CHF $H$ and digest $g$, it should be computationally infeasible to find any input $m$ from the pre-image of $g$ whereby $H(m) = g$.\n",
        "</details>\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9ef9cc2f-0be6-4961-8a4d-aad775d5e8be",
      "metadata": {},
      "source": [
        "### Collision resistance\n",
        "\n",
        "*Collision resistance* means that it is difficult to find two different inputs that hash to the same digest.\n",
        "\n",
        "A *cryptographic hash collision* occurs when two inputs hash to the same digest. While collisions inevitably exist given the many-to-one nature of CHFs, a good CHF nevertheless makes it infeasible to locate one at will.\n",
        "\n",
        "Collision resistance is crucial for applications like digital signatures and certificates, where it could be disastrous if a malicious party were able to create a forgery that hashes to the same value.\n",
        "\n",
        "<details>\n",
        "  <summary>\n",
        "    <em>mathematical details of hash collisions</em>\n",
        "  </summary>\n",
        "\n",
        "  $m_1, m_2$ can be found such that $H(m_1) = H(m_2)$.\n",
        "</details>\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "c28ba5d3-80a6-4493-8a30-e040ab500164",
      "metadata": {},
      "source": [
        "### Hash length\n",
        "\n",
        "Collision resistance is a harder requirement than pre-image resistance and necessitates output lengths twice as long as that needed for pre-image resistance. This is because a brute force attack known as the [birthday attack](https://en.wikipedia.org/w/index.php?title=Birthday_attack\\&oldid=1161075943), which can be used to identify hash collisions, has time complexity $2^{n/2}$.\n",
        "\n",
        "In the absence of cryptanalytic weaknesses, the security of a hash function is therefore primarily influenced by its hash length. The longer the hash, the more secure it is, as it becomes harder to mount brute force attacks.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "633c27c1-aa95-4be7-8d20-a84cfe921e6c",
      "metadata": {},
      "source": [
        "## Commonly used cryptographic hash functions\n",
        "\n",
        "The following table lists some commonly used cryptographic hash functions, along with their hash lengths and primary application domains:\n",
        "\n",
        "| Hash Function | Output Length (bits) | Common Applications                                        |\n",
        "| ------------- | -------------------- | ---------------------------------------------------------- |\n",
        "| MD5           | 128                  | File integrity checking, older systems, non-crypto uses    |\n",
        "| SHA-1         | 160                  | Legacy systems, Git for version control                    |\n",
        "| SHA-256       | 256                  | Cryptocurrency (Bitcoin), digital signatures, certificates |\n",
        "| SHA-3         | Variable (up to 512) | Various cryptographic applications, successor to SHA-2     |\n",
        "| Blake2        | Variable (up to 512) | Cryptography, replacing MD5/SHA-1 in some systems          |\n",
        "| Blake3        | Variable (up to 256) | Cryptography, file hashing, data integrity                 |\n",
        "\n",
        "*   [MD5](https://en.wikipedia.org/w/index.php?title=MD5\\&oldid=1161719955) and [SHA-1](https://en.wikipedia.org/w/index.php?title=SHA-1\\&oldid=1161401043), while still seen in less sensitive applications places, are considered deprecated in terms of collision resistance and are not recommended for new systems.\n",
        "    SHA-256, part of the [SHA-2](https://en.wikipedia.org/w/index.php?title=SHA-2\\&oldid=1155462953) family, is widely used and currently secure for most applications.\n",
        "*   [SHA-3](https://en.wikipedia.org/w/index.php?title=SHA-3\\&oldid=1161463128) is a newer standard that was selected by NIST in 2015 as the winner of the NIST hash function competition. It's designed to be a drop-in replacement for SHA-2, but it also has some different internal characteristics and is resistant to certain types of attacks that might be effective against SHA-2.\n",
        "*   [Blake2 and Blake3](https://en.wikipedia.org/w/index.php?title=BLAKE_\\(hash_function\\)\\&oldid=1161463384) are cryptographic hash functions that are faster than MD5, SHA-1, SHA-2, and SHA-3, but at least as secure as the latest standard, SHA-3. They are increasingly being used in new systems, particularly where speed is important.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "9c29c838-d006-4861-8892-5e95a53937b0",
      "metadata": {},
      "source": [
        "### Quantum risks to traditional cryptographic hashing\n",
        "\n",
        "The primary quantum threat to cryptographic hashing is posed by brute force attacks.\n",
        "\n",
        "Given a particular digest, an attacker will try out random inputs until they locate one which produces the same digest.\n",
        "\n",
        "With $n$ bits in the input, there are $2^n$ possible values. Therefore, the attacker needs to try out around $2^{n-1}$ inputs to have a better than 50% chance of success.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "046592c2-f94d-47cf-ae23-a80f84d7dbf6",
      "metadata": {},
      "source": [
        "#### Grover's algorithm\n",
        "\n",
        "For such an unstructured search context, Grover's algorithm can provide a quadratic speedup using a technique known as <DefinitionTooltip definition=\"Technique used in algorithms to leverage certain phenomenons in the quantum interface which increase the probability (amplitude) of the correct answer when searching through a large set of possibilities/definitions.\">quantum amplitude amplification</DefinitionTooltip>, reducing the time complexity of a pre-image attack to $2^{n/2}$.\n",
        "\n",
        "In practical terms, this means that a 256-bit CHF, which is currently considered secure against pre-image attacks by classical computers, would provide the same level of security as a 128-bit CHF when faced with a quantum attacker utilizing Grover's algorithm.\n",
        "\n",
        "Grover's algorithm by itself is not known to provide additional quantum speedups with respect to collision attacks beyond the limit set by the [birthday attack](https://en.wikipedia.org/w/index.php?title=Birthday_attack\\&oldid=1161075943), which can be carried out on classical computers. Since the classical birthday attack already requires CHFs to employ hash lengths that are twice as long as needed for pre-image resistance, the fact that Grover search effectively halves the hash length with respect to pre-image attacks does not pose a practical threat.\n",
        "\n",
        "For example, in the case of SHA-256, performing on the order of $2^{128}$ operations to execute a Grover-assisted pre-image attack would still be infeasible in the foreseeable future.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "fd61f809-f3d7-4d14-9381-a99a1b58fab5",
      "metadata": {},
      "source": [
        "#### BHT algorithm\n",
        "\n",
        "A quantum algorithm that combines aspects of the birthday attack with Grover search was proposed in 1997 by [Brassard, Høyer, and Tapp](http://arxiv.org/abs/quant-ph/9705002) (BHT) and affords a theoretical scaling of $O(2^{n/3})$ for finding hash collisions. However, this improved scaling is predicated on the existence of quantum random access memory [QRAM](http://arxiv.org/abs/2205.14023) technology, which currently does not exist.\n",
        "\n",
        "Without QRAM, the realizable scaling is $\\tilde{O}(2^{2n/5})$ and for hash lengths currently in use, this marginal improvement in collision-finding capability relative to the birthday attack does not represent a critical threat.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "id": "862061f9-1fcf-4380-98bb-f409438d8d6e",
      "metadata": {},
      "source": [
        "## Summary\n",
        "\n",
        "Cryptographic hash functions are an essential component for ensuring data integrity and privacy in digital information systems and find widespread application in many contexts.\n",
        "\n",
        "The security requirements of CHFs are mainly understood in terms of their resistance to pre-image and collision attacks. For well-designed CHFs, the hash length is a good proxy for the security level.\n",
        "\n",
        "While the advent of quantum computers executing the Grover and BHT algorithms in theory affects the pre-image and collision resistance of traditional CHFs, the long hash lengths already in use today means that modern cryptographic hashing algorithms, such as SHA-3, are likely to remain secure barring the discovery of as-yet-unknown cryptanalytic attacks.\n",
        "\n",
        "The relevance of CHFs lies in their role as a fundamental building block for quantum-resistant cryptographic schemes, ensuring that digital information remains secure even in the face of potential future advancements in quantum computing algorithms and technologies.\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "id": "a1b8767d",
      "source": "© IBM Corp., 2017-2026"
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "state": {},
        "version_major": 2,
        "version_minor": 0
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 4
}