MD5 Collision Attacks: A Deep Dive
Index
- What are hash functions?
- What is MD5?
- How are hash functions used in general applications?
- Collision Attacks
- Implications of Collision
- Generating Two Colliding Files
- Colliding two Programs – Huge Implications!
- BONUS: The Rogue CA Attack: How Fake Certificates Were Created
1. What are hash functions?
A hash function is a mathematical function that generates a unique output of a fixed length for every input given. This input (data files, images, executable files) could be of a variable length for which a fixed length output is generated.
They are one-way functions which means that it is easy to compute a hash value from a pre-image, but hard to generate a pre-image that hashes to a particular hash value. This important property is called irreversibility.
Since a hash value is unique to every input, hashes act as a fingerprint of any file. This property is useful to validate data and provide integrity. Users can be sure that a file has not been tampered with if its hash is verified. Even small changes in the input can produce dramatically different hashes.
Although we will be focusing on MD5 in this article, a few well known hash functions are SHA-256, SHA-512, MD5, GOST etc.
People often confuse hashing with encryption, but they serve entirely different purposes. Encryption scrambles data to protect the original message, allowing it to be decrypted later, while hashing generates a fixed-length “fingerprint” of data that cannot be reversed to reveal the original content. Another key difference lies in the output length. Encrypted data maintains the same length as the original message, while a hash always produces a fixed-length output, regardless of the input size.
2. What is MD5?
MD5 (Message-Digest Algorithm 5) is a cryptographic hash function that processes an input to produce a fixed 128-bit hash value. MD5 was one in a series of hash algorithms designed by Professor Ronald Rivest at MIT in 1992. Today, it has been replaced by the Secure Hashing Algorithms.
How MD5 works is by changing a 128-bit state as we go through the message in 64 byte blocks. The input message can be padded to ensure its length is a multiple of 64 bytes. The internal 128-bit state undergoes a function, which takes two inputs, a 64-byte data block and the outcome of the previous iteration (multiple rounds of bitwise operations, modular additions, and logical functions). After processing all blocks, the final 128-bit hash is produced.
As an example, let’s calculate the hashes of two different strings.
$ echo -n Hello | md5sum
8b1a9953c4611296a827abf8c47804d7
$ echo -n Hello! | md5sum
952d2c56d0485958336747bcdd98590d
We can see how just a slight change drastically changes the hash value.
For years, MD5 was widely used in digital signatures, file integrity verification, and certificate generation. However, its vulnerability to collisions has rendered it obsolete for secure applications.
3. How are hash functions used in general applications?
In the real world, hashing is used to ensure the authenticity of any file since no two files can have the same hash. This property is utilized in a variety of applications such as –
- Data Integrity Verification:
Hash functions are used to generate checksums or hash values for softwares and files. When files are transmitted or stored, their hash values are computed and compared at the destination to detect any alterations or corruption. This is common in software distribution to ensure data remains unchanged. For example, VirtualBox has listed SHA256 hashes of all its versions here https://download.virtualbox.org/virtualbox/4.3.8/SHA256SUMS
- Sensitive Information Storage:
Storing plain-text passwords is risky, so systems use hash functions to store hashed versions of passwords instead. During login, the entered password is hashed and compared to the stored hash. Even if the database is compromised, attackers cannot retrieve the original passwords, as the hash function is one-way and irreversible.
4. Collision Attacks
Now, ideally, 2 different strings should always have different hashes. But in some rare cases, if two distinct pieces of data share the same hash value, it is known as a hash collision attack.
As an example consider these two strings,
b77dff821f18233a8c1b4eee00210cb59408df00a81548f8563f053c862570ab 2432cfcf8c7ec3d1684b5c471b89645de43ecc7ba66773a473d4f94eb170f7e7 c43dcea432fbb2f9da481409a03ca8f67235dd627c365b158b86d78f105f414d 92e68eb87374f773c26998ebb6aa6375b1885f9fab133ee5a8706b9e5a709f35
and
b77dff821f18233a8c1b4eee00210cb594085f00a81548f8563f053c862570ab 2432cfcf8c7ec3d1684b5c479b89645ce43ecc7ba66773a473d4794eb170f7e7 c43dcea432fbb2f9da481409a03ca8f672355d627c365b158b86d78f105f414d 92e68eb87374f773c26998eb36aa6376b1885f9fab133ee5a870eb9e5a709f35
Each of these strings has MD5 hash 7098015cc18e1b20f1ef289d0c81a65e
The core issue with MD5 is its vulnerability to collisions as we saw above. The first demonstration of MD5’s insecurity was announced on 17 August 2004, when collisions for the full MD5 were announced by Xiaoyun Wang, Dengguo Feng, Xuejia Lai, and Hongbo Yu.
Suppose, we have two strings P and Q such that:
MD5(P) = MD5(Q)
This means that an identical data string T, when appended to both inputs will still result in identical hashes.
MD5(P + T) = MD5(Q + T)
This works because the MD5 algorithm processes data in blocks and the internal state after hashing P and Q is identical.
The hashes will also be the same, if we prepend identical data to both input blocks.
MD5(T + P) = MD5(T + Q)
Altogether, we can see the resulting hash will be the same whether we prepend or append a string to P and Q,
MD5(T + P + T’) = MD5(T + Q + T’)
This flaw undermines the reliability of MD5 which can be escalated to an attack. If an attacker can create two different files with the same hash, they can manipulate systems that rely on MD5 for authentication or verification. This can lead to severe security breaches, such as forging digital signatures or tricking software validation mechanisms.
5a. Generating Two Colliding Files
A practical demonstration of MD5 collisions can be performed using md5collgen, a tool designed to generate colliding files. You can get the tool here: https://github.com/zhijieshi/md5collgen
Steps:
- Use md5collgen to generate two binary files, a.bin and b.bin.
- $ ./md5collgen -o a.bin b.bin
MD5 collision generator v1.5
by Marc Stevens (http://www.win.tue.nl/hashclash/)
Using output filenames: ‘a.bin’ and ‘b.bin’
Using initial value: 0123456789abcdeffedcba9876543210
Generating first block: ………
Generating second block: S11……
Running time: 6.306789 s
- Compute their MD5 hashes:
- md5sum a.bin b.bin
- Observe that both files have the same hash, despite having different contents.
So, what this tool did was create two 128 bit sequences that happen to collide given a starting vector IHV0 (0123456789abcdeffedcba9876543210, see above).
5b. Colliding two Programs – Huge Implications!
As we’ve stated, one use of hashing is to verify data integrity i.e. whether a file is intact or tampered. MD5 collisions can also be exploited in executable programs, creating different code paths within a program.
As we saw earlier, given that
MD5(P) = MD5(Q)
the following property holds true for MD5 hash collisions,
MD5(T + P + T’) = MD5(T + Q + T’)
Let’s use it to our advantage. Here we shall create two executables out of a harmless looking program. This program simply compares two character arrays to make a decision. If the arrays are similar, a harmless code is executed, but if they are dissimilar, an evil code is executed.
#include <stdio.h>
#include <string.h>
// padding # (to make prefix a multiple of 64)
unsigned char padding[32] = {
0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23,
0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23,
0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23, 0x23};
// an array of ‘A’s
unsigned char A[128] = {
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41,
0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41, 0x41};
// an array of ‘B’s
unsigned char B[128] = {
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42,
0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42, 0x42};
int main() {
if (memcmp(&A, &B, 128) == 0) {
printf(“Good program\n”); // harmless }
else {
printf(“Evil program\n”); // system(“rm -rf /”) }
}
This program is available here.
If we compile this program into a binary, we know that there will be some 0x41 (A) and 0x42 (B) somewhere in the executable file. To view this, you can use a hex editor like ghex.
We’ll create two versions of the program: one harmless (takes the if branch) and one malicious (takes the else branch). To do this, we insert two 128-byte MD5-colliding blocks where the arrays A and B are compared. Observe in the image above, that at the offset 0x3040 (12352 in decimal), the prefix length is a multiple of 64 bytes, which aligns with MD5’s block size.
We first extract this prefix using the head command, then use it with md5collgen to generate two colliding blocks. Next, we extract the common suffix starting at byte 12608 (i.e., 12352 + 256). Finally, we reconstruct two different binaries by concatenating the prefix, one of the colliding blocks (either A or B), and the suffix. This yields two programs with identical MD5 hashes but diverging behavior.
Let’s do it step by step
$ gcc main.c -o main
$ head c 12352 main > prefix
$ tail -c 12608 a.out > suffix
$ ./md5collgen/md5collgen -p prefix o prefix1 prefix2
MD5 collision generator v1.5
by Marc Stevens (http://www.win.tue.nl/hashclash/)
Using output filenames: ‘prefix1’ and ‘prefix2’
Using prefixfile: ‘prefix’
Using initial value: 86160354c2aeff261ac1e4c57b8c5b86
Generating first block:………………………………………
Generating second block: S11…………………
Running time: 73.810965 s
Take a look at the prefixes (prefix1 and prefix2) now. Observe that the array A (starts after padding #) has been replaced with two different colliding blocks. Let’s call them block1 (in prefix1) and block2 (in prefix 2).
Since, we need another array (B) before the suffix, edit the prefixes in the following manner:
- Edit prefix1 to append another block1: Simply copy the last 128 bits and paste them again at the end. Save.
- Edit prefix2 to append block1 after block2: Copy the last 128 bits from prefix1 (block1) and paste them again at the end of prefix2. Save.
Finally, our prefixes look like this:
- prefix1: common-prefix + block1 + block1
- prefix2: common-prefix + block2 + block1
Look at the example below to get an idea.
Let’s put everything together
$ cat prefix1 suffix > good-program
$ cat prefix2 suffix > evil-program
If we look at the outputs of those two programs, the difference is visible. Still, observe that they collide!
$./good-program
Good program
$ ./evil-program
Evil program
$ md5sum good-program evil-program
fd54ee3ec8df9232e772bbaac642cfe7 good-program
fd54ee3ec8df9232e772bbaac642cfe7 evil-program
The real world implications of this simple example are huge. Suppose a trusted authority signs a safe and legitimate software with MD5. An attacker is then able to create a malicious version of the program with the same hash. The adversary can distribute this malicious software and will easily be trusted by the authority since it has the same hash.
To demonstrate the above scenario, please download this flask application and follow along: https://github.com/singhanveshak/MD5-collision-demo
This application accepts only those files which are trusted. The way it does this is by keeping a list of trusted hashes. By default, it trusts the file good-program in the Programs directory. Try uploading anything else and it fails but when you try to upload the evil-program, it is accepted since its hash matches out. This is a serious vulnerability.
5c. BONUS: Impersonating Malicious Websites as Legitimate
One of the most notorious MD5 attacks was the Rogue Certificate Authority (CA) Attack, which allowed attackers to disguise fake websites as legitimate by forging SSL certificates. Seems confusing? Let’s break it down.
Let’s first discuss certificates and some important terms related to it.
Certificates: are a way to verify the authenticity and trustworthiness of a website. These certificates are issued by Certificate Authorities (CA).
Each certificate contains the following information:
- Public Key: Used for encrypting data and verifying signatures.
- Certificate Info: Details like issuer, subject, validity period, and certificate usage.
- Digital Signature: Created by the CA to verify authenticity.
- Signature Algorithm: Specifies the algorithm used to sign the certificate.
Now, any OS or browser trusts some certificate authorities by default. In Linux, you can go to /var/share/ca-certificates/mozilla to view them. All CA’s or websites signed by these root CAs are trustworthy.
When you open a secure webpage, you see a lock symbol 🔒 in your browser. There you can view the certificate of the website, and who has granted it. As you can see below.
Here, you can see that *.google.com is signed by CA WR2 which is in turn signed by root CA called GTS. This is called the Chain of Trust.
The Attack Strategy:
The rogue CA attack is a clever and dangerous exploitation of weaknesses in the MD5 algorithm, which was once widely used for securing digital certificates. Here’s how it works: the attacker starts by submitting a legitimate certificate request to a trusted Certificate Authority (CA) that still uses MD5 to generate digital signatures. This request looks completely normal and valid, so the CA processes it without suspicion.
However, the attacker has also prepared a second, malicious certificate—a rogue intermediary CA certificate—that is designed to produce the exact same MD5 hash as the legitimate one. This is called an MD5 collision, where two different pieces of data end up with identical hashes. Once the CA signs the legitimate certificate and returns it, the attacker takes that signature and applies it to their rogue certificate. Because both certificates share the same hash, the signature works perfectly on the rogue one, making it appear as if it was legitimately issued by the CA.
Now armed with this rogue CA certificate, the attacker can impersonate any website they choose—whether it’s a bank, an e-commerce platform, or any other secure site. When users visit these fake sites, their browsers will show all the usual signs of security: HTTPS in the address bar, a padlock icon, and messages confirming that the certificate is valid. This makes phishing attacks virtually undetectable and allows attackers to steal sensitive information or manipulate communications without raising alarms.
Consider reading this article to the original thrill of confounding the whole Public Key Infrastructure with just $600 and a few playstations! MD5 considered harmful today
Conclusion
Despite MD5’s widespread use in the past, real-world attacks like the Rogue CA incident have demonstrated the catastrophic consequences of relying on weak methods. SHA512, SHA256 are good alternatives to MD5. Which is why it should not be used for security purposes or when collision resistance is important. People should keep their system up to date, and developers should use secure algorithms in their designs. This marks the end of our article.
Hope you enjoyed reading this. Keep learning ✨