Home .NET Quick search formatching objects by their checksums as an example of finding duplicate images

Quick search formatching objects by their checksums as an example of finding duplicate images

by admin

Source data :

  • setof objects with attributes
  • The ability to roughly identify an object by matching it with a checksum.

Ultimate Goal :

  • getlists of objects forwhich it is easy to identify matches.

The idea behind the algorithm is to create suffix tree each node of which keeps one byteof the checksum. When we getthe checksum of the next object, we start moving from the root of the tree deep down;ifwe don’t find a node for the next bytein the sequence, we create one. Reaching the end of the checksum and creating an end node we write to it the parameters of the object. We end up with a list of final nodes, ifthe final node contains a description of more than one object we assume that these objects are identical.
Here is a general view of the source text for better understanding, which can be used as a template for real-world problems.

using System;
using System.Collections. Generic ;
/// <summary>
/// Object with attributes
/// </summary>
public class HashObject
{
public string attribute1 { get ; set ;}
public string attribute2 { get ; set ; }
}
/// <summary>
/// Tree node, stores one byteof checksum
/// and a list of objects with a checksum that
/// represents the path to this node
///</summary>
public class HashTreeNode
{
/// <summary>
/// Node's own value
///</summary>
private byte myPart;
public byte MyPart
{
get { return myPart; }
}
public HashTreeNode( byte part)
{
myPart = part;
}
/// <summary>
/// If the node is a finite node, it points to objects that have a given checksum
///</summary>
public List <HashObject> files = null ;
/// <summary>
/// The following nodes
/// </summary>
private List <HashTreeNode> NextNodes = new List <HashTreeNode> ();
/// <summary>
/// Search for an end node for a specified checksum
///</summary>
public HashTreeNode FindEndNode( byte [] crc, int position)
{
HashTreeNode endNode = FindSubNodes(crc[position]);
if (position < crc.Length - 1)
return endNode.FindEndNode(crc, position + 1);
else return endNode;
}
/// <summary>
/// Search/create a subset for the current part of the checksum
///</summary>
private HashTreeNode FindSubNodes( byte part)
{
lock (NextNodes)
{
for ( int i = 0; i < NextNodes.Count; i++)
if (NextNodes[i].MyPart.Equals(part))
return NextNodes[i];
NextNodes.Add( new HashTreeNode(part));
return NextNodes[NextNodes.Count - 1];
}
}
}
/// <summary>
/// Tree, contains root nodes
/// </summary>
public class HashTree
{
/// <summary>
/// Root nodes of the tree
/// </summary>
List <HashTreeNode> Nodes = new List <HashTreeNode> ();
/// <summary>
/// Search the tree for a node with a specified checksum
/// if there is no such node, it will be created
/// </summary>
/// <param name="crc"> Crc32 file</param>
/// <returns> </returns>
public HashTreeNode CheckOnTree( byte [] crc)
{
HashTreeNode root = FindNode(crc[0]);
return root.FindEndNode(crc, 1);
}
/// <summary>
/// Find or create a root node for the passed checksum
///</summary>
private HashTreeNode FindNode( byte part)
{
lock (Nodes)
{
for ( int i = 0; i < Nodes.Count; i++)
if (Nodes[i].MyPart.Equals(part))
return Nodes[i];
Nodes.Add( new HashTreeNode(part));
return Nodes[Nodes.Count - 1];
}
}
}
* This source code was highlighted with Source Code Highlighter

As an example, a simple application to search for duplicate images in the specified directory is implemented. As a result we get an html report file.
This project is made in Microsoft Visual Studio 2008, and it is compiled for Windows x86.
The project was tested on its own folder with photos, test result :
Files in the directory : 4326
Of which photos : 4131
Weight of images : 8, 33 Gb
Search time for duplicates : 117 seconds (2 minutes) + 6 seconds to generate the report and copy duplicate files to the report folder.
Found : 90 duplicates.
Link to application and sources :
ifolder.ru/19211139
In case anyone is interested, the files are checked not by extension, but by their signature, by checking the file header. Signatures.cs is a class in the project.
Pros :

  • relatively high speed
  • ease of implementation and data storage

Minuses :

  • the entire data structure is stored in RAM. If the number of pictures will be measured in hundreds of thousands or millions, we need to rethink the way we store the tree. Or find another algorithm.

I will be grateful to those minususus post, who will give links to better known to them algorithms or point out the shortcomings of the described.

You may also like