Closed Thread Icon

Topic awaiting preservation: PHP - determining if two .jpg files are the same? Pages that link to <a href="https://ozoneasylum.com/backlink?for=26934" title="Pages that link to Topic awaiting preservation: PHP - determining if two .jpg files are the same?" rel="nofollow" >Topic awaiting preservation: PHP - determining if two .jpg files are the same?\

 
Author Thread
Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 10-31-2005 16:40

I'm doing some screen scraping, and part of the data is a .jpg for each record. If there is no photo for a given record, the source gives a generic "photo not available" image. The file name format matches those that are valid, so I can't verify just by filename. File sizes vary amongst the records, so I'm reluctant to look for each .jpg that matches the generic file just by file size.

Is there a way to take two images, and verify if they are the same? So I could take a known generic image and run a comparison against each that I get? Does that make sense?



(Edited by Pugzly on 10-31-2005 16:41)

poi
Paranoid (IV) Inmate

From: France
Insane since: Jun 2002

posted posted 10-31-2005 17:21

Using GD you could resize the pictures to the same size ( say ~128x128 ), then compute the difference. If the differences exceeds a given threshold too often, the images are different. Alas that technique will be slow if you want to compare many images.

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 10-31-2005 18:33

This is usually something that MD5 is used for (specifically md5_file). Though that can be slow depending on the size of the images.

You can probably open them up as text files and just read something like the first 10 lines or however many are the header and see if they're the same.



.:[ Never resist a perfect moment ]:.

(Edited by bitdamaged on 10-31-2005 21:05)

Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 11-01-2005 18:42

Interesting theory on a MD5 hash. They are all the same width/height size (mug shots). I'll test the MD5 method - that should work.

poi - are you saying use GD to check for a difference?

poi
Paranoid (IV) Inmate

From: France
Insane since: Jun 2002

posted posted 11-01-2005 18:48

yep. Use GD to compare the color of each pixels of the 2 images. Before that, resize both images to an arbitrary size to decrease significantly the amount of pixels to compare.

I proposed that method because I thought you wanted to compare 2 images irregardless of their file size, file name or resolution.



(Edited by poi on 11-01-2005 18:49)

Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 11-01-2005 20:00

poi - I'm a little green on GD syntax. Do you have an example?

In this case, all photos are the same dimensions, just slightly different file sizes.

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 11-01-2005 21:09

poi's suggestion actually physically examines if the two images are identical so if you know the height and width you'd loop through using something like so

for($i=0;$i<$imagewidth;$i++) {
for($j=0;$j<$imageheight;$j++) {
if (imagecolorat($image_resource1, $i,$j) != imagecolorat($image_resource2,$i,$j)) return false;
}

}

http://us3.php.net/manual/en/function.imagecolorat.php



.:[ Never resist a perfect moment ]:.

(Edited by bitdamaged on 11-01-2005 21:10)

Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 11-01-2005 22:51

Ah. I see. I have been playing with the md5 theory, and it looks like I can do that without taking a performance hit (I'm comparing nearly 18000 images while I cURL them).

poi
Paranoid (IV) Inmate

From: France
Insane since: Jun 2002

posted posted 11-01-2005 23:11

That's it.

The technique can be tweaked, by using a threshold, to be proof to subtle differences due to different compression ratio of the JPG.

Just bare in mind that imagecolorat( ... ) returns a bitfield value in which the bits 15->23 are for the red component, the bits 8->15 for the green one and the bits 0->7 for the blue one. Hence you can take advantage of the bitwise operations to compute the difference.

[edit] 18,000 images ouch, comparing each pixel of all those images gonna take some years.
If you have to compare all those images, you'll have to create a kind of hash table based on multi-resolution version of them and compare the images at the lowest resolution first then to the upper resolution until you find a significant difference. [/edit]

[edit2] The MD5 technique sounds really good, alas it accept no difference at all. [/edit2]



(Edited by poi on 11-01-2005 23:17)

(Edited by poi on 11-01-2005 23:19)

Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 11-02-2005 06:56

So far, the MD5 method is working. I found two different hashes that represented files I did NOT want, and I'm going through them now. Seems to work great.

poi
Paranoid (IV) Inmate

From: France
Insane since: Jun 2002

posted posted 11-02-2005 09:23

Great! It gonna be one billion times faster than comparing the pixels.

Do you make the MD5 from the very images files of from the N first pixels of the images ?
Have you tried to compare the same image compressed at different ratio ? or is such case impossible in your context ?

Pugzly
Paranoid (IV) Inmate

From: 127.0.0.1
Insane since: Apr 2000

posted posted 11-02-2005 22:23

No - it's actually turned out to be fairly simple. I found one record with the 'no photo available' image, and got the MD5 hash on it. I then just did a SQL query in my db, and cycled through the records, cURLing the photo, running an MD5 hash on it, and deleting it if it matched (since I can't MD5 them remotely, I have to get them first).

Due to bandwidth concerns, I sleep(5) between each call (don't want to cheese off the State Police!). Total time to implement the script was under a 1/2 hour, including some trial and error. It'll take a while to get all 18000+ images....

I am running the md5_file() against the entire file. Some trials showed that was the best method. Since the images aren't dynamically created, I've yet to see any problems.

The only issue I've really come up against is with cURL. I can't seem to write the files in a folder other than the one the script is in. I'm sure I'm missing something somewhere, but it's really not that critical.

« BackwardsOnwards »

Show Forum Drop Down Menu