Closed Thread Icon

Preserved Topic: PHP file search and regexps Pages that link to <a href="https://ozoneasylum.com/backlink?for=20881" title="Pages that link to Preserved Topic: PHP file search and regexps" rel="nofollow" >Preserved Topic: PHP file search and regexps\

 
Author Thread
butcher
Paranoid (IV) Inmate

From: New Jersey, USA
Insane since: Oct 2000

posted posted 04-22-2001 04:46

I'm trying to write a script in PHP that opens a directory, picks out the files with a html file name, and then searches each of those files for .gif or .jpeg file names. The return for the image file names needs to be the whole path (i.e. src="images/showtime/winners/topleft_01.gif") only without the *src=* or the parentheses at either end. I've gotten as far as opening the directory, and extracting all the html files, opening a file and read it. I don't know what function to use to look through the file with the regexp (which I need help with also) to extract the .gif or .jpeg path names. Here's what I have so far:

<BLOCKQUOTE><FONT face="Verdana, Arial">code:</font><HR><pre><?
$dir_name = "/Windows/Desktop/server_stuff/HTML/";

$dir = opendir($dir_name);
$file_list = "<ul>";

while ($file_name = readdir($dir)) {
if (ereg("html$", $file_name)) {
$html_file = "$dir_name$file_name";
$f = fopen($html_file,r);
$fd = fread($f, filesize ($html_file));
if (ereg("gif$

mr.maX
Maniac (V) Mad Scientist

From: Belgrade, Serbia
Insane since: Sep 2000

posted posted 04-22-2001 10:56

Here you go:

<?
$dir_name = "/Windows/Desktop/server_stuff/HTML/";

$dir = opendir($dir_name);
$file_list = "<ul>";

while ($file_name = readdir($dir)) {
&nbsp;&nbsp;&nbsp;&nbsp;if (preg_match("/html\$/i", $file_name)) {
&nbsp;&nbsp;&nbsp;&nbsp;$html_file = "$dir_name$file_name";
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$f = fopen($html_file,r);
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$fd = fread($f, filesize ($html_file));
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;if (preg_match_all("/<IMG(.+?)SRC=\"?([^\"' >]+)/i", $fd, $arrayofimages)){
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;foreach ($arrayofimages[2] as $img) {
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$file_list .= "<LI>$img";
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;}
&nbsp;&nbsp;&nbsp;&nbsp;}
}

$file_list .= "</ul>";

closedir($dir);
?>

BTW I'm using PERL-compatible regular expressions (preg* functions). Also, note that image filenames with spaces (i.e. SRC="bla bla.jpeg") won't be matched correctly, but that's OK, because space has to be written like "%20" (i.e. SRC="bla%20bla.jpeg") anyway...

butcher
Paranoid (IV) Inmate

From: New Jersey, USA
Insane since: Oct 2000

posted posted 04-22-2001 19:18

Thanks Mr. Max, Works like a charm.

I have one more question for you, or anybody else that knows regexp's. Could you explain:

(preg_match("/html\$/i", $file_name))

and

(preg_match_all("/<IMG(.+?)SRC=\"?([^\"' >]+)/i", $fd, $arrayofimages)){
foreach ($arrayofimages[2] as $img)

in terms I might understand so that I can begin to get a grip on regular expressions. I would really love to eventually learn to write my own, and the only way for me to do that is to see the I/O from some working models, and the explainations to go with them.

Thanks again

jiblet
Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted posted 04-23-2001 03:42

Regular Expressions are really not all that complicated, they just look that way. Try man grep in your unix shell for an explanation, or search around the net, most explanations are less than complete given the wide range of versatility and special cases possible. If you have a Mac, BBEdit has a very good explanation.

/html\$/i

First the pattern to be matched goes inside slashes, followed by a modifier. In this case 'i' which implies case-insensitive search.

The $ represents the end of a line, I'm not sure why he put the backslash before it, usually u use that to escape special characters to their literal meaning... ie. search for html$, but I am a relative regexp newbie, so I'm sure Max will set me straight.

The second regexp:

/<IMG(.+?)SRC=\"?([^\"' >]+)/i

Take the first parenthesized portion. The period matches any character, the + means match the previous character 1 or more times (ie. match any character 1 or more times), the ? modifies the + so that it returns the shortest possible match. Confused? Generally regexps find the longest possible match, which in this case could be not only the rest of the tag, but also everything leading up to the src of the last image in your string.

Next we have the escaped quotes for the purposes of PHP rather than the regexp.

This is followed by a ? which means one or more occurrences of the previous character (the previous explained meaning is only if it follows a quantifier character such as +, * or even itself).

Finally, we have the second parenthesized portion:

([^\"' >]+)

The brackets indicate a group of characters. But the ^ in the front indicates that it is to match anything BUT those characters. Then we have the + indicating one or more occurrences of anything but those characters. Thereby matching anything following the preceding " but before a space, quote, doublequote, backslash or greater-than.

The last part of this conundrum is the:

foreach ($arrayofimages[2] as $img)

The reason for this is that $arrayofimages is actually a 2-dimensional array. preg_match_all places an array the matches of the whole regexp in $arrayofimages[0]. It places an array of the matches of the first parenthesized expression in [1], the matches of the 2nd exp in [2] etc.

butcher
Paranoid (IV) Inmate

From: New Jersey, USA
Insane since: Oct 2000

posted posted 04-23-2001 21:27

Thanks jiblet. I appreciate you taking the time for such a complete response.

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 04-23-2001 22:29

Just to answer the question about the "\$" This is backslashed, like the quotes, for PHP (I'm assuming) not the regexp otherwise PHP will start looking for a variable called "$/i" and probably break.


Walking the Earth like Kane

mr.maX
Maniac (V) Mad Scientist

From: Belgrade, Serbia
Insane since: Sep 2000

posted posted 04-24-2001 07:05

Bitdamaged is right. "preg*" functions are only wrappers for their counterparts in PCRE library, and PHP is passing them a string that is parsed first by PHP and after that by PCRE library. So, if in some case you really need to escape a special RegEx character, you'll have to use double backslash.

Oh and one more thing. RegEx modifier "g" (greedy mode) is always turned on in PHP (you don't have to specify it like "/bla/g")...

jiblet
Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted posted 04-25-2001 18:52

What implementation of Regexps isn't greedy by default? Perl?

mr.maX
Maniac (V) Mad Scientist

From: Belgrade, Serbia
Insane since: Sep 2000

posted posted 04-25-2001 19:20

Yes, in PERL you have full control over RegExes.

But, if you need "ungreedy" mode in PHP, you can use "U" modifier (i.e. "/bla/U").

« BackwardsOnwards »

Show Forum Drop Down Menu