Closed Thread Icon

Topic awaiting preservation: Perl REGEX problem (Page 1 of 1) Pages that link to <a href="https://ozoneasylum.com/backlink?for=12819" title="Pages that link to Topic awaiting preservation: Perl REGEX problem (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Perl REGEX problem <span class="small">(Page 1 of 1)</span>\

 
Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-16-2003 17:52

I have a perl script where I want to search for strings in HTML like:

"<table class=MsoNormalTable border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;border:none'> "

and replace them with:

" <table>"

However when I run my perl script on the html it doesn't make the correct replacement....here's the string I'm trying (note - other substitutions are working, ex. "s/<meta.*>//gi;" replaces "<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"):

"s/<table.*>/<table>\n/gi;"

Thanks in advance for the help!

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 07-16-2003 20:20

I can't see why it shouldn't work...

try just replacing '<table' with '' and step off from there?

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-16-2003 20:46

It's not working either Petskull. Here's the whole script maybe having context of the rest of the script will play a role in the troubleshooting.

code:
#!/usr/bin/perl

$ARGV[0] =~ /(.*)\.([^\.]*)/;
$outfile = "$1_cleaned.$2";

open (INFILE, "<$ARGV[0]");
open (OUTFILE, ">$outfile");

$outline="";

while($outline=<INFILE> ){


$outline =~ s/\n/ /gi; #removes all linefeed characters
#$outline =~ s/<span.*>//gi; #removes open span tags
#$outline =~ s/<td.*>/<td>/gi; #removes TD attributes
#$outline =~ s/&nbsp;//gi; #removes non breaking spaces
#$outline =~ s/<b>/<strong>/gi; #replaces open bold w/ strong
#$outline =~ s/<\/b>/<\/strong>/gi; #replaces closing bold w/ closing strong
#$outline =~ s/<meta.*>//gi; #removes meta tags
#$outline =~ s/<div.*>//gi; #removes div open tags
#$outline =~ s/<span.*>//gi; #removes closing span tags
#$outline =~ s/<\/span>//gi; #removes closing span tags
#$outline =~ s/<p .*>/<p>\n/gi;
#$outline =~ s/<p$*>/<p>\n/gi;
#$outline =~ s/<br.*>//gi;
#$outline =~ s/<style.*\s+\S+\n//gi;
#$outline =~ s/<\/div>//gi; #removes div close tags
#$outline =~ s/<body.*>/<bodymatter>\n/gi;
#$outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi;
#$outline =~ s/<head>/<head>\n/gi; #inserts CR after <head> tag
#$outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi;
#$outline =~ s/<\/title>/<\/title>\n/gi;
#$outline =~ s/<\/p>/<\/p>\n/gi;
$outline =~ s/<tab.*>/<table>\n/gi;
#$outline =~ s/<\/table>/<\/table>\n/gi; #inserts linefeed after closing table tag
#$outline =~ s/<\/tr>/<\/tr>\n/gi;
#$outline =~ s/<tr>/<tr>\n/gi;
#$outline =~ s/<\/td>/<\/td>\n/gi;



#$outline =~ s///gi;


print OUTFILE "$outline";
}

print OUTFILE "$outline";
close INFILE;
close OUTFILE;



Emperor
Maniac (V) Mad Scientist with Finglongers

From: Cell 53, East Wing
Insane since: Jul 2001

posted posted 07-16-2003 20:59

I;m not sure if it is significant but it says:

quote:
$outline =~ s/<tab.*>/<table>\n/gi;



and it might be better to do what you have in your first question:

code:
$outline =~ s/<tab.*>/<table>\n/gi;



Not that I can see any reason why this is causing your problem.

___________________
Emps

FAQs: Emperor

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-16-2003 21:51

LOL did you guys notice that the non breaking space above wasn't escaped....

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 07-16-2003 22:49

Um I don't get it this works for me.



.:[ Never resist a perfect moment ]:.

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-16-2003 23:19

I don't get it either...it's friggin driving me crazy

hyperbole
Paranoid (IV) Inmate

From: Madison, Indiana, USA
Insane since: Aug 2000

posted posted 07-17-2003 18:06

Does the <table ....> tag span more than one line?

If the <table ...> tag looks like

code:
<table
class=MsoNormalTable
border=1
cellspacing=0
cellpadding=0
style='border-collapse:collapse;border:none'>



your regular expression won't see it.

You can correct this by adding the 's' flag to the end of the expression: s/<table.*>/<table>/isg

Also you might want to change the expression you are using so that it will stop at the first '>'. The way the expression is written it will eat the rest of the file to the last '>'.

Try something like s/<table[^>]*>/<table>/isg.




-- not necessarily stoned... just beautiful.

jiblet
Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted posted 07-17-2003 18:28

Yes, you will want to change all your regexps. Remember they are greedy by default, so .*> will match everything up to the last >. The only reason any of them work right now is because you don't have the 's' modifier at the end of the regexp, so it is doing one line at a time.

s/<table[^>]*>/<table>/isg

is the way I usually do such regexps. Alternatively you could do:

s/<table.*?>/<table>/isg

Which simply makes the .* be non-greedy (ie matches the shortest possible string instead of the longest possible). Either way the regular expression still chokes if an attribute contains a > as part of its value (though you would want to encode that as &amp;gt; anyway).

-jiblet

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-17-2003 18:58

Hey neighbor...I live in West Lafayette, Indiana!

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 07-18-2003 06:06

hey--- could you post or email me the entire script?

you sure you didn't miss a ';' on a previous line or something...

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 07-18-2003 14:32

Here's what I have so far...some of it works and some of it doesn't...as far as the substitutions go...everything else with the script is flawless.


code:
#!/usr/bin/perl


$ARGV[0] =~ /(.*)\.([^\.]*)/;
$outfile = "$1_cleaned.$2";

open (INFILE, "<$ARGV[0]");
open (OUTFILE, ">$outfile");

$outline="";

while($outline=<INFILE> ){


$outline =~ s/\n/ /gi;
$outline =~ s/<body.*?>/<bodymatter>\n/gi;
$outline =~ s/<p class=footnote>/<footnote>/gi;
$outline =~ s/<p.*?>/<p>/gi;
$outline =~ s/<b>/<strong>/gi;
$outline =~ s/<\/b>/<\/strong>/gi;
$outline =~ s/<td.*>/<td>/gi;
$outline =~ s/&nbsp;//gi;
$outline =~ s/<meta.*>//gi;
$outline =~ s/<div.*?>//gi;
$outline =~ s/<span.*?>/[0]/ig;
$outline =~ s/<\/span>//gi;
$outline =~ s/<\/body>/<\/bodymatter>/gi;
$outline =~ s/<\/p>/<\/p>\n/gi;
$outline =~ s/<br.*?>//gi;
$outline =~ s/<+\s*STYLE(.*?)>+.+<+\s*\/STYLE(.*?)>+//gis;
$outline =~ s/<\/div>//gi;
$outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi;
$outline =~ s/<head>/<head>\n/gi;
$outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi;
$outline =~ s/<\/title>/<\/title>\n/gi;
$outline =~ s/<tab.*>/<table>\n/gi;
$outline =~ s/<\/table>/<\/table>\n/gi;
$outline =~ s/<\/tr>/<\/tr>\n/gi;
$outline =~ s/<tr>/<tr>\n/gi;
$outline =~ s/<\/td>/<\/td>\n/gi;



#$outline =~ s///gi;


print OUTFILE "$outline";
}

print OUTFILE "$outline";
close INFILE;
close OUTFILE;



Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 07-21-2003 02:03

how about a typical infile that you would use?


Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com
ICQ: 67751342

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted posted 08-07-2003 17:57

In order to fix the problems I'm been having with line breaks I strip out all \n at the beginning of my script an add them back in later to fix formatting. I've read that this is a really common practice among Perl programmers.

Piper
Paranoid (IV) Inmate

From: California
Insane since: Jun 2000

posted posted 08-08-2003 15:45

One thing you might consider doings is reading in the entire file before you run it through your regexes:

code:
use Fcntl qw/:flock/;


# Read in the entire file. Unless you are working with files that
# are several MB's in size, this will be *much* faster.
open HTML, "< $infile" or die "Can't read open ($infile): $!";
flock HTML, LOCK_SH;
read HTML, my $html, -s HTML;
close HTML;


# run your regexes on $html here


# Write your parsed html to a file
open HTML, "> $outfile" or die "Can't write open ($outfile): $!";
print HTML $html;
close HTML;



Regexes are expensive. This way you will only be running running your regexes once per file instead of once per line of the file.

~Charlie

[This message has been edited by Piper (edited 08-08-2003).]

« BackwardsOnwards »

Show Forum Drop Down Menu