Topic awaiting preservation: Perl REGEX problem (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=12819" title="Pages that link to Topic awaiting preservation: Perl REGEX problem (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Perl REGEX problem <span class="small">(Page 1 of 1)</span>\$

Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-16-2003 17:52 I have a perl script where I want to search for strings in HTML like: "<table class=MsoNormalTable border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;border:none'> " and replace them with: " <table>" However when I run my perl script on the html it doesn't make the correct replacement....here's the string I'm trying (note - other substitutions are working, ex. "s/<meta.>//gi;" replaces "<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"): "s/<table.>/<table>\n/gi;" Thanks in advance for the help!
Petskull Maniac (V) Mad Scientist From: 127 Halcyon Road, Marenia, Atlantis Insane since: Aug 2000	posted 07-16-2003 20:20 I can't see why it shouldn't work... try just replacing '<table' with '' and step off from there?
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-16-2003 20:46 It's not working either Petskull. Here's the whole script maybe having context of the rest of the script will play a role in the troubleshooting. code: #!/usr/bin/perl $ARGV[0] =~ /(.)\.([^\.])/; $outfile = "$1_cleaned.$2"; open (INFILE, "<$ARGV[0]"); open (OUTFILE, ">$outfile"); $outline=""; while($outline=<INFILE> ){ $outline =~ s/\n/ /gi; #removes all linefeed characters #$outline =~ s/<span.>//gi; #removes open span tags #$outline =~ s/<td.>/<td>/gi; #removes TD attributes #$outline =~ s/ //gi; #removes non breaking spaces #$outline =~ s/<b>/<strong>/gi; #replaces open bold w/ strong #$outline =~ s/<\/b>/<\/strong>/gi; #replaces closing bold w/ closing strong #$outline =~ s/<meta.>//gi; #removes meta tags #$outline =~ s/<div.>//gi; #removes div open tags #$outline =~ s/<span.>//gi; #removes closing span tags #$outline =~ s/<\/span>//gi; #removes closing span tags #$outline =~ s/<p .>/<p>\n/gi; #$outline =~ s/<p$>/<p>\n/gi; #$outline =~ s/<br.>//gi; #$outline =~ s/<style.\s+\S+\n//gi; #$outline =~ s/<\/div>//gi; #removes div close tags #$outline =~ s/<body.>/<bodymatter>\n/gi; #$outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi; #$outline =~ s/<head>/<head>\n/gi; #inserts CR after <head> tag #$outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi; #$outline =~ s/<\/title>/<\/title>\n/gi; #$outline =~ s/<\/p>/<\/p>\n/gi; $outline =~ s/<tab.*>/<table>\n/gi; #$outline =~ s/<\/table>/<\/table>\n/gi; #inserts linefeed after closing table tag #$outline =~ s/<\/tr>/<\/tr>\n/gi; #$outline =~ s/<tr>/<tr>\n/gi; #$outline =~ s/<\/td>/<\/td>\n/gi; #$outline =~ s///gi; print OUTFILE "$outline"; } print OUTFILE "$outline"; close INFILE; close OUTFILE;
Emperor Maniac (V) Mad Scientist with Finglongers From: Cell 53, East Wing Insane since: Jul 2001	posted 07-16-2003 20:59 I;m not sure if it is significant but it says: quote: $outline =~ s/<tab.>/<table>\n/gi; and it might be better to do what you have in your first question: code: $outline =~ s/<tab.>/<table>\n/gi; Not that I can see any reason why this is causing your problem. ___________________ Emps FAQs: Emperor
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-16-2003 21:51 LOL did you guys notice that the non breaking space above wasn't escaped....
bitdamaged Maniac (V) Mad Scientist From: 100101010011 <-- right about here Insane since: Mar 2000	posted 07-16-2003 22:49 Um I don't get it this works for me. .:[ Never resist a perfect moment ]:.
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-16-2003 23:19 I don't get it either...it's friggin driving me crazy
hyperbole Paranoid (IV) Inmate From: Madison, Indiana, USA Insane since: Aug 2000	posted 07-17-2003 18:06 Does the <table ....> tag span more than one line? If the <table ...> tag looks like code: <table class=MsoNormalTable border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;border:none'> your regular expression won't see it. You can correct this by adding the 's' flag to the end of the expression: s/<table.>/<table>/isg Also you might want to change the expression you are using so that it will stop at the first '>'. The way the expression is written it will eat the rest of the file to the last '>'. Try something like s/<table[^>]>/<table>/isg. -- not necessarily stoned... just beautiful.
jiblet Paranoid (IV) Inmate From: Minneapolis, MN, USA Insane since: May 2000	posted 07-17-2003 18:28 Yes, you will want to change all your regexps. Remember they are greedy by default, so .> will match everything up to the last >. The only reason any of them work right now is because you don't have the 's' modifier at the end of the regexp, so it is doing one line at a time. s/<table[^>]>/<table>/isg is the way I usually do such regexps. Alternatively you could do: s/<table.?>/<table>/isg Which simply makes the . be non-greedy (ie matches the shortest possible string instead of the longest possible). Either way the regular expression still chokes if an attribute contains a > as part of its value (though you would want to encode that as &gt; anyway). -jiblet
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-17-2003 18:58 Hey neighbor...I live in West Lafayette, Indiana!
Petskull Maniac (V) Mad Scientist From: 127 Halcyon Road, Marenia, Atlantis Insane since: Aug 2000	posted 07-18-2003 06:06 hey--- could you post or email me the entire script? you sure you didn't miss a ';' on a previous line or something...
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-18-2003 14:32 Here's what I have so far...some of it works and some of it doesn't...as far as the substitutions go...everything else with the script is flawless. code: #!/usr/bin/perl $ARGV[0] =~ /(.)\.([^\.])/; $outfile = "$1_cleaned.$2"; open (INFILE, "<$ARGV[0]"); open (OUTFILE, ">$outfile"); $outline=""; while($outline=<INFILE> ){ $outline =~ s/\n/ /gi; $outline =~ s/<body.?>/<bodymatter>\n/gi; $outline =~ s/<p class=footnote>/<footnote>/gi; $outline =~ s/<p.?>/<p>/gi; $outline =~ s/<b>/<strong>/gi; $outline =~ s/<\/b>/<\/strong>/gi; $outline =~ s/<td.>/<td>/gi; $outline =~ s/ //gi; $outline =~ s/<meta.>//gi; $outline =~ s/<div.?>//gi; $outline =~ s/<span.?>/[0]/ig; $outline =~ s/<\/span>//gi; $outline =~ s/<\/body>/<\/bodymatter>/gi; $outline =~ s/<\/p>/<\/p>\n/gi; $outline =~ s/<br.?>//gi; $outline =~ s/<+\sSTYLE(.?)>+.+<+\s\/STYLE(.?)>+//gis; $outline =~ s/<\/div>//gi; $outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi; $outline =~ s/<head>/<head>\n/gi; $outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi; $outline =~ s/<\/title>/<\/title>\n/gi; $outline =~ s/<tab.>/<table>\n/gi; $outline =~ s/<\/table>/<\/table>\n/gi; $outline =~ s/<\/tr>/<\/tr>\n/gi; $outline =~ s/<tr>/<tr>\n/gi; $outline =~ s/<\/td>/<\/td>\n/gi; #$outline =~ s///gi; print OUTFILE "$outline"; } print OUTFILE "$outline"; close INFILE; close OUTFILE;
Petskull Maniac (V) Mad Scientist From: 127 Halcyon Road, Marenia, Atlantis Insane since: Aug 2000	posted 07-21-2003 02:03 how about a typical infile that you would use? Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com ICQ: 67751342
Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 08-07-2003 17:57 In order to fix the problems I'm been having with line breaks I strip out all \n at the beginning of my script an add them back in later to fix formatting. I've read that this is a really common practice among Perl programmers.
Piper Paranoid (IV) Inmate From: California Insane since: Jun 2000	posted 08-08-2003 15:45 One thing you might consider doings is reading in the entire file before you run it through your regexes: code: use Fcntl qw/:flock/; # Read in the entire file. Unless you are working with files that # are several MB's in size, this will be much faster. open HTML, "< $infile" or die "Can't read open ($infile): $!"; flock HTML, LOCK_SH; read HTML, my $html, -s HTML; close HTML; # run your regexes on $html here # Write your parsed html to a file open HTML, "> $outfile" or die "Can't write open ($outfile): $!"; print HTML $html; close HTML; Regexes are expensive. This way you will only be running running your regexes once per file instead of once per line of the file. ~Charlie [This message has been edited by Piper (edited 08-08-2003).]

Topic awaiting preservation: Perl REGEX problem (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=12819" title="Pages that link to Topic awaiting preservation: Perl REGEX problem (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Perl REGEX problem <span class="small">(Page 1 of 1)</span>\$

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-16-2003 17:52

I have a perl script where I want to search for strings in HTML like:

"<table class=MsoNormalTable border=1 cellspacing=0 cellpadding=0 style='border-collapse:collapse;border:none'> "

and replace them with:

" <table>"

However when I run my perl script on the html it doesn't make the correct replacement....here's the string I'm trying (note - other substitutions are working, ex. "s/<meta.*>//gi;" replaces "<meta http-equiv=Content-Type content="text/html; charset=windows-1252">"):

"s/<table.*>/<table>\n/gi;"

Thanks in advance for the help!

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted 07-16-2003 20:20

I can't see why it shouldn't work...

try just replacing '<table' with '' and step off from there?

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-16-2003 20:46

It's not working either Petskull. Here's the whole script maybe having context of the rest of the script will play a role in the troubleshooting.

code:

#!/usr/bin/perl



$ARGV[0] =~ /(.*)\.([^\.]*)/;

$outfile = "$1_cleaned.$2";



open (INFILE, "<$ARGV[0]");

open (OUTFILE, ">$outfile");



$outline="";



while($outline=<INFILE> ){



	

	$outline =~ s/\n/ /gi; 			#removes all linefeed characters

	#$outline =~ s/<span.*>//gi;			#removes open span tags

	#$outline =~ s/<td.*>/<td>/gi;			#removes TD attributes

	#$outline =~ s/&nbsp;//gi; 			#removes non breaking spaces

	#$outline =~ s/<b>/<strong>/gi; 		#replaces open bold w/ strong

	#$outline =~ s/<\/b>/<\/strong>/gi; 		#replaces closing bold w/ closing strong

	#$outline =~ s/<meta.*>//gi; 			#removes meta tags

	#$outline =~ s/<div.*>//gi;			#removes div open tags

	#$outline =~ s/<span.*>//gi;			#removes closing span tags	

	#$outline =~ s/<\/span>//gi;			#removes closing span tags

	#$outline =~ s/<p .*>/<p>\n/gi;          	

	#$outline =~ s/<p$*>/<p>\n/gi;          	

	#$outline =~ s/<br.*>//gi;	        	

	#$outline =~ s/<style.*\s+\S+\n//gi;    	

	#$outline =~ s/<\/div>//gi;			#removes div close tags

	#$outline =~ s/<body.*>/<bodymatter>\n/gi;

	#$outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi;

	#$outline =~ s/<head>/<head>\n/gi;		#inserts CR after <head> tag

	#$outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi;

	#$outline =~ s/<\/title>/<\/title>\n/gi;

	#$outline =~ s/<\/p>/<\/p>\n/gi;

	$outline =~ s/<tab.*>/<table>\n/gi;

	#$outline =~ s/<\/table>/<\/table>\n/gi;	#inserts linefeed after closing table tag	

	#$outline =~ s/<\/tr>/<\/tr>\n/gi;

	#$outline =~ s/<tr>/<tr>\n/gi;

	#$outline =~ s/<\/td>/<\/td>\n/gi;



	

		

	#$outline =~ s///gi;

	



	print OUTFILE "$outline";

}



print OUTFILE "$outline";

close INFILE;

close OUTFILE;

Emperor
Maniac (V) Mad Scientist with Finglongers

From: Cell 53, East Wing
Insane since: Jul 2001

posted 07-16-2003 20:59

I;m not sure if it is significant but it says:

quote:
$outline =~ s/<tab.*>/<table>\n/gi;

and it might be better to do what you have in your first question:

code:

$outline =~ s/<tab.*>/<table>\n/gi;

Not that I can see any reason why this is causing your problem.

___________________
Emps

FAQs: Emperor

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-16-2003 21:51

LOL did you guys notice that the non breaking space above wasn't escaped....

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted 07-16-2003 22:49

Um I don't get it this works for me.

.:[ Never resist a perfect moment ]:.

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-16-2003 23:19

I don't get it either...it's friggin driving me crazy

hyperbole
Paranoid (IV) Inmate

From: Madison, Indiana, USA
Insane since: Aug 2000

posted 07-17-2003 18:06

Does the <table ....> tag span more than one line?

If the <table ...> tag looks like

code:

<table

       class=MsoNormalTable 

       border=1 

       cellspacing=0 

       cellpadding=0 

      style='border-collapse:collapse;border:none'>

your regular expression won't see it.

You can correct this by adding the 's' flag to the end of the expression: s/<table.*>/<table>/isg

Also you might want to change the expression you are using so that it will stop at the first '>'. The way the expression is written it will eat the rest of the file to the last '>'.

Try something like s/<table[^>]*>/<table>/isg.

-- not necessarily stoned... just beautiful.

jiblet
Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted 07-17-2003 18:28

Yes, you will want to change all your regexps. Remember they are greedy by default, so .*> will match everything up to the last >. The only reason any of them work right now is because you don't have the 's' modifier at the end of the regexp, so it is doing one line at a time.

s/<table[^>]*>/<table>/isg

is the way I usually do such regexps. Alternatively you could do:

s/<table.*?>/<table>/isg

Which simply makes the .* be non-greedy (ie matches the shortest possible string instead of the longest possible). Either way the regular expression still chokes if an attribute contains a > as part of its value (though you would want to encode that as &gt; anyway).

-jiblet

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-17-2003 18:58

Hey neighbor...I live in West Lafayette, Indiana!

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted 07-18-2003 06:06

hey--- could you post or email me the entire script?

you sure you didn't miss a ';' on a previous line or something...

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-18-2003 14:32

Here's what I have so far...some of it works and some of it doesn't...as far as the substitutions go...everything else with the script is flawless.

code:

#!/usr/bin/perl





$ARGV[0] =~ /(.*)\.([^\.]*)/;

$outfile = "$1_cleaned.$2";



open (INFILE, "<$ARGV[0]");

open (OUTFILE, ">$outfile");



$outline="";



while($outline=<INFILE> ){



	

	$outline =~ s/\n/ /gi; 			

	$outline =~ s/<body.*?>/<bodymatter>\n/gi;

	$outline =~ s/<p class=footnote>/<footnote>/gi; 

	$outline =~ s/<p.*?>/<p>/gi;          	

	$outline =~ s/<b>/<strong>/gi; 			

	$outline =~ s/<\/b>/<\/strong>/gi; 		

	$outline =~ s/<td.*>/<td>/gi;			

	$outline =~ s/&nbsp;//gi; 			

	$outline =~ s/<meta.*>//gi; 			

	$outline =~ s/<div.*?>//gi;			

	$outline =~ s/<span.*?>/[0]/ig;			

	$outline =~ s/<\/span>//gi;			

	$outline =~ s/<\/body>/<\/bodymatter>/gi;

	$outline =~ s/<\/p>/<\/p>\n/gi;  

	$outline =~ s/<br.*?>//gi;	        	

	$outline =~ s/<+\s*STYLE(.*?)>+.+<+\s*\/STYLE(.*?)>+//gis;    	

	$outline =~ s/<\/div>//gi;			

	$outline =~ s/<html>/<\?xml version\=\"1\.0\" encoding\=\"UTF\-8\"\?>\n<dtbook version\=\"1\.1\.0\">\n/gi;

	$outline =~ s/<head>/<head>\n/gi;		

	$outline =~ s/<\/head>/\n<\/head>\n<book>\n/gi;

	$outline =~ s/<\/title>/<\/title>\n/gi;

	$outline =~ s/<tab.*>/<table>\n/gi;

	$outline =~ s/<\/table>/<\/table>\n/gi;	

	$outline =~ s/<\/tr>/<\/tr>\n/gi;

	$outline =~ s/<tr>/<tr>\n/gi;

	$outline =~ s/<\/td>/<\/td>\n/gi;



	

		

	#$outline =~ s///gi;

	



	print OUTFILE "$outline";

}



print OUTFILE "$outline";

close INFILE;

close OUTFILE;

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted 07-21-2003 02:03

how about a typical infile that you would use?

Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com
ICQ: 67751342

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 08-07-2003 17:57

In order to fix the problems I'm been having with line breaks I strip out all \n at the beginning of my script an add them back in later to fix formatting. I've read that this is a really common practice among Perl programmers.

Piper
Paranoid (IV) Inmate

From: California
Insane since: Jun 2000

posted 08-08-2003 15:45

One thing you might consider doings is reading in the entire file before you run it through your regexes:

code:

use Fcntl qw/:flock/;





# Read in the entire file.  Unless you are working with files that

# are several MB's in size, this will be *much* faster.

open  HTML, "< $infile" or die "Can't read open ($infile): $!";

flock HTML, LOCK_SH;

read  HTML, my $html, -s HTML;

close HTML; 





# run your regexes on $html here





# Write your parsed html to a file

open  HTML, "> $outfile" or die "Can't write open ($outfile): $!";

print HTML $html;

close HTML;

Regexes are expensive. This way you will only be running running your regexes once per file instead of once per line of the file.

~Charlie

[This message has been edited by Piper (edited 08-08-2003).]