Closed Thread Icon

Preserved Topic: Parsing HTML/Regexp Shortcomings (Page 1 of 1) Pages that link to <a href="" title="Pages that link to Preserved Topic: Parsing HTML/Regexp Shortcomings (Page 1 of 1)" rel="nofollow" >Preserved Topic: Parsing HTML/Regexp Shortcomings <span class="small">(Page 1 of 1)</span>\

Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted posted 07-02-2002 00:31

So I am working on my new templating system, and one of the things I want to be able to do is automatically translate something like:

<div class="foobar">xxxxxxxx</div>
<table yyyy><tr><td>xxxxxxxx</td></tr></table>

The purpose is to move up to standards-compliant design without sacrificing Netscape 4 compatibility.

Now I'm very proficient with regexps, so I have the tools to solve the problem. The only difficulty is, of course, dealing with nested tags. XML parsers handle this problem beautifully, but from what I can tell, they don't offer a good way to change the tags themselves. I was wondering if any of you have any ideas about an efficient way to deal with this conundrum (in PHP). It's safe to assume this document will be well-formed since I hope for all these pages to validate as XHTML 1.0 Transitional.

Currently the best idea I can come up with is searching for something like
/<div [^>]*class="[^"]*"[^>]*>.*?<\/div>/si

Then taking this string, search for internal occurrences of '<div'. Then loop on the count of divs, building a new regexp that searches for the appropriate number of closing <div> tags. Seems like there should be a better way...


[This message has been edited by jiblet (edited 07-02-2002).]

Tyberius Prime
Paranoid (IV) Mad Scientist with Finglongers

From: Germany
Insane since: Sep 2001

posted posted 07-02-2002 08:34

hm, maybe you should give XSTL (or called something along that line... style sheets something) a try. It's specifiallly designed to translate one kind of XML to another kind of XML. Provided of course, your source is in XHTML. So you can say replace <div> tags which have class XXX with <td>XXX</td>. but look it up, I never worked with it :-)

so long,

Tyberius Prime

Maniac (V) Mad Scientist

From: Belgrade, Serbia
Insane since: Sep 2000

posted posted 07-02-2002 20:49

AS TP said, you should investigate the possibilities of XSLT...

XSLT (Extensible Stylesheet Language (XSL) Transformations) specification:
PHP XSLT extension:

Paranoid (IV) Inmate

From: Minneapolis, MN, USA
Insane since: May 2000

posted posted 07-03-2002 21:29

Good idea guys. I checked it out and looks very promising. The only problem is that a) it is not compiled into our PHP build currently, requiring our tech's assistance and b) I want this code to be portable since I plan on releasing the source at some time in the future. Nevertheless, XSLT seems like a very important technology to be familiar with.

It turned out to be a pretty interesting problem to solve. As long as we are searching for simple strings and not regexps, I came up with what I think is a pretty cool solution. Check out the code, maybe someone might find it useful:

//Make transformations based on browser specs.
preg_match("/<body[^>]*>(.*)<\/body>/s", $content, $matches);
$body = $matches[1];

//This block locates each <div>s starting and ending points so that they can be substr'ed out and regex'ed accordingly.
//This code assumes well formed XHTML code, and should be easily portable to handle other tags.
$open_offset = 0;
$close_offset = 0;
$pos = true;
$num_divs = 0;

//Tally locations of opening and closing <div> tags (as well as the end of opening tags since they are variable length)
while ($pos != false) {
$pos = strpos($body,'<div',$open_offset);
if ($pos != false) {
$open_divs[$num_divs]['open'] = $pos;
$open_offset = $pos+3;
$open_divs[$num_divs]['tag_end'] = strpos($body,'>',$open_offset);
$pos = strpos($body,'</div',$close_offset);
if ($pos != false) {
$close_divs[$num_divs] = $pos;
$close_offset = $pos+4;
//$num_divs gets incremented even on the last loop when search fails, so it must be decremented once.

//For each closing div, match it up with it's opening div.
//We loop through the opening tags until we find that the next opening tag begins after the current closing tag.
//Then we know that the previous opening tag matches the current closing tag and we make the assignment.
foreach ($close_divs as $this_closer) {

//Find the first available opening tag
$current = 0;
while(isset($open_divs[$current]['close'])) {

//Loop through until we find the correct tag and assign the closing tag to it.
$assigned = false;
while(!$assigned) {
$next = $current + 1;
$next_exists = false;
while (!$next_exists && $next < $num_divs) {
if (!isset($open_divs[$next]['close'])) {
$next_exists = true;
if(!$next_exists &#0124; &#0124; $open_divs[$next]['open'] > $this_closer) {
$open_divs[$current]['close'] = $this_closer;
$assigned = true;
} else {
$current = $next;


« BackwardsOnwards »

Show Forum Drop Down Menu