Closed Thread Icon

Topic awaiting preservation: Regexp reverse-templating (Page 1 of 1) Pages that link to <a href="https://ozoneasylum.com/backlink?for=13013" title="Pages that link to Topic awaiting preservation: Regexp reverse-templating (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Regexp reverse-templating <span class="small">(Page 1 of 1)</span>\

 
Hub-izer
Bipolar (III) Inmate

From: The little green dot at the center of your monitor
Insane since: Jul 2003

posted posted 12-02-2003 23:22

Hello,

I'm trying to extract data from a string. For example, I deal with the string:

"<title>my site</title>

<span class='news'>news item 1</span>
<span class='news'>news item 2</span>

<span>random fact</span><br>orangutans smell<br>
<span>random map</span><br>oceania<br>"

I am looking for a regexp that will find chunks of data inside or between other chunks of data, and extract all instances of it for me. For example, if I feed "<title>{title_data}</title>" to it, it will return "my site" in "title_data". Or if I feed it "<span>{rnd_fact}</span<br>{fact_text}<br>" then it will find me "orangutans smell" in "rnd_fact[0]" and "oceania" in "rnd_fact[1]", etc. I know there's a way to do this, in fact, PHP's regexp functions will do the work of setting the appropriate variables into arrays and stuff, but I don't know much about regexps and how I would be able to accomplish this.

In effect, I am looking for a good screen scraper in PHP, but I have exhaustively searched and not found a proper one.

Thanks,
Hub-izer



bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 12-03-2003 00:18

Well here's a quickie
http://www.bitdamaged.com/testpages/regtest/source.phps

You can see the results here
http://www.bitdamaged.com/testpages/regtest/


The regular expression I used was

/>([^<]+)</i

The break down is this

/> <-- look for a closing bracket
( <-- creates a group
[^>] <--- Look for anything that's not a ">"
+ <-- look for those anything not a ">" one or more times.
) close the group
> <-- look for an opening bracket.

Basically it just uses the preg_match_all function which stores matches into an array of arrays called $matches (actually I named it matches you can call it whatever) where $matches[0] is everything that matched my entire expression. So whats in there would look like

>something <

$matches[1] is an array of everything I matched within my parentheses so that would be

something


This would do a lot of the grunt work for you. IT can just get more complicated from here.



.:[ Never resist a perfect moment ]:.

Skaarjj
Maniac (V) Mad Scientist

From: :morF
Insane since: May 2000

posted posted 12-03-2003 00:41

For this you could do a preg_replace_callback, which allows you to use one regex function to match the whole string, then pass it to a function wherein you can use more regex to match what's inside the string.

PHP->PCRE Pattern Syntax
PHP->preg_replace_callback()

so, for example, say you store the text you're feeding into it in the variable $input and the initial regex function in $search, for the title search you would do:

code:
$input = '<title>Welcome to my site</title>';
$search = "(?Ui)(\<title\> )(.*)(\<\/title\> )/";
$title_data = preg_replace_callback($search, 'my_callback_function', $input);



There...see how easy that was? Now, if course, snce you're defining your own functions, it must be defined in your script before you try to call it with the preg_replace_callback. So above the regex piece I showed you just before you'd put:

code:
function my_callback_function($matches) //feel free to name this anything you want, but don't forget to change the name in the regex function too
{
$input = $matches[2];
$search = "(?Ui)(\my site)/";
$final_data = preg_replace ($search, $input); //you could, of course, put another callback in here for more complicated searches
return $final data;
}



Now, in there I've used the 'matches' array. This is the array of data that the callback sends on to the function you define. It's arranged like this:

$matches[0] is the complete set of matched data "<title>Welcome to my site</title>"
$matches[1] is the first search term returned "<title>"
$matches[2] is the second search term returned "Welcome to my site"
$matches[3] is the third search term returned "</title>"

of course you will have more items in your array the more search terms you have. In regex the search terms are defined as the parts with the () around them and the literal data "(\<title\> )" or search quantifiers inside them "(.*)" <---This is the search quantifier meaning 'return everything up to the contents of the next search term'.

Well, that's about it in a nutshell...if you have any questions, feel free to ask.

[This message has been edited by Skaarjj (edited 12-03-2003).]

Perfect Thunder
Paranoid (IV) Inmate

From: Milwaukee
Insane since: Oct 2001

posted posted 12-03-2003 12:15

Good info Skaarjj -- what with your work on the Grail, you're quickly becoming a regex go-to man!

Cell 1250 :: alanmacdougall.com :: Illustrator tips

Skaarjj
Maniac (V) Mad Scientist

From: :morF
Insane since: May 2000

posted posted 12-03-2003 17:05

Oooh...I've always wanted to be a 'man' with some worked tacked onto the start

nah seriously...thanks PT...it means alot

Hub-izer
Bipolar (III) Inmate

From: The little green dot at the center of your monitor
Insane since: Jul 2003

posted posted 12-03-2003 18:40

Thanks, Skaarjj!
http://students.washington.edu/jbarbero/public/concept/scraper/source.php
That I think is identical to what you showed me, in essence. Do you know why it produces the error? http://students.washington.edu/jbarbero/public/concept/scraper/scraper.php


Hub-izer
Bipolar (III) Inmate

From: The little green dot at the center of your monitor
Insane since: Jul 2003

posted posted 12-03-2003 21:00

Unfortunately, I don't know much about regular expressions, so I can't debug this one, but here's your code: http://students.washington.edu/jbarbero/public/concept/scraper/testsrc.php
and here's what it does: http://students.washington.edu/jbarbero/public/concept/scraper/test.php

Hub-izer

Hub-izer
Bipolar (III) Inmate

From: The little green dot at the center of your monitor
Insane since: Jul 2003

posted posted 12-16-2003 21:59

Generic PHP scraper class, pretty darn good now, HTTP extension kinda slow for now. http://students.washington.edu/jbarbero/public/scraper/scraper_src.php

« BackwardsOnwards »

Show Forum Drop Down Menu