Closed Thread Icon

Topic awaiting preservation: Pattern Matching a URL Pages that link to <a href="https://ozoneasylum.com/backlink?for=12595" title="Pages that link to Topic awaiting preservation: Pattern Matching a URL" rel="nofollow" >Topic awaiting preservation: Pattern Matching a URL\

 
Author Thread
WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted posted 01-28-2003 02:07

I am having a bit of trouble coming up with a perl style pattern matching for the generic URL.

The basic idea is to be able to do some generic link checking of all urls on a page, which seems to present me with a little bit of difficulty. I assume that the pattern must have been produced already, but with hours of googling I have not been able to come up with the pattern, so I am trying it on my own. There are a number of different link types I can come up with, which makes it even harder. For right now I am going to restrict myself to the http:// protocal, as I figure a simple or statement for the different protocals would be easy to implement.

The rule I am working on currently is rather simple being:
*]http://[a-zA-Z1-9._/#]*

Which will match urls that begin with http://, but misses those URL that begin with either the file name something.html or /something.html or even something as odd as #somewhere.

I am going to keep search, but I was wondering if anyone has come accross and saved a reference to something similar.

Note: I am not looking for perl code in order to do this, since I do not want it to be language specific.

Perfect Thunder
Paranoid (IV) Inmate

From: Milwaukee
Insane since: Oct 2001

posted posted 01-28-2003 03:24

The PHP.net section on Perl-compatible regular expressions has lots of code in the comments sections. Try this page and scan down to the entry that starts:

quote:
Here's an expression that breaks apart any standard URI/URL into all it's various components (including username/password and query string)...



The regex this guy produced probably does exactly what you want.

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted posted 01-28-2003 03:43

Thanks a lot. That is exactly what I am looking for. The Pattern is bellow, I am going to attempt to deconstruct it and I will post that when I do. I just wanted to let you know that I really appreciate it.

code:
"((.*?):\/\/)?(([^:]*) :([^@]*)@)?([^\/:]*)( :([^\/]*))?([^\?]*\/?)?(\?(.*))?"



[had to fix the code to get rid of slimies]



[This message has been edited by WarMage (edited 01-28-2003).]

Perfect Thunder
Paranoid (IV) Inmate

From: Milwaukee
Insane since: Oct 2001

posted posted 01-28-2003 05:41

Well, I could deconstruct 80% of it off the top of my head, there's nothing really zany going on with it -- no backreferences or anything arcane like that. But it'll probably be a good experience for you to do it yourself.

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted posted 01-28-2003 12:40

Yeah, after looking at it, I found it not to be that tough.

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 01-28-2003 17:16

we did this a while back... lemme find the thread...


Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com
ICQ: 67751342

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 01-28-2003 17:46

while searching for the old thread, I decided to post a reference for that code that WarMage spit out:

code:
(                           # begin a 'unity'
(.*?) # zero or more of any single character and -of that group- only zero or one time
:\/\/ # the colon and two front slashes (escaped)
) # end the 'unity'
? # zero (0) or one (1) of that whole thing that came before
( # begin a second chunk
([^:]*) # anything BUT a colon- no matter how many times
: # a colon
([^@]*) # anything BUT a '@'- no matter how many times
@ # a '@'
) # end this second chunk
? # zero (0) or one (1) of that whole chunk that came before
([^\/:]*) # anything BUT a front slash or a colon- no matter how many times
( # begin 'big ol' slice
: # a colon
([^\/]*) # anything BUT a front slash- no matter how many times
) # end 'big ol' slice
? # zero (0) or one (1) of the last parenthesis that came before
( # hunk number 4
[^\?]* # anything BUT a question mark- no matter how many times
\/? # zero (0) or one (1) slash(es)
) # end hunk number 4
? # zero (0) or one (1) of the thing that came before
(\?(.*)) # question mark and more characters after it
? # zero (0) or one (1) of um... I forget



take a look at it... you're complicating youself WAY more than you have to... you really need only get anything after 'http://' (or https: or ftp: or whatever) until the end of the word..

be back with... umm... something..


Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com
ICQ: 67751342

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 01-28-2003 17:55

the code goes like this
<B>/^\b(http:

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted posted 01-28-2003 18:10

the biggest issue I see for that one is the "www" part. That last one isn't really that efficient and would also allow for XSS attacks (which is one of the big reasons to parse submitted URLS). Basically the second only looks for some string with http:// or https?://www

Which allows for XSS attacks (which seem to be really hard to kill).

A sample would be:

code:
< a href="http://www.<script>alert('document.cookie')</script>">here</a>






.:[ Never resist a perfect moment ]:.


[This message has been edited by bitdamaged (edited 01-28-2003).]

Petskull
Maniac (V) Mad Scientist

From: 127 Halcyon Road, Marenia, Atlantis
Insane since: Aug 2000

posted posted 01-28-2003 19:27

aren't you turning '<' to '&lt' anyway?


Code - CGI - links - DHTML - Javascript - Perl - programming - Magic - http://www.twistedport.com
ICQ: 67751342

[This message has been edited by Petskull (edited 01-28-2003).]

« BackwardsOnwards »

Show Forum Drop Down Menu