Topic: Web scrape project (Page 1 of 1) Pages that link to <a href="https://ozoneasylum.com/backlink?for=28163" title="Pages that link to Topic: Web scrape project (Page 1 of 1)" rel="nofollow" >Topic: Web scrape project <span class="small">(Page 1 of 1)</span>\

 
Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

IP logged posted posted 07-05-2006 17:40 Edit Quote

I have searched through the Asylum for web scrape related topics and have found a few threads on WGET. I am looking however for ideas on how to convert simple table data (ie sports scores) into queryable xml or mysql db data. I have tried preg_replace and adding IDs to TDs but it is very difficult to add the correct ID to a particular TD in order to differentiate between say first quarter scores and second quarter scores when the reg ex would have to know what column it is in in order to assign the correct ID. Basically without IDs it's very tough to format the TD IDs correctly. Has anyone else had success doing this? If so I would be interested in knowing your methodology. Thanks in advance for any helpful posts!

(Edited by Boudga on 07-05-2006 17:41)

poi
Paranoid (IV) Inmate

From: Norway
Insane since: Jun 2002

IP logged posted posted 07-05-2006 23:35 Edit Quote

Are you sure you can't use an HTML parser ? read, do some of the work on client side. When I have to scrape something, I usually create a dummy DIV tag that I don't place in the current Document tree, and sets its innerHTML to the content of the page I want to scrape. That way the HTML parser of the browser takes care of invalid markup and create a correct DOM tree. Then I can easily use DOM methods / XPath / String methods on the cleaned innerHTML / whatever to scrape the content I need.

Otherwise, I suggest you to make a simplistic XML-ish parser that will create a sort of DOM tree. As you seem interrested in TABLEs, you could limit your parser to the TABLE tags and their descendants. From there it should be easy to grab the TR. and TDs.



Post Reply
 
Your User Name:
Your Password:
Login Options: Remember Me On This Computer
 
Your Text:
Loading...
Options: Show Signature
Enable Slimies
Enable Linkwords

« BackwardsOnwards »

Show Forum Drop Down Menu