Topic awaiting preservation: Web scrape project (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=28163" title="Pages that link to Topic awaiting preservation: Web scrape project (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Web scrape project <span class="small">(Page 1 of 1)</span>\$

Boudga Maniac (V) Mad Scientist From: Jacks raging bile duct.... Insane since: Mar 2000	posted 07-05-2006 17:40 I have searched through the Asylum for web scrape related topics and have found a few threads on WGET. I am looking however for ideas on how to convert simple table data (ie sports scores) into queryable xml or mysql db data. I have tried preg_replace and adding IDs to TDs but it is very difficult to add the correct ID to a particular TD in order to differentiate between say first quarter scores and second quarter scores when the reg ex would have to know what column it is in in order to assign the correct ID. Basically without IDs it's very tough to format the TD IDs correctly. Has anyone else had success doing this? If so I would be interested in knowing your methodology. Thanks in advance for any helpful posts! (Edited by Boudga on 07-05-2006 17:41)
poi Paranoid (IV) Inmate From: Norway Insane since: Jun 2002	posted 07-05-2006 23:35 Are you sure you can't use an HTML parser ? read, do some of the work on client side. When I have to scrape something, I usually create a dummy DIV tag that I don't place in the current Document tree, and sets its innerHTML to the content of the page I want to scrape. That way the HTML parser of the browser takes care of invalid markup and create a correct DOM tree. Then I can easily use DOM methods / XPath / String methods on the cleaned innerHTML / whatever to scrape the content I need. Otherwise, I suggest you to make a simplistic XML-ish parser that will create a sort of DOM tree. As you seem interrested in TABLEs, you could limit your parser to the TABLE tags and their descendants. From there it should be easy to grab the TR. and TDs.

Topic awaiting preservation: Web scrape project (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=28163" title="Pages that link to Topic awaiting preservation: Web scrape project (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Web scrape project <span class="small">(Page 1 of 1)</span>\$

Boudga
Maniac (V) Mad Scientist

From: Jacks raging bile duct....
Insane since: Mar 2000

posted 07-05-2006 17:40

I have searched through the Asylum for web scrape related topics and have found a few threads on WGET. I am looking however for ideas on how to convert simple table data (ie sports scores) into queryable xml or mysql db data. I have tried preg_replace and adding IDs to TDs but it is very difficult to add the correct ID to a particular TD in order to differentiate between say first quarter scores and second quarter scores when the reg ex would have to know what column it is in in order to assign the correct ID. Basically without IDs it's very tough to format the TD IDs correctly. Has anyone else had success doing this? If so I would be interested in knowing your methodology. Thanks in advance for any helpful posts!

(Edited by Boudga on 07-05-2006 17:41)

poi
Paranoid (IV) Inmate

From: Norway
Insane since: Jun 2002

posted 07-05-2006 23:35

Are you sure you can't use an HTML parser ? read, do some of the work on client side. When I have to scrape something, I usually create a dummy DIV tag that I don't place in the current Document tree, and sets its innerHTML to the content of the page I want to scrape. That way the HTML parser of the browser takes care of invalid markup and create a correct DOM tree. Then I can easily use DOM methods / XPath / String methods on the cleaned innerHTML / whatever to scrape the content I need.

Otherwise, I suggest you to make a simplistic XML-ish parser that will create a sort of DOM tree. As you seem interrested in TABLEs, you could limit your parser to the TABLE tags and their descendants. From there it should be easy to grab the TR. and TDs.