Topic awaiting preservation: Web scrape project |
|
---|---|
Author | Thread |
Maniac (V) Mad Scientist From: Jacks raging bile duct.... |
posted 07-05-2006 17:40
I have searched through the Asylum for web scrape related topics and have found a few threads on WGET. I am looking however for ideas on how to convert simple table data (ie sports scores) into queryable xml or mysql db data. I have tried preg_replace and adding IDs to TDs but it is very difficult to add the correct ID to a particular TD in order to differentiate between say first quarter scores and second quarter scores when the reg ex would have to know what column it is in in order to assign the correct ID. Basically without IDs it's very tough to format the TD IDs correctly. Has anyone else had success doing this? If so I would be interested in knowing your methodology. Thanks in advance for any helpful posts! |
Paranoid (IV) Inmate From: Norway |
posted 07-05-2006 23:35
Are you sure you can't use an HTML parser ? read, do some of the work on client side. When I have to scrape something, I usually create a dummy DIV tag that I don't place in the current Document tree, and sets its innerHTML to the content of the page I want to scrape. That way the HTML parser of the browser takes care of invalid markup and create a correct DOM tree. Then I can easily use DOM methods / XPath / String methods on the cleaned innerHTML / whatever to scrape the content I need. |