Topic awaiting preservation: Can anyone suggest a good way to get keywords from referal URL's $Pages that link to <a href="https://ozoneasylum.com/backlink?for=25981" title="Pages that link to Topic awaiting preservation: Can anyone suggest a good way to get keywords from referal URL&#039;s" rel="nofollow" >Topic awaiting preservation: Can anyone suggest a good way to get keywords from referal URL's\$

Author	Thread
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-07-2005 15:07 Hi all, just wondering if anyone can suggest a good way to get keywords used in referal address. What i want to do is get the search engine i.e google.com (this may be .co.uk etc) and the keywords that were entered. For example http://www.google.com.au/search?sourceid=navclient&ie=UTF-8&rls=GGLD,GGLD:2004-17,GGLD:en&q=ozone+asylum the &q=ozone+asylum was the search term used. Im guessing regular expressions will be required. What im actually doing is reading a list of past referals for hits to a website, I want to show a list of the highest 10 used keywords. Not sure of the best way to do this, i was thinking arrays but a temproray mysql table is possibly a faster way. Any ideas? thanks.
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-07-2005 15:14 One thing i forgot to mention, obviously there will be different URL's to handle, so yahoo will need something slightly different etc. I was thinking of storing the search engine (yahoo, google) etc in a small table with the expression to use to get the keywords for that search engine. Hopefully it can be done with all the escaping chars etc..
jiblet Paranoid (IV) Inmate From: Minneapolis Insane since: May 2000	posted 06-07-2005 17:17 In PHP try this: preg_match('/q=([^&]*)/i',$_SERVER['HTTP_REFERER'],$matches); $words = split('+',$matches[ 1 ]); Now each search term is in $words. You can loop over that and INSERT into a table. Personally I would make a table with a 'word' field and a 'count' field and just increment the count if a word already exists. Takes up less space, and keeps the data up-to-date all the time. Otherwise you might try just using some web stats software like AWStats. -jiblet (Edited by jiblet on 06-07-2005 17:19)
bitdamaged Maniac (V) Mad Scientist From: 100101010011 <-- right about here Insane since: Mar 2000	posted 06-07-2005 17:34 Check out this tutorial from ALA. It's got downloadable code that you can probably tweak. It's pretty easy. .:[ Never resist a perfect moment ]:.
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-08-2005 01:29 Awesome thanks guys. Jiblet: The problem i guess with just incrementing one table, is that i want to be able to show the results for a set of dates as well as the entire period. What i'm planning on doing at the moment is, when i store a hit to the site, store the search engine and keywords for this hit - OR if its not identified then just store the referer. Im not sure if this is the best way, i'm struggligng to find a method i'm 100% happy with. Simply storring the referer URL and parsing the results when you look them up i'm guessing will be too slow - if it wasnt for that, and the wasted storage space then this would be ideal. The problem with NOT storing the referal URL is that you limit yourself down the track, what if there was a search engine you didnt have in the list, what if you want to get internal page movement etc. Its really not the simplest thing to design =/. Any other suggestions would be much appreciated.
jiblet Paranoid (IV) Inmate From: Minneapolis Insane since: May 2000	posted 06-08-2005 15:54 Well, in terms of limiting yourself, keep in mind the availability of log files, which will be best thing to store for complete records. But getting back to the performance issue... it will indeed be impossibly slow to parse out terms from a reasonably sized list of query strings on the fly. If you want the simplest idea to solve the date problem, just use one table which contains the fields for date and searchTerm. Then you can query something like SELECT COUNT(*) FROM table WHERE date > xxxxxxx AND date < xxxxxxx GROUP BY searchTerm Keep in mind both of these columns will need full indexes to be efficient. If you want to store the full queries, that should go in a separate table, and then you put a foreign key field in this table to link them. If you want to store search engines they should also probably go in a separate table which is then linked to the full query table. Also, to save space you could put searchTerms in their own table so that you only repeat ID#s instead of full words. -jiblet
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-09-2005 00:29 I did explore the log file option but to be honest didnt find any benifit from it. For starters it didnt have all of the information i wanted, and i would have to parse it anyway - so working with the database turned out to be easier to handle/customize. I don't see any performance benifit working with the log files either, from what i've found it can be much faster working with an indexed table than a flat file. - I could be wrong? I thought i would give something a try, simply reading through my database of logs, and entering any known search engine into an array. Reading through 300,000 records, roughly 30,000 which were actually search engine referals, i put them in an array and did a custom usort. Walla i had a sorted list of keywords in roughly 2.5 seconds. 1 second of that was taken up by the sort. I tried the table thing parsing them, putting them into another table and sorting them but that took 15 seconds -> far too long. The way you have mentioned above is something i was considering but havnt tried yet, while it would involve redunant data it might be worth it. Actually the query you have given me there wouldnt work would it, the GROUP BY wont allow me to count the occurances of the searchTerms ? I will give the other method a try today, one table for searchTerms with their ID, one table for search engines etc. My other decision is then weather to keep the referal string in my main log. The other thing i need to do is track internal movement within the site, whichis what the majority of records will be - i need to think of a good way to handle this also. I may end up with alot of tables T.T!!
jiblet Paranoid (IV) Inmate From: Minneapolis Insane since: May 2000	posted 06-10-2005 15:46 Well first thing is that you need to have log files with referers in them... if you don't then you're out of luck already (although maybe you can get your host to change it to the long log format?). Using the log files isn't easier, but a) gives you a retrospective view of past hits and more importantly b) many web statistics packages already track search terms exactly like you're talking about. I use awstats, and it will give you month by month or year by year stats for everything, although it falls a little short if you want to query any arbitrary time interval. As to the performance of using the log files, yes it's slow, but log file analyzers cache all the contents into a database for subsequent performance. Regarding the slow table method you tried... it's unclear where the bottleneck is. If you are taking query strings out of a database, parsing them, then inserting the individual words into another table, you're going to have a huge amount of overhead just on database calls. You have to get the data into the database in the format you need as you go, because running 30,000 queries through an API is never going to be fast. The other thing is, do you have an index on the term column? If you don't then sorting will be slow, but if you do then it will be sorted as you go. Now regarding your last question, absolutely that query will work. In fact, I highly suggest reading the MySQL Wizardry article over at devshed just to get some idea of the scope of what even a relatively primitive RDBMS such as MySQL can do. There are a lot of other great articles over there too, about indexes, joins, and normalization all of which are mercifully brief and get to the point rather effectively. Good luck. -jiblet
bitdamaged Maniac (V) Mad Scientist From: 100101010011 <-- right about here Insane since: Mar 2000	posted 06-10-2005 17:16 I'd recommend parsing the logs on some sort of cron job every day (or hour whatever) and then inserting that into an indexed table as opposed to doing an insert every time someone hits the page. One of the bottlenecks you are going to hit is that while indexing helps speed up searches it slows down inserts and updates because everything has to be reindexed. If you are doing this on every visit to the page you're going to add significant overhead to page load. .:[ Never resist a perfect moment ]:.
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-10-2005 17:38 quote: bitdamaged said: I'd recommend parsing the logs on some sort of cron job every day (or hour whatever) and then inserting that into an indexed table as opposed to doing an insert every time someone hits the page.One of the bottlenecks you are going to hit is that while indexing helps speed up searches it slows down inserts and updates because everything has to be reindexed. If you are doing this on every visit to the page you're going to add significant overhead to page load. .:[ Never resist a perfect moment ]:. I thought this could be an issue, but after testing it the overheads really are minimal and not noticable. The other problem im up against is some of the sites im working with are on windows servers, hence no cron jobs. I know indexing has an adverse effect on writing data but from what i've tested the writing is still super quick and not noticable, thankfully. quote: jiblet said: Regarding the slow table method you tried... it's I did try various ways here, indexing them all, indexing just the keywords etc. The quickest time i got was quite slow - the best method here will be have a separate table dynamically adding keywords as they are used. I think the string lenth could have been quite a bottleneck, a few more tests required me thinks! quote: jiblet said: Now regarding your last question, absolutely that query will work So the GROUP BY method can somehow count them as well? interesting that would be very useful - i will read up, there must be something i am missing!!! I was going to use subqueries etc, but i'm currently limited to version 4.0.2 mysql etc which dont support them yet - so stuck with your normal joins etc. I guess GROUP BY must have a different effect to distinct then? Ill do some reading and try clear this up! Thanks all for the help will let you know how it goes. Any thoughts on using an array like i did? I was suprised how quick it was.
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-10-2005 17:51 Hmm that was a great article jiblet, i think i have some SQL restructuring to do - can't wait to try this stuff out. I have skimmed this stuff before but never really applied it, i can see how it is very useful though! Thanks for that

Topic awaiting preservation: Can anyone suggest a good way to get keywords from referal URL's\$

Author

Thread

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-07-2005 15:07

Hi all, just wondering if anyone can suggest a good way to get keywords used in referal address.

What i want to do is get the search engine i.e google.com (this may be .co.uk etc) and the keywords that were entered.

For example

http://www.google.com.au/search?sourceid=navclient&ie=UTF-8&rls=GGLD,GGLD:2004-17,GGLD:en&q=ozone+asylum

the &q=ozone+asylum was the search term used.

Im guessing regular expressions will be required.

What im actually doing is reading a list of past referals for hits to a website, I want to show a list of the highest 10 used keywords. Not sure of the best way to do this, i was thinking arrays but a temproray mysql table is possibly a faster way.

Any ideas? thanks.

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-07-2005 15:14

One thing i forgot to mention, obviously there will be different URL's to handle, so yahoo will need something slightly different etc. I was thinking of storing the search engine (yahoo, google) etc in a small table with the expression to use to get the keywords for that search engine. Hopefully it can be done with all the escaping chars etc..

jiblet
Paranoid (IV) Inmate

From: Minneapolis
Insane since: May 2000

posted 06-07-2005 17:17

In PHP try this:

preg_match('/q=([^&]*)/i',$_SERVER['HTTP_REFERER'],$matches);
$words = split('+',$matches[ 1 ]);

Now each search term is in $words. You can loop over that and INSERT into a table. Personally I would make a table with a 'word' field and a 'count' field and just increment the count if a word already exists. Takes up less space, and keeps the data up-to-date all the time.

Otherwise you might try just using some web stats software like AWStats.

-jiblet

(Edited by jiblet on 06-07-2005 17:19)

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted 06-07-2005 17:34

Check out this tutorial from ALA.

It's got downloadable code that you can probably tweak. It's pretty easy.

.:[ Never resist a perfect moment ]:.

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-08-2005 01:29

Awesome thanks guys.

Jiblet: The problem i guess with just incrementing one table, is that i want to be able to show the results for a set of dates as well as the entire period.

What i'm planning on doing at the moment is, when i store a hit to the site, store the search engine and keywords for this hit - OR if its not identified then just store the referer.

Im not sure if this is the best way, i'm struggligng to find a method i'm 100% happy with. Simply storring the referer URL and parsing the results when you look them up i'm guessing will be too slow - if it wasnt for that, and the wasted storage space then this would be ideal.

The problem with NOT storing the referal URL is that you limit yourself down the track, what if there was a search engine you didnt have in the list, what if you want to get internal page movement etc.

Its really not the simplest thing to design =/.

Any other suggestions would be much appreciated.

jiblet
Paranoid (IV) Inmate

From: Minneapolis
Insane since: May 2000

posted 06-08-2005 15:54

Well, in terms of limiting yourself, keep in mind the availability of log files, which will be best thing to store for complete records.

But getting back to the performance issue... it will indeed be impossibly slow to parse out terms from a reasonably sized list of query strings on the fly. If you want the simplest idea to solve the date problem, just use one table which contains the fields for date and searchTerm. Then you can query something like

SELECT COUNT(*) FROM table WHERE date > xxxxxxx AND date < xxxxxxx GROUP BY searchTerm

Keep in mind both of these columns will need full indexes to be efficient. If you want to store the full queries, that should go in a separate table, and then you put a foreign key field in this table to link them. If you want to store search engines they should also probably go in a separate table which is then linked to the full query table. Also, to save space you could put searchTerms in their own table so that you only repeat ID#s instead of full words.

-jiblet

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-09-2005 00:29

I did explore the log file option but to be honest didnt find any benifit from it. For starters it didnt have all of the information i wanted, and i would have to parse it anyway - so working with the database turned out to be easier to handle/customize. I don't see any performance benifit working with the log files either, from what i've found it can be much faster working with an indexed table than a flat file. - I could be wrong?

I thought i would give something a try, simply reading through my database of logs, and entering any known search engine into an array. Reading through 300,000 records, roughly 30,000 which were actually search engine referals, i put them in an array and did a custom usort.

Walla i had a sorted list of keywords in roughly 2.5 seconds. 1 second of that was taken up by the sort.

I tried the table thing parsing them, putting them into another table and sorting them but that took 15 seconds -> far too long.

The way you have mentioned above is something i was considering but havnt tried yet, while it would involve redunant data it might be worth it. Actually the query you have given me there wouldnt work would it, the GROUP BY wont allow me to count the occurances of the searchTerms ?

I will give the other method a try today, one table for searchTerms with their ID, one table for search engines etc. My other decision is then weather to keep the referal string in my main log. The other thing i need to do is track internal movement within the site, whichis what the majority of records will be - i need to think of a good way to handle this also. I may end up with alot of tables T.T!!

jiblet
Paranoid (IV) Inmate

From: Minneapolis
Insane since: May 2000

posted 06-10-2005 15:46

Well first thing is that you need to have log files with referers in them... if you don't then you're out of luck already (although maybe you can get your host to change it to the long log format?). Using the log files isn't easier, but a) gives you a retrospective view of past hits and more importantly b) many web statistics packages already track search terms exactly like you're talking about. I use awstats, and it will give you month by month or year by year stats for everything, although it falls a little short if you want to query any arbitrary time interval. As to the performance of using the log files, yes it's slow, but log file analyzers cache all the contents into a database for subsequent performance.

Regarding the slow table method you tried... it's unclear where the bottleneck is. If you are taking query strings out of a database, parsing them, then inserting the individual words into another table, you're going to have a huge amount of overhead just on database calls. You have to get the data into the database in the format you need as you go, because running 30,000 queries through an API is never going to be fast. The other thing is, do you have an index on the term column? If you don't then sorting will be slow, but if you do then it will be sorted as you go.

Now regarding your last question, absolutely that query will work. In fact, I highly suggest reading the MySQL Wizardry article over at devshed just to get some idea of the scope of what even a relatively primitive RDBMS such as MySQL can do. There are a lot of other great articles over there too, about indexes, joins, and normalization all of which are mercifully brief and get to the point rather effectively. Good luck.

-jiblet

bitdamaged
Maniac (V) Mad Scientist

From: 100101010011 <-- right about here
Insane since: Mar 2000

posted 06-10-2005 17:16

I'd recommend parsing the logs on some sort of cron job every day (or hour whatever) and then inserting that into an indexed table as opposed to doing an insert every time someone hits the page.

One of the bottlenecks you are going to hit is that while indexing helps speed up searches it slows down inserts and updates because everything has to be reindexed. If you are doing this on every visit to the page you're going to add significant overhead to page load.

.:[ Never resist a perfect moment ]:.

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-10-2005 17:38

quote:

bitdamaged said:

I'd recommend parsing the logs on some sort of cron job every day (or hour whatever) and then inserting that into an indexed table as opposed to doing an insert every time someone hits the page.One of the bottlenecks you are going to hit is that while indexing helps speed up searches it slows down inserts and updates because everything has to be reindexed. If you are doing this on every visit to the page you're going to add significant overhead to page load. .:[ Never resist a perfect moment ]:.

I thought this could be an issue, but after testing it the overheads really are minimal and not noticable. The other problem im up against is some of the sites im working with are on windows servers, hence no cron jobs. I know indexing has an adverse effect on writing data but from what i've tested the writing is still super quick and not noticable, thankfully.

quote:

jiblet said:

Regarding the slow table method you tried... it's

I did try various ways here, indexing them all, indexing just the keywords etc. The quickest time i got was quite slow - the best method here will be have a separate table dynamically adding keywords as they are used. I think the string lenth could have been quite a bottleneck, a few more tests required me thinks!

quote:

jiblet said:

Now regarding your last question, absolutely that query will work

So the GROUP BY method can somehow count them as well? interesting that would be very useful - i will read up, there must be something i am missing!!! I was going to use subqueries etc, but i'm currently limited to version 4.0.2 mysql etc which dont support them yet - so stuck with your normal joins etc.

I guess GROUP BY must have a different effect to distinct then? Ill do some reading and try clear this up!

Thanks all for the help will let you know how it goes. Any thoughts on using an array like i did? I was suprised how quick it was.

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-10-2005 17:51

Hmm that was a great article jiblet, i think i have some SQL restructuring to do - can't wait to try this stuff out. I have skimmed this stuff before but never really applied it, i can see how it is very useful though! Thanks for that