Topic awaiting preservation: MySQL Max Record Performance (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=28130" title="Pages that link to Topic awaiting preservation: MySQL Max Record Performance (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: MySQL Max Record Performance <span class="small">(Page 1 of 1)</span>\$

WarMage Maniac (V) Mad Scientist From: Rochester, New York, USA Insane since: May 2000	posted 06-27-2006 03:45 I am going to be creating a table that will store the number of clicks that are processed via a website. clicks --- click_id - the autoincrement key page_id - the id of the page accessed user_id - the id of the user who submitted the hit ip_address - the ip address where the hit originated created_at - the time that the click was registered This table could very quickly contain a whol lot of data. In the short term I do not believe it unreasonable to think in the range or 100 million to 1 billion rows of data. My first question is, what is record set size that would be too large to persist in this kind of table given that this table will be frequently queried for information. Queries might be: SELECT count(1) FROM clicks WHERE user_id = ? SELECT count(1) FROM clicks WHERE page_id = ? SELECT count(1) FROM clicks WHERE created_at > ? AND created_at < ? GROUP BY user_id Or other queries of a similar fashion. With these kinds of constraints in mind, is it unreasonable to think of keeping all of this data in this table? The largest number of results that I have worked with would be something in the neighborhood of 10k records and there were some performance issues at this level. If it is not possible to keep the data in this table and still perform CRUD operations on the data in a reasonable amount of time what are some suggested solutions? Is there a performance degradation curve? Or different levels (0 to 10k rows, 10k to 100k rows) where there are different performance hits observed? Dan @ Code Town
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-27-2006 07:35 Ive done this exactly same thing on a stats database i was working on, involving simple and complex queries. I have jumped on to another project meanwhile but I did do quite a bit of testing and stat generation. I had a similar starting table, probably with a few more columns but just did a simple query test to check performance over increasing number of records. Auto_id was a primary index. Query String ("SELECT auto_id FROM hitlog"); Records Time 1 (sec) 50000 0.7395 100000 1.4825 150000 2.2949 200000 3.064 250000 3.6559 300000 4.4364 350000 5.4508 400000 5.887 450000 6.9888 500000 7.3012 550000 8.435 600000 9.0402 I tried various methods of indexing to see if performance changed but they all returned very similar results. This is pretty much the most basic of queries purely listing the data. I don't think i was even printing it out. Unfortunatley i don't remember too many specifics, it was done on php though. So for a straight query as expected the results are pretty much linear with regards to time - this depends on the query however and is not always the case. As far as how reasonable it is isnt exactly different to working with log files, with indexing its probably even faster. Down the track the tables will probably need to be archived however which is probably what I will end up doing. (It takes quite a while to update my awstats log on my server!) Will be back on this project in a few months actually.
WarMage Maniac (V) Mad Scientist From: Rochester, New York, USA Insane since: May 2000	posted 06-27-2006 16:01 Thanks for the information. That is a shame, 600,000 records could be generated in a single day, and 9 seonds to query them could be a pain. It is nice to have real time data, I might have to impliment a schema that would compress past data into one table and archive it into another. This might be reasonable in that I will still have an audit trail saved, and will also have access to all of the statistics. It is a bit more complex, but I guess that is a trade-off that you must accept when working with huge record sets. Dan @ Code Town
DmS Maniac (V) Inmate From: Sthlm, Sweden Insane since: Oct 2000	posted 06-27-2006 18:51 I can't see how it can take 9 seconds for a simple select on an indexed column with only 600 000 rows in the db?!?!? Oh, sorry, that query returns all rows right? What is a must on all queries against any kind of amount of rows is a correct indexing and thought out "where" clauses. The good old SELECT * FROM x WHERE created > '2006-01-01' and < '2006-01-31' will kill the db very fast with those volumes! Esp if created happens to be anything else than an indexed unix timestamp... The two largest tables in our systems are appr 6 million rows and the other is somewhere around 250-300 million rows per 3 months I've posted this link before http://www.mysql.com/why-mysql/case-studies/mysql-hp-ongame-casestudy.pdf it's a case study on our db-system where I work. It is indeed possible to query these amounts but it takes a lot of serious hardware and very good knowledge on how mysql interprets and executes the query and how the indexing engine works. To get to those volumes you are going to have a LOT of writing operations in the db... If you are contemplating storing these amounts you are talking about on a shared server... forget it. If you have a dedicated server with loads of very fast disk, perhaps, that is if you can gain access to tune it for massive data volumes. If you have a dedicated server just for mysql and a SAN for the data, yup... The best way to set it up is actually two servers, one for writes, such as Create, Update and Delete, then you setup replication to a slave where you query the data. This eliminates a lot of the problems you get as tables are locked for writing while you try to read it. Also with that amount of data you are certainly talking avout a lot of statistical queries that will be murder on performance, that you can't run on a live server that is constantly updated. Next is the setup of indexes in the tables... that you should consult a real DBA for, it's an art when you are talking about these volumes. To get away with this I'd probably set it up very carefully, then measure how much data I'm getting per 24h, then based on that decide on how many days data I need instant access to. Then I'd setup a periodical dump of that data to a storagesolution that isn't performace sensitive, could be files or a separate db that you can query without concerns. But be prepared on running out of disk a lot sooner than you think. /Dan {cell 260} {Blog} -{" Computer: ?As a Quantum Supercomputer I take advantage of Zeno?s paradox? C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.? C: ?It continually works to get to B but it never arrives.? Human: ?So you keep getting closer to finishing your task but never actually do?? C: ?Hey, coders make a living doing that?? ?}- (Edited by DmS on 06-27-2006 18:54)
WarMage Maniac (V) Mad Scientist From: Rochester, New York, USA Insane since: May 2000	posted 06-27-2006 22:21 Dan, Thanks for all that information. Very useful. Dan @ Code Town
H][RO Bipolar (III) Inmate From: Australia Insane since: Oct 2002	posted 06-28-2006 01:58 Yeh the query i did isn't totally useful but it still shows how the speed is effected linearly for that sort of thing. I did some other tests with where clauses but i dont think i wrote them down. I guess overall if you don't need to be exact time/date specific you can make it much more efficient since you can store your fixed data and retrieve it quickly. If you are doing what i was doing and calculating the totals every single time then it will take 10 seconds or so. Thanks for the info DMS, good read.
WarMage Maniac (V) Mad Scientist From: Rochester, New York, USA Insane since: May 2000	posted 06-28-2006 02:28 Now that is a good idea. Just store the data in the clicks table, but at the same time keep the totals (the numbers for current reporting needs) in a separate table. This would make things much faster for those running the reports, but will still persist all of the data if it needs to be audited because of abnormalities. Thanks! Dan @ Code Town
DmS Maniac (V) Inmate From: Sthlm, Sweden Insane since: Oct 2000	posted 06-28-2006 20:54 Yup, I agree, not a dumb idea at all. If your primary goal is to keep track of how many clicks certain areas/tasks get you can easily setup a table structure on the side that you just increment one row in, UPDATE click_summary_user SET nr_of_click = nr_of_click + 1 for each hit, that way you will only have one row for each tracker to query to get the instant summaries. A simple but perhaps useful aggreagation of very detailed data. However, as soon as you need something outside of the fixed aggregation you must be able to query a db that won't suffer from long and complex queries. /D {cell 260} {Blog} -{" Computer: ?As a Quantum Supercomputer I take advantage of Zeno?s paradox? C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.? C: ?It continually works to get to B but it never arrives.? Human: ?So you keep getting closer to finishing your task but never actually do?? C: ?Hey, coders make a living doing that?? ?}-

Topic awaiting preservation: MySQL Max Record Performance (Page 1 of 1) $Pages that link to <a href="https://ozoneasylum.com/backlink?for=28130" title="Pages that link to Topic awaiting preservation: MySQL Max Record Performance (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: MySQL Max Record Performance <span class="small">(Page 1 of 1)</span>\$

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted 06-27-2006 03:45

I am going to be creating a table that will store the number of clicks that are processed via a website.

clicks
---
click_id - the autoincrement key
page_id - the id of the page accessed
user_id - the id of the user who submitted the hit
ip_address - the ip address where the hit originated
created_at - the time that the click was registered

This table could very quickly contain a whol lot of data. In the short term I do not believe it unreasonable to think in the range or 100 million to 1 billion rows of data.

My first question is, what is record set size that would be too large to persist in this kind of table given that this table will be frequently queried for information. Queries might be:

SELECT count(1) FROM clicks WHERE user_id = ?
SELECT count(1) FROM clicks WHERE page_id = ?
SELECT count(1) FROM clicks WHERE created_at > ? AND created_at < ? GROUP BY user_id

Or other queries of a similar fashion.

With these kinds of constraints in mind, is it unreasonable to think of keeping all of this data in this table?

The largest number of results that I have worked with would be something in the neighborhood of 10k records and there were some performance issues at this level.

If it is not possible to keep the data in this table and still perform CRUD operations on the data in a reasonable amount of time what are some suggested solutions? Is there a performance degradation curve? Or different levels (0 to 10k rows, 10k to 100k rows) where there are different performance hits observed?

Dan @ Code Town

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-27-2006 07:35

Ive done this exactly same thing on a stats database i was working on, involving simple and complex queries. I have jumped on to another project meanwhile but I did do quite a bit of testing and stat generation.

I had a similar starting table, probably with a few more columns but just did a simple query test to check performance over increasing number of records. Auto_id was a primary index.

Query String ("SELECT auto_id FROM hitlog");

Records Time 1 (sec)
50000 0.7395
100000 1.4825
150000 2.2949
200000 3.064
250000 3.6559
300000 4.4364
350000 5.4508
400000 5.887
450000 6.9888
500000 7.3012
550000 8.435
600000 9.0402

I tried various methods of indexing to see if performance changed but they all returned very similar results.

This is pretty much the most basic of queries purely listing the data. I don't think i was even printing it out. Unfortunatley i don't remember too many specifics, it was done on php though.

So for a straight query as expected the results are pretty much linear with regards to time - this depends on the query however and is not always the case.

As far as how reasonable it is isnt exactly different to working with log files, with indexing its probably even faster. Down the track the tables will probably need to be archived however which is probably what I will end up doing. (It takes quite a while to update my awstats log on my server!)

Will be back on this project in a few months actually.

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted 06-27-2006 16:01

Thanks for the information.

That is a shame, 600,000 records could be generated in a single day, and 9 seonds to query them could be a pain.

It is nice to have real time data, I might have to impliment a schema that would compress past data into one table and archive it into another. This might be reasonable in that I will still have an audit trail saved, and will also have access to all of the statistics.

It is a bit more complex, but I guess that is a trade-off that you must accept when working with huge record sets.

Dan @ Code Town

DmS
Maniac (V) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted 06-27-2006 18:51

I can't see how it can take 9 seconds for a simple select on an indexed column with only 600 000 rows in the db?!?!?

Oh, sorry, that query returns all rows right?

What is a must on all queries against any kind of amount of rows is a correct indexing and thought out "where" clauses. The good old SELECT * FROM x WHERE created > '2006-01-01' and < '2006-01-31' will kill the db very fast with those volumes! Esp if created happens to be anything else than an indexed unix timestamp...

The two largest tables in our systems are appr 6 million rows and the other is somewhere around 250-300 million rows per 3 months
I've posted this link before http://www.mysql.com/why-mysql/case-studies/mysql-hp-ongame-casestudy.pdf it's a case study on our db-system where I work.

It is indeed possible to query these amounts but it takes a lot of serious hardware and very good knowledge on how mysql interprets and executes the query and how the indexing engine works.

To get to those volumes you are going to have a LOT of writing operations in the db...
If you are contemplating storing these amounts you are talking about on a shared server... forget it.
If you have a dedicated server with loads of very fast disk, perhaps, that is if you can gain access to tune it for massive data volumes.
If you have a dedicated server just for mysql and a SAN for the data, yup...

The best way to set it up is actually two servers, one for writes, such as Create, Update and Delete, then you setup replication to a slave where you query the data. This eliminates a lot of the problems you get as tables are locked for writing while you try to read it. Also with that amount of data you are certainly talking avout a lot of statistical queries that will be murder on performance, that you can't run on a live server that is constantly updated.

Next is the setup of indexes in the tables... that you should consult a real DBA for, it's an art when you are talking about these volumes.

To get away with this I'd probably set it up very carefully, then measure how much data I'm getting per 24h, then based on that decide on how many days data I need instant access to. Then I'd setup a periodical dump of that data to a storagesolution that isn't performace sensitive, could be files or a separate db that you can query without concerns. But be prepared on running out of disk a lot sooner than you think.

/Dan

{cell 260} {Blog}
-{" Computer:
?As a Quantum Supercomputer I take advantage of Zeno?s paradox?
C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.?
C: ?It continually works to get to B but it never arrives.?
Human: ?So you keep getting closer to finishing your task but never actually do??
C: ?Hey, coders make a living doing that??
?}-

(Edited by DmS on 06-27-2006 18:54)

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted 06-27-2006 22:21

Dan,

Thanks for all that information. Very useful.

Dan @ Code Town

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted 06-28-2006 01:58

Yeh the query i did isn't totally useful but it still shows how the speed is effected linearly for that sort of thing. I did some other tests with where clauses but i dont think i wrote them down.

I guess overall if you don't need to be exact time/date specific you can make it much more efficient since you can store your fixed data and retrieve it quickly. If you are doing what i was doing and calculating the totals every single time then it will take 10 seconds or so.

Thanks for the info DMS, good read.

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted 06-28-2006 02:28

Now that is a good idea.

Just store the data in the clicks table, but at the same time keep the totals (the numbers for current reporting needs) in a separate table. This would make things much faster for those running the reports, but will still persist all of the data if it needs to be audited because of abnormalities.

Thanks!

Dan @ Code Town

DmS
Maniac (V) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted 06-28-2006 20:54

Yup, I agree, not a dumb idea at all.
If your primary goal is to keep track of how many clicks certain areas/tasks get you can easily setup a table structure on the side that you just increment one row in, UPDATE click_summary_user SET nr_of_click = nr_of_click + 1 for each hit, that way you will only have one row for each tracker to query to get the instant summaries.

A simple but perhaps useful aggreagation of very detailed data.
However, as soon as you need something outside of the fixed aggregation you must be able to query a db that won't suffer from long and complex queries.

/D

{cell 260} {Blog}
-{" Computer:
?As a Quantum Supercomputer I take advantage of Zeno?s paradox?
C: ?Imagine a photon that must travel from A to B. The photon travels half the distance to B. From its new location it travels half the new distance. And again, it travels half that new distance.?
C: ?It continually works to get to B but it never arrives.?
Human: ?So you keep getting closer to finishing your task but never actually do??
C: ?Hey, coders make a living doing that??
?}-