Closed Thread Icon

Topic awaiting preservation: mySQL - when two tables are better than one.. Pages that link to <a href="https://ozoneasylum.com/backlink?for=25865" title="Pages that link to Topic awaiting preservation: mySQL - when two tables are better than one.." rel="nofollow" >Topic awaiting preservation: mySQL - when two tables are better than one..\

 
Author Thread
H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted posted 05-26-2005 01:19

Does anyone know alot about databases, particularly mySQL?

As far as performance goes i was wondering just when is two tables better than one. Now i know that sometimes you have different data that should be put in separate tables but what about when you are just loggin large ammounts of data.

I have a project where many clients will be sending logs to the central database all throughout the day, so there might be 20 clients sending 100 or more entries to the database, storing the current date/time, which client sent the data and a bunch of other data based on what the client machine. (The pc's are connected to manufacturing so things like start/stop/error of machines). Roughly 50,000+ records a month could be added.

So at what point to I split it up into more tables, say per month or per year. So should I have a 2004 table, 2005 table, or Jan-2004, Feb-2004 etc tables. OR do i not split the table up at all, it will take 10 years for the table to get to 2 gig after all and at that time i could move the oldest year into a "history" table since it wont really be referenced.

Obviously for reading data having it all in the one table would simplify things, but does is effect performance.

I have done some tests on records and time trials, adding 50,000 records and timing SQL queries via PHP, the increase in time seemed linear - this suggests that it doesnt matter if all of the records are in the one table or one per year, the time is purely how many records you are actually querying. This was a simple query however, and I imagine more complicated queires such as involving DISTINCT or GROUP BY may not linearly increase in time as the size does.

So just wondering everyones thoughts on this, what people have experienced and what is best to do. Thanks all!

DmS
Maniac (V) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 05-26-2005 09:51

Performancewise you can have several hundred million rows in a table as long as you use indexes correctly for your conditions and don't run queries that creates fulltable scans.

As for splitting into history...
That for me is decided on how the customer wants to extract stats from it.
However, as long as the data is in the db, in one or 100 tables you can always create custom queries so that shouldn't really be a problem.

/Dan

{cell 260} {Blog}
-{ ?There are two major products that come out of Berkeley: LSD and UNIX. We don't believe this to be a coincidence. - Jeremy S. Anderson" }-

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted posted 05-26-2005 09:57
quote:

DmS said:

Performancewise you can have several hundred million rows in a table as long as
you use indexes correctly for your conditions and don't run queries that creates
fulltable scans




Yeh i did some tests and and checking through 100,000 records was pretty much exactly twice as long as checking through 50,000 records - good too see this goes all the way up to the millions.

I guess for some things i might be better off creating query specific tables?

This would depend I guess: Lets say you have a table with 20 columns, and you do a query only pulling data from 5 of those columns. Is this any slower than pulling those 5 columns from a table which ONLY has 5 columns to begin with. So is there any point splitting up a table into more tables (column wise) or it really doesnt matter?

If your not querying the data then does it have any effect on speed? - I guess that is the ultimate question there.

At the end of the day if I have 10 years worth of data in the one table, I can always split it up at any point - or put it together, its not that big an issue. I wonder if i should have a backup table tho and duplicate the data incase of corruption... hmm...

WarMage
Maniac (V) Mad Scientist

From: Rochester, New York, USA
Insane since: May 2000

posted posted 05-26-2005 14:06

Ideologically you should normalize all of your data. At least down to second normal form. Try a search on database normalization.

http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html

You will save on storage space, which I believe is the number one advantage of normalization. It really helps with data integrity. And normalized databases are much easier to scale.

You are not going to see a noticeable performance gain or lose, as Dan said, as long as you have you indexes setup correctly.

Dan @ Code Town

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted posted 05-26-2005 14:24

Yeh thats interesting, i guess i have done that with other databases. This one is a bit different really, each record will also log the date/time of the record and every column will pretty much contain a unique entry from the client - thereis maybe one or two things i could put in another table but to no real advantage.

Since its all automatic data entry, data integrity isnt a big issue - you wont get users making mistakes etc.

I'm doing a hits database for my website as well which follow similar principle to the other database i am doing to monitor the machiens so i will try with this first.

My columns include
-autoid (autoincrement, primary key)
-user_ip (indexed)
-page (the page that someone is viewing)
-log_date
-proxy
-host
-referer
-country
-city

I guess i coudl probably move country & city to another table with ID's, but that just means an extra query when I want to get the data - and from what I can tell if anything it will only slow the query down. (From tests in most cases it seems faster to do it all in one query if possible, multiple tables can also increase your overheads)

Page i could probably move to another table too, not tooo sure if this would be advantageous.

In the end i will want to get the search engine and keywords out of the referer, so the other option i have here is to make more columns and parse the string when i log the data, or to do this when i querie the data.

Overall maybe its just a really plain straightforward database - because not much of the data is grouped and i need to be able to search data within any custom time period, im not so sure i can take advantage of relationships etc.


Any thoughts on this? Thanks

Tyberius Prime
Paranoid (IV) Mad Scientist with Finglongers

From: Germany
Insane since: Sep 2001

posted posted 05-26-2005 15:18

the second normal form can increase your performance - if you have to shuffle *a lot* of bytes, returning just an id, instead of a country name will shave of transfer time.

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted posted 05-27-2005 01:41

So instead of storing country name i would store an ID for the country which matches up to another table which has the country name?

In that document it says to use second normal form to reduce the risk of errors while the user is inputting data (since they input the same thing many times over) but it doesnt say anything regarding performance.

Im really unsure how using second normal form would speed up transfer at all, i mean you have to get the ID and from that get the country - its an extra step.

Surely its quicker just to get the country straight up?

Reading further in the article he even mentions that normalisation can decrease performance and de-normalisation is often done to increase performance. Interesting i will have to look that up!

edit:: ** Also isnt mySQL a non-relational database, so it probably wouldnt benefit as much from normalisation? Or can you create relationships somehow

(Edited by H][RO on 05-27-2005 02:08)

Tyberius Prime
Paranoid (IV) Mad Scientist with Finglongers

From: Germany
Insane since: Sep 2001

posted posted 05-27-2005 09:02

Look at what data you have to actual transfer from the hard disk:

a) Storing country names:
number of records * varchar (52) //that'd be "United Kingdom of Great Britain and Northern Ireland", the longest country name I could find.
so let's say you have a million records: You need to transfer 52 megabytes of data from the harddisk to the cpu.

b) Storing an id.
number of records * sizeof(int) //typically sizeof(int) = 4 bytes.
For a million records: you transfer 4 megabytes. You'll be able to do all necessary comparisons right away on those four bytes. The only time you'll need to do the actual lookup into the country table is when you want to print it. And boy, I'd suggest you actually do that replacement in your programing language of choice - saving on the bandwidth between the webserver and the database server.

so in b) you'd transfer only 1/13 of the data. Guess what: CPU speed get's faster faster than data transfer get's faster. Eventually, everything will be stored in a compressed format, just to lessen that bottleneck.

so long,

->Tyberius Prime

H][RO
Bipolar (III) Inmate

From: Australia
Insane since: Oct 2002

posted posted 05-27-2005 09:31

Thanks TB i see what you mean. I guess i could load all of the countries into an array or something for the replacement and do it that way, so they are loaded once and thats that.

jiblet
Paranoid (IV) Inmate

From: Minneapolis
Insane since: May 2000

posted posted 06-01-2005 00:06

I started writing a response to this over the weekend, but then I got sidetracked. I just wanted to address a few things you mentioned H][RO:

quote:
Yeh i did some tests and and checking through 100,000 records was pretty much exactly twice as long as checking through 50,000 records



If you are SELECTing all those records then this is fine. However, if you are querying for a specific category of rows using a WHERE clause, this indicates that you do not have an index defined over the column(s) that you are querying by. If you have a proper index than the time will scale logarithmically which means the database is truly scalable. The difference is like looking for a phone number. If you have a random list of names on a piece of paper you can find any number, but imagine if a phone book were not alphabetized... that's what not having an index does to your query.

quote:
I guess i coudl probably move country & city to another table with ID's, but that just means an extra query when I want to get the data - and from what I can tell if anything it will only slow the query down. (From tests in most cases it seems faster to do it all in one query if possible, multiple tables can also increase your overheads)



You should familiarize yourself with the different types of JOINs available in MySQL. JOINs also require proper indexes, but will perform extremely well given proper database structure.

quote:
Reading further in the article he even mentions that normalisation can decrease performance and de-normalisation is often done to increase performance. Interesting i will have to look that up!



Overall I think that's a dangerous generalization because it gives the impression that normalization incurs a performance penalty. Certain operations will be slightly slower in a normalized database (INSERTs), but others will be much faster (querying for rows that share some data). In general, if you're not really clear on how a database works internally, normalization is the better option.

quote:
Also isnt mySQL a non-relational database, so it probably wouldnt benefit as much from normalisation? Or can you create relationships somehow



MySQL is definitely a relational database. It does not support all the advanced features that are required of an ACID-compliant database, but neither do the majority of applications require full ACID-compliance. Historically MySQL does not have built-in relationship management (ie. foreign keys) but that just means that the application programmer has to manage relationships themselves. I'm not sure about all the new features in the latest version of MySQL, but they're definitely addressing a lot of the common complaints. As mentioned above, relationships are queried for using JOINs.

-jiblet

« BackwardsOnwards »

Show Forum Drop Down Menu