Designing a large WWW/DB-system - input appreciated
Hello AllAs some may know I've from time to time shared my experince regarding tuning Apache and MySQL in this forum but this time I've come across a challenge that cannot be met with simple tuning and I really need some input from people with some pratical experience with large, distributed systems.
This will be somewhat lengthy - I apologise for that but I'll make this as short as possible.
Current setup:
Dell PE 1750 server with 2X3GHz Xeons, 2 GB RAM, 2X73 GB SCSI in RAID1
Redhat Enterprise Linux 3.0, Apache 2, MySQL 4.1
An identical server is setup in a second datacenter and MySQL is replicated via WAN.
Currently we split the load 66/33 between the two datacenters via round-robin DNS because one server is unable to cope with the load during spikes but ideally we should split 50/50 and one datacenter should be able to run the show alone. We cannot split 50/50 at the moment but once our line provider upgrades our connection that should be fixed.
Apache takes 250-300 million hits per month, while MySQL is hovering around 1000 queries per second during workhours and peaking around 2400 qps. Limiting factor is I/O (the systems runs 533 MHz FSB) and to some extent CPU.
MySQL usage:
Se/In/Up/De(%): 75/04/14/00
- that is, not exactly your ususal 90/05/05/00 webserver usage. DiskI/O has to be able to handle a lot of writes.
Current database is around 3 GB and the largest table contains some 35 million rows. "Live" tables (accessed and/or updated at every webserver hit) contains up to 2 million rows. We use MyISAM exclusively at the moment but are considering InnoDB for the tables that takes the most of the writes.
The challenge: Scale this system to 10X in the next 12-18 months on a limited budget.
I'm considering the following setups but would like any kind of input you may want to share:
1)
One monster server that handles all database writes and replicates the databases to a number of border servers that handles all webservice and database reads. A load balancer distributes the load evenly among the border servers.
Secondary datacenter setup is identical to the above - databases are replicated between the two main databaseservers.
2)
A number of border servers running both web and a MySQL cluster. Currently this is hampered by the fact that the database has to be able to fit into the physical memory of the servers, something that isn't realistic with a 30+ GB database.
3)
The perfect solution that scales perfectly as the load increases, is able to recover from database failures within minutes rather than half hours (try repairing a MySQL table with 35 million rows...and then imagine repairing 350 million rows) and can do live replication without any problems. The only problem with this perfect solution is that I haven't found it yet.

I've been talking to a number of sales reps but would really like some comments from someone who doesn't have a product to sell - any thoughts you may have on this is much appreciated.