Thursday, April 26, 2007

Why Cb is slow: SQL query trouble

Yesterday and today I am back at the CUBIC to meet up with my former colleagues, which gives me the opportunity to proper debugging of the Cb server instabilities. Earlier I turned on this useful: MySQL option:

log_slow_queries = /var/log/mysql/mysql-slow.log
which pointed out a serious problem:
# Time: 070427  8:30:04
# User@Host: pg[pg] @ []
# Query_time: 426 Lock_time: 0 Rows_sent: 50 Rows_examined: 4096407
SELECT t1.tag AS tag, COUNT(DISTINCT posts.blog_id) AS count FROM tags AS
t1, tags AS t2, posts WHERE t1.post_id = t2.post_id AND t1.tag != t2.tag
AND t2.tag='Visualization' AND posts.post_id = t1.post_id GROUP BY t1.tag
HAVING count >= 1 ORDER BY count DESC LIMIT 50;

The time consumed in this example, 426 seconds, is already stupidly long, but it can be even worse. Now the problem really seems to be in the number of rows examined which is slightly over 4 milion, while the tags table really only has about 35 thousand entries. The reason why it actually is slow, is that during this query it massively reads and write from the harddisk. That is, 20-30 MB a second for about the time it takes to do the query. It is obvious that that leads to server instabilities.

Next step is to understand what this query is supposed to do, and why the hell it is actually making so many entries. Euan, if you are tuned in, the blog_id and post_id columns only contains NULL, which might cause the row explosion?

Update: I tracked this query down to the functionality get_similar_tags and disabled that for now, until I get it fixed.


Jim said...

Hi Egon,

Is there a tool in MySQL that allows you to look at the query plan? In Postgres you would use EXPLAIN ANALYZE. It looks like MySQL might need some help partitioning the query (it's building an intermediate structure with 4M, so it's unlikely to be doing it the best possible way...). It could also be that a couple of judiciously placed indices would speed this up no end.

One of the problems is that indices or no indices, that count(distinct) is going to hurt. Might it be possible to to have a count field on the tags table and update this periodically?


Egon Willighagen said...

Jim, it actually seems to be caused by blog_id being NULL, as I suggested in the item. I now extended the query with the clause 'WHERE blog_id != NULL', and that seems to solve the problem.

However, the field should actually no be empty, so rebuilding that data right now.

Egon Willighagen said...

Oh, btw.. Yes, MySQL has EXPLAIN too.

Egon Willighagen said...

Problem fixed. I had to rerun the tag extraction, and lost a considerable part of the tag table, but at least the system is performant again.

Stew said...


The query just finds the post_ids of posts that are tagged with something in particular ('vizualization' in this case) then gets a list of other tags assigned to those posts ordered by frequency.

It really doesn't scale, unfortunately.

Any idea why blog_id was empty?

Egon Willighagen said...

Stew, yes and no. I am not sure what the code was supposed to do, but wrote a patch to make sure it got fixed in Cb.