Fantastic and unorthodox solution to a database scaling problem

FriendFeed is one of my favorite sites to visit.  It aggregates and shows what users are doing or saying across the Internet.  As you can imagine, the database system required to store and query all these activities would be:

1) growing at an astronomical rate

2) be so furiously busy during query times

3) even the simple act of adding/removing an index would take an insane amount of itme

I’ve always wondered what FriendFeed’s secret to develop and maintain a such database system considering how quickly it runs on their site.  It turns out that they were starting to encounter a bottleneck that was making it difficult for them to add new features.

Thinking out of the box, Bret Taylor who works on the FF database system (MySQL, no less!) shed a bit of light on how they do a such outstanding job in moving past the bottleneck.

http://bit.ly/R3gU6

One of the major points was to stop using standard indexes on columns and instead use tables for this purpose.  This would sound risky at first but it turns out it was the right move to make, provided proper checks were continually carried out.  Their performance graphics show that this change was a huge success.


Comments   Databases, Web/Tech
Related posts:

PostgreSQL: How to have a column with a gap-less sequence with no skips in the incremental values

Best approach for a “gap-less” sequence

By defining a column in your table to be of type SERIAL, you can guarantee that column will increase in value and be unique. However, you cannot guarantee that the values will be gapless.


Comments   Databases
Related posts:

Replicate data from MS SQL Server to PostgreSQL

There’s an excellent post by Magnus Hagander on how to set up your MS SQL master server to replicate data to a PostgreSQL slave server.

This kind of setup would make sense if it was company’s requirements that critical data operations are done on MS SQL server but it was desirable to avoid shelling out more cash for additional licenses for slave MS SQL servers to host  databases that were read-only for reporting/analysis purposes.


Comments   Databases
Related posts:

PostgreSQL – Showing Tables/Views containing a fieldname

CREATE OR REPLACE FUNCTION public.”Table_and_Views_With_Fieldname”(fieldname name)
RETURNS SETOF name AS
$BODY$

SELECT pg_class.relname

FROM (pg_attribute JOIN pg_class ON ((pg_attribute.attrelid = pg_class.oid)))

WHERE ((pg_attribute.attname = $1) AND ((pg_class.relkind =
‘r’::”char”) or (pg_class.relkind = ‘v’::”char”)))
order by pg_class.relname;

$BODY$
LANGUAGE ’sql’ VOLATILE;


Comments   Databases
Related posts:

Materialized views with PostgreSQL

Database views can range in complexity and cost from a simple single table select query to multiple tables joins/aggregate views.  If you find that some of your high-cost views are being run on a frequent basis and are having a large impact on your database performance, you may want to consider materialized views.

To achieve near instantaneous access to the data results of these high-cost views, materialized views can be used to do the work in advance and store the results inside a table ready for rapid access.

Serious candidates for materialized views are those views that are based on tables that don’t change very often. Since materialized views are basically a snapshot of the result of a complex query at a point in time, you will want to use several possible refresh strategies for periodic updates:

    * Force a refresh every time data is being selected so will always return the most recent “live data”
    * X seconds since the last refresh and data is being selected
    * A preset time interval of X seconds since the last refresh, whether or not data was requested or not.

While other commercialized databases are able to auto-materialize views, you can achieve the same results in PostgreSQL by using triggers and PL/pgsql language.

Jonathan Gardner’s “Materialized views in PostgreSQL” post cover this topic well.  I noticed that it’s been written in 2004 and am wondering what the progress is on this?  Does the latest PostgreSQL version natively support this yet?  Still some more research to do ahead.  The awesome PostgreSQL General Bits over at Varlena has more on this topic.  (still written in 2004!!)

Essentially, Jonathan covers three different kinds of materialized views and each comes with their pros/cons:

Snapshot materialized views:

   Where you just simply remove all the rows of the materialized view table and refill the table with the updated data from a view.  You can do a complete refresh of the materialized view as needed.  This approach is good for views with underlying tables that may significantly change over time but the latest data doesn’t need to be immediately accessed by the users (maybe updating once or twice a day is satisfactory enough).
  However, if there were only a few changes in the underlying tables, the very expensive operation of a complete refresh from scratch would be somewhat of overkill.  Another  downside of this approach is that during the wait between each complete refresh of the table, the data of the materialized view can start to get out of sync with the underlying tables when the data starts to change.

Eager materialized views:
   As the name implies, this approach eagerly keeps the materialized view up to date as much as possible by using triggers on the underlying tables of a view.  The triggers take care of each row of the materialized view during the inserting, deleting, or updating.  The net result is that the materialized view is always kept up to date as data changes in the underlying tables.
    This approach makes sense if you expect only a few updates to the underlying tables over time.  This could also affect performance for the worse if all of sudden, some of the underlying tables get hit with many updates over a short period of time. The database will struggle to keep the materialized view constantly in sync with the underlying tables.

Very Lazy materialized view:

  This materialized view is the trade-off you would have if you met halfway between the two approaches listed above.  Essentially, this materialized view is the same as eager materialized view but instead of immediately updating the materialized view, a log of changes is kept.  When the time comes, you can start the refresh and the log of changes will be immediately applied towards the materialized view.  This works very nicely if there are only a few updates which can be applied quickly.  The updating would be much less expensive than a complete refresh using a snapshot materialized view.
  On the opposite side of the coin, if by the time you were ready to refresh the materialized view, there were many updates logged to be made against the materialized view, they all can be grouped together and applied in one go.  This approach avoids the expensive overhead associated with eager materialized view when a high number of updates hit the system.  The downside of a very lazy materialized view is the same problem with out of sync data during the wait between each refresh, once the underlying tables start to change.
  However, with this approach you would be able to better manage the overhead by controlling how often to refresh the very lazy materialized view!  You could first check to see how many updates need to be applied to the materialized view.  If there are only a few, you could have them immediately executed.  If there is a large log of updates to be carried out, you could force a wait until the opportune time when the database server load is lower before unloading them on to the materialized view.

If you decide to use a materialized view and need to choose which approach to use, you need to ask yourself “How up-to-date does the copy of the data need to be and how expensive is it to keep it updated?”


Comments   Databases
Related posts:

KDE application KPoGre for PostgreSQL

Just out:

A KDE graphical client for PostgreSQL database-  KPoGre


Comments   Databases
Related posts:

Resetting a PostgreSQL sequence

To reset a sequence so that the next time it’s used it returns 1:

select setval(’my_sequence_sq1′, 1, false);

The first parameter is the sequence name to set the value of, the next is the value to set it to, and the third field is whether the sequence should return this value or the next increment when it is used next.


Comments   Databases
Related posts:

Merlin Moncure does it again: How to do recursive sets in PostgreSQL using plpgsql

This is an excellent post on how to handle BOMs (Bill of Materials) inside PostgreSQL.

You can take an item number and explode it into all the sub-parts that make up the item as well as the sub-parts of those sub-parts. PostgreSQL doesn’t natively support nested sets yet but thanks to plpgsql, it can still handle this requirement.

The comments of the entry has several revisions so be sure to check those out too!

———————————–
CREATE OR REPLACE FUNCTION expand_bom_children(in_children refcursor, in_level int4)
RETURNS void AS
$BODY$declare
   bom_rec record;
   children refcursor;
level integer;

begin
   if in_level > 10 then
      raise exception ‘nested too deep’;
end if;

loop
   fetch in_children into bom_rec;

   if not found then
      exit;
   end if;

   if bom_rec.assembly then
      insert into result values(bom_rec.item_no, in_level);
      open children for
         select ** from bom
         where parent_item_no = bom_rec.item_no;
         level := in_level + 1;

         perform expand_bom_children(children, level);

      close children;
   else
      insert into result values(bom_rec.item_no, in_level);
   end if;

end loop;
end;

$BODY$
LANGUAGE ‘plpgsql’ VOLATILE;

CREATE OR REPLACE FUNCTION expand_bom(in_item_no “varchar”)
RETURNS void AS
$BODY$
declare
   children refcursor;
   bom_rec record;
begin
   select into bom_rec ** from bom where item_no = in_item_no;

   if bom_rec.assembly then
      insert into result values(in_item_no, 0);
      open children for
         select ** from bom
         where parent_item_no = bom_rec.item_no;

         perform expand_bom_children(children, 1);

      close children;
else
   insert into result values(in_item_no, 0);
end if;
end;
$BODY$
LANGUAGE ‘plpgsql’ VOLATILE;

I use it by issuing the following SQL:

delete from result;
select expand_bom(’1′);
select * from result;

Technorati Tags:


Comments   Databases
Related posts:

(new) Make row-wise comparisons work per SQL in PostgreSQL 8.2

the link

One of the cool SQL features of the next PostgreSQL 8.2 that you can add as a new trick in your arsenal of effective database querying strategies:

You’ll be able to do this elegant SQL statement:

select * from table where (col1, col2, col3) > (value1, value2, value3) order by a,b,c;

If you have an index key on col1, col2, and col3, then you’ll notice a performance increase as well!

From the link above:
Queries that answer questions like:
I need all orders for the account ‘xyz’ with the status ‘hold’ before the date ‘2/8/2006′ will thank you if you do a

create index o_idx on order(account, status, date);
select * from order where (account, status, date) between (’xyz’, ‘hold’, 1/1/1000′) and (’xyz’, ‘hold’, 2/8/2006′) order by account, status, date;

Technorati Tags:


Comments   Databases
Related posts:

PG Lightning Admin Available for $5.00!

Hurry and get this great PostgreSQL administrating software on the cheap!

PG Lightning Admin is Available for $5.00 (limited time only) with $1 per sale donated to the Postgresql Project IF 1000 units are sold by March 21st. This is the full version with no strings attached.

http://www.amsoftwaredesign.com

Spread the word, this is a easy way to bring a cool 1000 dollars to the project.


Comments   Databases
Related posts:

Peek into what SQL queries are being passed to your PostgreSQL database

Try the ngrep – network grep!!

For those of you who have tried setting stats_command_string in postgresql.conf to view the pg_stat_activity so that you could watch SQL queries being pushed into PostgreSQL, you know how much of an overhead there is and how the overall performance can suffer.  There is also a short pause between each polling so you can’t see ALL of the queries- just those that happen to be passed at the time of polling.

If you have wanted to snoop on each the SQL queries that are being passed to your PostgreSQL database but didn’t want to have to resort to tcpdump and also not have a major impact on your database perfomance- you should seriously consider ngrep!

/>> ngrep ‘SELECT|select’ port 5432

then you’ll start seeing a scrolling window of all the SELECT queries that pass through the PostgreSQL database!

Technorati Tags:


Comments   Databases
Related posts:

Simple examples using PostgreSQL’s new OUT parameter

OUT parameter sql & plpgsql examples


Comments   Databases
Related posts:

A kludge to do a month DateDiff() function inside PostgreSQL

This is not very robust but is functional in most cases:

date_part(’days’,date_trunc(’month’, now()) – date_trunc(’month’, “Document_Date”))::int4 / 30::int4 as “Months_Ago”

The pitfall: while this works OK for recent months, this doesn’t work well if you go way back in time when the days count offset starts to exceed 30 days.  The wrong month count will be returned!

Update:
Joseph (via a comment) suggested that I offload this date calculation to a special date PHP function that would be able to give me more accurate date calculation results.  I realize that PHP or Perl have awesome data calculation functions/modules that are available to use.  I could even have used plpgperl inside the PosgreSQL database along with the Date::Calc Perl module.

The reason I didn’t want to go this route is because several tables that I work with are massive and can go into hundreds of thousands of rows, such as the Sales table.  I definitely don’t want to have to return the whole table to a client (be it PHP, Crystal Reports, MS Access, etc) and have to filter through each row to find the sales information that falls in the months that are of interest.

I would much prefer to do the filtering at the database level and only have to return several thousand results.  This has two benefits:  other than the apparent reduction in the network bandwidth to return the limited set of results, the database can also cache this query into memory in case it’s run again in the next few minutes.  Another benefit is that since this “MONTHS_AGO” column is built into the view, it’s possible to immediately use it in any client and there is no need to have to find/use or develop a date calculation function for each of the platforms.

I also wanted to avoid even using plpgperl and Date::Calc since this would incur significant overhead when I have to repeatly use the perl interpreter on every row of the large table.

However, after some more investigation and testing:

A much better data calculation follows (Thanks to the concept of time interval in PostgreSQL)!  Plpgsql can be a bit convoulted at times:

(date_part(’year’,age(date_trunc(’month’::text, now()), date_trunc(’month’::text, “DOCDATE”) )) * 12)
 +
(date_part(’month’,age(date_trunc(’month’::text, now()), date_trunc(’month’::text, “DOCDATE”) )) )
AS “MONTHS_AGO”

This works for dates that go way back in time beyond 5 years!


Comments   Databases
Related posts:

Bash script to backup PostgreSQL databases

From the pgsql-admin mailing list:

Hello, I have written a backup script that will vacuum, analyze and backup every postgresql database. 

(1) Modify logfile and backup_dir variables to suite your needs.(2) I have a trust relationship so I am never prompted for a password on connection.(3) Add an entry to crontab to perform the backups nightly or whenever you wish.(4) Have fun.

# Crontab starts here.00 01 * * * /path/to/script/backup > /dev/null 2>&1#--------------------------------------------------

# Backup script starts here.

#!/bin/bash# Location of the backup logfile.logfile="/path/to/logfile.log"# Location to place backups.backup_dir="/directory/to/place/backups"touch $logfiletimeslot=`date +%H-%M`databases=`psql -h localhost -U postgres -q -c "\l" | sed -n 4,/\eof/p | grep -v rows\) | awk {'print $1'}`

for i in $databases; do        timeinfo=`date '+%T %x'`        echo "Backup and Vacuum complete at $timeinfo for time slot $timeslot on database: $i " >> $logfile        /server/pgsql/bin/vacuumdb -z -h localhost -U postgres $i >/dev/null 2>&1        /server/pgsql/bin/pg_dump $i -h 127.0.0.1 | gzip > "$backup_dir/postgresql-$i-$timeslot-database.gz"done#-------------------------------------------------

Comments   Databases
Related posts:

Filtering out data in a table using SQL subqueries with an aggregate like MAX()

If you have a large denormalized table which you want to use to locate certain rows that have a MAX of a certain value, you can do something like the following:

For example, if you had a table full of orders by customers along with the order dates and you just wanted to see when it was the last time a customer ordered from you:

—————————————-
Latest_Order_Numbers_of_Customers View:
—————————————-
SELECT
 
Invo_Tables_Indexed.DOCDATE,
Invo_Tables_Indexed.CUSTNMBR,
Invo_Tables_Indexed.DOCNUMBR
 
FROM Invo_Tables_Indexed
INNER  JOIN
(
 SELECT
     Invo_Tables_Indexed.CUSTNMBR,
     MAX(Invo_Tables_Indexed.DOCDATE) AS DOCDATE
 FROM Invo_Tables_Indexed
 
 GROUP BY Invo_Tables_Indexed.CUSTNMBR
) as ITI
ON
Invo_Tables_Indexed.CUSTNMBR = ITI.CUSTNMBR AND
Invo_Tables_Indexed.DOCDATE = ITI.DOCDATE
 
GROUP BY
Invo_Tables_Indexed.DOCNUMBR,
Invo_Tables_Indexed.DOCDATE,
Invo_Tables_Indexed.CUSTNMBR;


Comments   Databases
Related posts:

« Previous Entries