Sunday, December 12, 2010

Connection Pooling and Sizing of Connection Pools

 

A question that does not get asked as often as it should is what is the best size for a connection pool to a database. However, the answer to that question can have a major impact the response times your users experience as well as the throughput of the system.

Luckily for us, a number of people on the internet have provided a simple answer to that question for different technologies – “It Depends!”.

http://dev.mysql.com/tech-resources/articles/connection_pooling_with_connectorj.html

http://searchdomino.techtarget.com/expert/KnowledgebaseAnswer/0,289625,sid4_gci851755,00.html

http://stackoverflow.com/questions/405352/mysql-connection-pooling-question-is-it-worth-it

So, let us talk about what are the factors to take into consideration when planning the optimal size of the connection pool.

  1. A connection pool that is too small means that queries have to wait to get a connection before they execute. This delay can be quite large – in seconds – for a loaded system
  2. A very large connection pool, on the other hand has its own problems
    • First of all, connection pools are fairly heavy in terms of memory usage. This will impact the memory OBIEE server uses, which may or may not be a concern depending on whether you are on a 64-bit platform, and the available RAM.
    • Secondly, if you have a large connection pool, all queries will be sent on to the database. Your database box and/or disk will see a performance degradation beyond their capacity.

That means you need two pieces of information to correctly size the connection pool – the database capacity for parallelism and the expected concurrency. The former typically depends on the database you run, the machine specs, and the operating system. Your DBA will be able to help determine this based on his past experience. Think of this as the “supply” of connections.

Here is a little spreadsheet to help determine the “demand” for connections. The numbers here are, of course, made up for illustration only:

Name of Factor Number Cumulative Effect
No of Named Users 5000 5000
% logged on at peak times 10% 500
No. of reports on a dashboard 10 5000
% time running a dash board – For example, if a dash board runs in 5 seconds, and think time between runs is 30s – this is (5)/(5+30) 14% 714
Cache Miss Ratio – This will be high for highly customized reports 80% 571
Number of Logical Queries/logical SQL 3 1713

This number above – approximately 1700 – gives you a number of sizing guidelines. For example, if your database will gracefully handle only about 600 connections/queries at a time – then you know you need approximately 3 (1700/600) databases to handle this load correctly.

Also, if you find – for example due to memory issues – that an OBIEE instance will only scale up to a connection pool of 200 – you know you will need a cluster of about 8 (1700/200) instances of OBIEE to handle this load.

Of course, you can also play with the the numbers above to try and get the required connection pool size down. For example, if by creating better aggregates, you are able to get the dashboard running time from 5s to 1s, this can be 5x reduction in the number of required connections.


Wednesday, August 25, 2010

Cache Purging – Session Variables

There is another way to purge cache for selected tables that is based on a pull mechanism – not push – which is to use session variables to detect when data on the underlying database has changed.

Here is the sequence of things one has to do:

1) Create a ETL_TIMESTAMP table on your warehouse as follows :

create table ETL_TIMESTAMP
(
  ETL_TIMESTAMP TIMESTAMP(6)
);

2) At the end of every ETL run, run the following script to refresh this timestamp:

truncate table ETL_TIMESTAMP;

insert into ETL_TIMESTAMP select sysdate from dual;

3) Now create a session variable so that every new session gets the value of the ETL refresh timestamp when it starts – by using the following SQL:

select etl_timestamp from etl_timestamp

image

 

4) Finally you want to make sure that cache entries are discarded when the value of this variable changes. We now have to ensure that this variable is referenced in the query. This can be done in a number of ways – one of which is to reference it in the logical table source (LTS).

Modify every LTS (or at least the one for which cache purges are required) by modifying the ‘WHERE clause’:

image

Now cache entries will be automatically purged everytime you run ETL.

Pros and Cons

Pros

  1. The most important difference between this method and the others we discussed here and here, is that being a pull API, it is fairly resilient to changes in physical configuration. If you add or remove servers from a clustered environment, or have Dev. / UAT / QA / Production BI servers pointed to the same warehouse database, each of the caches will automatically refresh when they should.
  2. Since this is session level, there is no latency to a cache refresh. As soon as the new data appears on the database, all new sessions will see the new data.

Cons

  1. Configuration maybe hard, if every LTS is to be modified.
  2. Existing sessions might see some old cache hits. If you have long running sessions, then you may need to enhance this to use both a session variable and a repository variable to provide staleness guarantees. The latter is left as an exercise for the reader.

Wednesday, August 18, 2010

Cache Purging – API calls

 

In the last post, we saw that you can take control over your cache entries and purge them on a table by table basis. However, if you are like most BI departments, your tables do not have independent refresh frequencies. You probably run ETL once (twice? three time?) a day, and that ETL updates nearly every table in the warehouse. In this case, clearing call is simply issuing the following command via odbc to the BI server at the end of your ETL process.

 

Call SAPurgeAllCache()

 

Of course, if you have more than one BI server (e.g. in a cluster) you would have to issue this more than once – once for each server.  There are also API calls for purging cache by database/table but are required less often. If you do need to use these APIs see this great post by John.

Pros and Cons

Pros

  1. This is much simpler to implement than event polling especially in environments with more than 1 BI server. There is no need for a database table with the event polling table. No need for separate schemas for each BI server, and no communication between the ETL tools and metadata except a one-time (hopefully) handoff of a list of BI servers to your ETL developer.
  2. Since this is a call directly into the BI server, there is no latency whereas with event polling entries could stay in the cache for an extra 10 minutes (polling interval).

Cons

  1. This is a syncronous API. If a BI server happens to be down – for administration or due to unintended failure – it might never receive this call, and will serve stale data when it brought online later.
  2. It is hard to add/remove servers to the cluster without modifying ETL scripts. Moreover, if another BI server points to the same environment – say the QA environment – it will never have its cache purged.

Friday, August 13, 2010

Cache Purging–Event Polling Tables

 

We saw in our last post that setting timeouts for physical tables is not enough to ensure correctness when data changes and that some mechanism is required to signal to the OBIEE cache that it should purge old entries.

There are various ways in which this can be accomplished. In simple systems, one option would be to use the ‘Event Polling Table’

The idea behind this table is simple – it serves as a message bus. You put the names of the tables for which the data has changed. The OBIEE server polls this table for update events and purges cache entries that reference those tables. It also removes those entries from the polling table – so avoid purging them at the next polling interval.

Here is how one configures the event polling table.

Step 1: Create the polling table on the database. For Oracle, the create table statement is:

create table UET (
                UpdateType Integer not null,
                UpdateTime date DEFAULT SYSDATE not null,
                DBName char(40) null,
                CatalogName varchar(40) null,
                SchemaName varchar(40) null,
                TableName varchar(40) not null,
                Other varchar(80) DEFAULT NULL
);

Step 2: Import this table into the rpd and set up the connection pool info. Remember that this will not have user session – so needs a well defined user anme and password (i.e. :USER will not work)

image

Step 3: Declare this table to be the event polling table by going to Tools –> Utilities

image

and set the polling frequency – it is recommend that this be higher than 10 minutes. A smaller polling frequency reduces possibility of stale cache but will impact end-user throughput.

Step 4: Add code to your ETL process so that when a table is updated, rows are inserted into the event polling table. Remember to use the DB, schema and table names as defined in the rpd, not as defined on the database.

Pros and Cons

Pros

  1. Event polling is a very reliable way to purge cache for only those tables that have been updated, and is fairly easy to implement.
  2. Because of high selectivity – can be used in environments which have a high refresh rate. In particular, if you a warehouse where some table update frequently – say every 10 minutes - you can use implement event polling to delete entries which reference only the latter batch.

Cons

  1. The entries in the event polling table reference db, schema, catalog and table names are as defined in the repository and not in the database. This means that this requires communication between the ETL and the metadata developers on a regular basis to keep this working correctly.
  2. The above is compounded by the fact that entries involving aliases are not driven by the name of the base table. If you have ten aliases (say) for you your calendar table – a row needs to be inserted in the event polling table for each of those 10!
  3. Since there are practical limitations on how often you can poll, this still allows stale cache hits for up to 10 minutes.
  4. In any environment which has more than 1 BI server querying a single database (e.g. in a clustered environment), this can not be implemented. In such an environment the first server to find a record for DATE_D will delete its cache entries for it – but also the row in the event polling table. This means that the other servers in the cluster will never purge their cache for that table, and will continue to server old data. Users will see different results for reports depending upon which server they are routed to.

Tuesday, August 10, 2010

Cache purging

One of the most powerful, and least effort, features provided by the OBIEE stack for improving throughput is caching. For reports that have been executed before, OBIEE will serve the results from the mid-tier instead of going back to re-query the database.

This has two advantages – first, the query that hits cache returns faster since it avoids the round trip to the database, and second that by reducing the percent of queries that hit the database, it improves the response time for the queries that do hit the database.

However, a query that does not query the database can not, by definition, know if the data has changed on the database. Hence an important part of the caching strategy is defining how to avoid getting stale data out of cache. In this post I will examine the various options and their relative merits.

Cache Persistence Time

Your initial reaction may be to say that since we run ETL only once every day (or whatever your frequency may be), it is okay set the cache to expire every 24 hours. OBIEE even provides an option to expire cache entries after a given time period:

image

Unfortunately, cache persistence time is measured from the time the cache entry was created and not when the data is refreshed. This means that even if you set this to your ETL refresh interval, you will still get some reports being answered out of (stale) cache although the underlying data has changed. For example, imagine the case when ETL starts every night at 1 am:

image

You might be tempted to set the cache persistence time and assume that will take care of your purging data after every ETL. Unfortunately, this does not work. To understand why consider a report run at 12:30 am.

image

and then again at 2:30 am just two hours later.

image

This will hit cache entry created at 12:30 am since less than 24 hours have passed. However, the data in the underlying tables has changed completely.

So it is clear that one needs some way of signaling changes on the database to the OBIEE server so it can purge the obsolete cache entries. In subsequent posts, I will talk about the various options available to you – as well as their pros and cons.

Wednesday, July 7, 2010

Tuning Data Warehouse Models on Oracle - 2

In the last post we talked about how, in a fact table, even high selectivity – 10 % - does not lead to reduction of the blocks read.

This means that if you keep 10 years of data in the fact table – and select just 1 year – the database is on an average going to have to scan the full table. This can – will – be bad for performance. What one would ideally like to do – is just scan a 10th of the data. To accomplish that one has to ensure that each year is stored on the minimum number of blocks. This co-location of related records can be accomplished in three distinct ways.

There are three tools available for us to accomplish that:

  1. Index Organized table
  2. Partitioning 
  3. Physical Reordering via CTAS

Index Organized Table

The easiest way to keep the rows ordered is to tell Oracle to keep them that way – i.e. via an index organized table. The syntax is simple – you just have to create the table with an organization index clause:

create table sales (

) organization index;

The table data is stored with the primary key index instead of in its own blocks. If the primary key is prefixed with the columns you want to sort on – then records with a given value will appear together. For example,

create table sales (

…,

constraint PK_SALES primary key (year, sales_id)

) organization index;

will keep all records for a given year together.

However, we do not recommend actually doing this for fact tables because in our experience this does not work well with bitmap indexes.

Partitioning

If the number of values in the field that you want to co-locate by is small, you can typically partition by that field – for example, you may do something resembling the following:

create table sales (…)

     partition by range (year)

     partition p1 values less than (2005) tablespace u1,

     partition p2 values less than (2006) tablespace u2,

     partition p3 values less than (2007) tablespace u3,

     partition p4 values less than (2008) tablespace u4,

   partition p5 values less than (MAXVALUE) tablespace u5;

You would of course, have to maintain it as years were added. This is made much easier by the use of interval partitioning introduced in 11g.

 

create table sales (…)

     partition by range (year)

interval  (1)

     partition p1 values less than (2005) tablespace u1;

This is great when the number of partitions is variable – i.e. as data for future comes in.

The thing to watch out for is when the cardinality of the columns you are sorting on is high. For example, if you were sorting on date instead of years, then instead of 10 partitions could end up with 3650 partitions.

CTAS

The last way to keep rows of a certain value together is to to simply recreate the table using CTAS. For example,

create table sales_ordered as select * from sales order by date;

Note that this option, unlike the other two, does not provide any knowledge of the physical ordering to the optimizer. However, the benefits of co-locating the rows (as discussed in http://navi-obiee.blogspot.com/2010/04/tuning-data-warehouse-models-on-oracle.html) will continue to be available.

 

In a warehouse environment, we recommend that you always be aware of the physical order of data in your fact tables. If the cardinality of your most common filtering attribute is low, partitioning may be the best way to order the rows. If high, then CTAS with order by may be your only solution.

Thursday, April 1, 2010

Tuning Data Warehouse Models on Oracle

While not strictly related to OBIEE – I am currently working on a number of database optimization tasks, and will discuss a couple of ideas to be aware of when designing a warehouse schema for the oracle database.

Let me start with a simple question:

Assume you have a table of  1000 blocks – with a column ‘Digit’ (that goes from 0-9 obviously). If the table is properly indexed on ‘Digit’ and you query all records where Digit=’0’ – how may data blocks will you have to read ?

Simple applying the unitary method gives you :

(1/10) * 1000 = 100 blocks.

Unfortunately, it turns out that even though this might be a reasonable estimate for an OLTP application – this can be rather dramatically wrong for a fact table in a DSS environment.

Why? Well, if you have designed your star schema correctly, you only have some surrogate keys, and some measures in the fact table (no large varchars). Assuming 5 (4 byte) surrogate keys, and 10 (8 byte) measures – your record length is

5x4+8x10=100 bytes long

Now – your block size may be 8K – and so the number of records that fit in a single block is 8K/100=80 records.

By now you are wondering what this has to do with the original question. A lot as it turns out. Remember, all reads and write are for the block as a whole – and a block has to be read if *any* of its rows qualify the criteria. What is the probability that out of 80 records on a given block - none of them are a digit 0:

P(No 0’s)= P(1st record Not 0) x P(2nd Record not 0)… P(80th record not 0)

=(0.9) ^ 80

=0.00022

Which means in such an environment, the number of blocks that you would expect to read is 1000 x (1-0.00022) = 999.8 blocks. A very different answer from the simplistic 100 we first got we first got.

And this is not all - if you use table compression (which gives you – say 50% gain) – and a large blocksize (recommended with compression) you may have as many as:

32K/ (100 bytes *50% compression ratio)=640 records per block.

In the next post – I will talk about what this means in terms of optimizer behavior.