Wednesday, August 25, 2010

Cache Purging – Session Variables

There is another way to purge cache for selected tables that is based on a pull mechanism – not push – which is to use session variables to detect when data on the underlying database has changed.

Here is the sequence of things one has to do:

1) Create a ETL_TIMESTAMP table on your warehouse as follows :

create table ETL_TIMESTAMP
(
  ETL_TIMESTAMP TIMESTAMP(6)
);

2) At the end of every ETL run, run the following script to refresh this timestamp:

truncate table ETL_TIMESTAMP;

insert into ETL_TIMESTAMP select sysdate from dual;

3) Now create a session variable so that every new session gets the value of the ETL refresh timestamp when it starts – by using the following SQL:

select etl_timestamp from etl_timestamp

image

 

4) Finally you want to make sure that cache entries are discarded when the value of this variable changes. We now have to ensure that this variable is referenced in the query. This can be done in a number of ways – one of which is to reference it in the logical table source (LTS).

Modify every LTS (or at least the one for which cache purges are required) by modifying the ‘WHERE clause’:

image

Now cache entries will be automatically purged everytime you run ETL.

Pros and Cons

Pros

  1. The most important difference between this method and the others we discussed here and here, is that being a pull API, it is fairly resilient to changes in physical configuration. If you add or remove servers from a clustered environment, or have Dev. / UAT / QA / Production BI servers pointed to the same warehouse database, each of the caches will automatically refresh when they should.
  2. Since this is session level, there is no latency to a cache refresh. As soon as the new data appears on the database, all new sessions will see the new data.

Cons

  1. Configuration maybe hard, if every LTS is to be modified.
  2. Existing sessions might see some old cache hits. If you have long running sessions, then you may need to enhance this to use both a session variable and a repository variable to provide staleness guarantees. The latter is left as an exercise for the reader.

Wednesday, August 18, 2010

Cache Purging – API calls

 

In the last post, we saw that you can take control over your cache entries and purge them on a table by table basis. However, if you are like most BI departments, your tables do not have independent refresh frequencies. You probably run ETL once (twice? three time?) a day, and that ETL updates nearly every table in the warehouse. In this case, clearing call is simply issuing the following command via odbc to the BI server at the end of your ETL process.

 

Call SAPurgeAllCache()

 

Of course, if you have more than one BI server (e.g. in a cluster) you would have to issue this more than once – once for each server.  There are also API calls for purging cache by database/table but are required less often. If you do need to use these APIs see this great post by John.

Pros and Cons

Pros

  1. This is much simpler to implement than event polling especially in environments with more than 1 BI server. There is no need for a database table with the event polling table. No need for separate schemas for each BI server, and no communication between the ETL tools and metadata except a one-time (hopefully) handoff of a list of BI servers to your ETL developer.
  2. Since this is a call directly into the BI server, there is no latency whereas with event polling entries could stay in the cache for an extra 10 minutes (polling interval).

Cons

  1. This is a syncronous API. If a BI server happens to be down – for administration or due to unintended failure – it might never receive this call, and will serve stale data when it brought online later.
  2. It is hard to add/remove servers to the cluster without modifying ETL scripts. Moreover, if another BI server points to the same environment – say the QA environment – it will never have its cache purged.

Friday, August 13, 2010

Cache Purging–Event Polling Tables

 

We saw in our last post that setting timeouts for physical tables is not enough to ensure correctness when data changes and that some mechanism is required to signal to the OBIEE cache that it should purge old entries.

There are various ways in which this can be accomplished. In simple systems, one option would be to use the ‘Event Polling Table’

The idea behind this table is simple – it serves as a message bus. You put the names of the tables for which the data has changed. The OBIEE server polls this table for update events and purges cache entries that reference those tables. It also removes those entries from the polling table – so avoid purging them at the next polling interval.

Here is how one configures the event polling table.

Step 1: Create the polling table on the database. For Oracle, the create table statement is:

create table UET (
                UpdateType Integer not null,
                UpdateTime date DEFAULT SYSDATE not null,
                DBName char(40) null,
                CatalogName varchar(40) null,
                SchemaName varchar(40) null,
                TableName varchar(40) not null,
                Other varchar(80) DEFAULT NULL
);

Step 2: Import this table into the rpd and set up the connection pool info. Remember that this will not have user session – so needs a well defined user anme and password (i.e. :USER will not work)

image

Step 3: Declare this table to be the event polling table by going to Tools –> Utilities

image

and set the polling frequency – it is recommend that this be higher than 10 minutes. A smaller polling frequency reduces possibility of stale cache but will impact end-user throughput.

Step 4: Add code to your ETL process so that when a table is updated, rows are inserted into the event polling table. Remember to use the DB, schema and table names as defined in the rpd, not as defined on the database.

Pros and Cons

Pros

  1. Event polling is a very reliable way to purge cache for only those tables that have been updated, and is fairly easy to implement.
  2. Because of high selectivity – can be used in environments which have a high refresh rate. In particular, if you a warehouse where some table update frequently – say every 10 minutes - you can use implement event polling to delete entries which reference only the latter batch.

Cons

  1. The entries in the event polling table reference db, schema, catalog and table names are as defined in the repository and not in the database. This means that this requires communication between the ETL and the metadata developers on a regular basis to keep this working correctly.
  2. The above is compounded by the fact that entries involving aliases are not driven by the name of the base table. If you have ten aliases (say) for you your calendar table – a row needs to be inserted in the event polling table for each of those 10!
  3. Since there are practical limitations on how often you can poll, this still allows stale cache hits for up to 10 minutes.
  4. In any environment which has more than 1 BI server querying a single database (e.g. in a clustered environment), this can not be implemented. In such an environment the first server to find a record for DATE_D will delete its cache entries for it – but also the row in the event polling table. This means that the other servers in the cluster will never purge their cache for that table, and will continue to server old data. Users will see different results for reports depending upon which server they are routed to.

Tuesday, August 10, 2010

Cache purging

One of the most powerful, and least effort, features provided by the OBIEE stack for improving throughput is caching. For reports that have been executed before, OBIEE will serve the results from the mid-tier instead of going back to re-query the database.

This has two advantages – first, the query that hits cache returns faster since it avoids the round trip to the database, and second that by reducing the percent of queries that hit the database, it improves the response time for the queries that do hit the database.

However, a query that does not query the database can not, by definition, know if the data has changed on the database. Hence an important part of the caching strategy is defining how to avoid getting stale data out of cache. In this post I will examine the various options and their relative merits.

Cache Persistence Time

Your initial reaction may be to say that since we run ETL only once every day (or whatever your frequency may be), it is okay set the cache to expire every 24 hours. OBIEE even provides an option to expire cache entries after a given time period:

image

Unfortunately, cache persistence time is measured from the time the cache entry was created and not when the data is refreshed. This means that even if you set this to your ETL refresh interval, you will still get some reports being answered out of (stale) cache although the underlying data has changed. For example, imagine the case when ETL starts every night at 1 am:

image

You might be tempted to set the cache persistence time and assume that will take care of your purging data after every ETL. Unfortunately, this does not work. To understand why consider a report run at 12:30 am.

image

and then again at 2:30 am just two hours later.

image

This will hit cache entry created at 12:30 am since less than 24 hours have passed. However, the data in the underlying tables has changed completely.

So it is clear that one needs some way of signaling changes on the database to the OBIEE server so it can purge the obsolete cache entries. In subsequent posts, I will talk about the various options available to you – as well as their pros and cons.