6. Common Questions about the Engine


6.1 What're some pointers for configuring Online?

Failing that advice, try:

One of the most common problems experienced is that transaction logging becomes a major issue. If you do have to use logging, HEED THESE WORDS:

Make your logs big. Very big. Huge.
Make lots of logs. Plenty.
Run continuous backups. Continuously.
Lower the default Transaction high water marks from:
   LTXHWM          80         # Long TX high-water mark (percent)
   LTXEHWM         90         # Long TX exclusive high-water mark (percent)
   to:
   LTXHWM          70         # Long TX high-water mark (percent)
   LTXEHWM         80         # Long TX exclusive high-water mark (percent)

A small band of merry men ran forth and insisted that these LTX* figures need to go still lower:

But Jim Frison (jfrison@aol.com) clearly had other experiences:

Long transactions and completely locked up Informix systems have been the bane of my existence for the last two years. I strongly recommend that the defaults be revised to lower values. We gradually reduced them to their current levels (LTXHWM=65, LTXEHWM=75) and I am tempted to take them to 60,70 to feel safer. We have 8 logs and definitely ran into trouble at 70, 80. Change the defaults.

Note: Ken Miles (kenm@vcd.hp.com) uses:

   LTXHWM          40         # Long TX high-water mark (percent)
   LTXEHWM         65         # Long TX exclusive high-water mark (percent)

They deadlocked with the 70/80 settings. Indeed, Clem W. Akins (cwakins@jabba.rmc.com) claims that the high-water marks should be less than 50%, to allow a roll-back. Have enough logs that whatever number you choose is enough for your largest long transaction. 70/80 is not nearly low enough.

Johnathan Leffler (johnl@informix.com) suggests that:

That with a reasonable log configuration (say more than 4 logs), the 70/80 limit is very unlikely to run you into trouble, unless you are mass inserting or deleting extremely small records (say 1-8 bytes, though I'm tempted to suggest 1-4 bytes). I do not recall seeing systems run into trouble with 80/90 figures which are the default in tbconfig.std.

Reducing these parameters to less than 50% is not something I'd recommend; there is definitely an upper-limit on how much log space you can require to do a rollback, because the CLM records (which are the cause of the trouble) are of finite size (16 bytes, if memory serves me right), and there is at most on CLM for each operation performed by the user, and each operation has a 12 byte header plus some data, such as the whole record which is inserted or deleted, or both the before and after image for an updated record. Unless these records are tiny, the amount of space used to generate the LTX dwarfs the amount of space needed to rollback.

I suppose that if you have multiple users who regularly and simultaneously start operations which become LTX transactions, then you may have problems. But that is at least partly a question of education, and at least partly a question of operational controls on the OnLine system.

I consider that you have to be trying quite hard to run into trouble with LTXHWM = 70, LTXEHWM = 80. I also admit that the problems which occur if this is wrong are so severe that erring on the side of caution is advisable.

I (kerry@kcbbs.gen.nz) cheekily ask: Definitive enough for you? ;-)

Monitor the running of the continuous backups so that when it asks for a new tape you can mount one. (Alternatively use a diskfile instead of a tape, make it big enough to handle a weeks worth of logs, archive daily and clear it).
Never, ever, ever kill an sqlturbo. tbmode -z it.
Teach users not to do things likely to cause a long transaction error: e.g. "LOAD FROM .... INSERT INTO ...."
When doing DELETEs, lock the table exclusively (otherwise may cause lock overflow).

(cwakins@jabba.rmc.com (Clem W. Akins)):

Re-write deletes/purges to commit every row

If the logs get filled up, manually run the auto-backup from tbmonitor to free them. This works when the tbtape dies, and the logs fill from regular use (maybe not from a single huge event).

(rodger@hpdmd48.boi.hp.com (Rodger Anderson)):

Be careful using temporary tables. Use the WITH NO LOG clause when selecting into a temporary table.

Number of locks (Clem Akins cwakins@jabba.rmc.com)

Depending on your application and the number of users you will have concurrently, increase this. A lot. The max is one quarter million, default is 2,000. I always set mine to be >100,000 and <200,000. Remember, a massive delete (such as can be easily run from SQL) can generate tens of thousands of locks in a hurry. This will cost you some memory but not much.

If you run out of locks, and the rollback of the transaction cannot complete because it runs into the Long Transaction problem, the entire Online session may get corrupted and could require a complete reload.

Number of licenses used (Clem Akins cwakins@jabba.rmc.com)

If you use the SQL user menu to launch another informix application (one you wrote in 4gl, say) the user will have consumed 2 licenses! One for the menu, and one for the application. You can quickly run out of licenses this way. We use a shell script for the front-end user menu and have it run one informix application at a time.

Disk Allocation (Clem Akins cwakins@jabba.rmc.com)

Store different parts of your database on different spindles on different controllers. For example, store the root db, the log space, the data tables all on *different* disks, and if your bottleneck then becomes controller I/O, run these different disk drives on different controllers. Ensure your tables are in contiguous disk space by

  1. Creating each table in a specific dbspace, which is linked to a specific disk partition. You can choose which partition receives which data, and influence where on the disk the data physically resides. Data on the inner sectors are denser than those in the outer sectors, and read faster
  2. Allocating enough space in 1 extent with the EXTENT SIZE qualifier to the CREATE TABLE command.
  3. Loading each table immediately after creation, before the next table is created, to REALLY put the data in contiguous space.


6.2 What do I need to use MS-Access/Visual Basic with my Informix database?

Doug Mason explains:

There are several parts necessary to get an ODBC-compliant program talking to your Informix Database:

Li Chung Yuen (s920715@hp9000.csc.cuhk.hk) helpfully lists this database configuration :

  The Engine                    The Client

  Informix 5.01                 MS Access 2.0
  I-Star 5.0                    Q+E ODBC 1.2 
                                I-Net 5.0 
                                PC/TCP 2.2

... but regretfully informs us: It works, but the performance is not good.


6.3 THIS SPACE INTENTIONALLY LEFT BLANK

6.4 What error messages are caused by locking problems?

The following SQL error codes can occur because of locking problems:

-233: Cannot read record that is locked by another user.
-240: Could not delete a row.
-244: Could not do a physical-order read to fetch next row.
-245: Could not position within a file via an index.
-246: Could not do an indexed read to get the next row.
-250: Cannot read record from file for update.
-263: Could not lock row for update.
-271: Could not insert new row into the table.
-346: Could not update a row in the table.
-348: Could not read a row from the table.
-378: Record currently locked by another user.

The following ISAM error codes can occur because of locking problems:

-78: Deadlock detected
-79: No record locks available.
-107: ISAM error: record is locked.
-113: ISAM error: the file is locked.
-134: ISAM error: no more locks.
-143: ISAM error: deadlock detected.
-144: ISAM error: key value locked.
-154: ISAM error: Deadlock Timeout Expired - Possible Deadlock.

Errors -78 through -154 can also be shown as positive values under some circumstances. The error message files record the same message for both -78 and +78. This is not true of the SQL error messages.

If you are working on System V based machines (rather than BSD), the system error number leads to certain errors being mis-reported. The locking errors which may cause problems are:

-45: SysV -- Deadlock detected
-45: BSD  -- Operation not supported on socket

-46: SysV -- No record locks available
-46: BSD  -- Protocol family not supported.

-56: SysV -- File locking deadlock error
-56: BSD  -- Socket is already connected.

In each case, the error would be reported using the BSD message, but would actually have the System V meaning.

This list was generated from the document "Informix Error Messages" (December 1991) Part No 000-7121. Neither of these list is guaranteed to be definitive, nor are they guaranteed to remain unchanged with new versions of the software. However, the document used includes Version 5.00 error messages as well as 4.10 error messages.


6.5 What should I use instead of kill -9?

The controversy over the kills has to do with what Informix is trying to execute at the time. A badly timed abortion of some process can do damage to the activities of updating and logging the transactions in updates to one or more databases under the engines control.

This can leave things an in unrecoverable state.

(davek@newjersey.informix.com (Dave Kosenko))

Of course, killing a front end process is different from killing an engine (generally, it is safer). You BEST bet is, of course, to code signal handling into your application. This means catching signals in your esql/c programs and using the DEFER command in 4gl (or not using DEFER and just using the interrupt to terminate it). In esql/c, catch your signal, then issue a sqlint() followed by an sqlexit(), then exit the application.

Now it's time for a bit of truth regarding engines and signals. The following signals are IGNORED by sqlturbo (as of 5.0):

        SIGHUP
        SIGINT
        SIGQUIT
        SIGTSTP

You may send any of these signals to your heart's content; however, it will do nothing to stop a sqlturbo.

Three signals are caught:

        SIGTERM
        SIGUSR1
        SIGPIPE

Each of these cause a different action on the part of the sqlturbo. SIGUSR1 is what is used by tbmode -z. However, tbmode also diddles with some shared memory flags before sending this signal. If you send it yourself, it will generally not have any effect since the appropriate shared memory flags have not been set. Anyway, there is no advantage to sending SIGUSR1 yourself over letting tbmode -z do it for you.

To understand the remaining two, you need a basic understanding of how sqlturbo works. Basically, once it gets started up, it will sit in an infinite loop reading messages from the pipe (from the client process) and taking actions based on that message. When the action is complete, and you come back to the top of the loop, a return message will have been placed on the pipe (which the client reads and processes). Normally, this loop will only be terminated when you exit your client application (or use sqlexit(), which has the same effect).

A SIGPIPE is generated whenever a process attempts to write to a pipe that has no reader. We catch SIGPIPE and set a flag. At the top of the loop mentioned above, this flag is checked. So, if the client goes away (so there is no reader on that end of the pipe), when sqlturbo tries to write its return message to the pipe, a SIGPIPE is generated, we set the flag and continue. Since the very next action after writing anything to the pipe is to return to the main loop, the flag is checked and the loop is exited. sqlexit() accomplishes this by simply closing the client end of the pipe.

Generally, whenever a client goes away, the engine should follow suit. If it is blocked on a read from the pipe (waiting for the next request) it will get an EOF and terminate. If it attempts to write a return message to the pipe, it will get a SIGPIPE. When the engine is busy doing work when the client goes away, though, the situation changes. Until it attempts a write on the pipe, or tries to read from the pipe, it will not know that the client has gone. Eventually, it will get around to its write and terminate, but that could take a while if it is doing something that takes a lot of time.

Finally, we have our SIGTERM. This signal is caught and another global flag is set. This flag is checked all over the place in the sqlturbo code, and control is passed back to the main loop (with a message indicating that the statement was interrupted passed back: -213); the engine goes back to blocking on the read from the pipe. Now the flag is not checked after every statement; more like after every "unit of work" where the definition depends on the type of activity. So it may not have an immediate effect, but it will have an effect pretty quickly. If the client has gone away, the write of the -213 to the pipe will generate a SIGPIPE and the loop will be terminated, or the read will get EOF and the loop will be terminated.

So, if tbmode -z does not work, and you cannot shut your system down (really the best alternative), the next best thing to do is send a SIGTERM to the sqlturbo (causing it to interrupt its current work) followed by a SIGPIPE (if the SIGTERM alone doesn't do it). If this combination has no effect after a reasonable amount of time, you really should do a shutdown. Any other signal sending is likely to bring the system down anyway (abort mode).

(johnl@informix.com (Jonathan Leffler)):

As a matter of idle fact, killing an sqlexec (SE) is even worse than killing an sqlturbo (OnLine). With OnLine, there is a supervisory process (tbinit) which will detect the problem and either clean up after the dead process or bring the system to a halt before any damage is done. With SE, there is no such supervisory process, and if an sqlexec dies part way through a transaction, the data will appear committed for ever more. Until you try to do a rollforward database, that is. But by then, who knows how many decisions have been made on the faulty data which wasn't rolled back? This too is a controversial point of view. Note that this applies strictly to the SE engine, not to an application running it. That can die with impunity as long as the Engine survives.

brent@advgroup.co.nz (Brent Jackson) writes:

With OnLine 4.00 we had difficulty with tbmode -z. We found that kill -1 and -15 were ignored by the process, and kill -9 could bring the engine down if it hit at the wrong time. However, we found that a kill -8 was a reliable and safe way of killing the process.

Note however, since OnLine 4.10 (through to 5.06) we have had no problems with tbmode -z, though (as already mentioned) it can sometimes take awhile (~ 30 minutes) to work.


6.6 How can I tell what version I'm running?

6.6.1 Look in the ASCII files:

The files suffixed with "-cr" in $INFORMIXDIR/etc purport to give the versions of your current software. eg:

cat $INFORMIXDIR/etc/OnLine-cr (Version of Online)

6.6.2 Ask the programs by passing the -V switch:

   $INFORMIXDIR/lib/sqlturbo -V     (Version of Online)
   $INFORMIXDIR/lib/sqlexec[d]  -V  (Version of SE)
   fglpc -V                         (Version of 4GL-RDS)
   ... you get the idea ...

A handy little shell from Paul T. Pryor (ptp@fallschurch-acirs2.army.mil) is included in Appendix H which will list all versions of any Informix products you have installed.

6.6.3 Ask the engine itself:

   SELECT * FROM systables
    WHERE tabname = " VERSION"

This has the number of the last installed version of the engine. If that version has been de-installed then this data will be incorrect.


6.7 Locking in pre-6.0 engines

Prior to Informix 6.0, the engine used a locking mechanism which would involve locking rows adjacent to the row actually being locked.

Attached is the edited version of the Tech Info describing the Informix Twin Problem for pre-6.0 engines.

Albert E. Whale aewhale@access.hky.com

6.7.1 What is the "Twin Problem"?

The "Twin Problem" is a situation that can arise when inserting and deleting rows in databases with logging. Suppose that two processes, Px and Py, are updating a table that has a unique index. Px deletes a row with key value 8; then Py inserts a row with key value 8. Later, Px decides to roll back its transaction. If we allow Px to roll back, there will now be two rows with key value 8, violating the uniqueness constraint on the index.

Py therefore must not be allowed to insert its row until Px completes its transaction. But if only row locks are used, there is no way to prevent Py from inserting the row. Once Px deletes the row, the lock is gone; a row that is not there cannot be locked.

A similar problem can occur when Isolation Mode is set to Repeatable Read. Suppose Px, using Repeatable Read, locks the set of rows with key value 5. These rows now cannot be deleted or updated. However, new rows can still be added. Py can insert a row with key value 5, or update a row with key value 2 to key value 5. This would violate the Repeatable Read.

If the Repeatable Read selection criteria did not include a filter on an indexed column, the only solution is to place a shared lock on the entire table. This prevents any other process from altering the table while the repeatable read is in progress. However, when the selection makes use of an indexed column, OnLine makes use of adjacent key locking to prevent both this and the "Twin Problem".

6.7.2 How does this work and what are the ramifications?

What is the effect on the "Twin Problem"? When Px deletes the row with key value 8, it first tests value 8 to make sure it is not already locked, then it places a lock on key value 9, and deletes value 8. When Py attempts to insert a row with key value 8, it first has to test value 9 to see if it is locked. Since Px has locked value 9, Py cannot insert value 8. Note that if there is no value 9, Px will lock value 10, which not only prevents Py from inserting a value 8, but also prevents Py from inserting a value 9. If 8 was the highest value, the "infinity" value is locked, preventing Py from inserting any value higher than 8.

This also protects reads with Isolation Mode of Committed Read or greater, by alerting the reading process to uncommitted insertions or deletions from the selected set. If Py selects all rows "WHERE key_val < 10", and Px has deleted key value 8, Py will encounter the lock on key value 9, alerting it to the fact that there is an uncommitted insertion or deletion of a row. If there is no value 9, 10, or 11, Px will lock key value 12; however, Py will still encounter the lock, because the way in which OnLine determines that it has found all required rows is to read the index values until it reaches the first one which does NOT fit the criteria, which in this case would be 12. Note that Py would also receive a locking error if Py selected all rows "WHERE key_val > 10", even though the row actually deleted was 8.

These examples have dealt with issues on unique indexes. However, adjacent key locking also applies to non-unique indexes, as in the Repeatable Read problem. When Px, in Repeatable Read, selects all rows "WHERE value = 5" on a non-unique indexed value, it locks each row with value 5 as it is read. When it passes the last rowid with indexed value 5, it reads the first indexed value 6, which alerts it that it has completed its search, and locks it as well. If Py now attempts to insert a row with value 5, it will first test the adjacent key. If the rowid of the new row is less than that of any one of the existing rows with value 5, it will test the key of the first existing row with rowid greater than itself, and find it locked, as the Repeatable Read locked all the keys for value 5. If the new rowid is greater than any of the existing rows, it will test the key value 6, and find it locked as well. Thus Py is prevented from inserting any new rows with value 5 until the Repeatable Read is completed.

Another side-effect of adjacent key locking is reduced concurrency of insertions. If Px inserts the key value 9 into a unique index, Py will now receive a locking error if it attempts to insert the value 8.


6.7.3 What OS bugs may affect Online?

cmm@trac3000.ueci.com (Colin McGrath) writes:-

The ONLINE_7.2 OS Release notes only mention:
6. OS Patch information:
   OS version is Digital UNIX V3.2D.  The KAIO patch from DEC is required.
   The patch number for various OS versions are:
   OS v3.2de1    OSF360-070
   OS v3.2de2    OSF365-360070
   OS v3.2f      OSF370-360070
   OS v3.2g      OSF375-360070
   For OS v4.0 and up, this patch comes with the OS.

Also in Digital UNIX V4.0:-

       A data corruption problem can occur when the parameter new-wire-method
       is turned on. The new-wire-method parameter is only available in V4.0
       and later releases. All versions V4.0 and later ship with the default
       being new-wire-method enabled.

       Digital UNIX Engineering is investigating the exact cause of the problem; 
       however,until we fully understand the circumstances surrounding this, you should
       follow the recommended workaround detailed below.
 
 All I know right now, is that under very heavy loads (with OPS and DRD)
 this data corruption problem can occur. Even if you are not running
 Oracle Parallel Server, we recommend you implement the workaround
 because we don't fully know the cause of the problem. Systems running
 Digital UNIX V3.2x are not affected by this problem.
 
     BLITZ TITLE: DIGITAL UNIX DATA CORRUPTION WITH SHARED MEMORY

         It is the Strong Recommendation of Digital UNIX Engineering that 
         this workaround be implemented on all systems running Digital UNIX
         V4.0 and above. Failure to do so can result in undetected data 
         corruption.

djw@smooth1.demon.co.uk (David Williams) writes:-

Recently in comp.databases.informix there have been several reports of performance problems with HP-UX version 10.x. These are problems related to shared memory and have been fixed in 10.20. I would recommend that people upgrade to 10.20.

ismolic@open.hr (Ivica Smolic) writes:-

From the release notes of Informix OnLine 7.23 for DEC Alpha: "It is not recommended to set AFF_SPROC to 0. CPU 0 is a master CPU and DIGITAL does not recommend to bind process to this processor 0."

On 8th Dec 1997 deanr@innotts.co.uk (Dean Robinson) writes:-

[Editors note: This is about upgrading from Unixware 2.1.1 to version 2.1.2 and problems starting Online 7.23.UC1]

I had the same problem, utiill I rebuilt the kernel as follows;

cd /etc/conf/bin
./idtune -f HCPULIM 0x7FFFFFFF
./idtune -f SCPULIM 0x7FFFFFFF
./idtune -f HFSZLIM 0x7FFFFFFF
./idtune -f SFSZLIM 0x7FFFFFFF
./idtune -f HDATLIM 0x7FFFFFFF
./idtune -f SDATLIM 0x7FFFFFFF
./idtune -f HSTKLIM 0x7FFFFFFF
./idtune -f SSTKLIM 0x7FFFFFFF
./idtune -f HCORLIM 0x7FFFFFFF
./idtune -f SCORLIM 0x7FFFFFFF
./idtune -f HFNOLIM 0x00000800
./idtune -f SFNOLIM 0x00000800
./idtune -f HVMMLIM 0x7FFFFFFF
./idtune -f SVMMLIM 0x7FFFFFFF
./idbuild -B

reboot and your informix engine should now work ok

On 13th Jan 1999 scooper@thecompounders.com (Steven L Cooper) wrote:-

Problem

----------

oninit: Fatal error in initializing ASF with 'ASF_INIT_DATA' flags; asfcode='-25580'. [Editor's note on SCO Openserver 5.0.4]

Solution

----------

Kernel parameters are normally tuned with configure(ADM) on OpenServer, but some of the changes you want to make aren't supported by that, so use idtune(ADM). Something like:

cd /etc/conf/cf.d
../bin/idtune SEMMNI 2048 -max 8192
../bin/idtune SEMMNS 2048 -max 8192
../bin/idtune SEMMAP 2048 -max 8192
../bin/idtune SHMMAX 671088640
./link_unix -y

On 25th Mar 1999 heikog@informix.com (Heiko Giesselmann) writes:-

To use more than 1.75GB of shared memory on HP-UX 10.20 some HP patch has to be installed. HP should be able to help out with the current id of the patch required.

To use more than 1.75GB shared memory with Informix you have to 'chatr' all executables that attach to shared memory (onstat, ontape, ...). See the man pages for chatr. Please note that onbar cannot be chatr'ed with IDS versions earlier than 7.24.UC5 and therefore onbar won't be able to attach to shared memory if shared memory is created by a chatr'ed oninit.

To improve performance you might also try to do 'chatr +pd64M $INFORMIXDIR/bin/oninit' to increase shared memory page size. That should make a noticable performance difference for instances with larger shared memory segments.


6.8 Space reclamation

In the good old days of SE, a table with a high level of activity (i.e. deletions/inserts) became riddled with old data row slots which were available for re-write, and the index began to look like swiss cheese. To correct this periodic unload/loads or cluster manipulation or ALTER TABLE (same as an unload/load but more user-friendly) were needed to keep index and data table efficiency at peak. [See section 8.1, part 3]

Is this also true of Online? Or does Online do this automagically?

OnLine has assorted "reclamation" stuff that it does, so that you should not wind up with a "swiss cheese" situation (reuse of freed pages, page compression, etc.).

What OnLine will NOT do is return pages allocated to an extent (or an empty extent, for that matter) back to the dbspace. Once an extent has been allocated to a able, it will always be allocated to that table (unless, of course, you drop or alter the table).

(davek@informix.com (David Kosenko))


6.9 Online Disk Fragmentation:

Raw Online space does get very fragmented, especially if you are short of room in the partition. Table extents become very small and numerous and it causes performance problems. Tables can generally have about 240 extents, and it is possible to reach that value and come to a grinding halt.

We regularly monitor all our production systems extents and partition fragmention. Tbunloading and tbloading the database back typically puts all the data for each table in the first extent (if there is enough contiguous room). We then carefully set all the next extents on the tables, to accomodate future growth. If tbunloading is unreasonable because of the database size, we will create an empty new table, with the first extent set, and copy the data from the old table to the new table.

(anasaz!qip.naomi@enuucp.eas.asu.edu (Naomi Walker))


6.10 Why do UPDATE STATISTICS?

(Cathy Kipp)

The optimizer, which is what determines what read strategy to use when searching for data, uses the statistics as a means of determining the most efficient way to read for data. These statistics can quickly fall out of date after a large number of insertions or deletes. Using an 'update statistics' will give the optimizer a better shot at the data.

UPDATE STATISTICS FOR tablename can be run if only one or a few tables have changed - this is particularly true for large databases. Be warned the tables are locked while statistics are retrieved.

Information on how the optimizer has decided to process a query is available in a local file called sqexplain.out after a SET EXPLAIN ON command has been issued. This can be invaluable in improving query efficiency.


6.11 How do we get serial field starting values for a table

You can't. The starting value for a SERIAL column is not preserved anywhere. Once any data has been inserted, there is no way to tell what value was set.


6.12 Can I count the number of rows SELECTed, without using SELECT COUNT?

No. Nor with SELECT COUNT(*) either, because by the time you get to execute the SELECT *, someone may have changed the tables.

Stefan Gutsche, sg@ivu-berlin.de adds: Another option, faster than SELECT COUNT(*) is to:

1. UPDATE STATISTICS FOR TABLE [Table]

2. SELECT nrows FROM systables WHERE tabname = [Table]

It is really faster than SELECT COUNT(*) and for a secure working you should lock the table caption in exclusive mode.


6.13 Is the number of rows processed in the SQLERRD structure accurate?

... or is it just an estimate?

The info in the SQLERRD item is only filled in after you've processed the last row. It therefore isn't much help in the context of SELECT.

It is very useful in the context of:

In all cases, it is an exact number.

With the more recent Engines, the estimates of the cost and number of rows to be returned are also available. For more information look in the Informix Guide to SQL Reference Manuals under 'Error checking with SQLCA' (or under 6.0, the individual ESQL manuals).


6.14 Interpreting SysColumns.ColType and SysColumns.ColLength

The system catalogue table SysColumns documents the type of the column in encoded form using values from 2 columns, ColType and ColLength.

A detailed explanation of how to interpret these fields can be found in Appendix E, while a 4GL function which will interpret the values for you can be found in Appendix F.

If you do not have a copy of the Appendixes, and are a 4GL programmer, you could hack the sample 4GL source in $INFORMIXDIR/demo/fglbe/ex30.4gl


6.15 More tuning suggestions for Online

Tom Hogarty (hogarty.pad@sni.de), whose address bounces, suggests:

Points from bottom up.

  1. Check that the Unix system is performing properly by using sar or other Unix version tools to monitor CPU and Disk activity. Look for bottlenecks i.e contention on a particular disk.
  2. Think about reconfiguring your Informix installation so that four logical units (rootdbs, physical log, logical logs and your database dbspace reside on physically seperate devices) BTW if you don't use raw devices in your system (not always the best) then check it out, RTFM, find out if your platform can take advantage of raw devices. Further refinement is possible by placing heavily used tables in a seperate dbspace/device. The broad picture is to have all unix resources working with you and not against each other.
  3. Monitor the state of your data by checking your extents from time to time, use tbcheck -ce to get warnings on extents that have overreached the 8 extent recommendation. Informix works very hard to run with tables spread over a lot of extents (we used to call it data fragmentation). Reorganise the data with unload/delete/load on the offending table and calculate the next extent size to suit your own systems needs.
  4. Further to point 3, if the system catalog tables (systables, syscolumns, etc.) have managed to run to more than 8 extents then a dbexport of your database is advisable. When the export runs it produces an .sql file to control the dbimport of the data you may edit this file to include the ALTER TABLE command (preferably near the beginning of the control file/.sql) for these system catalog tables so that you MODIFY NEXT SIZE nnn. i.e. ALTER TABLE systables MODIFY NEXT SIZE 100 ; The value of nnn for each affected table may be ascertained by running tbcheck -pT <dbname>:owner.<tabname> read how many pages have been used in the current schema, change that to Kb and make that the size of your next extent (minus the first extent , if you want to get picky).
  5. Various releases of dbexport have their own canny charactaristics and, in particular one feature (nudge,nudge) of my current version is that it aborts because of a swapping problem. If you live with such a version then su to root and change the obvious(?) ulimits to unlimited (i.e. ulimit -v unlimited). su to informix and run the dbexport from there.
  6. Another feature(?) of dbexport in some versions is the ommision of the NEXT EXTENT for tables and/or the LOCK MODE of tables, these can be preserved by running the two scripts which follow this posting.

I know that tuning/optimisation is/can be a complex subject and would not hope to cover the subject in any sufficient completeness here, but there is always the Fine Manual to refer to along with

     DATABASE TUNING...        by Dennis E. Shasha ISBN 0-13-205246-6
     SED & AWK                 by Dale Dougherty   ISBN 0-937175-59-5
     SYSTEM PERFORMANCE TUNING by Mike Loukides    ISBN 0-937175-60-9
     Informix OnLine Performance Tuning            ISBN 0-13-124322-5
                               by Elizabeth Suto Published by Prentice Hall

The first is general db tuning while the second is an invaluable guide to sed/awk which I find a great in managing the various tbstat/tbcheck outputs. and the latter is a guide to Unix system tuning. Elizabeth Suto's book has been described as "both excellent and concise". Joe Lumbley's well publicised Survival Guide may also be worth a look

I hope some of you can further refine and correct my outpouring so that the message is clear.. optimise resources, remove botlenecks and keep on reading the manual.

The scripts mentioned in part 6:

# Name    : lmode
# Author  : Tom Hogarty
# Program : Integrated script/awk program to read/format/output
#           the current (row level) lock modes in a database.
# Syntax  : lmode >>modes.sql
#           ( update the new db calling [isql|dbaccess] <dbname> modes )
#           call as user informix
>/tmp/tom
echo " select 'ALTER TABLE ', owner, tabname
       from systables
       where locklevel = 'R' " >/tmp/tom
# edit the following line
[isql|dbaccess] <dbname> - </tmp/tom  |
awk '
    $0 ~ /ALTER TABLE/ { tabname=hoch1$3hoch2$4
                         printf " %12s %24s %34s \n",
                         "ALTER TABLE", tabname, " LOCK MODE (ROW) ; commit work ;"
    }  ' hoch1=\"  hoch2=\".


# Name    : nextx
# Author  : Tom Hogarty
# Program : Integrated script/awk program to read/format/output
#           the current next extent sizes in a database.
# Syntax  : nextx >>extents.sql
#           ( update the new db calling [isql|dbaccess] <dbname> extents )
#           call as user informix
>/tmp/tom
echo " select 'EXTENTS',
               owner,
               tabname,
               nextsize
         from systables
         where owner  <>  'informix'
         order by systables.owner,
                  systables.tabname;" >/tmp/tom

# edit the following line
[isql|dbaccess] <dbname> - </tmp/tom |
awk '
    $0 ~ /EXTENTS/ {  tabname=hoch1$2hoch2$3
                      nextsize=$4*2
                      printf "%12s %24s %18s %4s %15s\n",
                      "ALTER TABLE ",tabname," MODIFY NEXT SIZE ",
                       nextsize, "; COMMIT WORK ;"
    }  ' hoch1=\" hoch2=\".

On 11th Dec 1997 kagel@ns1.bloomberg.com (Art S. Kagel) wrote:-

BTW: I noticed your NETTYPE parameter for onipcshm in CPU VPs with 1 listener. If you increase this to 4 listeners there will be a listener in each CPU VP and you will 1) spread the load across the CPU VPs better and 2) be more responsive. The problem, I have found, is that the VPs with listeners take most of the work onto themselves which makes them too busy to listen but when they free up to listen they are not busy so they add the next job to their own queue instead of handing it off to another CPU VP. So 3 of your CPU VPs are in wait loops waiting for work most of the time while CPU VP1 is swampped. Check the CPU time distribution over the VPs in onstat -g glo with the current config and with 4 listeners and you can convince yourself. We started with 28 CPU VPs and 8 listeners as Informix consultants thought we should only need a few listeners. There were always some applications that were periodically paused waiting for service but 20 of the CPU VPs were never busy. We increased to 28 listeners and voila! No more delays and better VP utilization.

On 28th Oct 1997 kagel@ns1.bloomberg.com (Art S. Kagel) wrote:-

OK. Basically a NET VP (shown as class tli, soc, ipx, shm, str in the onstat -g glo report) is a Virtual Processor whose only job is to poll the various connections for new sessions and additional queries on existing sessions and pass the work off to CPU VPs as needed. The CPU VPs are doing actual query work for various sessions and in addition can handle exactly one poll thread to do listening itself. Basically it is best for networked connection types to have listeners in NET VPs and shared memory connection listeners in CPU VPs. Shared memory listeners in NET VPs will poll incessantly burning CPU time. Network listener threads in CPU VPs use more system call overhead. The difference has to do with the fact that the NET VPs have nothing better to do than monitor the connections and so can block waiting for network activity but they cannot block on shared memory operations and so have to poll in a loop. Inversely the CPU VPs cannot block waiting for network activity, they have better things to do with their time, and so have to poll network connections each time they have completed some user thread operation. Meanwhile the very fact that they are busy most of the time doing real work means that they can only poll shared memory connections occasionally which reduces the polling overhead for shared memory connections versus NET VPs.


6.16 DATETIME, INTERVAL and other tricks

Jack Parker has provided "a useful record of the sorts of problems you have to deal with when using DATETIME and INTERVAL types". It's a bit large to put in the main body of the FAQ, but here's a slightly edited copy of his introduction. The full, unabridged, document appears later as Appendix I.

A word about datetime/intervals:

As most of you know a DATETIME identifies a moment in time. For example: On Oct 11, 1987 at 11:05:32.32AM my wife and I exchanged wedding vows.

Say we have a recurring event - an appointment which occurs every month at a particular time - say on the 4th at 11:30. In this case we don't care about the year or the month, just the day and the hour.

A datetime datatype has the granularity to measure whatever you want. It can go from YEAR to FRACTIONs (5 decimal places) of a second - depending on your OS (syntax supports 5 places, but our OS only fills two of them).

There is a reasonable discussion of all this in Appendix J of TFM (4gl ref).

An Interval measures a time span. For example '2 hours ago'. While this is not a datetime concept, it uses much the same syntax and you can specify it in the same manner (YEARs through FRACTION). So 'how long have you been married' is an interval question. 'When did you get married' is a datetime question.

Who in their right mind cares? Anyone who wishes to mark a specific moment in time.

It makes no sense to 'add' two instants in time (or two datetimes) - what you want to do is take an instant in time (DATETIME) and add a time span (INTERVAL). "In two days I'm getting married". Sure this makes sense talking about it like this, but it's not 'intuitively obvious to the most casual observer' when working with the stuff. It's real easy to screw up what you put and where you put it. I probably wasted half of the time I spent on this trying to manipulate one datetime with respect to another datetime.

---

Like I said at the start, see Appendix I if you want to learn more.


6.17 Reading SE Audit Trails

From: lester@access.digex.net (Lester Knutsen)

Informix SE has a feature (this is not available in Informix Online) to create an audit trail of all adds, deletes and updates to a table. This can be used to trace which user is changing critical data. The procedures to create an audit file are simple and well documented. However, it is not well documented on how to read or use the audit file. The SQL syntax to create an audit trail is:

create audit for table_name in "pathname"

The full pathname is required for this command.

An Informix audit file is structured the same as the original data file (.dat file) with a header. One way to access an audit file is to convert it into an Informix database table. Then you can perform searches on the changes made to your data. The following steps will create an Informix database table out of an audit file. As an example, I will use the stores database and the orders table that come with Informix products. I recommend that you try this in a test environment first. To create an audit file on the orders table, type the following SQL command in dbaccess or isql.

create audit for orders in "/usr/lester/demo/orders.audit"

Every change to the orders table will be captured in this file. The next step is to create a way of loading this into a database and using the data.

First you need to create an SQL schema for the new audit table. The schema will be based on the table you are auditing with an additional five field header. You can use dbschema to do this by typing:
dbschema -s stores -t orders a_orders.sql
Edit the a_orders.sql script and add the additional fields for the audit header. The audit file includes the following five header fields:
a_type         char(2)   Type of record where aa = added, dd = deleted,
                         rr = before update image, ww = after update
                         image.
a_time         integer   Integer internal time value.
a_process_id   smallint  Process ID that changed the record.
a_usr_id       smallint  User ID that changed the record.
a_rowid        integer   Original rowid.

You will also need to change the table name in the SQL script produced by dbschema to the name you want to call the audit table. I like to use the old table name with an "a_" prefix. The Index statements will also need to be changed. There must be one index on this table for the next step with bcheck to work. The old unique indexes should be removed because in the audit file the same record could appear for multiple changes. Change the index statements to use the new table you are creating. The following example is the script for the orders table:

create table a_orders   (
     a_type         char(2),
     a_time         integer,
     a_process_id   smallint,
     a_usr_id       smallint,
     a_rowid        integer,
     order_num      serial not null,
     order_date     date,
     customer_num   integer,
     ship_instruct  char(40),
     backlog        char(1),
     po_num         char(10),
     ship_date      date,
     ship_weight    decimal(8,2),
     ship_charge    money(6,2),
     paid_date      date  );
create index a_order_num_idx on a_orders ( order_num );
Create the table with the new name. This should produce an empty table ready to hold your audit file data.
Copy the audit file to replace the new table data file. This step will destroy any data in the a_orders.dat table so proceed with caution. Look up the pathname of the data file created for this table. One way is to perform the following select:
select dirpath from systables where tabname = "a_orders"

On my system dirpath was "a_order007". Change to the directory where the database files are located and copy the audit file to replace the a_order007.dat file.

cd stores.dbs
cp /usr/lester/demo/orders.audit a_order007.dat
After overwriting the ".dat" file the data and the index will be out of sync. Use bcheck, the Informix index repair tool to fix the index file. Type the following command.
bcheck a_order007

You now have an Informix table of your audit records and you can run SQL or build perform screens to see changes made to your data. Repeat steps 4 and 5 every time you need to update the table. The following is an example SQL statement. The results show an add (aa), change ( rr and ww) and a delete (dd) on the orders table.

select a_type, a_time, a_process_id, a_usr_id, a_rowid, order_num 
  from a_orders
a_table a_time a_process_id a_usr_id a_rowid order_num
aa 759109477 823 200 16 1016
rr 759109502 823 200 15 1015
ww 759109502 823 200 15 1015
dd 759109516 823 200 16 1016


6.18 Recovering broken (or missing) indexes in SE

When your table's .idx file magically disappears from your .dbs directory, firstly be assured that you're not alone in experiencing this phenomenon, and secondly here's how you can regain access to your data.

From Rudy Fernandes:

  1. Create another table, say junk, within the same database. junk must have an IDENTICAL structure to the table with the deleted idx. You will find a dat and idx file created.
  2. Copy/rename the junk table's idx file to the idx deleted.

    e.g cp junk___123.idx deleted105.idx

    (N.B the idx file's prefix will be identical to that of its dat)

  3. Run a 'bcheck -s' on the original table.

    e.g bcheck -s deleted105

  4. That's it!

6.19 What about Replication?

clema@informix.com (Clem Akins) writes:

It goes like this:

Data Replication, now called High-availability Data Replication, is the master/slave setup designed for disaster recovery.

Enterprise Replication (used to be Continuous Data Replication) is the far more powerful new capability. You can choose to replicate just columns, or selected rows, at different times, and deal with problems in any of several different ways.


On 8th Dec 1997 ddailey@informix.com (Doug Dailey) wrote:-

The problem is that you are wanting to do Enterprise Replication, not HDR. If your intention is to only replicate certain data sets from individual tables, then you best bet is to change your mode of thinking to ER.

There are skimpy release notes on this, SNMPDOC_7.2 and EREPDOC_7.2 in your $INFORMIXDIR/etc directory. But yes, you can replicate specific data from tables based on where filters setup. The hardware medium (platform) can be different and the disk layout and configuration of instance can be independant on one another. You will need Replication Manager 7.22.TC1 that runs on a NT 3.51 or greater Workstation in order to run replication although there is also an API that comes with 7.22.UC1 ESQL/C or by default with the 7.23.x engine and will be located in the $INFORMIXDIR/demo/esql directory.

Following is info on starting the SNMP and sub-agent for SCO:

SNMP Information

There are the following files that are used for the master agent on SCO. SCO is a PEER based protocol and INFORMIX does not ship a master agent with our product on this platform. These files are located in the /etc directory:

Snmpd {executable that starts the master agent}

lrwxrwxrwx 1 root root 32 May 22 1996 snmpd -> /opt/K/SCO/tcp/2.0.0Cl/etc/snmpd

snmpd.comm {needs the following entry}

hskgsk96 0.0.0.0 WRITE

snmpd.conf {needs the following entry}

descr=SCO TCP/IP Runtime Release 2.0.0 objid=SCO.1.2.0.0 contact= Doug Dailey {this is just what I used} location= Lenexa {same thing} smux_reserved=system smux_reserved=snmp smux_reserved=smux

snmpd.peers {needs the following entry}

"cdragt" 0.0 ""

snmpd.trap {not necessary, used for event trapping}

Notes on SCO and starting the agent:

It is not necessary to set SR_AGT_CONF_DIR if you are using the master agent of the OS. This is the basic rule for all platforms using SNMP master supplied by the OS vendor.

Start the master agent with the following syntax:

/etc/snmpd log_level 6 & {0-6, 6 being the most detail. Do a man on snmpd on SCO to get more detail}

The log file will be created in /usr/adm/snmpd.log.

Start the cdr sub-agent the same way you would on any other system:

$INFORMIXDIR/bin/cdragt -t 3 -f /tmp/cdr.out &

Here is also some info on a base test in getting this up and running:

Below are steps for setting up for replication on the UNIX side for the SUN Solaris platform. You will have to substitue the portions regarding the master and sub-agent with the above info regarding SCO:

I. Installation:

II. Configuration:

env for test1
setenv INFORMIXDIR /testing/ddailey
setenv INFORMIXSERVER test1tcp
setenv PATH ${INFORMIXDIR}/bin:/tsgoldopt/bin:${PATH}
setenv ONCONFIG onconfig.test1
setenv LD_LIBRARY_PATH $INFORMIXDIR/lib:$INFORMIXDIR/lib/esql
setenv SR_AGT_CONF_DIR $INFORMIXDIR/snmp/snmpr

env for test2
setenv INFORMIXDIR /testing/ddailey
setenv INFORMIXSERVER test2tcp
setenv PATH ${INFORMIXDIR}/bin:/tsgoldopt/bin:${PATH}
setenv ONCONFIG onconfig.test2
setenv LD_LIBRARY_PATH $INFORMIXDIR/lib:$INFORMIXDIR/lib/esql
setenv SR_AGT_CONF_DIR $INFORMIXDIR/snmp/snmpr

sqlhosts
****************************************************************************
***
test1   group   -       -       i=1
test1tcp     ontlitcp        sunset             test1tcp g=test1

test2   group   -       -       i=2
test2tcp     ontlitcp        sunset             test2tcp g=test2

onconfig.test1
****************************************************************************
***
ROOTNAME        rootdbs         # Root dbspace name
ROOTPATH        /testing/ddailey/test1
ROOTOFFSET      0               # Offset of root dbspace into device (Kbytes)
ROOTSIZE        20000           # Size of root dbspace (Kbytes)

SERVERNUM       60              # Unique id corresponding to a OnLine instance
DBSERVERNAME    test1tcp                # Name of default database server
DBSERVERALIASES                 # List of alternate dbservernames
NETTYPE                         # Configure poll thread(s) for nettype
RESIDENT        0               # Forced residency flag (Yes = 1, No = 0)

MULTIPROCESSOR  0               # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS       1               # Number of user (cpu) vps
SINGLE_CPU_VP   0               # If non-zero, limit number of cpu vps to one

onconfig.test2
****************************************************************************
***
ROOTNAME        rootdbs         # Root dbspace name
ROOTPATH        /testing/ddailey/test2
ROOTOFFSET      0               # Offset of root dbspace into device (Kbytes)
ROOTSIZE        20000           # Size of root dbspace (Kbytes)

SERVERNUM       61              # Unique id corresponding to a OnLine instance
DBSERVERNAME    test2tcp                # Name of default database server
DBSERVERALIASES                 # List of alternate dbservernames
NETTYPE                         # Configure poll thread(s) for nettype
RESIDENT        0               # Forced residency flag (Yes = 1, No = 0)

MULTIPROCESSOR  0               # 0 for single-processor, 1 for
multi-processor
NUMCPUVPS       1               # Number of user (cpu) vps
SINGLE_CPU_VP   0               # If non-zero, limit number of cpu vps to one

IV. Starting the Agents:

1. As root, go to $INFORMIXDIR/bin (in ksh)

./snmpdm -apall >/tmp/snmp.out 2>&1 &

2. As informix, go to $INFORMIXDIR/bin (in ksh)

./cdragt -t 5 -f /tmp/cdr.out 2>&1 &

V. Installing Rep Manager

1. Installed as Administrator and used at least the NT 3.51 version of 7.22.TC1 rep manager

2. REPHOSTS file:

test1 group - - i=1 test1tcp ontlitcp sunset test1tcp g=test1 test2 group - - i=2 test2tcp ontlitcp sunset test2tcp g=test2

VI. Test

1. create table on instance A : create table test (col1 char(1) primary key, col2 char(5)); create table on instnace B : create table test (col1 char(1) primary key, col2 char(5)); inserted 4 rows into table test on instance A prior to defining replicate 2. databases created in rootdbs with logging 3. synchronized the two servers through replication manager 4. defined a replicate with the following properties:

update anywhere transaction scope ris and ats spooling select * from test where test = "cc" ignore conflict resolution continuous replication test1 and test2 instances all participants defined the same? NO

5. update test set col2 = "cc"; {yes to all} /*updates not converted to inserts on instance B.*/ /*subsequent updates only propagated on newly inserted rows*/

6.20 How well does Online scale?

Nils.Myklebust@ccmail.telemax.no writes:
:   smb@oasis.icl.co.uk (Stuart Britt) writes:
:  Does anyone have any info on the following problem?  We are using Informix
:  version 7.10.UC1 for Intel UnixWare 2.01 on single processor and SMP
:  machines.
:  
:  Using Informix on 1 processor we achieve good performance.  If we switch to
:  2 processors, then our performance actually degrades.  We have been given
:  some advice on tuning the system for dual processor machines and we have 
:  tried different settings for virtual processors and processor affinity.
:  We managed to get an improvement, but the best scaling we can get is
:  about 1.25.
:  
:  If we move to a quad processor system the problem goes away and we 
:  approach expected performance levels.
:  
:  Has anyone else experienced this?

This conforms exactly to what Informix said at the user conference this summer. According to that you can't rely expect any performance increase until you have 4 processors. This is due to extra resources required to do the swiching between virtual processors takes up more time than you save by an adittional processor.

Not long ago someone also said that on SCO Unix even a 4 processor configuration didn't give that much increased performance. As we also use Intel hardware and plan to upgrade to multiple processors it would be very interesting for us to here what your performance increase is.

6.21 How do I tune KAIO?

jmiller@informix.com (John Miller)

One way KAIO can hurt performance is when it is no longer asynchronous. This can happen on some platforms when there exists a large number of IO operations being requested. The nice thing about this is that it is usually tunable. On HP there exists an environment variable called IFMX_HPKAIO_NUM_REQ, by default this value is set at 1000. If you are doing alot of I/O this might not be high enough. On DEC I believe it need to be tuned in the kernal and for other platforms you will have to find out what the threshold is and how to tune it.

On 2nd Oct 1998 Eric_Melillo@agsea.com (Eric Melillo) writes:-

An environment variable IFMX_HPKAIO_NUM_REQ is provided to specify the number of simultaneous KAIO requests that are processed by the server at one time. The value can be set in the range of 10 to 5000, the default being 1000. If the error

"KAIO: out of OS resources, errno = %d, pid = %d" occurs, consider increasing the value of IFMX_HPKAIO_NUM_REQ.

On 14th Dec 1997 djw@smooth1.demon.co.uk (David Williams) writes:-

KAIO is definitely tunable under AIX, go into smit and there is a whole section on Asynchronous IO.:-

Change / Show Characteristics of Asynchronous I/O

Type or select values in entry fields.

Press Enter AFTER making all desired changes.

                                                        [Entry Fields]
  MINIMUM number of servers                          [1]                       #
  MAXIMUM number of servers                          [10]                      #
  Maximum number of REQUESTS                         [4096]                    #
  Server PRIORITY                                    [39]                      #
  STATE to be configured at system restart            defined                 +

The help for each field is :-

MINIMUM Indicates the minimum number of kernel processes dedicated to asynchronous I/O processing. Since each kernel process uses memory, this number should not be large when the amount of asynchronous I/O expected is small.

MAXIMUM Indicates the maximum number of kernel processes dedicated to asynchronous I/O processing. There can never be more than this many asynchronous I/O requests in progress at one time, so this number limits the possible I/O concurrency.

REQUESTS Indicates the maximum number of asynchronous I/O requests that can be outstanding at one time. This includes requests that are in progress as well as those that are waiting to be started. The maximum number of asynchronous I/O requests cannot be less than the value of AIO_MAX, as defined in the /usr/include/sys/limits.h file, but can be greater. It would be appropriate for a system with a high volume of asynchronous I/O to have a maximum number of asynchronous I/O requests larger than AIO_MAX.

PRIORITY Indicates the priority level of kernel processes dedicated to asynchronous I/O. The lower the priority number, the more favored the process is in scheduling. Concurrency is enhanced by making this number slightly less than the value of PUSER, the priority of a normal user process. It cannot be made lower than the values of PRI_SCHED. PUSER and PRI_SCHED are defined in the /usr/include/sys/pri.h file.

STATE Indicates the state to which asynchronous I/O is to be configured during system initialization. The possible values are defined, which indicates that the asynchronous I/O will be left in the defined state and not available for use, and available, indicating that asynchronous I/O will be configured and available for use.

Hence KAIO is very configurable under AIX 3.2.5p, I expect other version of UNIX are similarly configurable, even if the information is hard to find!

On 17th Sep 1998 mdstock@informix.com (Mark D. Stock) writes:-

Applications can use the aio_read and aio_write subroutines to perform asynchronous disk I/O. Control returns to the application from the subroutine as soon as the request has been queued. The application can then continue processing while the disk operation is being performed.

Although the application can continue processing, a kernel process (kproc) called a server is in charge of each request from the time it is taken off the queue until it completes. The number of servers limits the number of asynchronous disk I/O operations that can be in progress in the system simultaneously. The number of servers can be set with smit (smit->Devices->Asynchronous I/O->Change/Show Characteristics of Asynchronous I/O->{MINIMUM|MAXIMUM} number of servers or smit aio) or with chdev. The minimum number of servers is the number to be started at system boot. The maximum limits the number that can be started in response to large numbers of simultaneous requests.

The default values are minservers=1 and maxservers=10. In systems that seldom run applications that use asynchronous I/O, this is usually adequate. For environments with many disk drives and key applications that use asynchronous I/O, the default is far too low. The result of a deficiency of servers is that disk I/O seems much slower than it should be. Not only do requests spend inordinate lengths of time in the queue, the low ratio of servers to disk drives means that the seek-optimization algorithms have too few requests to work with for each drive

For environments in which the performance of asynchronous disk I/O is critical and the volume of requests is high, we recommend that:

maxservers should be set to at least 10*(number of disks accessed asynchronously) minservers should be set to maxservers/2.

This could be achieved for a system with 3 asynchronously accessed disks with:

# chdev -l aio0 -a minservers='15' -a maxservers='30'

On 3rd Oct 1998 jmiller@informix.com (John F. Miller III) writes:-

Here is the real story behind Informix/HP KAIO over the years. It so long so those of you that make it to the end I have include some HP Kaio tuning tips.

There have been three different version of HP KAIO over the years which are basically the same except their polling medthods. You can tell which version of KAIO by the line put in the online.log at startup

Version 1 "HPUX Version XXX -> Using select based KAIO" Version 2 "HPUX Version XXX -> Using flag style KAIO" Version 3 "HPUX Version XXX ->Using flag/select style KAIO"

Version 1

It uses the select() system call to poll the OS for IO completions status. This causes the kernel to use excess amounts of system time due to the expensive nature of select() system call on HP.

Advantages: Very fast Disadvantages: As the system gets busy the select() system call gets very expensive. Online Versions 7.1, 7.2, 7.3

Version 2

It use a memory address registered with the kernel to poll the OS for IO completions status. When the server issues an I/O request through KAIO, it must poll the HP-UX asynchronous driver to determine when the request has been fulfilled. If a request has not yet been filled, the I/O thread may choose to go to sleep for 10 milliseconds before checking the request status again. Up through ODS 7.1, the I/O thread would sleep using a select() system call. This was changed in 7.2 to use the newer and lighter nanosleep() system call. Performance tests showed this was lighter on the CPU, and still gave us the desired sleep behavior. Sometime after 7.2 was released (we believe), nanosleep() was changed to correctly conform to its IEEE standard. One of the impacts is that we can no longer sleep for just 10 milliseconds. On average, the I/O thread is now sleeping for almost twice that long (when it does not need to) and this is why we believe KAIO performance has gotten so much worse.

Advantages: Signigicant reduction in resources Disadvantages: When the system is not busy the kaio poll thread will wait for I/O an extra 6 to 8 ms (Defect #92132) Not supported prior to HPUX 10.10 Online Versions 7.2, 7.3

Version 3

It use the memory address registered with the kernel to poll the OS for I/O completions status when the system is busy, but when the system goes idle and waits on I/O completions the system will not use nanosleep as in version 2, but will use the select() system call to wait.

Advantages: Signigicant reduction in resources Very fast. In direct response to Defect #92132 Disadvantages: None found to date. Online Versions 7.30.UC3 and above and 9.14.UC1 and above

Question:

Using the select() system call won't this cause the same performance issues as kaio version 1?

Answer:

No it will not. The problem with version one was that the system became busier the amount of I/O inceased which caused an increase in select() calls to be made. In the new method (version 3) we only use the select system call if the system is idle and is wait on I/O. In addition there is a benefit of using the select system call which is the ability to interupt the timer if and I/O completes before the timer expires. This will give early notification of the I/O.

Tune Tips for HP KAIO

RESIDENCY

In version 7.30 for best KAIO performance set RESIDENT flag in the onconfig to -1. In versions prior to version 7.30 set the resident flag to 1. This will reduce a significant amount of kernel locking that must be done on non-resident segments.

NOTE:

Not setting the RESIDENT flag in the onconfig will cause an additional 8KB of memory to be allocated because the online system will require assurance that this flag is on its on OS memory page. If the RESIDENCY flag is set this assurance does not have to be done.

IFMX_HPKAIO_NUM_REQ

This specifies the maximum number of simultaneous KAIO request online can process at one time. This can currently be set between HPKAIO_MAX_CONCURRENT (5000) and HPKAIO_MIN_CONCURRENT (10) with a default value of HPKAIO_DEF_CONCURRENT 1000. The only draw back to setting this higher is memory. Below is the formula to determine how much memory will be used:

bytes of memory = (# of cpuvps)*((12*IFMX_HPKAIO_NUM_REQ)+4) The default setting should take about 12KB per cpu vp. If you see the error "KAIO: out of OS resources, errno = %d, pid = %d" then you should consider raising IFMX_HPKAIO_NUM_REQ.

6.22 Why does ontape not notice when I change device names in onconfig?

kagel@bloomberg.com (Art S. Kagel) writes:-

This is only a problem for LTAPEDEV. Ontape does read the TAPEDEV and TAPESIZE parameters from the ONCONFIG for archives but for logical log backups only the reserved page version of LTAPEDEV and LTAPESIZE are used. Onmonitor modifies both the ONCONFIG and the reserved pages.

6.23 How do I reduce checkpoint times?

johnl@informix.com (Jonathan Leffler) wrote:-

If the performance nose-dives at checkpoint time, seriously consider reducing LRU_MIN_DIRTY and LRU_MAX_DIRTY. Don't try to go below LRU_MIN_DIRTY 1, LRU_MAX_DIRTY 2. The default values (of 30/20 in 4.00 and 60/50 are, in my not so very humble opinion, rubbish for any system with more than about 1 MB of shared memory. Most databases are read a lot more than they are modified.

On 13th Dec 1997 djw@smooth1.demon.co.uk (David Williams) wrote

kagel@bloomberg.com (Art S. Kagel) wrote:

"I agree that additional cleaners are of limited use during normal LRU writes but they are needed for fastest checkpointing. Since at checkpoint time each chunk is assigned to another cleaner thread until the threads are exhausted, and since that one thread, as has already been pointed out, is only scheduling the actual I/Os with either the AIO VPs or KAIO VPs, that thread will block on each I/O that it schedules and single thread writes to your disks. You must have multiple cleaners, even with a single CPU VP, since the other cleaners can wake and schedule the I/Os that they are responsible for while the first cleaner is blocked waiting for I/O service...My point is that the single cleaner thread has to wait for the issued I/O to complete. It does relinquish the CPU VP to other threads so that other work is not blocked

So I recommend 1.5 * min(#chunks,#LRUs) as the number of cleaners keep the 2-2.5 * NUMCPUs idea for multiple CPUs in mind and constrained by a maximum useful value of about 64 for multiple CPU systems."

djw@smooth1.demon.co.uk (David Williams) continued

I've tried on a single CPU machine at work, here are my findings.


  Initial Setup for tests
  =======================

  Online 7.12.UC1 on an ICL Teamserver running DRS/NX.
  
  3 chunks , 2K pagesize

  Dbspace  File  size          Device

  rootdbs  root1 5000 pages on /dev/c0d0s21     DIsk 1
  dbs1     chk1  1500 pages on /dev/c0d0s21     Disk 1
  chk2     chk2  1500 pages on /dev/c1d4s1      Disk 2

  1 CPU VP, 2 AIO VPs.

 
  Database setup via dbaccess
  ===========================

  create database dlo    # This is in the root dbs and have no logging
  create table t1
   (
   c1 char(2020)
   ) in dbs1;

  create table t2
  (
  c1 char(2020)
  ) in dbs2;


  
  Source code for test program
  ============================


  2.c

  #include <stdio.h>
  #include <sys/types.h>
  #include <time.h>

  main()
  {
  time_t t;
  int i,a;

  a=doit1(0);
  
  t=time(NULL);
  fprintf(stderr,"%s\n",ctime(&t));
  
  a=doit2(0);
  t=time(NULL);
  fprintf(stderr,"%s\n",ctime(&t));
  

  1.4gl 


  FUNCTION doit1()
     DATABASE dlo
     DELETE FROM t1
     DELETE FROM t2
  END FUNCTION

  FUNCTION doit2()
     DEFINE i INTEGER

     FOR i = 1 TO 1000
        INSERT INTO t1 VALUES("11")
        INSERT INTO t2 VALUES("22")
     END FOR
  END FUNCTION


  Run this program once to populate the tables...


  Disk layout
  ===========

  oncheck -pe nows shows

  rootdbs = sysmaster+sysutil.
  
  chunk2 (chk2) = 
                          START LENGTH
    OTHER RESERVED PAGES  0     2
    CHUNK FREE LIST       2     1
    TBLSPACE TBLSPACE     3     50
    dlo.t1                53    1008
    FREE                  1061  439 
  

  chunk3 (chk3) = 
                          START LENGTH
    OTHER RESERVED PAGES  0     2
    CHUNK FREE LIST       2     1
    TBLSPACE TBLSPACE     3     50
    dlo.t2                53    1008
    FREE                  1061  439  


  Online configuration
  ====================

  OK so we have 2 tables on 2 disks

  We have

  NUMAIOVPS 2
  NUMCPUVPS 1
  BUFFERS 2000
  Physical Log Size 2000
  LRU_MAX_DIRTY 99
  LRU_MIN_DIRTY 98

  This is a small instance so I've forced more I/O to occur at 
  checkpoint time.
 

  Test procedure
  ==============

  1. start online                   
    # Note this is from 'cold'
    # Since the program has been run once we have 'preallocated 
    # our extents. Since we delete and reinsert the same rows
    # with no indexes, we have exactly the same I/O's occuring
    # to the same pages on both runs.

  2. Run program above once.
  3. onmode -c  # Force checkpoint
  4. onstat -m  # Note checkpoint time.

  Repeat 2,3*4 two more times


  Timing with CLEANERS = 1
  ========================

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  17:12:18 17:12:59 41         4/4/4/5/5/2        24 
  17:13:50 17:14:30 40         4/4/5/4/4/1        22
  17:15:33 17:16:37 64         4/4/4/5/5/1        23

  Timing with CLEANERS = 2
  ========================

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  17:18:28 17:19:04 36         4/3/3/3/3/1        17 !!!!  
  17:20:09 17:20:44 35         4/3/4/2/3/1        17
  17:21:45 17:22:20 35         4/3/3/2/2/1        15


  With 1 AIO VP (just as control or I forgot to set it to 2 first time 
  aoround!!)

  1 CLEANER

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  16:40:37 16:41:17 40         4/4/4/5/5/2        24
  16:42:22 16:43:01 39         5/4/4/4/4/1        22
  16:44:12 16:44:52 40         4/4/4/5/4/1        22

  2 CLEANERS

        Program             
  Start    End      Run Time   Checkpoint Times   Total checkpoint time
  16:31:12 16:31:52 40         4/4/4/5/4/2        23
  16:32:45 16:33:24 39         4/4/5/4/4/1        23
  16:35:10 16:35:50 40         4/4/4/4/4/1        21  


  Summary
  =======

  With 1 AIO VP                 checkpoint=22-24 secs program = 39-40
  With 2 AIO VPs and 1 CLEANER  checkpoint=22-24 secs program = 39-40
  With 2 AIO VPs and 2 CLEANERS checkpoint=15=17 secs program = 35-36!!


  Conclusion
  ==========

  Checkpoints are faster and LRU cleaners faster with more cleaners
  on a single cpu vp machine.


6.24 Why does Online show up as "oninit -r" processes?

On 17th Dec 1997 kagel@bloomberg.com (Art S. Kagel) wrote:-

The engine was brought up by a cold restore (ontape -r). Ontape is only a manager of the restore process the oninits do all the work so after restoring the reserved pages itself ontape starts up the Informix engine to do the rest. The undocumented -r argument to oninit notifies the master oninit process to coordinate with the ontape and not attempt to read anything beyond the reserved pages (the DATABASE TABLESPACE and TABLESPACE TABLESPACE are normally read at startup but they are not there). When the restore completes the engine is in quiescent mode and apparently someone ran an onmode -m to bring it online. When you next shutdown and restart that instance the "problem" will vanish.


6.25 What are these CDRD_ threads?

On 19th Dec 1997 satriguy@aol.com (SaTriGuy) wrote:-

>Right now we are seeing that there are almost 700 informix threads on
>the primary instance. I get this using 'onstat -u'. All these threads
>are owned by informix. The Flags field of the 'onstat -u' o/p has
>"Y--P--D" (Y - waiting on a condition, P - Primary thread for a session,
>D - deamon thread) for all the above threads. The 'pid' of the above
>sessions is 0 in the sysesssions table.

On 7.2x with Enterprise Replication, there are a lot of threads that might be generated. If the majority of them start with CDRD_, then they are the "data sync" threads. If there appear to be an excess of them in comparision to the repl threads, then there could be a problem.

6.26 What does onstat -h do?

On 8th Sep 1998 Gary.Cherneski@MCI.Com (Gary Cherneski) wrote:-

Hope this helps:

# of chains A count of the # of hash chains for which the length for members of that chain is the same.

of len The length of the chains.

total chains The total number of hash chains (buckets) created.

hashed buffs The number of buffer headers hashed into the hash buckets.

total buffs The total number of buffers in the buffer pool.

6.27 How do I start up Extended Parallel Server?

On 27th Aug 1998 tschaefe@mindspring.com (Tim Schaefer) wrote:-

Regular Startup EveryDay After Initialization: xctl -b -C oninit First and ONLY FIRST time Initialization: xctl -b -C oninit -iy

You should be on 8.2 now so as to take advantage of any potential bug fixes, etc.

6.28 How many AIO VPs should I have?

On 16th Sep 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

WHOA HORSEY! Informix's recommendations for 7.10 was 2-3 AIO VPs per chunk, for 7.14 they made some improvements and changed that to 1.5-2 AIO VPs per chunk, as of 7.21 the recommendation, after another round of optimizing AIO VPs is 1-1.5 AIO VPs per chunk (leaning toward 1), I have not seen any recommendation for 7.3 but with the other speedups to the 7.3 code stream I'd say that 1 AIO VP/chunk was now a maximum. I have always found that these recommendations are MORE than sufficient. I actually only count active chunks when determining AIO VP needs. For example on one pair of servers I have 279 chunks and growing but many of these are holding one months data for four years of history (ie all September data in one dbspace made of 8-12 2GB chunks) so only one out of twelve dbspaces ( and their chunks) are active and I do very well with 127 AIO VPs (io/wp 0.6 -> 1.3 with most at 0.8 -> 1.1). IBM's recommendation is excessive in the extreme.

Run onstat -g ioq to monitor q length by chunk and onstat -g iov to monitor the activity of the KAIO and AIO VPs (BTW you can determine which are being used this way). If almost all of your AIO VPs are performing at least 1.0 io/wp with many showing 1.5 or more then you need more AIO VPs if not, not. A value of io/wp means that an AIO VP was awakened to handled a pending IO but before it could get to the required page another, previously busy, AIO VP finished what it was doing and took care of it. In effect having some of your VPs below 1.0 means you could even do without those VPs except at peak. Conversely, if you have any AIO VP with <0.5 io/wp you can probably reduce the number of AIO VPs accordingly since more than half the time these VPs awaken there is nothing for them to do, they are just wasting cycles and taking up swap space.

If it turns out, as I expect, that the VPs are not the culprit look to your disk farm. Do you use singleton disk drives on a single controller? Work toward the ultimate setup RAID1+0 of at least eight mirrored pairs, spread across at least 4 controllers (2 primary, 2 mirror, preferably with 4 additional backup controllers handling the same drives), 8 or 16K stripe size, 500MB-1GB cache per controller. Expensive, but no I/O problems. Anyway RAID1+0, small stripe size.

6.29 How should I use mirroring + striping?

On 8th Oct 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID 0+1 (ie mirroring two stripe sets to each other): If a drive fails one side of the mirror, an entire stripe set is useless and the whole stripe needs to be recovered from mirror. Also you can only recover from multiple drive failures if all failed drives are on the same side of the mirror. (Probability of data loss from multiple drive failure is 50%.) This can be a real problem if you buy drives in large orders which can come from the same manufacturing batch and may all be prone to the same failure. (Don't poo poo, it happened to us! A whole batch of IBM 2GB drives all failed at 100,000 hours of operation we had as many as 20 drives failing on the same day until they were all replaced. We got lucky and data loss was minimal.) The performance penalty during drive rebuild can be as much as 90% of capacity. Also until the drive is replaced the read performance boost that mirroring gives you is completely lost.

RAID 1+0 (ie stripe several mirrored pairs together): If a drive fails only that one drive needs to be rebuild and the read performance boost that mirroring gives you is only lost for that one pair - until the drive is replaced other pairs continue to perform. During drive replacement the performance penalty cannot exceed about 90% of 1/N% of capacity where N is the number of pairs in the stripe set. In addition you are protected against multiple drive failures even before the defunct drive is replaced unless both drives of a mirrored pair are lost (probability 1/2N). Total rebuild time is 1/Nth of RAID 0+1 also.

On 11th May 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

On the Fragmentation -vs- Striping issue I am going to disagree, in part. I believe that you should BOTH stripe across as many disks and controllers as possible (I use at least 10 drives but prefer 20-30 spread over 2 to 3 separate controllers) and then fragment your critical tables also. In this way you are unlikely to develop any REAL hot-spots. Striping should be with a 16K page size so that any single Informix BIGREAD comes from a single drive while readahead affects several drives. Also, when creating the 2GB virtual drive devices on the stripe set try to arrange things so that each successive virtual drive begins on a different disk and that the stripe is built such that successive drives are on different controllers, if possible. This will go a long way toward eliminating any hotspots. With this scheme only hitting the same 8 pages repeatedly could possible create a hostspot and Informix's BUFFER cache will prevent that from happening anyway.

Your administration load of tracking I/Os to eliminate hotspots will dwindle SIGNIFICANTLY with this scheme.

On 24th Aug 1998 Gary.Cherneski@MCI.Com (Gary D. Cherneski) wrote:-

I am a Sr. DBA managing large (> 1 tb) Informix OLTP dbs. We use Veritas striping and mirroring (RAID 10.) For down and dirty oltp using indexes, its great. We round robin all our table data (many in excess of 300 M rows) and have all of our indexes in a single dbspace.

Your assumption with what your Informix SE said is correct. This configuration will not work optimally (but it will work) for DSS type queries. In addition, large multi-table joins have a somewhat diminished performance, even if they use indexes.

If your primary need is OLTP but you still must maintain a strong DSS type environment, you might try using RAID 10 with veritas striping/mirroring and an expression based fragmentation scheme for the data and indexes. This would give you the ability for multiple threads when performing DSS type queries as well as more threads for multi-table joins. The downside of this is that it is harder to maintain (all of the frag expressions.)

If DSS type query performance is not a factor but you have a significant amount of multi-table joins that are critical, you could try round-robin'ing your data and using expression based fragmentation for your indexes. This would give a better performance for those multi-table joins and is not as terrible to maintain.

If DSS type query and multi-table join query performance is not a factor, I would go with the scheme we use: round-robin your data in n buckets and use one dbspace for indexes. This by far is simpler to maintain since you do not have to watch frag expressions as they grow and this will deliver the OLTP performance you need.

Additionally: Are you considering "hot spares" where veritas will automatically migrate a disk slice to a spare drive/slice in the event of problems? It helps with redundancy but watch out for the long sync times when disk trays or controllers go south.

6.30 Why is shared memory not freed when my session ends?

On 9th Feb 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

The memory used by your cursors and prepared statements (4GL implicitely prepares statements even if you did not explicitely prepare any) is freed when the cursor is freed. The communications areas allocated to support your query were freed when your program exited, they are kept for reuse until then, but the connection specific data structures allocated in shared memory are not freed when your program exits. They are kept for future reuse by another process to connect with. You should see that if another copy of the process is executed after the first has exited that the memory used does not continue to increase but that the allocated structures are reused.

6.31 Do deattached indexes take up more space?

On 18th Feb 1998 mcollins@us.dhl.com (Mark Collins) wrote:-

Detached indexes actually take more space. In an attached index, each key includes a one-byte delete flag and a four-byte row location. With a detached index, you also have a four-byte tablespace identifier that points to the base table's tablespace.

This can be verified by running oncheck against an attached index vs. a detached index. Samples of such onchecks are included here:

 
       Attached index
   Level 2 Node 28 Prev 0 Next 27
   Key:    4900001:
   Rowids:       201
   Key:    4900002:
   Rowids:       202

 
       Detached index
   Level 2 Node 3 Prev 0 Next 2
   Key:    4900001:
   Fragids/Rowids:    30001f/     101
   Key:    4900002:
   Fragids/Rowids:    30001f/     102

Note that instead of just "Rowids", we now have "Fragids/Rowids." The fragid value points to the tablespace number of the base table. Regardless of the name "fragids", let me assure you that the table used in this case was not fragmented. This was a simple table defined in a specific dbspace, with the index defined with an "in dbspace" clause as well.

For experimental purposes, I tried the "create index" directing the index into the same dbspace as the base table, and also into a separate dbspace. The oncheck results were the same in both cases. Thus, anytime you specify "in dbspace" when creating an index, even if you specify the same dbspace as for the table, the index is considered detached.

As a result, you may want to modify the space calculations for estimating index pages. I know I wasn't able to calculate index space correctly until I discovered the extra four bytes.

6.32 How do I reduce bufwaits?

On 12th May 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

First, bufwaits is NEVER zero for a REAL working server. That said it is desireable to keep bufwaits to a minimum. A rule of thumb I use is to keep bufwaits to 2-10% of either dskreads or bufwrits (or <3% of the total), since both of these operations require locking an LRU and are most likely to cause a bufwait. There several things that affect bufwaits:

1) More buffers == fewer bufwaits but with your cache %ages this seems OK.

2) More LRUS == fewer bufwaits. This is critical! The more LRUS the better. Avoid LRUS=64 and LRUS=96 as there seems to be a bug in the LRU hash algorithm that causes bufwaits to skyrocket at these particular values (and LRUS=128 triggers a harmless but scary bug at engine startup that complains about zero length Physical and Logical log files). I know that the following values are fine: 4, 16, 32, 127.

3) Lowering LRU_MAX_DIRTY and LRU_MIN_DIRTY can insure that there are sufficient clean buffers available when needed. You can see if this is the problem by looking at FG Writes on the onstat -F report. Foreground should be zero or a fraction of a percent of the total buffer writes. If not you need to be flushing more frequently or you need more buffers, again you buffer count looks fine.

On 29th August 2000 spamwillemsspam@soudal.com (Daniel Willems) wrote:-

Here is a correction for the script which measures bufwaits and readaheads


onstat -p | /usr/xpg4/bin/awk '

[a-zA-Z]/{ 
 for (i=1; i <= NF ; i++ ) { 
               name[i]=$i ;
 }

 } 
/[0-9]/{ 
 for (i=1; i <= NF ; i++ ) { 
               content[name[i]]=$i ;
 }
 }
END { 
 ixdaRA = content["ixda-RA"];
 idxRA = content["ixd-RA"];
 daRA = content["da-RA"];
 RApgsused = content["RA-pgsused"];
 print "Read Utilization (UR): ",((RApgsused /( ixdaRA + idxRA + daRA) )*100),"%";

 bufwaits = content["bufwaits"];
 bufwrits = content["bufwrits"];
 pagreads = content["pagreads"];
 print "BufWaits Ratio(BR): ", ((bufwaits/(pagreads + bufwrits)) *100),"%";
 
 }
'

6.33 What status codes does onstat return?

On 21st May 1998 mdstock@informix.com (Mark D. Stock) wrote:-

As a point of interest, you can check the current mode of IDS by checking the status ($?) after onstat -.

The values returned by 'onstat -' are:

6.34 Why does ontape dislike norewind devices?

On 11th Jun 1998 davek@summitdata.com (David Kosenko) wrote:-

It is primarily an issue with multiple-volume backups, and with restores. When ontape switches to writing the second volume of an archive set, it first reads the header from the tape to be sure that the tape was actually switched (i.e. it won't write the second volume over the first volume), then "rewinds" the tape. It does the "rewind" by closing the device, which, on a normal (i.e. not no-rewind) driver will cause the tape to rewind. If you use a norewind driver in this case, ontape will read a bit of the tape, corresponding to the header it would write. It would then close the device, reopen it and start writing the new header. With the norewind driver, that header would be one header's worth into the tape, rather than at the start. If you tried to restore from that tape, the restore would fail.

I don't recall offhand if the norewind is an issue with the first tape in a volume or not. It's been a while since I dove into the gritty details.

6.35 Any known issues with shared memory connections?

Yes, in your sqlhosts file shared memory and tcp connections should not use the same service name

On 16th Jun 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

Yes, as I said, there will be a noticable increase in system calls per second if you do this. With only one CPU VP and one NET VP you may not be able to descern the effect but we run 8 NET VPs and have shared memory listeners in all 28 CPU VPs that we run. In our case, not only is there a noticable, and quantifiable, increase in system calls when the network and shared memory connection share a service, there is a noticeable slowdown in system responsiveness. This is true even though we have 4 out of 32 CPUs reserved for UNIX services. I'd call that a drawback!

On 16th Jun 1998 davek@summitdata.com (David Kosenko) wrote:-

In fact, the "servername" acting as a "placeholder" in the shared memory connection entry is used as the basis for a filename, /INFORMIXTMP/.inf.XXXXXX (where XXXXX is the servicename you specified for the shared memory connection) that is used by clients accessing the server via the shared memory connection (the file contains a count of the number of shared memory poll threads running, among other stuff I forget now). So you can only use that "servicename" for one shared memory connection on a single Unix box.

6.36 Any fast backup methods under IDS 7.3?

Yes, use 7.3's support for external backup methods

On 16th Jun 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

This was intended for use as a VERY fast method. Namely in combination with 7.30's support for multiple mirrors. You can have two mirrors defined for each chunk, take the engine to external backup mode, disable one set up mirrors, put the engine back online, backup the disabled mirrors offline, reenable the secondary mirror. Alternatively, you can just leave the additional mirrors offline and use them as the backup directly then disable the other mirror to use as backup and bring the original backup mirrors back online to be caught up. This is great for EXTREMELY large databases as an "archive" only takes about 2 minutes, the time needed to disable one of the mirrors and reenable the other, and if it can be scheduled at a time when there will be no updates there will be no effect on production.

6.37 Is IDS 7.3 most robust when crashes occur?

On 17th Jun 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

Version 7.3 adds another event handler that traps all of those fatal crashes that the event alarm handler could not handle. The problem was that the event alarms were triggered by the thread that crashed, for certain kinds of crashes it could not trap itself. I beat them up about that for over a year and the result is the monitoring thread in V7.30+ that watches for another thread to go down. In addition 7.30 will usually not go down in these cases only the one affected oninit will crash and the engine will stay up (a side effect of the monitor thread is that it can clean things up and determine if it is safe to continue or if the engine needs to be brought down). Also, if you have any version before 7.21 there were many trapable crashes that were simply not caught. The ultimate answer? Upgrade!

On 9th Jan 1999 djw@smooth1.demon.co.uk (David Williams) wrote:-

I belive Art is talking about the new ONCONFIG parameter SYSALAMPROGRAM.

6.38 Is raw disk faster than cooked files?

On 22nd Jun 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

....................the safety issue of cooked files is no longer a problem. The big problem with cooked files still is performance. All writes and reads to/from cooked files MUST go through the UNIX buffer cache. This means an additional copy from the server's output buffer to the UNIX buffer page then a synchronous write to disk. This is opposed to a write to a way file where the server's output buffer is written directly to disk without the intervening copy. This is just faster. Anyone who has written anything that can test this can attest to the difference in speed. Here are my test results:


FileType	Sync?	Times (real/user/system) 2run avg
---------------	-----	-----------------------------------------
Filesystem file	  N     14.40/3.70/2.52
		  Y	15.02/3.61/2.63
Cooked disk	  N	12.81/3.74/2.24
		  Y	13.42/3.84/2.43
Raw disk	  N	 9.32/3.67/1.52
		  Y	 9.40/3.66/1.44

From this you can clearly see the cost of Cooked files and of synced cooked files. The tests were done with a version of my ul.ec utility modified to optionally open the output file O_SYNC. Cooked disk partition is almost 50% slower than raw disk partition and cooked filesystem files are almost 60% slower. The penalty for O_SYNC is an additional 5% for cooked files and negligible for RAW files (as expected). The test file was 2.85MB written using 4K I/O pages (the default fopen application buffer size) which should simulate Informix performance. The Cooked and Raw disk partition tests were conducted to a singleton 9GB Fast Wide SCSI II drive using the raw and cooked device files corresponding to the same drive.

6.39 Should online run as root or informix?

On 1st Jul 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

The ownership of the running oninit processes should be root if the engine is started from the system startup and informix if started by user informix from the command line. It really does not matter much in general except that on different platforms there may be a requirement for the owner to be root in order for processor affinity, core dumping, NOAGE, etc to function properly. Since oninit is an SUID program it's real and effective userid may differ if not started as root which will cause problems core dumping and changing process status such as aging and affinity on certain systems.

6.40 Should online chunks be owned by root or informix?

On 1st Jul 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

DATABASE CHUNK FILES SHOULD ALWAYS BE OWNED BY USER INFORMIX GROUP INFORMIX AND PERMISSIONS 660! ALWAYS! PERIOD. On certain platforms, DG/UX among them, the volume managers create device files for logical devices on the fly at boot time so that the permissions of these files is reset to root root 666 on reboot. In this case the rc startup script for Informix, which must run after the volume manager script and the filesystems are mounted, should modify the permissions on all active database chunk files before starting the engine. This can be done with a manually maintained chunk list or using the following loop:


for fil in `$INFORMIXDIR/bin/oncheck -pr|awk '/Chunk path/{print $3;}'` 
do
	echo "Fixing permissions for $fil"
	chown informix $fil
	chgrp informix $fil
	chmod 660 $fil
done

This will work for all versions of online (substituting tbcheck for 5.x) except for 5.06 which fails if you have an empty chunk slot in the chunk table page from having deleted a chunk.

6.41 Can archives hang the system waiting for a tape?

On 9th Jan 1999 djw@smooth1.demon.co.uk (David Williams) wrote:-

I the next section I believe Art is talking about Online 5.x archives which wait for a tape change end up holding a latch which stops checkpoints from completing and hence can hang the system

On 1st Jul 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

In 5.xx the description that Jonathan Leffler and Scott Black give is just about right, and indeed if the tape needs changing or is hung the checkpoint will wait for the tape change to complete.

However, this was changed in 7.xx. Actually, the physical log pages are copied into a set of temp tables, one per dbspace being backed up, at each checkpoint, before the physical log is cleared. When the archive thread completes a dbspace it copies all of the preimage pages from the corresponding temp table out to the tape. This prevents the archive from stopping the checkpoint except momentarily if a tape change is needed, and is also why IDS 7.x uses DBSPACETEMP space during archives. This change is one main reason that 7.xx archives run much faster than 5.xx archives did.

6.42 Is it worth binding Online VPs myself?

On 18th Aug 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

There can be great benefit. Sometimes, on some systems (especially Numa architecture systems), there are performance gains above those achieved by the ONINIT affinity parameters to be had be affining the VPs yourself. I use a set of scripts that run onstat -g glo to a file and scans the report for the pids of VPs one VP class at a time affining each class, round robin, across the available processors. Doing this and switching back to AIO VPs we achieved 50% better performance than with ONINIT affinity and KAIO threads which Informix reports is 50% faster than ONINIT affinity and AIO VPs. One trick is to affine AIO VPs to the CPUs in the opposite order as the CPU VPs (ie affine CPU VP#1 to CPU#1 but AIO VP#1 to CPU #12 etc). Since the lowest numbered VPs of each class do most of the work this keeps the busiest CPU VPs from interfering with the busiest AIO VPs.

6.43 Subqueries are not working under IDS 7.3 - help?

On 9th Jan 1999 djw@smooth1.demon.co.uk (David Williams) wrote:-

IDS 7.3 includes a new feature called subquery flattening where some subqueries are converted into joins instead for performance reasons. This functionality contains some bugs so Informix introduced an environment variable NO_SUBQF. Set this to 1 before starting Online to disable the subquery flattening feature

6.44 What is a .FCx release of Online?

On 21st Oct 1998 com@netinfo-moldova.com (Octav Chiriac) wrote:-

The IDS 7.3 SQL Documentation (I don't remember is it the Guide or Reference):

dbinfo('version','os') The operating system identifier within the version string:

On 30th Oct 1998 tgirsch@iname.com (Thomas J. Girsch) wrote:-

The second letter goes:

Subsequent releases increment the letter to D, E, etc.

6.45 How many semaphores does online need?

On 28th Oct 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

The actual number of semaphores is equal to the number of possible shared memory connections and is allocated in semaphore groups ( in my system the groups each contain 100 semaphores the actual number per group is dependent on the kernel configuration). The engine will allocate as many maximum size groups as needed plus a smaller group if needed to complete the needs of each listening VP separately. For example on my development machine I have:

NETTYPE ipcshm,8,250,CPU

That is 8 listeners with 250 connections each for a total of 2000 possible connections. There are 24 semaphore groups allocated on the system (see ipcs -s) each of the 8 connections has two 100 semaphore groups and one 50 sempahore group to make up the 250 needed.

You can configure the maximun available semaphores and the maximum size of each group in the kernel configuration files, often /etc/system but that is system dependent. Be sure to allow for additional semaphores beyond your Informix requirements for other applications when you configure the kernel. It is best to set these values to AT LEAST the values recommended in the release notes for your platform even if you need fewer semaphores but don't be shy to increase these values beyond the release notes numbers if you need them.

6.46 Does IDS support chunks more than 2Gb in size?

On 28th Oct 1998 june_t@hotmail.com (June Tong) wrote:-

IDS *CANNOT* handle chunks over 2GB in size, on a 2K pagesize machine, (4GB on a 4K pagesize platform) and to tell people that it will is irresponsible. Your customer is asking for trouble, and is going to end up with corruption that is completely irretrievable. Everything will be "almost fine" until they write to the next chunk (e.g. writing to chunk 4 if your 15GB chunk is chunk 3), and the new chunk REPLACES THE PAGES past 2GB in your big chunk, or your big chunk overwrites the pages in the next chunk. And then, your only option will be to restore from an archive taken BEFORE this overwriting took place, and THEN export all your data to ASCII format and re-create your entire instance with 2GB chunks.

(sigh, something else for the FAQ)

If IDS is letting you create chunks over 2GB (or 4GB on a 4K pagesize platform), then this is a MAJOR bug that someone out there ought to report and get fixed ASAP.

On 23rd October 2000 obnoxio@hotmail.com (Obnoxio The Clown) wrote:-

7.x and 9.2x, the chunk size is still 2GB. 8.3x has a platform dependent (but much larger) chunk size.

Rumour has it that 9.30 will have a bigger chunk size. Allegedly.

6.47 Why am I getting lots of busy waits?

On 26th Jan 1999 rdavis@rmi.net (Bob) wrote:-

This is probably not a big deal.

When a cpu vp has no work to do, it will go into a busy wait so that UNIX does not swap the process out. After roughly 1000 spins, the vp will check again. At this point, if there is still no work to be done, a semop is performed and UNIX swaps the cpu vp process out to the UNIX queue.

If your system is relatively inactive, this is nothing to worry about as your primary cpu vp is handling most of the work. If your system is loaded, however, this could be indicative of a lower level problem.

6.48 Why do checkpoints sometimes hang?

On 17th Mar 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

The checkpoint starts by issuing a block that stops new transactions beginning until the checkpoint completes, then it waits for all threads in critical section to free themselves and block waiting for the checkpoint thread. Only when all threads requiring critical section latches are blocked can the checkpoint continue. Here you are waiting for a thread in a, probably HUGE, rollback which is a critical section itself, before the checkpoint can start.

6.49 What is this mt.c mentioned in online crashes?

On 17th March 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

I think that mt.c is the source for the mainline of the monitor thread in the misc VP that watches to make sure the other VPs stay online.

6.50 Why does onstat sometimes fail with a changing data structure error?

On 15th July 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

Onstat reads dynamic data structures in shared memory without latching them to avoid affecting online operations. Since many of these structures involve links and pointers and are in a constant state of flux it is possible that while following a set of links onstat may find itself in a loop or a dead-end. In this case it aborts gracefully, having giving up any hope of continuing, with the very true message: "Changing data structure forced command termination." Running the onstat report again usually works. Also this was more common in the earlier 7.2x versions, so upgrading may alleviate the problem somewhat

6.51 What are the restrictions with Workgroup Server?

On 29th July 1999 murray@quanta.co.nz (Murray Wood) wrote:-

In the Informix (workgroup Server) box I have received it has a small piece of paper with the following restrictions listed:

6.52 How can I tell how well Online buffering is working per table?

On 6th August 1999 martyn.hodgson@eaglestar.co.uk (Martyn Hodgson) wrote:-

I've been looking at trying to analyse our IDS buffer effectiveness, in terms of page reuse by object. The query:


select a.dbsname, a.tabname, round(sum(c.reusecnt) / count(*),2) as avg_reuse, 
count(*) as num_buffs
from systabnames a, sysptnext b, sysbufhdr c
where c.pagenum between b.pe_phys and 
b.pe_phys + b.pe_size - 1
and   a.partnum = b.pe_partnum
group by 1,2
order by 4,1,2

seems to give the number of pages of each object in the buffer, and the average number of times each buffer is reused. onstat -p simply gives this percentage for the whole of the buffer, not by object.

6.53 What can I do if Online ends up hung?

On 30th September 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

Sounds like one or more VPs are hung.

Time to trash the engine and restart. Check onstat -R and onstat -l to make sure that all data and log buffers are safe on disk (if not try onmode -c to force a checkpoint but that is likely to hang as well). Run onstat -g glo and find the PID of the master oninit process. When you have found it (it will be VP #1 a CPU VP) do kill -TERM followed by kill -PIPE the master oninit will exit. After a minute or two the admin vp will notice that the master VP is deceased and initiate a forced shutdown.

Now restart the engine. Fast recovery should work fine even if you could not force the checkpoint.

Obviously this is a last resort thing. OH, I have not tried this on 7.3x with the "stay online" feature enabled. If you have it enabled the engine may stay up after you do this instead of crashing. If it was the master VP that was hung you may now be OK to try a normal shutdown, otherwise you may have to kill -TERM, kill -PIPE all of the VPs.

6.54 How do I fix permission on Informix binaries?

On 18th August 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

Here are my scripts. The shell script must be run by root and it will in turn invoke the awk script. Put both scripts in your path and make them executable:

fixperms.sh:


cd $INFORMIXDIR
for lists in etc/*files; do
	fixperms.awk $lists | sh
done



#! /usr/bin/awk -f
# Adjust above for location of your favorite version of awk

BEGIN {
	notyet=1;
}
notyet==1 {
	if ($1 == ".") {
		notyet=0;
	} else {
		next;
	}
}
{
	file=$1;
	owner=$2;
	group=$3;
	perms=$4;

	if (length( file ) > 0) {
		printf "chown %s %s\n", owner, file;
		printf "chgrp %s %s\n", group, file;
		printf "chmod %s %s\n", perms, file;
	}
}

6.55 How do I detect network errors with Online?

On 17th August 1999 y_dovgart@tci.ukrtel.net (Yuri Dovgart) wrote:-

Other problem can be in your network - monitor your connections through 'netstat -e' and look for 'Errors' field. At Informix level use 'onstat -ntd' to see accepted/rejected entries.

6.56 What if I get a mt_notifyvp time out error with Online?

On 17th December 1998 kagel@bloomberg.net (Art S. Kagel) wrote:-

There are several bugs causing an mt_notifyvp time out. This indicates that the admin VP which monitors the other VPs has determined that one of the VPs has not responded to its current job or hearbeated its continued busy status in too long and the engine assumes that that VP is hung or crashed so it tries to shut down the engine. Sometimes that cannot happen because things are just to badly hosed, other times the engine comes down but the master oninit is the one hung and it is also the one that normally removes shared memory so the memory stays around and onstat can query it but as the onstat -d messages indicated it cannot get a message from the oninit processes. Running onmode -ky will usually bring down any remaining oninits and shared memory

6.57 How do I tune buffer waits?

On 14th October 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

BUFWAITS can have three causes:

To determine if your BUFWAITS is reasonable given your load (some waits are unavoidable as many processes may just need the same buffers at the same time and so will have to wait) calculate the BUFWAIT Ratio (BR):

BR = (bufwaits / (dskreads + bufwrits)) * 100.00

A reasonable BR value is under 10% with the ideal at 7% or less. Anything over 10% is usually death.

6.58 Why should I not use RAID 5?

On 1st December 1999 kagel@bloomberg.net (Art S. Kagel) wrote:-

There are two problems with RAID5. The first is performance which is the one most people notice and if you can live with write throughput which is 50% of the equivalent RAID0 stripe set then that is fine. The performance hit is caused because RAID5 ONLY reads the one drive containing the requested sector leaving the other drives free to return other sectors from different stripe blocks. This is the reason that RAID5 is preferred to RAID3 or RAID4 for filesystems, this feature improved small random read performance. However, since the parity and the balance of the stripe block were not read, if you rewrite the block (which databases do far more frequently than filesytems) the other drives must all be read and a new parity calculated and then both the modified block and the parity block must be written back to disk. This READ-WRITE-READ-WRITE for each modified block is the reason RAID5 is so poor in terms of write throughput. Large RAID controller caches and on controller firmware level RAID implementations alleviate the problem somewhat but not completely and write performance still hovers at around half what a pure stripe (RADI0) would get.

The second problem, despite what others have said IS a FUNDAMENTAL problem with the design of RAID5 which various implementors have tried to correct with varying levels of success. The problem is that if a drive fails slowly over time, known as partial media failure,where periodically a sector or two goes bad, this is NOT detected by RAID5's parity and so is propagated to the parity when that sector is rewritten which means that if another drive fails catastrophically its data will be rebuilt utilizing damaged parity resulting in two sectors with garbage. Now this may not even be noticed for a long time as modern SCSI drives automatically remap bad sectors to a set of sectors set asside for the purpose but the corrected error is NOT reported to the OS or the administrators. Over time if the drive is going it will run out of remap sectors and will have to begin returning data reconstructed from the drive's own ECC codes.

Eventually the damage will exceed the ECC's ability to rebuild a single bit error per byte and will return garbage.

RAID3 and RAID4 are superior in both areas. In both all drives are read for any block which improves sequential read performance (Informix Read Ahead depends on sequential read performance) over RAID5 and parity can be (and in most implementations IS) checked at read time so that partial media failure problems can be detected. Write performance is approximately the same as RAID0 for large writes or smaller stripe block sizes. One problem with early implementations of RAID3/4 was slow parity checking since it has to be calculated for every read and every write. Modern controller based RAID systems use the on-board processor on the SCSI controller to perform the parity checks without impacting system performance by tying up the a system CPU to check and produce parity. These RAID levels require the exact same number of drives as RAID5.

RAID10 provides the best protection and performance with read performance exceeding any other RAID level (since both drives of a mirrored pair can be reading different sectors on parallel) and write performance is closest to pure striping. Indeed in a hardware/firmware implemented RAID10 array with on-board cache apparent write throughput can exceed RAID0 for brief periods due to the two drives of each pair being written to independently though the gain is not sustainable over time.

A third problem with ALL RAID3/4/5 from which RAID10 does not suffer is multiple drive failure. (Ever get a batch or 200 bad drives? We have!) If one drive in a RAID3/4/5 array fails catastrophically you are at risk for complete data loss if ANY of the remaining 4 (or more) drives should fail before the original failed drive can be replaced and rebuilt. With RAID10, since it is made up as a stripe set of N mirrored pairs, when a drive fails you are only at risk for complete data loss if that one drives particular mirror partner should fail. Make each mirrored pair from drives selected from different manufacturer's lots and the probability of this happening become vanishingly small.

Fourth problem. During drive rebuild RAID3/4/5 (and RAID01 mirrored stripe sets) performance of the array during the rebuild can degrade by as much as 80%! Some RAID systems let you tune the relative priority of rebuild versus production to reduce the performance hit to as low as about 40% degradation but this will increase the recovery time increasing the number of production requests that are degraded and increasing the risk of the previous problem with a second drive failure. RAID10, since only one drive is involved in mirror recovery, the array's performance (for a 4 drive array) is degraded only a maximum of 80% for reads and writes against the failed pair and only slightly (due to controller traffic) for accesses to the other drives, on average, since the one pair comprises only 20% of accesses, performance is affected no more than 16% during recovery and the risk of catastrophic data loss is reduced.

On 31st January 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Editor's note RAID 0+1 is striping followed by mirroring and RAID 1+0 (RAID 10) is mirroring followed by striping

The problem is that unless you are using hardware RAID01 the mirror software/ firmware does not know it is mirroring a stripe set and so can only recover the entire logical drive that has failed from its mirror. The logical drives that are being mirrored are the entire stripe sets so that is the unit of recovery. IFF you are using a hardware RAID01, or I suppose a sophisticated software only RAID01, then it is POSSIBLE, though not likely, for the firmware to have enough knowledge of the layout to realize that only one drive of the stripe set is down and recover only that one drive. HOWEVER, in this case you do not really have RAID01, since each drive in the stripe sets is mirror individually, what you have, functionally, is RAID10 with an unfortunate name of RAID01!

There is clearly confusion about these two pseudo RAID levels, mostly because they are not officially defined levels but levels that developed in the field when users, and later vendors, combined striping and mirroring to gain the advantages of both and ended up creating these new levels. Both the mirror over stripe and stripe over mirror approaches came to be known as RAID 0+1 and it became rather confusing. About two years ago, to help clear things up, I proposed that we separate the two approaches by calling one RAID01 and the other RAID10 based on the order in which the stripe or mirror were applied, so RAID01 refers to stripe then mirror and RAID10 to mirror then stripe. Some of my messages were in response to cross-posts with RAID related newsgroups so the proposal received some wider exposure than just Informix folk.

Unfortunately there is little discussion about the differences between RAID01 and RAID10 anywhere else than here though from discussions I have had with two controller manufacturers the RAID01/RAID10 as described has become a pretty standard way of referring to the stripe/mirror combinations.

On 19th May 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID10 for data and RAID1 or (sparate) RAID10 for logs and rootdb. Backups can go anywhere since they should be swept to tape immediately or at least frequently if written to disk. If you do not sweep them immediately consider rcp'ing them to another system on the network immediately for safety.

Considering that write performance of RAID5 is 50% that of RAID0, RAID3 or RAID4 and that RAID10 does a bit better than straight RAID0 why even consider RAID5 for the MOST I/O intensive part of the server, the logical and physical (at least in 7.xx) log dbspaces? One could make the case that the RAID5 safety issue is moot for the logs since the physical log is transitory and the logical logs are constantly backed up (you are using one of the constant log backup options I hope) for a server with a low transaction rate, since write performance is not an issue either, but you are describing a server with an very high transaction rate initially and even down the road! Logical log write performance is CRITICAL to server performance in such a system!

On 13th June 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

OK, I will deal with why RAID10 (and by extention RAID1) is not as much at risk. Remember that the problem is not the partial media failure trashing the data on the one drive that is in the process of progressively failing over time but the problem that good data will be read from another stripe member (call it drive #1) for the stripe block containing damaged data on the failing drive (call it drive #2) and modified causing a read of the remaining stripe members and the calculation of new parity after which the modified block will be written back to drive 1 and the parity back to that stripe's parity drive (perhaps drive#3). Now since the data from drive 2 was not needed by the application (here IDS) the fact that it was damaged was not detected and so the new parity was calculated using garbage resulting in a parity that can ONLY be used accurately to recreate the trashed block on drive 2. Now suppose that drive #4 suffers from a catastrophic failure and has to be replaced. The damaged drive has continued to fail and now returns a different pattern of bits than was used to calculate the trashed parity block on drive 3. Now when the missing block data on the drive 4 replacement is calculated it too will become garbage and two disk blocks are now unusable, the damage has propagated. Now that dealt with what happens if the bad block was NOT read directly and detected by IDS. If it was IDS will mark the chunk OFFLINE and refuse to use it until you repair the damage. The only way you can do that is if you restore from backup or remove the partially damaged drive and try to rebuild it from the parity as if it had completely failed. HOWEVER, if all, or even several, of the drives in that array are from the same manufacturing lot or are even of similar age, there is a good chance that the previous problem has already trashed the parity of other blocks so that you will possibly be reconstructing a new drive that will have more bad data blocks than the one it replaces.

With RAID10, each drive in each mirrored pair is written independently. If a block on drive a is trashed the data on drive 1b (its mirror) is fine. If the bad data is read from the drive that is failing (say 1a) the engine will recognize it and mark the chunk down. All you have to do is remove drive 1a and mark the chunk back online rebuilding the mirror online. No problems and less chance that there are other damaged blocks on the one remaining mirror than on any of the 5 or more drives in a RAID5 stripe. If the data is NOT read from 1a but from 1b and modified it will be rewritten to BOTH drives improving the chances that it will be correctly readable if read from the failing 1a next time just because the flux changes will have been renewed, if the platter is too far gone we are just back to the possibility that the bad block will be read and flagged by IDS later. In no case can the data on 2a/2b, 3a/3b, 4a/4b, etc be damaged. Yes, if we were talking about ANY old data file on RAID10 the damage might propagate but since IDS has its own methods for detecting bad data reads this probability is vanishingly small (the damage to the block would have to NOT alter the contents of the first 28 bytes or the last 4 to 1020 bytes thus not damaging the page header, page trailer, or the slot table to not be detected by IDS)

On 30th June 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

RAID5 and any striping scheme (RAID0, RAID3, RAID4, or RAID10) will give you some of the load balancing advantages of fragmentation but not the parallelism that fragmentation along with PDQ can give you and not the fragment elimination advantage either.

RAID5 give 1/2 the write performance of RAID0 (striping alone) and RAID10 will give a slight write performance increase over RAID0 so RAID5 is less than 1/2 the write performance of RAID10. In addition DG/EMC Clariions have an excellent RAID10 implementation which will outperform Informix mirror by a noticable margin and you will not have to waste the parity drives. In your case using RAID10 instead of RAID5 + mirror (ie RAID1) (what is that RAID51?) will actually cost you less or give you more storage for the same number of drives!

In addition, there is a PROBLEM with using Clariion RAID5 with Informix. Though DG/EMC deny it (I have seen it happen for years though) if there is a RAID5 error, ie a lost drive, Informix will receive I/O errors for several seconds when the parity data reconstruction first kicks in and will mark your chunks offline even though the RAID5 is correcting the problem and returning good data. This does not happen with Clariion RAID10 when a drive fails!

Use Clariion RAID10 and no Informix mirrors!

If you want to take advantage of parallel searching you MUST use Informix fragmentation and fragment the table across multiple dbspaces, yes

6.59 Are indexes bigger under 9.2?

On 11th September 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Yes

6.60 Why is Online 9.14 so slow?

On 25th August 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

AHAH! This could be the whole problem. IDS/US 9.1x had MAJOR performance problems on anything even remotely OLTPish. Definitely upgrade to IDS v9.21 and look to eventually upgrade to 9.30 when it comes out and has had time to stabilize. 9.1 was based on the VERY OLD 7.0 codebase while the 9.21 is based on the 7.31 code base and is MUCH faster with FULL OLTP support.

6.61 How do I configure DD_MASHMAX and DD_HASHSIZE?

On 13th Septmber 2000 djw@smooth1.demon.co.uk (David Williams) wrote:-

NOTE: These previously undocumented parameters under 7.x are now documented in the IDS 2000 manuals!!

On 31st August 2000 andy@kontron.demon.co.uk (Andy Lennard) wrote:-

In case it may be useful (?) here's a tacky script that I've used to work out what my DD_HASHSIZE and DD_HASHMAX should be set to.

It may save a bit of trial-and-error.


#! /usr/bin/ksh

# Attempt to work out 'optimum' values for DD_HASHSIZE and DD_HASHMAX
#
# DD_HASHSIZE sets the number of slots in the dictionary
# DD_HASHMAX sets the max number of table names allowed in each slot
#
# for efficient searching of the dictionary the max number of tablenames
# present in a given slot should be small
#
#
# A Lennard 20-sep-1999
#

if [[ $# -lt 1 ]]; then
   echo Usage: $0 database [database...]
   exit 1
fi

# make a list of all the databases we need to consider
#  remember to include sysmaster too...
db_list=\"sysmaster\"
while [ $# -gt 0 ]
do
  db_list=$db_list,\"$1\"
  shift
done

# all tables get put into the dictionary, not just user ones
dbaccess sysmaster - </dev/null
output to temp$$ without headings
select tabname
from systabnames
where dbsname in ($db_list)
EOF

nawk '
BEGIN {
  
  # make an array for numeric representation of ascii
  for (i = 0; i < 128; i++) {
    c = sprintf("%c", i);
    char[c] = i;
  }

  # initialise a counter for the number of tables in the database
  n_tables = 0;

}
{
  # skip null lines
  if (length( $1 ) == 0) { next; }
}
{
  # on non-blank lines...

  # save the table name
  tabname[n_tables] = $1;

  # evaluate the checksum of the characters of the table name
  s=0;
  for (i = 1; i <= length($1); i++) {
    s += char[substr($1, i, 1)];
  }
  sum[n_tables] = s;

  n_tables++;
}

function isprime(i) {
  int j;
  for (j = 2; j< i/2; j++) {
    if (i%j == 0) return 0;
  }
  return 1;
}

END {

  printf("%d tablenames found\n", n_tables);

  # look through the checksums to find the max checksum
  max_sum = 0;
  for (i = 0; i< n_tables; i++) {
    if (sum[i] > max_sum) {
      max_sum = sum[i];
    }
  }

  # then use this max checksum as a top limit when going through finding
  # out how many table names have an identical checksum

#  printf(" checksum            tablename\n --------            --------
-\n");
  max_same_sum = 0;
  for (j = 0; j<= max_sum; j++) {
    same_sum = 0;
    for (i = 0; i< n_tables; i++) {
      if (sum[i] == j) {
#        printf(" %8d %20s\n", sum[i], tabname[i]);
        same_sum ++;
      }
    }
    if (same_sum > max_same_sum) {
      max_same_sum = same_sum;
    }
  }

  printf("Some table names have the same checksum,\n");
  printf(" the theoretical minimum value of DD_HASHMAX is %d\n\n",
max_same_sum);

  printf(" DD_HASHSIZE     Max names\n")
  printf("                 per slot\n")

  # Now it is reckoned that DD_HASHSIZE should be a prime number
  # so loop over the prime numbers, ignoring the really small ones
  for (prime = 17; prime < 1000; prime += 2 ) {

    if (!isprime(prime)) {
      continue;
    }

    # initialise the slots that the tablenames would be stored in
    for (j = 0; j < prime; j++) {
      slot[j] = 0;
    }

    # for each table..
    for (j = 0; j < n_tables; j++) {

      # work out which slot the table name will hash to
      this_slot = sum[j] % prime;

      # increment the number of table names that would be stored in this
slot
      slot[this_slot]++;

#      printf "Table %s will go into slot %d\n", tabname[j], this_slot;

    }

    # now see which slot has the greatest number of names in it
    max_names = 0;
    which_slot = 0;
    for (j = 0; j < prime; j++) {

#      printf "Slot %d has %d names\n", j, slot[j]

      # does this slot contain more names than the greatest seen so far?
      if (slot[j] > max_names ) {
         max_names = slot[j];
         which_slot = j;
      }
    }

#    printf " slot %d has the max number of names (%d)\n\n", which_slot,
max_names

    if (max_names == max_same_sum) {
      printf("%8d         %5d  -- matches theoretical minimum
DD_HASHMAX\n", prime, max_names);
    } else {
      printf("%8d         %5d\n", prime, max_names);
    }
  }
}
' temp$$

rm temp$$


6.62 How does Online page size vary across platforms?

On 13th Septmber 2000 heiko.giesselmann@informix.com (Heiko Giesselmann) wrote:-

Page size is not related to the word width of a specific version, i.e. on a given platform 32 bit and 64 bit versions will have the same page size (otherwise you would have to convert the page format of the complete instance disk space going from a 32 bit to a 64 bit version).

As far as I know the only platforms with 4K pages are NT and AIX (again, regardless of 32 bit or 64 bit versions).

6.63 What should I check when Online hangs?

On 14th September 2000 murray@quanta.co.nz (Murray Wood) wrote:-

Check the OS is still running.
Check the Informix messages log.
Check the status of the oninit processes - using time?
Check onstat -u changing?

On 14th September 2000 cobravert99@hotmail.com () wrote:-

A few other things to check....

   Make sure your instance isn't rolling back from a long
transaction...
   
   Check to see if you are in a deadlock situation..
   
   Use onstat -g lmx (on 9.x anyway) to make sure you aren't running
into a mutex lock bug.
   
   If you have Shared memory dumps turned on and run into an Assert
Fail, it could take a while for the dump to finish.

   Make sure that Nettype is high enough for the amount of users that
are accessing the database..(It can sometimes seem hung).

   Check to see if AFDEBUG is set to 1...

On 15th October 2000 bryce@hrnz.co.nz (Bryce Stenberg) wrote:-

1  check if the logs are not full
2  check the LRU and the page cleaners
3  increase the number of  open files 
4  check the status of the chunks

6.64 Is IDS affected by Windows service packs?

On 3rd Septmber 2000 andy@kontron.demon.co.uk (Andy Lennard) wrote:-

I was talking with our Informix help support people last week about this matter - as in what service pack versions are certified to what Informix products.

If I remember correctly (and I was referring to the product we use here - Informix Dynamic Server - Workgroup Edition for NT):

Service packs are backwardly compatible - as in later SP's include the updates of earlier SP's - you don't have to apply them sequentially.

Also to note, if running a Windows NT network - if you have servers running SP3 and SP4 or later then you will have problems - in SP4 Microsoft changed the way security is handled making for synchronisation problems within domain - see M$ Knowledge base article Q197488.

6.65 When does onbar read the onconfig file?

Can anyone tell me whether a modification to the BAR_MAX_BACKUP parameter in the onconfig file requires the engine to be "bounced" for the change to take effect? The version is 7.31.UC5.

On 31st October 2000 kernoal.stephens@AUTOZONE.COM () wrote:-

No the instance does not need to be bounced. Onbar read the onconfig file when it runs.

6.66 Why are archives so slow and do they use the buffer cache?

On 23rd October 2000 kagel@bloomberg.net (Art S. Kagel) wrote:-

Improving your cache ratio will not help the archiving speed since the onarchive thread which does the actual page reading for ontape, onarchive, and onbar reads the physical disk into a small set of private buffers to avoid causing the buffer cache to thrash and affect performance for user threads. Sorry. How many chunks are defined? Is this a long existing instance that has been upgraded, over time, to 7.31FC6? You may be running into the page timestamp bug which is not fixed until the C8 maintenance release (though you can ask for a back-port patch to FC6 if you have maintenance it is a backward compatible fix.

The problem is that there are pages with timestamps so old that the timestamp values are wrapping to negative so to prevent them from wrapping again and overtaking the old pages the engine, during each archive, restamps the oldest pages. Unfortunately it seems that it stops archiving to do this which lets the tape stop and have to be backed up and spun up to speed again. The patch apparently just gathers the pagenumbers that need restamping and batches the updates after the archive is complete or assigns a separate thread to do the job.

6.67 How does NETTYPE work and how do I tune it?

On 20th October 2000 heiko.giesselmann@informix.com (Heiko Giesselmann) wrote:-

NETTYPE establishes an actual limit for shared memory connections only. In the shared memory case the nettype settings basically define the number of 'memory slots' that clients can use to pass messages to the database server.

For all other connection types it is rather a tuning parameter that helps the engine to estimate how much memory it should set aside for networking purposes. If there is a performance problem it would show up as non-zero values in the 'q-exceed' columns of the 'global network information' section in the 'onstat -g ntt' output.

Versions 9.21 and later include a column 'alloc/max' in the 'onstat -g ntt' output that allows to check how many network buffers are currently allocated and the maximum of used buffers. These values would give you an indication on what has to be configured for NETYPE. More information on this topic can be found in the MaxConnect manual