Wednesday, September 25, 2013

OOW Notes - High Volume Text Indexing

Tuesday 12:00 - HighVolume text indexing with Oracle Text - Roger Ford


create index <index_name> on <table_name> (column_name) index type is ctxsys.context;


select from table_name
where
contains (product_name, 'monitor NEAR full hd', 99) > 0
order by score(99) desc;


index types:


  • context (primary)
  • ctxcat
  • ctxrule
  • Xquery Full Text


context index is inverted
consists list of words each with an associated posting list
posting lists can be very long (requires IO)
are stored in $I table
SGA works as is on this object always pin this table to SAG or flash cache if possible


Common Pitfalls


Memory


  • when indexing, postings lists are built in memory then flushed to disk when memory is exhausted
  • frequent flushes = small fragments


in memory controlled by MEMORY parameter


create index index_name
on table_name(col)
index type ctxsys.context parameters ('memory 1G');


exec ctxsys.ctx_adm.set_parameter('MAX_INDEX_MEMORY','10G');


Substring index / UTF8


context indexes are token (word) based


using leading wildcards are slow


need to turn on SUBSTRING_INDEX wordlist option


increases index build time
    1. : 5x long
    2. : %20 long


AL32UTF8 has significant indexing overhead (consider single byte character set)


Frequent syncs


  • context indexes are asynchronous
  • sync(on commit) will cause index fragmentation
  • can be cured by ctx_ddl.optimize_index


transactional will require unindexed documents to be scanned by all queries
use transactional only with sync(on commit)


if syncing frequently, optimize frequently
use 12c feature (near real time indexes)


Big IO

  • previously $I index were stored inline BLOBs
  • new storage attribute BIG_IO uses secure file lobs

Seperate offsets

posting lists consists of document Ids and word position in the document

some queries does not need word position

new separate_offset storage attribute will hugely reduce IO for these queries
will make proximity (NEAR), phase and zone section searcher will be a bit slower

Near Real Time Indexes

new storage attribute STAGE_ITAB

  • creates two level index
  • small enough to keep in memory
  • large main index stored in the disk
  • updates are moved from staging index to main index and optimized as they move
  • requires BIG_IO storage parameter in order to use

Query Filter Cache

some expressions are common to many queries
security restriction: .. “(sales or marketing) WITHIN dept”
postings entries for these queries may be very large
postings will likely be cached in sga but still need to be decompressed and merges

query filter cache, caches the results of subquery in memory
query operator ctxFilterCache(subquery)

Pattern stop class

specify regular expression exclusion list for index
large indexes can be %75 junk

  • reference codes
  • sequential/numeric data
  • wrongly indexed binary data

example:
  • exclude all words longer than 20 chars
  • exclude all words which are mixed alpha and numeric

Partitioning

  • partitioned index can be created on partitioned base tables
  • use local keyword for indexes
  • currently only range partitioning is supported
  • partitioned indexes looks like separate indexes on partitions

  • better query performance on subset
  • less interference between indexing and querying
  • caching of more common target positions

Parallelization

parallel index creation is supported by using parallel n clause
can mix local and parallel but max 1 process per partition (db restriction)

to do intra-partition parallelization

  • create index using unusable keyword
  • run dbms_pclxutil.build_part_index and specify degrees on parallelization

index is unavailable until complete
no option for incremental builds

manual partition builds

create index using nopopulate parameter
build individual partitions using

alter index index_name rebuild partition partn parallel n;

allows complete control over parallelization
index is usable as soon as one operation is built

unlike unusable option; can build incremental changes by loading new partition and rebuilding

no sync required for non changing data
faster use partition swapping

distributed index builds

  • indexes can be built on farm of machines
  • partitioned base table on master with local indexes
  • non partitioned table on remote instance
  • build index on remote table, then transfer to master using transportable tablespace

swap in index using exchange partitioning
alter table table_name
exchange partition main_table_p10
with table remote_table
including indexes
without validation;


  • almost instant ($K IOT must be copied)
  • new data immediately searchable

No comments:

Post a Comment