Ergem PEKER 's Oracle Blog: OOW Notes - High Volume Text Indexing

Wednesday, September 25, 2013

OOW Notes - High Volume Text Indexing

Tuesday 12:00 - HighVolume text indexing with Oracle Text - Roger Ford

create index <index_name> on <table_name> (column_name) index type is ctxsys.context;

select from table_name
where
contains (product_name, 'monitor NEAR full hd', 99) > 0
order by score(99) desc;

index types:

context (primary)
ctxcat
ctxrule
Xquery Full Text

context index is inverted
consists list of words each with an associated posting list
posting lists can be very long (requires IO)
are stored in $I table
SGA works as is on this object always pin this table to SAG or flash cache if possible

Common Pitfalls

Memory

when indexing, postings lists are built in memory then flushed to disk when memory is exhausted
frequent flushes = small fragments

in memory controlled by MEMORY parameter

create index index_name
on table_name(col)
index type ctxsys.context parameters ('memory 1G');

exec ctxsys.ctx_adm.set_parameter('MAX_INDEX_MEMORY','10G');

Substring index / UTF8

context indexes are token (word) based

using leading wildcards are slow

need to turn on SUBSTRING_INDEX wordlist option

increases index build time

: 5x long
: %20 long

AL32UTF8 has significant indexing overhead (consider single byte character set)

Frequent syncs

context indexes are asynchronous
sync(on commit) will cause index fragmentation
can be cured by ctx_ddl.optimize_index

transactional will require unindexed documents to be scanned by all queries
use transactional only with sync(on commit)

if syncing frequently, optimize frequently
use 12c feature (near real time indexes)

Big IO

previously $I index were stored inline BLOBs
new storage attribute BIG_IO uses secure file lobs

Seperate offsets

posting lists consists of document Ids and word position in the document

some queries does not need word position

new separate_offset storage attribute will hugely reduce IO for these queries

will make proximity (NEAR), phase and zone section searcher will be a bit slower

Near Real Time Indexes

new storage attribute STAGE_ITAB

creates two level index
small enough to keep in memory
large main index stored in the disk
updates are moved from staging index to main index and optimized as they move
requires BIG_IO storage parameter in order to use

Query Filter Cache

some expressions are common to many queries

security restriction: .. “(sales or marketing) WITHIN dept”

postings entries for these queries may be very large

postings will likely be cached in sga but still need to be decompressed and merges

query filter cache, caches the results of subquery in memory

query operator ctxFilterCache(subquery)

Pattern stop class

specify regular expression exclusion list for index

large indexes can be %75 junk

reference codes
sequential/numeric data
wrongly indexed binary data

example:

exclude all words longer than 20 chars
exclude all words which are mixed alpha and numeric

Partitioning

partitioned index can be created on partitioned base tables
use local keyword for indexes
currently only range partitioning is supported
partitioned indexes looks like separate indexes on partitions

better query performance on subset
less interference between indexing and querying
caching of more common target positions

Parallelization

parallel index creation is supported by using parallel n clause

can mix local and parallel but max 1 process per partition (db restriction)

to do intra-partition parallelization

create index using unusable keyword
run dbms_pclxutil.build_part_index and specify degrees on parallelization

index is unavailable until complete

no option for incremental builds

manual partition builds

create index using nopopulate parameter

build individual partitions using

alter index index_name rebuild partition partn parallel n;

allows complete control over parallelization

index is usable as soon as one operation is built

unlike unusable option; can build incremental changes by loading new partition and rebuilding

no sync required for non changing data

faster use partition swapping

distributed index builds

indexes can be built on farm of machines
partitioned base table on master with local indexes
non partitioned table on remote instance
build index on remote table, then transfer to master using transportable tablespace

swap in index using exchange partitioning

alter table table_name

exchange partition main_table_p10

with table remote_table

including indexes

without validation;

almost instant ($K IOT must be copied)
new data immediately searchable

Ergem PEKER 's Oracle Blog

Pages

Wednesday, September 25, 2013

OOW Notes - High Volume Text Indexing

No comments:

Post a Comment

About Me

Search This Blog

My Documents

Blog Archive (monthly)

group by (subject)

OCP 10g

RAC Expert

OCP 11g

My Blog List

mostly visited links

Total Pageviews

warning