Wednesday, December 31, 2008

Searching over structured and unstructured data in 11G

For those of you who have to deal with searches involving both structured and unstructured data, there is a small neat feature introduced to Oracle Text Context indexes in Oracle 11G which you may find to be very useful.

The example

Let's say you have a books table like this:
SQL> create table books
2 (
3 author varchar2(30),
4 title varchar2(30),
5 published date,
6 text clob
7 ) lob (text) store as securefile
8 (
9 enable storage in row
10 cache
11 );

Table created.
I'm going to populate it using some test data:
SQL> declare
2 l_clob clob;
3 begin
4 for obj_cur in (
5 select owner, object_name, created
6 from dba_objects
7 where object_type in ('PROCEDURE','PACKAGE','PACKAGE BOD
Y','FUNCTION','TRIGGER')
8 ) loop
9 insert into books values (obj_cur.owner, obj_cur.object_name, ob
j_cur.created, empty_clob())
10 returning text into l_clob;
11
12 dbms_lob.open(l_clob, dbms_lob.lob_readwrite);
13
14 for text_cur in (
15 select text
16 from dba_source
17 where owner=obj_cur.owner
18 and name=obj_cur.object_name
19 order by type, line
20 ) loop
21 dbms_lob.writeappend(l_clob, dbms_lob.getlength(text_cur
.text), text_cur.text);
22 end loop;
23
24 dbms_lob.close(l_clob);
25 end loop;
26 end;
27 /

PL/SQL procedure successfully completed.

SQL> commit;

Commit complete.
What I have basically done there is dumped the source text of some of my database objects into books table. Object owner goes as an author, object_name represents a title, object creation date goes as published and, of course, the source itself goes as book's text.

The search

Let's say I'm interested in all the books which write about procedures. To do that kind of search using Oracle Text is fairly straightforward.

First of all, we want to make sure that underscore symbol goes as part of a token. It is not really required for our example, however, it is generally a good idea to do since we want stuff like DBMS_OUTPUT or DBMS_SQL appear as a single tokens instead of being split in half due to underscore symbol treated as a special character:
SQL> begin
2 ctx_ddl.create_preference('books_lexer', 'BASIC_LEXER');
3 ctx_ddl.set_attribute('books_lexer', 'printjoins', '_');
4 end;
5 /

PL/SQL procedure successfully completed.
All I have to do now is to create my index:
SQL> CREATE INDEX ctx_books_text
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');

Index created.
And do the search:
SSQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 order by score(1) desc;

622 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934

-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 785 | 1576K| | 352 (1)| 00:00:05 |
| 1 | SORT ORDER BY | | 785 | 1576K| 2104K| 352 (1)| 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 785 | 1576K| | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | | 12 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
19748 bytes sent via SQL*Net to client
801 bytes received via SQL*Net from client
43 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
622 rows processed
We got 622 books which took us 486 LIOs. So far so good.

Let's add something else

Things will get a bit interesting (or complicated, depending on the perspective) once you'll try to mix both structured and unstructured data in one search. If I will take the above search, but this time I'm interested only in books written by author XDB, the search will look like this:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 18 | 37026 | 13 (8)| 00:00:01 |
| 1 | SORT ORDER BY | | 18 | 37026 | 13 (8)| 00:00:01 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 18 | 37026 | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("AUTHOR"='XDB')
3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
1448 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
Note that I've got only 37 books this time, however, it took me the same amount of resources. It is not a surprise, after all, because condition on an author column went as a filter predicate. In case your table has some indexes on structured columns already:
create index i_books_author on books (author);

Index created.
You can try to search using B*Tree index:
SQL> select /*+ index(books i_books_author */ author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 806035686

-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 288 (1)| 00:00:04 |
| 1 | SORT ORDER BY | | 47 | 96679 | 288 (1)| 00:00:04 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 287 (0)| 00:00:04 |
|* 3 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | 88 | | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
3 - access("AUTHOR"='XDB')

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
9 recursive calls
0 db block gets
223 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
Though a bit better, it just really went the other way around this time. We are now fetching all books written by XDB and then passing each book to Oracle Text for filtering.

The problem

As long as one of your structured/ustructured predicates have a good selectivity, the search using only either of indexes will produce good results, there is not much to post-filter, after all. However, if both predicates are selective only when applied together, you need something else to speed the things up. Indexes combined through BITMAP AND can produce some improvements:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 3589384316

----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 17 (6)| 00:00:01 |
| 1 | SORT ORDER BY | | 47 | 96679 | 17 (6)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID | BOOKS | 47 | 96679 | 16 (0)| 00:00:01 |
| 3 | BITMAP CONVERSION TO ROWIDS | | | | | |
| 4 | BITMAP AND | | | | | |
| 5 | BITMAP CONVERSION FROM ROWIDS| | | | | |
|* 6 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | | | 1 (0)| 00:00:01 |
| 7 | BITMAP CONVERSION FROM ROWIDS| | | | | |
| 8 | SORT ORDER BY | | | | | |
|* 9 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

6 - access("AUTHOR"='XDB')
9 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
57 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
37 rows processed
We are down to only 57 LIOs. However, the entire thing still requires us to fetch all relevant rowids from both indexes followed by a merge. On huge document sets each step may output quite a bit of data and require substantial processing resources.

The new filter option in 11G

11G's Oracle Text allows you to specify structured data columns in a filter section while creating Oracle Text indexes:
SQL> CREATE INDEX ctx_books_text
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 FILTER BY author
5 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');

Index created.
Let's see how the search will look like now:
SQL> select author, title, published
2 from books
3 where contains(text, 'procedure') > 0
4 and author='XDB';

37 rows selected.


Execution Plan
----------------------------------------------------------
Plan hash value: 2443340341

----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 11 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 11 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 8 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CTXSYS"."CONTAINS"("TEXT",'procedure')>0)
filter("AUTHOR"='XDB')

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
48 consistent gets
0 physical reads
0 redo size
1357 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
37 rows processed
Note only 48 LIOs. While it may not be a solid improvement over merged indexes example, the important point here is that Oracle was able to get all required data from a context index alone.

What about 10G?

There is no filter option for context indexes in 10G, however, that doesn't mean that there is nothing you could do to help you facilitate your mixed structured and unstructured data searches. I'll blog about what could be done in 10G next time which, by the way, can be applied to 11G as well.

No comments:

Post a Comment