The example
Let's say you have a books table like this:
SQL> create table booksI'm going to populate it using some test data:
2 (
3 author varchar2(30),
4 title varchar2(30),
5 published date,
6 text clob
7 ) lob (text) store as securefile
8 (
9 enable storage in row
10 cache
11 );
Table created.
SQL> declareWhat I have basically done there is dumped the source text of some of my database objects into books table. Object owner goes as an author, object_name represents a title, object creation date goes as published and, of course, the source itself goes as book's text.
2 l_clob clob;
3 begin
4 for obj_cur in (
5 select owner, object_name, created
6 from dba_objects
7 where object_type in ('PROCEDURE','PACKAGE','PACKAGE BOD
Y','FUNCTION','TRIGGER')
8 ) loop
9 insert into books values (obj_cur.owner, obj_cur.object_name, ob
j_cur.created, empty_clob())
10 returning text into l_clob;
11
12 dbms_lob.open(l_clob, dbms_lob.lob_readwrite);
13
14 for text_cur in (
15 select text
16 from dba_source
17 where owner=obj_cur.owner
18 and name=obj_cur.object_name
19 order by type, line
20 ) loop
21 dbms_lob.writeappend(l_clob, dbms_lob.getlength(text_cur
.text), text_cur.text);
22 end loop;
23
24 dbms_lob.close(l_clob);
25 end loop;
26 end;
27 /
PL/SQL procedure successfully completed.
SQL> commit;
Commit complete.
The search
Let's say I'm interested in all the books which write about procedures. To do that kind of search using Oracle Text is fairly straightforward.
First of all, we want to make sure that underscore symbol goes as part of a token. It is not really required for our example, however, it is generally a good idea to do since we want stuff like DBMS_OUTPUT or DBMS_SQL appear as a single tokens instead of being split in half due to underscore symbol treated as a special character:
SQL> beginAll I have to do now is to create my index:
2 ctx_ddl.create_preference('books_lexer', 'BASIC_LEXER');
3 ctx_ddl.set_attribute('books_lexer', 'printjoins', '_');
4 end;
5 /
PL/SQL procedure successfully completed.
SQL> CREATE INDEX ctx_books_textAnd do the search:
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');
Index created.
SSQL> select author, title, publishedWe got 622 books which took us 486 LIOs. So far so good.
2 from books
3 where contains(text, 'procedure', 1) > 0
4 order by score(1) desc;
622 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934
-------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 785 | 1576K| | 352 (1)| 00:00:05 |
| 1 | SORT ORDER BY | | 785 | 1576K| 2104K| 352 (1)| 00:00:05 |
| 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 785 | 1576K| | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | | 12 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
19748 bytes sent via SQL*Net to client
801 bytes received via SQL*Net from client
43 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
622 rows processed
Let's add something else
Things will get a bit interesting (or complicated, depending on the perspective) once you'll try to mix both structured and unstructured data in one search. If I will take the above search, but this time I'm interested only in books written by author XDB, the search will look like this:
SQL> select author, title, publishedNote that I've got only 37 books this time, however, it took me the same amount of resources. It is not a surprise, after all, because condition on an author column went as a filter predicate. In case your table has some indexes on structured columns already:
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;
37 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2979142934
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 18 | 37026 | 13 (8)| 00:00:01 |
| 1 | SORT ORDER BY | | 18 | 37026 | 13 (8)| 00:00:01 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 18 | 37026 | 12 (0)| 00:00:01 |
|* 3 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("AUTHOR"='XDB')
3 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
486 consistent gets
0 physical reads
0 redo size
1448 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
create index i_books_author on books (author);You can try to search using B*Tree index:
Index created.
SQL> select /*+ index(books i_books_author */ author, title, publishedThough a bit better, it just really went the other way around this time. We are now fetching all books written by XDB and then passing each book to Oracle Text for filtering.
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;
37 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 806035686
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 288 (1)| 00:00:04 |
| 1 | SORT ORDER BY | | 47 | 96679 | 288 (1)| 00:00:04 |
|* 2 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 287 (0)| 00:00:04 |
|* 3 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | 88 | | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
3 - access("AUTHOR"='XDB')
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
9 recursive calls
0 db block gets
223 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
37 rows processed
The problem
As long as one of your structured/ustructured predicates have a good selectivity, the search using only either of indexes will produce good results, there is not much to post-filter, after all. However, if both predicates are selective only when applied together, you need something else to speed the things up. Indexes combined through BITMAP AND can produce some improvements:
SQL> select author, title, publishedWe are down to only 57 LIOs. However, the entire thing still requires us to fetch all relevant rowids from both indexes followed by a merge. On huge document sets each step may output quite a bit of data and require substantial processing resources.
2 from books
3 where contains(text, 'procedure', 1) > 0
4 and author='XDB'
5 order by score(1) desc;
37 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 3589384316
----------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 17 (6)| 00:00:01 |
| 1 | SORT ORDER BY | | 47 | 96679 | 17 (6)| 00:00:01 |
| 2 | TABLE ACCESS BY INDEX ROWID | BOOKS | 47 | 96679 | 16 (0)| 00:00:01 |
| 3 | BITMAP CONVERSION TO ROWIDS | | | | | |
| 4 | BITMAP AND | | | | | |
| 5 | BITMAP CONVERSION FROM ROWIDS| | | | | |
|* 6 | INDEX RANGE SCAN | I_BOOKS_AUTHOR | | | 1 (0)| 00:00:01 |
| 7 | BITMAP CONVERSION FROM ROWIDS| | | | | |
| 8 | SORT ORDER BY | | | | | |
|* 9 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 12 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
6 - access("AUTHOR"='XDB')
9 - access("CTXSYS"."CONTAINS"("TEXT",'procedure',1)>0)
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
57 consistent gets
0 physical reads
0 redo size
1454 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
2 sorts (memory)
0 sorts (disk)
37 rows processed
The new filter option in 11G
11G's Oracle Text allows you to specify structured data columns in a filter section while creating Oracle Text indexes:
SQL> CREATE INDEX ctx_books_textLet's see how the search will look like now:
2 ON books (text)
3 INDEXTYPE IS CTXSYS.CONTEXT
4 FILTER BY author
5 PARAMETERS('filter ctxsys.null_filter lexer books_lexer memory 64m');
Index created.
SQL> select author, title, publishedNote only 48 LIOs. While it may not be a solid improvement over merged indexes example, the important point here is that Oracle was able to get all required data from a context index alone.
2 from books
3 where contains(text, 'procedure') > 0
4 and author='XDB';
37 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 2443340341
----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 47 | 96679 | 11 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID| BOOKS | 47 | 96679 | 11 (0)| 00:00:01 |
|* 2 | DOMAIN INDEX | CTX_BOOKS_TEXT | | | 8 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CTXSYS"."CONTAINS"("TEXT",'procedure')>0)
filter("AUTHOR"='XDB')
Note
-----
- dynamic sampling used for this statement
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
48 consistent gets
0 physical reads
0 redo size
1357 bytes sent via SQL*Net to client
372 bytes received via SQL*Net from client
4 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
37 rows processed
What about 10G?
There is no filter option for context indexes in 10G, however, that doesn't mean that there is nothing you could do to help you facilitate your mixed structured and unstructured data searches. I'll blog about what could be done in 10G next time which, by the way, can be applied to 11G as well.