Uncategorized

How to Index Files Residing in the OS File System

Today we are featuring the FILE_DATASTORE which is used to index files that reside in the operating file system. There is no need to upload the documents into the database. File names are stored in a varchar column, one document per row.

We are demonstrating this with a set of recipes in PDF format and want to be able to search their content for example to find all the recipes with chicken.

Create a directory to place all the recipes

mkdir /jbarba/Recipes

Place the files in this directory, in my case I have 5 files for testing. ls with -1 (minus 1) list each file in its own row

ls -1
Output:
Balsamic-Raspberry Chicken.pdf
Grilled Mongolian Pork Chops.pdf
Homestyle Kalua Pork with Cabbage in a Slow Cooker.pdf
Make-Ahead Slow-Cooker Asian Peach Chicken Thighs.pdf
Watermelon Sangria.pdf

Create a directory for the setup files

mkdir /jbarba/rsearch
cd /jbarba/rsearch

Create the file containing the list of file names of your recipes

ls -1 /jbarba/Recipes > /jbarba/rsearch/list_files.txt
more /jbarba/rsearch/list_files.txt
Output:
Balsamic-Raspberry Chicken.pdf
Grilled Mongolian Pork Chops.pdf
Homestyle Kalua Pork with Cabbage in a Slow Cooker.pdf
Make-Ahead Slow-Cooker Asian Peach Chicken Thighs.pdf
Watermelon Sangria.pdf

We are using sql loader thus using a control file. Create the load.ctl file in rsearch directory

vi /jbarba/rsearch/load.ctl
LOAD DATA
INFILE 'list_files.txt'
  INTO TABLE documents
  (filename CHAR(100),
   id     "DOCS_SEQ.NEXTVAL")

Connect to the database and verify the Oracle Text Option is installed:

sqlplus / as sysdba
select comp_id from dba_registry where comp_id='CONTEXT';
Output:
COMP_ID
------------------------------
CONTEXT

Create the schema user
We are naming the user rsearch

conn / as sysdba
drop user rsearch cascade;
create user rsearch identified by rsearch;
grant connect, resource to rsearch;
alter user rsearch default tablespace users;

 

We also need a role for using the FILE DATASTORE and grant CTXAPP to rsearch

create role FILE_ACCESS;
exec ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE','FILE_ACCESS');
grant FILE_ACCESS to rsearch;
grant CTXAPP to rsearch;

 

Create table and sequence under rsearch schema

conn rsearch/rsearch
DROP TABLE DOCUMENTS PURGE;
CREATE TABLE DOCUMENTS (
id NUMBER(10) PRIMARY KEY,
filename VARCHAR2(100)
);

DROP SEQUENCE docs_seq;
CREATE SEQUENCE docs_seq;

 

Run the sqlloader to fill the names of the recipes into DOCUMENTS table

host sqlldr userid=rsearch/rsearch control=/jbarba/rsearch/load.ctl log=/jbarba/rsearch/load.log bad=/jbarba/rsearch/load.bad

SQL*Loader: Release 11.2.0.3.0 – Production on Fri Jul 1 13:36:26 2016

Copyright (c) 1982, 2011, Oracle and/or its affiliates. All rights reserved.
Commit point reached – logical record count 5

Connecting as rsearch our recipe search schema user we create a file datastore from FILE_DATASTORE and specify our path

conn rsearch/rsearch
select * from documents;
Output:
         ID FILENAME
----------- ----------------------------------------------------------------------------------------------------
          1 Balsamic-Raspberry Chicken.pdf
          2 Grilled Mongolian Pork Chops.pdf
          3 Homestyle Kalua Pork with Cabbage in a Slow Cooker.pdf
          4 Make-Ahead Slow-Cooker Asian Peach Chicken Thighs.pdf
          5 Watermelon Sangria.pdf
exec ctx_ddl.drop_preference('RECIPES_DIR');
begin
 ctx_ddl.create_preference('RECIPES_DIR','FILE_DATASTORE');
 ctx_ddl.set_attribute('RECIPES_DIR','PATH','/jbarba/Recipes');
end;
/
drop index docsx;
create index docsx on documents(filename) indextype is ctxsys.context
parameters ('filter ctxsys.auto_filter datastore recipes_dir');

We verify there are no errors

select * from ctx_user_index_errors;
Output:
no rows selected

Take a look at the tables that make up a text index

select * from cat;
Output:
TABLE_NAME                     TABLE_TYPE
------------------------------ -----------
DOCS_SEQ                       SEQUENCE
DOCUMENTS                      TABLE
DR$DOCSX$I                     TABLE
DR$DOCSX$K                     TABLE
DR$DOCSX$N                     TABLE
DR$DOCSX$R                     TABLE

6 rows selected.

We can now run some queries with the contains function to use the Text index.

What recipes can I make that include chicken?

select id, filename from documents where contains (filename, 'chicken') > 0;
Output:
         ID FILENAME
----------- ----------------------------------------------------------
          1 Balsamic-Raspberry Chicken.pdf
          4 Make-Ahead Slow-Cooker Asian Peach Chicken Thighs.pdf

How about cilantro?

select id, filename from documents where contains (filename, 'cilantro') > 0;
Output:
         ID FILENAME
----------- ----------------------------------------------------------
          4 Make-Ahead Slow-Cooker Asian Peach Chicken Thighs.pdf

 

To cleanup:

conn rsearch/rsearch

DROP SEQUENCE docs_seq;
 DROP TABLE documents purge;
 exec ctx_ddl.drop_preference('RECIPES_DIR');
 host rm /jbarba/Recipes/*
 host rm /jbarba/rsearch/*
 host rmdir /jbarba/Recipes
 host rmdir /jbarba/rsearch

conn / as sysdba

DROP ROLE FILE_ACCESS;
 exec ctxsys.ctx_adm.set_parameter('FILE_ACCESS_ROLE','');
 drop user rsearch cascade;

Sure this made me hungry. Enjoy!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s