0% found this document useful (0 votes)

18 views7 pages

IndustryDocumentsDataAPI v7

The document provides a comprehensive guide on using the Industry Documents Library (IDL) Solr API for accessing and querying document data. It details how to retrieve documents in various formats, including XML and JSON, and explains the parameters for querying, paging, and using operators. Additionally, it includes examples and notes on specific query syntax and limitations to enhance user experience when interacting with the Solr server.

Uploaded by

Aurelius Noble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views7 pages

IndustryDocumentsDataAPI v7

Uploaded by

Aurelius Noble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

INDIVIDUAL DOCUMENTS 1

SEARCH RESULTS 3
QUERY NOTES 5

Industry Documents Library Solr API

The Industry Documents Library (IDL) uses Apache Solr to index the document corpus.
Users who are interested in accessing the data programmatically can query the Industry
Documents Library (IDL) Solr server directly. This allows the user to easily export
documents to another system, execute search queries and process search results by
program. Data can be exported in these formats: xml, json, python, ruby, php, and csv.

Individual Documents
Each document is uniquely identified by an ID. The ID is an alphanumeric string. It consists
of 8 characters with four letters followed by four digits, e.g. kylw0221. The ID is not case
sensitive: KYLW0221, kYLw0221, kylw0221 all refer to the same document.

To access a document’s metadata in xml format, please query the Industry Documents
Library (IDL) Solr server with the ID.

For example, to extract the information of document with ID kylw0221:

https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=id:kylw0221

The response looks like this:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">10</int>
<lst name="params">
<str name="q">id:kylw0221</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">kylw0221</str>
<str name="tid">ctg37j00</str>
<arr name="collection">
<str>Depositions and Trial Testimony (DATTA)</str>
</arr>
<arr name="availability">
<str>no restrictions</str>
<str>public</str>
</arr>
<arr name="case">
<str>
Engle Progeny; Andy R. Allen, Sr. and Patricia L. Allen, Case No. 16-
2007-CA-008311-BXXX-MA, Case No. 2008-CA-15000
</str>
</arr>
<str name="titie">
In Re: Engle Progeny Cases Tobacco Litigation. Pertains to Andy R.
Allen, Sr., as Personal Representative for the Estate of Patricia L.
Allen. Jury Trial
</str>
<str name="documentdate">2014 November 25</str>
<arr name="type">
<str>trial transcript</str>
</arr>
<int name="pages">155</int>
<str name="bates">figlarj20141125</str>
<str name="witness">Figlar, James, Ph.D.</str>
<str name="dateaddeducsf">2015 March 05</str>
</doc>
</result>
</response>

The parameters of interest are:

Parameter Description Comment
q query id:[ID]
wt writer type xml (default)
json
python
ruby
php
csv

If one is interested in extracting the same data in json format, simply attach &wt=json to the
url:
https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=id:kylw0221&wt=json

The response looks like this:

{
"responseHeader":{
"status":0,
"QTime":2,
"params":{
"indent":"true",
"q":"id:kylw0221",
"wt":"json"}},
"response":{"numFound":1,"start":0,"docs":[
{
"id":"kylw0221",
"tid":"ctg37j00",
"collection":["Depositions and Trial Testimony (DATTA)"],
"availability":["no restrictions",
"public"],
"case":["Engle Progeny; Andy R. Allen, Sr. and Patricia L.
Allen, Case No. 16-2007-CA-008311-BXXX-MA, Case No. 2008-CA-15000"],
"titie":"In Re: Engle Progeny Cases Tobacco Litigation.
Pertains to Andy R. Allen, Sr., as Personal Representative for the
Estate of Patricia L. Allen. Jury Trial",
"documentdate":"2014 November 25",
"type":["trial transcript"],
"pages":155,
"bates":"figlarj20141125",
"witness":"Figlar, James, Ph.D.",
"dateaddeducsf":"2015 March 05"}]
}}

Searches
Users can also run queries against the Solr server and extract results in the desired format.
In order to prevent user issued queries from overloading the Solr sever, we allow the
retrieval of 100 records at a time, users can page through the results by appending
&start=[number] to the request. Deep paging using start is expensive for a Solr server, so
allow maximum of 10,000 pages using start. If you wish to access further, you will need to
use cursorMark.

The parameters of interest are:

Parameter Description Comment
Q query The default is to return all
records.
If no q is passed in or q=* or
q=*:*, all records are returned.

Please see the “Query Notes”

section below for more details
on how to query the Solr server.
Use Boolean operators to refine
your query: AND, OR, NOT.

Wt writer type xml (default)

json
python
ruby
php
csv
Start start For paging through results.
100 records are returned at a
time.
NOTE: Using start is costly
for solr server when paging
deeply. We have a restriction
on start value that cannot be
bigger than 10,000. If you
wish to page beyond 10,000
pages, you have to use
cursorMark. See solr
documentation:
https://solr.apache.org/guid
e/8_5/pagination-of-
results.html

For example, to query for all documents with author Glantz and extract the data in xml
format:

https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=author:glantz

The response looks like:

<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">author:glantz</str>
</lst>
</lst>
<result name="response" numFound="1189" start="0">
<doc>
<str name="id">qpnl0052</str>
<str name="tid">dpl03c00</str>
<arr name="collection">
<str>RJ Reynolds</str>
</arr>
<str name="area">
OPERATIONS; ENGINEERING; BOHANON HR; SR PRINCIPAL ENGINEER
</str>
<arr name="availability">
<str>public</str>
<str>no restrictions</str>
</arr>
<str name="topic">CTR/TIRC/TI; TOBACCO INSTITUTE</str>
<str name="box">NA; RJR9941</str>
<arr name="case">
<str>US RESEARCH AND MANUFACTURING DOCUMENT PRODUCTION</str>
</arr>
<str name="title">
ECNOMIC IMPACT OF GOVERNMENT MANDATED SMOKING RESTRICTIONS ON THE
RESTAURANT INDUSTRY.
</str>
<arr name="author">
<str>RJR</str>
<str>BOHANON H</str>
<str>GLANTZ S</str>
<str>TI</str>
<str>GLANTZ & SMITH</str>
<str>OSHA</str>
<str>WAXMAN H</str>
<str>SENATE</str>
<str>EPA</str>
<str>CRAIG</str>
<str>NCSU</str>
<str>NRA</str>
</arr>
<str name="documentdate">1995 April 19</str>
<arr name="type">
<str>report</str>
</arr>
<int name="pages">24</int>
<arr name="mentioned">
<str>LIST OF FOOTNOTES</str>
<str>HANSEN & LOTT</str>
<str>MADD</str>
</arr>
<str name="description">Marginalia; Y</str>
<str name="bates">525616666-525616689</str>
<str name="minnesotarequestnumber">US RESEARCH AND MANUFACTURING
DOCUMENT PRODUCTION</str>
<str name="dateshipped">2015 August 06</str>
<str name="dateaddeducsf">2003 January 15</str>
<str name="dateaddedindustry">2002 October 15</str>
<str name="datemodifiedindustry">2012 April 17; 2015 August 06</str>
</doc>
…

The attribute numFound="1189" shows that there are 1189 records that match this query.
Since we only return 100 records at a time, paging is needed to access all records.

To see the next 100 records, append &start=100 to the url:

https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=author:glantz&start=100

To see the remaining 32 records, append &start=200 to the urls:

https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=author:glantz&start=200

If the user wants to find author:glantz only in the tobacco industry, then you should add
AND industry:tobacco to the query.
https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=(author:glantz AND industry:tobacco)

If the user wants to find all records from the Brown & Williamson collection with type letter
but NOT with brand Kool. Please note that you cannot use & directly in the URL query. You
must substitute it with %26. (Also see the Ampersand Section under Query Notes.) This is
because & is a special URL character. The query should look like this:
https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=((collection:"brown %26 williamson" AND
type:letter) NOT brand:kool)
Example with cursorMark deep paging

• You must have a sort order. This example uses sort by id desc
• Initial request must pass cursorMark=*
• Response will have a nextCursorMark value, let’s call it XXX
• Subsequent requests pass nextCursorMark value into &cursorMark=XXX

Initial Request:
https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=tobacco&wt=json&cursorMark=*&sort
=id%20desc

At the end of the response, you will see:

"nextCursorMark":"AoEoenp5eDAyMTY="}

Next Request:
https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=tobacco&wt=json&cursorMark=AoEoenp
5eDAyMTY=&sort=id%20desc

Pseudo code for looping through search results using cursorMark:

// NOTE: This is not real code.

// This is pseudo code to demo the logic of how to loop through
// the result set using cursorMark.
// Please translate to your programming language of choice.

$params = [ q => $some_query, sort => 'id asc', cursorMark => '*' ]
$done = false
while (not $done) {
$results = fetch_solr($params)

// write logic here to do something with $results

if ($params[cursorMark] == $results[nextCursorMark]) {
$done = true
}
$params[cursorMark] = $results[nextCursorMark]
}

Query Notes
The Industry Documents Library (IDL) Solr server does not have the syntax sugar that the
Industry Documents Library (IDL) website adds for the user. But most of the queries
entered on the website can be directly issued to the Solr server with a few exceptions.

Dates:
The date fields are stored as a string in the format of “4-digit-year month 2-digit-day”, for
example “2001 November 09”.

Searching for an exact date can be done on the date field, for example
q=documentdate:”2001 November 09”.

However, range searches do not work on string fields. We also store the date fields in ISO
format that is meant for date range searches. If one wants to search for documents dated
between 2001 and 2011, the search must be done on the documentdateiso field
(q=documentdateiso:[2001-01-01T00:00:00Z TO 2011-12-31T00:00:00Z]).

Bates search:
When searching for a bates number on the Solr server, please use the batesexpanded field
code (q=batesexpanded:XXXXX). This is because bates number could cover a range. One
could find a single bates number within the bates range by querying the batesexpanded
field.

cited:yes/cited:no:
The query cited:yes works on the website, but it needs to be issued as q=cited:* with the
Solr server directly.
Similarly cited:no needs to be issued as q=–cited:* or q=NOT cited:* to the Solr server.

Ampersand &
Some field values have Ampersand in them, for example Brown & Williamson. Ampersand
is a special character in the URL and when searching for values with ampersand, please use
%26.

For example: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=collection:"brown %26

williamson"

Operators:
Use operators AND, OR, NOT to refine your search. Always remember to use parentheses to
indicate how you want your query to be interpreted.
For example:
q=author:glantz AND type:letter OR brand:kool

The above query will give you unexpected results if you don’t add parenthesis to the query.

You MUST use parentheses to explicitly indicate how you want the query to be executed:

q=((author:glantz AND type:letter) OR brand:kool)

OR
q=(author:glantz AND (type:letter OR brand:kool))

Scopa Rules
No ratings yet
Scopa Rules
2 pages
MCP-161 - Micro III Direct Communications
100% (2)
MCP-161 - Micro III Direct Communications
1 page
Elastic Search Presentation
No ratings yet
Elastic Search Presentation
55 pages
Built On Solr Simplified, Accelerated Produc Vity Cost Effec Ve Architecture
No ratings yet
Built On Solr Simplified, Accelerated Produc Vity Cost Effec Ve Architecture
7 pages
Solr Elasticsearch
No ratings yet
Solr Elasticsearch
10 pages
Lucence / SOLR
No ratings yet
Lucence / SOLR
21 pages
Search Engine Functionality For LLP: Apache Lucene
No ratings yet
Search Engine Functionality For LLP: Apache Lucene
6 pages
Solr Presentation
No ratings yet
Solr Presentation
20 pages
Apache Solr Presentation
100% (1)
Apache Solr Presentation
37 pages
Leaks API
No ratings yet
Leaks API
11 pages
Welcome To Lucene!
No ratings yet
Welcome To Lucene!
11 pages
Musa Talukdar: Software Engineer 28 June, 2012
No ratings yet
Musa Talukdar: Software Engineer 28 June, 2012
19 pages
Apache Lucene: Searching The Web and Everything Else
No ratings yet
Apache Lucene: Searching The Web and Everything Else
35 pages
Apache 2.5docx
No ratings yet
Apache 2.5docx
10 pages
Lucene and Solr
No ratings yet
Lucene and Solr
24 pages
Apache Solr For Indexing Data - Sample Chapter
No ratings yet
Apache Solr For Indexing Data - Sample Chapter
19 pages
Solr Appendix Final
No ratings yet
Solr Appendix Final
3 pages
Web Service Heterogenus
No ratings yet
Web Service Heterogenus
4 pages
Chapter - 6 Part 1
No ratings yet
Chapter - 6 Part 1
21 pages
Tutorial
No ratings yet
Tutorial
59 pages
Lucene Solr
No ratings yet
Lucene Solr
52 pages
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
0% (1)
Advanced Lucene: Grant Ingersoll Center For Natural Language Processing Apachecon 2005 December 12, 2005
37 pages
Solr 4 Highlights: Friday, October 21, 11
No ratings yet
Solr 4 Highlights: Friday, October 21, 11
31 pages
Lucene
No ratings yet
Lucene
15 pages
HD Mod10 Solr
No ratings yet
HD Mod10 Solr
73 pages
Information Retrieval: Adt-V Unit
No ratings yet
Information Retrieval: Adt-V Unit
106 pages
Requirements: Sun IBM BEA Solr Release
No ratings yet
Requirements: Sun IBM BEA Solr Release
5 pages
Apache Solr High Performance Sample Chapter
No ratings yet
Apache Solr High Performance Sample Chapter
11 pages
Lucene Lecture at Pisa
No ratings yet
Lucene Lecture at Pisa
11 pages
Informaiton Retrieval and Web Search
No ratings yet
Informaiton Retrieval and Web Search
44 pages
List Copy Fields: True True
No ratings yet
List Copy Fields: True True
10 pages
Pythonlearn-13-WebServices Python
No ratings yet
Pythonlearn-13-WebServices Python
54 pages
Procamiasymp00001 0574
No ratings yet
Procamiasymp00001 0574
5 pages
Searching PCCEJabber Archives
No ratings yet
Searching PCCEJabber Archives
23 pages
XQuery
No ratings yet
XQuery
24 pages
Knowledge Search Engines
No ratings yet
Knowledge Search Engines
13 pages
Paper 10
No ratings yet
Paper 10
8 pages
Web Technologies Notes
75% (4)
Web Technologies Notes
121 pages
Solr Architecture
No ratings yet
Solr Architecture
5 pages
Luce Ne Bootcamp
No ratings yet
Luce Ne Bootcamp
83 pages
08 XQuery
No ratings yet
08 XQuery
88 pages
Lucene Sail
No ratings yet
Lucene Sail
4 pages
Indri IS Awesome
100% (1)
Indri IS Awesome
110 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
Lucene Search Syntax Guide
No ratings yet
Lucene Search Syntax Guide
18 pages
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
No ratings yet
Text Mining and The Semantic Web: DR Diana Maynard NLP Group Department of Computer Science University of Sheffield
46 pages
Solr Search Reference
No ratings yet
Solr Search Reference
5 pages
Information Extraction Using Incremental Approach
No ratings yet
Information Extraction Using Incremental Approach
3 pages
Law Finder With Solr: 1 Abstract
No ratings yet
Law Finder With Solr: 1 Abstract
4 pages
Emutye
No ratings yet
Emutye
20 pages
Cs276B Question Answering From Text: Examples From Altavista Query Log
No ratings yet
Cs276B Question Answering From Text: Examples From Altavista Query Log
8 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Apache Lucene
No ratings yet
Apache Lucene
5 pages
Unit 5 6 Pages Notes
No ratings yet
Unit 5 6 Pages Notes
3 pages
XQuery Essential
No ratings yet
XQuery Essential
18 pages
Json LD Presentation
No ratings yet
Json LD Presentation
6 pages
Unit-5 Adt
No ratings yet
Unit-5 Adt
11 pages
Merih Instruction BUS Door
No ratings yet
Merih Instruction BUS Door
6 pages
Properties of Water Reading - (1 - )
No ratings yet
Properties of Water Reading - (1 - )
4 pages
Guidance Note C - B - ENV 002, July 02
No ratings yet
Guidance Note C - B - ENV 002, July 02
12 pages
Pupil Practice Book
67% (3)
Pupil Practice Book
89 pages
ACS800 Multidrive Modules & Cabinets
No ratings yet
ACS800 Multidrive Modules & Cabinets
3 pages
Gill
No ratings yet
Gill
474 pages
Trs en
No ratings yet
Trs en
2 pages
Solubilidad Del Florfenicol Con Diferentes Solventes
No ratings yet
Solubilidad Del Florfenicol Con Diferentes Solventes
4 pages
418 CUMMINS 6CTA8.3-C215 Dongfeng Part Catalogue
100% (1)
418 CUMMINS 6CTA8.3-C215 Dongfeng Part Catalogue
84 pages
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
No ratings yet
Pervaporation Ketazine Aq Layer Prodn HH Peroxide Proc PDF
6 pages
2 Operations On Polynomials
No ratings yet
2 Operations On Polynomials
5 pages
Spinach 1
No ratings yet
Spinach 1
7 pages
Automated Learning of Interpretable Models With Quantified Uncertainty
No ratings yet
Automated Learning of Interpretable Models With Quantified Uncertainty
18 pages
Soeg RT m18 Ps K GB
No ratings yet
Soeg RT m18 Ps K GB
5 pages
Week006-Where-LabExer003 Rivera Dennis
No ratings yet
Week006-Where-LabExer003 Rivera Dennis
6 pages
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
No ratings yet
Improvements in The Mechanical Properties of The 18R-6R High-Hysteresis Martensitic Transformation by Nanoprecipitates in CuZnAl Alloys
8 pages
Secunderabad, Telangana To Sri Chaitanya Junior College - Google Maps
No ratings yet
Secunderabad, Telangana To Sri Chaitanya Junior College - Google Maps
2 pages
Geology of Kohistan
100% (1)
Geology of Kohistan
39 pages
KODAG
No ratings yet
KODAG
24 pages
Iron FerroVer + TPTZ Methods
No ratings yet
Iron FerroVer + TPTZ Methods
15 pages
Valsir - Triplus New
No ratings yet
Valsir - Triplus New
20 pages
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
No ratings yet
IND315 Operations Research I, Fall 2023, by Ç. Özgün Kibiroğlu
7 pages
Analizador de Carbono Orgánico Total C391E058L TOC V
100% (1)
Analizador de Carbono Orgánico Total C391E058L TOC V
20 pages
MTH302-lec-02 Worksheet
No ratings yet
MTH302-lec-02 Worksheet
6 pages
THS527 Datasheet
No ratings yet
THS527 Datasheet
5 pages
Umc Notification
No ratings yet
Umc Notification
1 page
Abyss MiniRPG
No ratings yet
Abyss MiniRPG
4 pages
Chapter 2 - Review Questions: Operating-System Structures
No ratings yet
Chapter 2 - Review Questions: Operating-System Structures
2 pages

IndustryDocumentsDataAPI v7

Uploaded by

IndustryDocumentsDataAPI v7

Uploaded by

Table of Contents

Industry Documents Library Solr API

For example, to extract the information of document with ID kylw0221:

The response looks like this:

The parameters of interest are:

The response looks like this:

The parameters of interest are:

Please see the “Query Notes”

Wt writer type xml (default)

The response looks like:

To see the next 100 records, append &start=100 to the url:

To see the remaining 32 records, append &start=200 to the urls:

At the end of the response, you will see:

Pseudo code for looping through search results using cursorMark:

// NOTE: This is not real code.

// write logic here to do something with $results

For example: https://metadata.idl.ucsf.edu/solr/ltdl3/query?q=collection:"brown %26

q=((author:glantz AND type:letter) OR brand:kool)

You might also like