Download Unimarc XML files from BNIweb and convert to Parquet with Duckdb.
The Parquet dump is available here https://atomotic.github.io/bni/bni.parquet (70M) and can be used with DuckDB Shell
The following steps are available inside the Justfile
Scrape all XML urls (tools needed: pup and sd)
curl -s "http://bni.bncf.firenze.sbn.it/bniweb/menu.jsp" \
| pup 'a attr{href}' \
| grep elenco_fasc \
| sd "&" "&" \
| sd "elenco_fasc" "scaricaxml" \
> links.txt
Download all XML files (tool needed: wcurl)
parallel wcurl --curl-options="--remote-header-name" "http://bni.bncf.firenze.sbn.it/bniweb/{}" :::: links.txt
mkdir xml
move *.xml xml/
Load all XML files to DuckDB (tools needed: Go and gnu parallel)
go build
parallel -j1 ./bni {} ::: xml/*.xml
Export from DuckDB to Parquet
duckdb bni.ddb "copy bni to bni.parquet (format parquet);"
Size comparison
du -h bni.ddb bni.parquet
1.2G bni.ddb
67M bni.parquet
duckdb
The schema: data
contains the full Unimarc record converted to JSON
DESCRIBE SELECT * FROM 'https://atomotic.github.io/bni/bni.parquet';
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ id │ VARCHAR │ YES │ │ │ │
│ isbn │ VARCHAR │ YES │ │ │ │
│ title │ VARCHAR │ YES │ │ │ │
│ data │ VARCHAR │ YES │ │ │ │
│ source │ VARCHAR │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
D .mode line
D SELECT id,title,isbn,source FROM 'https://atomotic.github.io/bni/bni.parquet' WHERE title LIKE '%biblioteco%' LIMIT 5;
id = USM1959877
title = Biblioteche e biblioteconomia
isbn = 9788843075294
source = xml/Monografie201503.xml
id = PAV0095007
title = I fondamenti della biblioteconomia
isbn = 9788870758474
source = xml/Monografie201601.xml
id = SBT0014568
title = Conferimento della laurea magistrale ad honorem in scienze archivistiche e biblioteconomiche a Michele Casalini
isbn = 9788864538822
source = xml/Monografie201904.xml
id = MOD1738924
title = Guida alla biblioteconomia moderna
isbn = 9788893574013
source = xml/Monografie202204.xml
id = SBT0045209
title = Principi, approcci e applicazioni della biblioteconomia comparata
isbn = 9788855186063
source = xml/Monografie202301.xml