Skip to main content

Showing 1–9 of 9 results for author: Naiman, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.01556  [pdf, other

    astro-ph.IM cs.DL cs.IR

    pathfinder: A Semantic Framework for Literature Review and Knowledge Discovery in Astronomy

    Authors: Kartheik G. Iyer, Mikaeel Yunus, Charles O'Neill, Christine Ye, Alina Hyk, Kiera McCormick, Ioana Ciuca, John F. Wu, Alberto Accomazzi, Simone Astarita, Rishabh Chakrabarty, Jesse Cranney, Anjalie Field, Tirthankar Ghosal, Michele Ginolfi, Marc Huertas-Company, Maja Jablonska, Sandor Kruk, Huiling Liu, Gabriel Marchidan, Rohit Mistry, J. P. Naiman, J. E. G. Peek, Mugdha Polimera, Sergio J. Rodriguez , et al. (5 additional authors not shown)

    Abstract: The exponential growth of astronomical literature poses significant challenges for researchers navigating and synthesizing general insights or even domain-specific knowledge. We present Pathfinder, a machine learning framework designed to enable literature review and knowledge discovery in astronomy, focusing on semantic searching with natural language instead of syntactic searches with keywords.… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: 25 pages, 9 figures, submitted to AAS jorunals. Comments are welcome, and the tools mentioned are available online at https://pfdr.app

  2. arXiv:2309.11549  [pdf, other

    cs.DL astro-ph.IM

    Large Synthetic Data from the arXiv for OCR Post Correction of Historic Scientific Articles

    Authors: Jill P. Naiman, Morgan G. Cosillo, Peter K. G. Williams, Alyssa Goodman

    Abstract: Scientific articles published prior to the "age of digitization" (~1997) require Optical Character Recognition (OCR) to transform scanned documents into machine-readable text, a process that often produces errors. We develop a pipeline for the generation of a synthetic ground truth/OCR dataset to correct the OCR results of the astrophysics literature holdings of the NASA Astrophysics Data System (… ▽ More

    Submitted 20 September, 2023; originally announced September 2023.

    Comments: 6 pages, 1 figure, 1 table; training/validation/test datasets and all model weights to be linked on Zenodo on publication

  3. arXiv:2309.06126  [pdf, other

    astro-ph.IM astro-ph.CO astro-ph.GA astro-ph.HE cs.CL cs.LG

    AstroLLaMA: Towards Specialized Foundation Models in Astronomy

    Authors: Tuan Dung Nguyen, Yuan-Sen Ting, Ioana Ciucă, Charlie O'Neill, Ze-Chang Sun, Maja Jabłońska, Sandor Kruk, Ernest Perkowski, Jack Miller, Jason Li, Josh Peek, Kartheik Iyer, Tomasz Różański, Pranav Khetarpal, Sharaf Zaman, David Brodrick, Sergio J. Rodríguez Méndez, Thang Bui, Alyssa Goodman, Alberto Accomazzi, Jill Naiman, Jesse Cranney, Kevin Schawinski, UniverseTBD

    Abstract: Large language models excel in many human-language tasks but often falter in highly specialized domains like scholarly astronomy. To bridge this gap, we introduce AstroLLaMA, a 7-billion-parameter model fine-tuned from LLaMA-2 using over 300,000 astronomy abstracts from arXiv. Optimized for traditional causal language modeling, AstroLLaMA achieves a 30% lower perplexity than Llama-2, showing marke… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: 6 pages, 3 figures, submitted to IJCNLP-AACL 2023. Comments are welcome. The model can be found on Hugging Face - https://huggingface.co/universeTBD/astrollama

  4. arXiv:2302.11583  [pdf, other

    cs.DL astro-ph.IM

    The Digitization of Historical Astrophysical Literature with Highly-Localized Figures and Figure Captions

    Authors: Jill P. Naiman, Peter K. G. Williams, Alyssa Goodman

    Abstract: Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, after they have been processed with Optical Character Recognition (OCR), w… ▽ More

    Submitted 22 February, 2023; originally announced February 2023.

    Comments: 29 pages, 10 figures, accepted for publication in the International Journal on Digital Libraries, special issue follow up to TPDL 2022 conference. arXiv admin note: substantial text overlap with arXiv:2209.04460

  5. arXiv:2301.10781  [pdf, other

    cs.DL

    Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

    Authors: Jill P. Naiman

    Abstract: The lack of generalizability -- in which a model trained on one dataset cannot provide accurate results for a different dataset -- is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: 9 pages, 3 figures, submitted as part of AEOLIAN Workshop 5: Making More Sense With Machines: AI/ML Methods for Interrogating and Understanding Our Textual Heritage in the Humanities, Natural Sciences, and Social Sciences

  6. arXiv:2209.04460  [pdf, other

    astro-ph.IM cs.DL

    Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

    Authors: J. P. Naiman, Peter K. G. Williams, Alyssa Goodman

    Abstract: Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OC… ▽ More

    Submitted 9 September, 2022; originally announced September 2022.

    Comments: 16 pages, 3 figures, accepted to TPDL 2022

  7. arXiv:2110.13819  [pdf, other

    cs.CV cs.GR cs.LG eess.IV

    CloudFindr: A Deep Learning Cloud Artifact Masker for Satellite DEM Data

    Authors: Kalina Borkiewicz, Viraj Shah, J. P. Naiman, Chuanyue Shen, Stuart Levy, Jeff Carpenter

    Abstract: Artifact removal is an integral component of cinematic scientific visualization, and is especially challenging with big datasets in which artifacts are difficult to define. In this paper, we describe a method for creating cloud artifact masks which can be used to remove artifacts from satellite imagery using a combination of traditional image processing together with deep learning based on U-Net.… ▽ More

    Submitted 26 October, 2021; originally announced October 2021.

  8. arXiv:2006.00084  [pdf, other

    astro-ph.IM astro-ph.EP cs.GR

    Clustering-informed Cinematic Astrophysical Data Visualization with Application to the Moon-forming Terrestrial Synestia

    Authors: Patrick D. Aleo, Simon J. Lock, Donna J. Cox, Stuart A. Levy, J. P. Naiman, A. J. Christensen, Kalina Borkiewicz, Robert Patterson

    Abstract: Scientific visualization tools are currently not optimized to create cinematic, production-quality representations of numerical data for the purpose of science communication. In our pipeline \texttt{Estra}, we outline a step-by-step process from a raw simulation into a finished render as a way to teach non-experts in the field of visualization how to achieve production-quality outputs on their own… ▽ More

    Submitted 29 May, 2020; originally announced June 2020.

    Comments: 19 pages, 16 figures, submitted to MNRAS

  9. Cinematic Visualization of Multiresolution Data: Ytini for Adaptive Mesh Refinement in Houdini

    Authors: Kalina Borkiewicz, J. P. Naiman, Haoming Lai

    Abstract: We have entered the era of large multidimensional datasets represented by increasingly complex data structures. Current tools for scientific visualization are not optimized to efficiently and intuitively create cinematic production quality, time-evolving representations of numerical data for broad impact science communication via film, media, or journalism. To present such data in a cinematic envi… ▽ More

    Submitted 1 November, 2018; v1 submitted 8 August, 2018; originally announced August 2018.

    Comments: 24 pages, 14 figures