A unified Julia package for searching scientific repositories, libraries, and databases
DataScout.jl provides a consistent interface to search across multiple scientific and academic data sources. Whether youβre looking for research papers, books, datasets, or web content, DataScout.jl makes it easy to find what you need.
using Pkg
Pkg.add("DataScout")
Or from the Julia REPL:
] add DataScout
using DataScout
# Search Wikipedia
results = search("Julia programming language", source=:wikipedia, max_results=5)
# Search academic papers (requires API key)
set_api_key!(:core, "your-core-api-key")
papers = search("machine learning", source=:core, max_results=10)
# Search books
books = search("data science", source=:openlibrary, max_results=5)
# View results
println(results)
Tip: An empty query returns an empty DataFrame (no error), so you can safely pass user input without pre-validation.
Source | Symbol | API Key Required | Description |
---|---|---|---|
CORE | :core |
β | Academic papers and research articles |
OpenAlex | :openalex |
β | Scholarly works and publications |
Zenodo | :zenodo |
β | Research data and publications |
Figshare | :figshare |
β | Research outputs and datasets |
Project Gutenberg | :gutenberg |
β | Free ebooks |
Open Library | :openlibrary |
β | Books and library catalog |
Wikipedia | :wikipedia |
β | Encyclopedia articles |
DuckDuckGo | :duckduckgo |
β | Web search results |
SearxNG | :searxng |
β | Privacy-focused search |
Whoogle | :whoogle |
β | Privacy-focused Google search |
Internet Archive | :internetarchive |
β | Digital library and archives |
Why DataScout: unified schema across sources, built-in retries and rate limiting, simple API key management, and strong error handling.
Some services require API keys for access:
# Set API keys
set_api_key!(:core, "your-core-api-key")
set_api_key!(:openalex, "your-openalex-api-key") # Optional
# Get API keys
api_key = get_api_key(:core)
API keys are stored in ~/.datascout/config.toml
and persist between sessions.
For services like SearxNG and Whoogle, you can specify custom instances:
# Using environment variables
ENV["SEARXNG_INSTANCE"] = "https://my-searxng-instance.com"
ENV["WHOOGLE_INSTANCE"] = "https://my-whoogle-instance.com"
# Or pass as parameters
results = search("query", source=:searxng, instance="https://custom-instance.com")
using DataScout
# Simple Wikipedia search
results = search("quantum computing", source=:wikipedia)
println("Found $(nrow(results)) results")
println(results.title[1]) # First result title
println(results.url[1]) # First result URL
# Search for academic papers
set_api_key!(:core, "your-api-key")
papers = search("climate change", source=:core, max_results=20)
# Filter results
recent_papers = filter(row -> !ismissing(row.authors), papers)
# Display results
for i in 1:min(5, nrow(papers))
println("Title: $(papers.title[i])")
println("Authors: $(papers.authors[i])")
println("URL: $(papers.url[i])")
println("---")
end
function search_multiple_sources(query, sources=[:wikipedia, :openalex, :zenodo])
all_results = DataFrame()
for source in sources
try
results = search(query, source=source, max_results=5)
all_results = vcat(all_results, results, cols=:union)
catch e
@warn "Failed to search $source: $e"
end
end
return all_results
end
# Search across multiple sources
results = search_multiple_sources("artificial intelligence")
:core
): literature reviews, academic search portals, or internal tools where PDF links and authors are important.
set_api_key!(:core, "YOUR_CORE_KEY")
df = search("graph neural networks", source=:core, max_results=5)
:openalex
): topic exploration, citation-based workflows, profile building.
https://doi.org/...
.df = search("federated learning", source=:openalex, max_results=5)
:zenodo
): dataset discovery, research artifacts in pipelines.
df = search("climate dataset", source=:zenodo, max_results=5)
:figshare
): media, datasets, and supplementary materials.
df = search("microscopy", source=:figshare, max_results=5)
:gutenberg
): classic texts for NLP experiments and demos.
df = search("sherlock holmes", source=:gutenberg, max_results=5)
:openlibrary
): bibliographic enrichment and library apps.
df = search("data visualization", source=:openlibrary, max_results=5)
:wikipedia
): quick encyclopedic lookups in UIs or chatbots.
df = search("julia programming language", source=:wikipedia, max_results=5)
:duckduckgo
): general web results with privacy focus.
df = search("reproducible research tooling", source=:duckduckgo, max_results=5)
:searxng
): meta-search with custom instances for enterprise.
ENV["SEARXNG_INSTANCE"] = "https://searx.example.org"
df = search("open data portals", source=:searxng, max_results=5)
:whoogle
): privacy-preserving Google front-end.
ENV["WHOOGLE_INSTANCE"] = "https://whoogle.example.org"
df = search("state of the art summarization", source=:whoogle, max_results=5)
:internetarchive
): archives, media, and historical documents.
df = search("old computing magazines", source=:internetarchive, max_results=5)
All search functions return a DataFrame
with the following columns:
title::Union{String, Missing}
- Title of the resulturl::Union{String, Missing}
- URL to access the resourceauthors::Union{Vector{String}, Missing}
- List of authors (when available)source::Union{String, Missing}
- Source nameid::Union{String, Missing}
- Unique identifier from the sourceDataScout.jl includes comprehensive error handling:
# Graceful handling of network errors
try
results = search("test query", source=:core)
catch e
@error "Search failed: $e"
results = DataFrame() # Empty results
end
# Built-in retry mechanism for transient failures
# Automatic rate limiting to respect API limits
# Detailed error logging for debugging
Behavioral guarantees:
DataScout.jl automatically handles rate limiting for each service:
~/.datascout/config.toml
Rate-limiting state is persisted in ~/.datascout/state.toml
to smooth behavior across sessions.
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
git clone https://github.com/liyakhathshaik/DataScout.jl.git
cd DataScout.jl
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. -e 'using Pkg; Pkg.test()'
This project is licensed under the MIT License - see the LICENSE file for details.
DataScout.jl - Making scientific data discovery simple and unified! πβ¨