DataScout.jl

DataScout.jl πŸ”

CI codecov License: MIT

A unified Julia package for searching scientific repositories, libraries, and databases

DataScout.jl provides a consistent interface to search across multiple scientific and academic data sources. Whether you’re looking for research papers, books, datasets, or web content, DataScout.jl makes it easy to find what you need.

Features

Installation

using Pkg
Pkg.add("DataScout")

Or from the Julia REPL:

] add DataScout

Quick Start

using DataScout

# Search Wikipedia
results = search("Julia programming language", source=:wikipedia, max_results=5)

# Search academic papers (requires API key)
set_api_key!(:core, "your-core-api-key")
papers = search("machine learning", source=:core, max_results=10)

# Search books
books = search("data science", source=:openlibrary, max_results=5)

# View results
println(results)

Tip: An empty query returns an empty DataFrame (no error), so you can safely pass user input without pre-validation.

Supported Sources

Source Symbol API Key Required Description
CORE :core βœ… Academic papers and research articles
OpenAlex :openalex ❌ Scholarly works and publications
Zenodo :zenodo ❌ Research data and publications
Figshare :figshare ❌ Research outputs and datasets
Project Gutenberg :gutenberg ❌ Free ebooks
Open Library :openlibrary ❌ Books and library catalog
Wikipedia :wikipedia ❌ Encyclopedia articles
DuckDuckGo :duckduckgo ❌ Web search results
SearxNG :searxng ❌ Privacy-focused search
Whoogle :whoogle ❌ Privacy-focused Google search
Internet Archive :internetarchive ❌ Digital library and archives

Why DataScout: unified schema across sources, built-in retries and rate limiting, simple API key management, and strong error handling.

Configuration

API Keys

Some services require API keys for access:

# Set API keys
set_api_key!(:core, "your-core-api-key")
set_api_key!(:openalex, "your-openalex-api-key")  # Optional

# Get API keys
api_key = get_api_key(:core)

API keys are stored in ~/.datascout/config.toml and persist between sessions.

Custom Instances

For services like SearxNG and Whoogle, you can specify custom instances:

# Using environment variables
ENV["SEARXNG_INSTANCE"] = "https://my-searxng-instance.com"
ENV["WHOOGLE_INSTANCE"] = "https://my-whoogle-instance.com"

# Or pass as parameters
results = search("query", source=:searxng, instance="https://custom-instance.com")

Usage Examples

using DataScout

# Simple Wikipedia search
results = search("quantum computing", source=:wikipedia)
println("Found $(nrow(results)) results")
println(results.title[1])  # First result title
println(results.url[1])    # First result URL

Academic Research

# Search for academic papers
set_api_key!(:core, "your-api-key")
papers = search("climate change", source=:core, max_results=20)

# Filter results
recent_papers = filter(row -> !ismissing(row.authors), papers)

# Display results
for i in 1:min(5, nrow(papers))
    println("Title: $(papers.title[i])")
    println("Authors: $(papers.authors[i])")
    println("URL: $(papers.url[i])")
    println("---")
end
function search_multiple_sources(query, sources=[:wikipedia, :openalex, :zenodo])
    all_results = DataFrame()
    
    for source in sources
        try
            results = search(query, source=source, max_results=5)
            all_results = vcat(all_results, results, cols=:union)
        catch e
            @warn "Failed to search $source: $e"
        end
    end
    
    return all_results
end

# Search across multiple sources
results = search_multiple_sources("artificial intelligence")

Real-World Use Cases by Source

Result Format

All search functions return a DataFrame with the following columns:

Error Handling

DataScout.jl includes comprehensive error handling:

# Graceful handling of network errors
try
    results = search("test query", source=:core)
catch e
    @error "Search failed: $e"
    results = DataFrame()  # Empty results
end

# Built-in retry mechanism for transient failures
# Automatic rate limiting to respect API limits
# Detailed error logging for debugging

Behavioral guarantees:

Performance and Rate Limiting

DataScout.jl automatically handles rate limiting for each service:

Rate-limiting state is persisted in ~/.datascout/state.toml to smooth behavior across sessions.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Development Setup

git clone https://github.com/liyakhathshaik/DataScout.jl.git
cd DataScout.jl
julia --project=. -e 'using Pkg; Pkg.instantiate()'
julia --project=. -e 'using Pkg; Pkg.test()'

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Support


DataScout.jl - Making scientific data discovery simple and unified! πŸ”βœ¨