Package Exports

universities

Readme

Universities

Worldwide universities dataset & enrichment toolkit – instant access to a global list of institutions plus an optional, rate‑limited scraper that augments each with descriptive, academic, and classification metadata via a TypeScript API & CLI.

Overview

universities is an evolving TypeScript/Node.js library and CLI that provides a structured, extensible dataset of the world's universities along with an enrichment pipeline that (optionally) visits institutional homepages to extract additional metadata.

Core goals:

Provide immediate, zero‑network access to a clean base list of universities (name, domains, country info, web site) sourced from the public world universities dataset.
Offer an enrichment layer (opt‑in) that scrapes each university homepage respectfully (rate‑limited + retries) to infer or collect:
- Descriptions / taglines / motto
- Contact and location hints
- Founding year
- Academic programs & faculties (heuristic extraction)
- Social media links
- Institutional classification (public/private, research, technical, community, etc. — heuristic)
- Degree levels (undergraduate / graduate / doctoral)
- Data quality scoring for traceability
Expose ergonomic programmatic APIs for search, filtering, and statistics.
Provide a CLI for quick querying, enrichment, and aggregated stats generation.
Remain transparent, reproducible, and respectful of target sites (configurable concurrency, caching, resumability, optional full‑dataset execution).

NOTE: Full automatic enrichment of every university (≈9k+) can take considerable time and should be run thoughtfully to avoid undue load on remote servers. The base dataset works instantly without enrichment.

Key Features

Base dataset loader (CSV → strongly typed objects)
In‑memory repository with searching, filtering, sorting and basic statistics
Extensible domain model (University, Program, Faculty, ranking + classification enums)
Heuristic scraper with retry + rate limiting queue
Batch enrichment script with per‑university JSON caching (resumable)
CLI with subcommands: list, enrich, stats
TypeScript declarations for consumption in TS or JS projects
Modular architecture to allow swapping scraping strategies or adding alternate data sources later (e.g., APIs, ranking feeds)

Installation

Install locally (library usage inside another project):

npm install universities

Or for global CLI usage (optional):

npm install -g universities

After a global install you can invoke the CLI via the universities command (see CLI section below). When using as a dependency, import from the package entry points.

Quick Start (CLI)

List the first 5 US universities:

universities list --country-code US --limit 5

Search by name fragment:

universities list --name polytechnic --limit 10

Output JSON instead of a table:

universities list --country-code CA --json --limit 3

Enrich a single university (fetch + parse homepage):

universities enrich https://www.mit.edu/

View aggregated stats (counts by type, size, etc.— improves once enriched data exists):

universities stats

Programmatic Usage

import { loadBaseUniversities } from 'universities/dist/data/loadBase';
import { UniversityRepository } from 'universities/dist/repository/UniversityRepository';

async function example() {
  const base = await loadBaseUniversities();
  const repo = new UniversityRepository(base);

  const results = repo.search({ countryCode: 'US', name: 'state', limit: 20 });
  console.log(results.slice(0, 3));
  console.log(repo.stats());
}

example();

The scraper (UniversityScraper) is intentionally decoupled and lazily imported in the CLI to avoid pulling ESM‑only dependencies when unnecessary. For programmatic enrichment you can: import { UniversityScraper } from 'universities/dist/scraper/UniversityScraper';

Data Model (Simplified)

interface University {
  id: string; // Stable hash/id generation
  name: string;
  country: string;
  countryCode: string;
  alphaTwoCode?: string; // If present in source
  webPages: string[]; // One or more homepage URLs
  domains: string[]; // Domain(s)
  stateProvince?: string;
  // Enriched fields (optional until scraping):
  description?: string;
  motto?: string;
  foundingYear?: number;
  location?: string;
  contact?: { email?: string; phone?: string; address?: string };
  programs?: { name: string; degreeLevels?: string[] }[];
  faculties?: { name: string; description?: string }[];
  social?: { twitter?: string; facebook?: string; instagram?: string; linkedin?: string; youtube?: string };
  classification?: { type?: string; degreeLevel?: string[] };
  dataQuality?: { score: number; factors: string[] };
  enrichedAt?: string; // ISO timestamp when enrichment occurred
}

See the full definitions in src/types/University.ts for exhaustive enum types, search options, stats structure, and classification helpers.

Architecture Overview

Layered design:

Source Layer (world-universities.csv) – raw dataset.
Loader (loadBaseUniversities) – parses CSV into partial University objects.
Domain Types (types/University.ts) – strongly typed schema + enums + search contracts.
Repository (UniversityRepository) – in‑memory indexing, filtering, sorting, basic statistics.
Scraper (UniversityScraper) – fetch + parse homepage, extraction heuristics, classification & data quality scoring (rate limited via queue).
Enrichment Script (scripts/enrich.ts) – orchestrates batch scraping with caching to data/cache/*.json and writes aggregated enriched dataset.
CLI (cli.ts) – user interface for listing, enrichment, and stats.

Scraper Heuristics (High-Level)

Fetch with retry & jitter backoff.
Extract <meta name="description">, first meaningful paragraph, or tagline patterns.
Look for contact info via regex (emails, phone numbers, address fragments).
Infer founding year via patterns like Established 18xx|19xx|20xx.
Identify program/faculty keywords in navigation or section headers.
Collect social links by domain match (twitter.com, facebook.com, etc.).
Classify type (public/private/research/technical/community) by keyword/phrase heuristics.
Score data quality based on number & diversity of successfully extracted fields.

Performance & Respectful Crawling

Concurrency controlled by a queue (configurable).
Optional pauses / resume; per‑record caching prevents redundant fetches.
Future roadmap includes robots.txt parsing & adaptive politeness windows.

Enrichment Workflow

The batch enrichment script is optional and can be executed when you purposely want deeper metadata.

npm run build
node dist/scripts/enrich.js --concurrency 3 --resume

Flags (planned / implemented):

Flag	Description
`--concurrency <n>`	Number of parallel fetches (default modest to prevent overloading sites).
`--resume`	Skip already cached universities (looks in `data/cache/`).
`--limit <n>`	(Planned) Process only the first N universities for sampling.
`--country-code <CC>`	(Planned) Restrict enrichment to a country subset.

Outputs:

data/cache/{universityId}.json – per‑university enriched snapshot.
data/enriched-universities.json – aggregated enriched dataset (written after run).

CLI Reference

Command	Purpose	Key Options
`list`	Filter & display base (or partially enriched) universities	`--name`, `--country`, `--country-code`, `--limit`, `--json`
`enrich <url>`	Enrich a single university homepage	(none yet; uses internal defaults)
`stats`	Show aggregated statistics	None

Examples:

universities list --name technology --limit 8
universities list --country-code GB --json --limit 5
universities enrich https://www.stanford.edu/
universities stats

Roadmap

Full dataset enrichment pipeline automation & snapshot publishing
Dual ESM + CJS distribution build (current workaround: lazy import for ESM‑only deps)
Robots.txt compliance & politeness policy configuration
Advanced classification (continent/region inference, size estimation heuristics, ranking ingestion)
Pluggable enrichment modules (e.g., ranking APIs, accreditation feeds)
Incremental persistent store (SQLite / LiteFS / DuckDB) for historical deltas
Comprehensive test suite (scraper mocks, repository edge cases, CLI integration)
Documentation site (API reference, enrichment metrics dashboard)
Progressive enrichment resume with queuing telemetry
Data provenance & reproducibility manifest (hashes, run metadata)

Testing

Run unit and integration tests:

npm test

Coverage reports are emitted to coverage/.

Contributing

We welcome contributions! Suggested steps:

Fork & create a feature branch.
Install dependencies: npm install.
Run npm run build & ensure tests pass.
Add or update tests for your change.
Follow lint & formatting (npm run lint, npm run format).
Submit a PR referencing any related issues.

Please consult (or propose) a CONTRIBUTING.md for evolving guidelines. Ethical scraping considerations and rate limiting are especially important—avoid aggressive concurrency.

Ethical & Legal Considerations

This project performs only light, homepage‑level scraping by default.
Always respect target site terms of service and robots.txt (planned feature for enforcement).
Do not use the enrichment pipeline to harvest personal data beyond institutional metadata.
Consider running enrichment in batches with conservative concurrency settings.

Troubleshooting

Issue	Cause	Resolution
`ERR_REQUIRE_ESM` when using CLI `list`	ESM‑only dependency (`p-queue`) pulled into non‑enrichment path	Resolved via lazy import; update to latest version of package
Empty enrichment fields	Site structure variation	Re‑run later or inspect HTML; heuristics will improve over time
Slow enrichment run	Network latency / conservative concurrency	Increase `--concurrency` cautiously

Security

No secrets are stored. If you identify a security concern (e.g., vulnerable dependency or scraping misuse vector) please open an issue with reproduction details or use private disclosure if sensitive.

License

This repository is distributed under the terms of the MIT License. See LICENSE for details.

Acknowledgements

Inspired by the open university datasets community and contributors who maintain baseline CSV resources. Future improvements will strive for transparency, repeatability, and respectful data gathering.

Give a ⭐ if you find this useful and feel free to open issues for ideas or enhancements.

^{Generated documentation improvements are iterative; feel free to propose edits.}