Package Exports

google-researcher-mcp
google-researcher-mcp/dist/server.js

This package does not declare an exports field, so the exports above have been automatically detected and optimized by JSPM instead. If any package subpath is missing, it is recommended to post an issue to the original package (google-researcher-mcp) to support the "exports" field. If that is not possible, create a JSPM override to customize the exports field for this package.

Readme

Google Researcher MCP Server

Empower AI assistants with robust, persistent, and secure web research capabilities.

This server implements the Model Context Protocol (MCP), providing a suite of tools for Google Search, content scraping, and Gemini AI analysis. It's designed for performance and reliability, featuring a persistent caching system, comprehensive timeout handling, and enterprise-grade security.

Why Use This Server?
Features
System Architecture
YouTube Transcript Extraction
Getting Started
Usage
Performance & Reliability
Security
- OAuth 2.1 Authorization
- Available Scopes
Testing
Troubleshooting
Contributing
License

Why Use This Server?

Extend AI Capabilities: Grant AI assistants access to real-time web information and powerful analytical tools.
Maximize Performance: Drastically reduce latency for repeated queries with a sophisticated two-layer persistent cache (in-memory and disk).
Reduce Costs: Minimize expensive API calls to Google Search and Gemini by caching results.
Ensure Reliability: Prevent failures and ensure consistent performance with comprehensive timeout handling and graceful degradation.
Flexible & Secure Integration: Connect any MCP-compatible client via STDIO or HTTP+SSE, with enterprise-grade OAuth 2.1 for secure API access.
Open & Extensible: MIT licensed, fully open-source, and designed for easy modification and extension.

Features

Core Research Tools:
- google_search: Find information using the Google Search API.
- scrape_page: Extract content from websites and YouTube videos with robust transcript extraction.
- analyze_with_gemini: Process text using Google's powerful Gemini AI models.
- research_topic: A composite tool that combines search, scraping, and analysis into a single, efficient operation.
YouTube Transcript Extraction:
- Robust YouTube transcript extraction with comprehensive error handling: 10 distinct error types with clear, actionable messages.
- Intelligent retry logic with exponential backoff: Automatic retries for transient failures (network issues, rate limiting, timeouts).
- User-friendly error messages and diagnostics: Clear feedback when transcript extraction fails, with specific reasons.
Advanced Caching System:
- Two-Layer Cache: Combines a fast in-memory cache for immediate access with a persistent disk-based cache for durability.
- Custom Namespaces: Organizes cached data by tool, preventing collisions and simplifying management.
- Manual & Automated Persistence: Offers both automatic, time-based cache saving and manual persistence via a secure API endpoint.
Robust Performance & Reliability:
- Comprehensive Timeouts: Protects against network issues and slow responses from external APIs.
- Graceful Degradation: Ensures the server remains responsive even if a tool or dependency fails.
- Dual Transport Protocols: Supports both STDIO for local process communication and HTTP+SSE for web-based clients.
Enterprise-Grade Security:
- OAuth 2.1 Protection: Secures all HTTP endpoints with modern, industry-standard authorization.
- Granular Scopes: Provides fine-grained control over access to tools and administrative functions.
Monitoring & Management:
- Administrative API: Exposes endpoints for monitoring cache statistics, managing the cache, and inspecting the event store.

System Architecture

The server is built on a layered architecture designed for clarity, separation of concerns, and extensibility.

graph TD
    subgraph "Client"
        A[MCP Client]
    end

    subgraph "Transport Layer"
        B[STDIO]
        C[HTTP-SSE]
    end

    subgraph "Core Logic"
        D{MCP Request Router}
        E[Tool Executor]
    end

    subgraph "Tools"
        F[google_search]
        G[scrape_page]
        H[analyze_with_gemini]
        I[research_topic]
    end

    subgraph "Support Systems"
        J[Persistent Cache]
        K[Event Store]
        L[OAuth Middleware]
    end

    A -- Connects via --> B
    A -- Connects via --> C
    B -- Forwards to --> D
    C -- Forwards to --> D
    D -- Routes to --> E
    E -- Invokes --> F
    E -- Invokes --> G
    E -- Invokes --> H
    E -- Invokes --> I
    F & G & H & I -- Uses --> J
    D -- Uses --> K
    C -- Protected by --> L

    style J fill:#f9f,stroke:#333,stroke-width:2px
    style K fill:#ccf,stroke:#333,stroke-width:2px
    style L fill:#f99,stroke:#333,stroke-width:2px

For a more detailed explanation, see the Full Architecture Guide.

YouTube Transcript Extraction

The server includes a robust YouTube transcript extraction system that provides reliable access to video transcripts with comprehensive error handling and automatic recovery mechanisms.

Key Features

Comprehensive Error Classification: Identifies 10 distinct error types with clear, actionable messages
Intelligent Retry Logic: Exponential backoff mechanism for transient failures (max 3 attempts)
Production Optimizations: 91% performance improvement and 80% log reduction
User-Friendly Feedback: Clear error messages explaining why transcript extraction failed

Supported Error Types

Error Code	Description	User Action
`TRANSCRIPT_DISABLED`	Video owner disabled transcripts	Try a different video
`VIDEO_UNAVAILABLE`	Video no longer available	Verify the URL and video status
`VIDEO_NOT_FOUND`	Invalid video ID or URL	Check the YouTube URL format
`NETWORK_ERROR`	Network connectivity issues	System will retry automatically
`RATE_LIMITED`	YouTube API rate limiting	System will retry with backoff
`TIMEOUT`	Request timed out	System will retry automatically
`PARSING_ERROR`	Transcript data parsing failed	Contact support if persistent
`REGION_BLOCKED`	Video blocked in server region	Use proxy if needed
`PRIVATE_VIDEO`	Video requires authentication	Use public videos only
`UNKNOWN`	Unexpected error occurred	Contact support with details

Retry Behavior

The system automatically retries failed requests for transient errors:

Maximum Attempts: 3 retries for NETWORK_ERROR, RATE_LIMITED, and TIMEOUT
Exponential Backoff: Progressive delays between retries to avoid overwhelming YouTube's API
Smart Recovery: Only retries errors that are likely to succeed on subsequent attempts

Example Error Messages

When transcript extraction fails, users receive clear, specific error messages:

Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx.
Reason: TRANSCRIPT_DISABLED - The video owner has disabled transcripts.

Failed to retrieve YouTube transcript for https://www.youtube.com/watch?v=xxxx after 3 attempts.
Reason: NETWORK_ERROR - A network error occurred.

For complete technical details, see the YouTube Transcript Extraction Documentation.

Getting Started

Prerequisites

Node.js: Version 18.0.0 or higher.
API Keys:
OAuth 2.1 Provider (for HTTP transport): An external authorization server (e.g., Auth0, Okta) to issue JWTs.

Installation & Setup

Clone the Repository:

git clone https://github.com/zoharbabin/google-research-mcp.git
cd google-researcher-mcp

Install Dependencies:
```
npm install
```
Configure Environment Variables: Create a .env file by copying the example and filling in your credentials.
```
cp .env.example .env
```
Now, open .env in your editor and add your API keys and OAuth configuration. See the comments in .env.example for detailed explanations of each variable.

Running the Server

Development Mode: For development with automatic reloading on file changes, use:
```
npm run dev
```
This command uses tsx to watch for changes and restart the server.
Production Mode: First, build the TypeScript project into JavaScript, then start the server:
```
npm run build
npm start
```

Upon successful startup, you will see confirmation that the transports are ready:

✅ stdio transport ready
🌐 SSE server listening on http://127.0.0.1:3000/mcp

Usage

Available Tools

The server provides a suite of powerful tools for research and analysis. Each tool is designed with detailed descriptions and annotations to be easily understood and utilized by AI models.

Tool	Title	Description & Parameters
`google_search`	Google Web Search	Description: Searches the web using the Google Custom Search API to find relevant web pages and resources. Ideal for finding current information, discovering authoritative sources, and locating specific documents. Results are cached for 30 minutes. Parameters: - `query` (string, required): The search query. Use specific, targeted keywords for best results. - `num_results` (number, optional, default: 5): The number of search results to return (1-10).
`scrape_page`	Web Page & YouTube Content Extractor	Description: Extracts text content from web pages and YouTube videos with robust transcript extraction capabilities. Features comprehensive error handling with 10 distinct error types (TRANSCRIPT_DISABLED, VIDEO_UNAVAILABLE, NETWORK_ERROR, etc.), automatic retry logic with exponential backoff for transient failures, and user-friendly error messages. Supports both youtube.com/watch?v= and youtu.be/ URL formats. Results are cached for 1 hour. Parameters: - `url` (string, required): The URL of the web page or YouTube video to scrape. YouTube URLs automatically extract transcripts when available.
`analyze_with_gemini`	Gemini AI Text Analysis	Description: Processes and analyzes text content using Google's Gemini AI models. It can summarize, answer questions, and generate insights from provided text. Large texts are automatically truncated. Results are cached for 15 minutes. Parameters: - `text` (string, required): The text content to analyze. - `model` (string, optional, default: "gemini-2.0-flash-001"): The Gemini model to use (e.g., `gemini-2.0-flash-001`, `gemini-pro`).
`research_topic`	Comprehensive Topic Research Workflow	Description: A powerful composite tool that automates the entire research process: it searches for a topic, scrapes the content from multiple sources, and synthesizes the findings with Gemini AI. It's designed for resilience and provides comprehensive analysis. Parameters: - `query` (string, required): The research topic or question. - `num_results` (number, optional, default: 3): The number of sources to research (recommended: 2-5).

Client Integration

STDIO Client (Local Process)

Ideal for local tools and CLI applications.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "node",
  args: ["dist/server.js"]
});
const client = new Client({ name: "test-client" });
await client.connect(transport);

const result = await client.callTool({
  name: "google_search",
  arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);

// YouTube transcript extraction example
const youtubeResult = await client.callTool({
  name: "scrape_page",
  arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
});
console.log(youtubeResult.content[0].text);

HTTP+SSE Client (Web Application)

Suitable for web-based clients. Requires a valid OAuth 2.1 Bearer token.

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StreamableHTTPClientTransport } from "@modelcontextprotocol/sdk/client/streamableHttp.js";

// The client MUST obtain a valid OAuth 2.1 Bearer token from your
// configured external Authorization Server before making requests.
const transport = new StreamableHTTPClientTransport(
  new URL("http://localhost:3000/mcp"),
  {
    getAuthorization: async () => `Bearer YOUR_ACCESS_TOKEN`
  }
);
const client = new Client({ name: "test-client" });
await client.connect(transport);

const result = await client.callTool({
  name: "google_search",
  arguments: { query: "Model Context Protocol" }
});
console.log(result.content[0].text);

// YouTube transcript extraction with error handling
try {
  const youtubeResult = await client.callTool({
    name: "scrape_page",
    arguments: { url: "https://www.youtube.com/watch?v=dQw4w9WgXcQ" }
  });
  console.log("Transcript:", youtubeResult.content[0].text);
} catch (error) {
  if (error.content && error.content[0].text.includes("TRANSCRIPT_DISABLED")) {
    console.log("Video owner has disabled transcripts");
  } else if (error.content && error.content[0].text.includes("VIDEO_NOT_FOUND")) {
    console.log("Video not found - check the URL");
  } else {
    console.log("Transcript extraction failed:", error.content[0].text);
  }
}

Management API

The server provides several administrative endpoints for monitoring and control. Access to these endpoints is protected by OAuth scopes.

Method	Endpoint	Description	Required Scope
`GET`	`/mcp/cache-stats`	View cache performance statistics.	`mcp:admin:cache:read`
`GET`	`/mcp/event-store-stats`	View event store usage statistics.	`mcp:admin:event-store:read`
`POST`	`/mcp/cache-invalidate`	Clear specific cache entries.	`mcp:admin:cache:invalidate`
`POST`	`/mcp/cache-persist`	Force the cache to be saved to disk.	`mcp:admin:cache:persist`
`GET`	`/mcp/oauth-scopes`	Get documentation for all OAuth scopes.	Public
`GET`	`/mcp/oauth-config`	View the server's OAuth configuration.	`mcp:admin:config:read`
`GET`	`/mcp/oauth-token-info`	View details of the provided token.	Requires authentication

Performance & Reliability

The server has been optimized for production use with significant performance improvements and reliability enhancements:

YouTube Transcript Extraction Performance

91% Performance Improvement: End-to-end tests for YouTube transcript extraction are now 91% faster
80% Log Reduction: Streamlined logging reduces noise while maintaining diagnostic capabilities
Production Controls: Environment-based configuration allows fine-tuning of retry behavior and timeouts

System Reliability

Intelligent Error Recovery: Automatic retry with exponential backoff for transient failures
Graceful Degradation: The system continues operating even when individual components encounter issues
Comprehensive Error Classification: 10 distinct error types provide precise feedback for troubleshooting
Resource Optimization: Efficient memory and CPU usage patterns for high-volume operations

Monitoring & Diagnostics

Enhanced Logging: Detailed but efficient logging for production debugging
Performance Metrics: Built-in performance tracking for all major operations
Error Analytics: Structured error reporting for operational insights

These optimizations ensure the server can handle production workloads efficiently while providing reliable service even under adverse conditions.

Security

OAuth 2.1 Authorization

The server implements OAuth 2.1 authorization for all HTTP-based communication, ensuring that only authenticated and authorized clients can access its capabilities.

Protection: All endpoints under /mcp/ (except for public documentation endpoints) are protected.
Token Validation: The server validates JWTs (JSON Web Tokens) against the configured JWKS (JSON Web Key Set) URI from your authorization server.
Scope Enforcement: Each tool and administrative action is mapped to a specific OAuth scope, providing granular control over permissions.

For a complete guide on setting up OAuth, see the Security Configuration Guide.

Available Scopes

Tool Execution Scopes

mcp:tool:google_search:execute
mcp:tool:scrape_page:execute
mcp:tool:analyze_with_gemini:execute
mcp:tool:research_topic:execute

Administrative Scopes

mcp:admin:cache:read
mcp:admin:cache:invalidate
mcp:admin:cache:persist
mcp:admin:event-store:read
mcp:admin:config:read

Testing

The project maintains a high standard of quality through a combination of end-to-end and focused component tests.

Script	Description
`npm test`	Runs all focused component tests (`*.spec.ts`) using Jest.
`npm run test:e2e`	Executes the full end-to-end test suite for both STDIO and SSE transports.
`npm run test:coverage`	Generates a detailed code coverage report.

For more details on the testing philosophy and structure, see the Testing Guide.

Troubleshooting

Method	Endpoint	Description	Required Scope
`GET`	`/mcp/cache-stats`	View cache performance statistics.	`mcp:admin:cache:read`
`GET`	`/mcp/event-store-stats`	View event store usage statistics.	`mcp:admin:event-store:read`
`POST`	`/mcp/cache-invalidate`	Clear specific cache entries.	`mcp:admin:cache:invalidate`
`POST`	`/mcp/cache-persist`	Force the cache to be saved to disk.	`mcp:admin:cache:persist`
`GET`	`/mcp/oauth-scopes`	Get documentation for all OAuth scopes.	Public
`GET`	`/mcp/oauth-config`	View the server's OAuth configuration.	`mcp:admin:config:read`
`GET`	`/mcp/oauth-token-info`	View details of the provided token.	Requires authentication

Contributing

We welcome contributions of all kinds! This project is open-source under the MIT license and we believe in the power of community collaboration.

⭐ Star this repo if you find it useful.
🍴 Fork it to create your own version.
💡 Report issues if you find bugs or have suggestions for improvements.
🚀 Submit PRs for bug fixes, new features, or documentation enhancements.

To contribute code, please follow our Contribution Guidelines.

License

This project is licensed under the MIT License. See the LICENSE file for details.