User guideAI DetectionScanning repositories
AI Detection

Scanning repositories

Learn how to scan GitHub repositories to detect AI/ML usage across code, workflows, and infrastructure.

Overview

AI Detection scans GitHub repositories for AI and machine learning usage in your codebase. It finds "shadow AI" (AI usage that hasn't been formally documented or approved) and keeps an inventory of detected AI technologies.

The scanner checks source files, dependency manifests, CI/CD workflows, container definitions, and model files against 100+ AI/ML patterns (OpenAI, TensorFlow, PyTorch, LangChain, etc.). A 2-phase LLM vulnerability pipeline also checks for the 10 OWASP LLM Top 10 vulnerability types. All results are stored and can be reviewed from the scan results page.

Starting a scan

To scan a repository, enter the GitHub URL in the input field. You can use either the full URL format (https://github.com/owner/repo) or the short format (owner/repo). Click Scan to begin the analysis.

By default, only public repositories can be scanned. To scan private repositories, configure a GitHub Personal Access Token in AI Detection → Settings.
AI Detection scan page with repository URL input field
Enter a GitHub repository URL to start scanning

Scan progress

The scan runs through a few stages. A progress indicator shows which file is being analyzed, how many files have been processed, and how many findings have been found so far.

  • Cloning: Downloads the repository to a temporary cache
  • Scanning: Analyzes files for AI/ML patterns and security issues
  • Completed: Results are ready for review

You can cancel an in-progress scan at any time by clicking Cancel. The partial results are discarded and the scan is marked as cancelled in the history.

Scan progress indicator showing files being analyzed
Real-time progress shows current file, total processed, and findings discovered

Statistics dashboard

The scan page shows statistics about your AI Detection activity in card form:

  • Total scans: Number of scans performed, with a count of completed scans
  • Repositories: Unique repositories that have been scanned
  • Total findings: Total AI/ML detections across all scans
  • Libraries: AI/ML library imports and dependencies detected
  • API calls: Direct API calls to AI providers (OpenAI, Anthropic, etc.)
  • Security issues: Hardcoded secrets and model file vulnerabilities
Statistics are only displayed after you have completed at least one scan. The dashboard automatically updates as new scans are completed.

Understanding results

Scan results are organized into nine tabs covering different aspects of AI/ML usage in your codebase:

  • Libraries: Detected AI/ML frameworks and packages
  • Vulnerabilities: OWASP LLM Top 10 vulnerability findings detected via 2-phase LLM analysis
  • API calls: Direct integrations with AI provider APIs
  • Models: References to AI/ML model files and pre-trained models
  • RAG: Retrieval-Augmented Generation components and vector databases
  • Agents: AI agent frameworks and autonomous system components
  • Secrets: Hardcoded API keys and credentials
  • Security: Model file vulnerabilities and security issues
  • Compliance: EU AI Act compliance mapping and checklist

Libraries tab

The Libraries tab lists all detected AI/ML technologies. Each row shows the library name, provider, risk level, confidence, and file count. Click a row to see specific file paths and line numbers. The scanner checks source imports, dependency manifests, Dockerfiles, and docker-compose files.

Risk levels indicate the potential data exposure:

  • High risk: Data is sent to external cloud APIs. Potential for data leakage and compliance issues.
  • Medium risk: Can connect to cloud APIs depending on how it's configured. Check your usage.
  • Low risk: Runs locally. Data stays on your infrastructure.

Confidence levels indicate detection certainty:

  • High: Direct match, like an explicit import or dependency declaration
  • Medium: Likely match but with some ambiguity (generic utility imports, etc.)
  • Low: Possible match; needs manual verification

Governance status

You can assign a governance status to any library finding to track your review. Click the status icon on a row to change it:

  • Reviewed: Finding has been examined but no decision made yet
  • Approved: Usage is authorized and compliant with organization policies
  • Flagged: Requires attention or is not approved for use
Libraries tab showing detected AI/ML frameworks with risk and confidence levels
Detected AI/ML libraries with provider, risk level, confidence, and governance status

Vulnerabilities tab

The Vulnerabilities tab shows findings from the 2-phase LLM vulnerability pipeline. Phase 1 runs a regex pre-filter against known patterns. Phase 2 sends candidates to an LLM for analysis using type-specific rubric prompts.

It covers all 10 OWASP LLM Top 10 vulnerability types:

  • LLM01: Prompt injection: Untrusted input concatenated into prompts without sanitization
  • LLM02: Insecure output handling: LLM output passed to dangerous sinks (eval, SQL, shell commands)
  • LLM03: Training data poisoning: Insecure deserialization or untrusted model sources
  • LLM04: Model denial of service: Missing token limits, timeouts, or rate limiting on LLM calls
  • LLM05: Supply chain: Unpinned dependency versions, untrusted model URLs, missing checksum validation
  • LLM06: Sensitive info disclosure: PII, session tokens, or credentials passed to LLM context
  • LLM07: Insecure plugin design: Tools or plugins that accept raw input without validation or schemas
  • LLM08: Excessive agency: Agents with broad tool access, no human-in-the-loop, auto-approve patterns
  • LLM09: Overreliance: No human review, no confidence thresholds, silent failures without fallbacks
  • LLM10: Model theft: Model files in public directories or served without authentication

Each finding shows severity, confidence, a description, and a suggested fix. Findings that share file paths with library, agent, or security findings get a cross-reference badge so you can see related detections together.

LLM vulnerability detection needs an LLM key configured and vulnerability scanning enabled in AI Detection → Settings. Without it, only the regex pre-filter runs, and findings may be less accurate.

API Calls tab

The API Calls tab shows direct calls to AI provider APIs found in your code, like OpenAI, Anthropic, and Google AI.

API call findings include:

  • REST API endpoints: Direct HTTP calls to AI provider APIs (e.g., api.openai.com)
  • SDK method calls: Usage of official SDKs (e.g., openai.chat.completions.create() or client.chat.completions.create())
  • Framework integrations: LangChain, LlamaIndex, and other framework API calls
  • CI/CD pipeline usage: AI service secrets referenced in GitHub Actions workflows (e.g., ${{ secrets.OPENAI_API_KEY }})
All API call findings are marked as high confidence since they indicate direct integration with AI services.

Secrets tab

The Secrets tab finds hardcoded API keys and credentials in your code. Move these to environment variables or a secrets manager.

The scanner detects common AI provider API key patterns:

  • OpenAI API keys: Keys starting with sk-...
  • Anthropic API keys: Keys starting with sk-ant-...
  • Google AI API keys: Keys starting with AIza...
  • Other provider keys: AWS, Azure, Cohere, and other AI service credentials
Security risk
Hardcoded secrets get exposed if the repository goes public or is accessed by someone it shouldn't be. Rotate any exposed credentials right away.

Security tab

The Security tab shows findings from model file analysis. Serialized model files (.pkl, .pt, .h5) can contain malicious code that runs when loaded. The scanner looks for system command execution, network access, and code injection patterns.

Security findings include severity levels and compliance references:

  • Critical: Direct code execution risk. Investigate immediately.
  • High: Indirect execution or data exfiltration risk
  • Medium: Potentially dangerous depending on context
  • Low: Informational, minimal risk
Security risk
Don't load model files flagged as Critical until you've verified them. Malicious models can run arbitrary code on your system when loaded with standard ML frameworks.
Security tab showing model file vulnerabilities with severity levels
Security findings with severity, CWE references, and OWASP ML Top 10 mappings

Models tab

The Models tab lists AI/ML model file references found in your code: pre-trained models, checkpoints, and model loading patterns.

  • Pre-trained models: References to Hugging Face models, OpenAI models, and other hosted models
  • Local model files: Model weights stored in the repository (.pt, .h5, .onnx, etc.)
  • Model loading code: Code that loads or initializes ML models

RAG tab

The RAG (Retrieval-Augmented Generation) tab lists components used in RAG systems, which combine retrieval with generative AI.

  • Vector databases: Integrations with Pinecone, Qdrant, Chroma, Weaviate, and other vector stores
  • Embedding models: Code that generates embeddings for documents or queries
  • Retrieval pipelines: LangChain retrievers, LlamaIndex query engines, and similar patterns

Agents tab

The Agents tab shows AI agent frameworks and autonomous system components found in the repo.

  • Agent frameworks: LangChain agents, CrewAI (including @agent and @crew decorators), AutoGen, Swarm, and similar frameworks
  • MCP servers: Model Context Protocol server implementations and configuration files (mcp.json, claude_desktop_config.json)
  • Tool usage: Code that defines or uses tools for AI agents
  • Planning components: Task planning and execution orchestration code
AI agents often have access to external systems and data. Review agent implementations for security and compliance issues.

Compliance tab

The Compliance tab maps scan findings to EU AI Act requirements and generates a checklist based on the AI technologies found in your code.

The compliance mapping covers key requirement categories:

  • Transparency: Requirements for disclosing AI system usage and capabilities
  • Data governance: Requirements for data quality, bias prevention, and privacy
  • Documentation: Technical documentation and record-keeping obligations
  • Human oversight: Requirements for human supervision of AI systems
  • Security: Cybersecurity and resilience requirements

Infrastructure and CI/CD detection

The scanner also checks infrastructure and CI/CD config files, catching AI usage that lives in deployment pipelines and containers rather than application code.

GitHub Actions workflows

YAML workflow files (.yml, .yaml) are checked for AI service references: GitHub Actions that call AI providers, and secrets like OPENAI_API_KEY or ANTHROPIC_API_KEY in workflow environment variables.

Docker and container images

Dockerfiles and docker-compose files are scanned for AI/ML container images. Detected images include:

  • GPU compute: NVIDIA CUDA and NGC container images
  • ML frameworks: PyTorch, TensorFlow, and Hugging Face container images
  • Inference servers: Ollama, vLLM, and NVIDIA Triton Inference Server
  • ML operations: MLflow tracking and serving containers

MCP server configuration

The scanner picks up Model Context Protocol (MCP) config files like mcp.json and claude_desktop_config.json. These define MCP servers that give AI assistants access to external tools and data, so they're flagged as agent-type findings.

Export and visualization

After a scan completes, you can:

  • Risk scoring: Calculate an AI Governance Risk Score (AGRS) across 5 risk dimensions. You can also enable LLM-enhanced analysis for written summaries, recommendations, and suggested risks to add to your risk register.
  • View graph: Open an interactive dependency graph. Nodes are findings, edges are inferred dependencies based on shared files and providers.
  • Export AI-BOM: Download scan results as an AI Bill of Materials (AI-BOM) in JSON. The format is CycloneDX-inspired and includes all detected components, providers, risk levels, and file locations.
AI-BOM for compliance
The AI-BOM export gives you a structured inventory of AI components you can use for regulatory submissions, vendor assessments, or internal documentation.

Compliance references

Security findings include standard reference IDs for compliance reporting:

  • CWE: Common Weakness Enumeration, the standard catalog for software security weaknesses (e.g., CWE-502 for deserialization)
  • OWASP ML Top 10: OWASP Machine Learning Security Top 10, covering the top ML security risks (e.g., ML06 for AI Supply Chain Attacks)

Scanning private repositories

To scan private repositories, you must configure a GitHub Personal Access Token (PAT) with the repo scope. Navigate to AI Detection → Settings to add your token. The token is encrypted at rest and used only for git clone operations.

For instructions on creating a GitHub PAT, see the official GitHub documentation at https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token.

AI Detection settings page with GitHub token configuration
Configure GitHub Personal Access Token for private repository scanning
NextRisk scoring
Scanning repositories - AI Detection - VerifyWise User Guide