I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

Leader posted Originally published at lucybatten.substack.com 11 min read

A deep dive into the real pipeline behind CodeAtlas: AST parsing, import resolution, force-directed graphs, and everything in between.

Code is a graph... but every tool we have forces us to read it like a book.

I built CodeAtlas to fix this. It takes any GitHub repository URL, clones it, parses every file using two separate AST parsers, resolves every import to an actual file, builds a dependency graph, and renders it as an interactive force-directed visualisation. You can see the entire structure of a codebase in seconds, click any node to read the file in a Monaco editor, filter by depth, and understand architecture that would otherwise take hours to infer.

GitHub: CodeAtlas

The Architecture

CodeAtlas has three distinct layers:

Let me go through each section: Backend, Parsers and Frontend in detail.


The Backend: Cloning, Indexing, and Graph Construction

Cloning

The entry point is a POST /analyze endpoint in server.js. The first thing it does is clone the repository:

async function cloneRepo(repoUrl) {
  await fs.remove(TEMP_DIR);
  await fs.mkdir(TEMP_DIR);
  await simpleGit().clone(repoUrl, TEMP_DIR, ["--depth", "1"]);
}

The --depth 1 flag is critical. A shallow clone fetches only the latest commit not the full history. For large repositories this is the difference between a 2-second clone and a 45-second clone. CodeAtlas never needs git history: it only needs the current state of the files, so shallow cloning is always correct here.

fs-extra‘s remove call before mkdir ensures the temp directory is clean before each clone. Without this, a previous failed run could leave stale files that contaminate the new analysis.


File Tree Indexing

After cloning, buildIndex walks the file tree and builds a map of every relevant file and its raw imports:

function buildIndex(dir) {
  const index = {};

  function walk(folder) {
    const items = fs.readdirSync(folder, { withFileTypes: true });

    for (const item of items) {
      if (IGNORE.has(item.name)) continue;

      const fullPath = path.join(folder, item.name);

      if (item.isDirectory()) {
        walk(fullPath);
        continue;
      }

      if (!item.name.match(/\.(js|ts|tsx|py)$/)) continue;

      try {
        const content = fs.readFileSync(fullPath, "utf8");

        const imports = [
          ...content.matchAll(/from\s+['"](.*?)['"]/g),
          ...content.matchAll(/require\(['"](.*?)['"]\)/g),
        ].map(m => m[1]);

        const rel = path.relative(dir, fullPath);
        index[rel] = { imports: imports.slice(0, 30) };
      } catch (e) {}
    }
  }

  walk(dir);
  return index;
}

The IGNORE set is doing important work here:

const IGNORE = new Set([
  "node_modules", "dist", "build",
  ".git", "coverage", ".next", ".cache"
]);

node_modules alone can contain tens of thousands of files. Including it would make the graph useless… you’d be visualising the entire npm ecosystem rather than the project’s own code. dist and build are generated code that duplicates the source. .next contains Next.js build artefacts. None of these contain information about the project’s architecture.

The withFileTypes: true option on readdirSync is a performance detail worth noting. It returns Dirent objects which already know whether each entry is a file or directory, avoiding a separate stat call per entry. On repos with thousands of files this is meaningfully faster.


Import Resolution

Raw import strings like ./utils need to be resolved to actual files. The resolveImport function handles this:

function resolveImport(file, imp, allFiles) {
  if (!imp.startsWith(".")) return null;

  const base = path.dirname(file);

  const possiblePaths = [
    path.normalize(path.join(base, imp)),
    path.normalize(path.join(base, imp + ".js")),
    path.normalize(path.join(base, imp + ".ts")),
    path.normalize(path.join(base, imp + ".tsx")),
    path.normalize(path.join(base, imp, "index.js")),
    path.normalize(path.join(base, imp, "index.ts")),
  ];

  for (const p of possiblePaths) {
    if (allFiles.has(p)) return p;
  }

  return null;
}

The first thing it does is discard any import that doesn’t start with .. This filters out all third-party packages (react, lodash, express) which live in node_modules and aren’t part of the project’s own dependency graph. Only relative imports (starting with ./ or ../) represent relationships between the project’s own files.

The resolution order tries the import path as-is first, then appends common extensions, then checks for index files inside a directory of that name. This mirrors how Node.js’s own module resolution works, so it produces the same result as the runtime would.

allFiles is a Set: each resolution check is O(1). Multiply that across potentially thousands of imports in a large repo and the total resolution step stays fast.

Graph Construction

Once the index is built, indexToGraph assembles the final data structure:

function indexToGraph(index) {
  const fileList = Object.keys(index);
  const fileSet = new Set(fileList);

  const nodes = fileList.map(id => ({ id }));
  const links = [];

  for (const file of fileList) {
    for (const imp of index[file].imports) {
      const resolved = resolveImport(file, imp, fileSet);
      if (resolved) {
        links.push({ source: file, target: resolved });
      }
    }
  }

  return { nodes, links, backLinks: {} };
}

The graph format: nodes, links, backLinks, is designed specifically for D3’s force simulation on the frontend. nodes is an array of objects with an id. links is an array of { source, target } pairs using those same ids. backLinks is the reverse dependency index: for any given file, which files import it.


The JavaScript/TypeScript Parser: Babel AST

parser.js is the more powerful of the two parsers. Instead of regex, it uses Babel to parse source files into full Abstract Syntax Trees and then traverses those trees to extract imports.

An AST is a tree representation of source code where every construct e.g. a function declaration, an import statement, a variable assignment, becomes a typed node. Parsing text into an AST is the first step every compiler and linter performs. Using ASTs means the parser understands code structure rather than matching text patterns.

Parsing Files

export function parseFile(filePath) {
  try {
    const code = fs.readFileSync(filePath, "utf-8");

    const ast = parser.parse(code, {
      sourceType: "unambiguous",
      plugins: [
        "typescript",
        "jsx",
        "dynamicImport",
        "classProperties",
      ],
      errorRecovery: true,
    });

    const imports = [];

    traverse.default(ast, {
      ImportDeclaration({ node }) {
        imports.push(node.source.value);
      },
      CallExpression({ node }) {
        if (
          node.callee.name === "require" &&
          node.arguments.length === 1 &&
          node.arguments[0].type === "StringLiteral"
        ) {
          imports.push(node.arguments[0].value);
        }
      },
    });

    return imports;
  } catch (err) {
    return [];
  }
}

Several configuration decisions here are worth explaining.

sourceType: "unambiguous" tells Babel to figure out whether the file is a CommonJS module or an ES module by looking at whether it contains any import/export statements, rather than requiring you to specify upfront. Real codebases are messy and mix both styles.

errorRecovery: true is essential in practice. Real codebases contain files that don’t parse cleanly: files with experimental syntax, partially written code, or syntax errors that have been introduced but not yet caught. Without error recovery, one bad file would crash the entire parsing pipeline for the whole repo. With it, Babel does its best and returns whatever AST it can construct from the valid portions.

The CallExpression handler catches require() calls. These show up differently in the AST than import statements - they’re function calls rather than declarations - so they need their own handler. The check that node.callee.name === "require" and that the single argument is a StringLiteral ensures we only capture simple require('./path') patterns and not dynamic requires like require(getModuleName()).

Walking the Folder

parseFolderJS handles the file tree walk and graph construction for the JS/TS parser:

export function parseFolderJS(folderPath) {
  const graph = { nodes: [], links: [], backLinks: {} };
  const filesMap = {};

  function walk(dir) {
    const entries = fs.readdirSync(dir, { withFileTypes: true });
    for 

(let entry of entries) {

  if (IGNORE.has(entry.name)) continue;
  const fullPath = path.join(dir, entry.name);
  if (entry.isDirectory()) {
    walk(fullPath);
  } else if (entry.isFile() && fullPath.match(/\.(js|ts|jsx|tsx)$/)) {
    filesMap[fullPath] = fullPath;
  }
}

}

walk(folderPath);

for (let fullPath in filesMap) {

const fileId = toRelative(fullPath);
graph.nodes.push({ id: fileId });

const imports = parseFile(fullPath);

for (let imp of imports) {
  if (!imp.startsWith(".")) continue;

  let resolved = path.resolve(path.dirname(fullPath), imp);

  const possible = [
    resolved,
    resolved + ".ts",
    resolved + ".tsx",
    resolved + ".js",
    resolved + ".jsx",
    resolved + "/index.ts",
    resolved + "/index.tsx",
    resolved + "/index.js",
    resolved + "/index.jsx",
  ];

  const found = possible.find((p) => filesMap[p]);

  if (found) {
    const targetId = toRelative(found);
    graph.links.push({ source: fileId, target: targetId });

    if (!graph.backLinks[targetId]) graph.backLinks[targetId] = [];
    graph.backLinks[targetId].push(fileId);
  }
}

}

// Deduplicate links
graph.links = Array.from(

new Set(graph.links.map((l) => `${l.source}->${l.target}`))

).map((str) => {

const [source, target] = str.split("->");
return { source, target };

});

return graph;
}
The filesMap object serves a dual purpose: it stores all file paths so they can be looked up during resolution, and its keys are exactly the absolute paths we need to check against during the possible.find() call.

Building backLinks inline during the main loop is efficient: each time a link is added forward (source → target), the reverse index is updated simultaneously. By the end of the loop, backLinks[file] contains every file that imports file, with no second pass needed.

The deduplication at the end handles a real edge case: a file might import from the same module in multiple ways, a static import at the top and a dynamic import inside a function, or two different named imports from the same module in separate import statements. Both would produce the same source → target edge. The Set-based deduplication collapses these into a single edge before the data reaches the frontend.


The Python Parser

parser_py.py handles Python repositories using Python’s own ast module:

def parse_file(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        try:
            tree = ast.parse(f.read(), filename=file_path)
        except Exception:
            return []

    imports = []

    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for name in node.names:
                imports.append(name.name)
        elif isinstance(node, ast.ImportFrom):
            if node.module:
                imports.append(node.module)

    return imports

Python has two import syntaxes that map to different AST node types. import os produces an ast.Import node where node.names is a list of alias objects (there can be multiple: import os, sys). from pathlib import Path produces an ast.ImportFrom node where node.module is the module name being imported from.

The folder parser maps module names to file paths:

for imp in imports:
    target = imp.replace(".", "/") + ".py"

Python uses dots as namespace separators: from utils.helpers import something maps to utils/helpers.py. Replacing dots with slashes converts the module path back to a filesystem path. This is a heuristic: it works correctly for relative project imports but doesn’t distinguish between standard library imports (os, sys) and project files. Standard library modules simply won’t exist as files in the repo, so they produce dangling target nodes in the graph rather than edges to real files. This is an area for future improvement.


The Frontend: React, BFS, and D3

 State Architecture

App.tsx is the application root and manages all state:

const [graphData, setGraphData] = useState<GraphData>(null);
const [selectedFile, setSelectedFile] = useState<string | null>(null);
const [fileContent, setFileContent] = useState<string | null>(null);
const [focusMode, setFocusMode] = useState(true);
const [depth, setDepth] = useState(2);
const [loading, setLoading] = useState(false);

graphData holds the raw graph from the API, the complete set of nodes and links for the entire repository. displayData is a derived version, computed by useMemo, that represents what’s actually shown in the graph at any given moment based on focus mode and depth settings. Separating raw data from display data means toggling focus mode or changing depth never triggers a new API call… it just recomputes the view over the existing data.

The API URL switches automatically based on the Vite environment flag:

const API = import.meta.env.DEV
  ? "http://localhost:8080"
  : "https://codeatlas-production-e4f8.up.railway.app";

API Response Normalisation

The handleAnalyze function contains some defensive normalisation worth explaining:

const graph = raw.graph ?? raw;

const formattedGraph: GraphData = {
  nodes: graph.nodes.map((n: any) =>
    typeof n === "string" ? { id: n } : n
  ),
  links: graph.links.map((l: any) => ({
    source: typeof l.source === "string" ? l.source : l.source?.id,
    target: typeof l.target === "string" ? l.target : l.target?.id,
  })),
  backLinks: graph.backLinks || {},
};

The raw.graph ?? raw fallback handles two different response shapes from the backend, one where the graph is nested under a graph key, one where it’s the root object. This kind of defensive normalisation is common when a frontend is evolving alongside its backend.

The source/target normalisation in the links map addresses a D3 behaviour: D3’s force simulation mutates link objects during the simulation, replacing string ids with references to the actual node objects. So after the simulation runs, link.source is no longer the string "src/App.tsx" but the node object { id: "src/App.tsx", x: 123, y: 456 }. The frontend normalises both forms everywhere it needs to compare or display link endpoints.

Focus Mode: BFS Traversal

The most technically interesting part of the frontend is the focus mode implementation using useMemo:

const displayData = useMemo(() => {
  if (!graphData) return null;
  if (!focusMode || !selectedFile) return graphData;

  const visited = new Set<string>();
  const queue: { id: string; level: number }[] = [
    { id: selectedFile, level: 0 },
  ];

  while (queue.length) {
    const { id, level } = queue.shift()!;

    if (visited.has(id) || level > depth) continue;
    visited.add(id);

    for (const l of graphData.links || []) {
      const source = typeof l.source === "string" ? l.source : l.source?.id;
      const target = typeof l.target === "string" ? l.target : l.target?.id;

      if (!source || !target) continue;

      if (source === id && !visited.has(target)) {
        queue.push({ id: target, level: level + 1 });
      }
      if (target === id && !visited.has(source)) {
        queue.push({ id: source, level: level + 1 });
      }
    }
  }

  return {
    nodes: (graphData.nodes || []).filter((n) => visited.has(n.id)),
    links: (graphData.links || []).filter((l) => {
      const s = typeof l.source === "string" ? l.source : l.source?.id;
      const t = typeof l.target === "string" ? l.target : l.target?.id;
      return s && t && visited.has(s) && visited.has(t);
    }),
    backLinks: graphData.backLinks || {},
  };
}, [graphData, focusMode, selectedFile, depth]);

This is a bidirectional BFS: it traverses both forward edges (files that the selected file imports) and backward edges (files that import the selected file) up to depth hops away. The level counter on each queue entry tracks how many hops from the origin each node is, and nodes beyond depth are not enqueued.

The result is a subgraph centred on the selected file that shows its immediate neighbourhood in the dependency graph. Depth 1 shows only direct imports and importers. Depth 2 shows imports of imports. Depth 5 shows almost everything reachable.

Using useMemo with [graphData, focusMode, selectedFile, depth] as dependencies means the BFS only re-runs when one of those values changes. The computation is pure: same inputs, same output, so memoisation is safe and effective.

File Inspection

Clicking a node triggers both a local state update and an API call:

const handleNodeClick = async (id: string) => {
  setSelectedFile(id);

  const res = await fetch(`${API}/file?path=${encodeURIComponent(id)}`);
  const data = await res.json();
  setFileContent(data?.content || "");
};

The encodeURIComponent call is important: file paths can contain characters like +, #, or spaces that would corrupt a URL query parameter without encoding.

The file content is passed to Monaco editor, which provides VS Code-quality syntax highlighting and navigation in the browser:

<Editor
  height="100%"
  language={getLanguage(selectedFile)}
  value={fileContent}
  theme="vs-dark"
  options={{
    readOnly: true,
    minimap: { enabled: false },
    fontSize: 13,
  }}
/>

Language detection is handled by file extension:

const getLanguage = (file: string | null) => {
  if (!file) return "plaintext";
  if (file.endsWith(".ts") || file.endsWith(".tsx")) return "typescript";
  if (file.endsWith(".js") || file.endsWith(".jsx")) return "javascript";
  if (file.endsWith(".py")) return "python";
  return "plaintext";
};

Monaco uses this to apply the correct grammar for syntax highlighting, bracket matching, and token colouring.

The Sidebar

The sidebar shows imports and dependents for the selected file:

{/* IMPORTS */}
{(graphData.links || [])
  .filter((l) => {
    const s = typeof l.source === "string" ? l.source : l.source?.id;
    return s === selectedFile;
  })
  .map((l, i) => {
    const t = typeof l.target === "string" ? l.target : l.target?.id;
    return <li key={i}>→ {t}</li>;
  })}

{/* DEPENDENTS */}
{(graphData.backLinks?.[selectedFile] || []).map((f, i) => (
  <li key={i}>← {f}</li>
))}

If you found this useful, starring the repo is the best thing you can do: it helps other developers find it.

GitHub: CodeAtlas

(link to live demo on GitHub page)

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

Dharanidharan - Feb 9

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

Karol Modelskiverified - Mar 19

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Karol Modelskiverified - Apr 23

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

Pocket Portfolioverified - Apr 1

Architecting a Local-First Hybrid RAG for Finance

Pocket Portfolioverified - Feb 25
chevron_left

Related Jobs

View all jobs →

Commenters (This Week)

1 comment
1 comment

Contribute meaningful comments to climb the leaderboard and earn badges!