I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

Question

I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

calendar_todayApr 22 • schedule11 min read

— Originally published at lucybatten.substack.com

A deep dive into the real pipeline behind CodeAtlas: AST parsing, import resolution, force-directed graphs, and everything in between.

Code is a graph... but every tool we have forces us to read it like a book.

I built CodeAtlas to fix this. It takes any GitHub repository URL, clones it, parses every file using two separate AST parsers, resolves every import to an actual file, builds a dependency graph, and renders it as an interactive force-directed visualisation. You can see the entire structure of a codebase in seconds, click any node to read the file in a Monaco editor, filter by depth, and understand architecture that would otherwise take hours to infer.

[GitHub: CodeAtlas][1]

(link to live demo on GitHub page)

The Architecture

CodeAtlas has three distinct layers:

Let me go through each section: Backend, Parsers and Frontend in detail.

The Backend: Cloning, Indexing, and Graph Construction

Cloning

The entry point is a POST /analyze endpoint in server.js. The first thing it does is clone the repository:

async function cloneRepo(repoUrl) {
  await fs.remove(TEMP_DIR);
  await fs.mkdir(TEMP_DIR);
  await simpleGit().clone(repoUrl, TEMP_DIR, ["--depth", "1"]);
}

The --depth 1 flag is critical. A shallow clone fetches only the latest commit not the full history. For large repositories this is the difference between a 2-second clone and a 45-second clone. CodeAtlas never needs git history: it only needs the current state of the files, so shallow cloning is always correct here.

fs-extra‘s remove call before mkdir ensures the temp directory is clean before each clone. Without this, a previous failed run could leave stale files that contaminate the new analysis.

File Tree Indexing

After cloning, buildIndex walks the file tree and builds a map of every relevant file and its raw imports:

function buildIndex(dir) {
  const index = {};

  function walk(folder) {
    const items = fs.readdirSync(folder, { withFileTypes: true });

    for (const item of items) {
      if (IGNORE.has(item.name)) continue;

      const fullPath = path.join(folder, item.name);

      if (item.isDirectory()) {
        walk(fullPath);
        continue;
      }

      if (!item.name.match(/\.(js|ts|tsx|py)$/)) continue;

      try {
        const content = fs.readFileSync(fullPath, "utf8");

        const imports = [
          ...content.matchAll(/from\s+['"](.*?)['"]/g),
          ...content.matchAll(/require%%MATH_BLOCK_0%%/g),
        ].map(m => m[1]);

        const rel = path.relative(dir, fullPath);
        index[rel] = { imports: imports.slice(0, 30) };
      } catch (e) {}
    }
  }

  walk(dir);
  return index;
}
```

The IGNORE set is doing important work here:

```
const IGNORE = new Set([
  "node_modules", "dist", "build",
  ".git", "coverage", ".next", ".cache"
]);
```

node_modules alone can contain tens of thousands of files. Including it would make the graph useless… you’d be visualising the entire npm ecosystem rather than the project’s own code. dist and build are generated code that duplicates the source. .next contains Next.js build artefacts. None of these contain information about the project’s architecture.

The withFileTypes: true option on readdirSync is a performance detail worth noting. It returns Dirent objects which already know whether each entry is a file or directory, avoiding a separate stat call per entry. On repos with thousands of files this is meaningfully faster.


----------

### Import Resolution
Raw import strings like ./utils need to be resolved to actual files. The resolveImport function handles this:

```
function resolveImport(file, imp, allFiles) {
  if (!imp.startsWith(".")) return null;

  const base = path.dirname(file);

  const possiblePaths = [
    path.normalize(path.join(base, imp)),
    path.normalize(path.join(base, imp + ".js")),
    path.normalize(path.join(base, imp + ".ts")),
    path.normalize(path.join(base, imp + ".tsx")),
    path.normalize(path.join(base, imp, "index.js")),
    path.normalize(path.join(base, imp, "index.ts")),
  ];

  for (const p of possiblePaths) {
    if (allFiles.has(p)) return p;
  }

  return null;
}
```

The first thing it does is discard any import that doesn’t start with .. This filters out all third-party packages (react, lodash, express) which live in node_modules and aren’t part of the project’s own dependency graph. Only relative imports (starting with ./ or ../) represent relationships between the project’s own files.

The resolution order tries the import path as-is first, then appends common extensions, then checks for index files inside a directory of that name. This mirrors how Node.js’s own module resolution works, so it produces the same result as the runtime would.

allFiles is a Set: each resolution check is O(1). Multiply that across potentially thousands of imports in a large repo and the total resolution step stays fast.

### Graph Construction
Once the index is built, indexToGraph assembles the final data structure:

```
function indexToGraph(index) {
  const fileList = Object.keys(index);
  const fileSet = new Set(fileList);

  const nodes = fileList.map(id => ({ id }));
  const links = [];

  for (const file of fileList) {
    for (const imp of index[file].imports) {
      const resolved = resolveImport(file, imp, fileSet);
      if (resolved) {
        links.push({ source: file, target: resolved });
      }
    }
  }

  return { nodes, links, backLinks: {} };
}
```

The graph format: nodes, links, backLinks, is designed specifically for D3’s force simulation on the frontend. nodes is an array of objects with an id. links is an array of { source, target } pairs using those same ids. backLinks is the reverse dependency index: for any given file, which files import it.


----------

## The JavaScript/TypeScript Parser: Babel AST
parser.js is the more powerful of the two parsers. Instead of regex, it uses Babel to parse source files into full Abstract Syntax Trees and then traverses those trees to extract imports.

An AST is a tree representation of source code where every construct e.g. a function declaration, an import statement, a variable assignment, becomes a typed node. Parsing text into an AST is the first step every compiler and linter performs. Using ASTs means the parser understands code structure rather than matching text patterns.

### Parsing Files

```
export function parseFile(filePath) {
  try {
    const code = fs.readFileSync(filePath, "utf-8");

    const ast = parser.parse(code, {
      sourceType: "unambiguous",
      plugins: [
        "typescript",
        "jsx",
        "dynamicImport",
        "classProperties",
      ],
      errorRecovery: true,
    });

    const imports = [];

    traverse.default(ast, {
      ImportDeclaration({ node }) {
        imports.push(node.source.value);
      },
      CallExpression({ node }) {
        if (
          node.callee.name === "require" &&
          node.arguments.length === 1 &&
          node.arguments[0].type === "StringLiteral"
        ) {
          imports.push(node.arguments[0].value);
        }
      },
    });

    return imports;
  } catch (err) {
    return [];
  }
}
```

Several configuration decisions here are worth explaining.

sourceType: "unambiguous" tells Babel to figure out whether the file is a CommonJS module or an ES module by looking at whether it contains any import/export statements, rather than requiring you to specify upfront. Real codebases are messy and mix both styles.

errorRecovery: true is essential in practice. Real codebases contain files that don’t parse cleanly: files with experimental syntax, partially written code, or syntax errors that have been introduced but not yet caught. Without error recovery, one bad file would crash the entire parsing pipeline for the whole repo. With it, Babel does its best and returns whatever AST it can construct from the valid portions.

The CallExpression handler catches require() calls. These show up differently in the AST than import statements - they’re function calls rather than declarations - so they need their own handler. The check that node.callee.name === "require" and that the single argument is a StringLiteral ensures we only capture simple require('./path') patterns and not dynamic requires like require(getModuleName()).

### Walking the Folder
parseFolderJS handles the file tree walk and graph construction for the JS/TS parser:

```
export function parseFolderJS(folderPath) {
  const graph = { nodes: [], links: [], backLinks: {} };
  const filesMap = {};

  function walk(dir) {
    const entries = fs.readdirSync(dir, { withFileTypes: true });
    for 
```

(let entry of entries) {
      if (IGNORE.has(entry.name)) continue;
      const fullPath = path.join(dir, entry.name);
      if (entry.isDirectory()) {
        walk(fullPath);
      } else if (entry.isFile() && fullPath.match(/\.(js|ts|jsx|tsx)$/)) {
        filesMap[fullPath] = fullPath;
      }
    }
  }

  walk(folderPath);

  for (let fullPath in filesMap) {
    const fileId = toRelative(fullPath);
    graph.nodes.push({ id: fileId });

    const imports = parseFile(fullPath);

    for (let imp of imports) {
      if (!imp.startsWith(".")) continue;

      let resolved = path.resolve(path.dirname(fullPath), imp);

      const possible = [
        resolved,
        resolved + ".ts",
        resolved + ".tsx",
        resolved + ".js",
        resolved + ".jsx",
        resolved + "/index.ts",
        resolved + "/index.tsx",
        resolved + "/index.js",
        resolved + "/index.jsx",
      ];

      const found = possible.find((p) => filesMap[p]);

      if (found) {
        const targetId = toRelative(found);
        graph.links.push({ source: fileId, target: targetId });

        if (!graph.backLinks[targetId]) graph.backLinks[targetId] = [];
        graph.backLinks[targetId].push(fileId);
      }
    }
  }

  // Deduplicate links
  graph.links = Array.from(
    new Set(graph.links.map((l) => `${l.source}->${l.target}`))
  ).map((str) => {
    const [source, target] = str.split("->");
    return { source, target };
  });

  return graph;
}
The filesMap object serves a dual purpose: it stores all file paths so they can be looked up during resolution, and its keys are exactly the absolute paths we need to check against during the possible.find() call.

Building backLinks inline during the main loop is efficient: each time a link is added forward (source → target), the reverse index is updated simultaneously. By the end of the loop, backLinks[file] contains every file that imports file, with no second pass needed.

The deduplication at the end handles a real edge case: a file might import from the same module in multiple ways, a static import at the top and a dynamic import inside a function, or two different named imports from the same module in separate import statements. Both would produce the same source → target edge. The Set-based deduplication collapses these into a single edge before the data reaches the frontend.

----------

## The Python Parser
parser_py.py handles Python repositories using Python’s own ast module:

def parse_file(file_path):

with open(file_path, "r", encoding="utf-8") as f:
    try:
        tree = ast.parse(f.read(), filename=file_path)
    except Exception:
        return []

imports = []

for node in ast.walk(tree):
    if isinstance(node, ast.Import):
        for name in node.names:
            imports.append(name.name)
    elif isinstance(node, ast.ImportFrom):
        if node.module:
            imports.append(node.module)

return imports


Python has two import syntaxes that map to different AST node types. import os produces an ast.Import node where node.names is a list of alias objects (there can be multiple: import os, sys). from pathlib import Path produces an ast.ImportFrom node where node.module is the module name being imported from.

The folder parser maps module names to file paths:

for imp in imports:

target = imp.replace(".", "/") + ".py"


Python uses dots as namespace separators: from utils.helpers import something maps to utils/helpers.py. Replacing dots with slashes converts the module path back to a filesystem path. This is a heuristic: it works correctly for relative project imports but doesn’t distinguish between standard library imports (os, sys) and project files. Standard library modules simply won’t exist as files in the repo, so they produce dangling target nodes in the graph rather than edges to real files. This is an area for future improvement.

----------

## The Frontend: React, BFS, and D3
### State Architecture
App.tsx is the application root and manages all state:

const [graphData, setGraphData] = useState(null);
const [selectedFile, setSelectedFile] = useState(null);
const [fileContent, setFileContent] = useState(null);
const [focusMode, setFocusMode] = useState(true);
const [depth, setDepth] = useState(2);
const [loading, setLoading] = useState(false);


graphData holds the raw graph from the API, the complete set of nodes and links for the entire repository. displayData is a derived version, computed by useMemo, that represents what’s actually shown in the graph at any given moment based on focus mode and depth settings. Separating raw data from display data means toggling focus mode or changing depth never triggers a new API call… it just recomputes the view over the existing data.

The API URL switches automatically based on the Vite environment flag:

const API = import.meta.env.DEV
? "http://localhost:8080"
: "https://codeatlas-production-e4f8.up.railway.app";


### API Response Normalisation
The handleAnalyze function contains some defensive normalisation worth explaining:

const graph = raw.graph ?? raw;

const formattedGraph: GraphData = {
nodes: graph.nodes.map((n: any) =>

typeof n === "string" ? { id: n } : n

),
links: graph.links.map((l: any) => ({

source: typeof l.source === "string" ? l.source : l.source?.id,
target: typeof l.target === "string" ? l.target : l.target?.id,

})),
backLinks: graph.backLinks || {},
};


The raw.graph ?? raw fallback handles two different response shapes from the backend, one where the graph is nested under a graph key, one where it’s the root object. This kind of defensive normalisation is common when a frontend is evolving alongside its backend.

The source/target normalisation in the links map addresses a D3 behaviour: D3’s force simulation mutates link objects during the simulation, replacing string ids with references to the actual node objects. So after the simulation runs, link.source is no longer the string "src/App.tsx" but the node object { id: "src/App.tsx", x: 123, y: 456 }. The frontend normalises both forms everywhere it needs to compare or display link endpoints.

### Focus Mode: BFS Traversal
The most technically interesting part of the frontend is the focus mode implementation using useMemo:

const displayData = useMemo(() => {
if (!graphData) return null;
if (!focusMode || !selectedFile) return graphData;

const visited = new Set();
const queue: { id: string; level: number }[] = [

{ id: selectedFile, level: 0 },

];

while (queue.length) {

const { id, level } = queue.shift()!;

if (visited.has(id) || level > depth) continue;
visited.add(id);

for (const l of graphData.links || []) {
  const source = typeof l.source === "string" ? l.source : l.source?.id;
  const target = typeof l.target === "string" ? l.target : l.target?.id;

  if (!source || !target) continue;

  if (source === id && !visited.has(target)) {
    queue.push({ id: target, level: level + 1 });
  }
  if (target === id && !visited.has(source)) {
    queue.push({ id: source, level: level + 1 });
  }
}

}

return {

nodes: (graphData.nodes || []).filter((n) => visited.has(n.id)),
links: (graphData.links || []).filter((l) => {
  const s = typeof l.source === "string" ? l.source : l.source?.id;
  const t = typeof l.target === "string" ? l.target : l.target?.id;
  return s && t && visited.has(s) && visited.has(t);
}),
backLinks: graphData.backLinks || {},

};
}, [graphData, focusMode, selectedFile, depth]);


This is a bidirectional BFS: it traverses both forward edges (files that the selected file imports) and backward edges (files that import the selected file) up to depth hops away. The level counter on each queue entry tracks how many hops from the origin each node is, and nodes beyond depth are not enqueued.

The result is a subgraph centred on the selected file that shows its immediate neighbourhood in the dependency graph. Depth 1 shows only direct imports and importers. Depth 2 shows imports of imports. Depth 5 shows almost everything reachable.

Using useMemo with [graphData, focusMode, selectedFile, depth] as dependencies means the BFS only re-runs when one of those values changes. The computation is pure: same inputs, same output, so memoisation is safe and effective.

### File Inspection
Clicking a node triggers both a local state update and an API call:

const handleNodeClick = async (id: string) => {
setSelectedFile(id);

const res = await fetch(${API}/file?path=${encodeURIComponent(id)});
const data = await res.json();
setFileContent(data?.content || "");
};


The encodeURIComponent call is important: file paths can contain characters like +, #, or spaces that would corrupt a URL query parameter without encoding.

The file content is passed to Monaco editor, which provides VS Code-quality syntax highlighting and navigation in the browser:


Language detection is handled by file extension:

const getLanguage = (file: string | null) => {
if (!file) return "plaintext";
if (file.endsWith(".ts") || file.endsWith(".tsx")) return "typescript";
if (file.endsWith(".js") || file.endsWith(".jsx")) return "javascript";
if (file.endsWith(".py")) return "python";
return "plaintext";
};


Monaco uses this to apply the correct grammar for syntax highlighting, bracket matching, and token colouring.

### The Sidebar
The sidebar shows imports and dependents for the selected file:

{/ IMPORTS /}
{(graphData.links || [])
.filter((l) => {

const s = typeof l.source === "string" ? l.source : l.source?.id;
return s === selectedFile;

})
.map((l, i) => {

const t = typeof l.target === "string" ? l.target : l.target?.id;
return <li key={i}>→ {t}</li>;

})}

{/ DEPENDENTS /}
{(graphData.backLinks?.[selectedFile] || []).map((f, i) => (

← {f}
))}


Imports are computed by filtering graphData.links for edges where the selected file is the source. Dependents come directly from the backLinks index: a O(1) lookup rather than a scan over all links. This is why maintaining backLinks during graph construction matters: the sidebar is queried on every node click, and a linear scan over potentially thousands of links on each click would be noticeably slow.
----------

If you found this useful, starring the repo is the best thing you can do: it helps other developers find it.

[GitHub: CodeAtlas][2]

(link to live demo on GitHub page)


  [1]: https://github.com/lucyb0207/CodeAtlas
  [2]: https://github.com/lucyb0207/CodeAtlas

🔥 Join developers growing publicly

Share your knowledge, build in public, and grow your developer presence with a global community.

Join CoderLegion

chevron_left

Commenters (This Week)

Contribute meaningful comments to climb the leaderboard and earn badges!

	How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work Dharanidharan - Feb 9
	I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt Karol Modelskiverified - Mar 19
	TypeScript Complexity Has Finally Reached the Point of Total Absurdity Karol Modelskiverified - Apr 23
	Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download) Pocket Portfolio - Apr 1
	UPDATE: I built a side project that turns GitHub profiles into 3D solar systems ABD - Mar 25

I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

(link to live demo on GitHub page)

The Architecture

The Backend: Cloning, Indexing, and Graph Construction

Cloning

File Tree Indexing

0 Comments

Please log in to comment on this post.

More Posts

How I Built a React Portfolio in 7 Days That Landed ₹1.2L in Freelance Work

I’m a Senior Dev and I’ve Forgotten How to Think Without a Prompt

TypeScript Complexity Has Finally Reached the Point of Total Absurdity

Sovereign Intelligence: The Complete 25,000 Word Blueprint (Download)

UPDATE: I built a side project that turns GitHub profiles into 3D solar systems

More From lucyb0207

Why Your Backend, Frontend, and Database “Don’t Talk”

From URL To Website: Behind The Scenes

Developers don’t have a productivity problem. They have a memory problem.

Related Jobs

Commenters (This Week)

Welcome to Coder Legion

Connect with 4,645 amazing developers

Don't have an account? Sign up

OR

I Built a Tool That Turns Any GitHub Repo Into an Interactive Dependency Graph: Here's How!

(link to live demo on GitHub page)

The Architecture

The Backend: Cloning, Indexing, and Graph Construction

Cloning

File Tree Indexing

0 Comments

Please log in to comment on this post.

More Posts

More From lucyb0207

Related Jobs

Commenters (This Week)