Build a Data Science Query Language in Python using Lark
What if you could write something like this:
DATA [1, 2, 3, 4, 5]
SUM
MEAN
STD
…and have it behave like a mini data science engine?
In this tutorial, we’ll build a **Domain-Specific Language (DSL)** for data analysis using:
- Python
- Lark (parser library)
- NumPy
---
# What Are We Building?
We are creating a **custom query language** that:
- Accepts a dataset
- Runs statistical commands
- Prints results
---
# Step 1: Install Dependencies
```bash
pip install lark numpy
Step 2: Define the Grammar
The grammar defines how our language looks.
from lark import Lark, Transformer
import numpy as np
grammar = """
start: data command+
data: "DATA" list
command: "SUM" -> sum
| "MEAN" -> mean
| "STD" -> std
| "MAX" -> max
| "MIN" -> min
list: "[" NUMBER ("," NUMBER)* "]"
%import common.NUMBER
%import common.WS
%ignore WS
"""
Explanation
start: data command+
- Program must start with
DATA
- Followed by one or more commands
data: "DATA" list
- Defines dataset input
Example:
DATA [1, 2, 3]
Commands
SUM → sum
MEAN → mean
STD → std
MAX → max
MIN → min
- These map text → function names
-> sum means call sum() in Transformer
List Rule
list: "[" NUMBER ("," NUMBER)* "]"
Accepts:
(, NUMBER)* means repeat
Ignore Spaces
%ignore WS
- Allows flexible formatting
⚙️ Step 3: Build the Interpreter
Now we convert parsed text into execution.
class DLangInterpreter(Transformer):
def data(self, items):
self.data = np.array([float(x) for x in items[0]])
return self.data
Explanation
items[0] → list of numbers
- Convert to NumPy array
- Store in
self.data for reuse
Step 4: Add Operations
SUM
def sum(self, _):
print(np.sum(self.data))
MEAN
def mean(self, _):
print(np.mean(self.data))
STD
def std(self, _):
print(np.std(self.data))
MAX
def max(self, _):
print(np.max(self.data))
MIN
def min(self, _):
print(np.min(self.data))
Explanation
- Each function matches grammar rule
_ = unused input
- Uses NumPy for computation
- Prints result immediately
Step 5: Parse List
def list(self, items):
return items
Explanation
- Returns list of numbers
- Passed to
data() method
Step 6: Create the Parser
parser = Lark(grammar, parser="lalr", transformer=DLangInterpreter())
Explanation
lalr → fast parsing algorithm
transformer → auto-executes logic
with open("example.dl") as f:
code = f.read()
parser.parse(code)
Example example.dl
DATA [10, 20, 30, 40]
SUM
MEAN
MAX
✅ Output
100
25.0
40
How It Works (Flow)
Text Input
↓
Parser (Lark)
↓
Grammar Rules Match
↓
Transformer Methods Trigger
↓
NumPy Executes
↓
Output Printed
✨ Why This Is Powerful
- You built a mini programming language
Clean separation of:
- Syntax (grammar)
- Execution (Transformer)
- Easily extensible
Next Features You Can Add
1. Filtering
FILTER > 10
2. Sorting
SORT ASC
3. CSV Support
DATA file.csv
4. Chaining
DATA [1,2,3,4]
FILTER > 2
MEAN
Final Thought
This is how real systems like:
- SQL
- Pandas query engine
- Spark
…start at a basic level.
You just built the foundation of a data query engine
If You Liked This
Drop a like ❤️
Follow for more AI + Systems content
And try extending this DSL yourself!