Fabricio Olivetti de Franca and Gabriel Kronberger. 2023. Reducing Overparameterization of Symbolic Regression Models with Equality Saturation. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '23). Association for Computing Machinery, New York, NY, USA, 1064–1072. https://doi.org/10.1145/3583131.3590346
Symbolic Regression algorithms are prone to overparametrization:




Sometimes the set of rules induced some exponential growth of the e-graph...
Fabricio Olivetti de França and Gabriel Kronberger. 2025. Improving Genetic Programming for Symbolic Regression with Equality Graphs. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '25). Association for Computing Machinery, New York, NY, USA, 989–998. https://doi.org/10.1145/3712256.3726383
Neutral perturbations: navigability despite selective pressure.
Bloating introduces new patterns without impacting the fitness.
New opportunities to find better solutions later.
This has other implications:
Ideally we would be able to:
Key differences to GP:

Critical Difference diagram of the average ranks of the tested methods calculated by the average MSE over the 30 runs of each dataset

CD diagram of the ranks calculated over average model size.
Fabricio Olivetti de França and Gabriel Kronberger. 2025. REGGression: an Interactive and Agnostic Tool for the Exploration of Symbolic Regression Models. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO '25). Association for Computing Machinery, New York, NY, USA, 4–12. https://doi.org/10.1145/3712256.3726385
More info at https://github.com/folivetti/reggression
In any case, the purpose here is to show how we can use rression to explore alternative models.
We can create an initial e-graph for this dataset using eggp:
from eggp import EGGP
import pandas as pd
reg = EGGP(gen=200, nPop=200, maxSize=25, \
nonterminals="add,sub,mul,div,log,power,sin,cos,abs,sqrt", \
simplify=True, optRepeat=2, optIter=20, folds=2, \
dumpTo="vlad.egg")
reg.fit(x_sel, y_sel)
We are saving the final e-graph into the file named vlad.egg so we can explore it after the search.
Now, let's load the e-graph into r:ression
from reggression import Reggression
egg = Reggression(dataset="vlad.csv", loadFrom="vlad.egg")
If we look at the top-5 models, we can see small variations of the top performing with similar fitness (negative MSE) values.
egg.top(5)[["Latex", "Fitness", "Size"]]
| Latex | Fitness | Size |
|---|---|---|
| -0.00415306 | 16 | |
| -0.00425244 | 18 | |
| -0.00430326 | 18 | |
| -0.00430774 | 19 | |
| -0.0043503 | 14 |
Some of these functions behave similarly while others display a different behavior when looking outside of the training region:

We can retrieve top expressions filtering by size:
model_top(egg.top(n=10, filters=["size <= 10"]), n, x, y)
Let us investigate the distribution of tokens from the top 1000 generated expressions.:
egg.distributionOfTokens(top=1000)
| Pattern | Count | AvgFit | Pattern | Count | AvgFit | Pattern | Count | AvgFit |
|---|---|---|---|---|---|---|---|---|
| x0 | 2604 | -0.00359749 | t6 | 1 | -0.013187 | Exp(v0) | 45 | -0.0118458 |
| t0 | 1006 | -0.009312 | Abs(v0) | 465 | -0.00810496 | Cube(v0) | 38 | -0.00867039 |
| t1 | 981 | -0.00941213 | Sin(v0) | 74 | -0.0115615 | (v0 + v1) | 3405 | -0.00275121 |
| t2 | 955 | -0.00937039 | Cos(v0) | 3029 | -0.00309273 | (v0 - v1) | 351 | -0.00848634 |
| t3 | 806 | -0.00893546 | Sqrt(v0) | 32 | -0.00845579 | (v0 * v1) | 2139 | -0.0042415 |
| t4 | 466 | -0.00910986 | Square(v0) | 27 | -0.00967352 | (v0 / v1) | 68 | -0.00815694 |
| t5 | 144 | -0.00786632 | Log(v0) | 10 | -0.00972384 |
Sine and cosine are ranked next, while the exponential is rarely used and particularly with a worse average fitness than the other tokens.
The reason for this could be that fitting parameters inside an exponential function can be tricky depending on the initial values.
We can verify that by plotting the top 5 expressions with the pattern
egg.top(n=n, pattern="exp(v0)*v1")
and as we can see, still not a very good fit, as expected.
We can try our luck with another SR method, such as Operon [4], and insert the obtained expressions into the e-graph:
from pyoperon.sklearn import SymbolicRegressor
regOp = SymbolicRegressor(objectives=['mse','length'], max_length=20, \
allowed_symbols='add,sub,mul,div,square,sin,cos,exp,log,sqrt,abs,constant,variable')
regOp.fit(x_sel, y_sel)
f = open("equations.operon", "w")
for eq in regOp.pareto_front_:
eqstr = regOp.get_model_string(eq['tree'])
fitness = -eq['mean_squared_error']
print(f"{eqstr},{fitness},{fitness}", file=f)
f.close()
egg.importFromCSV("equations.operon")
Still no luck! But we didn't make things easy for SR anyway!
We can insert the ground-truth expression to see whether the parameter optimization is capable of converging to the true parameters and if the fitness is better than what we have.
egg.insert("exp(x0/t0)*(x0^3)*(cos(x0)*(sin(x0)^2)-t1)")
| Latex | Fitness | Parameters |
|---|---|---|
| -0.00256414 | [-3.15, -0.83] |
We can also use rression to check whether two or more expressions are equivalent. Let's say we want to see whether
First, we create an empty e-graph:
newegg = Reggression(dataset="vlad.csv", loss="MSE")
Next, we add both expressions while storing their e-class ids:
eid1 = egg.insert("(x0 + 3)**2 - 9").Id.values[0]
eid2 = egg.insert("x0*(x0 + 6)").Id.values[0]
print(eid1, eid2)
> 6, 9
Initially, their ids are going to be different, since until now they are distinct to each other as far as the e-graph is concerned.
Now, the main idea is that we run equality saturation to produce all the equivalent forms of each one of these expressions following a set of rules, such as:
If the set of rules are sufficient to produce at least one common expression departing from the first and from the second expressions, they will eventually be merged, and their e-class id will become the same.
We can run some iterations of equality saturation using the command:
egg.eqsat(5)
And, now, their ids should be the same!
print("Id of the first equation: \n", egg.report(eid1).loc[0:1, ["Info", "Training"]])
print("Id of the second equation: \n", egg.report(eid2).loc[0:1, ["Info", "Training"]])
> Id of the first equation: 16
> Id of the second equation: 16
After running equality saturation, we can also retrieve a sample of the equivalent expressions for that e-class id:
egg.getNExpressions(eid1, 10)
Leading to:
We can also measure the similarity between two expressions by measuring the percentage of shared e-classes ids.

Python library and CLI
pip install eggp
pip install reggression
pip install symregg
https://github.com/folivetti/eggp
https://github.com/folivetti/reggression
https://github.com/folivetti/symregg
change font size globally
digraph G { exp1 [label="exp"] neg [label="-"] x1 [label="x"] x2 [label="x"] x3 [label="x"] x4 [label="x"] x5 [label="x"] x6 [label="x"] cos1 [label="cos"] sin1 [label="sin"] cos2 [label="cos"] sin2 [label="sin"] pow1 [label="^"] pow2 [label="^"] minus [label="-"] mul1 [label="*"] mul2 [label="*"] mul3 [label="*"] mul4 [label="*"] mul5 [label="*"] mul6 [label="*"] mul1 -> mul2 mul2 -> pow1 mul2 -> mul3 mul3 -> cos1 cos1 -> x3 mul3 -> mul4 mul4 -> sin1 sin1 -> x4 mul4 -> mul5 mul5 -> minus minus -> mul6 minus -> 1 mul6 -> cos2 cos2 -> x5 mul6 -> pow2 pow2 -> sin2 sin2 -> x6 pow2 -> 2 mul1 -> exp1 exp1 -> neg neg -> x1 pow1 -> x2 pow1 -> 3 }