Tensor Logic Paper Explanation

Think of this paper as proposing a new language for building AI systems — one that uses a single, simple building block and combines the strengths of two worlds: neural (learn-from-data, scalable) and symbolic (rules, logic, explainability).

Here’s the core idea and why it matters, explained with analogies.

What’s the single idea?

Replace many programming constructs with one simple thing: tensor equations. Everything — rules, facts, neural layers, probability models — is written as equations that join and sum tensors (multi-dimensional arrays).
The paper shows these tensor equations can express logic rules and database joins, and they are exactly the same form as many neural network operations (Einstein summation / einsum). So symbolic rules and neural nets are the same “shape” mathematically.

Analogy: LEGO bricks

Imagine you had two toy kits: one full of complicated gears and rule-books (symbolic AI), and another with flexible rubber bricks that you can stretch and glue to fit anything (neural nets). Tensor logic says: let’s make a single LEGO brick that can be snapped together to form both gears and rubber shapes. With that one brick you can build logical rules, neural nets, probabilistic models — everything.

How relations (facts/rules) map to tensors (representation)

A relation like Parent(Alice, Bob) can be seen as a sparse table or a sparse Boolean tensor (mostly zeros, ones where the relation holds).
Joins (e.g., combining Parent and Ancestor) correspond to summing over a shared index in tensor algebra (einsum) and then thresholding (turning any positive count into true/false).
Projection (selecting some columns/indices) is just summing out other indices of the tensor.

Analogy: Mailing lists and spreadsheets

A relation is like a sparse spreadsheet where you only write the rows that matter. A tensor equation is like a formula that says “for every matching row in these sheets, add these values and then check if the result is nonzero.”

Inference (how the system answers questions)

Forward chaining: repeatedly apply tensor equations to derive new tensors (facts) until nothing new appears — like running repeated matrix operations.
Backward chaining: treat each equation as a function and compute only what’s needed for a query — like recursively resolving definitions.

Analogy: Recipe book vs. on-demand cooking

Forward chaining is like pre-making all dishes in a big buffet (compute everything). Backward is like cooking only the dish someone orders (compute on demand).

Learning (how parameters are trained)

Because everything is tensor algebra, automatic differentiation (backprop) works naturally. Gradients are also tensor equations.
You can learn weights like in neural nets, and you can “invent predicates” (analogous to hidden features) by tensor decompositions (e.g., Tucker decomposition).

Analogy: Sculpting with clay that learns

Think of weights as clay shape that you tweak so the overall sculpture (output) matches examples. The derivative tells you how to push the clay.

How common AI systems fit in

Neural nets: MLPs, CNNs, RNNs, GNNs, Transformers — all expressed as compact tensor equations (einsums, nonlinearities, projections). The paper sketches how to write these architectures as tensor logic programs.
Symbolic AI: Datalog (logic programs) maps directly — Boolean tensors and einsums implement joins and projections.
Kernel methods and probabilistic graphical models: can be written as tensor equations (kernels as Gram matrices, factors as tensors; marginalization = projection).

Analogy: One universal pattern

Like discovering that many different knitting patterns are actually sequences of the same three stitches; once you know those, you can make any sweater pattern.

Reasoning in embedding space (the novel, powerful idea)

Instead of only storing discrete facts, embed objects and relations as vectors/tensors. Combine them with tensor equations to do reasoning directly in embedding space.
If embeddings are random high-dimensional vectors, superposition and dot products let you check membership and reconstruct relations with small error (like Bloom filters or holographic representations).
If embeddings are learned, similar objects “borrow” inferences from each other (analogy-based reasoning). You can control how much borrowing happens with a temperature parameter: low temperature = strict deductive reasoning; higher temperature = softer, analogical inferences.
This approach gives: scalability, learning, and — important — transparency and controllable reliability (you can make it behave like logical deduction by lowering temperature).

Analogy: Fuzzy library catalog

Imagine each book has a fingerprint. Instead of exact shelf lookup, you query by fingerprint similarity: near matches suggest related books. Embedding-based reasoning is like combining fingerprints and rules to infer which books are relevant, with a knob to make the system strict or fuzzy.

Scaling up (practicality)

For large sparse data, two approaches:
1. Hybrid: use a database engine for sparse parts and GPU tensor ops for dense parts.
2. Full GPU via decomposition: turn sparse tensors into dense low-rank decompositions (Tucker), operate on those. That introduces small, controllable errors but scales well and fits learning.

Analogy: City map vs. GPS tiles

Either keep the sparse paper map and look up locations (DB + GPU), or pre-render tiles (dense decompositions) to serve fast zoomed-in views — the latter uses more approximation but is faster.

Why it could matter

Unifies symbolic and neural approaches in a mathematically clean way.
Makes reasoning transparent and (if desired) logically sound, while keeping generalization and scalability of embeddings/neural nets.
Could reduce hallucinations (if you tune temperature toward deductive behavior) and make models auditable, because intermediate tensor values can be inspected.
Acts like a new programming language for AI where the main primitive is tensor equations, potentially replacing many ad hoc toolchains.

Analogy: Universal tool

Think of tensor logic as inventing a Swiss Army knife optimized for AI: one blade (tensor equation) but many tools fold out — symbolic rules, neural modules, probabilistic factors, all handled consistently.

Limitations / caveats (brief)

Converting huge sparse datasets into dense decompositions trades exactness for efficiency; you must choose embedding sizes wisely.
Implementing full systems, tooling, and adoption are nontrivial engineering challenges.
Some logical features (e.g., full Prolog-style unification with functions) need extra work.

Final one-liner

Tensor logic says: treat rules, data, and neural computations as the same tensor operations. Use that single, uniform language to build AI systems that are learnable, scalable, and more transparent — with a built-in way to smoothly move between strict logic and flexible analogy.

If you want, I can:

Give a short example showing a simple rule and its corresponding tensor equation.
Walk through how a small toy transformer or a simple logic rule is written in tensor logic.
Here are two compact, concrete examples in tensor logic style:
1. a simple symbolic rule turned into a tensor equation (the “Aunt” example), and
2. a tiny attention head (toy transformer) expressed as tensor equations.
  I’ll show the rule/equations, then explain each line in plain English.

Example 1 — a simple logic rule (Aunt)

Natural-language rule: “x is Aunt of z if x is Sister of y and y is Parent of z.”
Datalog-style rule: Aunt(x, z) ← Sister(x, y), Parent(y, z)
Tensor-logic form (compact):

Aunt[x, z] = step( Sister[x, y] Parent[y, z] )

Explanation line-by-line:
- Sister[x, y] and Parent[y, z] are Boolean tensors (matrices) where entry = 1 means the relation holds.
- Writing Sister[x, y] Parent[y, z] uses an einsum: it multiplies and sums over the shared index y. That corresponds to the database join on y.
- The result is a numeric tensor whose (x,z) entry counts how many y witnesses exist making both relations true.
- step(...) applies a threshold: any positive count becomes 1 (true), zero stays 0. That recovers the Boolean relation Aunt(x,z).
- So the single tensor equation implements the join + projection + booleanization step of the original logical rule.

Why this is handy:

The same einsum expression Sister[x,y] Parent[y,z] is also a standard matrix/tensor operation used in neural code — so logic joins and neural linear algebra look identical in this language.

Example 2 — a tiny attention head (toy transformer)

Goal: one attention head computing output vectors for each position p from input embeddings X[p,d].
Key operations: produce Query/Key/Value, score queries vs keys, softmax weights, weighted sum of values.
Tensor-logic equations (toy, minimal indices):

Query[b, p, k] = WQ[b, k, d] X[p, d] Key[b, p, k] = WK[b, k, d] X[p, d] Val[b, p, v] = WV[b, v, d] X[p, d]

Scores[b, p, q] = Query[b, p, k] Key[b, q, k] / sqrt(Dk) AttnWeights[b, p, q] = softmax_q( Scores[b, p, q] ) AttnOut[b, p, v] = AttnWeights[b, p, q] Val[b, q, v]

Explanation line-by-line:
- X[p, d] is the input embedding for position p and embedding dimension d.
- WQ, WK, WV are learned weight tensors that linearly map inputs to query/key/value spaces. The extra index b denotes a block or layer (keeps formulas general).
- Query[b, p, k] = WQ[b, k, d] X[p, d] is an einsum: for each block b, each position p, and query-dim k, sum over d to get the query vector.
- Scores[b, p, q] = Query[b, p, k] Key[b, q, k] / sqrt(Dk) computes the dot product between Query at position p and Key at position q (sum over k), scaled by sqrt(Dk).
- AttnWeights[b, p, q] = softmax_q(Scores[b,p,q]) applies softmax along index q for each fixed (b,p), turning raw scores into normalized attention weights.
- AttnOut[b, p, v] = AttnWeights[b, p, q] Val[b, q, v] is another einsum: for each (b,p,v) take weighted sums over positions q of the value vectors, yielding the attended output vectors.

Why this is the same pattern as the logic example:

Both the logic join (Sister[x,y] Parent[y,z]) and the attention weighted sum (AttnWeights[b,p,q] Val[b,q,v]) are einsums: multiply then sum over a shared index. The only difference is the semantics (Boolean join vs. soft-weighted sum) and the nonlinearities (step vs. softmax).

Short walk-through of a tiny end-to-end flow (how the attention head answers a query)

Start from X[p,d] (the inputs).
Compute Query, Key, Val by linear einsums with WQ/WK/WV.
Compute Scores = dot(Query, Key) and AttnWeights = softmax over positions.
Compute AttnOut = weighted einsum of Val with AttnWeights.
Optionally pass AttnOut through a small MLP (also tensor equations) and continue stacking blocks.

Extra — showing relation extraction from embeddings (very short)

If Emb[x,d] is an embedding matrix for objects and EmbR[i,j] stores a superposition of tuple embeddings for relation R, you can test tuple (A,B) with:

Score[A,B] = EmbR[i, j] Emb[A, i] Emb[B, j]
- Large Score[A,B] ≈ relation true; small ≈ false. This is the embedding-space analogue of R(A,B) lookup.

If you’d like next:

I can give a tiny numeric toy (very small matrices) and show the actual multiplications step-by-step so you can see numbers flow through an Aunt-rule or an attention head.
Or I can show how to express forward vs backward chaining in the tensor equations with a very small knowledge base. Which would you prefer?

Below are two compact, concrete walk-throughs you requested.

Numeric toy: the “Aunt” rule computed step-by-step
Rule: Aunt(x,z) ← Sister(x,y), Parent(y,z)
We represent Sister[x,y] and Parent[y,z] as small Boolean matrices and compute the einsum plus step.

Setup (small domain with 3 people: A, B, C)

indices: x ∈ {A=0, B=1, C=2}, y ∈ {A,B,C}, z ∈ {A,B,C}

Matrices (rows = first arg, columns = second arg):

Sister[x,y] (1 if row is sister of column)

Sister = [ [0, 1, 0], # A is sister of B [1, 0, 0], # B is sister of A [0, 0, 0] # C has no sister records here ]
Parent[y,z] (1 if row is parent of column)

Parent = [ [0, 0, 0], # A is not parent of anyone here [0, 0, 1], # B is parent of C [0, 0, 0] ]

Compute join Counts[x,z] = Sister[x,y] * Parent[y,z] summing over y

Do elementwise multiplication for each y and sum across y:

Manual per (x,z):

For x=A (row 0)
- z=A: sum_y Sister[0,y]*Parent[y,0] = (S[0,0]_P[0,0]) + (S[0,1]_P[1,0]) + (S[0,2]_P[2,0]) = 0_0 + 1_0 + 0_0 = 0
- z=B: sum = 0_0 + 1_0 + 0*0 = 0
- z=C: sum = 0_0 + 1_1 + 0*0 = 1
- For x=B (row 1)
- z=A: sum = 1_0 + 0_0 + 0*0 = 0
- z=B: sum = 1_0 + 0_0 + 0*0 = 0
- z=C: sum = 1_0 + 0_1 + 0*0 = 0
- For x=C (row 2)
- all z: Sister row is all zeros → sums are 0

So Counts matrix:

Counts = [ [0, 0, 1], [0, 0, 0], [0, 0, 0] ]

Apply step (threshold to Boolean: positive → 1)

Aunt = step(Counts) yields:

Aunt = [ [0, 0, 1], # Aunt(A,C) = 1 [0, 0, 0], [0, 0, 0] ]

Interpretation

Aunt(A,C) is true because A is sister of B and B is parent of C.
This shows how the einsum/join and projection (sum over y) plus threshold implement the logical rule.
Tiny attention head: numeric toy with step-by-step multiplications
Toy sequence length P=2 (positions p=0,1)
Embedding dimension d=2
Query/key dim k=2 and value dim v=2
Single block/head (omit block index for simplicity)

Inputs

X[p,d]:

X = [ [1.0, 0.0], # position 0 embedding [0.0, 1.0] # position 1 embedding ]

Weights (small, chosen for clarity)

WQ[k,d] = WK[k,d] = WV[v,d] (use same for simplicity)

WQ = WK = WV = [ [1.0, 0.0], # row for dim 0 [0.0, 1.0] ]

Step 1 — compute Query, Key, Value

Query[p,k] = WQ[k,d] * X[p,d] (sum over d)
- p=0: Query[0] = [1_1 + 0_0, 0_1 + 1_0] = [1.0, 0.0]
- p=1: Query[1] = [1_0 + 0_1, 0_0 + 1_1] = [0.0, 1.0]
Similarly Key = Query, Val = Query (because weights identical here)

So

Query = Key = Val = [ [1.0, 0.0], # p=0 [0.0, 1.0] # p=1 ]

Step 2 — compute raw scores Scores[p,q] = Query[p,k] * Key[q,k] (dot over k), scale by sqrt(Dk)

Dk = 2 → sqrt(Dk) ≈ 1.4142. We’ll divide by 1.4142 for scaling.

Compute dot products:

Scores[0,0] = dot(Query[0],Key[0]) = 1_1 + 0_0 = 1
Scores[0,1] = dot(Query[0],Key[1]) = 1_0 + 0_1 = 0
Scores[1,0] = dot(Query[1],Key[0]) = 0_1 + 1_0 = 0
Scores[1,1] = dot(Query[1],Key[1]) = 0_0 + 1_1 = 1

Scaled scores (divide by sqrt(2)):

Scores = [ [1/1.4142, 0], [0, 1/1.4142] ] ≈ [ [0.7071, 0.0], [0.0, 0.7071] ]

Step 3 — softmax over q for each p: AttnWeights[p,q] = softmax_q(Scores[p,q])

For p=0: softmax([0.7071, 0.0]) → exponentiate and normalize
- exp(0.7071) ≈ 2.028, exp(0) = 1 → sum ≈ 3.028
- weights ≈ [2.028/3.028, ⅓.028] ≈ [0.67, 0.33]
For p=1: same pattern, weights ≈ [0.33, 0.67] (because scores [0,0.7071])

So

AttnWeights ≈ [ [0.67, 0.33], # weights for p=0 attending to q=0 and q=1 [0.33, 0.67] ]

Step 4 — weighted sum of values: AttnOut[p,v] = AttnWeights[p,q] * Val[q,v] (sum over q)

For p=0:
- AttnOut[0] = 0.67 * Val[0] + 0.33 * Val[1] = 0.67*[1,0] + 0.33*[0,1] = [0.67, 0.33]
For p=1:
- AttnOut[1] = 0.33*[1,0] + 0.67*[0,1] = [0.33, 0.67]

Result

AttnOut ≈ [ [0.67, 0.33], [0.33, 0.67] ]

Interpretation

Each position’s output is a mixture of both positions’ value vectors, weighted by attention. Because Query matched Key best at the same position, the self-weight is stronger (≈0.67) but there is still cross-position influence (≈0.33).
Forward vs Backward chaining (tiny KB example using the Aunt scenario) We use the same small KB as in (1).

Knowledge base (facts)

Sister(A,B), Sister(B,A), Parent(B,C)

Forward chaining (compute closure)

Start with the fact tensors Sister and Parent.
Apply the rule equation Aunt[x,z] = step(Sister[x,y] Parent[y,z]).
As in numeric example, this computes Aunt(A,C).
Forward chaining would continue if there were more rules that use Aunt to derive new facts; it repeatedly applies applicable tensor equations until no new nonzero tensor entries appear.

Backward chaining (query-driven)

Ask query: Is Aunt(A,C)?
Backward chaining treats Aunt[x,z] equation as a function: to prove Aunt(A,C):
- Look for y such that Sister(A,y) and Parent(y,C) hold.
- Check facts: Sister(A,B) true and Parent(B,C) true → succeed.
In tensor logic terms, backward chaining would compute only the needed slices:
- compute Sister[0,y] row and Parent[y,2] column, then sum over y and apply step.
- This avoids computing the whole Aunt matrix if we only need one entry.

Analogy recap

Forward chaining = “precompute the whole buffet” (build the complete Aunt table).
Backward chaining = “cook the requested plate on demand” (compute only Aunt(A,C) when asked).

Would you like:

A tiny runnable Python snippet that performs the exact arithmetic above (matrix multiplications / softmax) so you can tweak numbers and run it yourself?
Or a small example that shows learned embeddings and an embedding-space lookup for a relation (the paper’s embedding trick)?

Below is a tiny runnable Python snippet that performs the exact arithmetic from the two walk-throughs: (1) the Aunt rule via a Boolean einsum + step, and (2) a tiny attention head (queries, keys, values, scaled dot-product, softmax, weighted sum). You can copy-paste and run this with Python 3 and NumPy installed.

Code Snippet

# python
import numpy as np

def step(x):
    return (x > 0).astype(int)

def softmax(a, axis=-1):
    a_max = np.max(a, axis=axis, keepdims=True)
    e = np.exp(a - a_max)
    return e / np.sum(e, axis=axis, keepdims=True)

# 1) Aunt example: boolean matrices for Sister and Parent over domain [A,B,C]
Sister = np.array([
    [0, 1, 0],   # A is sister of B
    [1, 0, 0],   # B is sister of A
    [0, 0, 0]
], dtype=int)

Parent = np.array([
    [0, 0, 0],   # A has no children here
    [0, 0, 1],   # B is parent of C
    [0, 0, 0]
], dtype=int)

# Einsum join over index y: Counts[x,z] = sum_y Sister[x,y] * Parent[y,z]
Counts = np.einsum('xy,yz->xz', Sister, Parent)
Aunt = step(Counts)

print("Counts (Sister join Parent):\n", Counts)
print("Aunt (after step):\n", Aunt)
# Interpretation: Aunt[0,2] == 1 means Aunt(A,C) is true

# 2) Tiny attention head example
# Inputs: sequence length P=2, embedding dim d=2
X = np.array([
    [1.0, 0.0],  # position 0
    [0.0, 1.0],  # position 1
])

# Simple weights: map d->k (queries/keys) and d->v (values)
WQ = np.array([[1.0, 0.0],
               [0.0, 1.0]])  # shape (k,d)
WK = WQ.copy()
WV = WQ.copy()

# Compute Query/Key/Value: shape (P, k) and (P, v)
Query = np.einsum('kd,pd->pk', WQ, X)  # WQ @ X.T  but einsum gives (k,p)->transpose
Query = Query.T  # shape (P,k)
Key   = np.einsum('kd,pd->pk', WK, X).T
Value = np.einsum('kd,pd->pk', WV, X).T

print("\nQuery:\n", Query)
print("Key:\n", Key)
print("Value:\n", Value)

# Scaled dot-product attention
Dk = Query.shape[1]
scores = np.einsum('pk,qk->pq', Query, Key) / np.sqrt(Dk)  # shape (P,P)
attn_weights = softmax(scores, axis=1)  # softmax over q (columns) for each p (row)
attn_out = np.einsum('pq,qv->pv', attn_weights, Value)  # weighted sum of values

print("\nScores (scaled dot products):\n", scores)
print("Attention weights (softmax rows):\n", attn_weights)
print("Attention output vectors:\n", attn_out)