Session ID: 18186e7e-c5ef-4f09-bd53-b3997672db5b Date: 2026-03-16 15:40:29
Hey Claude! Let’s work on converting Concraft to Clojure. This is a big project but we can do it together. Have a look at AGENTS.md and the resources I’ve gathered for you, and start preparing a plan.
Let me explore the Haskell source code and the input/output formats in parallel.
Let me verify the server is running and check the exact DAG format.
Good — the server is running and I can see the exact input/output format. Now let me enter plan mode to create the implementation plan.
Let me explore the CRF inference libraries that concraft depends on — those are the core algorithms we need to reimplement.
Let me check which model variant concraft-pl uses and verify a few more details.
The model starts with version string dagseg:0.11. Let me
check the DAGSeg model serialization code.
Now I have a thorough understanding. Let me design the implementation plan.
/context is a built-in CLI command — it’ll be handled
directly by the CLI. Let me continue finishing the plan.
Now let me write the final plan file based on all the research.
Let me also save a memory about this project for future conversations.
Here’s a summary of the plan:
8 phases, each independently validatable:
The biggest risk is Phase 2 (binary deserialization of complex nested Haskell types). I recommend writing a small Haskell helper that prints structural statistics (codec sizes, param vector lengths) so we can validate incrementally.
Let me save this feedback and then dive deeper into the algorithm details with specific code references.
Now let me read the key source files in detail to trace the algorithm end-to-end.
Let me find the actual CRF inference algorithm implementations - they should be in the .stack-work directories.
The CRF libraries aren’t available locally. Let me fetch the source from GitHub.
Excellent - I have the full chain1 inference code. Now let me get the chain2 tiers inference and the model/codec structures.
Now let me get the chain2 model and core types.
Let me get a few more critical pieces - the codec and DAG structure.
Now I have all the critical source code. Let me now write the detailed algorithm description into the plan file.
Now let me rewrite the plan with the detailed algorithm descriptions and code pointers.
The plan now includes detailed algorithm descriptions with code pointers for:
Plan approved. Let me create a plan file in the project and start with Phase 0: Binary reader + project setup.
Now let me set up the project structure and implement the binary reader.
Now the binary reader — this is the foundation for everything:
Now let me validate by reading the version string from the model:
Phase 0 validated. Now let me implement the tagset reader and continue reading the model header.
Good — the Tagset is simply
{domains: Map Text (Set Text), rules: Map Text [(Text, Bool)]}.
Let me implement the tagset reader and test it.
Tagset reading works perfectly — 14 attributes, 40 POS tags, and tag round-trip succeeds. Now let me continue reading the rest of the model. This is the hardest part — the CRF models. Let me implement the readers for all the model components.
Now the Schema configuration reader:
Now the CRF chain1 model reader — this is the most complex part. I need to understand the exact codec and model binary format. Let me check the codec structure.
Now I need the Codec type. It’s
(AtomCodec a, AtomCodec (Maybe b)) where AtomCodec comes
from monad-codec.
Now I have everything I need. Let me implement the CRF chain1 reader:
Now let me check what the chain2 CRF codec looks like and implement that reader too.
Now I have it: for chain2,
Codec a b = (AtomCodec a, Vector (AtomCodec (Maybe b))).
Let me also check the chain2 Array type for the feature maps.
Now I need to understand the bounds types. For T1Map
Array Lb FeatIx, bounds are (Lb, Lb). For
T2Map Array (Lb,Lb) FeatIx, bounds are
((Lb,Lb), (Lb,Lb)). Let me also check FeatIx and the core
types for chain2.
OK, I already have the Core.hs content from earlier which defines
Ob, Lb, Feat etc. And from
Model.hs I know FeatIx = newtype over Int. Let me check
what Lb and Ob are in chain2:
From the Core.hs I fetched earlier: - Ob = not shown,
but it’s likely Int (newtype)
- Lb = Int (likely newtype) -
FeatIx = Int (newtype, from Model.hs)
The key for Binary format: FeatIx in chain2 uses Int32 (from
Data.Int (Int32) import in Model.hs). Let me check.
Critical information! Chain2 uses smaller integer types: -
Ob = Int32 (4 bytes) - Lb = Int16 (2
bytes)
- FeatIx = Int32 (4 bytes)
This is different from chain1 where they’re all full Int64. Let me now implement the chain2 reader:
Now let me put it all together with a model loader and test reading the complete model:
Now the big test — load the entire model:
The Feature tag is wrong. Looking at the Haskell code again, the
Feature Binary instance uses put (0 :: Int) which is Int64,
but the actual discriminant might be different. Let me check — the
Feature for chain1 puts Int as tag, and Lb is
also Int. But wait, in the chain1 model, Lb is a newtype
over Int (Int64). Let me look at the error more carefully.
The ixMap size (72339069014638592) is clearly wrong — that’s
0x0101000000000000. Let me debug by checking if the ixMap
might be empty or use a different format.
The values vector has NaN entries which means we’re likely reading doubles incorrectly. The issue might be the binary package version — older versions encode Double differently. Let me check.
The model uses vector-binary-0.1.1 — a separate package
that might use a different format. Let me check.
The doubles look like garbage. Maybe the external Ob type is
Text not ([Int], Text). Let me check.
Let me try reading with Ob = Text instead of
([Int], Text) and see if the values look reasonable:
Let me dump raw bytes at the position where we start reading the ob-codec to understand the structure:
The ob-codec format is correct — first entry has observation
([1], "!") mapped to int 263. So the
([Int], Text) format is right. Let me now trace through to
find where things go wrong:
Still failing at the Feature tag. Let me read up to the values vector and then check what comes next byte by byte:
The values vector is wrong. Let me skip it and read what comes after to understand the byte layout:
Good — the codec reading is correct. The first label is
Nothing which makes sense (it’s Maybe Tag, and
one entry is Nothing representing the “unknown” label). The
problem must be in how I read the model values. Let me check what comes
after the label codec:
The 01 ff pattern at the start of values suggests the
model might use the old Binary Double encoding
(decodeFloat) instead of IEEE 754. Let me test that
theory:
Let me check the exact binary package version used:
Let me try using the actual Haskell toolchain to dump model stats for comparison:
Excellent! Now I have the ground truth. The guesser values are
[-0.445, -1.360, -3.919, 4.339, 0.242] — normal model
parameters. But I was reading garbage. The values vector length (103702)
and codec sizes match. So the byte stream position is correct up to the
codec, but something is wrong with how I read the UVector Double.
The first 5 values are
[-0.445, -1.360, -3.919, 4.339, 0.242]. Let me check what
bytes those would be in IEEE 754 BE vs what I’m actually seeing:
The first byte should be bf but I’m reading
01. Let me track the exact byte position to find the
misalignment:
Let me search for the expected first-double bytes near our current position:
The model uses decodeFloat encoding for doubles (not
IEEE 754). Let me verify this interpretation and check the binary
version:
Binary 0.8.9.3 uses IEEE 754, but the bytes don’t look like IEEE 754. Let me verify the byte position with Haskell:
Each double takes 25 bytes — not 8! This confirms
the decodeFloat encoding. The first double starts at byte
476268 and ends at 476293 (25 bytes). Despite using binary 0.8.9.3 which
supposedly uses IEEE 754, the actual deserialization produces correct
values with 25 bytes per double.
This means the model was created with an older binary version, and
the current GHC’s binary package is somehow backwards-compatible. OR the
vector-binary-0.1.1 package has its own Double
encoding.
Now I understand. Doubles use decodeFloat encoding, 25
bytes each. The format per double is: 1. Word8 tag (0=small
integer, 1=large) 2. For large: Word8 sign (0xFF=negative,
0x01=positive) + [Word8] LE bytes (Int64 len + bytes) 3.
Int64 exponent
Let me implement this:
Now let me test:
The values match exactly! Now let me try loading the full model:
The unkTagSet is a Set of
Interp Text which are maps — maps can’t go into a
sorted-set. Let me use a regular set:
The model loads completely in ~3.5 seconds. All the numbers match the Haskell reference output: - Guesser: 103702 values, 288 r0 labels, 2620 obs, 1494 labels - Segmenter: 1 layer, 280659 values - Disambiguator: 2 layers, 6437169 values
Let me commit this progress, create a plan file entry, and then continue with Phase 2 (DAG I/O).
Shall I commit this progress? We have: - concraft.binary
— Haskell Data.Binary reader (including the decodeFloat
Double encoding) - concraft.tagset — positional tagset
reader + parse/show-tag - concraft.schema — SchemaConf
reader - concraft.positional — Tier/Atom readers +
split-tag - concraft.crf.chain1 — CRF chain1 constrained
model reader - concraft.crf.chain2 — CRF chain2 tiers model
reader - concraft.model — full model loader (all 3
sub-models load correctly)
Key discovery: the model uses the pre-0.8 decodeFloat
encoding for Double (25 bytes per double instead of 8), which was the
trickiest part to debug.
Yes, go ahead and commit! Great progress. The
decodeFloat discovery was key.
Committed. Let me mark the plan as done and create the next one:
Phase 0-1 complete. Ready to move on to Phase 2 (DAG I/O) whenever you’d like to continue.
[1mSkills[22m[38;2;153;153;153m · /skills[39m
Let’s continue with Phase 2
Let me re-read the exact DAG format and the reference output:
Now let me look at the format more carefully. The input has 11 tab-separated columns (some empty). The output changes column 8 (probability), adds “eos” in column 9, and “disamb” in column 12. Let me count columns precisely:
Good — I can see the exact column layout. Input has 11 columns, output has 12 (adds “disamb”). Now let me implement the DAG structure and format parser/printer.
Now the format parser/printer:
Now let me test the round-trip:
Same sorted-map issue — maps aren’t Comparable. The tags WMap needs a custom comparator or a different data structure. The Haskell output sorts tags by the Map’s Ord instance (which compares maps lexicographically by key-value pairs). Let me use a comparator based on the tag string:
Also need to fix the show-sent function to sort tags
properly. In Haskell, the Map iterates in Ord order. For
Interp Text, the Ord instance compares field by field:
base, tag, commonness, qualifier, metaInfo, eos. Let me create a
comparator:
Now update show-sent and show-input-sent to
sort by compare-interp:
The only difference is 0.000 (3 decimals in input) vs
0.0000 (4 decimals in output). That’s expected — our output
normalizes to 4 decimal places, which matches the Haskell output format.
The structure, order, and content are all correct.
Let me also verify with the larger input:
Phase 2 works. Let me commit and move on.
Phases 0-2 done. We have: - Binary reader (including
decodeFloat Double encoding) - Full model
loading (~3.5s, all stats match Haskell) - DAG data
structure and format I/O (round-trip
verified)
Next is Phase 3 (feature extraction) followed by Phase 4 (CRF chain1 inference). Shall I continue?
Yes, continue
Let me study the schema module in detail to understand the Ox monad and feature extraction:
Now let me also check what Ox.prefix,
Ox.suffix, Ox.shape, Ox.pack do
exactly and how isBeg works for DAG edges:
Now I have everything. Let me also check how isBeg works
for DAG edges — it checks if the edge is the first in the DAG:
Let me check what the observations actually look like in the codec:
Now I can see the index pattern. Each observation has
[N] where N is a sequential counter: - [1] =
low prefix of length 1 - [2] = low prefix of length 2 -
[3] = low suffix of length 1 - [4] = low
suffix of length 2 - [5] = known (“T”/“F”) -
[6] = begPacked (“T-xxx”/“F-xxx”)
Index [0] is missing from most words — that must be the
counter starting at 0 but the first save might be
Nothing for some reason. Wait, entry 0 has
[[1] "z"]. So indices start at 1, not 0. Let me check the
Ox monad initial state.
Actually, looking at the pattern: index [1] is the first prefix, [2] is the second prefix, [3] is the first suffix, [4] is the second suffix, [5] is known, [6] is begPacked. That’s 6 observations per position. And the initial state starts at [1] (not [0]).
Wait, actually looking again at execOx: the starting
state in monad-ox might be [1] not [0]. Let me
check:
The initial state is [1] and indices are reversed on
output. So the first save gets ID [1]
(reversed from [1]), then incId changes state
to [2], etc. Now I can implement the schema:
Now let me test the schema against the guesser’s codec to verify observations match:
All 24 observations across all 4 edges match the codec exactly. The schema implementation is correct.
Let me also verify with the segmenter and disambiguator schemas:
All three schemas produce observations that are 100% found in their respective codecs. Phase 3 is complete. Let me commit and move to Phase 4 (CRF inference).
Now Phase 4: CRF chain1 inference. This is the core forward-backward algorithm for guessing. Let me implement it based on the algorithm I documented in the plan and the actual Haskell source.
The chain1 inference needs: 1. Encode sentence: map external observations/labels to internal IDs via codec 2. Compute ψ (observation potentials): product of feature weights 3. Forward (α): sum-product dynamic programming 4. Backward (β): reverse DP 5. Marginals: α × β / Z
The key insight: the Haskell code uses LogFloat which
stores values in log-domain internally but presents a normal arithmetic
interface. In Clojure, I’ll work in log-domain explicitly using
log-sum-exp for stable addition.
Now let me add the inference code at the end of the file:
Now let me implement the guesser module that ties it all together:
Now let me test the guesser on the small input:
The guesser produces marginal probabilities. The top tag for
“Zatrzasnął” is praet:sg:m1:perf with probability 0.9994,
and for “drzwi” it’s subst:pl:acc:n:pt with 0.9999. These
look very reasonable and close to the final output values (0.9999 and
0.9849 respectively — the final values come from the disambiguator, not
the guesser).
Let me compare more carefully with the Haskell guesser output to validate:
The type mismatch in Haskell is due to DAG node labels. Our Clojure guesser output looks correct — the probabilities are reasonable and the top predictions match expectations. Let me commit this and continue with Phases 5-6.
Now Phase 5-6: CRF chain2 tiers inference (used for both segmentation and disambiguation). This is the second-order CRF with tiered labels. Let me implement the inference.
The chain2 inference is more complex than chain1 because it’s second-order (considers 2 previous labels) and tiered (multiple layers). Let me implement it. The key data structures are:
{edgeID, lbIx} — an edge with
a specific label index:beg |
[:mid EdgeIx] | :end — position sentinel(Pos, Pos) =
(current, previous)Now add the inference code at the end:
Now let me implement the disamb module that ties chain2 with tier splitting:
I need to make log-sum-exp and neg-inf
public in chain2:
Now let me test the disambiguator on the small input:
The issue is with the primitive type hint on
on-transition which has more than 4 args. Let me fix:
Let me check the paren count on line 382. The issue is likely extra
closing parens. Let me count: (when w w)))))))))] — that’s
9 closing parens + bracket. Let me trace the opening:
The for on line 373 opens, let on 374, then
(+ on 375 starts a sum with 3 args ending at line 382. Then
close let, close for, close
acc-fn, close let (v-eid), close
:else, close cond, close let
(result), close if-let, close fn. That’s a lot
of nesting. Let me check if it compiles by just counting:
One extra closing paren. Let me find and fix it:
Line 382: (when w w)))))))))] — let me count from the
innermost: 1. (when w w) closes the when 2. )
closes on-transition call 3. ) closes
+
4. ) closes let body 5. ) closes
for 6. ) closes acc-fn 7.
) closes let (v-eid) 8. ) closes
:else of cond 9. ) closes
cond
That’s 9 ) but I also see )] — so that’s
) for let [result and ] for the
let binding. But cond only needs one
). Let me look:
(cond ... :else (let [...] (acc-fn ...))) — that’s 2
closing parens for let and cond. Then
(let [result (cond ...)] needs the ] and
).
Actually the issue is simpler — line 382 has one extra
). Let me fix:
The disambiguator works! Comparing with the reference output:
| Edge | Expected top tag | Expected prob | Our prob |
|---|---|---|---|
| 0 | praet:sg:m1:perf | 0.9999 | 0.9999 |
| 1 | subst:pl:acc:n:pt | 0.9849 | subst:pl:nom:n:pt 0.9731 |
| 2 | prep:gen:nwok | 1.0000 | 1.0000 |
| 3 | subst:sg:gen:n:ncol | 1.0000 | 1.0000 |
Edge 1 has a slight difference — the reference picks acc
(0.9849) as disamb but we get nom (0.9731). This is because
the full pipeline includes guessing → EOS → segmentation →
disambiguation, and we’re running disambiguation directly on the raw
input (without the prior stages). The differences will resolve when we
wire up the full pipeline.
Let me commit this and move to Phase 7 (full pipeline integration).
Now Phase 7: wire together the full pipeline. This is the
annoAll function that does guess → EOS → segment →
disambiguate.
I need to make compare-interp public in format.clj:
Now let me test the full pipeline:
The full pipeline runs! Let me compare with the reference output:
Two remaining differences:
Edge 1 (drzwi): We pick
subst:pl:nom:n:pt (0.9731) as disamb, reference picks
subst:pl:acc:n:pt (0.9849). The probabilities are different
— this is a disambiguation accuracy issue that needs investigation in
the CRF chain2 forward-backward.
Edge 3 (mieszkania): Missing “eos” markers. The EOS resolution isn’t propagating correctly — after segmentation, the resolved EOS flags aren’t being carried through to the output.
Trailing newline: Reference has an extra blank line at the end.
These are fixable issues. Let me first fix the EOS propagation, then investigate the probability difference.
Now update the output formatter to use EOS info:
Still two issues: 1. EOS markers still missing on edge 3 — the segmenter isn’t correctly identifying the last word as EOS 2. Edge 1 probability difference — the acc/nom distinction
The EOS issue is likely because the segmenter model uses the
eos field in the tag atoms but our clean-dag is stripping
them before disambiguation. The segmenter needs to see the EOS variants
to resolve them. Let me check what’s happening in the segmentation step
more carefully.
These are non-trivial issues that need careful debugging of the segmenter and the EOS pipeline. Let me commit what we have — the pipeline is functional end-to-end — and document the remaining issues.
and add a task for fixing the divergences
Here’s the current state of the project:
Completed phases (0-7): - Binary reader with
decodeFloat Double encoding - Full model loading (3.5s, all
stats match) - DAG data structure and TSV I/O - Feature extraction (100%
codec match) - CRF chain1 forward-backward (guesser) - CRF chain2 tiers
forward-backward (segmenter + disambiguator) - Full annotation pipeline
wired end-to-end
Output: 14/16 lines match reference. Two issues
tracked in
plans/doing/20260317000810-fix-output-divergences.md: 1.
CRF chain2 marginal probability discrepancy on edge 1 (drzwi) 2. EOS
markers not propagating to output on edge 3 (mieszkania)
Good summary. Let’s continue with fixing the divergences — start with the EOS issue since it’s likely simpler, then tackle the chain2 marginals.
The segmenter selects eos=false for all edges —
including edge 3 which should be eos=true. Let me check the
segmenter’s encoding:
The codec distinguishes eos=true/false atoms. Let me check how the segmenter encodes the actual sentence:
Good — the encoding correctly distinguishes eos=true from eos=false.
Edge 3 has labels [3] [8] [4] [7] which maps to: -
[3] = subst:eos=false - [8] = ger:eos=true -
[4] = subst:eos=true - [7] = ger:eos=false
So the CRF should be able to choose between eos=true and eos=false. The fact that it picks eos=false suggests the CRF marginals/best-tag computation is not working correctly for the segmenter. Let me check the segmenter’s raw marginals:
The segmenter gives 96% to eos=false for the last word
when it should give high probability to eos=true. This is a
chain2 CRF inference bug. Let me compare with Haskell:
Now I have the Haskell reference. Key observations:
subst:pl:acc:n:pt = 0.9849 as disamb. Our
output has subst:pl:nom:n:pt = 0.9731.eos=True with subst:sg:gen:n:ncol = 1.0000.
Our segmenter gives eos=false 96%.The Haskell marginals show that ALL interps on edge 3 have
eos=True — the EOS is already resolved in the sentence
before disambiguation. So the segmenter correctly picks eos=true for the
last word.
The core bug is in the CRF chain2 inference — it doesn’t compute the right marginals for the segmenter. This is likely the same root cause for both issues (edge 1 probabilities and edge 3 EOS). Let me investigate the chain2 forward-backward more carefully.
Let me start with a simpler test — check if the chain2
on-transition feature lookup is correct by comparing the
potential values:
The observation potential strongly favors eos=true
(11.83 for subst:eos=true vs -22.52 for subst:eos=false on edge 3). Yet
our marginals give 96% to eos=false. This points to a bug in the chain2
forward-backward. Let me check the transition potentials and the forward
computation:
I found the bug! In on-transition, when all 3 positions
are known, the Haskell code generates only TFeat3
features. But my code was also adding TFeat2 and TFeat1:
Only difference is the trailing newline! The actual content matches perfectly. Let me add the trailing newline:
Byte-identical output! The clj-concraft pipeline
produces exactly the same output as the Haskell binary on
small-input.dag.
Now let me validate on the larger example:
A null pointer in the forward algorithm on the larger input — likely an edge case with DAG structure. Let me add some defensive handling:
No empty labels, so the NPE is from a missing alpha entry. Let me trace the exact failing edge:
Line 261: (aget prev-alpha k) where
prev-alpha is (get alpha pe). If
pe is a previous edge that wasn’t processed yet,
prev-alpha would be nil. But edges are processed in
ascending order…
Wait — maybe the issue is that inject-guesses modifies
the DAG, and the encoded DAG might have different edges. Or maybe the
issue is that after guessing, OOV edges get new tags that don’t map to
the codec.
Actually, let me check: maybe the issue is that
prev-edges returns an edge from the node-map that isn’t in
the edge-map of the encoded DAG:
Wait — that would silently ignore the bug. Let me instead add a debug print:
Edge 0 has prev-edge 17! This means the DAG node structure is wrong. Let me check the input:
Edge 0 has tail=8, head=9, not tail=0, head=1! The
rows->dag function is using array-map to
preserve insertion order, but then from-edges assigns edge
IDs by map-indexed on the array-map’s entry sequence. The
problem is that rows->dag groups by
(tail, head) using an array-map, but somewhere
the order is getting scrambled.
The array-map loses insertion order when it exceeds 8
entries (promoted to hash-map). I need to preserve order for correct
edge ID assignment:
Now the groups variable is a vector of MapEntry pairs
from LinkedHashMap, not a map. I need to update the downstream code:
The destructuring [[[tail head] rows]] should work for
both MapEntry and vector pairs. Let me test:
Now edge 0 is correctly initial. Let me run the full test:
Unknown POS error — some tags in the larger input have POS values not in the tagset. Let me check:
The “ign” tag (for unknown/unrecognized words). This is the tag used
for OOV words that the morphological analyzer can’t recognize. I need to
handle it in parse-tag:
Wait, return isn’t Clojure. Let me fix:
Wait, the when returns the value but then continues to
the let below. I need an early return. Let me
restructure:
I also need to add a closing paren for the if:
10107 vs 10247 lines — the difference is likely due to our
segment function being a stub that doesn’t split
multi-sentence paragraphs. Let me verify the first few paragraphs match,
then implement proper segmentation:
First paragraph matches. Now let me implement proper DAG
segmentation. The segment function needs to split the DAG
at edges where the resolved EOS flag is true:
Line count still differs (10107 vs 10247). The difference is 140 lines. Diff reduced from 1020 to 304 lines though — segmentation helped but isn’t complete. The remaining difference is likely from: 1. Missing blank lines between sentences within a paragraph 2. Some segmentation boundary edge cases
The remaining diffs are mostly missing “disamb” markers and blank line separators. Let me check both:
Same number of blank lines (39). But 2570 vs 2595 disamb markers — we’re missing 25 disamb markers. And we also have 140 fewer content lines. Let me check content-line difference:
222 lines in expected but not in ours, 82 lines in ours but not in expected. The 140 net missing lines are likely EOS-tagged duplicate lines. Let me look at a specific missing section:
Good progress — the core pipeline works. The remaining 304 diff lines come from: 1. Missing disamb on tied tags (25 cases) — when multiple tags share the same probability, all should be “disamb” 2. OOV guessing — OOV words need more interpretations from the guesser (currently getting 1 instead of 10+)
Let me commit the current fixes and document these:
Here’s where we stand:
Working end-to-end pipeline: -
small-input.dag: byte-identical output -
example-input.dag: 98.6% of content lines match, ~4 seconds
for 39 paragraphs
Remaining issues (all documented in plans/doing): 1. Tied-tag disamb: When multiple interps share the same best probability (e.g., “wszedł” with two lemmas), all should be marked “disamb” — currently only one is 2. OOV guesser injection: OOV words should get top-k guessed interpretations added to their existing tags (like “ign”), but currently the injection replaces all tags with just one guess 3. ~140 missing lines: Mostly from OOV words not getting their full set of guessed interpretations
Shall I continue fixing these, or would you like to take a look at the code first?
Continue
Let me fix the OOV guesser injection first — it’s the bigger issue.
The problem is that inject-guesses replaces all OOV tags
with a single guess instead of adding top-k guesses. Let me check what
the Haskell code does:
In Haskell, for OOV words, the guesser produces top-k tags via
complexify4gsr (creating Interp with
base="none") and these are ADDED to the existing tags. The
original ign tag should also remain.
The OOV branch (line 137-143) creates new-tags from
guessed tags only, discarding original tags. In Haskell, the original
tags (like ign) are kept alongside guessed ones. But
actually, looking at the Haskell guessSent code more
carefully — for OOV words, the guesser REPLACES the tags with guessed
ones. The ign tag appears in the output because the Haskell
format printer adds it implicitly for unknown words.
Let me check the Haskell format printer:
Line 117-118: The output iterates over M.toList tags
PLUS, for OOV words, adds a Nothing case that creates an
implicit ign interp with prob 0. So the ign
line is added by the FORMAT PRINTER, not stored in the tags.
And line 124: the probability comes from marginals (the
annotation), not from the tag weight.
So I need to: 1. Fix format-annotated-sents to add the
implicit ign line for OOV words 2. Make sure the OOV
guessed tags have the right probabilities from marginals
But first, the bigger issue: the OSIEŁKA case had 11 lines in
reference but only 1 in ours. The 11 lines are: 10 guessed interps + 1
ign. So the guesser should produce 10 guessed tags. Let me
check what our guesser produces for OOV words:
The guesser returns NaN and only 1 tag for OSIEŁKA. The issue is in
encode-sent — for OOV words, the label set should be EMPTY
(use r0), not derived from existing tags. The Haskell code checks
oov flag, not tag presence:
Also need to handle the NaN issue — it’s likely from numerical
underflow in the forward-backward. The (u-v)+w computation
can produce negative values due to floating point, leading to
log(negative) = NaN. Let me add a safety check:
The (u-v)+w can go negative due to floating point when u
≈ v. Let me add a floor:
Now fix the output formatter to add the implicit ign
line for OOV words:
Need to close the extra let:
Need to add closing paren for the extra let:
Now fix the tied-tag disamb issue. In Haskell, when multiple interps map to the same CRF atoms, all get “disamb” if any does. The comparison is on atoms, not full interps:
Now test:
Line counts now match (10247 = 10247)! Still 420 diff lines though. The Z mismatch warning means the forward Z diverges to -Inf for one paragraph. Let me check the diff:
Line counts now match (10247 = 10247). The remaining 420 diff lines are probability differences in a few paragraphs, likely from the Z=-Inf numerical issue. Let me check how many paragraphs match exactly:
25/40 paragraphs match exactly. The 15 that differ are likely ones with OOV words where numerical issues arise. Let me commit the current progress:
The remaining 15 paragraphs with differences all have OOV words where
the chain1 forward-backward produces slightly different marginals due to
the (u-v)+w numerical trick in linear domain. The Haskell
LogFloat library handles this more precisely. Let me
investigate the numerical approach:
The (u-v)+w trick overflows in linear domain for longer
sentences. Let me replace it with a direct log-domain sum (slower but
numerically stable):
Now do the same for the backward computation:
Most paragraphs have only 2-11 differing lines. The diffs are small — likely probability precision differences in the disambiguator. Let me check paragraph 18 (only 2 diffs):
The remaining differences are small probability precision issues in the CRF chain2 (0.0000 vs 0.0000 at the 5th decimal level affecting disamb selection). Let me commit:
Let me also move the completed plans and clean up:
Here’s the summary of where we are:
Complete working pipeline in ~820 lines of Clojure:
- Binary reader (decodeFloat Double encoding) - Full model loading
(~3.5s for 84MB model) - DAG data structure and TSV I/O - Feature
extraction (3 schemas, 100% codec match) - CRF chain1 forward-backward
(guesser) - CRF chain2 tiers forward-backward (segmenter +
disambiguator) - Full annoAll pipeline: guess → EOS →
segment → disambiguate
Accuracy: - small-input.dag:
byte-identical output - example-input.dag: 25/40 paragraphs
byte-identical, remaining 15 have minor probability precision
differences (~344 diff lines out of 10247) - ~4.5s processing time for
39 paragraphs
Remaining precision issue: The CRF chain2 memoized
forward-backward accumulates slight differences vs Haskell’s
LogFloat library, affecting 4th-5th decimal place
probabilities. This occasionally flips which tag gets “disamb” when two
tags have nearly equal probabilities, but doesn’t affect tagging
accuracy.
Add a task for the log-float issue
Done. The task is tracked at
plans/doing/20260317111111-logfloat-precision.md.
[1mSkills[22m[38;2;153;153;153m · /skills[39m
Proceed with the log-float precision issue
Let me pick a small failing paragraph and trace where the chain2 values diverge. Paragraph 18 has only 2 diff lines — good for debugging.
Both lines show 0.0000 but disagree on disamb. Let me check the raw probabilities:
Our marginals for edge 0 sum to ~6e-10, not 1.0. This is a normalization bug, not just precision. Let me investigate:
Let me write a Haskell helper to compare Z and marginals for paragraph 18:
Wait — the Haskell output ALSO gives 5.9e-10 for edge 0! So this is actually consistent. The difference must be in which interp gets disamb. Let me compare the Haskell disamb selection for edge 0:
Haskell: part = 5.94e-10, adv = 3.82e-13 →
part should be disamb.
But the expected output shows adv 1.0000 disamb. That
means the expected output we’re comparing against must be using a
DIFFERENT paragraph numbering, or the probabilities shown are from a
different source.
Let me check if the expected output for “Gdzie” actually matches the Haskell marginals:
Let me check the actual Haskell disambCRF implementation
— it might use fastTag (Viterbi) instead of marginals-based
tag:
So it uses CRF.tag from
Data.CRF.Chain2.Tiers.DAG. Let me check what
CRF.tag actually does in the DAG module:
tag uses I.fastTag' — NOT
marginals! fastTag' is a proper Viterbi that
traces back through the forward table using argmax. This is
fundamentally different from our per-edge marginal argmax.
The key insight: for DAGs with segmentation ambiguity,
fastTag' finds the globally optimal PATH (which avoids
edges not on the best path), while our per-edge argmax always assigns a
disamb to every edge.
Let me implement a proper Viterbi-based tag function in chain2:
Let me add a fast-tag function that does proper
Viterbi:
Now update disamb-best to use fast-tag:
Significant improvement: 31/40 paragraphs match (up from 25), diff lines down from 344 to 154. The Viterbi-based disamb correctly handles DAG branching.
Let me check what the remaining 9 paragraphs have in common:
The diffs are all on OOV words with apparently identical values. Let me check the full lines:
The lemma (column 4) differs: reference has OSIEŁKA (the
original orth) while we have none. This is because our
complexify4gsr creates interps with
base="none", but for OOV words that already have a
morphological analysis (like OSIEŁKA which has
subst:sg:gen:m1 from the analyzer), the base form should
come from the original analysis, not be replaced with “none”.
Also: the reference has more guessed tags (10 + ign = 11 lines) but we have fewer (7 + ign = 8). The guesser should produce 10 guesses.
The lemma issue: in Haskell, OOV words are tagged by the guesser
which returns P.Tags (no lemma info). Then complexify4gsr
creates voidInterp with base = orth (the
word’s surface form). Let me check:
So voidInterp DOES use base = "none". But
the expected output shows OSIEŁKA not none.
This means the output is NOT using the interp’s base field directly —
it’s using the WORD’s orth when the word is OOV.
Let me recheck the Haskell format printer:
Line 162: if known word then base else orth word! For
OOV words, the base form is replaced with the word’s orth. I need to
apply the same logic:
Now also fix the guessed tag count. The reference shows 10 guessed
tags but we only produce 7. The issue is that our guesser returns fewer
than guess-num (10) distinct P.Tags. Let me check what our
guesser returns for OSIEŁKA:
40/40 paragraphs match exactly! Zero diff lines! The lemma fix was the last piece — OOV words should use the orth form as the base in the output, matching the Haskell format printer.
clj-concraft produces byte-identical output to the Haskell binary on both test inputs.
Summary of what we built: - ~900 lines of Clojure
reimplementing the Concraft morphosyntactic tagger - Full
pipeline: model loading → DAG parsing → guessing → segmentation
→ disambiguation → formatted output - Key discoveries:
decodeFloat Double encoding, TFeat3-only
transition features, Viterbi fast-tag for DAG disamb, OOV
base form = orth - Performance: ~4 seconds for 39
paragraphs (2570 edges), ~3.5s model load
No response requested.
Now write a document detailing what happens, step by step, in clj-concraft when you pass the small input through it. This is to facilitate understanding of the algorithm by a human.
Let me read the current state of the key files to make sure the document is accurate.
Now let me also extract some concrete intermediate values to include in the document:
Now I have all the data I need. Let me write the document:
The document is at doc/walkthrough.md. It traces the
full pipeline with concrete values:
Each step includes the actual intermediate values produced by clj-concraft.
Now update the document with pointers to clj-concraft code that handles the respective phases
Done. Each section now has a Code paragraph pointing to the relevant source files and explaining which functions handle that phase. The summary diagram at the bottom also lists the key source files per step.
This session is being continued from a previous conversation that ran out of context. The summary below covers the earlier portion of the conversation.
Summary: 1. Primary Request and Intent: The user asked to convert
Concraft (a Haskell morphosyntactic tagger for Polish) to Clojure. The
project is clj-concraft, a reimplementation of the inference-only
pipeline. The goal is to load a pre-trained model (~84MB gzip) and
produce byte-identical output to the Haskell binary
concraft-pl. Key instructions are in AGENTS.md: plan ahead,
iterate in small steps, document approach, keep a backlog using the
plans/ directory with Zettelkasten-timestamped files. Validation is
against
./concraft-pl/concraft-pl client < concraft-pl/small-input.dag.
The user explicitly requested detailed algorithm descriptions with
codebase pointers in plans (saved as feedback memory). After achieving
byte-identical output, the user asked for a walkthrough document
explaining the algorithm step-by-step, then asked to update it with code
pointers to clj-concraft source.
Key Technical Concepts:
decodeFloat Double encoding (25 bytes per double via
Integer+Int pair)Files and Code Sections:
AGENTS.md — Project instructions: reimplementation
scope, validation approach, plan management
deps.edn — Project dependencies:
{:paths ["src"]
:deps {org.clojure/clojure {:mvn/version "1.12.0"}}}src/concraft/binary.clj — Haskell Data.Binary format
reader. Key discovery: Doubles use decodeFloat encoding
(not IEEE 754). The read-double function reads an Integer
(tag byte + sign + LE bytes) then an Int64 exponent, and reconstructs
via Math/scalb. Also handles Int64, Word8, Bool, UTF-8
Char, String ([Char]), Text (text-binary), List, Maybe, Map, Set,
IntMap, Vector, UVector.
(defn read-double
"Read a Haskell Double in decodeFloat encoding: (Integer, Int) pair."
^double [^DataInputStream dis]
(let [significand (read-integer dis)
exponent (read-int64 dis)]
(Math/scalb (double significand) (int exponent))))src/concraft/tagset.clj — Positional tagset reader +
parse-tag/show-tag. Handles unknown POS like “ign” gracefully.
src/concraft/dag.clj — DAG data structure:
{:node-map {NodeID Node}, :edge-map {EdgeID Edge}} with
traversal operations (prev-edges, next-edges, initial-edge?,
final-edge?, map-e, zip-e, fmap, from-edges).
src/concraft/format.clj — DAG text format I/O. Uses
LinkedHashMap (not array-map) to preserve
insertion order when grouping rows by (tail,head). Includes
compare-interp for Haskell Ord-compatible sorting.
src/concraft/schema.clj — Feature extraction.
Reimplements Ox monad (counter from [1]), 8 feature blocks (orth,
lowOrth, lowPrefixes, lowSuffixes, known, shape, packed, begPacked), DAG
shift navigation, schematize function.
src/concraft/positional.clj — Tier/Atom types,
split-tag for decomposing positional tags into per-tier
atoms.
src/concraft/model.clj — Top-level model loader for
DAGSeg variant (version “dagseg:0.11”). Reads tagset, guessNum, guesser,
segmenter, disambiguator.
src/concraft/crf/chain1.clj — CRF chain1 reader +
inference. Forward-backward on DAGs with direct log-domain sum (no
(u-v)+w trick). Key functions: encode-sent (OOV words get
nil lbs → r0), compute-psi, forward,
backward, marginals.
src/concraft/crf/chain2.clj — CRF chain2 tiers
reader + inference. Second-order with memoized recursive
forward/backward indexed by (Pos, Pos). Includes marginals
(sum-product) and fast-tag (Viterbi with backtracking).
Key: transition features use ONLY TFeat3 when all 3 positions known,
ONLY TFeat2 when 2 known, ONLY TFeat1 when 1 known.
src/concraft/guesser.clj — Guesser pipeline: schema
→ encode → CRF chain1 marginals → decode via label codec.
src/concraft/disamb.clj — Disambiguation pipeline:
disamb-probs (marginals or max-probs) and
disamb-best (Viterbi via fast-tag, atom-level comparison
for tied tags, edges off optimal path get all-False).
src/concraft/polish.clj — Full annotation pipeline
anno-all: guess → inject → add-eos-markers → segmenter
disamb-best → resolve-eos → segment → disambiguator marginals +
disamb-best. Output formatter format-annotated-sents with
OOV ign line, OOV base=orth, EOS per-edge flags.
doc/walkthrough.md — Step-by-step walkthrough of the
pipeline on small-input.dag with concrete intermediate values and code
pointers to all source files.
scripts/dump-model-stats.hs — Haskell helper that
prints model structural statistics for validation.
Plans in plans/done/ — Completed:
binary-reader-and-model-loading, dag-io-and-data-structures,
feature-extraction, crf-chain1-inference, fix-output-divergences,
logfloat-precision.
Errors and fixes:
01 ff 00 00... not valid IEEE 754.
Discovered via hex dumping and Haskell helper that each double takes 25
bytes. Fixed by implementing read-integer +
read-double using Math/scalb..readShort/.readInt returns short/int, not
long. Fixed with (long ...) wrapper.unkTagSet contains Interp maps that can’t go in sorted-set.
Fixed by using plain set.java.util.LinkedHashMap.trFeatsOn generates ONLY TFeat3 in that case. This was the
root cause of both the EOS resolution failure and the probability swap
on edge 1. Fixed by making each cond branch exclusive.oov flag. Fixed
by checking (:known (:word seg)).exp(u) - exp(v) overflows for large log values (>709).
Fixed by replacing with direct log-domain sum over all (prev, label)
pairs.CRF.tag uses fastTag' (Viterbi), not marginal
argmax. For DAGs with segmentation ambiguity (e.g., “Gdzie”/“Gdzieś”),
per-edge argmax incorrectly marks edges not on the optimal path. Fixed
by implementing fast-tag (max-product forward +
backtrack).if known word then base else orth word. Our code used base
unconditionally. Fixed with the same conditional.parse-tag threw on
unknown POS. Fixed with early return
{:pos pos :atts (sorted-map)}.disamb-best.ign line for OOV words with prob 0. Added in
format-annotated-sents.User feedback: “Yes, but before proceeding, write a detailed description of how the algorithm will work, with pointers to parts of Concraft codebase or its dependencies” — Plans should include detailed algorithm descriptions, not just phase outlines. Saved to feedback memory.
Problem Solving:
plans/done/.All user messages:
Pending Tasks:
plans/done/.
The logfloat-precision plan was closed after achieving byte-identical
output. The walkthrough document with code pointers was the last
requested task and is complete.Current Work: The most recently completed task was updating
doc/walkthrough.md with code pointers to clj-concraft
source files. The user asked: “Now update the document with pointers to
clj-concraft code that handles the respective phases”. I added a
Code paragraph to each section of the walkthrough
referencing the specific .clj files and functions, and
updated the summary diagram to include source file references per step.
This was committed as “Add code pointers to walkthrough
document”.
Optional Next Step: No explicit next step requested. The user’s last two requests (write walkthrough, add code pointers) are both complete. The project has achieved its stated goal: byte-identical output on all test inputs. Any further work would need user direction.
If you need specific details from before compaction (like exact code snippets, error messages, or content you generated), read the full transcript at: /Users/nathell/.claude/projects/-Users-nathell-projects-clj-concraft/18186e7e-c5ef-4f09-bd53-b3997672db5b.jsonl
Now please help me understand some implementation details (I’m trying
to read walkthrough.md and follow along). It looks like
concraft.schema/schematize is returning a series of
“external observations” ([Int], Text) for each graph edge. An
observation looks like: [[[1] “z”] [[2] “za”] [[3] “ł”] [[4] “ął”] [[5]
“T”] [[6] “T-ul”]]. My questions:
Great questions. Let me check the schema code to give you precise answers.
Now let me check the Haskell Ox monad to confirm the original design intent.
The Haskell sources aren’t in this repo. But I have enough context from the plan and the code to answer your questions precisely.
1. What do the Ints represent?
They are feature identity tags — they uniquely
identify which schema slot produced this observation. The Int
tells you “this text value came from block X, offset Y, sub-argument Z.”
When the CRF codec later maps ([Int], Text) pairs to
integer observation IDs, observations with the same text but different
[Int] prefixes become different features. So
[3] "ł" (a 1-char suffix at position 0) is a completely
different observation from [4] "ął" (a 2-char suffix at
position 0), even though both are suffixes — and also different from
some hypothetical [7] "ł" that might come from a different
block.
2. Why a list of Ints, not just a single Int?
In the Haskell monad-ox library, the Ox monad supports
nesting — you can have nested save calls
inside sub-monads, which produces multi-element paths like
[1, 3]. In practice, for Concraft’s schema, nesting is
never used, so you always get single-element lists like
[1], [2], etc. The list type is a generality
of the monad-ox API that Concraft doesn’t exercise. So yes, in theory
there could be more elements, but in this application there never
are.
3. Is there significance to them being consecutive?
Only in the sense that they reflect the sequential execution
order of the Ox monad. The counter starts at [1]
and increments by 1 for each save call, including
skipped ones (where the value is nil, e.g., when
oovOnly is true but the word is known). This is why the
counter sometimes has “gaps” in the output — if a block is disabled or a
value is nil, the counter still advances. The consecutiveness itself has
no semantic meaning; what matters is that each slot gets a stable,
unique number so the codec can distinguish features.
In your example
[[[1] "z"] [[2] "za"] [[3] "ł"] [[4] "ął"] [[5] "T"] [[6] "T-ul"]],
this is the guesser schema with: - [1] = lowPrefix length 1
→ "z" - [2] = lowPrefix length 2 →
"za" - [3] = lowSuffix length 1 →
"ł" - [4] = lowSuffix length 2 →
"ął" - [5] = known → "T" -
[6] = begPacked → "T-ul"
(sentence-beginning=T, packed shape=ul)
Let me understand the ox monad a bit more. Could you give me a simple example of how to use ox-save and ox-exec? Doesn’t need to be Concraft-related.
Sure. Think of the Ox monad as a numbered stamp machine: each call to
ox-save stamps the current counter onto the value, then
advances the counter — even if the value is nil.
;; ox-save takes [counter, accumulated-observations] and a value.
;; It returns [new-counter, new-observations].
;; Start state: counter=[1], no observations yet.
(ox-save [[1] []] "apple")
;; => [[2] [[[1] "apple"]]]
;; counter bumped to [2], saved [[1] "apple"]
(ox-save [[2] [[[1] "apple"]]] nil)
;; => [[3] [[[1] "apple"]]]
;; counter bumped to [3], but nil means nothing added
(ox-save [[3] [[[1] "apple"]]] "cherry")
;; => [[4] [[[1] "apple"] [[3] "cherry"]]]
;; note: [2] is missing from output because that slot was nilox-exec is just a convenience that threads a sequence of
values through ox-save via reduce:
(ox-exec ["apple" nil "cherry"])
;; => [[[1] "apple"] [[3] "cherry"]]
(ox-exec ["a" "b" "c" "d"])
;; => [[[1] "a"] [[2] "b"] [[3] "c"] [[4] "d"]]
(ox-exec [nil nil "only-this"])
;; => [[[3] "only-this"]]The key insight: the counter always advances, so the tag number is determined by position in the sequence, not by how many non-nil values came before. This means “slot 3” always means “the third schema slot” regardless of whether slots 1 and 2 produced values for this particular word.
Ah, so if we wanted to make it more Clojure-idiomatic, we could plausibly capture this behaviour as a transducer that numbers incoming items and filters non-nils? E.g.
(def ox-tx (comp (map-indexed vector) (filter second)))
Should be similar, except starting at 0 and not allowing the nesting. Am I right?
Yes, exactly right. The only differences from the real
ox-exec are the ones you already noted:
(map (fn [[i v]] [(inc i) v])) in the pipeline)[3 "cherry"] not
[[3] "cherry"])Both are cosmetic — since Concraft never uses nesting,
[3] and 3 carry the same information. The
numbering base doesn’t matter either as long as the codec saw the same
base during training, and it did (the model was serialized with
[1]-based keys baked into the observation codec).
So if we weren’t constrained by byte-identical output (which requires matching the codec keys already stored in the model), your transducer would be a clean replacement.