Convert Lotus 1-2-3 Files for RAG & AI Pipelines

TL;DR

Convert Lotus 1-2-3 workbooks to CSV for value-only analytics and embeddings, and to Markdown when you want LLMs to keep table structure. Avoid PDF as an intermediate — it destroys the grid the model actually needs. Lotus Converter does this locally in bulk so regulated rows never leave your network.

The Problem: Legacy Spreadsheets Are Invisible to AI

If your finance, ops, or engineering team has been around for 20+ years, a non-trivial slice of your institutional knowledge is still in Lotus 1-2-3 files: chart-of-accounts ledgers, actuarial tables, plant cost models, rate-case workpapers, customer pricing books. These binary workbooks cannot be read by modern AI systems, vector databases, or embedding models. To an LLM, your historical numbers simply do not exist.

Building a RAG (Retrieval-Augmented Generation) system or private LLM that ignores your legacy spreadsheets means your AI is missing decades of quantitative context — the very numbers analysts ask follow-up questions about.

Step-by-Step: Lotus 1-2-3 to Vector Database

The workflow for making legacy Lotus workbooks AI-ready:

Inventory Your Source Archives

Locate your .123, .wk1, .wk3, .wk4, .wks, .wb1, .wb2, and .wb3 files. They're typically scattered across departmental network drives, retired file servers, and backup media. Lotus Converter scans entire folder trees recursively and can even detect Lotus files that have lost their extension by inspecting header bytes.

Batch-Convert to CSV and Markdown

Run Lotus Converter against the archive. Pick CSV when each sheet is one logical table and you want maximum token efficiency. Pick Markdown when the workbook has captions, totals, and notes that an LLM should read alongside the grid. Everything happens locally — no workbooks leave your machine.

Chunk Per Sheet, Not Per File

A single .wk4 file usually contains multiple worksheets. Treat each sheet as its own document for chunking — that keeps row context coherent and prevents the "summary tab" from polluting the "detail tab" embeddings. Markdown headings and CSV file names give you natural breakpoints.

Generate Embeddings

Run your chunks through an embedding model (OpenAI, Cohere, local models like Sentence-BERT, BGE). Clean tabular text in = higher quality vectors out = better numeric retrieval. Tag each chunk with file, sheet, year, and business unit so retrieval can filter on metadata.

Load Into Your Vector Store

Store embeddings in Pinecone, Weaviate, Chroma, pgvector, or any vector database. Your legacy quantitative knowledge is now queryable by your RAG pipeline alongside modern XLSX, DOCX, and PDF sources.

Query with RAG (and Tools)

When an analyst asks "what did our 1998 plant cost model assume for steel input prices?", your RAG system retrieves the relevant CSV/Markdown chunks. For numeric reasoning, pair retrieval with a sandboxed code-interpreter tool so the model can actually re-run the math instead of guessing.

Format Comparison: Which Output is Best for AI?

Not all spreadsheet outputs are created equal for LLM ingestion. Here's how the common targets compare:

Factor	CSV	Markdown	XLSX	PDF	Raw .wk4 / .123
LLM Readability	Excellent	Excellent	Good (needs parser)	Fair	None
Token Efficiency	Highest	High	Low (XML overhead)	Low (extraction noise)	N/A
Structure Preservation	Rows and columns	Tables, headings, captions	Sheets, formulas, formats	Layout-dependent	Binary format
Embedding Quality	High	High	Medium (after parsing)	Medium (noisy)	N/A
Tool / Code-Interpreter	Native (pandas, DuckDB)	Convertible	Native (openpyxl)	Requires OCR	No tooling
Processing Complexity	Direct ingestion	Direct ingestion	XLSX parser	PDF parser / OCR	Needs Lotus library
Best For	RAG over numeric tables; agents with code tools	RAG over annotated workbooks with notes	Re-using the workbook in Excel	Human reading, archiving	Nothing (legacy only)

What's the best format to feed legacy spreadsheets into an LLM?

CSV for raw tabular numerics where each sheet is a clean table and you want analysts' agents to query with pandas or DuckDB. Markdown for workbooks where comments, totals, and section headings carry meaning the model should keep. Avoid PDF as an intermediate — PDF extraction destroys the row/column grid that makes spreadsheets useful to AI.

Why Local Processing Matters for AI Pipelines

Most teams want to build private RAG systems specifically to keep regulated data — financial models, customer pricing, employee records — off third-party servers. Using a cloud-based converter to prepare those files for a private AI defeats the purpose. Lotus Converter processes everything on your machine, maintaining a complete chain of custody from legacy workbook to vector database.

This is especially critical for banks and credit unions (GLBA), healthcare and benefits admins (HIPAA), regulated utilities and public-sector records (rate cases and FOIA), and any enterprise with SOX or GDPR obligations.

How to Convert Lotus 1-2-3 Spreadsheets for RAG Pipelines and LLM Ingestion