sheetwise

SpreadsheetLLM: A Python Package for Encoding Spreadsheets for Large Language Models

This package implements the key components from the SpreadsheetLLM research: - SheetCompressor: Efficient encoding framework with three modules - Chain of Spreadsheet: Multi-step reasoning approach - Vanilla encoding methods with cell addresses and formats

Additional features include: - Formula extraction and analysis - Multi-sheet workbook support - Advanced table detection - Visualization tools

Based on the research paper: “SpreadsheetLLM: Encoding Spreadsheets for Large Language Models” by Microsoft Research Team

class sheetwise.SpreadsheetLLM(compression_params=None, enable_logging=False)[source]

Main class integrating all SheetWise components. Includes Offline SQL and JSON export capabilities.

__init__(compression_params=None, enable_logging=False)[source]

Initialize SheetWise framework.

Parameters:
  • compression_params (Dict[str, Any]) – Parameters for SheetCompressor

  • enable_logging (bool) – Enable detailed logging for debugging

auto_configure(df)[source]

Auto-configure compression parameters.

Return type:

Dict[str, Any]

compress_and_encode_for_llm(df)[source]

Original Markdown encoding (retained for compatibility).

Return type:

str

compress_spreadsheet(df)[source]
Return type:

Dict[str, Any]

encode_compressed_for_llm(compressed_result)[source]

Generate text representation (Markdown).

Return type:

str

encode_to_json(df)[source]

Encode compressed spreadsheet data into structured JSON. Ideal for piping into other scripts or APIs.

Return type:

str

encode_vanilla(df, include_format=False)[source]
Return type:

str

get_encoding_stats(df)[source]

Get basic statistics.

Return type:

Dict[str, Any]

load_from_file(filepath)[source]

Load spreadsheet from file with robust type detection. Detects file type using magic numbers (signatures) rather than extensions.

Return type:

DataFrame

process_qa_query(df, query)[source]

Process QA query using Chain of Spreadsheet.

Return type:

Dict[str, Any]

query_sql(df, sql_query, params=None)[source]

Run a SQL query against the DataFrame using DuckDB with enhanced security.

Parameters:
  • df (DataFrame) – The dataframe to query (registered as table ‘input_data’)

  • sql_query (str) – SQL query. Use ‘input_data’ to refer to the dataframe. Example: “SELECT * FROM input_data WHERE Year > ?”

  • params (Union[list, Dict[str, Any], None]) – Optional parameters for the query to prevent SQL injection. Supports list (for ‘?’) or dict (for ‘$name’) parameters.

Return type:

DataFrame

Returns:

Result as a new DataFrame

class sheetwise.SheetCompressor(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]

Main compression framework combining all three modules. Optimized for memory efficiency.

__init__(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]
compress(df, inplace=False)[source]

Apply compression pipeline to spreadsheet data.

Parameters:
  • df (DataFrame) – Input DataFrame

  • inplace (bool) – If True, attempts to minimize memory copies (CAUTION: modifies data flow) Note: Pandas operations often return copies anyway, but this flag prevents the initial full copy.

Return type:

Dict[str, Any]

Returns:

Compressed representation

class sheetwise.VanillaEncoder[source]

Vanilla spreadsheet encoding with cell addresses and formats

encode_to_markdown(df, include_format=False)[source]

Encode spreadsheet to Markdown-like format

Parameters:
  • df (DataFrame) – Input DataFrame

  • include_format (bool) – Whether to include format information

Return type:

str

Returns:

Markdown-style string representation

class sheetwise.ChainOfSpreadsheet(compressor=None)[source]

Implements deterministic ‘Chain of Spreadsheet’ reasoning. Uses fuzzy matching and heuristic scoring instead of LLMs.

__init__(compressor=None)[source]
process_query(df, query)[source]

Process a query using deterministic search logic.

Parameters:
  • df (DataFrame) – Input DataFrame

  • query (str) – Natural language query (e.g., “Revenue in 2023”)

Return type:

Dict[str, Any]

Returns:

Dictionary containing identified tables and search scores.

class sheetwise.CellInfo(address, value, data_type, format_string=None, row=0, col=0)[source]

Information about a spreadsheet cell

__init__(address, value, data_type, format_string=None, row=0, col=0)
col: int = 0
format_string: Optional[str] = None
row: int = 0
address: str
value: Any
data_type: str
class sheetwise.TableRegion(top_left, bottom_right, rows, cols, confidence=0.0)[source]

Represents a detected table region in the spreadsheet

__init__(top_left, bottom_right, rows, cols, confidence=0.0)
confidence: float = 0.0
top_left: str
bottom_right: str
rows: range
cols: range
sheetwise.create_realistic_spreadsheet()[source]

Create a realistic large, sparse spreadsheet with multiple tables

class sheetwise.FormulaParser[source]

Extracts, analyzes and simplifies Excel formulas from spreadsheets. Optimized for memory usage with streaming reads.

CELL_REF_PATTERN = re.compile('([A-Z]+[0-9]+|[A-Z]+\\:[A-Z]+|[0-9]+\\:[0-9]+|[A-Z]+[0-9]+\\:[A-Z]+[0-9]+)')
__init__()[source]

Initialize the formula parser.

build_dependency_graph()[source]

Build a graph of cell dependencies based on extracted formulas.

Return type:

Dict[str, Set[str]]

encode_formulas_for_llm(formulas=None)[source]

Generate LLM-friendly encoding.

Return type:

str

extract_cell_references(formula)[source]

Extract all cell references from a formula.

Return type:

List[str]

extract_formulas(excel_path, sheet_name=None)[source]

Extract all formulas from an Excel file using Memory-Efficient Streaming.

Parameters:
  • excel_path (str) – Path to the Excel file

  • sheet_name (Optional[str]) – Optional specific sheet to parse (saves time)

Return type:

Dict[str, str]

Returns:

Dictionary mapping cell addresses to formulas

get_formula_impact(cell_address)[source]

Analyze the impact of a formula cell.

Return type:

Dict[str, Any]

simplify_formula(formula)[source]

Generate a simplified explanation using robust tokenization. Handles nested functions better than Regex.

Return type:

str

class sheetwise.FormulaDependencyAnalyzer(formula_parser=None)[source]

Specialized analyzer for formula dependencies.

__init__(formula_parser=None)[source]
find_calculation_chains()[source]
Return type:

List[List[str]]

identify_critical_cells()[source]
Return type:

List[str]

class sheetwise.CompressionVisualizer(enable_interactive=True)[source]

Visualization tools for spreadsheet compression analysis. Now includes Interactive HTML Reports.

__init__(enable_interactive=True)[source]
compare_original_vs_compressed(original_df, compressed_result)[source]
create_data_density_heatmap(df, title='Data Density Heatmap')[source]

Generate a heatmap showing data density in the spreadsheet.

Return type:

Figure

generate_html_report(original_df, compressed_result)[source]

Legacy static report (Backwards compatibility)

Return type:

str

generate_interactive_report(original_df, compressed_result, filename='report.html')[source]

Generate a standalone interactive HTML report for auditing compression.

Features: - Side-by-side view (Original vs Compressed) - Highlighted “Removed” regions - Click-to-scroll navigation

Return type:

str

visualize_anchors(df, anchors, title='Structural Anchors')[source]

Visualize structural anchors identified in the spreadsheet.

Return type:

Figure

class sheetwise.WorkbookManager[source]

Manages multi-sheet workbooks and cross-sheet references.

This class provides utilities to: 1. Load and process entire Excel workbooks with multiple sheets 2. Handle cross-sheet references and relationships 3. Compress entire workbooks 4. Identify inter-sheet relationships

__init__()[source]

Initialize the workbook manager.

compress_workbook(compressor)[source]

Compress all sheets in the workbook.

Parameters:

compressor – SheetCompressor instance to use for compression

Return type:

Dict[str, Any]

Returns:

Dictionary with compression results for each sheet

detect_cross_sheet_references()[source]

Detect references between sheets using formula analysis.

Return type:

Dict[str, Set[str]]

Returns:

Dictionary mapping sheet names to sets of referenced sheets

encode_workbook_for_llm(compression_results)[source]

Generate LLM-ready encoding of the entire workbook.

Parameters:

compression_results (Dict[str, Any]) – Output from compress_workbook

Return type:

str

Returns:

LLM-ready text representation of the workbook

get_sheet_importance_ranking()[source]

Rank sheets by their importance in the workbook.

Return type:

List[Tuple[str, float]]

Returns:

List of (sheet_name, importance_score) tuples, sorted by importance

get_sheet_relationship_graph()[source]

Generate a directed graph of sheet relationships.

Return type:

Dict[str, Any]

Returns:

Dictionary with graph representation of sheet relationships

load_workbook(excel_path)[source]

Load all sheets from an Excel workbook.

Parameters:

excel_path (str) – Path to the Excel file

Return type:

Dict[str, DataFrame]

Returns:

Dictionary mapping sheet names to dataframes

class sheetwise.SmartTableDetector(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]

Advanced table detection with enhanced capabilities.

This class provides utilities to: 1. Detect multiple tables in spreadsheets 2. Identify table headers and structures 3. Classify tables by type 4. Handle complex table layouts

__init__(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]

Initialize the detector.

Parameters:
  • min_table_size (int) – Minimum number of rows/columns to consider a table

  • max_empty_ratio (float) – Maximum ratio of empty cells allowed in a table

  • header_detection (bool) – Whether to detect headers

detect_tables(df)[source]

Detect multiple tables in a spreadsheet.

Parameters:

df (DataFrame) – Input DataFrame

Return type:

List[EnhancedTableRegion]

Returns:

List of detected enhanced table regions

extract_tables_to_dataframes(df)[source]

Extract all tables from a spreadsheet into separate dataframes.

Parameters:

df (DataFrame) – Input DataFrame

Return type:

Dict[str, DataFrame]

Returns:

Dictionary mapping table names to extracted DataFrames

class sheetwise.TableType(value)[source]

Types of tables that can be detected.

DATA_TABLE = 'data_table'
PIVOT_TABLE = 'pivot_table'
MATRIX = 'matrix'
FORM = 'form'
MIXED = 'mixed'
SPARSE = 'sparse'
class sheetwise.EnhancedTableRegion(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)[source]

Extended table region with additional metadata.

__init__(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)
confidence: float = 1.0
property end_col: int
property end_row: int
has_headers: bool = False
header_cols: List[int] = None
header_rows: List[int] = None
property start_col: int
property start_row: int
table_type: TableType = 'data_table'

Modules

chain

Chain of Spreadsheet reasoning implementation (Offline Edition).

classifiers

Data type classification utilities for spreadsheet cells.

cli

Command line interface for SheetWise.

compressor

Main compression framework combining all modules.

core

Main SpreadsheetLLM class integrating all components (Offline Edition).

data_types

Data types and structures used throughout the SpreadsheetLLM package.

detectors

Table detection utilities.

encoders

Encoding utilities for spreadsheet data.

extractors

Compression modules for SpreadsheetLLM framework (Enhanced).

formula_parser

Formula parsing and analysis utilities for spreadsheets.

smart_tables

Advanced table detection and classification utilities.

utils

Utility functions and demo data creation.

visualizer

Visualization utilities for spreadsheet compression.

workbook

Multi-sheet workbook handling and cross-sheet reference management.