sheetwise package¶

SpreadsheetLLM: A Python Package for Encoding Spreadsheets for Large Language Models

This package implements the key components from the SpreadsheetLLM research: - SheetCompressor: Efficient encoding framework with three modules - Chain of Spreadsheet: Multi-step reasoning approach - Vanilla encoding methods with cell addresses and formats

Additional features include: - Formula extraction and analysis - Multi-sheet workbook support - Advanced table detection - Visualization tools

Based on the research paper: “SpreadsheetLLM: Encoding Spreadsheets for Large Language Models” by Microsoft Research Team

class sheetwise.SpreadsheetLLM(compression_params=None, enable_logging=False)[source]¶

Bases: object

Main class integrating all SheetWise components. Includes Offline SQL and JSON export capabilities.

__init__(compression_params=None, enable_logging=False)[source]¶

Initialize SheetWise framework.

Parameters:

compression_params (Dict[str, Any]) – Parameters for SheetCompressor
enable_logging (bool) – Enable detailed logging for debugging

auto_configure(df)[source]¶

Auto-configure compression parameters.

Return type:: Dict[str, Any]

compress_and_encode_for_llm(df)[source]¶

Original Markdown encoding (retained for compatibility).

Return type:: str

compress_spreadsheet(df)[source]¶

Return type:: Dict[str, Any]

encode_compressed_for_llm(compressed_result)[source]¶

Generate text representation (Markdown).

Return type:: str

encode_to_json(df)[source]¶

Encode compressed spreadsheet data into structured JSON. Ideal for piping into other scripts or APIs.

Return type:: str

encode_vanilla(df, include_format=False)[source]¶

Return type:: str

get_encoding_stats(df)[source]¶

Get basic statistics.

Return type:: Dict[str, Any]

load_from_file(filepath)[source]¶

Load spreadsheet from file with robust type detection. Detects file type using magic numbers (signatures) rather than extensions.

Return type:: DataFrame

process_qa_query(df, query)[source]¶

Process QA query using Chain of Spreadsheet.

Return type:: Dict[str, Any]

query_sql(df, sql_query, params=None)[source]¶

Run a SQL query against the DataFrame using DuckDB with enhanced security.

Parameters:

df (DataFrame) – The dataframe to query (registered as table ‘input_data’)
sql_query (str) – SQL query. Use ‘input_data’ to refer to the dataframe. Example: “SELECT * FROM input_data WHERE Year > ?”
params (Union[list, Dict[str, Any], None]) – Optional parameters for the query to prevent SQL injection. Supports list (for ‘?’) or dict (for ‘$name’) parameters.

Return type:

DataFrame

Returns:

Result as a new DataFrame

class sheetwise.SheetCompressor(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]¶

Bases: object

Main compression framework combining all three modules. Optimized for memory efficiency.

__init__(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]¶

compress(df, inplace=False)[source]¶

Apply compression pipeline to spreadsheet data.

Parameters:

df (DataFrame) – Input DataFrame
inplace (bool) – If True, attempts to minimize memory copies (CAUTION: modifies data flow) Note: Pandas operations often return copies anyway, but this flag prevents the initial full copy.

Return type:

Dict[str, Any]

Returns:

Compressed representation

class sheetwise.VanillaEncoder[source]¶

Bases: object

Vanilla spreadsheet encoding with cell addresses and formats

encode_to_markdown(df, include_format=False)[source]¶

Encode spreadsheet to Markdown-like format

Parameters:

df (DataFrame) – Input DataFrame
include_format (bool) – Whether to include format information

Return type:

str

Returns:

Markdown-style string representation

class sheetwise.ChainOfSpreadsheet(compressor=None)[source]¶

Bases: object

Implements deterministic ‘Chain of Spreadsheet’ reasoning. Uses fuzzy matching and heuristic scoring instead of LLMs.

__init__(compressor=None)[source]¶

process_query(df, query)[source]¶

Process a query using deterministic search logic.

Parameters:

df (DataFrame) – Input DataFrame
query (str) – Natural language query (e.g., “Revenue in 2023”)

Return type:

Dict[str, Any]

Returns:

Dictionary containing identified tables and search scores.

class sheetwise.CellInfo(address, value, data_type, format_string=None, row=0, col=0)[source]¶

Bases: object

Information about a spreadsheet cell

__init__(address, value, data_type, format_string=None, row=0, col=0)¶

col: int = 0¶

format_string: Optional[str] = None¶

row: int = 0¶

address: str¶

value: Any¶

data_type: str¶

class sheetwise.TableRegion(top_left, bottom_right, rows, cols, confidence=0.0)[source]¶

Bases: object

Represents a detected table region in the spreadsheet

__init__(top_left, bottom_right, rows, cols, confidence=0.0)¶

confidence: float = 0.0¶

top_left: str¶

bottom_right: str¶

rows: range¶

cols: range¶

sheetwise.create_realistic_spreadsheet()[source]¶: Create a realistic large, sparse spreadsheet with multiple tables

class sheetwise.FormulaParser[source]¶

Bases: object

Extracts, analyzes and simplifies Excel formulas from spreadsheets. Optimized for memory usage with streaming reads.

CELL_REF_PATTERN = re.compile('([A-Z]+[0-9]+|[A-Z]+\\:[A-Z]+|[0-9]+\\:[0-9]+|[A-Z]+[0-9]+\\:[A-Z]+[0-9]+)')¶

__init__()[source]¶: Initialize the formula parser.

build_dependency_graph()[source]¶

Build a graph of cell dependencies based on extracted formulas.

Return type:: Dict[str, Set[str]]

encode_formulas_for_llm(formulas=None)[source]¶

Generate LLM-friendly encoding.

Return type:: str

extract_cell_references(formula)[source]¶

Extract all cell references from a formula.

Return type:: List[str]

extract_formulas(excel_path, sheet_name=None)[source]¶

Extract all formulas from an Excel file using Memory-Efficient Streaming.

Parameters:

excel_path (str) – Path to the Excel file
sheet_name (Optional[str]) – Optional specific sheet to parse (saves time)

Return type:

Dict[str, str]

Returns:

Dictionary mapping cell addresses to formulas

get_formula_impact(cell_address)[source]¶

Analyze the impact of a formula cell.

Return type:: Dict[str, Any]

simplify_formula(formula)[source]¶

Generate a simplified explanation using robust tokenization. Handles nested functions better than Regex.

Return type:: str

class sheetwise.FormulaDependencyAnalyzer(formula_parser=None)[source]¶

Bases: object

Specialized analyzer for formula dependencies.

__init__(formula_parser=None)[source]¶

find_calculation_chains()[source]¶

Return type:: List[List[str]]

identify_critical_cells()[source]¶

Return type:: List[str]

class sheetwise.CompressionVisualizer(enable_interactive=True)[source]¶

Bases: object

Visualization tools for spreadsheet compression analysis. Now includes Interactive HTML Reports.

__init__(enable_interactive=True)[source]¶

compare_original_vs_compressed(original_df, compressed_result)[source]¶

create_data_density_heatmap(df, title='Data Density Heatmap')[source]¶

Generate a heatmap showing data density in the spreadsheet.

Return type:: Figure

generate_html_report(original_df, compressed_result)[source]¶

Legacy static report (Backwards compatibility)

Return type:: str

generate_interactive_report(original_df, compressed_result, filename='report.html')[source]¶

Generate a standalone interactive HTML report for auditing compression.

Features: - Side-by-side view (Original vs Compressed) - Highlighted “Removed” regions - Click-to-scroll navigation

Return type:: str

visualize_anchors(df, anchors, title='Structural Anchors')[source]¶

Visualize structural anchors identified in the spreadsheet.

Return type:: Figure

class sheetwise.WorkbookManager[source]¶

Bases: object

Manages multi-sheet workbooks and cross-sheet references.

This class provides utilities to: 1. Load and process entire Excel workbooks with multiple sheets 2. Handle cross-sheet references and relationships 3. Compress entire workbooks 4. Identify inter-sheet relationships

__init__()[source]¶: Initialize the workbook manager.

compress_workbook(compressor)[source]¶

Compress all sheets in the workbook.

Parameters:: compressor – SheetCompressor instance to use for compression
Return type:: Dict[str, Any]
Returns:: Dictionary with compression results for each sheet

detect_cross_sheet_references()[source]¶

Detect references between sheets using formula analysis.

Return type:: Dict[str, Set[str]]
Returns:: Dictionary mapping sheet names to sets of referenced sheets

encode_workbook_for_llm(compression_results)[source]¶

Generate LLM-ready encoding of the entire workbook.

Parameters:: compression_results (Dict[str, Any]) – Output from compress_workbook
Return type:: str
Returns:: LLM-ready text representation of the workbook

get_sheet_importance_ranking()[source]¶

Rank sheets by their importance in the workbook.

Return type:: List[Tuple[str, float]]
Returns:: List of (sheet_name, importance_score) tuples, sorted by importance

get_sheet_relationship_graph()[source]¶

Generate a directed graph of sheet relationships.

Return type:: Dict[str, Any]
Returns:: Dictionary with graph representation of sheet relationships

load_workbook(excel_path)[source]¶

Load all sheets from an Excel workbook.

Parameters:: excel_path (str) – Path to the Excel file
Return type:: Dict[str, DataFrame]
Returns:: Dictionary mapping sheet names to dataframes

class sheetwise.SmartTableDetector(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]¶

Bases: object

Advanced table detection with enhanced capabilities.

This class provides utilities to: 1. Detect multiple tables in spreadsheets 2. Identify table headers and structures 3. Classify tables by type 4. Handle complex table layouts

__init__(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]¶

Initialize the detector.

Parameters:

min_table_size (int) – Minimum number of rows/columns to consider a table
max_empty_ratio (float) – Maximum ratio of empty cells allowed in a table
header_detection (bool) – Whether to detect headers

detect_tables(df)[source]¶

Detect multiple tables in a spreadsheet.

Parameters:: df (DataFrame) – Input DataFrame
Return type:: List[EnhancedTableRegion]
Returns:: List of detected enhanced table regions

extract_tables_to_dataframes(df)[source]¶

Extract all tables from a spreadsheet into separate dataframes.

Parameters:: df (DataFrame) – Input DataFrame
Return type:: Dict[str, DataFrame]
Returns:: Dictionary mapping table names to extracted DataFrames

class sheetwise.TableType(value)[source]¶

Bases: Enum

Types of tables that can be detected.

DATA_TABLE = 'data_table'¶

PIVOT_TABLE = 'pivot_table'¶

MATRIX = 'matrix'¶

FORM = 'form'¶

MIXED = 'mixed'¶

SPARSE = 'sparse'¶

class sheetwise.EnhancedTableRegion(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)[source]¶

Bases: TableRegion

Extended table region with additional metadata.

__init__(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)¶

confidence: float = 1.0¶

property end_col: int¶

property end_row: int¶

has_headers: bool = False¶

header_cols: List[int] = None¶

header_rows: List[int] = None¶

property start_col: int¶

property start_row: int¶

table_type: TableType = 'data_table'¶

top_left: str¶

bottom_right: str¶

rows: range¶

cols: range¶

sheetwise package¶

Submodules¶