sheetwise package

SpreadsheetLLM: A Python Package for Encoding Spreadsheets for Large Language Models

This package implements the key components from the SpreadsheetLLM research: - SheetCompressor: Efficient encoding framework with three modules - Chain of Spreadsheet: Multi-step reasoning approach - Vanilla encoding methods with cell addresses and formats

Additional features include: - Formula extraction and analysis - Multi-sheet workbook support - Advanced table detection - Visualization tools

Based on the research paper: “SpreadsheetLLM: Encoding Spreadsheets for Large Language Models” by Microsoft Research Team

class sheetwise.SpreadsheetLLM(compression_params=None, enable_logging=False)[source]

Bases: object

Main class integrating all SheetWise components. Includes Offline SQL and JSON export capabilities.

__init__(compression_params=None, enable_logging=False)[source]

Initialize SheetWise framework.

Parameters:
  • compression_params (Dict[str, Any]) – Parameters for SheetCompressor

  • enable_logging (bool) – Enable detailed logging for debugging

auto_configure(df)[source]

Auto-configure compression parameters.

Return type:

Dict[str, Any]

compress_and_encode_for_llm(df)[source]

Original Markdown encoding (retained for compatibility).

Return type:

str

compress_spreadsheet(df)[source]
Return type:

Dict[str, Any]

encode_compressed_for_llm(compressed_result)[source]

Generate text representation (Markdown).

Return type:

str

encode_to_json(df)[source]

Encode compressed spreadsheet data into structured JSON. Ideal for piping into other scripts or APIs.

Return type:

str

encode_vanilla(df, include_format=False)[source]
Return type:

str

get_encoding_stats(df)[source]

Get basic statistics.

Return type:

Dict[str, Any]

load_from_file(filepath)[source]

Load spreadsheet from file with robust type detection. Detects file type using magic numbers (signatures) rather than extensions.

Return type:

DataFrame

process_qa_query(df, query)[source]

Process QA query using Chain of Spreadsheet.

Return type:

Dict[str, Any]

query_sql(df, sql_query, params=None)[source]

Run a SQL query against the DataFrame using DuckDB with enhanced security.

Parameters:
  • df (DataFrame) – The dataframe to query (registered as table ‘input_data’)

  • sql_query (str) – SQL query. Use ‘input_data’ to refer to the dataframe. Example: “SELECT * FROM input_data WHERE Year > ?”

  • params (Union[list, Dict[str, Any], None]) – Optional parameters for the query to prevent SQL injection. Supports list (for ‘?’) or dict (for ‘$name’) parameters.

Return type:

DataFrame

Returns:

Result as a new DataFrame

class sheetwise.SheetCompressor(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]

Bases: object

Main compression framework combining all three modules. Optimized for memory efficiency.

__init__(k=4, use_extraction=True, use_translation=True, use_aggregation=True)[source]
compress(df, inplace=False)[source]

Apply compression pipeline to spreadsheet data.

Parameters:
  • df (DataFrame) – Input DataFrame

  • inplace (bool) – If True, attempts to minimize memory copies (CAUTION: modifies data flow) Note: Pandas operations often return copies anyway, but this flag prevents the initial full copy.

Return type:

Dict[str, Any]

Returns:

Compressed representation

class sheetwise.VanillaEncoder[source]

Bases: object

Vanilla spreadsheet encoding with cell addresses and formats

encode_to_markdown(df, include_format=False)[source]

Encode spreadsheet to Markdown-like format

Parameters:
  • df (DataFrame) – Input DataFrame

  • include_format (bool) – Whether to include format information

Return type:

str

Returns:

Markdown-style string representation

class sheetwise.ChainOfSpreadsheet(compressor=None)[source]

Bases: object

Implements deterministic ‘Chain of Spreadsheet’ reasoning. Uses fuzzy matching and heuristic scoring instead of LLMs.

__init__(compressor=None)[source]
process_query(df, query)[source]

Process a query using deterministic search logic.

Parameters:
  • df (DataFrame) – Input DataFrame

  • query (str) – Natural language query (e.g., “Revenue in 2023”)

Return type:

Dict[str, Any]

Returns:

Dictionary containing identified tables and search scores.

class sheetwise.CellInfo(address, value, data_type, format_string=None, row=0, col=0)[source]

Bases: object

Information about a spreadsheet cell

__init__(address, value, data_type, format_string=None, row=0, col=0)
col: int = 0
format_string: Optional[str] = None
row: int = 0
address: str
value: Any
data_type: str
class sheetwise.TableRegion(top_left, bottom_right, rows, cols, confidence=0.0)[source]

Bases: object

Represents a detected table region in the spreadsheet

__init__(top_left, bottom_right, rows, cols, confidence=0.0)
confidence: float = 0.0
top_left: str
bottom_right: str
rows: range
cols: range
sheetwise.create_realistic_spreadsheet()[source]

Create a realistic large, sparse spreadsheet with multiple tables

class sheetwise.FormulaParser[source]

Bases: object

Extracts, analyzes and simplifies Excel formulas from spreadsheets. Optimized for memory usage with streaming reads.

CELL_REF_PATTERN = re.compile('([A-Z]+[0-9]+|[A-Z]+\\:[A-Z]+|[0-9]+\\:[0-9]+|[A-Z]+[0-9]+\\:[A-Z]+[0-9]+)')
__init__()[source]

Initialize the formula parser.

build_dependency_graph()[source]

Build a graph of cell dependencies based on extracted formulas.

Return type:

Dict[str, Set[str]]

encode_formulas_for_llm(formulas=None)[source]

Generate LLM-friendly encoding.

Return type:

str

extract_cell_references(formula)[source]

Extract all cell references from a formula.

Return type:

List[str]

extract_formulas(excel_path, sheet_name=None)[source]

Extract all formulas from an Excel file using Memory-Efficient Streaming.

Parameters:
  • excel_path (str) – Path to the Excel file

  • sheet_name (Optional[str]) – Optional specific sheet to parse (saves time)

Return type:

Dict[str, str]

Returns:

Dictionary mapping cell addresses to formulas

get_formula_impact(cell_address)[source]

Analyze the impact of a formula cell.

Return type:

Dict[str, Any]

simplify_formula(formula)[source]

Generate a simplified explanation using robust tokenization. Handles nested functions better than Regex.

Return type:

str

class sheetwise.FormulaDependencyAnalyzer(formula_parser=None)[source]

Bases: object

Specialized analyzer for formula dependencies.

__init__(formula_parser=None)[source]
find_calculation_chains()[source]
Return type:

List[List[str]]

identify_critical_cells()[source]
Return type:

List[str]

class sheetwise.CompressionVisualizer(enable_interactive=True)[source]

Bases: object

Visualization tools for spreadsheet compression analysis. Now includes Interactive HTML Reports.

__init__(enable_interactive=True)[source]
compare_original_vs_compressed(original_df, compressed_result)[source]
create_data_density_heatmap(df, title='Data Density Heatmap')[source]

Generate a heatmap showing data density in the spreadsheet.

Return type:

Figure

generate_html_report(original_df, compressed_result)[source]

Legacy static report (Backwards compatibility)

Return type:

str

generate_interactive_report(original_df, compressed_result, filename='report.html')[source]

Generate a standalone interactive HTML report for auditing compression.

Features: - Side-by-side view (Original vs Compressed) - Highlighted “Removed” regions - Click-to-scroll navigation

Return type:

str

visualize_anchors(df, anchors, title='Structural Anchors')[source]

Visualize structural anchors identified in the spreadsheet.

Return type:

Figure

class sheetwise.WorkbookManager[source]

Bases: object

Manages multi-sheet workbooks and cross-sheet references.

This class provides utilities to: 1. Load and process entire Excel workbooks with multiple sheets 2. Handle cross-sheet references and relationships 3. Compress entire workbooks 4. Identify inter-sheet relationships

__init__()[source]

Initialize the workbook manager.

compress_workbook(compressor)[source]

Compress all sheets in the workbook.

Parameters:

compressor – SheetCompressor instance to use for compression

Return type:

Dict[str, Any]

Returns:

Dictionary with compression results for each sheet

detect_cross_sheet_references()[source]

Detect references between sheets using formula analysis.

Return type:

Dict[str, Set[str]]

Returns:

Dictionary mapping sheet names to sets of referenced sheets

encode_workbook_for_llm(compression_results)[source]

Generate LLM-ready encoding of the entire workbook.

Parameters:

compression_results (Dict[str, Any]) – Output from compress_workbook

Return type:

str

Returns:

LLM-ready text representation of the workbook

get_sheet_importance_ranking()[source]

Rank sheets by their importance in the workbook.

Return type:

List[Tuple[str, float]]

Returns:

List of (sheet_name, importance_score) tuples, sorted by importance

get_sheet_relationship_graph()[source]

Generate a directed graph of sheet relationships.

Return type:

Dict[str, Any]

Returns:

Dictionary with graph representation of sheet relationships

load_workbook(excel_path)[source]

Load all sheets from an Excel workbook.

Parameters:

excel_path (str) – Path to the Excel file

Return type:

Dict[str, DataFrame]

Returns:

Dictionary mapping sheet names to dataframes

class sheetwise.SmartTableDetector(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]

Bases: object

Advanced table detection with enhanced capabilities.

This class provides utilities to: 1. Detect multiple tables in spreadsheets 2. Identify table headers and structures 3. Classify tables by type 4. Handle complex table layouts

__init__(min_table_size=2, max_empty_ratio=0.7, header_detection=True)[source]

Initialize the detector.

Parameters:
  • min_table_size (int) – Minimum number of rows/columns to consider a table

  • max_empty_ratio (float) – Maximum ratio of empty cells allowed in a table

  • header_detection (bool) – Whether to detect headers

detect_tables(df)[source]

Detect multiple tables in a spreadsheet.

Parameters:

df (DataFrame) – Input DataFrame

Return type:

List[EnhancedTableRegion]

Returns:

List of detected enhanced table regions

extract_tables_to_dataframes(df)[source]

Extract all tables from a spreadsheet into separate dataframes.

Parameters:

df (DataFrame) – Input DataFrame

Return type:

Dict[str, DataFrame]

Returns:

Dictionary mapping table names to extracted DataFrames

class sheetwise.TableType(value)[source]

Bases: Enum

Types of tables that can be detected.

DATA_TABLE = 'data_table'
PIVOT_TABLE = 'pivot_table'
MATRIX = 'matrix'
FORM = 'form'
MIXED = 'mixed'
SPARSE = 'sparse'
class sheetwise.EnhancedTableRegion(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)[source]

Bases: TableRegion

Extended table region with additional metadata.

__init__(top_left, bottom_right, rows, cols, confidence=1.0, table_type=TableType.DATA_TABLE, has_headers=False, header_rows=None, header_cols=None)
confidence: float = 1.0
property end_col: int
property end_row: int
has_headers: bool = False
header_cols: List[int] = None
header_rows: List[int] = None
property start_col: int
property start_row: int
table_type: TableType = 'data_table'
top_left: str
bottom_right: str
rows: range
cols: range

Submodules