Telio Blog

Scaling EV Repairs: How RepairWise Handles 1,000+ Daily Messages with AI

Evgeny Li — Tue, 06 Jan 2026 07:39:00 GMT

Highlights

Company: RepairWise
Industry: Electric Vehicle Diagnostics and Repair
Challenge: managing 1,000+ daily messages across customer SMS and repair shop communications while scaling operations
Solution: dual AI agent system powered by Telio integrated with:
- Dialpad, real-time SMS communications with customers
- PostgreSQL Database, customers, orders, and repair shops information
- Repair Manuals, proprietary documentation for EV models
- Google Docs, response templates and guidelines
- Help Center, public-facing FAQ and support content
Results:
- 60% of customer SMS fully automated
- 40% of customer messages drafted for one-click review and send
- 2.85x faster median response time
- 100% of repair shop communications receive AI-generated drafts
- Same 9-person team of service advisors and technicians handles growing volume without additional headcount

About RepairWise

RepairWise is revolutionizing how electric vehicle owners get their cars serviced. The platform remotely diagnoses EV issues and provides instant online quotes, making repairs more transparent and convenient. RepairWise connects thousands of EV owners with a nationwide network of qualified repair shops across the United States.

The Challenge: Two Communication Channels, One Scaling Problem

As RepairWise expanded its network of customers and repair shop partners, the volume of daily communications became overwhelming for its lean team.

"We were handling over a thousand messages every single day. On one side, EV owners were texting us constantly with questions about their diagnostics, repair statuses, and appointments. On the other side, our partner mechanics needed technical guidance, parts information, and software update assistance. Both channels were growing faster than we could hire."

— James Castillo, Co-Founder & Head of Engineering

On the customer side: EV owners communicate via SMS through Dialpad, asking about diagnostics, scheduling, pricing, insurance, repair status, and general vehicle health questions.

On the shop side: auto mechanics use RepairWise's proprietary portal and messaging system for technical guidance, diagnostic assistance, parts logistics, software updates, and customer coordination.

As the business scaled, this structure wasn't sustainable without a fundamental change in how they handled communications.

The Solution: Two Specialized AI Agents Working in Parallel

RepairWise built two distinct AI agents using Telio, each tailored to its communication channel and audience. Rather than a one-size-fits-all approach, they created specialized agents that understand the unique context and integrate with the exact knowledge sources needed for that role.

Agent #1: The Service Advisor for Customer SMS

The first agent operates as a virtual service advisor, handling SMS conversations with EV owners through Dialpad's real-time integration.

How it works:

Dialpad receives the SMS with Telio's real-time integration immediately capturing it
Analyzes the inquiry to understand customer intent and categorize it
Accesses comprehensive context about the customer and car history, help center articles, and response templates
Generates review-ready drafts that service advisors can approve and send with one click
Automatically performs actions for 60% of messages to update the order status and send a reply SMS

"The Service Advisor agent knows everything about a customer's diagnostic history, their vehicle, and their past interactions. It's like having our best service advisor assistant available 24/7, but it never gets tired or overwhelmed."

— James Castillo, Co-Founder & Head of Engineering

Agent #2: The Technical Expert for Mechanics

The second AI agent serves as a master technician, supporting the auto repair shop partners through RepairWise's messaging system and portal.

How it works:

Monitors the proprietary messaging system for inquiries from repair shop mechanics
The AI agent examines the technical question or issue
It searches connected systems, including repair manuals for specific EV models and the order & customer database
It generates a comprehensive draft with repair guidance, diagnostic steps, parts information, or customer communication suggestions
A human technician reviews the draft and sends it with a single click

"The Technical Expert agent handles incredibly complex questions. It can reference specific repair procedures from our manuals, cross-reference diagnostic codes with known issues."

— James Castillo, Co-Founder & Head of Engineering

The Results: Scaling Support Without Scaling Staff

The impact of RepairWise's dual-agent system has been transformative for its operations.

Volume handled: over 1,000 daily messages now flow through AI-assisted systems, with every single response, whether fully automated or drafted, generated by AI agents working alongside the human team.
Automation rate: 60% of customer SMS messages are fully automated, while the remaining 40% benefit from AI-drafted responses that reduce human effort to "review and click send".
Response speed: the 2.85x improvement in median response time means customers get answers faster, leading to higher satisfaction and reduced inquiry volume from follow-up questions.
Technical accuracy: by connecting to proprietary repair manuals and order databases, the technical agent provides mechanics with accurate, model-specific guidance that previously required extensive research.
Team efficiency: the same 9-person team continues handling increasing volume without burnout. They've shifted from being message responders to quality reviewers and complex problem solvers.

Why This Approach Works

RepairWise's success with AI-powered messaging automation demonstrates several key principles.

Specialized agents for specialized audiences. Rather than building one general-purpose agent, RepairWise created two agents optimized for their specific domains. The Service Advisor agent speaks the language of concerned EV owners, while the Technician agent communicates with the precision and depth mechanics require.

Human-in-the-loop for quality control. The 60/40 automation split for customer messages and 100% draft generation for technical communications ensures accuracy while maximizing efficiency. This hybrid approach delivers speed without sacrificing the personal touch.

Production-safe architecture. By aggregating all data into our scalable open-source BemiDB database with built-in replication, RepairWise eliminated any risk of AI queries impacting their core platform. The AI agents can securely retrieve all synchronized data across various knowledge sources with granular permissions and aggregated queries running up to 2000x faster than regular PostgreSQL, while significantly reducing LLM token consumption.

Looking Ahead: Revolutionizing Auto Repair

For RepairWise, AI-powered communication automation represents more than just efficiency gains. It's the foundation for ambitious growth plans in the rapidly expanding EV repair market and investment in making their existing team more effective.

As RepairWise continues expanding its network of repair shops, adding support for new EV models, and onboarding more customers, their support infrastructure scales automatically.

Ready to Scale Your Customer Communication with AI Agents?

If you're managing high-volume 24/7 communications and need to scale without proportional headcount growth, AI agents can transform your operations.

Telio integrates with various tools and systems, allowing you to build intelligent automation that understands your business context and scales with your growth.

How Rollups Scales Fintech Support: Automating 1,000s of Emails in Front

Evgeny Li — Tue, 30 Dec 2025 21:51:00 GMT

Highlights

Company: Rollups
Industry: Financial Technology / Cap Table Management
Challenge: Managing thousands of emails while maintaining personalized and accurate responses
Solution: Email automation with Telio AI agents integrated with:
- Front, shared email inbox
- Attio, CRM
- Notion, internal knowledge center and style guide
- Help Center, public-facing website
Results: Fully automated email draft generation with context-aware responses

About Rollups

Rollups helps startups and growth-stage companies take control of their equity by consolidating stakeholders into a single vehicle. It is part of AngelList’s ecosystem of brands, and it is trusted by over 50,000 investors and companies. Rollups offers two primary solutions: Roll Up Vehicles (RUVs) for raising capital and Consolidation Vehicles (CVs) for cleaning up complex cap tables.

The Challenge: Scaling Support in a Complex Domain

As Rollups grew, their team received a flood of repetitive questions that prevented them from focusing on complex customer issues.

"We were getting the same questions over and over. Questions about how Roll Up Vehicles work, what documents are needed, and other basic support questions. Our team was spending hours each day answering these."

— Sumukh Sridhara, Rollups

The numbers told the story:

Thousands of emails flowing through their system
A significant portion were repetitive, common questions
Response times stretched during peak periods
Team stretched thin between routine and complex inquiries

The Solution: AI Agents That Think Like Support Experts

Rollups built AI agents using Telio with a strategic approach. Rather than replacing their human support, they built an AI layer that handles routine inquiries automatically with a human-in-the-loop approach.

How It Works in Practice

When a customer support email arrives in Front, Telio's AI agent:

Reads and analyzes the customer inquiry
Searches connected systems for relevant information: help center content, internal knowledge center and style guide articles, CRM records.
Drafts a complete response that matches Rollups' tone and includes accurate, contextual information
Presents the draft to the human team for review and one-click sending directly in the Front interface without switching tools or disrupting the workflow.

"By automating the email reply drafting process, we've eliminated the 'blank page' problem for our team. The AI does the research and writing, using our existing help collateral, and our team provides the final human touch."

— Sumukh Sridhara, Rollups

The Results: More Time for What Matters

The impact on Rollups' support operation has been transformative:

Volume handled: Thousands of emails now flow through the AI-assisted system, with all replies drafted automatically.
Response quality: Answers tailored to each customer’s specific situation, using the same trusted knowledge sources as the human team.
Team focus: Support specialists now spend their time on high-value activities like advising on complex consolidations, rather than answering questions that the team has seen before.
Scalability: As Rollups continues growing their customer base, their support capacity scales automatically without proportional headcount increases.

Why This Approach Works

Rollups' success with AI-powered support comes down to two key factors.

Connected knowledge systems. By integrating Front with their internal knowledge center, style guides, help center, and CRM, the AI agent has access to holistic information, ensuring responses are accurate and aligned with the company's style.

Human-in-the-loop design. The AI generates email drafts that can be reviewed by a human and sent with a single click. This hybrid approach combines the speed of automation with the judgment and nuance of human oversight.

Looking Ahead

For Rollups, AI-powered email automation represents more than just efficiency gains. It's about scaling their ability to deliver an exceptional customer experience as they continue to onboard companies at every stage, from pre-seed to Series D+ startups.

Ready to Transform Your Support Operations?

If you're handling high volumes of repetitive customer inquiries and your team is stretched thin, our AI agents can help.

Telio integrates with various tools and services, allowing you to create intelligent email automation that actually understands your business and customers.

Telio Achieves SOC 2 Type II Compliance: Secure 24/7 AI Agents

Evgeny Li — Tue, 14 Oct 2025 16:58:00 GMT

We’re excited to announce a major milestone in our commitment to data security and trust: Telio is now officially SOC 2 Type II certified, the gold standard for security compliance in the tech industry.

What is SOC 2 Type II?

SOC 2 is an auditing framework developed by the American Institute of Certified Public Accountants that evaluates how well a service provider protects customer data.

A Type I report only checks whether security controls were correctly designed at a single point in time. A Type II report, on the other hand, is much more rigorous, requiring an independent auditor to monitor our systems for several months to demonstrate that our security controls are actually working in practice.

Telio has independently verified that it meets strict requirements across five Trust Service Criteria:

Security: Our systems and infrastructure are protected against unauthorized access, both physical and logical.
Availability: Our services operate reliably and are available 24/7 as promised to our customers.
Processing Integrity: System processing is complete, valid, accurate, timely, and authorized.
Confidentiality: Sensitive information is protected as committed or agreed upon.
Privacy: Personal information is collected, used, retained, disclosed, and disposed of in accordance with our privacy policies and applicable regulations.

Why This Matters for Our Customers

As an AI-powered platform that handles thousands of customer interactions through voice calls, text messages, and emails, we understand the critical importance of data security and privacy. Our customers trust us with sensitive business information and customer data every single day, and this certification validates that trust.

For businesses considering Telio for their customer support needs, SOC 2 Type II provides additional confidence that we take data protection seriously. Customers can request our SOC 2 report directly as part of due diligence or procurement reviews.

What This Means Moving Forward

Security isn't a "one-and-done" checkbox for us, it’s part of our DNA. Being HIPAA-compliant and now also achieving SOC 2 Type II is just one milestone in our long-term roadmap to provide the most secure and human-like AI agents.

We will continue to maintain and enhance our security practices, undergo regular audits, and stay ahead of emerging threats and industry standards. Learn more about our security at gettelio.com/security and compliance at trust.gettelio.com.

Ready to experience secure, 24/7 AI-powered customer support?

Visit gettelio.com to learn more about how Telio can transform your customer service while keeping your data safe and secure.

Product Analytics Queries Without Database Meltdown

Arjun Lall — Fri, 02 May 2025 15:54:00 GMT

I've personally watched this journey unfold — what starts as shipping a simple dashboard can eventually lead to temporary patches, complexity creep, and full rearchitectures just to avoid killing the transactional database. This'll walk you through the journey I've seen a team take many years ago at a fast-scaling startup and help you short-circuit this cycle, saving months of discovery pain along the way.

Meltdown

One of our biggest international customers launched a flash sale that drove 50x their normal transaction volume. At 3 AM, our on-call was paged and Slack lit up with alerts. "DATABASE CPU: CRITICAL." "API LATENCY: CRITICAL." "CHECKOUT FAILURE RATE: CRITICAL.". This customer eventually churned due to the lost revenue from the incident. The culprit? Constant refreshing of their dashboard to track sale success.

Let’s take a closer look at the query:

SELECT 
  date_trunc('month', purchase_date) as month,
  customer_region,
  AVG(customer_age) as avg_age,
  SUM(purchase_amount) as total_revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM purchases
JOIN customers ON purchases.customer_id = customers.id
WHERE purchase_date > NOW() - INTERVAL '12 months'
GROUP BY date_trunc('month', purchase_date), customer_region
ORDER BY month, total_revenue DESC;

When this query executes in a transactional database, several resource-intensive operations occur:

Full table scan: Without proper indexes, the database reads every row in both tables to find matching records.
Join operation: The database builds an in-memory hash table for the join, which can consume significant RAM when joining large tables.
Grouping operation: Another in-memory hash table gets built to track each unique group (month + region).
Distinct counting: The COUNT(DISTINCT customer_id) requires tracking all unique customer IDs seen so far, adding more memory pressure.
Sorting: Finally, the results get sorted by month and revenue, potentially needing disk if the result set is large enough.

In transactional databases like PostgreSQL, Multiversion concurrency control (MVCC) avoids locking by letting each query see a consistent snapshot of the data. This works by checking the visibility of each row version at read time. But for large analytical queries—especially those that scan millions of rows and compute aggregates—these per-row checks can spike CPU usage. Add operations like sorting, grouping, and COUNT(DISTINCT), and memory and disk I/O can quickly become bottlenecks. Even though MVCC avoids blocking other queries directly, a single heavyweight query can still monopolize enough resources to severely impact overall performance.

The ‘Just Scale It’ Phase

After turning off feature flags or killing queries, the natural first step was to optimize the existing system.

Indexes

The quickest win was to add indexes to speed up the query execution:

CREATE INDEX idx_purchases_date 
ON purchases(purchase_date);

CREATE INDEX idx_purchases_customer_region_date 
ON purchases(customer_region, purchase_date);

CREATE INDEX idx_customers_region 
ON customers(region);

Indexes create specialized data structures (typically B-trees) that organize record locations in a predefined order. They contain key values and pointers to the actual data rows, allowing the database to locate records matching specific criteria without sequentially scanning the entire table. For our query, these indexes help the database quickly identify relevant purchase records by date range, find matching customer regions, and efficiently join the tables together.

Performance improved drastically at first, dropping the dashboard’s 99th percentile load time to about 10 seconds. But the indexes had some downsides:

Write performance degradation: Every write operation has to now also update each index. As more indexes get added for new dashboard charts, this overhead grows linearly, and slows down critical writes.
Bloat: Indexes can consume lots of disk space, increasing infra and maintenance costs. I’ve seen production databases with indexes more than double the size of the actual data.
Limited gains: Indexes help with filtering but not with aggregations like COUNT(DISTINCT).

Read Replica

Indexes gave us some breathing room, but analytical and transactional workloads were still competing for the same resources. I pointed out to the team that we had a read replica that was heavily underutilized - sitting at just 12% CPU utilization and low disk IOPS.

Routing all dashboard queries to the read replica worked surprisingly well, with the primary database seeing a 15% CPU reduction and dashboard p99 latency improving by 30%.

Since read replicas maintain exact copies of the primary, any index the team needed to add for dashboard charts still had to exist on the primary. This meant write performance on the primary was still degraded. There were still improvements to make since the dashboard could still take a few seconds to load.

Materialized Views

The team next needed a way to pre-compute dashboard query results:

CREATE MATERIALIZED VIEW monthly_regional_metrics AS
SELECT
  date_trunc('month', purchase_date) as month,
  customer_region,
  AVG(customer_age) as avg_age,
  SUM(purchase_amount) as total_revenue,
  COUNT(DISTINCT customer_id) as unique_customers
FROM purchases
JOIN customers ON purchases.customer_id = customers.id
GROUP BY date_trunc('month', purchase_date), customer_region;

Materialized views pre-compute and store query results, turning slow queries into simple table lookups. This successfully made the dashboard p99 latency drop 50%, but introduced the new challenge of the operational complexity of maintaining fresh views as data scale grew.

Table Partitioning

The team didn’t actually partition data, but I’ll list it here to be thorough:

CREATE TABLE purchases (
  id SERIAL,
  customer_id INTEGER,
  purchase_date TIMESTAMP,
  purchase_amount DECIMAL(10,2)
) PARTITION BY RANGE (purchase_date);

CREATE TABLE purchases_2019_q1 PARTITION OF purchases
  FOR VALUES FROM ('2019-01-01') TO ('2019-04-01');

CREATE TABLE purchases_2019_q2 PARTITION OF purchases
  FOR VALUES FROM ('2019-04-01') TO ('2019-07-01');

Partitioning divides large tables into smaller physical chunks. When a query specifies a date range, the database can scan only relevant partitions instead of the entire table.

Server Scaling

As our data volume continued to grow, the team considered a few other scaling approaches:

Vertical Scaling: The most straightforward of which is upgrading the hardware and giving it more resources. This continues to buy some time.

Horizontal Scaling: The most complex approach is distributing data across multiple servers called shards. This means implementing application-level logic to determine which shard to access when reading or writing data.

The approaches that were tried helped temporarily but didn't address the fundamental architectural issue, and the mismatch was becoming more apparent as the company scaled.

The Architectural Mismatch

Transactional databases such as PostgreSQL excel at:

Quick point lookups and small range scans
High concurrency for many small operations
Strong data consistency guarantees

Analytical workloads like a dashboard have fundamentally different characteristics:

They scan across table columns rather than accessing specific records
They aggregate data across millions of rows
They often need to access only a few columns from wide tables

When our dashboard query needed just purchase_date and purchase_amount from a 20-column table with say a million rows, PostgreSQL still reads all 20 million values. This massive I/O inefficiency is why our CPU and disk metrics kept hitting the ceiling.

That’s why databases made for analytical workloads store data in columnar format.

This columnar storage provides massive benefits for analytical queries:

Only needed columns are pulled from disk
Similar data in columns compresses better
Operations can process batches of the same data type simultaneously
Filtering happens before data is loaded into memory

Eventually, the temporary dashboard scaling workarounds weren’t enough and the team needed to move to an analytics optimized architecture.

The Data Warehouse

Before all scaling options were fully exhausted, the company had invested in setting up Snowflake with proper data pipelines, and all product teams were encouraged to leverage this new analytics infrastructure.

With the help from the data engineering team, they implemented a traditional extract-transform-load (ETL) architecture:

Extraction: Airflow DAGs would run scheduled jobs that pulled data from our production PostgreSQL database
Transformation: The extracted data underwent transformation steps before loading - cleaning values, applying business rules, and restructuring data.
Loading: Finally, the transformed data was loaded into Snowflake tables optimized for analytical workloads.

For the dashboard to display this processed data, the team implemented a reverse ETL process that completed a data round-trip: production DB → Snowflake → back to production DB, but now with pre-computed aggregations ready for fast querying. This dropped dashboard load times to milliseconds!

The Batch Reality vs. Real-Time Promise

The team initially accepted a 24-hour refresh cycle for the dashboard metrics, comfortable with this temporary limitation based on the data team's roadmap promising real-time capabilities "soon." Our dashboard would clearly display: "Data updated daily at 4:00 AM UTC."

The company's leadership was understanding about this refresh cycle during the transition, but customers accustomed to seeing their latest transactions reflected immediately were confused by the sudden shift to day-old data. This created an unexpected support burden.

Unexpected Complexities

What started as a simple architectural improvement quickly revealed hidden operational costs:

Pipeline fragility: Weekly pipeline failures would result in cryptic error messages that the product engineering team struggled to troubleshoot, often requiring escalation to the busy data engineering team.

Dependency challenges: The promised real-time capabilities kept getting delayed as the data team discovered the true complexity of implementing Change Data Capture and Kafka streaming infrastructure.

While this architecture was theoretically correct, the company underestimated the operational overhead. Our mid-sized startup didn't actually have the petabyte-scale data that would justify the complexity, and our engineering team lacked a lot of data engineering talent. The organizational impact was equally problematic - our product team became dependent on an overextended data team, creating unclear ownership and delayed issue resolution.

For a company with a large data engineering org and massive data volumes, this architecture makes perfect sense. But for our relatively straightforward analytics needs, we had overcomplicated our infrastructure without considering the long-term operational and organizational costs.

Modern Analytical Solutions

Today, teams have simpler options available available. Several analytical databases can bridge the gap between transactional and analytical workloads without as much overhead as a Snowflake.

DuckDB: The Embedded Analytical Engine

DuckDB has emerged as a powerful columnar analytical database that can be embedded directly into applications - essentially SQLite for analytics:

Embedded architecture: Runs inside a host process with bindings for languages like Python, with the ability to directly place data into structures like NumPy arrays
Columnar-vectorized execution: Uses a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a "vector") are processed in one operation
Interoperability: Seamlessly works with the broader data science ecosystem

DuckDB is ideal for data scientists and analysts who need fast analytical capabilities integrated directly into their data workflows without the overhead of setting up a separate database server. It excels at scenarios like exploratory data analysis, ad-hoc queries, and processing moderately large datasets locally.

ClickHouse: High-Performance Analytical Database

Originally developed at Yandex for their analytics, Clickhouse now powers analytics at companies like Cloudflare and Uber. It's is a column-oriented DBMS built for analytical workloads at extreme scale:

Distributed architecture: Designed to scale horizontally across clusters of commodity hardware
Vectorized execution: Processes data in parallel using SIMD instructions for maximum performance
Real-time ingestion: Can ingest large amounts of data with the Clickpipes integration

ClickHouse shines when dealing with massive datasets where query performance is critical.

BemiDB: The Analytical Read Replica

Disclaimer: I’m a BemiDB open source contributor.

BemiDB is a data warehouse that can be used as a PostgreSQL read replica optimized for analytics:

Automatic replication: Data syncs from Postgres with no pipeline code
Open columnar format: Stores data either on a local file system or on S3-compatible object storage
Full Postgres compatibility: Uses the same SQL syntax as PostgreSQL

BemiDB is ideal for teams trying to scale for analytics while staying close to PostgreSQL. The use cases it shines would be in-app analytics, BI queries, and centralizing PostgreSQL data.

Materialize: Streaming SQL for Real-Time Analytics

Materialize takes a different approach by focusing on real-time analytics using a streaming architecture:

Incremental view maintenance: Updates query results as data changes
Streaming architecture: Processes data changes in real-time as they occur
Differential dataflow: Uses sophisticated algorithms to minimize computation

Materialize shines for use cases requiring real-time analytics on constantly changing data, including fraud detection, real-time notifications, and operational dashboards.

Choosing the Right Solution for Your Needs

The best choice would depend on your specific needs:

Solution	Best For	Key Advantage
DuckDB	Embedded analytics, local analysis	Lightweight, no server needed
ClickHouse	Massive scale analytics	Extreme query performance
BemiDB	Simplicity, Postgres compatibility	Single Docker image, direct Postgres replication
Materialize	Real-time streaming analytics	Incremental view maintenance

A Simpler Path Forward

Instead of the painful progression most companies follow:

Query production directly until it breaks
Add indexes until writes become slow
Build complex ETL pipelines to data warehouses
Create reverse ETL to bring data back to the application

Consider this simpler approach:

Keep transactional workloads on your production database
Use a columnar analytical engine for reports and dashboards
Only move to a full data warehouse when you truly need to integrate many different data sources or process petabytes of data

This approach would’ve saved that team months of trouble. The mid-sized startup needed operational simplicity and the team first and foremost wanted to eliminate late-night pages, pipeline debugging, and the performance dance of trying to make the transactional database not meltdown. The right solution for you will depend on your specific requirements around data volume, real-time needs, and data org maturity.

There are several open-source solutions mentioned worth exploring, including DuckDB, ClickHouse, BemiDB, and Materialize. Each has different trade-offs that might make it the right fit for your specific needs.

Proprietary Data Analytics Platforms are a Trap

Arjun Lall — Fri, 14 Mar 2025 12:50:00 GMT

When analytical queries and workloads scale out of transactional databases like Postgres, most companies default to a cloud vendor like Snowflake, connecting and transforming their data in the process. Proprietary data platforms are the industry standard here. We have one of the fastest growing open source database projects, and every advisor tells us to “build a cloud platform”. It’s what the industry expects and also the easiest revenue model.

But we’re taking a contrarian path. I’ll explain why we think it’s time to rethink data analytics and why current platforms are a trap—namely because of high costs, vendor lock-in, and unnecessary complexity.

Complexity

Cloud data analytics is a scam, today. Snowflake and other old proprietary cloud data warehouses were built on the assumption that distributed clusters, data movement orchestration, and constant infrastructure maintenance are necessary for analytics at scale. Let’s revisit if this Hadoop-era way of thinking still holds true with today’s technical shifts:

Hardware advances: the raw power of a single server today is orders of magnitude more powerful than what it used to be. A single machine with 64 CPU cores can easily scan 1TB of data. And since 98% of Redshift queries on >10TB datasets scan less than 1TB, distributed systems and CAP theorem tradeoffs aren’t necessary for analytical workloads anymore.
Embedded analytics engines: projects like DuckDB show you can quickly scan and analyze billions of rows of data on even your laptop. It runs fully in-process, sharing memory with your application, which means there’s no separate database server or network overhead.
Separation of compute and storage: with accessible durable object storage like S3, you can store massive datasets cheaply without needing always-on infrastructure. You only spin up compute for the data you actually query, which reduces overhead—especially since most queries only scan a fraction of what’s stored.

Together, these shifts enable a simpler architecture — you can self-host without the bloat and cost of managing complex infrastructure. For example, our BemiDB project is just a single Docker image. It embeds DuckDB, automatically syncs data from Postgres databases to an analytics optimized S3 bucket, and leverages Postgres’ wire and query protocol to speak Postgres. In practice, this means spinning up a fast and scalable data warehouse can be as simple as:

> curl -sSL https://raw.githubusercontent.com/BemiHQ/BemiDB/refs/heads/main/scripts/install.sh | bash

> ./bemidb sync --pg-sync-interval 10m --pg-database-url postgres://:@:5432/

> ./bemidb start

> psql postgres://localhost:54321/bemidb
bemidb=> SELECT country, COUNT(*) FROM users GROUP BY country;

The lock in tax

Snowflake and similar vendors store data in proprietary formats, meaning you’re effectively stuck and unable to leave. Especially as you scale, this means exorbitant fees and no flexibility of using any other tools and services with your data. This is in stark contrast to modern open table formats like Apache Iceberg which are now becoming the standard.

Iceberg provides a consistent structure for your data on object storage that keeps it readable by all data tools and services. Paired with open-standard columnar files like Apache Parquet, your data remains fully portable.

Additionally, proprietary cloud query engines limit where you can run and optimize workloads. Open source query engines let you deploy on VMs, bare metal, or containers—whatever fits your performance and security needs.

Embracing open formats and open source helps avoid the lock-in tax and keeps your data truly yours.

Monetization

Data teams end up paying well over $1,000 per month on cloud data warehouses—often climbing to tens of thousands when you factor in ETL pipelines, data egress, and infrastructure overhead. By storing data in cheap, durable object storage within the same region and running queries on a modest VM, you can cut those monthly costs down to hundreds.

We’re a venture backed startup that’s after profit, but our mission is to simplify data, and we encourage everyone to use our open source and self-host. Not everything can be packaged into a single Docker container, so we charge for support and extra features that require integrating additional components.

Give self hosted a try

The exorbitant cloud costs and lock-in are worth it when the alternative is complex infrastructure to build or maintain. Unless you’re in the big data one percent, this isn’t the case anymore for data analytics.

Check out our GitHub repo and give BemiDB a star! We’re always pushing to make data simpler and more open.

Data Analytics with PostgreSQL: The Ultimate Guide

Evgeny Li — Mon, 10 Feb 2025 14:51:00 GMT

TL;DR

In this blog post, we will compare the main techniques and approaches for running data analytics with PostgreSQL:

"Just use Postgres" and Scale It
1. Indexes: choosing index types, multicolumn/partial/expression indexes
2. Materialized Views: denormalization
3. Table Partitioning: declarative partitioning
4. Server Scaling: read replica, vertical scaling, horizontal sharding
Install PostgreSQL Extensions
1. Foreign Data Wrappers: columnar storage formats, Parquet, ETL
2. Analytics Query Engines: vectorized execution, parallel execution on GPU, DuckDB
3. Super Extensions: TimescaleDB and Citus
Integrate with Analytics Databases
1. BemiDB: read replica, vectorized engine and columnar storage, open table formats, Iceberg
2. ClickHouse: PostgreSQL wire protocol, vectorized engine and columnar storage, logical replication
Use Proprietary Solutions
1. Google Cloud AlloyDB: hybrid transactional and analytical workloads
2. EDB Analytics Accelerator: proprietary extension on PostgreSQL with replication
3. Crunchy Data Warehouse: proprietary extensions on PostgreSQL
4. Firebolt: forked ClickHouse with PostgreSQL dialect

Each of them has their pros and cons, so we will take a look at them and see which one may be the best for your use case.

The PostgreSQL ecosystem has matured and become capable of dealing with analytical workloads in recent years. There is no need to bring heavy big-data tools like Apache Spark or build data pipelines with Kafka anymore.

"Just use Postgres" and Scale It

PostgreSQL is a highly flexible and powerful database. It simplifies your data stack by serving as a single transactional (OLTP) engine for a range of purposes—including full-text search, queueing, caching, and event streaming—while also being tunable for analytics (OLAP) queries. Below are the main built-in features.

Indexes

To identify slow queries and their bottlenecks, we can use PostgreSQL's EXPLAIN ANALYZE which will show the sequential and index scans. Then, depending on the data structure and queries, you can choose an appropriate index type such as:

B-tree for equality = and range conditions like > or <
GIN for composite values like JSONB or ARRAY

See the full list of supported PostgreSQL index types and the optimized operators that work with them.

You can also create specialized indexes:

Multicolumn index (for columns frequently used together in queries):

CREATE INDEX index_name ON table (column1, column2);

Partial index (for frequent filtering a subset of rows):

CREATE INDEX index_name ON table (column)
  WHERE column > 1 AND column < 1000;

Expression index (for computed expressions or functions):

CREATE INDEX index_name ON table (LOWER(column));

Materialized Views

Denormalization is a strategy that allows improving read performance at the expense of adding redundant copies of data, similarly to a cache. One example is the use of PostgreSQL materialized views that can pre-compute and store query results in a table-like form for frequently accessed data.

CREATE MATERIALIZED VIEW summary_sales AS
  SELECT seller_id, invoice_date, sum(invoice_amount) AS sales_amount
  FROM invoices
  WHERE invoice_date < CURRENT_DATE
  GROUP BY seller_id, invoice_date;

Creating a materialized view

PostgreSQL also allows adding indexes on materialized views and querying them like it is a regular table:

CREATE INDEX summary_sales_index
  ON summary_sales (seller_id, invoice_date);

SELECT * FROM summary_sales
  WHERE seller_id = 1 AND invoice_date = '2025-01-01'

Creating an index and querying a materialized view

Table Partitioning

Partitioning is a technique that allows splitting a large table into smaller physical ones called partitions. This helps improve query performance by scanning only relevant partitions instead of the entire large table.

CREATE TABLE invoices (issued_on DATE NOT NULL)
  PARTITION BY RANGE (issued_on);

CREATE TABLE invoices_2025_01 PARTITION OF invoices
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');

CREATE TABLE invoices_2025_02 PARTITION OF invoices
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');

Declarative partitioning of a table by range

Native PostgreSQL declarative partitioning treats the partitioned table like a “virtual” one that delegates reads/writes to the underlying partitions:

-- Stores the row into invoices_2025_01
INSERT INTO invoices (issued_on) VALUES ('2025-01-31');
-- Stores the row into invoices_2025_02
INSERT INTO invoices (issued_on) VALUES ('2025-02-01');

-- Selects the rows from invoices_2025_01
SELECT * FROM invoices WHERE issued_on = '2025-01-31'

Writing and reading a partitioned table

Server Scaling

There are a few server scaling options available for PostgreSQL when it comes to running analytical workloads.

Read replica. To offload read queries from the primary server, you can set up one or more PostgreSQL read replicas.
Vertical scaling. This is the most straightforward approach. When a server can't handle the load, consider upgrading the hardware and giving it more resources.
Horizontal scaling. Another option is to use sharding by distributing data across multiple servers called shards. For example, you can implement application-level logic to determine which shard to access when reading or writing data.

PostgreSQL Pros

Powerful general-purpose database transactional database (OLTP) that can handle basic analytical workloads (OLAP) out of the box.
"Just use Postgres" allows keeping the data stack as simple as possible without adding additional tools and services.

PostgreSQL Cons

Creating indexes tailored for specific queries negatively impacts the "write" performance for transactional queries and resource usage.
Materialized views as a "cache" require manual maintenance and can become increasingly slow to refresh as the data volume grows.
Table partitioning is a leaky abstraction that doesn't perfectly encapsulate the details and doesn't play nice with triggers, constraints, etc.
Scaling up servers often is a short-term solution that buys some time but doesn't solve the underlying performance issues.
Horizontal scaling using sharding significantly increases the complexity of the database architecture and introduces engineering and operational overhead.
Continuously spending a lot of engineering resource to gain meaningful long-term performance improvements.
Further tuning and optimization may not be possible if executing various ad-hoc analytical queries.

Install PostgreSQL Extensions

PostgreSQL has a rich ecosystem of extensions that enhance its functionality in different areas, including data analytics. There are a few different approaches that these extensions take, which can be grouped into the categories described below.

Foreign Data Wrappers

One of the most popular data storage file formats used in analytics is Parquet. Think of it as "CSV on steroids for analytics". Here are its key features:

Columnar storage format with data organized by columns rather than rows
Excellent compression due to storing similar data together in columns
Strongly typed schema with explicitly defined types for each column

The common pattern is extracting data from PostgreSQL (or other data sources), transforming it if necessary, and loading it, for example, in Parquet format to S3. This extract, transform, and load process is called ETL.

To query Parquet files using PostgreSQL, you can use foreign data wrappers (FDW). This is a mechanism that allows to access data stored outside a database as it is stored in a local table. The most popular PostgreSQL extensions for Parquet are:

parquet_fdw for reading Parquet files from a local file system
parquet_s3_fdw for reading Parquet files from S3-compatible object storage

Here is an example how these foreign data wrappers can be used:

-- Set up the extension
CREATE EXTENSION parquet_s3_fdw;
CREATE SERVER parquet_s3_server FOREIGN DATA WRAPPER parquet_s3_fdw;

-- Create a foreign table
CREATE FOREIGN TABLE invoices (issued_on DATE NOT NULL)
  SERVER parquet_s3_server
  OPTIONS (filename 's3://bucket/dir/invoices.parquet');

-- Querying data from a Parquet file
SELECT issued_on FROM invoices;

Querying Parquet data using a foreign data wrapper

Analytics Query Engines

To get the best performance when running analytical workloads, using columnar storage format like Parquet is often not enough because the PostgreSQL query engine still processes each row sequentially.

Enter DuckDB, a columnar-vectorized query execution engine. Think of DuckDB as "SQLite for analytics" that is lightweight and can store all its data on a disk or in memory. Here are its main key features:

It uses a columnar-verctorized engine, which supports parallel execution and can efficiently process large batches of values, a.k.a. vectors.
It is a single-binary program without any external dependencies that can run on all operating systems or can be embedded into another program.
It provides a universal SQL access to various data types such as Parquet, CSV, JSON, and sources such as remote S3 buckets, API endpoints, Excel.
And similarly to PostgreSQL, it also supports extensions to improve its core functionality.

DuckDB version 1.0 was released in 2024, and since then many new PostgreSQL extensions that embed DuckDB or build their own query engines started being developed. Here are some of them:

pg_duckdb embeds DuckDB and can query Parquet files from object storage
pg_mooncake embeds DuckDB and uses columnar storage within PostgreSQL
pg_analytics embeds DuckDB and uses foreign data wrappers to read from S3
pg-strom uses a parallel execution engine that can leverage GPU cores

-- Set up the extension
CREATE EXTENSION pg_duckdb;

-- Querying data from a Parquet file
SELECT issued_on FROM read_parquet('s3://bucket/dir/invoices.parquet');

Querying Parquet data using pg_duckdb extension

Super Extensions

There are some extensions that significantly alter how PostgreSQL works. Such extensions are sometimes called "super extensions" and very often installed on a dedicated PostgreSQL server.

These extensions are not designed for data analytics, but some of their features can still improve PostgreSQL query execution and performance. Here are some of the extensions:

timescaledb is designed to turn PostgreSQL into a time-series database. It can automatically partition tables by time-columns, use hybrid row-columnar store, and refresh materialized views incrementally.
citus is designed to turn PostgreSQL into a distributed database. It can automatically shard tables across servers and use columnar storage for compression and query performance.

Extensions Pros

Adding new features to PostgreSQL while keeping everything under one roof.
Variety of open-source extensions that can provide great flexibility and customization to your PostgreSQL.
Querying data stored in compressed columnar format and/or bringing an analytical query engine to improve performance.

Extensions Cons

Performance overhead when running analytical queries within the same PostgreSQL that can negatively affect transactional queries.
Very limited support for installable extensions in managed PostgreSQL services. For example, here is the AWS Aurora allowlist.
Increased dependency management and maintenance complexity when using extensions or upgrading PostgreSQL/extension versions.
Manual data syncing and data mapping using ETL pipelines or within PostgreSQL from native row-based storage to columnar storage for best performance.
Some TimescaleDB features are not available under an open-sourced license: incremental materialized views, compression, and query optimizations.

Integrate with Analytics Databases

There are a few OLAP databases that can integrate with PostgreSQL databases and are also compatible, so you can use PostgreSQL tools like database drivers and adapters as usual.

BemiDB

Disclaimer: I’m a BemiDB contributor. And even though I’m biased and want more people to use BemiDB as a simple solution to the PostgreSQL data analytics problem, I’ll try to be as objective as possible.

BemiDB is a read replica optimized for analytics. It connects to a PostgreSQL database, automatically syncs data into a compressed columnar storage, and uses a Postgres-compatible analytics query engine to read the data.

Here are its main key features:

Single binary that can be run on any machine. The compute is stateless and separated from storage, making it easier to run and manage.
Embeds DuckDB to improve performance, a columnar-vectorized query execution engine optimized for analytical workloads.
Uses an open columnar format for tables with compression. The data can be stored either on a local file system or on S3-compatible object storage.
Postgres-integrated, both on the SQL dialect and table data level. I.e., all SELECT queries executed on a primary PostgreSQL server can be ported to BemiDB as is.

# Sync data from PostgreSQL
bemidb --pg-database-url postgres://localhost:5432/dbname sync

# Start BemiDB
bemidb start

# Query BemiDB as a PostgreSQL read replica
psql postgres://localhost:54321/bemidb -c "SELECT COUNT(*) FROM table_from_postgres"

Running BemiDB as a read replica optimized for analytics

We've already described Parquet data format and its benefits. The next evolutional approach is using open table formats, such as Iceberg that is used by BemiDB under the hood. These formats use Parquet files to store data in compressed columnar format and stitch them together using metadata files according to format specifications. This helps adding smaller Parquet data files incrementally instead of fully rewriting files on every data change.

With open table formats like Iceberg, in addition to query performance benefits, it's possible to achieve things that are not possible with PostgreSQL. For example, data interoperability across different databases/tools/services and schema evolution/time travel enabling access to historical versions.

ClickHouse

ClickHouse is a column-oriented database designed for real-time analytics. The company recently acquired PeerDB that allows syncing data from PostgreSQL into ClickHouse in real-time. It also has some basic PostgreSQL compatibility allowing you to connect and run ClickHouse SQL queries via the PostgreSQL wire protocol.

Here are its main key features:

Distributed processing across multiple servers in a cluster enabling horizontal scalability.
Vectorized query execution engine optimized for analytical workloads.
Columnar storage that enables data compression and improved query performance.
ClickHouse is optimized for inserting large batches of rows, usually between 10K and 100K rows.

PeerDB that behaves like an ETL tool that connects to PostgreSQL databases using logical replication and decoding. The transformations can be performed using custom Lua scripts.

Analytics Databases Pros

The best performance tailored specifically for analytical workloads.
Integrate with PostgreSQL databases and replicate data into a scalable columnar storage format. Minimal impact on PostgreSQL performance, resource usage, and internal configuration.
BemiDB consists of a single binary allowing to easily run and manage an optimized for analytics PostgreSQL read replica.
ClickHouse allows batch data inserts directly, bypassing PostgreSQL.

Analytics Databases Cons

They are not PostgreSQL and don't support many PostgreSQL-specific features or extensions.
Increased system complexity with extra server processes running in addition to PostgreSQL.
BemiDB doesn't support direct Postgres-compatible write operations (yet), so it can only work as a read replica.
ClickHouse is quite different from PostgreSQL and OLTP databases in many ways: different SQL dialect, limitations on data mutability, no support for ACID (atomicity, consistency, isolation, durability), and many others.

Use Proprietary Solutions

Due to PostgreSQL's popularity, many companies started building their custom proprietary solutions for analytics either on top of PostgreSQL or making them Postgres-compatible.

Google Cloud AlloyDB

AlloyDB is a managed PostgreSQL-compatible database for hybrid transactional and analytical workloads (HTAP). It can replace PostgreSQL for transactional queries and also deliver good performance for analytical queries.

Here are its main features:

Enhanced query processing layers in PostgreSQL kernel for performance and shared storage in a region.
Embeds a proprietary vectorized engine and storage with an additional columnar format.
Query planner that automatically chooses an execution fully on columnar data, fully on row-oriented data, or a hybrid of the two.
Has a downloadable version called AlloyDB Omni that can also run on AWS and Azure in a Docker container.

EDB Analytics Accelerator

EDB (a.k.a. EnterpriseDB) Postgres AI is a data platform for both transactional and analytical workloads. The analytics product is powered by PostgreSQL and the proprietary extension called PGAA.

Here are the main analytics features:

Vectorized query engine optimized for columnar data formats.
Tiered storage, with hot data on a disk and cold data in object storage in open table formats.
Storage and compute separation with dedicated PostgreSQL replicas for analytical queries.

Crunchy Data Warehouse

Crunchy Data is a company that specializes in providing services, support, and solutions for PostgreSQL. The company released Crunchy Data Warehouse in 2024, an analytics database built on PostgreSQL.

The main features include:

The latest versions of PostgreSQL with proprietary extensions.
Integrated DuckDB query engine by delegating parts of the query to it for vectorized execution.
S3 for storage with an Iceberg table format that can be queried with tools like Apache Spark.

Firebolt

Firebolt is a cloud data warehouse. It started by forking ClickHouse to implement better storage and compute decoupling, along with other improvements. In 2024, they added Postgres SQL dialect compatibility.

Vectorized query execution engine, ACID compliant.
Proprietary columnar data format and tiered storage in memory, local SSD, and S3.

Proprietary Solutions Pros

Fully managed cloud data warehouses optimized for analytical workloads.

Proprietary Solutions Cons

Vendor lock-in and limited control over the source code and data.
Very limited or no support at all for installable extensions. For example, here is the GCP AlloyDB allowlist.
Crunchy Data and Firebolt require manual data syncing using ETL pipelines or within PostgreSQL from native row-based storage to columnar storage.
Can be more expensive compared to other alternatives.

Conclusion

There is a wide variety of options for handling data analytics with PostgreSQL.

Just using PostgreSQL and scaling it can be a great starting point for simpler analytical needs initially, allowing to keep the data stack simple.
With enough PostgreSQL expertise and access to custom extensions, installing them can help improve analytical performance within PostgreSQL.
If you don't want to spend time tuning PostgreSQL and all you need is a simple read replica optimized for analytics, then BemiDB is the best choice.
If you deal with many terabytes of mostly append-only data and don't mind switching to another SQL dialect, then ClickHouse is a great choice.
And if you already host PostgreSQL on platforms like GCP or EDB, then choosing their analytics solutions can reduce the number of data providers.

Check out the BemiDB GitHub repo if you want to give it a shot. And subscribe to our blog if you want to learn more about PostgreSQL and data analytics.

When Postgres Indexing Went Wrong

Arjun Lall — Mon, 23 Sep 2024 05:08:00 GMT

Indexing in Postgres seems simple, but it’s important to understand the basics of how it really works and the best practices for preventing system downtime.

TLDR: Be careful when creating indexes — a lesson I learned the hard way when concurrent indexing failed silently.

Critical incident

At a previous company, we managed a high-volume Postgres instance with billions of rows of transactional data. As we scaled, query performance became a key priority, and one of the first optimizations was adding indexes. To avoid downtime, we used CREATE INDEX CONCURRENTLY, which allows indexing large tables without locking out writes for hours. Initially, p99 query performance improved dramatically.

A few weeks later, another team launched a new feature that was built to rely heavily on the new index. Everything seemed routine—until the traffic spiked.

At first, the problem was subtle. A few queries took longer than expected. But within hours, the load began to spike. Query response times slowed to a crawl, and some requests were timing out.

We couldn’t immediately see why. The index was in place, a quick EXPLAIN ANALYZE confirmed it was being used. But users were still experiencing massive slowdowns, and we were on the brink of a full-scale production outage.

It wasn’t until we checked the server logs did we piece together what happened:

CREATE INDEX CONCURRENTLY idx_email_2019 ON users_2019 (email);
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 54321; blocked by process 54322.

Concurrent indexing can fail (silently)

Concurrent indexing needs more total work than a standard index build and takes much longer to complete. It uses a 2 phase approach that helps avoid locking the table:

Phase 1: A snapshot of the current data gets taken, and the index is built on that.
Phase 2: Postgres then catches up with any changes (inserts, updates, or deletes) that happened during phase 1.

Since this process is asynchronous, the CREATE INDEX command might fail, leaving an incomplete index behind. An “invalid” index is ignored during querying, but this oversight can have serious consequences if not monitored.

postgres=# \d users_emails_2019
       Table "public.users_emails_2019"
 Column |  Type   | Collation | Nullable | Default
--------+---------+-----------+----------+---------
  ...   |            |           |          |
Indexes:
    "idx" btree (email) INVALID

In our case, the issue was amplified by the fact that our data was partitioned. The index had failed on some partitions but not others, leading to a situation where some queries were using the index while others were hitting unindexed partitions. This imbalance resulted in uneven query performance and significantly increased load on the system.

If we hadn’t caught it when we did, we would have faced a full-blown production outage, impacting every user on the platform.

Best practices for Postgres indexing

To help others navigate this terrain, here are some best practices for Postgres indexing that can prevent these issues:

Avoid dangerous operations

Always use the CONCURRENTLY flag when creating indexes in production. Without it, even smaller tables can block writes for unacceptably long, leading to system downtime. While CONCURRENTLY takes more CPU and I/O, the trade-off is worth it to maintain availability. Keep in mind that concurrent index builds can only happen one at a time on the same table, so plan accordingly for large datasets.

Monitor concurrent index creation closely

Don’t take successful index creation for granted. The system table pg_stat_progress_create_index can be queried for progress reporting while indexing is taking place.

postgres=# SELECT * FROM pg_stat_progress_create_index;
-[ RECORD 1 ]------+---------------------------------------
pid                | 896799
datid              | 16402
datname            | postgres
relid              | 17261
index_relid        | 136565
command            | CREATE INDEX CONCURRENTLY
phase              | building index: loading tuples in tree
lockers_total      | 0
lockers_done       | 0
current_locker_pid | 0
blocks_total       | 0
blocks_done        | 0
tuples_total       | 10091384
tuples_done        | 1775295
partitions_total   | 0
partitions_done    | 0

Manually validate indexes

If you don’t check your indexes, you might think you’re able to rely on them when you can’t. And although an invalid index gets ignored during querying, it still consumes update overhead. Common causes for index failures include:

Deadlocks: Index creation might conflict with ongoing transactions, leading to deadlocks.
Disk Space: Large indexes may fail due to insufficient disk space.
Constraint Violations: Creating unique indexes on columns with non-unique data will result in failures.

You can find all invalid indexes by running the following:

SELECT * FROM pg_class, pg_index WHERE pg_index.indisvalid = false AND pg_index.indexrelid = pg_class.oid;

You can also query the pg_stat_all_indexes and pg_statio_all_indexes system views to verify that the index is being accessed.

Fix invalid indexes

Invalid indexes can be recovered using the REINDEX command. It’s the same as dropping and recreating the index, except it would also lock out reads that attempt to use that index (if not specifying CONCURRENTLY). Note that CONCURRENTLY reindexing isn’t supported in versions below Postgres 12.

REINDEX INDEX CONCURRENTLY idx_users_email_2019;

If a problem occurs while rebuilding the indexes, it’d leave behind a new invalid index suffixed with _ccnew. Drop it and retry REINDEX CONCURRENTLY.

postgres=# \d users_2019
       Table "public.tab"
 Column |  Type   | Modifiers
--------+---------+-----------
 col    | integer |
Indexes:
    "users_emails_2019" btree (col) INVALID
    "users_emails_2019_ccnew" btree (col) INVALID

If the invalid index is suffixed with _ccold, it’s the original index that wasn’t fully replaced. You can safely drop it, as the rebuild has succeeded.

Create partition indexes consistently

Newly created partitioned tables or small tables (<100k) can easily just create indexes synchronously on the parent table, and it'd automatically propagate indexes to all partitions, including any newly created ones in the future.

CREATE INDEX idx_users_email ON users (email);

But it’s currently not possible to use the CONCURRENTLY flag when creating an index on the root partitioned table. What you should use instead is the ONLY flag. This tells the parent table to not apply the index recursively to children, so the table isn’t locked.

-- Create an index on the parent table (metadata only operation);
CREATE INDEX idx_users_email ON ONLY users (email);

This creates an invalid index first. Then we can create indexes for each partition and attach them to the parent index:

CREATE INDEX CONCURRENTLY idx_users_email_2019
    ON users_2019 (email);
ALTER INDEX idx_users_email
    ATTACH PARTITION idx_users_email_2019;

CREATE INDEX CONCURRENTLY idx_users_email_2020
    ON users_2020 (email);
ALTER INDEX idx_users_email
    ATTACH PARTITION idx_users_email_2020;

// repeat for all partitions

Only once all partitions are attached, the index for the root table will be marked as valid automatically. The parent itself is just a “virtual” table without any storage, but can serve to ensure all partitions maintain a consistent indexing strategy.

Check the query execution plan

Using the EXPLAIN ANALYZE command provides a comprehensive view of the query execution plan, detailing how Postgres processes your query. This breakdown is essential for verifying that the expected indexes are being utilized effectively.

EXPLAIN ANALYZE SELECT * FROM users_2019 WHERE email = 'arjun@bemi.io';

Index Scan using idx_users_email_2019 on users_2019  (cost=0.15..0.25 rows=1 width=48) (actual time=0.015..0.018 rows=1 loops=1)
  Index Cond: (email = 'arjun@bemi.io'::text)
Planning Time: 0.123 ms
Execution Time: 0.028 ms

Remove unused indexes

Sometimes the indexes we add aren’t as valuable as expected. To prune our indexes to optimize write performance, we can check which indexes haven’t been used:

select 
    indexrelid::regclass as index, relid::regclass as table 
from 
    pg_stat_user_indexes 
    JOIN pg_index USING (indexrelid) 
where 
    idx_scan = 0 and indisunique is false;

By implementing these best practices, you can avoid scary mistakes. Remember to monitor, validate, and understand the implications of your indexing strategy. The cost of overlooking these details can be significant, and a proactive approach will help you maintain a stable and efficient database.

When Postgres indexing isn't enough to scale, check out the BemiDB for handling analytical workloads on Postgres.