Build Your Own AI Voice Assistant: ESP32-S3, Home Assistant & Local Audio (No cloud needed)

By The Maker Team December 07, 2025
Build Your Own AI Voice Assistant: ESP32-S3, Home Assistant & Local Audio (No cloud needed)
In this Build Guide:
  • The Chip: Why you need an ESP32-S3 (and why the C3 won't work).
  • The Backend: Installing Whisper and Piper in Home Assistant.
  • The Parts: INMP441 Mic + MAX98357A Amp.
  • The Wiring: Understanding the I2S Audio Protocol.
  • The Result: A private smart speaker for under $15.

In our previous guides, we loved the ESP32-C3 for temperature sensors and WLED. It is cheap and efficient. But today, we are building ears and a mouth for your home.

For audio processing, the C3 is too weak. To detect a wake word like "Hey Jarvis" locally—without sending audio to the cloud—we need heavy processing power. We need the ESP32-S3.

Hardcore Hardware

This is an intermediate build involving I2S audio protocols. Love getting into the weeds of datasheets? Search for the "Electronics" or "PCB Design" tags on Great Meets to find other hardware hackers in your city.


Step 1: Preparing Home Assistant (The Brains)

Before we wire up the hardware, we need to ensure Home Assistant has the "brains" to understand English and talk back. We need to install three add-ons.

  1. Go to Settings -> Add-ons -> Add-on Store.
  2. Install Whisper (Speech-to-Text). This converts your voice recording into text.
  3. Install Piper (Text-to-Speech). This creates the computer voice that talks back.
  4. Install openWakeWord. Even though the S3 chip handles detection, this add-on manages the models.

Once installed, go to Settings -> Voice Assistants and make sure you have a pipeline active that uses these three services. This is the "server" your ESP32 will talk to.


Step 2: The Shopping List

Unlike a smart plug, we are building this from components. You will need:

The Brain

ESP32-S3 DevKit. Make sure it is the S3 version (N16R8 is best). The "S" stands for Smart (AI features).

The Ears

INMP441 Microphone. An omnidirectional I2S microphone. It captures high-quality digital audio.

The Mouth

MAX98357A Amplifier. This takes digital audio from the ESP32 and powers a small 3W speaker.


Step 3: The Wiring (I2S Protocol)

We are using I2S (Inter-IC Sound), which is a standard for connecting digital audio devices. It requires 3 wires: Clock (BCLK), Word Select (LRC), and Data (DIN/DOUT).

ESP32-S3 Pin Microphone (INMP441) Amplifier (MAX98357A)
3.3V / 5V VDD Vin
GND GND GND
GPIO 41 SCK BCLK
GPIO 42 WS LRC
GPIO 40 SD (Serial Data) -
GPIO 39 - DIN

Note: You can change these GPIO pins in the software, but these are standard defaults for many S3 boards.

Getting Static or Screeching?

Audio hardware is sensitive to power noise. If your speaker is buzzing, you might need a capacitor or a cleaner power supply. Stuck? Search for an "Audio Engineer" on Great Meets and message them for troubleshooting tips.


Step 4: The Software (ESPHome)

We will use ESPHome to program the chip. You will need a specific configuration that includes the "Micro Wake Word" component.

Create a new device in ESPHome and use this configuration block for the I2S setup:

i2s_audio:
  - id: i2s_bus
    i2s_lrclk_pin: GPIO42
    i2s_bclk_pin: GPIO41

microphone:
  - platform: i2s_audio
    id: board_microphone
    i2s_din_pin: GPIO40
    adc_type: external
    pdm: false

speaker:
  - platform: i2s_audio
    id: board_speaker
    i2s_dout_pin: GPIO39
    dac_type: external
    mode: mono

voice_assistant:
  microphone: board_microphone
  speaker: board_speaker
  noise_suppression_level: 2
  auto_gain: 31dBFS
  volume_multiplier: 2.0

Step 5: The Test

Once you flash the ESP32-S3, Home Assistant will auto-discover it. Add it, and then go to Settings -> Voice Assistants.

  1. Select your device.
  2. Choose your Wake Word (e.g., "Okay Nabu" or "Hey Jarvis").
  3. Speak!

When you speak the wake word, the ESP32-S3 detects it locally. It then streams the audio to the Whisper add-on (Step 1) to convert it to text. Home Assistant processes the command, and sends the response back to Piper, which plays the audio out of your DIY speaker.


Conclusion

You have just built a device that rivals Amazon Echo in functionality but respects your privacy completely. It costs about $15 in parts and gives you total control over the hardware.

Build It Together

Soldering tiny wires to an ESP32 can be daunting. Why not host a "Build Night"? Great Meets lets you find others who want to learn. Create a local meetup or just find a buddy to share shipping costs on parts.