From Spooky Ambitions to Practical Lessons: Overwhelming Animatronics Powered by Local VLM

Animatronics Powered by Local VLM

The dream was simple enough: an AI-powered Halloween skeleton, affectionately dubbed “Skelly,” greeting trick-or-treaters with personalized welcomes based on their costumes. The reality, as often happens in the world of rapid prototyping and ambitious side projects, proved… more complicated. This post details the lessons learned from a somewhat chaotic Halloween night deployment, focusing on the implications inherent in edge AI systems like Skelly, and outlining strategies for a more controlled – and successful – iteration next year. We’ll dive into the design choices, the unexpected challenges, and how leveraging local Vision Language Models (VLMs) can be a powerful tool for privacy-focused applications.

The Initial Vision: A Local AI Halloween Greeter

The core concept revolved around using a Radxa Zero 3W, a connected USB webcam, built-in speaker controlled by a MAX98357A mono amplifier, and the animatronics of a pre-built Halloween skeleton. The plan was to capture images, feed them into an offline VLM like those available through LM Studio (powered by AMD Strix Halo platform), analyze the costumes (with Google gemma-3-27B), and generate a custom greeting delivered via text-to-speech (TTS) using PiperTTS. The original inspiration came from Alex Volkov’s work on Weights & Biases, utilizing a similar setup with Google AI Studio, ElevenLabs, Cartesia, and ChatGPT.

I opted for a fully offline approach to prioritize privacy. Capturing images that include children requires careful consideration, and sending that data to external APIs introduces significant risks. Local processing eliminates those concerns, albeit at the cost of increased complexity in model management and resource requirements.

The Halloween Night Reality: Overwhelmed by the Que

The biggest issue wasn’t technical – it was human. We anticipated a trickle of small groups, perhaps one to three treaters approaching Skelly at a time, uttering a polite “trick or treat.” Instead, we were met with waves of ten-plus children lining up like attendees at a concert. The system simply couldn’t handle the rapid influx.

The manual trigger approach – snapping pictures on demand – quickly became unsustainable. We struggled to process images fast enough before the next wave arrived. Privacy concerns also escalated as we attempted manual intervention, leading us to abandon the effort and join our kids in traditional trick-or-treating. The lack of good reproducible artifacts was a direct consequence of these issues; we were too busy firefighting to collect meaningful data.

Security Considerations: A Deep Dive into Edge AI Risks

This experience highlighted several critical risk considerations for edge AI deployments, particularly those involving physical interaction and potentially sensitive data like images of children:

  • Data Capture & Storage: Even with offline processing, the captured images represent a potential privacy breach if compromised. Secure storage is paramount – encryption at rest and in transit (even locally) is essential. Consider minimizing image retention time or implementing automated deletion policies.
  • Model Integrity: The VLM itself could be targeted. A malicious actor gaining access to the system could potentially replace the model with one that generates inappropriate responses or exfiltrates data. Model signing and verification are crucial.
  • GPIO Control & Physical Access: The Radxa Zero 3W’s GPIO pins, controlling the animatronics, represent a physical attack vector. Unrestricted access to these pins or the network could allow an attacker to manipulate Skelly in unintended ways,
  • Network Exposure (Even Offline): While we aimed for complete offline operation, the system still had network connectivity for initial model downloads and updates. This creates a potential entry point for attackers.

Reimagining Skelly: Controlling the Chaos

Next year’s iteration will focus on mitigating these risks through a combination of controlled interactions, robust security measures, and optimized processing. Here’s the plan:

1. Photo Booth Mode: Abandoning the “ambush” approach in favor of a dedicated photo booth setup. A backdrop and clear visual cues will encourage people to interact with Skelly in a more predictable manner.

2. Motion-Triggered Capture: Replacing voice activation with a motion sensor. This provides a consistent trigger mechanism, allowing us to time image capture and processing effectively.

3. Timing & Rate Limiting: Implementing strict timing controls to prevent overwhelming the system. A delay between captures will allow sufficient time for processing and response generation.

4. Visual Indicators & Auditory Cues: Providing clear feedback to users – a flashing light indicating image capture, a cheerful phrase confirming costume recognition, and a countdown timer before the greeting is delivered. This enhances user experience and encourages cooperation.

5. Enhanced GPIO Controls: Restricting access to the GPIO pins using Linux capabilities or mount namespaces. As well as limiting physical access to Skelly is key to reduce tampering.

Leveraging Local VLMs: A Python Example

The power of local VLMs lies in their ability to understand images without relying on external APIs. Here’s a simplified example demonstrating how to capture an image from a USB webcam and prompt Ollama with a costume greeting request using Python:

import cv2
import requests
import json

# Configuration
OLLAMA_API_URL = "http://localhost:11434/api/generate" # Adjust if necessary
MODEL = "gemma-3-27B"  # Or your preferred VLM model
PROMPT_TEMPLATE = "You are an AI assistant controlling a Halloween animatronic. The following is a base64‑encoded JPEG image of a person(s) in a costume.
Identify the costume in one short phrase and then respond with a friendly greeting that references the costume. Use a cheerful tone."

def capture_image(camera_index=0):
    """Captures an image from the specified webcam."""
    cap = cv2.VideoCapture(camera_index)
    if not cap.isOpened():
        raise IOError("Cannot open webcam")
    ret, frame = cap.read()
    if not ret:
        raise IOError("Failed to capture image")
    _, img_encoded = cv2.imencode('.jpg', frame)
    cap.release()
    return img_encoded.tobytes()

def prompt_ollama(image_data):
    """Prompts Ollama with the image data and returns the response."""
    headers = {
        "Content-Type": "application/json"
    }
    payload = {
        "model": MODEL,
        "prompt": PROMPT_TEMPLATE,
        "stream": False # Set to True for streaming responses
    }

    # Encode the image as base64 (Ollama requires this)
    import base64
    image_base64 = base64.b64encode(image_data).decode('utf-8')
    payload["prompt"] += f"\n[Image: {image_base64}]"

    response = requests.post(OLLAMA_API_URL, headers=headers, data=json.dumps(payload))
    response.raise_for_status()  # Raise an exception for bad status codes
    return response.json()['response']


if __name__ == "__main__":
    try:
        image_data = capture_image()
        greeting = prompt_ollama(image_data)
        print("Generated Greeting:", greeting)

    except Exception as e:
        print("Error:", e)

Important Notes:

  • This is a simplified example and requires the cv2 (OpenCV) and requests libraries. Install them using pip install opencv-python requests.
  • Ensure Ollama is running and the specified model (gemma-3-27B) is downloaded.
  • The image data is encoded as base64 for compatibility with Ollama’s API. Adjust this if your VLM requires a different format.
  • Error handling is minimal; implement more robust error checking in a production environment.

System Flow Diagram: Whisper to Piper via Ollama

Here’s a flow diagram illustrating the complete system architecture:

This diagram highlights the key components and data flow: a motion sensor triggers image capture, which is then processed by Ollama to generate a costume description and greeting. Piper TTS converts the text into audio, delivered through Skelly’s speaker. Whisper processing detects the “trick or treat” wake word, initiating the process.

Conclusion: Building Secure & Engaging Edge AI Experiences

The Halloween night debacle served as a valuable learning experience. While the initial vision was ambitious, it lacked the necessary controls and security measures for a real-world deployment. By focusing on controlled interaction, robust security practices, and leveraging the power of local VLMs like those available through Ollama or LM Studio, we can create engaging and privacy-focused edge AI experiences that are both fun and secure. The key is to anticipate potential challenges, prioritize user safety, and build a system that’s resilient against both accidental mishaps and malicious attacks. The future of animatronics powered by local VLM is bright – let’s make sure it’s also safe!

Bringing Skelly to Life: A Radxa Zero 3W Powered AI Halloween Skeleton

Halloween is my favorite time of year, and I love creating interactive experiences for trick-or-treaters. Last year, Alex Volkov’s project on Weights & Biases – building an AI-powered skeleton using a Raspberry Pi – sparked an idea. This year, I wanted to take that concept further, aiming for a smaller footprint and increased processing power by leveraging the Radxa Zero 3W single board computer. This blog post details my journey of transforming a standard Home Depot 3ft Halloween Classics Animated LED Dancing Skeleton into a locally AI-driven greeter. We’ll cover everything from dismantling the original animatronic, wiring up the Radxa Zero 3W, and setting the stage for integrating local vision models to recognize costumes.

The Inspiration & Goals

Volkov’s project was brilliant: using online AI services like Google AI Studio, ElevenLabs (for voice), Cartesia, and ChatGPT to create a responsive skeleton that could greet trick-or-treaters. However, relying on cloud services introduces latency, requires a stable internet connection, and could raise privacy concerns – not ideal for the often chaotic Halloween night. My goal was to replicate the interactive experience but move all processing local, using a more compact and powerful board than the Raspberry Pi 4. The Radxa Zero 3W seemed like the perfect fit. It packs significant punch in a tiny form factor, offering Wi-Fi connectivity, Bluetooth, and ample GPIO pins for controlling the animatronic components.

Disassembly & Component Identification: Getting to Know Skelly

The first step was understanding how the original skeleton worked. This involved carefully dismantling the 3ft dancing skeleton. Start by removing the back of the skull and chest plate; this provides access to the control board, battery pack, motors, and speaker. Be gentle – these animatronics aren’t built for extensive tinkering!

Inside the skull, you’ll find a DC motor controlling the mouth movement (yellow positive, white ground wires) and LEDs illuminating the eyes (red positive, black ground wires).

Photo of the skeleton's internal components after removing the skull backplate. Highlight the mouth motor and eye LEDs

Under the chest plate you will see a hardware speaker with two blue wires and another DC motor powering the body/arm movements (positive red, ground black). All these wires converge on a small control board.

Photo of the skeleton's internal components after removing the chest backplate. Highlight the control board, battery pack,  body motor, and speaker

It’s crucial to document everything as you go. I took numerous photos and created a wiring diagram to ensure I could reassemble everything correctly (or at least understand where things went if something went wrong!).

Close-up documenting the Home Depot 3ft Halloween Classics Animated LED Dancing Skeleton control board wiring

The original manufacturer uses transistors and capacitors to compensate for the fluctuating battery voltage – typically between 1.4V and 1.66V with three AA batteries in parallel, reaching around 5V. This is a good reminder that relying solely on the battery pack’s power output isn’t ideal; we’ll address this later.

Wiring Up the Radxa Zero 3W: The Heart of the Operation

The plan was to intercept the signals going to each component – mouth motor, eye LEDs, body motor, and speaker – and control them via the Radxa Zero 3W’s GPIO pins. This required carefully unsoldering these wires from the original control board.

Once unsolder, I connected each wire to a set of jumper pins, allowing me to easily breadboard and test connections before committing to permanent soldering. This also provides flexibility for future modifications.

Note: I included a 220 Ohm inline to help prevent the eye LEDs from burning out. Its not required, but its recommended to avoid burning out the LEDs during tinkering.

Close-up photo showing the wires from the skeleton's components connected to jumper cables.

Here’s the GPIO pinout I utilized on the Radxa Zero 3W (gpiochip3):

  • PIN_7 (gpiochip3 20) – Mouth motor open/close
  • PIN_11 (gpiochip3 1) – Eye LEDs illumination
  • PIN_15 (gpiochip3 8) – Body motor for dancing movement

The Radxa Zero 3W’s official documentation outlines the 40-pin GPIO interface. Since we’re using pins 12, 40 and 35 for our mono amp (more on that later), this leaves a good selection of readily available pins on gpiochip3 to control relays.

  • Pin 7: GPIO3_C4 (also PWM14_M0)
  • Pin 11: GPIO3_A1
  • Pin 15: GPIO3_B0
  • Pin 16: GPIO3_B1 (also PWM8_M0)

Note: PWM (Pulse-Width Modulation) can be used on Pin 7 and Pin 16 by enabling the correct device tree overlay using rsetup and/or u-boot. This allows for finer control over LEDs (dimming) and DC motor speeds, but wasn’t necessary for this initial implementation.

Powering Skelly: The Radxa Zero 3W to the Rescue

Initially, I considered powering the components directly from the battery pack. However, as mentioned earlier, the inconsistent voltage proved problematic. The solution? Leverage the Radxa Zero 3W’s 5V GPIO power rails! This provides a stable and reliable power source for all components.

To manage the current requirements of the motors and LEDs, I incorporated relays inline with each component’s wiring. Relays act as electrically controlled switches, allowing the Radxa Zero 3W to control the flow of power from its 5V output to the skeleton’s components.

showing how the Radxa Zero 3W controls the skeleton's components via relays

Relay Implementation: The Switching Mechanism

Each relay requires a control pin on the GPIO, which when activated, allows power to flow through it. The wiring is as follows:

  1. Connect the Radxa Zero 3W’s 5V output to both sides of each relay.
  2. Ground each relay and component to the Radxa Zero 3W’s ground pins.
  3. Connect the control pin on the GPIO to the relay’s control input.
  4. The output power of each relay connects to the corresponding component’s jumper (mouth motor, body motor, eye LEDs).

When the GPIO pin is set HIGH (active state), the relay closes, allowing power to flow from the Radxa Zero 3W to the component. When the pin is LOW, the relay opens, cutting off the power supply. This effectively gives us programmatic control over each animatronic function.

Breadboarding & Testing: Bringing it All Together

I started by breadboarding everything – connecting the Radxa Zero 3W, relays, jumpers, and a temporary power source to verify functionality. This is where patience is key! Double-check all connections before applying power. A multimeter is your best friend during this phase. Once I confirmed that each component responded correctly to the GPIO signals, I removed the breadboard and connected everything directly to the jumper wires for a more permanent connection.

Mounting & Final Assembly: Skelly Gets an Upgrade

With the wiring complete, it was time to mount the Radxa Zero 3W inside the skeleton’s chest cavity. I repurposed the original control board’s mounting point and used some 2.5mm standoffs to secure the Radxa Zero 3W in place. This ensured a snug fit without interfering with any existing components.

 Photo showing the Radxa Zero 3W and relays mounted inside the skeleton’s chest cavity

I then stuffed all the “guts” back into the body, wrote a simple web control page, using Flask, to test the functionality remotely, and ran through final testing. Success! Skelly was now responding to commands from my computer, ready for the next phase: AI integration.

The Next Phase: Local Vision Models & Costume Recognition

With Skelly reasonably buttoned up and his basic movements working, I’m moving on to the most exciting part of the project – leveraging a connected camera and locally hosted/trained vision models to determine trick-or-treaters’ costumes. This will involve using services like LMStuido on an AMD AI workstation to leverage models like Gemma 3 for vision to text. I plan to document this process in a future blog post, so stay tuned! But up next a deep dive on how to leverage device tree overlays and I2S via GPIO, to power an existing hardware speaker with a MAX98357A mono amp.

Resources & Further Exploration

This project has been a fantastic learning experience, combining hardware tinkering with software development and AI integration. The Radxa Zero 3W proved to be an excellent platform for this application, offering the power and flexibility needed to bring Skelly to life. I hope this blog post inspires you to create your own interactive Halloween experiences!