Voice Control of IoT Devices via Mobile Application
Voice control of IoT in mobile app — not just "add SiriKit" or "integrate Google Assistant". It's a separate logic layer: speech recognition, intent extraction, mapping to device commands, feedback. Each step breaks differently.
Two Fundamentally Different Approaches
Built-in voice assistants (Siri Shortcuts, Google Assistant Actions) work through cloud and require explicit permission. Siri Shortcuts on iOS available via INPlayMediaIntent and INSendMessageIntent, but for arbitrary IoT commands need AppIntent (iOS 16+) — Swift framework describing intents. Example: "Hey Siri, turn off kitchen light" → Siri calls TurnOffLightIntent in your app, which sends MQTT command. Delay — 2–4 seconds through Apple cloud, no guarantees with offline.
Local recognition — different level. On iOS this is SFSpeechRecognizer with SFSpeechAudioBufferRecognitionRequest. From iOS 13 supports on-device mode (requiresOnDeviceRecognition = true) without sending audio to cloud. On Android — SpeechRecognizer API (via Google cloud) or Vosk / Whisper.cpp for fully offline.
For IoT apps where local network operation without internet matters, choice is clear — local recognition + offline NLU.
NLU: From Text to Device Command
Recognized "turn on kitchen light and raise temperature to twenty two" — now extract:
- intent:
turn_on,set_temperature - entities:
device_type=light,location=kitchen,device_type=thermostat,value=22
For simple cases, rule-based approach suffices: verb-intent dictionary + device/room dictionary from user's database. Build regex or simple intent matcher against existing device list.
For complex scenarios — Rasa NLU (self-hosted) or Duckling for numeric values. On Flutter integrate via HTTP request to local server in home network or via dart:ffi for embedded model.
Real example: smart apartment project, 35 devices, Russian language. Trained simple fastText model with ~500 command examples, converted to .tflite, ran via tflite_flutter. Accuracy on household commands — 94%. Misses were on compound commands (two actions in one phrase) — solved with preprocessing via splitting by conjunctions "and", "then", "later".
Feedback and Edge Cases
Push to talk vs always-on. Always-on on mobile — battery killer. Recommend push-to-talk button in app + optional wake word via Porcupine SDK (PicoVoice). Porcupine works locally, consumes <5% CPU on idle.
What if device not recognized? Don't stay silent. Return voice response via AVSpeechSynthesizer (iOS) / TextToSpeech (Android), list what was understood, ask to clarify. User doesn't see screen — needs audio feedback.
On Flutter use flutter_tts for synthesis and speech_to_text as unified API over platform engines. Important: on Android 11+ SpeechRecognizer requires RECORD_AUDIO permission with explicit rationale in onRequestPermissionsResult. Without clear rationale — Google Play marks as policy violation.
MQTT Integration
Voice command → NLU → device command → MQTT topic publish. Latency from button press to device response: on-device recognition ~300–800ms, NLU ~50ms, MQTT publish < 50ms on local broker. Total — feels instant.
With cloud recognition add 1.5–3 seconds. On Russian, cloud Google Speech-to-Text works well, Apple Speech — worse on IoT-specific terms like "dimmer", "receiver", "relay".
Timeline
Push-to-talk with cloud recognition and simple command mapping — 2–3 weeks. Offline recognition + NLU + wake word + TTS feedback — 6–10 weeks. Pricing depends on languages, platforms, and offline requirements.







