I've been working a bit at the beginnings of this project, and I've outlined the components I'm working on in some docs, and I figured I'd go through it here.
There are the following components to this project:
Speech to Text
This is the component which converts text to speech commands. I am choosing to use the CMU Sphynx system, which is the only fully open-source speech recognition engine (the rest are sponsored by corporations, and some require licensing.) I'm choosing this because all of the others require API access, which 1) could become expensive, and 2) open this up to surveilance. This does make it harder - the other systems are easier to use and set up. But this is one of the most important components in terms of security and privacy - I don't want any compromises here.
- installation of Sphynxbase
- use of pocketsphynx & pocketsphyx python
It's a pretty complex process - it looks like it requires some model training, etc. I also ran into trouble with my mic setup on the Raspberry Pi - the small mic I ordered just didn't do the trick - I needed to use my headset - which is fine for now, but I'm going to have to solve the mic problem at some point. I decided to hold off on this part, and work more on the inner components, and finish the basics I can run from a console, then build out the audio components (speech-to-text and text-to-speech.)
This component is what will parse the commands from the text generated from CMU sphynx. To begin with, this will be basically a pass through. But eventually, it will take a complex sentence, and figure out what to do with it.
This is a database registry of:
- possible commands and modifiers
- resource details (just to start):
- Google Calendar
- Some weather resource (NOAA?)
- Texts for output
At first, I was thinking I need to choose something light and free, and I thought about SQLite. But I realized, in thinking about what I really needed, that a NoSQL solution was going to be a better bet. I'm pretty familiar with MongoDB, but it looks rough to get working on Raspbian, so I looked a bit more, and found that CouchDB should work pretty well. So I'm going with that.
I have four tables so far: commands, resources, outputs and user. The user table will hold the necessary data and secrets needed for this to work with services like Google.
Each command set needs a specific resource to refer to in order to create the right output. I want to create these modules akin to Alexa "Skills" make it easy to create new ones. It would be best if all of the data necessary to connect to a new module could be configured inside the registry, so that new code wouldn't need to be written. I don't know how likely that is to pull off.
This is the component that waits for commands to come from the parser, reads the registry, decides what output to create based on commands, and possibly could create output based on incoming data from a different resource.
Text to speech
There are a number of text to speech options that are truly open source and not sponsored by a company. I'm going to start with espeak-ng, which has a python wrapper written. I might also try flite, also by CMU.
I'm excited to be doing this project, and I'm looking forward to seeing how it evolves. The very beginnings are already up on GitHub, so if you want to track it's development, that's where to look (and here, I'll be blogging about it as well.)