I've had some success with this using some rigging & weights, in a manner similar to what Nate described. It's a bit of a curious setup, but it looks something like this:
- Starting with an "oh" phoneme shape, create a mesh that maps the lips of the mouth.
- Create a copy of this mesh for every other mouth shape (duplicate the mesh and change the image path).
- Reposition the vertices on each mesh to be in the appropriate positions for the mouth pose. The idea is to have all of your mouth shapes share a common set of vertices.
- Create a set of bones to control parts of the mouth.
- For each mouth shape, position the bones in the appropriate locations, and bind them to the mesh. Be sure to position the bones before binding them to the mesh.
- After you've bound the bones to all the mouth shapes, restore the bones to their original position for the "oh" phoneme (or whichever phoneme you want for your setup pose).
This will give you a setup where you can move the mouth bones around, and each mouth shape will distort to match the bone position as closely as possible. This way you don't have to create specific animations to transition from one mouth shape to another. You just move the bones, and change the visible attachment at the correct point in time. So for example, to transition from "oh" to "aah", stretch the edges of the mouth wide, and narrow the center gap for the lips.
The key is that the mouth bones must be in the proper position for the mouth shape when they are bound to the mesh. You can uses weighting to smooth out how each particular shape distorts when the bone positions move around. Some shapes don't always distort well, but with the right weighting, I found this setup surprisingly effective.