Dev Log 2 - A complete system

I already have basic landing page and supabase based auth login and simple generation form set up. The exciting part is designing the backend. The way i’m thinking it needs to be is this. It all starts with a single source of truth. Its an AI generated product but doesn’t mean we accept slop. In my experience, the thing AI is really good controlled, clearly defined and short contexts.

Blog image

So most of this engine that spits out the audio will be an orchestration system that puts goes through multiple steps of synthesis from text to audio and finally stitching.

This is the preliminary “no wrong idea version of the engine”. As I started building it out, I was thinking to myself in terms of creating an interface where a non technical person can come into a portal, edit a text document and then the engine I build then is able to map it out and make it into these different parts.

I built out a basic django admin set up today with simple serializers, viewsets and basic model definitions. I added basic jwt based authentication (login/registration).

However, the more I think about this idea of creating a document based view for someone to edit, the less something like that makes sense. Why shouldn’t it be a simple table where we can add a list of locations and when we set up a location, we set them up with their lat/long coordinates. We can then enable the user setting this up to just edit the block of text/background for the same. The snippets for both text and audio then would just be mapped to this.

Relationally, we can also have a table that can be configured to choose an ordered list of locations that can then be mapped to certain keywords that are what is requested from the frontend. So this idea of old/short/long can all be pulled from there with future extensions as required. So this might also mean its easier to then split the audio and text snippets table for short / medium / long. I think the key is that we will start with these open descriptors of length and then maybe go from there to something more sophisticated. But right now this is how it is appearing to me as i’m thinking through it.

A main table of locations that will have a long source of truth text. Another table tours that we can then configure for now as a simple ordering of locations from this locations table. The front end sends a request to this table saying I want {tour-name} of length short. So we then go through the tours table and find the order of the locations and their ids. Then we go into the short-audio snippet table and grab them all in the order they are configured in tours table and put them together and send them back to the user. At a rudimentary level, this is what we’re striving for phase 1 of this product.

The engine itself should have the ability to edit the locations and tours table. Based on which be able to synthesise the text snippet and subsequently the audio table. We must also be able ot track when changes in the locations table particularly the text or name have not been updated in text and audio tables respectively and perform this updation. I’m not sure how RAG and vectorization plays into all of this but this will remain one of the things to figure out as we go.

Main question that comes up to me are these

  • How to bring more customization in the locations
  • How do we use location data to give transitionary on the fly instructions for the audio-snippet gathering so it seems dynamic for when locations are customized.