Building a recommendation engine from scratch:
Step 1: Finalizing the model type.
- Recommendation systems come in lot of types catering to different users. Common types: Collaborative filtering, content based filtering and hybrid version. I believe for podcasts recommendation, we should use the hybrid so that we can recommend to a user what people with similar interests like and also based on his interests(e.g same author)
Step 2: Data Collection and Filtering.
- Most common apps used for podcasts are Spotify, Google Podcasts, Apple Podcasts. Etc. Spotify provides an API for data collection. But to get individual user’s data, we can use Kaggle.
- Since, we are going with the hybrid version, we need to collect user data — basically user interests — authors, podcast genre, number of views, number of likes, etc. We can also collect the duration until which the user has listened to the podcast, ignoring the podcast if he has listened less than a minute.
- We will finally have a dataset of podcasts containing audio, author, podcast genre/categories, description, audio length, id, average ratings, etc. We will have a client dataset ( unstructured ) containing the client podcasts interests referred by the id of podcasts data. — Client Listening History(detailed with duration till they heard) and the people they follow.
Step 3: Usefulness and Similarity Index.
- Usefulness (Content Based Filtering): Utility function that finds the usefulness of a podcast(item) for an user — depends on content of current podcasts and content of the previous liked podcasts. Content refers to the parameters in Step 2.
- Similarity (Collaborative Filtering): Similarity Index finds the similarity between two users based on their contents and pools them into groups of similarity. (Clustering — k-nearest neighbor) — cosine similarity or Eucledian distance.
Step 4: Modelling.
- Basic Approach: Use NLP on podcasts transcript to understand categories, genre, talking speed, etc and Use Machine Learning to generate recommendations based on these features, user’s past history and similar users interest.
- Building model by Keras with TensorFlow front end.
- We will use TensorFlow to add new streaming data into the model even after the model has been generated using TensorFlow PlaceHolder.
- Here, we can analyse the audio of the podcasts by the speech to text conversion and then applying the NLP feature extraction on the text/transcript to get the categories and description of the podcasts if not given.
Other Important Points Regarding Model:
- Label Encoding: So as to have common references across different dataframes.
- Make description for each podcast — use transcript of the podcast, the description given by the author and select to 30 important words (we can make a list of words ignoring the common used words) or to have more generalized model, use step 3.
- Use Gensim’s Word2Vec or Glove Embeddings to extract the important words
- Split the data into training(75%), testing(15%) and optionally cross validation(10%).
- CNN-LSTM combinations works nicely in NLP especially with glove embeddings. — Feature Extraction for Description and Podcast.
- Keywords using Cosine Similarity
- LightFM for recommender model
- SVD(Singular Value Decomposition) model with cosine for hybrid
- Get recommendations and calculate user rating(collaborative filtering on user’s rating)
In the next parts of this series, we will code for each of the steps above. Until then!
If you enjoyed the article, please clap and share with your coder friends!
Thanks for stopping by! Happy Coding!