Engaging with Youtube: How to Get Liked

 · 13 mins read

Introduction

What makes people find and click on music on YouTube? Does the way a video is posted have anything to do with how popular it gets? In parts 1-3, I try and partly succeed in predicting how big a music-type video will get. For parts 4-6, I end up pivoting to identifying the most engaging video tags, in an effort to optimize engagement per view for a potential advertiser. In the end, I try to answer the age-old question: What the hell is trap music, actually?

For engagement Visualizations, jump to part 5

  • Part 1: YouTube
  • Part 2: The Data
  • Part 3: Failing To Regress
  • Part 4: Engagement by Genre
  • Part 5: Conclusion
  • Part 6: Bonus Word Clouds
  • Part 1: YouTube

    What's in a video?

  • A title, about 7 words on average, often following a format like Song - Artist (Official Music Video)
  • Title Words

  • 10-15 Tags, not visible, meant to help the video show up in searches (e.g. rap, Cardi B, concert)
  • Tag Words

  • A longer description, around 100 words, usually providing detailed information about the artist and the uploader
  • Description Words

    Getting the Data

    • The YouTube API allows for automatic calls to its query function, which returns a list of relevant videos
    • Each individual video can then be queried for summary statistics
    • Time-stamped information, such as when views or comments happened, is not available
    • The query 'budget' makes it very inefficient to get specific comments

    Part 2: The Data

    What can we get?

    Videos present a remarkably strong relationship across orders of magnitude:

    Description Words

    • By taking the base10 log, we can see a clear trend across orders of magnitude
    • Actual and log modes:
      • Average Views per video: 53 Million (M) (7.7 on the graph)
      • Peak of Views per video: 10 M (6.96)
      • Peak of Likes per video: 25 thousand (k) (M) (4.4)
      • Peak of Comments per video: 1.3 k (3.13)
      • Peak of Dislikes per video: 870 (2.94)
    • Which means the percentages of engagment per view are very consistent
      • Likes / View: .28%
      • Comments / View: .01%
      • Dislikes / View: .01%

    What's Up with the Bump?

    The bump on the lower side of each curve brings up an import note about how the data was collected. When I ran the search, I pulled as many results as YouTube would give for each letter of the alphabet (about videos 350 per). I think this has to do with how YouTube finds relevant videos: it searches for the most videos that match the string, sorted again by year, and cuts off anything below the threshold. Sure enough, if we look at the results over a few years, most of the bump comes from newer videos, which suggested the remainder of the analysis be restricted to the well behaved videos over 10,000 views (As a bonus, this is only applies to about 500 videos out of 8500 unique entries in the data)

    Bump over time

    Duration Trends

    Most videos fall within 2.5-5 minutes

    test

    Because people don't watch very long or short videos

    test

    Part 3: Failing To Regress

    The Setup

    • The original goal was to predict what went into a big hit (100 M + Views), and to predict views in general
    • Considered a number of features, including:
      • Video Aspects - Duaration (missing lyrics)
      • Text Features - title, tags, description: word vectors, sentiment, length; title features: featuring artist, letters in word
      • Publication Date - weekend, friday, day of year, day of month, year (controls for more time to see a video after release)
      • Meta - has caption, high def vs. standard, content rating
    • Number to beat: using only likes, dislikes, and comments, we are able to predict views with r2 .67

    "Results"

    • Tried a number of models, PCA, grid search, with random forrest scoring highest at r2 .48, .9 MAE on the log (or +/- 1 order of magnitude)
    • Analysis of engagement on biggest misses showed that I was missing the 'it' factor
    • Model was still 'good enough' to identify a number of new potential hits that weren't showing up in the main charts

    What can we Know? Rules of Thumb

    By combining feature importance from Gradient Boost and sign of Linear Regression coefficients, its possible to identify the most significant potential view boosters:

    1. Duration - shorter is better
    2. Year - older videos have higher total views, probably due to a combination of 1. more time to accumulate views 2. Feature of the ETL phase which probably failed to retrieve old videos with lower view counts
    3. Day of year: release earlier in the year, although recalling the midsummer dip, it is probably safe to say winter videos do better
    4. Longer descriptions with more positive tone do better
    5. The number of tags is much more important than their sentiment
    6. Longer titles do not do well
    7. Licensed content is more viewed
    8. Including a caption seems to help visibility
    9. Use the letter ā€˜aā€™ in the title a bunch, but not ā€˜pā€™
    10. Content cool enough to be prohibited in certain regions is more popular
    11. The Pitbull Effect: Include a featuring artist for an easy 14% bump

    Part 4: Engagement by Genre

    The Pivot

    After failing to satisfactorily predict views, I realized that what I had was a general approach for describing overall engagement. Views are a measure of engagement, as are likes, dislikes, and comments: Something about a video provokes different kinds of reactions from people. We don't need to perfectly predict any one element if we can describe things in a useful way.

    After messing around with some unsupervised classification, I found what should have been obvious in the first place: genres are pre-existing classes that behave differently. Using the tags and continuously adding popular words to new or existing genres, I was able to identify a number of keywords:

    • Variant: acoustic, cover, instrumental, lyric, lyrics
    • Blues : blues, blue, delta, rhythm, lee hooker
    • Christian : christ, christian, faith, worship
    • Classical : bach, beethoven, classical, composer, concerto, debussy, ensemble, orchestra, piano, symphony, sonata
    • Country : country, western, horse, america, american, soldier, road, home, alabama, denver, haggard, coe
    • Dubstep : dub, dubstep, skrillex, bass
    • EDM: aoki, club, edm, house, dance, dj,electr, electronic, electronica, techno,trance, ultra,
    • Extended : live, album, festival
    • Folk : folk, banjo, indie
    • Halloween : creepy, halloween, eerie, horror, wolves
    • Hit : hit, interscope, new, official, single, sony, warner, vevo
    • Italian : singolo, nuovo, ultimo
    • Jazz : jazz, new orleans, rag, ragtime, swing,
    • Kpop : kpop, korea, korean
    • Latin : latin, musica, reggaeton
    • Love Songs : amore, amor, breakup, break-up, love, need
    • Other Rock : grunge, heavy, metal, punk
    • Pop : clean, pop
    • Rap : rap, hip, hop, hiphop, r&b
    • Reggae : reggae, marley
    • Remix : remix
    • Romanian : romania, romanian
    • Relax : ambient, chill, concentration, downtempo, estudiar, dormir, meditate, meditation, relajar, relax, relaxing, relaxation, trabaja, sleep, sleeping, study, zen
    • Rock : rock, roll, rocknroll,
    • Trap : lean, trap

    These keywords do not produce pure classifications, which is actually useful for understanding overlap between genres

    genre_correlation

    For example, trap music is

    • 22% Remixes (mainly of hit songs)
    • 18% Latin Music
    • 14% Dubstep
    • 13% Covers and instrumentals
    • 7% Rap samples
    • 6% Festival music
    • 5% Trying to scare people into listening
    • 3% EDM

    Once the data was classified, I controlled for views to find expected likes, dislikes, comments, and like / dislike ratio on each video. With a standard deviation of this projection, I was then able to convert actual counts into a deviation from expected, which allowed for direct comparison of videos across view counts. Without this, the characteristics of each genre would be dominated by the average views of the videos in it.

    Exploring Engagement Across Genres

    Interpreting These Charts

    The numbers shown for engagement are in terms of standard deviation from expected, in order to allow for comparability across metrics. Loosely speaking, it translates to percentage. Its more of a 'more or less' than a substantive metric at this point, so don't get too caught up.

    That being said, for each metric, one 'unit' translates to within a percent of expected:

    • Likes: 50%
    • Comments: 60%
    • Dislikes: 50%
    • Like / Dislike Ratio: 50%

    Visualizations

    Engagement in Most Viewed Videos of Major Genres

    Most Viewed Videos from each of the biggest categories, sized by favorability (likes / dislikes)

    Note that the spread is considerably higher on higher view count videos in general

    Engagement Stats By Genre

    Move Slider to isolate videos by view count (slight variance in aggregation method from actual)

    Engagement Stats By Genre, Measures Isolated

    Move Slider to isolate videos by view count

    Engagement Stats By Genre, Across Views

    Select Genres to Compare engagement across views

    View Range by Genre

    Spread of log of Views

    Part 5: Conclusion

    The following chart plots likes and comments, scaled by average view counts, for the genres studied.

    They are colored by the ratio of likes to dislikes (Higher Ratio: More Green)

    Halloween music, rock and roll, and KPop all provoke the highest rates of engagement per view, followed by love songs, trap, and EDM. Any adversting campaign would do well to target any of these areas.

    Some trends to see here:

    • Videos with less engagement are more favorably liked - there is some threshold that provokes a reaction, and the bar for likes is lower than dislikes / comments
    • The biggest genres are the most normal, clustered within the main band (latin, pop, etc.)
    • Rock fans are quite vocal
    • Highest engagement genres: rock, kpop, halloween (spooky music), EDM, love songs, Trap
    • Classical music provokes comments, but not likes - presumably, they're above such petty interaction

    While this data is informative, there is always room for improvement in the analysis and presentation of findings. Some examples below:

    • Filter out other holiday or event-specific music
    • Incorporate metrics to compare genres across different view counts (e.g. low, medium, high)
    • Host a function to examine individual songs and genres (currently available in notebook)
    • Adjust ratio-based model to predict and compare views
    • Expand tags used to identify existing genres, and isolate new modes of engagement not directly related to genre
    • Continue to refine and engineer new features for use in the regression model, such as a variable to identify 'normal' length songs
    • Examine characteristics of heavy hitter channels like Vevo, and try to classify them
    • Identify tag words with high cross-over to understand how genres are related

    Part 6: Bonus Word Clouds

    Christian Music

    christian words

    Classical Music

    classical words

    Dubstep (music?)

    Dubstep words

    EDM

    EDM words

    Halloween Songs

    halloween words

    Latin Music

    Latin words

    Hit Music

    Hit words

    Rap Music

    Rap words

    Relaxation and Study Music

    Relax words

    Trap

    Trao words

    Source Code