Chapter 3 Data transformation

In this section, we will discuss how to merge multiple movie data. It can be tricky to merge data from multiple sources. Talk is cheap, click the view source button to see the code if you are interested!

3.1 Input data file

  • basics contains following information for each movies:
  • akas contains all alias for each movies:
  • ratings contains IMDb rating and votes:
  • principals contains principal cast/crew for each movies
  • Budget contains rev/cost info

3.2 Merge & Preprocess Data

3.2.1 Limit the scope

At this report, we dive in the “movies”, and especially those with box office (go the thereter). But the original dataset contains all these 10 types:

  • We only focus on movie and tvMovie (as some of them have box office, like High School Musical 3: Senior Year). And we will use them to map other data.

  • Also, we delete movie start at 2022 or later.

3.2.2 Merge movie-wised data

  • For basic and rating data, we can use the title id tconst to merge directly.

  • But the for the budget data acquired from numbers, we need to map them by movie name and it’s alias. Is is a tedious process to locate the IDMB id(tconst) value of 6,196 movies in theaters :

    1. Safest way: use movie name and their year to find id

      • use movies name + RealeaseYear in budget to map primary title + year from basics of IDMB. -> ✔️ matches 4495 rows

      • use movies name + RealeaseYear in budget to map original title + year from basics of IDMB. -> ✔️ matches 137 rows

      • use movies name(drop punctuation) + RealeaseYear in budget to map title(drop punctuation) + year from akas of IDMB. -> ✔️ matches 410 rows

    2. Not the best way: only use its name, and for multiple movies that share a same name, only consider the oen with highest number of votes in IMDB. -> ✔️ matches 1012 rows

    3. Last way: fuzzy matching name + year - like Star Wars Ep. VII: The Force Awakens with Star Wars: Episode VII -> ✔️ matches 135 rows

    4. -> ⚠️ Last 7 movies that can’t be matched to IMDB: consider its rev/cost data as missing.

After a tedious clean, we get this data to map them to imdb database:

Then, we need to understand how the NA is encoded in the dataset. So we calculate 15 most common value among each columns:

  • Apparently, \N, for example, is likely to be one kind of tag for missing value among:
SIGN_of_MISSING <- c('\\N', "", "[]", "   #N/A", "#N/A")

3.2.3 Merge person-wised data

At this part we merge pricinpals to the movies they are in, and calculate the weighted average rating for each actor, actress and director.

3.2.4 Merge Movie/genres-wised data

Noted that the genres column is not a cleaned column. To preserve this 1-movie-to-multi-genres relation, we

  • assign main genres, second genres and third genres based on reverse order of the popularity of genres to each movie. For example, if a movie is tagged with Crime, Action, and Adventure. We will label Crime (177m votes) as main genre, Adventure (250m votes) as second, and Action (306m votes) as third.
  • create a one-to-many mapping table for each movie in order to analysis, e.g, aveage votes for each genres.

3.3 Create d3 data

Lastly, we calculate the all-time best movies, the result is stored in data/processing/d3input.csv.