Chapter 3 Data transformation
In this section, we will discuss how to merge multiple movie data. It can be tricky to merge data from multiple sources. Talk is cheap, click the view source button to see the code if you are interested!
3.1 Input data file
basicscontains following information for each movies:
akascontains all alias for each movies:
ratingscontains IMDb rating and votes:
principalscontains principal cast/crew for each movies
Budgetcontains rev/cost info
3.2 Merge & Preprocess Data
3.2.1 Limit the scope
At this report, we dive in the “movies”, and especially those with box office (go the thereter). But the original dataset contains all these 10 types:
We only focus on
movieandtvMovie(as some of them have box office, like High School Musical 3: Senior Year). And we will use them to map other data.Also, we delete movie start at 2022 or later.
3.2.2 Merge movie-wised data
For basic and rating data, we can use the title id
tconstto merge directly.But the for the
budgetdata acquired from numbers, we need to map them by movie name and it’s alias. Is is a tedious process to locate the IDMB id(tconst) value of 6,196 movies in theaters :Safest way: use movie name and their year to find id
use
movies name+RealeaseYearin budget to mapprimary title+yearfrombasicsof IDMB. -> ✔️ matches 4495 rowsuse
movies name+RealeaseYearin budget to maporiginal title+yearfrombasicsof IDMB. -> ✔️ matches 137 rowsuse
movies name(drop punctuation) +RealeaseYearin budget to maptitle(drop punctuation) +yearfromakasof IDMB. -> ✔️ matches 410 rows
Not the best way: only use its name, and for multiple movies that share a same name, only consider the oen with highest number of votes in IMDB. -> ✔️ matches 1012 rows
Last way: fuzzy matching name + year - like Star Wars Ep. VII: The Force Awakens with Star Wars: Episode VII -> ✔️ matches 135 rows
-> ⚠️ Last 7 movies that can’t be matched to IMDB: consider its rev/cost data as missing.
After a tedious clean, we get this data to map them to imdb database:
Then, we need to understand how the NA is encoded in the dataset. So we calculate 15 most common value among each columns:
- Apparently,
\N, for example, is likely to be one kind of tag for missing value among:
SIGN_of_MISSING <- c('\\N', "", "[]", " #N/A", "#N/A")3.2.3 Merge person-wised data
At this part we merge pricinpals to the movies they are in, and calculate the weighted average rating for each actor, actress and director.
3.2.4 Merge Movie/genres-wised data
Noted that the genres column is not a cleaned column. To preserve this 1-movie-to-multi-genres relation, we
- assign
main genres,second genresandthird genresbased on reverse order of the popularity of genres to each movie. For example, if a movie is tagged with Crime, Action, and Adventure. We will label Crime (177m votes) as main genre, Adventure (250m votes) as second, and Action (306m votes) as third.
- create a one-to-many mapping table for each movie in order to analysis, e.g, aveage votes for each genres.
3.3 Create d3 data
Lastly, we calculate the all-time best movies, the result is stored in data/processing/d3input.csv.