Chapter 3 Data transformation
In this section, we will discuss how to merge multiple movie data. It can be tricky to merge data from multiple sources. Talk is cheap, click the view source button to see the code if you are interested!
3.1 Input data file
basics
contains following information for each movies:
akas
contains all alias for each movies:
ratings
contains IMDb rating and votes:
principals
contains principal cast/crew for each movies
Budget
contains rev/cost info
3.2 Merge & Preprocess Data
3.2.1 Limit the scope
At this report, we dive in the “movies”, and especially those with box office (go the thereter). But the original dataset contains all these 10 types:
We only focus on
movie
andtvMovie
(as some of them have box office, like High School Musical 3: Senior Year). And we will use them to map other data.Also, we delete movie start at 2022 or later.
3.2.2 Merge movie
-wised data
For basic and rating data, we can use the title id
tconst
to merge directly.But the for the
budget
data acquired from numbers, we need to map them by movie name and it’s alias. Is is a tedious process to locate the IDMB id(tconst
) value of 6,196 movies in theaters :Safest way: use movie name and their year to find id
use
movies name
+RealeaseYear
in budget to mapprimary title
+year
frombasics
of IDMB. -> ✔️ matches 4495 rowsuse
movies name
+RealeaseYear
in budget to maporiginal title
+year
frombasics
of IDMB. -> ✔️ matches 137 rowsuse
movies name
(drop punctuation) +RealeaseYear
in budget to maptitle
(drop punctuation) +year
fromakas
of IDMB. -> ✔️ matches 410 rows
Not the best way: only use its name, and for multiple movies that share a same name, only consider the oen with highest number of votes in IMDB. -> ✔️ matches 1012 rows
Last way: fuzzy matching name + year - like Star Wars Ep. VII: The Force Awakens with Star Wars: Episode VII -> ✔️ matches 135 rows
-> ⚠️ Last 7 movies that can’t be matched to IMDB: consider its rev/cost data as missing.
After a tedious clean, we get this data to map them to imdb database:
Then, we need to understand how the NA is encoded in the dataset. So we calculate 15 most common value among each columns:
- Apparently,
\N
, for example, is likely to be one kind of tag for missing value among:
<- c('\\N', "", "[]", " #N/A", "#N/A") SIGN_of_MISSING
3.2.3 Merge person
-wised data
At this part we merge pricinpals to the movies they are in, and calculate the weighted average rating for each actor, actress and director.
3.2.4 Merge Movie/genres
-wised data
Noted that the genres
column is not a cleaned column. To preserve this 1-movie-to-multi-genres relation, we
- assign
main genres
,second genres
andthird genres
based on reverse order of the popularity of genres to each movie. For example, if a movie is tagged with Crime, Action, and Adventure. We will label Crime (177m votes) as main genre, Adventure (250m votes) as second, and Action (306m votes) as third.
- create a one-to-many mapping table for each movie in order to analysis, e.g, aveage votes for each genres.
3.3 Create d3 data
Lastly, we calculate the all-time best movies, the result is stored in data/processing/d3input.csv
.