>>>> 1. Touhou Project
If you have ever seen a black and white shadow painting called "Bad Apple!!", or noticed the words "Touhou Project" when listening to songs in various music communities, but have no idea what is behind it? So curiously asked in the comment section-"What anime is this? Where to follow?", but was met with cold reception and even ambiguous ridicule, and then reluctantly regarded it as a strange symbol. Then I hope this article will give you a chance to revisit this wonderful world constructed by countless fan creators from the perspective of data.
First of all, the Touhou Project (hereafter referred to as "Touhou") refers to the barrage shooting series of games and music albums released by the fan club Shanghai Alice Phantom Orchestra . In a broad sense, it also includes various fandoms based on this (original). Or commercial secondary creation works, is a general term covering works in different fields such as games, music, comics, novels, and animation .
Note 1: Doujin has nothing to do with primary creation or secondary creation. It can be simply understood as a non-commercial creative activity. The author is also responsible for publishing, printing or distribution.
Note 2: The Shanghai Alice Magic Orchestra (Shanghai Alice Magic Orchestra) is not in China. The actual member is only one person called ZUN, whose real name is "Ota Junya"
Since the original author ZUN announced the opening of the second creation in 2004, Dongfang has gradually become a spark in the fan creation circle, attracting a large number of outstanding creators. Since then, the titles of the "3.Big Three" have never been absent. The plot of the original game is very simple, leaving players with one after another with distinctive personalities and powerful capabilities.
Maiden Many characters don't even have a dialogue or a painting, but countless secondary creations have injected their souls. Gensokyo, as an illusory thing, seems to have existed for a long time.
In my humble opinion, the charm of the East mainly comes from the enduring secondary creation. This phenomenon that can continuously attract a large group of fans with high-level creative ability is not common in fandom, but it is a bit of the early Internet. The freedom of the community and the open source community. Although some of the original Eastern works are commercial publications, this does not affect ZUN's enthusiasm for continuing to participate in fan creation.
>>>> 2. The Reptile of Honest and Honest (Jan)
(For the full version, please read the author's blog post)
Due to my limited energy, I cannot collect and analyze Oriental-related creations in all fields in a short period of time. Now only station B is the representative to investigate the situation of Touhou Project at the video submission level.
Since the establishment of station B, there have been a large number of Oriental-related video submissions, and the earliest av2 is the Oriental music collection contributed by the bishi webmaster. Combined with page performance and browser console, different data interfaces can be gradually restored. After investigating, several useful JSON APIs were obtained:
Traverse the av number of each video submission, use the API to obtain the tag information to determine whether it is the target video, and if it is, use other APIs to obtain the data, which is the basic idea of the crawler at this stage. After roughly crawling 20,000 Eastern videos like this, the hit rate is getting lower and lower, and the number of clearly invalid submissions (incomplete metadata) has actually been counted to 4,000. While collating and recording the APIs I found from station B, I decided to use another imperfect solution-a recursive search from the recommendation list.
Since the metadata of 20,000 submissions of Oriental Video has been collected, they can be used as the initial search collection. Each time the recommended list of one of the submissions is traversed, the successful target will be merged into the search collection, and at the same time, it will continue from the search collection. Filter out the targets that have been searched, then when the search set is empty, the crawler ends.
The following table organizes the main data fields obtained from this crawler work, which are scattered in four Mongo Collections.
Contribution av number
Cid of P
up main (contributor)
Min P time
Barrage metadata and barrage text
Number of fans
Popularity information such as broadcast, coin, collection, etc.
>>>> 3. Dongfang Qiuwen Data
Using the upload time of the submission records, the number of submissions can be aggregated by month and then the cumulative number can be calculated, and the following cumulative curve of submissions can be obtained.
A distribution map of the monthly submission volume is also attached here. It can be combined with the heat analysis after eating.
As previously discovered, there are many invalid or deleted submissions in station B. In the investigation process of this article, the visual score P information (at least one P for a submission) and real-time popularity information are not empty as the basis for judging valid videos .
Total: After investigation, the total number of effective submissions obtained by this crawler is 59,611, of which 6049 submissions have multiple P points, and the total number of P points accumulates to close to 100,000. The total video duration reaches 76,341,480 seconds, which is 833 days. The total number of barrage (including the part outside the current barrage pool) reached 12,430,822. The total number of up hosts who participated in the submission reached 8,838. This is roughly equivalent to the size of the B station establishment sum of the previous year and a half .
Most: The craziest sub-P is the "Eastern Doujin Music Museum Series", with a total of 11,644 pieces. The up master has been updated to the recent C92 exhibition, but these ten thousand pieces of music are only the tip of the iceberg of Eastern fan music . The music of the East is a very interesting subject. Some clubs have written the book "A Survey of Touhou Project Music and Music Theory" from the perspective of music theory.
Collection is the largest number of old Bad Apple !! shadow painted . The submission with the most barrage is an advanced barrage game in the Flash player era (using Flash vulnerabilities to move elements on the player to avoid the barrage), and it is also a ghost video that is only visible to members. The submission with the most coins is the original "Secret Seal Activity Record (Month)" by the domestic fan club . Among all the up-hosts that have contributed to the East, the one that contributed the most is " The New Moon in Gensokyo ".
Let's take a look at the distribution of submission zones. With the increasing number of submissions at station B, zoning design has always been a big problem. The once stable zoning pattern of station B is animation, music, games, entertainment, collection, and Xinfan six districts. The "animation zone" here originally just borrowed the word "douga" in Japanese. The translation into Chinese refers to the posted video, not the "TV animation" in the series. Later, Station B limited this to animated short films that were self-made or made with the aid of 3D modeling tools. For a detailed introduction to the zoning, please refer to the B station zoning specification.
The MMD･3D, short film, handbook, dubbing, synthesis, MAD･AMV in the pie chart are actually sub-zones under the animation zone. You can see that about half of the Oriental video submissions belong to the animation zone. This result is not surprising, although the main body of the original Oriental game is a barrage shooting game. However, the characters and the story are obviously more interesting than considering how to pass the level.
Labeling and clustering
The importance of tags was explained in the previous crawler. Now, all submissions are aggregated according to tags and sorted by the count of aggregated submissions. Intercept the first 200 high-frequency tags to generate the following word cloud. This word cloud at least reflects the awareness or concern of the users of station B on Dongfang Project.
Readers who are not familiar with the East can see the main characters in the East or their nicknames. Note that some characters have multiple titles. For example, the tags "Frandolus Carrett", "Frandolore", and "Miss Two" refer to the same character.
From the submitted video metadata information, a total of 39,525 different tags are obtained. More than half of the tags have been used no more than 5 times in the entire site . The original intention of the tags is to facilitate retrieval, but many tags in station B are just comments or complaints. This is likely to be the consequence of not paying attention to it at the early stage but relying only on the consciousness of the members.
Now only the first 500 high-frequency tags are co-occurring analysis. 1. count the number of times that any two tags are marked in a video submission at the same time, and the resulting counts form the co-occurrence matrix C. (Ci,j represents the number of times that the i-th label and the j-th label appear together. A more refined model can also use a second-order co-occurrence matrix. ) Using the co-occurrence matrix, you can get the vectorization of each label Represents (Embedding often mentioned in natural language processing), here we write a most_similar function with the cosine value as a similarity measure to investigate on-site the label (vector) that is closest to "Frandourol Scarlet".
The effect is very good, (well, I know that the person who sees here must be a good person in Lolicon, right).
With the definition of distance or similarity, the natural expansion should be clustering .
Articles with review categories divide commonly used clustering algorithms into nine categories. Hierarchical clustering algorithm is selected here, and bottom-up hierarchical clustering (Agglomerative Clustering) is generally selected. At the beginning, each sample point exists as an independent cluster, and the distance metric (affinity) and the definition of the distance between clusters (linkage criteria), continuously merge the two closest clusters. Commonly used distance measures include Euclidean distance, Manhattan distance, and cosine similarity. Commonly used methods for defining distance between clusters include cluster centroid distance (centroid), cluster average distance (average), cluster maximum distance (complete), etc.
Now feed the previously generated co-occurrence matrix into the hierarchical clustering algorithm library of scikit-learn, and it is found that the clustering effect is relatively general, and it still remains after adjusting the parameters. After reflection, it was speculated that some of the labels that were absolutely dominant interfered with the scale of the data, so the co-occurrence matrix was regenerated after several hegemonic labels such as "Orient" and "Orient PROJECT" were removed. And use singular value decomposition (SVD) to reduce the dimensionality of the vectorized representation of each label, adjust and re-cluster the following several times, and finally get a satisfactory clustering result:
It can be seen that these clusters are all characters from different works in the original oriental series. Have you seen your favorite characters?
Co-occurrence analysis is actually a technique commonly used in recommendation systems . At present, it is observed that there is a column of "related tags" on the tag homepage of station B, which may also use the tag co-occurrence matrix.
Video popularity analysis
Station B will make real-time statistics on the number of broadcasts, barrage, comments, favorites, coins, shares, etc. of each submitted video. The latest submissions crawled by crawlers are as of October 6, 2017, and the real-time popularity statistics are as of October 21, 2017.
The scoring algorithm of station B has been updated many times. Ignoring the specific ranking, we will first aggregate and count the Eastern videos that have entered the single-day ranking (top 100) on a monthly basis. It turns out that there are very few counts before 2012. There are three possible reasons: 1. The ranking mechanism was not introduced before 2012; 2. A malignant incident caused a large-scale withdrawal of manuscripts; 3. Problems arising from revision or migration.
If you only start from 2013 (registration opened in May 2013), you can notice that the monthly number of rankings oscillates almost every six months. After the second half of 2014 (the era of Ruizhan), station B began to expand from the pan-ACG circle to other areas , and the divisional plate has continued to increase from the six major districts to the current 14 divisions. Even so, although the existence rate of submissions related to the East on the single-day list has dropped slightly, they have not been fatally injured, and it can even be said that they are still very strong. In addition, after investigation, the THVideo Dongfang exclusive barrage station, which was founded by the main roast night bird of up, which existed from May 2014 to July 2016, diverted more than 6,000 Dongfang video submissions, and most of them were not at station B at the same time. Post.
Now start to explore the basic information of heat: Play volume
Play volume is the number of times a video is played (even if it is only played for one second) counted by cookies, which can be used as the base of the 6 heat indicators that the crawler can understand. The following data analysis excludes 725 videos that can only be viewed by members (For this part of the manuscript, the number of views returned by the API is
By directly visualizing the distribution of the playback volume, an extremely positive skew "long tail curve" will be obtained. In this regard , the general strategy is to improve the efficiency of visualization information by taking the logarithm . It can be seen that the skewness of the distribution of the playback volume in the logarithmic scale is still relatively positive. The median of the play volume read directly from the figure is about 1000 (the median is actually 1408 after investigation). If you consider a manuscript with a playback volume of less than 1,000 as an absolute low-play manuscript, it is found that about 10,000 of the more than 13,000 videos in the game zone behave like this. This is also expected. If you have followed the increase in real-time submissions on the homepage of Station B, you can notice that the game area is always growing the fastest. After all, the threshold for recording a screen is far less than making an animated short film.
Next is collection, coins, and sharing . These three indicators reflect the degree of audience recognition. Obviously, they will all be much smaller than the broadcast volume. According to the usual experience of visiting station B, the numbers of these three are generally decreasing by an order of magnitude. Because the collection is almost free of cost , the collection is generally the most important indicator besides the amount of play, followed by coins. Coins are the basic economic unit of station B (not the tokens of Q currency) circulated among users. It is a relatively limited resource. Each user can cast up to two coins for a video. The least of these three indicators is generally the number of shares, because even if users are shocked and shocked for a while, they have to bear additional social pressure when sharing on social networking sites .
Take the logarithm of the above indicators and draw a kde-shaped two-dimensional distribution chart:
It is found that the playback volume and the number of collections show an approximate linear relationship (using numpy to calculate the Pearson correlation coefficient of the original data is 0.804). In the most concentrated case, the amount of playback will be about 20 times the number of collections. At the same time, referring to the distribution of the number of coins and the number of shares, it is really difficult to get more than a dozen coins and a few shares for submissions with less than 1,000 views. In addition, for those who have entered the single-day list, generally at least more than 20,000 views are required.
For the visualization strategy of high-dimensional data, in addition to combining different low-dimensional subspaces, it can also be represented by a parallel coordinate graph. Here we use syntagmatic's D3 extension library to select the part with a higher number of collections. The screenshot is as follows ( see the content of the original author's blog for the interactive illustration ).
The color in the parallel coordinate chart indicates the coin-collection ratio. The closer to the red, the higher the value, and the closer to the blue the lower the value.
After data exploration, I opened my favorites at station B and manually checked some of the high-quality works that I recognized. First of all, they all come from the MMD and hand-drawn animations in the animation area, some are authorized to carry + subtitles and even dubbing, and some are original ups. Not all of them have entered the top 100 on the single-day list, but almost all have one thing in common: the coin-collection ratio is obviously higher, generally reaching more than 20%, and a few are close to or even exceeding the number of collections. From the perspective of interactive behavior, whether on desktop or mobile, collecting and coin insertion are the most easily touched operations besides playing . The single-day list means the performance of video submissions in one day, which may not be a particularly fair indicator for measuring the overall quality of the manuscript. So I investigated all the submissions in the animation area and music area with more than 100 coins and more than 4,000 views, and imported the query results of coin-collection ratios of more than 16% into a public Google table (archaeological list). Enquiries are welcome. If it is used up in the future, it can be gradually transformed into a dedicated Oriental video contribution navigation station.
After the first round of crawlers, the information of up owners who contributed to Eastern Video was collected. Now I would like to know when all the up owners who have submitted their contributions registered their accounts:
It seems that most of up users are old users (before May 2013, only a few times of the year were able to register). Contributed more than 200 video up masters: ['The New Moon in Gensokyo','Finger Cat','R 10.Hun','MITO Mizuo','Pawa Twins','Zusa Meow Go Soy Sauce' ,'犬走椛様','Roasted Night Sparrow','Ibuki Xiaoqiu','One-character text','Pain','White River Kotomi','Josany','Long Qi Huanya','kikou', ' bili tourist','a passing magician grapefruit','despair is the first chapter of a dream','Xiaoyi.','Yuyu Marisa','⑨practice dream⑨','dead soul butterfly', '20024JoK' ,'The Phantom of Gensokyo','Fangji Manju','Lucky Good Mood','meiling','Miko of Asaya Shrine','Shao Daxiao','Little Bird Yushingo','killbillwillil', ' Instant noodles m'], if you are interested, you can pay attention to it. Here the name of the main up is added to the parallel coordinate observation. It can be noticed that several popular up mains (the crescent moon of Gensokyo, the grilled night sparrow, the twins of Pawa...) are generally red (see the original blog for the interactive version). It shows that submissions by well-known up masters can get a higher coin-collection ratio.
Barrage originally refers to the intensive military firepower (such as the anti-aircraft barrage of battleships). In the barrage shooting game , it refers to the bullets fired by the player's own or enemy aircraft. The concept of "barrage" was extended to same-screen commentary, which was first introduced by Niconico (N station). Because the words and sentences that are swiped across the same screen when the video is played are very similar to the barrage in a barrage shooting game, it is named after it. Station B is considered to have inherited the barrage concept of Station N, but unlike Station N, Station B also inherits the comment area of traditional online video websites. The original intention may be that users consciously separate bullet screens and comments. In addition, station B has provided barrage shielding rules early on to protect the look and feel, but it is still difficult to resist the decline in the quality of the user group brought about by popularization.
The barrage at station B can be divided into two types according to its nature: The first type of barrage reflects the audience’s comments or descriptions of the current screen, which generally contains a wealth of information and has a certain descriptive meaning . this). The second type exists mainly to increase the expressiveness of the current scene, and regardless of the length of the text, the amount of information is generally limited, and it is mainly an emotional catharsis . In addition to widespread determination to shield you set in the second barrage
^6666*pure noise, as well as a variety of special barrage stems, or. Either way, the bullet screen is full of unregistered words, special symbols, mixed language, typos..., and the context between the bullet screens mainly depends on the video content, which makes the mining of the text of the bullet screen inevitably different from other traditional scenes. The difference.
Since there are a lot of non- conventional corpus vocabulary in the barrage text , for the need of word segmentation, first try to use all the text of ordinary barrage to extract words, but before that, moderate text cleaning must be completed.
At this time, statistics show that the total number of barrage is more than 7.1 million, the total number of barrage is more than 57 million, and a total of 8,974 characters are used.
(If each word is the size of the font in front of the screen, then these bullet screens can be continuously launched from Hangzhou East to Shanghai Hongqiao)Refer to the method of extracting words without knowledge base in the Matrix67 blog  , calculate the mutual information (mutual information) and the entropy of the left and right neighbor words of each candidate word, and keep the candidate words that are as large as possible.
Mutual information reflects the degree of cohesion of a candidate word. For example, the degree of cohesion of the three words "cinema" can be determined by the ratio of p (cinema) and p (electricity) · p (cinema) and p (cinema) and p (movie) · The smaller value of the ratio of p(hospital) reflects. The entropy of the left and right neighbors reflects the degree of freedom of a candidate word to engage with the context. A dictionary tree of prefix and suffix is established for all strings, and all calculations can be completed. The Java implementation of sing1ee on GitHub is used here (part of the code is modified according to the actual performance of the barrage).
The longest candidate word is 5, the minimum degree of cohesion is 1, and the minimum entropy of the left and right neighbors is 2, and finally 42,368 potential candidate words are selected, of which about 96% of the words appear in the barrage text less than 1 in 10,000 . Check that there are many common words in the candidate words extracted this time, and exclude the conventional Chinese thesaurus from the candidate word set extracted this time . (If you can complete the labeling work, you can gradually improve a "barrage-specific thesaurus"), take The first 200 high-frequency new words generate the following word cloud:
Comparing the high-frequency bullet screen words with the previous label words, it can be found that the two are highly repetitive in semantics. It's just that the full name of the official name is often used for the label, and the abbreviation is used for the bullet screen. Import the new words extracted from the barrage into jieba to complete the word segmentation. In the process of checking the word segmentation, it was found that the words containing hiragana/katakana were not parsed. I traced back to the jieba source code and found that only Chinese characters, letters, numbers and a few symbols were retained during regular screening. Fortunately, most of the subtitles with pseudonyms appear in some lyrics and subtitles, which has nothing to do with the overall situation. Now load the segmented barrage text into gensim (the Python implementation of word2vec), set the word vector dimension to 120, and see what kind of word vector model can be trained .
Similar to the previous co-occurrence analysis, the similarity here still uses the cosine value. Let's look at some interesting word vectors:
>>> First question: Who is Meili<<<
(I don’t know if the dream of purple is Mei Li 欤? Mei Li’s dream is zi 欤?)
>>> 2.question: Where is the Baiyu Building<<<
(Of course it’s in the underworld, the others are also the names of places in Gensokyo)
>>> Third question: Which CP do you like <<<
>>> Fourth question: In addition to the East, what other teachings are in the barrage<<<
Due to the widespread existence of secondary settings, there is almost no absolutely strict character design (Character Design) in Touhou Project . Now try to use the results of word vectors to explore the close relationship of some characters. Using the previously trained word vector model, you can semi-automatically organize a list of synonymous role names. After merging the characters from Dongfang Red Devil Township to Dongfang Fengshen Lu, a word vector model was retrained. And extract 34 characters among them to calculate the similarity between two pairs, try to generate a character relationship diagram, here choose Gephi to complete the relationship diagram drawing. The thickness of the line indicates the similarity between the characters. The most striking thing in the picture is a character who doesn’t even have a formal name-the big goblin (Daijiang), which has the highest relevance among these 34 characters, which makes people can’t help but think of the darkness in "The Hidden Biosphere" set up.
The relationship between the characters in Red Devil Township, Yaoyaomeng, Yongyechao, Fengshenlu
Many extreme Oriental fans have suggested that Station B set up an "Oriental Zone", believing that it is really unnecessary. If you can guide users to add and delete tags, supplemented by multiple dimensions such as barrage theme extraction, audience characteristics, etc., it should be possible to design a set of flexible partitioning mechanism, which can also avoid the bloat of fixed partition guidance. However, it may be more important than a flexible partitioning system (the mobile app has been fully deployed).
The development of Japanese commercial ACG is inseparable from the development of doujin creation for decades. Fanren creation can be used as a backup resource for commercial ACG, and it can also be an experimental exploration for the direction of the industry. Compared with other commercial ACGs with huge investment, Touhou Project has been a miracle for the Heisei generation with its unique charm.
This article discusses the dissemination and development of "Oriental Project" since the establishment of bilibili, an important ACG cultural distribution center in China. At the same time, the tags of submissions were effectively clustered, and the popularity of submissions was explored from different dimensions. Finally, moderate text mining was performed on the content of the barrage sent by the audience, important new words were extracted, and a word vector model was established. Completed the analysis of the role relationship.
 Wartena, C., Brussee, R., & Wibbels, M. (2009, November). Using tag co-occurrence for recommendation. In Intelligent Systems Design and Applications, 2009. ISDA'09. Ninth International Conference on ( pp. 273-278). IEEE.
 Gu Sen, (August 10, 2012). Matrix67 Blog, Sociolinguistics in the Internet Age: Text Data Mining Based on SNS