social bookmarking

Research Methodology based on RSS Feeds and Social Bookmarking (Part 3, analysis)

In this part I will present an analysis of the data presented on part 2. Below are some charts which I will comment on at the bottom of the post. The analysis is strictly related to this simple scenario.

General data for both searches (blog and news)

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 1: entries per day (with all key words combined)

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 2: entries per day of the week (all keywords combined)

Blog search

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 3: entries per day (broken down by subscription)

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 4: ratio between unique and repeated entries

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 5: preselections and bookmarks per day, next to unique and total entries

News search

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 6: entries per day (broken down by subscription)

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 7: ratio between unique and repeated entries

Research Methodology based on RSS Feeds and Social Bookmarking

Chart 8: preselections and bookmarks per day, next to unique and total entries

Some obvious things first

Observing charts 3 and 6, where the number of new feeds per day are broken down into the specific searches, say, “dissemination knowledge culture web 2.0″ and “culture 2.0″, we can notice that the less keywords used, at least in this very specific scenario, the higher the number of results achieved. This should not by any means come as a surprise, as it is also with their results: less keywords yield more general results.

Gruhl et al. [1] have found in a study about information diffusion of information through the blogosphere, in which they analyzed a set of 401.021 postings, with an average of 2k-10k posts per day, among 11,804 RSS feeds, that weekends are low in intensity while midweek peak. As my humble observation shows, this is also the case here.

Number of unique entries

When comparing charts 4 and 7, one can clearly notice that the blog search turn up a much higher ratio of unique entries: 81.5 % of the blog search feeds where unique, whereas only about half of the news feeds (56.85%) where not repeated. I assume that this has to do with how many blogs there are in the blogosphere, compared to the rather limited universe of news providers searched by Google. What I also realized during the collection phase, is that many news feeds tend to be repeated during 1-2 cycle, therefore showing up in the next day of collection. Naturally, the set of keywords, which is progressively reduced in complexity but still with some of the same words, the results showed within the more complex ones (with more words) are likely to be shown also on those of lower complexity (with less words).

Preselections and bookmarks

Charts 5 and 8 show the number of preselections and bookmarks for both the blog and news searches. Here is clear that the blogosphere is a much more fertile ground when it comes to results. Of all unique entries, 18.16 % of the blog feeds were preselected for further reading (adding star in Google Reader), while only 9.04 % of the news feeds have resulted in preselections. In relation to the total amout of entries (unique and non-unique), this difference becomes even more blatant: 14.80% of all blog search feeds became preselections, and as opposed to a third of that (5.14%) in the news search.

Bookmarks stretch the difference even further. In the blog search the number of bookmarked entries is of 6.69 % of all unique entries and 5.45 % of all entries. This figures are 2.11% and 1.20% for the news search respectively. That means that the blog searcher returns nearly 3 times more of bookmarks than the news search, and almost 5 times as much when all the entries are taken into consideration.

[1] D. Gruhl, R. V. Guha, D. Liben-Nowell, and A. Tomkins. Information di®usion through blogspace.
In WWW, pages 491{501, 2004.

rss
social bookmarking
web 2.0

Comments (0)

Permalink

Research Methodology based on RSS Feeds and Social Bookmarking (Part 2, data collection)

As I described on part 1, I have been running a little research to verify whether RSS feeds generated by blog and news searches are of any worth. On the following parts I will be presenting the final data, some analysis and conclusions. Please refer to the original post in order for a more detailed explanation of the methodology and its motivations.

Scope

The scope of this research is very narrow and can only be applied for the selected keywords below. Therefore, the results should not be generalized. A more representative research would need a much larger sample of random keywords, possibly in different languages, using different search engines and more rigorous data collection. Nonetheless, the results shed some light on using RSS feeds as an additional way of information gathering, specially in the exploratory phases of a research.

Data collection

As described in the introductory post, I created a series of subscriptions with some keywords for a Google blog and a Google news search. The keywords, based on the title of my master’s thesis “Dissemination of Knowledge and Culture on Web 2.0: a Case for Brazilian Music”, were based on the first part of the title, and used for the different searches in different levels of complexity, as follows:

  • “dissemination knowledge culture web 2.0″
  • “dissemination knowledge web 2.0″
  • “dissemination culture web 2.0″
  • “knowledge culture web 2.0″
  • “knowledge web 2.0″
  • “culture web 2.

I observed each subscription on Google Reader daily, for a 30-day period, from 19 March until 17 April. For each set I considered the following variables:

  • Total Feeds: the daily number of feeds for a subscription
  • Overrides: how many repeated feeds were found within a subscription or across the whole set
  • Preselections: feeds that have been marked (or starred in Google Reader) for further reading
  • Primary leads: interesting leads that may serve as a reference derived from a preselection
  • Secondary leads: any interesting lead found within a primary lead
  • Bookmarks: effective sources found in the chain of leads that are bookmarked and tagged
  • New RSS feeds: new rss subscriptions generated as a result of good content found within a blog or site that contains other articles of interest, and may serve as an expanding sourc

Table 1: Consolidated data table for the Google Blog Search (all keywords combined)

---

  Total % (of unique items) % (of preselections) % (of all leads)
Total entries 642 --- --- ---
Unique entries 523 (81.5 %) 81.5 % --- ---
Preselections 95 18.16 % --- ---
Primary leads 25 4.78 % 26.32 % 58.14 %
Secondary leads 18 ---- * 15.13 % 41.86 %
Bookmarks 35 6.69 % 36.84 % 81.40 %
New RSS Feeds 4 0.76 % 4.21 % 9.30 %

Table 2: Consolidated data table for the Google News Search (all keywords combined)

---

  Total % (of unique items) % (of preselections) % (of all leads)
Total entries 584 --- --- ---
Unique entries 332. (56.85%) 81.5 % --- ---
Preselections 30 9.04 % --- ---
Primary leads 6 1.81 % 20.00 % 75.00 %
Secondary leads 2 ---- * 0.79 % 25.00 %
Bookmarks 7 2.11 % 23.33 % 87.5 %
New RSS Feeds 40 0.00 % 0.00 % 0.00 %

dissemination of knowledge
rss
social bookmarking
web 2.0

Comments (0)

Permalink

Research Methodology based on RSS Feeds and Social Bookmarking (Part 1)

The research methodology for my Master thesis is largely based on rss feeds, with good leads leading to tagged bookmarks (with del.icio.us) for future reference. The best possible outcome is not only achieving a usable concrete reference, but also other rss feeds that may serve as an expanding source.

The methodology

In order to demonstrate and test tee effectiveness of the methodology I have been running for 9 days now a little experiment. My Thesis runs under the title “Dissemination of Knowledge and Culture on Web 2.0: a Case for Brazilian Music”. Using the first part of the title I have created RSS feeds based on Google Blog and News searches. I have tried to make Technorati and Digg subscriptions, but they have returned unreliable results*, specially with Google Reader**, my RSS reader of choice.

The complexity of the queries, without prepositions and conjunctions, is as follows:

  • x.1 “dissemination knowledge culture web 2.0″
  • x.2 “dissemination knowledge web 2.0″
  • x.3 “dissemination culture web 2.0″
  • x.4 “knowledge culture web 2.0″
  • x.5 “knowledge web 2.0″
  • x.6 “culture web 2.0″

Every search query used has become essentialy an rss feed subscription for a blog search (A) and a news search (B), ergo A.1, A.2, B.1, B.2, etc. With the subscriptions in place, I have used a few variables to control them, individually, on a daily basis:

  • Total Feeds: the daily number of feeds for a subscription
  • Overrides: how many repeated feeds were found within a subscription or across the whole set
  • Pre-selections: feeds that have been marked (or starred in Google Reader) for further reading
  • Primary leads: interesting*** leads that may serve as a reference derived from a pre-selection
  • Secondary leads: any interesting lead found within a primary lead (e.g. a hyperlink inside of a lead****)
  • Bookmarks: effective sources found in the chain of leads that are bookmarked and tagged
  • New RSS feeds: new rss subscriptions generated as a result of good content found within a blog or site that contains other articles of interest, and may serve as an expanding source

Preliminary results

Within the 9-day period, between 19 March 2008 and 27 March 2008, the blog search (A) and the news search (B) have returned for the combinations of all their subscriptions, a total 182 and and 198 entries respectively, a rate of 20.22 articles per day for the blog search, and of 22 for the news search. When we look a bit closer at the numbers we see that the news search had a higher repeat rate, 78 articles, whereas the blog search had only 37 repeats. Therefore, the blogsearch had a higher rate of unique articles, 79.67% (145 entries), as opposed to that of the news search of 60.6% (120 entries).

The blog search had 27 pre-selections out 145 unique entries, amounting to 18.62% ratio. Out of my readings, 6 out these 18 have turned out to be primary leads (22.22%). In the news search, 18 out 120 unique entries became pre-selections, a rate of 15%, and 4 out of these pre-selections became primary leads, coincidentally a rate of also 22.22%.

To make it more readable:

(A) Blog search:
Total entries: 182
Repeated entries: 37 (20.33% of total entries)
Unique entries: 145 (79.67% of total entries)
Pre-selections: 27 (18.62% of unique entries)
Primary leads: 6 (22.22% of pre-selections)
Secondary leads: 4
Bookmarks: 7 (70% of all leads combined, 25.93% of pre-selections, 4.83% of unique entries)
New RSS Feeds: 2 (20% of all leads combined, 7.41% of pre-selections, 1.38% of all unique entries)

(B) News search:
Total entries: 198
Repeated entries: 78 (39.39% of total entries)
Unique entries: 120 (60.61% of total entries)
Pre-selections: 18 (15% of unique entries)
Primary leads: 4 (22.22% of pre-selections)
Secondary leads: 2
Bookmarks: 5 (83.33% of all leads combined, 27.78% of pre-selections, 4.17% of unique entries)
New RSS Feeds: 0

Early observations

In 9 days of observations it’s safe to point out that news searches have a much higher repeat rate than blog searches. This maybe due to the limited number of sources used by the Google News search and differences in their search algorythms. Without getting to the subjective quality of the content (a bit more to that later), both the blog and the news searches gave roughly similar results in relation to the number of bookmarks generated.

I will be trying to expand this observation a bit further and will be posting the results here.


jp
Notes:

* Techonorati and Digg feeds didn’t produce feeds if there were not any results in the first search.

** Maybe I am part of a vendor lock-in, but I have also used Google Reader as my RSS reader of choice. Although it lacks basic sorting functions and it’s statistics tools are limited to only the last 30 days, it has served me well for the purpose of this studies and it’s online, accessible from anywhere.

*** Interesting is a very broad term here. It denotes what I would personally find fit to use as reference for my thesis research. Naturally, other subjects would have preselected and bookmarked different entries. The purporse of this analysis is primarily to describe how RSS feeds can be used as exploratory research method, not the quality of the leads found in the results.

**** Secondary leads also include any leads found within the navigational path starting at the primary lead. For example, a preselection has turned up a good primary lead, say, a blog entry from Prof. John Doe. In his article, Prof. Doe mentions the work of Dr. Lorem Ipsum (a secondary lead). In this secondary lead, the article of Dr. Ipsum, there’s a link to another piece of information, and so it goes.

rss
social bookmarking
web 2.0

Comments (0)

Permalink