Rankings Study

Lead: Chris Dreyer (rankings.io)

Support: Jiddu Alexander, Cédric Scherer & Daniel Kupka (frontpagedata.com)

Last updated on June, 18 2020

1 Introduction

With Google evaluating sites based on various ranking factors, knowing on which ranking factors to focus on your SEO strategy for the biggest bang is crucial.

Several large-scale data studies, mainly conducted by SEO vendors, have sought to uncover the relevance and importance of certain ranking factors. However, in our view, most of the studies contain severe statistical and methodological flaws. In addition, all studies have taken an industry-wide perspective, neglecting the peculiarities of certain industries/niches.

The main goal of the study is to provide guidance on the relation between SEO features and Google organic search results within the personal injury practice niche.

The study was conducted between January and March 2020.

 

1.1 Methodology

  • Step 1 Keyword Selection: To attain relevant search queries, we first downloaded a city data set with a total of 28,000 city names ( https://simplemaps.com/data/us-cities). Second, we added each city to the two relevant keyword phrases, namely “car accident lawyer” and “personal injury lawyer”. For Minneapolis, the following keyword combinations were possible: “car accident lawyer minneapolis”, “minneapolis car accident lawyer” ,“minneapolis personal injury lawyer” and “personal injury lawyer minneapolis”.
  • Step 2 Data Extraction: To extract backlink data as well as the SERP results, we uploaded the keyword combinations onto the Ahrefs Keyword Explorer and downloaded the respective data. Please note: For a vast majority of search query combinations in Ahrefs, search volume data was too low to any meaningful data points. We then filtered for those search query combinations that exhibited higher search volumes. For “car accident lawyer minneapolis” (high monthly search volume) and “minneapolis car accident lawyer” (low monthly search volume), we kept the data the SERPs and associated data points for “car accident lawyer Minneapolis”. That way, we ensured that we look only at the most relevant data points with the highest search volume to avoid having duplicated data. The data extraction was conducted in January 2020.
  • Step 3 Data Mining: In the last step of sourcing the raw data, we extracted the following data from the URLs: “title” “meta_description” “h1_tag” “h2_tag” “h3_tag” “word_count” “images_amount” “videos_amount” “broken_links_amount” “internal_links_amount” “unique_internal_links_amount” “external_links_amount” “unique_external_links_amount” “no_follow_links_amount” “follow_links_amount” “links_anchor_text” “schema_markup_exists” “domain_name_registration_date” “page_size_html” “facebook_exists” “linkedin_exists” “pinterest_exists” “instagram_exists” “youtube_exists” like age of domain.
  • Step 4 Data analysis: The data has been analysed and processed for selected features to showcase whether they have a positive or negative trend on Google Ranking Positions. Polynomial regression has been applied to all numeric variables. Linear regression is used on yes-or-no variables such as https as well as on numeric variables to provide simple average trends. For some variables, outlier behaviour has been identified. This was mostly caused by the larger, more authoritative domains (e.g. lawyers.findlaw). This has been accounted for in the regression analysis. The potential of each feature for the average website has been derived and ranked to show features where most can be benefited from. Lastly, an Xtreem Gradient boosting machine learning algorithm has been tested with the Leave One Feature Model to determine the importance of each feature. The analysis was conducted between February and March 2020.
 

1.2 Cleaning the Data: What Information Do We Keep for Analysis?

We removed all URLs with the HTTP status codes not 200 from the data set. Unfortunately, due to anti-mining mechanisms by some of the directory websites, we weren´t able to get page-level data for yelp.com (430 observations; 2.7% of total), avvo.com (413 observations; 2.6%), and lawyers.com (217; 1.7%). However, we decided to include those three larger domains for the backlink and domain rating analysis.

In addition, while data on referring domains were provided, Ahrefs did not provide any data points on the number of backlinks. Throughout the report, we therefore use the terms backlinks and referring domains interchangeably. Also, Ahrefs did not give us any data on URL rating.

Furthermore, we took only URLs into account that ranked in organic search results. Hence, the final data set contains all organic links returned at position 1 to 20 in the Google search with HTTP 200 status codes.

 

1.3 What does the Clean Data Look Like?

The resulting final data set contained 8201 unique URLs (excluding avvo.com, lawyers.com, and yelp.com data). After the data cleaning step, 22 raw variables with a total of 305537 data points were available for further analyes. Most of these variables have a sample size of approx. 14500 values. Eight variables had a considerable amount of missing information, with a minimum sample size of around 12900 (cost per click and year of registration) and around 13080 (page size and information on social media channels).

The data set contains 818 distinct keywords. The URLs have a monthly search volume of 114 and 32.4 referring domains on average with a mean Ahrefs difficulty score of 13.5.

For each of the variables of interest, we visualize the average values per position on Google. We use only data on organic links and exclude sitelinks which are also provided by Google (and are assigned to the same position).

2 Research Findings

In this section, we analyse how different ranking factors relate with higher organic positions in the Search Engine Results Pages (SERPs).

More specifiically, we look at following factors:

  • Domain Factors
  • Site-level Factors
  • Backlink Factors
  • Page-level Factors
  • Brand Signals
  • Other Factors
 

2.0.1 The Role of Lawyer Directory Domains

Before we delve into the analysis, it is important to showcase the role of lawyer directory websites in the SERPs. A large share of the lawyer directories (lawyers.findlaw, attorneys.superlawyers, justia, expertise, yelp) rank in top positions, except for avvo, thumbtack and lawyers.com.

For instance, lawyers.findlaw pages rank on average on 4th position with a median of 0 referring domains. The below graph shows the distribution of positions for the pages of the largest domains compared the rest (Other).

Throughout the report, we distinguish URLs between the larger domains and smaller ones to provide a more granular analysis.

Domain# of Records% of totalPostion (average)Backlinks (median)
lawyers.findlaw8975.64.00
attorneys.superlawyers7894.96.20
justia5703.56.52
yelp4302.76.10
avvo4132.613.71
expertise3432.15.51
thumbtack3051.911.20
lawyers2711.713.60

 

2.1 Domain Factors

 

2.1.1 Older Domains Tend to Rank Higher in Google

Key takeaways:

  • Overall, our analysis indicates that older domains tend to rank higher in the SERPs. In general, for every 6 years a position in the SERPs can be gained.
 

2.2 Site-level Factors

 

2.4 Page-level Factors

 

2.4.1 A Page on an Authoritative Domain Will Rank Higher Than a Page on a Domain with Less Authority

Key takeaways:

  • Throughout the data, pages with higher domain ratings tend to rank higher in organic SERPs. On average, a 12 points increase in domain rating translates into a one position gain in the SERPs
  • There is a group of domains with domain ratings of over 75. This group mainly consists of directory domains. Domain rating is a feature where they stand out from the rest of pages.
 

2.4.2 Pages with Higher Word Count Tend to Outrank Pages with Lower Word Count

Key takeaways:

  • According to our data, text-rich pages are shown to rank higher. The sweet spot lies around 3000 words. With regards to the URLs that belong to less authoratative domains, approximately every 700 words may lead to an increase of one position (up to 3000 words max).
 

2.4.3 Higher # of Images Correleates with Higher Positions

Key takeaways:

  • The number of images on a page correlates positively with higher Google Rankings.
 

2.4.6 Pages with Optimzed Titles, Meta Descriptions, H1 and H2 tags Help with SERP Rankings

Caption: “Source: Rankings.io”

Key takeaways:

  • Our data set consists of longtail keywords, meaning that a search query is composed of at least 4 words. With regards to the matching, if 3 out 4 words pair up with the title tag, for example, we assign a score of 75%. Conversely, if only 1 word matches up with the title, we assign a score of 25%. This appraoch has been applied to all HTML tags.
  • Exact matches are common in the title, meta description, H1 and H2 tags. Up to 3 positions can be gained by exact matching instead of not matching at all.
  • For the domain name, its URL sub-directories and the H3 tags exact matching is rare; however, a partial match with the domain name or its url directories is clearly beneficial for SERP rankings, according to our data.
  • We also noticed that most HTML tags are already optimised for. Therefore, we conclude that there are a few websites that can still benefit from adding relevant keywords to their tags.
 

2.4.7 The Number of Keywords a Page Ranks for Does not Have a Significant Impact on organic SERP rankings

Key takeaways:

  • If a page ranks for several other keywords, it may give Google an internal sign of quality. However, for our data at hand, this claim cannot be confirmed. There is too much uncertainty to identify any positive or negative correlation between the number of keywords a pages ranks for and it´s corresponding position.
  • The pattern holds true regradless whether it belongs to the large directory domains or not. In fact, the larger domains rank for around the same number of keywords as the smaller domains do.
 

2.5 Brand Signals

 

2.5.1 Pages That Rank Higher Also Have Brand Signals

Key takeaways:

  • On average, domains with well positioned pages have Facebook, Instagram and Linkedin accounts. The only exception in our data is Youtube. While this trends do not indicate that social signals boost rankings, we rather hypothesise that better positioned domains have setup social media accounts as part of their branding strategy.

3 Other findings

 

3.1 Difficulty Map

Key takeaways:

  • In total, our data set contains websites from 321 unique cities in the US.
  • We find spatial hotspots along the West coast (California) and the East coast (Florida, Washington, New York City, Boston).
  • Houston has on average the highest difficulty scores (43.6), followed by Chicago (41.5), Salt Lage City (35), Corpus Christi (33.8), Topeka (32), Philadelphia (31.5), Boston (31.1), and Orlando (31).
  • The lowest (non-zero) scores can be found in Hollywood, Ontario, Huntigton, and Roswell (all 0.5).
  • 32 cities have the lowest possible average score of 0%, e.g. Iowa City, Portsmouth, Key West and Richmond Hill.
  • The mean is 13.4 for all cities and key phrases.

4 Areas with Potential Improvement

Based on our data, we created a plot to indicate the potential positions of improvement for the average website. For example, when looking at the average website, this implies that most websites would benefit from adding external, do-follow as well as no-follow links in order to increase organic SERP rankings. This also holds true for adding additional images and increasing word count.

On the other hand, HTML tags such as title tags as well as adding a SSL certificate show a low potential given that they are mostly optimised for by most pages.

Appendix

 

Extra maps

 

Leave On Feature Out Importance not complex enough to highlight any variable as important

The choice was to use a recently developed technique called Leave-One-Feature-Out-Importance (LOFO). The idea behind LOFO is to iteratively remove one independent variable at a time from the data set and measure how much predictive power is lost compared to the full model. If the prediction accuracy is not affected at all, then the feature can be considered to be irrelevant for the task. On the other hand, removing important features should cause large loss of accuracy.

The Leave One Feature Out model has not shown any one feature to be critical to the position ranking. The algorithm did not suffer or gain consistently when any one feature was left out for prediction. This process would work well if our prediction power was perfect. However, in Google’s standards our available data set is too small and the algorithm is relatively simple compared to the multiple machine learning algorithms the use for the search results itself.

Below is the graph of the difference in model performance on whole data compared to the whole data minus one variable once for each variable. This would highlight which variables are important and less important. However, when running this for 25 iterations it came out that within the variance not a single variable could selected as important in improving or hurting the model.