spider laptop
3 years ago
 

Crawling is the new API – a legal and technical rough guide for the travel industry

For the past six years, we have been asking ourselves one fundamental question for our business.

“How do we get data from the transport providers we want to integrate?”

NB: This is an analysis by Veit Blumschein, CEO of FromAtoB.

After following the old-fashioned way by kindly asking each partner for access to its API, we faced the fact that not all money in the world would suffice to finance this approach.

Therefore, we had to make a decision – either we wait for the market to open up for us or we open up the market ourselves, as our own crawling framework enables us to use the best interface every B2C company is offering: their own website.

We realized that we actually didn’t need any help from our partners for the integration: no maintained API, no complex security measures, effortful documentation or expensive technical support the partner had to provide us with.

There are three perspectives I want to explain in order to prove my statement: technical, market specific and legal, followed by an outlook and the key learnings from six years on the quest for gathering high quality travel data.

1. The technical perspective

A crawler is best described as a program that simulates the user’s behavior on a website, following all the steps a user does with his browser such as entering search parameters (e.g. destination, date, etc.), requesting a result by clicking on the search button and then scanning through them.

That is both the simple and genius system of crawlers: everything a website shows to its users and customers, the crawler can also read extract and store.

chart1

The extent and quality of code libraries and open source frameworks exploded over the past decade.

This especially concerns technologies that cover fundamental principle of the open source movement: free and easy accessible data – which is exactly what crawlers do.

This development affects all elements of a crawler.

There are analysis tools that allow us to request a search result without even loading the entire page, the downloader can decide while reading the data on a website if it is worth storing, advanced parsers filter the stored content almost in real-time and turn unreadable into structured re-usable data such as XML or JSON.

Since most of the basic libraries are ‘open source’, it gives our developers the opportunity to adapt it to our special needs.

For example, we taught our crawlers to adapt themselves to certain changes on the websites (for instance when the provider changes his layout) and to notify us if it failed.

In general, the change of a website is a problem for a crawler – in fact, it is very similar to the change of an official API – which also happens from time to time.

In preparation of this article, I reviewed the changes we had to make for APIs over the past years and compared it to the number of adaptions to the crawlers we had to come up with due to structural changes of the website.

In total, the ratio is only 1: 2, decreasing over the years due to the tolerance to change e.g. in design we taught our crawlers (see above).

Another interesting fact is, if we compare the ratio of down times, the ratio is almost the other way around, which means that – in our sample – websites are more available than regular APIs.

Of course, there are existing technical protection mechanisms for websites to avoid crawlers.

This is a sub topic that has to be addressed, although it is in our experience, (probably caused by our business model) no obstacle that we had to overcome with further technical counter measures.

In most cases, we have made the experience that many partners love to share their data with us, as we are a distribution channel for them.

Hence, it is standard for us to crawl websites with the partner’s prior consent. However there are server-based intrusion detection and protection mechanisms.

The most common is the blocking of the IP addresses of the requesting server, displaying results in JavaScript, cloaking parts of the content or requesting Captchas from the users – but all these hurdles can be overcome with state of the art technology, cloud based servers and free-of-charge frameworks.

2. Market-specific concerns

This brings us directly to travel market specific problems and concerns. As mentioned above, with a good sales pitch, using your partner’s website as an API is a win-win strategy for both parties.

This has several and very simple reasons: Many players – even big ones – do not possess an efficient API, apart from some GDS based interfaces that are extremely expensive and limited.

Although fromAtoB is depending on the quality of data from its partners, I understand why some providers choose not to set up their own API.

You suddenly have another interface, apart from your own website, that you have to maintain and protect via complex security measures.

Documentation for third parties needs to be crafted and these third parties need to be supported with technical manpower.

These are many investments you have to pre-finance before you know if there is a business case. Our answer: if you do not have an API, we do not need one.

We use the best maintained interface any company with e-commerce ambitions has: the website (or in some cases the mobile app).

In the past few years, we even switched back several times from using an existing API to crawling the websites, exactly for the reasons mentioned above.

In our cases: the API documentation was in Spanish, the partner forgot to tell us twice that they had changed the API keys: the partner’s backend changed and the API fell out (but the website did not).

For all these reasons, we are now also offering providers we are crawling the opportunity to use our API so that they can redistribute their data to third parties (affiliates for instance).

“We don’t want you to crawl our website, as we fear that you will kill our servers with too many requests.”

That is a concern we are facing with many partners. Yes, and that was once true back in the 90s, when distributed crawlers sometimes had the same harmful effect as DDOS attacks.

But with modern “minimally invasive” crawlers you can specifically request the content you are looking for, without loading images, style sheets, videos, ads, etc.

Self-learning caching logics reduce this issue even more, especially for crawlers with high data demand, as similar requests can be used from the own previously stored database, without requesting the partner’s website again.

This is also the solution for another travel specific concern: the so-called “scan-to-book ratio” (also: look to book). The pure horror of every meta-search engine and website administrator – at least back in the days.

If we probe the causes of this problem, it is not as one would expect the costs for server capacities or DDOS concerns (as mentioned above).

No. It is, or at least was, in most cases caused by GDSs and their business model, in which a travel company had to pay for requesting and using their own data.

But since many companies are nowadays hosting their own data or having pure performance based data third party services, this issue is slowly fading.

3. Legal implications

Several European countries have long been waiting for a judgment establishing a principle from a national supreme court on how to proceed with crawlers, and finally clarifying long unanswered questions such as if the search results on websites are copyrighted and the extent of a virtual householder’s rights, etc.

Finally, the Bundesgerichtshof (German supreme court) settled these questions in a trial between Ryanair and Beins Travel Group (Cheaptickets.de), in which it states that booking and search portals are not underlying the competition law when crawling the data from a website.

The court goes even a step further by allowing booking portals to charge the customers for a commission.

The court follows the argument of the portal, which justifies the fee with a service for searching and processing the data.

However, I would like to emphasize once again what I have already mentioned above, the prior consent of the website is in most cases something that is easy to get.

4. Outlook

Gathering data via crawlers is actually just the first step, covering the entire booking on a larger scale via crawling and bot comes next.

In the flight industry, there are already several online travel agencies) that are not only crawling and extracting flight data from an airline’s website, but they are also making an entire booking process via a crawler (resp. bot) by simulating the payment process on a website.

For several reasons, this is an easier process in the flight industry (e.g. via payment with a virtual credit card) as for example in the railway or even public transportation sector.

Still, in general the approach and solution stays the same and definitely feasible within the next years.

Key learnings and implications for the travel industry

Now, six years later, there are three very simple but essential learnings:

  • Crawlers make data fully accessible via websites: APIs are from the 90‘s – crawlers are today’s interfaces!
  • Accordingly, GDS‘ monopoly as a single source of travel data will fade
  • Even booking and payment will be handled via intelligent bots

What does this mean for the travel industry?

  • Does somebody, who occasionally needs travel data, have to build a crawler for every single website? No, there are specialized providers, like e.g. Travelfusion or ourselves offering that data to third parties through a web service
  • Meta search engines using advanced crawler technology for specific sectors such as the aviation and hospitality sector, shuttles or bus service providers will further advance, but are only in the first phase in the evolution of travel technology
  • Interlinking these sectors via easily accessible data and overcoming the disruption to book all-in-one through bots will trigger the next evolutionary phase: intermodal journey planners.

NB: This is an analysis by Veit Blumschein, CEO of FromAtoB.

NB2: Spider laptop via Shutterstock.

Share on FacebookTweet about this on TwitterShare on LinkedInEmail to someone
 
 
Kevin May

About the Writer :: Kevin May

Kevin May was a co-founder and member of the editorial team from September 2009 to June 2017.

 

Comments

Your email address will not be published. Required fields are marked *

  1. Roberto Da Re

    Having “some” experience in the filed of scanning/scraping/crawling, just a few comments to this interesting topic thread:
    1. there are intelligent scrapers that know where an airline flies to and only pass “relevant” requests to that site, so there is no “waste of resource” but more the opportunity for the site to display their product whenever they have something to offer, even if originally the user is not aware of such airline offering that route.
    2. Some suppliers of Airline Res systems charge the airline if they want to distribute via an XML API. Screen scraping is a way to get around that cost for some of them
    3. Screen scraping does require constant support and a distributed infrastructure that gets more expensive almost by the day. An API is actually more complex to implement and still requires maintenance and support, however, it is normally done within a controlled environment which allows efficient companies to be more in control of costs and resources.
    4. AIR is a relatively standardised product .. so scraping is manageable, once you go beyond AIR, complexity goes up tenfold.

    If the objective is competitive monitoring and you are only after a subset of data, scraping can do the job; but if you want a reliable, predictable and properly integrated product offering, API is the way to go.

     
    • Timothy O'Neil-Dunne

      … and thus speaks the voice of experience.
      However “intelligent” scrapers are few and far between. Schedules can be dynamic and for a way for it to work – there needs to be a perfect match of the city pair and schedule. Ryanair’s schedule (like many LCCs) can all over the place. just an examination of the dates of flights and the timings reveals that this is more than a little problem. Managing that in real time is pretty hard because the schedule will be different tomorrow.

      Cheers

      Timothy

       
  2. Denis Tsyplakov

    Sorry if my comment is bit naive, I’ve been working in travel industry for just 11 month, but have 20 years technical IT background.
    Technically I believe it is almost not possible to avoid scraping. Nowadays there are lots of frameworks that allow to implement distributed scraper which is using headless browser, swarm worker pool and set of browsing patterns that mimic human behavior just in few weeks.
    Ethical side is much more interesting. I think we need to distinguish two cases.
    1. Scraping rates – on mature competitive market every player has a natural right to be aware of prices in the “shop next door”. I believe scraping rates are fair and acceptable. If I am selling apples I have a natural right to check – how much apples costs in another shop
    2. Scraping content – creating and maintaining high quality content cost money, lots of money. If some companies are investing in content and some taking it for free it does not look fair. It can lead to the decrease in content quality. But scraping is a physical reality. We could not just prohibit scraping. So industry should find a way how to leave in scraping reality and maintain high quality of service. May be at the end we will even come to situation when content would be the crowd funded by major industry players like Wikipedia (and we would solve an issue with hotel matching once and forever).

     
  3. Jonathan Boffey

    @Tim: Did I manage to avoid the technology debate? 🙂

    There is no doubt that the provenance of the data will be key to achieving commercial success. Trio already have a client who wants to use our technology to monitor abandoned bookings due to (cache attributed) pricing/availability differences arising between the search and booking steps and that’s at the ‘OTA’ level before any crawling.

    Another client once said to me that scrapers always leave a signature because they always have to fill in their pricing grid – just like the old “battleships” game. Whilst the crawling technologies may have moved on so have the real-time analysis systems. Cat and mouse………..

     
    • Denis Tsyplakov

      Scraping and anti-scraping is really like cat and mouse. My wife when she is searching for vacation tour definitely match some pattern because 2 OTA sites have baned her and constantly showing captcha :-). It is really difficult to distinguish smart swarm scraper and active user.
      Does scrapers dream of electric sheep?

       
  4. Timothy O'Neil-Dunne

    You have to love this topic. It is such an elephant in the room of how the consumer web expectation collides with legacy technology AND commercial & operational processes and practices.

    Maybe I just playing my old curmudgeonly self in saying … this will end in tears”

    Cheers

     
  5. Alexey Abolmasov

    Oops.

    Does that mean that I have hired IT people to program and support the product, sales people to connect operators, data people to dig in the dirty data and clean it up, double extra smart people to standardize, validate and prepare this data for distribution, setup all servers and pay for running the infrastructure plus all overheads; then we spending months and years to cook and spice all that public transport data from really hard to reach places and companies, combining buses, ferries, flights, raiways, sweat blood and tears – and then you just come and pick it by crawling ready and validated?

    This world _is_ data these days, data is the value; if you need data – come and ask and there could be business involved.

    Neither way I’m too happy about this new way of interpreting publicly exposed apache as “API”. API and static files are running on separate servers and hence unload the public interfaces – I value response time of my users. Also worth mentioning that I lose control over the data coming from my services – especially if there is 12Go label on it. Data in public transport sector quickly gets sour (sometimes within hours), and if some web service claims that “12Go said there is a bus at 12:35 from A to B” I must be sure I did everything to provide this data in a good way. If your bot posts it as coming from”public sources” – there is no trust to that data. Especially if we provide fake data mixins.

    Again, we live in the world of data and interconnecting APIs. Collaborate & cooperate is the key – not parse & go. APIs are smart enough these days.

     
  6. Valentin Dombrovsky

    “No. It is, or at least was, in most cases caused by GDSs and their business model, in which a travel company had to pay for requesting and using their own data.

    But since many companies are nowadays hosting their own data or having pure performance based data third party services, this issue is slowly fading”.

    Sorry, do you mean that “look to book” is not an issue anymore and you don’t need to pay to GDS for inquiries? This part is a bit unclear to me.

     
  7. Jonathan Boffey

    It is interesting that there are OTAs out there who have set up meta-search APIs because they need to manage the impact crawlers have. This becomes a cost the OTA has to bear because the OTA wants to maintain its response times for direct consumer traffic that is response time sensitive. Ironically the next cost is that the meta-search site then wants paid for the referrals too. So the meta-search site just ate part of the OTA’s lunch or did it just invite the OTA to a bigger banquet? All part of the evolution of on-line travel? What goes around usually comes around. Expect the crawlers to get crawled and the meta-searchers to get meta-meta-searched? Who foots the bill? No doubt the laws of economics will prevail over the technology.

     
    • Timothy O'Neil-Dunne

      JB
      The missing element in your comment is the issue of data quality. Once data has been presented and crawled it starts its decay. Frequently by the time it is presented the data is aged.

      Today there are three sources of “offer” data:
      1. Calculated by owner/responsible party e.g. Airline or GDS
      2. Calculated by third party and not necessarily accurate (e.g. ITA engine)
      3. Cached data – pre-calculated and re-used.

      Google/ITA’s calculations may be technically right BUT if they are not supported through a PCA or other similar agreement then its – and this is the case in most instances – JUST an educated guess. it has no basis in law as a formal offer. And the result is nothing more than an indicator of the price you could pay. Guess what? Google DOESN’T need to care. It has no obligation to tell the truth and confirm that this is their interpretation of what the real result should be. Conversely the airlines sign PCA deals with the GDS and the results might actually be of LESSER technical quality but contractually the airlines are obliged to honour them.

      My team does constant analysis on web site and direct engine results. Our tracking however is very expensive to do on a consistent basis so frankly we dont do it as often as we would like. BUT i can tell you that the failure rates we can observe just from going from OFFER result to actual available result in the purchase path has a quality degrade of about 15% that we can observe. Note that we estimate that the actual total rates are far higher. Looking at the statistical distribution, because there is so much focus on the most frequently travelled routes – we see that those are updated most frequently and are monitored (consequently their quality is higher). The edge cases have the higher percentage of bad data but because they are so seldom accessed then it doesn’t seem to matter.

      The strange thing is that a quirk of the traditional contract process ONLY guarantees the data when you have completed the transaction. Its neither yours nor the final price determined until you have paid for the ticket. Seems a tad odd when you think about it but its the way its always been.

      Finally the cached data whether scraped or calculated is just plain wrong. It might have arrived at the same number but its not guaranteed.

      The poor consumer who is used to seeing in other sector websites and traditional outlets a price that is either present or absent finds this additional variable hard to comprehend. EG a book at this price is either present or absent.

      Without the element of trust provided by the “owner” of the data allocated to the product – this conundrum will always exist. The issue of how the consumer then sees the results as Trustworthy has not till now really had an impact because everyone is bad. Personalization and transparency is creating more pressure and more complexity. Result? Greater confusion for the consumer.

      Gotta love the GDSs and airlines’ founding fathers for this strange mess they created.

      Cheers

       
  8. Paul Byrne

    Our experience with website scraping has been horrendous. Frequent changes to sites and ensuing breakage, leading to constant support and development changes, have led us to steer clear of this.
    I agree this presents challenges when no API is or can be presented as an alternative. However, your statement:
    APIs are from the 90‘s – crawlers are today’s interfaces!
    seems a bit bizarre to me.

    I would argue quite the opposite, and the explosion and growth of open APIs, as can be seen on the ProgrammableWeb site, attest to the fact that APIs are the building block for innovation in the travel industry. From July 2008 to Oct 2013, there has been a 10X growth in the number of publicly available APIs. These APIs are powering web sites and mobile apps and act as additional revenue channels.

    Furthermore, APIs provide airlines with control over these individual channels and the ability to offer relevant product, priced accordingly.

    For some reason, from my development days, scraping web sites still gives me the jitters!

     
  9. Carlos Baez

    Crawling is certainly an option available when an API does not exist but crawling a site without the owner’s permission does raise many questions, not just technical or even legal but ethical as well. Equally, even if you decide crawling is your best option, certain etiquette should also be taken into consideration, I have seen websites go down because of sudden peaks of load caused by crawlers generating too much traffic and subsequently just getting blocked as spam traffic.

    Would also be interesting to hear from the author about the crawling tools used.

     
  10. Timothy O'Neil-Dunne

    This is a minefield. It is a really good idea to work with this topic and consider if you are a good guy or a bad guy.

    I regard this as the extent of Newton’s 2nd Law. In all cases there are always 4 possible results of the relationship.

    1. Both sides agree
    2. Side 1 agrees – side 2 doesn’t agree
    3. Side 1 doesn’t agree – side 2 does agree
    4 Both sides disagree.

    Since only the positive case is fully allowable – only 1 condition set works.

    Let me give an example cited to me by Ryanair.

    Ryanair gets hit hard by scrapers who don’t know (or want to know or don’t care) what they are doing. Thus they are hit with requests for NYC to LAX. A route they don’t serve.

    Why?

    Since the cost of the web search is so small as to be negligible – crawlers and scrapers (the less nice term) simply are indiscriminate about the crawling process. Why should Ryanair have to implement a complex set of services (and expensive at that) to accommodate the scrapers? What control do the scrapers give over the content that is clearly NOT theirs.

    It gets ugly very fast.

    The legal case cited here is but one of MANY cases. Ryanair – correctly in my view – takes the position that since they EXPRESSLY forbid scraping then any scraping or crawling is bad. For specific details go here: http://www.ryanair.com/en/terms-of-use/ The courts in Europe are not consistent in this situation.

    However underneath all of this is a basic truth. Travel websites in general (direct or indirect sellers) calculate the results for each offer. (Either cached pr real time calculated). Thus the cost to deliver it in time and effort (and thus real hard cost) is high. There is no easy answer to this problem of scraping and crawling – however for the commercial and technical teams reading this – please do consider the impact of your actions on the scraped site if you are being a scraper..

    This problem will not go away. And as I have frequently stated – “verified availability” is one solution to the problem. Both sides – scrapers and scraped need to step up their game. Good guys will do this. Bad guys will carry on abusing. That is a fundamental truth.

    Cheers

    Timothy

     
    • Daniele Beccari

      I think the world is not binary.
      We get asked almost daily by travel sites if we can scrape their B2C site.
      While others reject scraping because it screws up their analytics.
      As far as both parties are happy, the rest is just a different technical solution.

       
    • Mike Putman

      Great feedback to an interesting article. There are the Ryan-air’s of the world, but there are a host of other LCC’s that are looking for distribution but don’t have the means to pull it off.

       
 
 

Newsletter Subscription

Please subscribe now to Tnooz’s FREE daily newsletter.

This lively package of news and information from Tnooz’s web site provides a convenient digest of what’s happening in technology that drives the global travel, tourism and hospitality market.

  • Cancel