Wikipedia would tell you that Screen Scraping aka Data Scraping is a process through which the information is gathered from the Internet and is then later aggregated for some particular use.
This is quite an easy explanation, yet let’s try to explain screen scraping in such words that even an old person that has never seen a computer could understand. Imagine a city library, a place that has thousands of books. In order to understand what the books are about, how useful they are and what their main subject is, it is actually necessary to read the books, grade and categorise them. This may sound like quite a big effort and it actually is an enormous amount of work.
When it comes to screen scraping, these are not the books we deal with but the web pages. As this data is digitised, it is possible to perform all of the aforementioned processes without any human efforts. So if you can imagine a robot going at the lightning-fast speed through every page of the book and then placing these books at the correct shelf in the library - then you can imagine screen scraping technology too.
One of the best example of screen scraping would be actually something we use on a daily basis - Google. This search engine works rather simple, it asks you to input the searching phrase and within some milliseconds it displays the search results. To perform this process so fast Google needs to score each of the pages for its relevancy and, of course, this process is not done instantly. Google algorithms crawl the web pages, analyse them, extract some data from the text and links, perform mapping, scoring and tons of other activities, followed by indexing. If Google was not relying on the algorithms, but on the human efforts in order to build a search engine, it would most probably be the largest company in terms of the employee count.
In general, screen scraping technology allows fast and objective processing of the data from the Internet.
This is a rather tricky question. Back in the 16th century Francis Bacon has said that knowledge is power and this phrase has been only gaining more and more relevancy since then. Data scraping technologies allow us to quickly extract and analyse the data from the world wide web. Considering that there are roughly 200,000,000 active websites and nearly 1 billion active host names, it would be one heck of an effort to actually extract the data from them without involving any technology. Hence, we can say that data scraping is useful as it allows humans to actually access and assess the information that could not simply be available otherwise.
Let’s tackle this with an example too. Imagine there is no Internet. Terrifying, huh? Okay, let’s not go into extremes, there is Internet, breathe normally... but imagine there are no search engines. You have just invited an extremely funny joke that you would like to share with the whole world. Well, you will most probably send it to a few friends of yours. You could post it on your facebook and expect that it could go viral. However, if there is no data scraping in place and no search engines you could visit, there is simply no way your joke could be found by someone in an easy manner. Of course, if someone is dedicated enough to look ask around your name, look up your facebook profile and find this content on your wall - it is still possible to make the data available, but the whole process would become very time consuming.
This way we can say that the main advantage of the data scraping is that it makes the data accessible.
As a rule, there is always criticism when it comes to online privacy and the information that could be found about a person or an item on the web. Well, data scraping technology is not actually tightly connected to this issue, as publicly available information is mostly an issue of indexing. In other words, data scraping good, indexing bad.
Another common issue that is related to online data scraping is that the information could be manipulated. An official name for the Internet Data Manipulation is actually called as Search Engine Optimisation and it has been providing tons of high paying jobs for many years. As a rule, there is a certain algorithm from Google that scores the page and analyses how high it should display a page in the search results. The bad news are that yes, algorithms can be tricked, however it is also possible to say that same could possibly happen if there were humans scoring the data. The good news are that the algorithms are getting constantly updated and the webpages that try to trick them are getting penalised, so there is always an improvement taking place.
By now you should understand web scraping technology basics from A to Z, so let’s talk more about the technologies involved in this process. In general, there are two major solutions to the problem either using fully-fledged web browsers (headless Webkit is among the most popular choices), or implementing a low-level scraper in developer's language of choice like for example good ol’ Java.
There are some developers that use Webkit, but why would anyone do it? I don’t really know. Java screen scraping is a natural process of accessing the web data for one simple reason - it runs on any platform (Linux, OSX, Windows) including mobile platform like Android, so it can be easily integrated in most software stacks.
Besides that, Java excels in the speed. Usually it takes only one request to scrap the data from the webpage. In contrast, Webkit technologies perform way more requests, as they are downloading CSS and a lot of other data that does not add any value. This is why the method of screen scraping Java is superior.
Now let’s come to the most interesting part of this article. As you know, your financial information cannot be accessed publicly as it is secured by your bank. However, is it good or bad? Generally it is a good thing of course, as you would rarely want to disclose your financial information to the general public. However, in many instances you would be much better off sharing your financial information with another secure institution than having this information totally under the lock and key.
Let me tell you a story how I transformed from an anti financial data sharing to a person who is in favour of it. Back in 2012 I started working on kontomierz.pl, I was a bit sceptical about the whole process of connecting your account to the application and letting it import your transactions. However, as a developer, I quickly understood that this portal is actually a secure one, as it is using the applet that is not sending any credentials to kontomierz.pl servers. As I was working on this product, I eventually decided to try using it on my own once I was sure in its security….and promotions aside...it became my favourite tool for personal finance management. In the past I was simply having my paycheque and occurring expenses, knowing that everything that is left out of it becomes my savings. It was quite hard for me to control my expenses, categorise them and see any “hidden” or “new” charge from my bank. It is of course possible to use a PFM app without data scraping technology, however it would require manual upload of CSV files from my bank. I cannot say that this process would be a show stopper for me, yet a PFM app with automatic synchronisation is definitely something that is the next level of managing your finances.
Also there are more and more useful financial services appearing every now and then, we no longer use one bank account for the whole set of the financial products. It is often possible to see people with a checking bank account in one institution, loan from another bank and a currency deposit in a 3rd bank. Applying for every new financial service is rather time-consuming and often requires paying a visit. Using screen scraping technology, it is now possible to have a login with your bank account option, which is essentially nothing more than login with facebook in the realm of financial services. There is no need to fill out bulky application forms, upload your documents and wait for their verification and so on, all of the registration data and identification data can simply be supplied by the bank where you are already registered.
As financial institutions are getting access to your banking history, they can also often supply you with the more attractive offers based on your previous performance in the other institutions. We can say that via a banking data scraping the banks are not only getting to Know-Your-Client, but also to understand the client much better. As the bank knows the consumption patterns of the financial services, it can adapt its offer better and make it more suitable for the customer that has just joined the bank.
Now we have come to a part where I suppose to tell why Kontomatik technology is exactly what you need when it comes to extracting financial data. Even though I work for KontoX, I will try to be as objective as possible. If you disagree - please do comment, it would be highly appreciated and could potentially lead us to improvement. So, let’s find out why KontoX is so useful.
First of all, Kontomatik is written in pure Java. Unlike a few of our competitors that are using WebKit, our technology lets us perform the extraction of the data faster and with a higher rate of success.
Speed is of course important, but when it comes to finances - security is the king. Kontomatik offers both SaaS and On-Premises solutions to fit the needs of different clients. Both solutions offer a very high level of data security:
Another great point of our software is its flexibility. As you can choose among different integration models depending on the needs of your organisation, you can also get financial data aggregation customised to fit the needs of your app too. In general, we are simply supplying you with the technology that is exactly like a plain canvas - you are an artist that has to create the app the way you want it using Kontomatik.
Finally, it is also necessary to mention that Kontomatik provides a dedicated server for every single client. This way you are not only getting enhanced security, but also a higher degree of reliability. Less requests will take place, hence the chances something goes wrong are smaller too. Also, there will be less stress placed on your servers, less bandwidth will be used and you could actually get more imports in the same time!
I hope you have enjoyed this article and found it educative. Please do not hesitate to comment and provide me with the feedback. Thanks!