Big Data Strategy
The Big Data Strategy Conference is coming back to Lithuania. The list of technologies used to solve Big Data problems is ever expanding, giving rise to questions like: What should be used? How should it be used? When should it be used? Join one of Europe’s premier events, where answers to those questions will be revealed. The Big Data Strategy Conference will feature the perfect combination of inspirational speeches and training talks by world-leading speakers and industry experts, as well as personal experiences and practical suggestions for technologies such as Hadoop, CouchBase, HP Vertica+MS SQL Server and many others.
The presentation introduces the Analytics Platform System (APS), a Massively Parallel Appliance for Big Data, built on a shared-nothing architecture. It is fully integrated with other products of Microsoft, i.e., Azure, the BI stack, ETL, and Complex Event Processing. Columnar Storage and Polybase will be presented: Columnar storage is a technology with a lot of optimization with respect to reporting and analytics. Polybase is a query extension which allows for simultaneously querying data stored in the data base as well as files on Hadoop, both via T-SQL. This allows for the application of BI tools to information stored in Hadoop. The role of APS in large analytic projects will be discussed looking into a real-life example.
In the last 5-10 years, the industry have witnessed how dozens of new NoSQL databases emerge, bringing topics such as schema-less and scaling to buzzwords, hot topics. These NoSQL databases have taken a different approach to solving current scaling and Big Data problems, sometimes offering niche products, sometimes innovating on a given aspect, sometimes taking compromises to their CAP-compliance. However, and surprisingly to some, NoSQL databases share at least one common pattern: they were all built from scratch. Their storage engines, replication techniques, journaling, ACID support (if any), they were all coded from zero. However, these are among the most complex problems in the software industry, yet they were implemented without leveraging the previously existing state of the art. From an engineering perspective, this is not what we all have been told: DRY. Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database? Wouldn't it be possible to "tune" a relational database to behave as a NoSQL database, so as to easily focus on being schema-less, scalable and anything else needed, but without re-inventing the wheel on "basic" stuff such as journaling or durability? Enter Toro DB. Toro DB is an open source project that behaves as a NoSQL database but runs on top of PostgreSQL, one of the most respected and reliable relational databases. ToroDB offers a document interface, and implements the MongoDB wire protocol, hence being compatible with existing MongoDB drivers and applications. But ToroDB stores data on PostgreSQL - something which is transparent to database clients. But rather than storing JSON documents as a blob or using PostgreSQL 9.4's fantastic jsonb data type. ToroDB explored an innovative approach by transforming document data to a relational representation in a fully automated way - that does not require user intervention or configuration. the benefits of storing document data as relational are quite significant.
This presentation will give an overview and provide examples on two components of HP Predictive Analytics platform: HP Vertica and HP Distributed R. Distributed R is an open-source, scalable and high-performance engine for the R language. Designed for data scientists, HP Distributed R accelerates large-scale machine learning, statistical analysis, and graph processing. The secret is in how HP Distributed R splits tasks between multiple processing nodes to vastly reduce execution time and enables users to analyze much larger data sets. Best of all, HP Distributed R retains the familiar R look and feel, and data scientists can continue to use their existing statistical packages. The Vertica Analytics Platform is easy to use and deploy, so users across an organization (not just DBAs) can get up and running quickly and immediately analyze mission-critical data. As a distributed shared-nothing database, Vertica has built in statistical and data mining tools that can be leveraged using the SQL language as well as have extensibility to scale your existing R code.
With its popularity, development ease, and performance benefits, Apache Spark is primed to become the next general processing layer for Hadoop - succeeding MapReduce. We will discuss the main benefits of Spark, Resilient Distributed Datasets (RDD) programming paradigm and Scala functional programming language - designed to express common distributed programming patterns in A concise and elegant way never before seen in the Hadoop world.
Handling Big Data doesn't have to be a Big Challenge. Use Microsoft's technologies based on open source technologies like Hadoop, advanced Machine Learning algorithms and easy to use visualization tools to get started fast on Big Data. In this session we will discuss examples and use-cases from various industry verticals and show how to build a Big Data solution from collecting data to visualizing data.
SmartRecruiters is the Hiring Success Platform providing everything that companies need to transform their recruiting into effective talent marketing and sales machines and hire the best candidates. It includes Recruitment Marketing, Collaborative Hiring and a Modern Platform that solves the unique customization, compliance, integration and analytics needs. In this talk we discuss several aspects of optimizing product by using data. We will start by looking at how best to search for patterns in customers behavior and usage data followed by how to make predictions based on historical data before ending with how to fine tune your UI basedon your findings. At the end of the presentation, several interesting conclusions about recruiting drown from data will be shown.
Everyone is talking about Big Data nowadays and challenges to work with large amount of data. But how one goes from plain Big Data idea to actual company? Presentation will cover challenges, failures and hard lessons learned while building such company. Some parts will be applicable to project managers in larger companies starting Big Data initiatives.
Humans use their senses to gather data from world. There are multiple ways how we could interact with computers. Text search engines are pretty old - Google has 17 years. How about visual search? We will present some practical solutions of image recognition technology to prove that in next few years visual search could be helping us in day-to-day life.
Vinted is an online lifestyle marketplace and a social network geared towards young women. We currently have 10 million members and we process and analyze up to 1 billion events daily. To overcome the limitations imposed by the implementation of our initial analytics' solution based on MySQL, we evaluated Hive, Impala and Scalding, finally arriving at a solution built on Spark, Kafka and HBase. This is a lessons-learnt talk about our bumpy road to Big Data analytics on the Hadoop platform. We will cover our Kafka-based data ingestion pipeline, fact table preparation, data aggregation and talk about how all of this leads to a sub-second slicing of pre-aggregated data cubes. In addition, we will mention how our pipeline is reused for ad-hoc data analysis with the help of interactive notebooks.
Big Data without access is just a very expensive set of ones and zeros. It is paramount for great reporting solution what your user is able to access data in a way that matters to him and not only access it, but access it simply and naturally. API is a great solution to such a problem. But devil is in the details. Sounds easy until you understand that you have to support multiple data warehouse solutions and ad-hock data slices. In this presentation I will share the experience and lessons learned from building and consuming such a data API (project code name - Einstein).
Part one of this talk will take a look at the development of Percona Server, a MySQL drop-in replacement. To provide context, we will also cover the MySQL ecosystem including major forks, patches and users. Part two will cover one of the major challenges faced when developing a database server: Defining and implementing the right feature set, given the resource constraints of a small development team and strong competition, while still achieving quality goals. Part three will focus on performance and look at our strategy for bridging the gap between the peak performance numbers used in marketing graphs and actually executing a fast server in production. More specifically, this talk will hone in on strategies for how to deal with the hazards of stalls and performance variance.
Sometimes serving reports with data as old as few hours or even minutes is not enough, clients want to see what happens in real time. Traditional batch based ETL (extract - transform - load) techniques, which served for years, are unable to cope with our current needs. In this talk we will tell about our journey from batch based ETL to stream processing. So, if you want to hear how we run cluster, manage resources with mesos and do stream processing with storm, come to this presentation.
Apache Spark is easy to use and fast engine for large-scale data processing, optimized for multi-stage in-memory operations. In typical use-cases, Apache Spark batch processing model will be much faster than Hadoop, even 100x times. But Spark offers more… Additional modules allow you to run SQL queries against massive datasets, analyze streams or graphs and create machine-learning models, you can use all these features to write complex analytics applications. In this talk I will introduce concepts behind streaming and machine-learning algorithms on Apache Spark and show how you can use them to get insights from your streaming data.
This is a war story. It tells about a small victory in the war, that we‘re currently losing. The story starts like this: „Adform Big Data Lake was born as a small and beautiful pool. A single spring of fresh water was feeding the Lake. Time passed and more and more springs were bringing the precious water to the Lake. People of the Adform were very happy to see it. But one day terrible Evil started growing in the darkest depths of the Lake...“
While talking about one of Ivinco's projects I will introduce Sphinx Search and MySQL as an efficient alternative to today's more traditional big data systems (Hadoop, Solr, etc.). this talk will describe our architecture decisions when building a scalable backend for a social media data search engine. Starting with MySQL/InnoDB cluster which now stores 120TB+ text data (sharding strategies for different types of data, adding large amounts of incoming data with low latency, ensuring high availability), I'll introduce Sphinx Search and it's capabilities (indexing strategies, configuration, advanced features, how Sphinx compares to ElasticSearch and Solr). Finally I'll outline how we monitor system health and ensure great system performance.
Most companies still uses old-fashioned MS MOLAP solutions, wasting time on development, scaling, HW and licenses. In this session we'll talk about Adform's experiences transitioning to a Vertica+ROLAP solution. The talk will cover: * Why we decided to start using disrupting technology instead of optimizing existing ones; * When sexy SQL defeats elegant MDX; * About the nightmare we faced before the transition and how we reconnected wagons from one running train into another. We'll also discuss resources benefits that resulted from the transition; * Vertica+ROLAP optimization tips; * How Vertica falls within the Hadoop data lake and our philosophy for Big Data disaster recovery.
Once Vinted.com (a peer-to-peer marketplace to sell, buy and swap clothes) grew larger, demanding more advanced analytics, we needed a simple, yet scalable and flexible data-cubing engine. The existing alternatives (e.g. Cubert, Kylin, Mondrian) seemed not to fit, being too complex or not flexible enough, so we ended up building our own with Spark. We'll present: - how DataFrames have proven to be the most flexible tool for fact preparation and cube input (c.f. typesafe Parquet-Avro schemas) - how we support multivalued dimensions - how we use Algebird aggregators for defining and computing our metrics - how simple it is to get good cubing performance by pre-aggregating input before cubing with help of Algebird aggregators that are Semigroup-additive for free - our HBase key design and optimizations such as bulk-loading to HBase, and how we read the cube back from HBase