forked from Azure/usql
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathparams.json
1 lines (1 loc) · 8.36 KB
/
params.json
1
{"name":"U-SQL","tagline":"U-SQL Language Team Site","body":"### Introducing U-SQL \r\n\r\nU-SQL is a data processing language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in file, object and relational stores such as Azure SQL Database.\r\n\r\n### What is this site?\r\nThis site is the front door to the [U-SQL GitHub repo](https://github.com/microsoftbigdata/usql) are where you can find libraries, tools and more for extending U-SQL. More importantly, it's a direct line back to the team for bugs, feature requests, suggestions. We will also be placing design proposals and track features coming in upcoming releases. We're super excited about U-SQL, and we hope that you are too. \r\n\r\n###Let's Write U-SQL!\r\n\r\nLet’s assume that I have downloaded my Twitter history of all my tweets, retweets, and mentions as a CSV file and placed it into my Azure Data Lake Store. In this case I know the schema of the data I want to process, and for starters I want to just count the number of tweets for each of the authors in the tweet “network”:\r\n\r\n\t@t = EXTRACT date string\r\n\t\t\t, time string\r\n\t\t\t, author string\r\n\t\t\t, tweet string\r\n\t\tFROM \"/input/MyTwitterHistory.csv\"\r\n\t\tUSING Extractors.Csv();\r\n\t\r\n\t@res = SELECT author\r\n\t\t, COUNT(*) AS tweetcount\r\n\t\tFROM @t\r\n\t\tGROUP BY author;\r\n\t\r\n\tOUTPUT @res TO \"/output/MyTwitterAnalysis.csv\"\r\n\tORDER BY tweetcount DESC\r\n\tUSING Outputters.Csv();\r\n\r\nThe above U-SQL script shows the three major steps of processing data with U-SQL:\r\n\r\n1. Extract data from your source. Note that you just schematize it in your query with the EXTRACT statement. The datatypes are based on C# datatypes and I use the built-in Extractors library to read and schematize the CSV file.\r\n2. Transform using SQL and/or custom user defined operators (which we will cover another time). In the example above, it is a familiar SQL expression that does a GROUP BY aggregation.\r\n3. Output the result either into a file or into a U-SQL table to store it for further processing.\r\n\r\nNote that U-SQL’s SQL keywords have to be upper-case to provide syntactic differentiation from syntactic C# expressions with the same keywords but different meaning. Also notice that each of the expressions are assigned to a variable (@t and @res). This allows U-SQL to incrementally transform and combine data step by step expressed as an incremental expression flow using functional lambda composition (similar to what you find in the Pig language). The execution framework, then, composes the expressions together into a single expression. That single expression can then be globally optimized and scaled out in a way that isn’t possible if expressions are being executed line by line. \r\n\r\nGoing back to our example, I now want to add additional information about the people mentioned in the tweets and extend my aggregation to return how often people in my tweet network are authoring tweets and how often they are being mentioned. Because I can use C# to operate on the data, I can use an inline C# LINQ expression to extract the mentions into an ARRAY. Then I turn the array into a rowset with EXPLODE and apply the EXPLODE to each row’s array with a CROSS APPLY. I union the authors with the mentions, but need to drop the leading @-sign to align it with the author values. This is done with another C# expression where I take the Substring starting at position 1.\r\n\r\n @t = EXTRACT date string\r\n\t\t\t, time string\r\n\t\t\t, author string\r\n\t\t\t, tweet string\r\n\t\tFROM \"/input/MyTwitterHistory.csv\"\r\n\t\tUSING Extractors.Csv();\r\n\t\t\r\n\t@m = SELECT new SQL.ARRAY<string>(\r\n\t\t\t\t\ttweet.Split(' ').Where(x => x.StartsWith(\"@\"))) AS refs\r\n\t\tFROM @t;\r\n\t\r\n\t@t = SELECT author, \"authored\" AS category\r\n\t\tFROM @t\r\n\t\tUNION ALL\r\n\t\tSELECT r.Substring(1) AS r, \"mentioned\" AS category\r\n\t\tFROM @m CROSS APPLY EXPLODE(refs) AS Refs(r);\r\n\t\r\n\t@res = SELECT author\r\n\t\t\t\t, category\r\n\t\t\t\t, COUNT(*) AS tweetcount\r\n\t\tFROM @t\r\n\t\tGROUP BY author, category;\r\n\t\r\n\tOUTPUT @res TO \"/output/MyTwitterAnalysis.csv\"\r\n\tORDER BY tweetcount DESC\r\n\tUSING Outputters.Csv();\r\n\r\n\r\n###Why U-SQL?\r\n\r\nIf you analyze the characteristics of Big Data analytics, several requirements arise naturally for an easy to use, yet powerful language:\r\n\r\n* Process any type of data. From analyzing BotNet attack patterns from security logs to extracting features from images and videos for machine learning, the language needs to enable you to work on any data.\r\n* Use custom code easily to express your complex, often proprietary business algorithms. The example scenarios above may all require custom processing that is often not easily expressed in standard query languages, ranging from user defined functions, to custom input and output formats.\r\n* Scale efficiently to any size of data without you focusing on scale-out topologies, plumbing code, or limitations of a specific distributed infrastructure.\r\n\r\nHow do existing Big Data languages stack up to these requirements?\r\n\r\nSQL-based languages (such as [Hive](http://hive.apache.org) and others) provide you with a declarative approach that natively does the scaling, parallel execution, and optimizations for you. This makes them easy to use, familiar to a wide range of developers, and powerful for many standard types of analytics and warehousing. However, their extensibility model and support for non-structured data and files are often bolted on and harder to use. For example, even if you just want to quickly explore your data in a file or remote data source, you need to create catalog objects to schematize file data or remote sources before you can query them, which reduces your agility. And although SQL-based languages often have several extensibility points for custom formatters, user-defined functions, and aggregators, they are rather complex to build, integrate, and maintain, with varying degrees of consistency in the programming models.\r\n\r\nProgramming language-based approaches to process Big Data, for their part, provide an easy way to add your custom code. However, a programmer often has to explicitly code for scale and performance, often down to managing the execution topology and workflow such as the synchronization between the different execution stages or the scale-out architecture. This code can be difficult to write correctly, and optimized for performance. Some frameworks support declarative components such as language integrated queries or embedded SQL support. However, SQL may be integrated as strings and thus lacking tool support, the extensibility integration may be limited or – due to the procedural code that does not guard against side-effects – hard to optimize, and does not provide for reuse.\r\n\r\nTaking the issues of both SQL-based and procedural languages into account, we designed U-SQL from the ground-up as an evolution of the declarative SQL language with native extensibility through user code written in C#. This unifies both paradigms, unifies structured, unstructured, and remote data processing, unifies the declarative and custom imperative coding experience, and unifies the experience around extending your language capabilities.\r\n\r\nU-SQL is built on the learnings from Microsoft’s internal experience with [SCOPE](http://www.vldb.org/pvldb/1/1454166.pdf) and existing languages such as T-SQL, ANSI SQL, and Hive. For example, we base our SQL and programming language integration and the execution and optimization framework for U-SQL on SCOPE, which currently runs hundred thousands of jobs each day internally. We also align the metadata system (databases, tables, etc.), the SQL syntax, and language semantics with T-SQL and ANSI SQL, the query languages most of our SQL Server customers are familiar with. And we use C# data types and the C# expression language so you can seamlessly write C# predicates and expressions inside SELECT statements and use C# to add your custom logic. Finally, we looked to Hive and other Big Data languages to identify patterns and data processing requirements and integrate them into our framework.\r\n\r\n\r\n### Authors and Contributors\r\n@MikeRys, @SaveenR, @mwinkle\r\n","google":"UA-69266020-1","note":"Don't delete this file! It's used internally to help with page regeneration."}